Tracing Kernel Functions: How the illumos AMD64 FBT Provider Intercepts Function Calls
June 15, 2020
For my money there is perhaps nothing more valuable than in situ kernel debugging. The ability to ask questions of any part of the kernel, on a live production system, without stopping or pausing anything, feels more like a right than a privilege after you’ve experienced it first hand. One such avenue to in situ kernel debugging is the DTrace FBT provider.
The FBT provider, which stands for Function Boundary Tracing, allows one to trace all kernel function entry and return sites, their arguments and return values, and even grab the stack that lead to the call. There is no overhead when probes are disabled, and minimal overhead on the specific entry or return site when enabled. DTrace is made up of several parts: the command, the user library, the kernel framework and VM, and the provider interface. Its authors made the smart decision of separating the general infrastructure of DTrace from the specifics of probe creation and firing: providers provide and fire the probe points, and the framework executes their actions. This allows providers to be developed separately from DTrace itself. If you want to learn more about DTrace I recommend reading Dynamic Instrumentation of Production Systems presented in USENIX 2004.
For this post I want to focus on how FBT intercepts function
calls. That is, when a user enables an FBT probe, how exactly
does the probe end up firing when a kernel thread hits that
entry or return site? For example, the following one-liner
instruments all calls to the mac_ring_tx()
function, responsible for delivering a network packet to the
underlying NIC. It counts the number of packets traveling
across each interface along with the distribution of the
packet size sent across each interface. As you can see, a
simple function entry probe can be quite powerful, and I’m
only touching the surface of what’s possible. But how exactly
did the FBT provider accomplish this?
Insrtumenting Function Calls and Returns
Before I can describe how a site is intercepted I
need to give a primer on how the kernel text
is instrumented. Upon first use DTrace loads the FBT
provider, iterates all loaded kernel modules, and passes each
one to fbt_provide_module()
. This function uses
the ELF section header data to iterate all functions, read
their program text, and find the location of all entry and
return sites. The FBT provider creates an FBT probe
(fbt_probe_t
) for each site, responsible for
holding the information needed to instrument that site. This
design produces zero overhead when no probes are in use as no
program text is modified during this stage.
It’s not until the user executes a D script containing one or
more FBT probes that the provider instruments the
kernel module’s text. The DTrace framework
calls fbt_enable()
, passing in the FBT probe
structure created earlier which contains the information
needed to alter the correct program text. If this is an entry
probe, then FBT replaces the push %rbp
instruction. Otherwise, it is a return probe, and FBT replaces
the ret
instruction. In both cases, FBT replaces
this instruction with INT3
, the breakpoint
instruction (or #BP in Intel mnemonics). This design produces
the minimal overhead needed to intercept the probe site. It
modifies only the site requested, leaving all other text
untouched.
A breakpoint is what’s known as a software exception — a processor exception produced at the request of software, as opposed to an exception generated by a legit program exception, like divide by zero. All exceptions are really just interrupts, with the small difference being that an interrupt is something typically generated external to the processor, whereas an exception is generated by the processor as a consequence of the code it is executing. Furthermore, when reading code, you’ll also see breakpoint referred to as a trap. This is yet a further classification of the exception, and it dictates what actions the processor will take upon servicing the exception. All three terms are used interchangeably throughout the code, but the important thing to remember is it’s an interrupt. It’s interrupting the normal flow of the program code (in this case, the kernel). Like all other interrupts, the operating system must provide the processor with some action to take upon receiving a breakpoint. This action is defined in the Interrupt Descriptor Table (IDT), seen below.
Upon receiving a breakpoint the processor will call into the
handler tr_brktrap
. The tr_
prefix
stands for trampoline, which is a mitigation method used as
part of KPTI. This trampoline provides Meltdown protection and
ultimately leads to brktrap
, which I’ll come back
to later. The SDT_SYSIGT
argument tells us this
is an “interrupt gate”, meaning the processor will disable the
assertion of all maskable hardware interrupts before invoking
the handler — preventing another interrupt from interrupting
the breakpoint. The TRP_UPL
argument indicates
the interrupt may originate from user or kernel space. In the
case of the FBT provider, it will always originate from kernel
space. And finally, the last argument is an index into the
Interrupt Stack Table (IST): a collection of dedicated stacks
to be used during interrupt handling to avoid scribbling on
the interrupted thread’s stack. In this case
the T_BPTFLT
handler maps to a dedicated stack.
Before the Handler
So I’m running a D script that contains an action for one or
more FBT probes, and a kernel thread executes one of these
probe sites, what’s next? Based on the IDT entry it should
call the brktrap
handler, but that’s not quite
what happens. First, the processor does some work on the
operating system’s behalf before calling the interrupt
handler. It must switch to the stack specified in the IDT
entry and push the following values to it, in this order (see
Intel Vol. 3, §6.14.4).
- The interrupted thread’s stack selector (SS).
- The interrupted thread’s stack pointer (RSP).
- The interrupted thread’s RFlags.
- The interrupted thread’s Code Segment selector (CS).
- The interrupted thread’s Instruction Pointer (RIP).
-
The error code of the exception, as
INT3
has no error code this doesn’t apply.
This ends with the processor in interrupt context, with the
interrupted thread’s primary CPU context loaded on the new
stack, ready to execute the operating
system’s brktrap
handler — as depicted in the
figure below.
The invoptrap Handler
The brktrap
handler is actually a front for the
real workhorse: the invoptrap
handler. Different
providers use different methods of trapping into the DTrace
framework. Some use INT3
(#BP), others use
the LOCK
prefix (0xFO
) incorrectly
in order to force an invalid operation fault (#UD). I’m not
sure why both methods exist or when a provider should prefer
one method over the other, perhaps fodder for a future post.
But in the case of FBT, the first stop
is brktrap
.
/*
* #BP
*/
ENTRY_NP(brktrap)
XPV_TRAP_POP
cmpw $KCS_SEL, 8(%rsp)
jne bp_user
/*
* This is a breakpoint in the kernel -- it is very likely that this
* is DTrace-induced. To unify DTrace handling, we spoof this as an
* invalid opcode (#UD) fault. Note that #BP is a trap, not a fault --
* we must decrement the trapping %rip to make it appear as a fault.
* We then push a non-zero error code to indicate that this is coming
* from #BP.
*/
decq (%rsp)
push $1 /* error code -- non-zero for #BP */
jmp ud_kernel
Lines 296-297 determine where the BP originated from:
user-land or kernel-land. If the former, jump
to bp_user
, otherwise fall through. For those
that aren’t familiar with assembly, I’ll break down line 296 a
bit more. At the point of this instruction the
interrupt’s RSP
points to the 64-bit word
containing the RIP
of the interrupted
thread. The 8(%rsp)
instruction is AT&T
syntax for indirect memory addressing: it tells the CPU to
read the value at %rsp + 8
(basically the parens
are equivalent to dereferencing a pointer in C). Since the
stack grows downward, and every value on the stack is 64-bit,
adding 8 bytes references the value “above” RSP
on the stack: the interrupted thread’s CS
.
Therefore, the cmpw
instruction compares the
lower 16 bits of the immediate value $KCS_SEL
to
the interrupted thread’s CS
value. In this case
we know the #BP originated from a kernel thread and the
processor will fall through.
Lines 307-309 are well described by the comment above them.
The bulk of the handling is done under the #UD handler, but in
FBT’s case we are coming in via #BP. Rather than replicate the
handler logic we opt to make the #BP look like a #UD, jumping
to the ud_kernel
handler after doing so. This is
achieved by rolling back the RIP
by one
instruction, as it would be in the case of a #UD fault, and
pushing an error code on the stack. The error code, while
meaningless in this case, is needed to keep the stack layout
consistent with other exceptions. All exceptions share common
routines which expect the stack to be laid out in a particular
way.
ENTRY_NP(invoptrap)
XPV_TRAP_POP
cmpw $KCS_SEL, 8(%rsp)
jne ud_user
#if defined(__xpv)
movb $0, 12(%rsp) /* clear saved upcall_mask from %cs */
#endif
push $0 /* error code -- zero for #UD */
ud_kernel:
push $0xdddd /* a dummy trap number */
On line 345 we find the ud_kernel
label. It sits
a few instructions into the invoptrap
handler, the
main DTrace entry point. Like the brktrap
handler, invoptrap
looks at the the interrupt
origin to determine what to do. Since we originated from #BP
we skip all that and head straight to line 346 where we push a
dummy trap number. Like the dummy error code pushed
in brktrap
, this is done to make sure the stack
has a consistent layout across all exceptions so that macros
like INTR_PUSH
can function correctly. However,
why we push a dummy value instead of the actual trap
number, like all the other traps, is a mystery to me.
INTR_PUSH
As I mentioned, all exceptions use the same stack layout, but
what exactly is that layout? The layout is defined by
the regs
structure, shown below. This structure
includes all values pushed by the processor’s exception
mechanism as well as the trap number, all general-purpose
registers, the segment registers, and two special values
named r_savfp
and r_savpc
. These
last two fields contain copies of the interrupted thread’s
frame pointer and program counter, respectively. The
processor’s exception mechanism populates lines 101-105 of the
structure on the stack, lines 95 & 100 are populated by the
handler, and INTR_PUSH
is responsible for
populating the rest.
struct regs {
/*
* Extra frame for mdb to follow through high level interrupts and
* system traps. Set them to 0 to terminate stacktrace.
*/
greg_t r_savfp; /* a copy of %rbp */
greg_t r_savpc; /* a copy of %rip */
greg_t r_rdi; /* 1st arg to function */
greg_t r_rsi; /* 2nd arg to function */
greg_t r_rdx; /* 3rd arg to function, 2nd return register */
greg_t r_rcx; /* 4th arg to function */
greg_t r_r8; /* 5th arg to function */
greg_t r_r9; /* 6th arg to function */
greg_t r_rax; /* 1st return register, # SSE registers */
greg_t r_rbx; /* callee-saved, optional base pointer */
greg_t r_rbp; /* callee-saved, optional frame pointer */
greg_t r_r10; /* temporary register, static chain pointer */
greg_t r_r11; /* temporary register */
greg_t r_r12; /* callee-saved */
greg_t r_r13; /* callee-saved */
greg_t r_r14; /* callee-saved */
greg_t r_r15; /* callee-saved */
/*
* fsbase and gsbase are sampled on every exception in DEBUG kernels
* only. They remain in the non-DEBUG kernel to avoid any flag days.
*/
greg_t __r_fsbase; /* no longer used in non-DEBUG builds */
greg_t __r_gsbase; /* no longer used in non-DEBUG builds */
greg_t r_ds;
greg_t r_es;
greg_t r_fs; /* %fs is *never* used by the kernel */
greg_t r_gs;
greg_t r_trapno;
/*
* (the rest of these are defined by the hardware)
*/
greg_t r_err;
greg_t r_rip;
greg_t r_cs;
greg_t r_rfl;
greg_t r_rsp;
greg_t r_ss;
};
After the INTR_PUSH
we have an
entire regs
structure populated on the stack
with RSP
pointing to r_savfp
.
movq REGOFF_RIP(%rsp), %rdi
movq REGOFF_RSP(%rsp), %rsi
movq REGOFF_RAX(%rsp), %rdx
pushq (%rsi)
movq %rsp, %rsi
subq $8, %rsp
call dtrace_invop
ALTENTRY(dtrace_invop_callsite)
Lines 348-350 ready the first and third arguments for
the dtrace_invop()
call: the address of the
instruction being intercepted, and the RAX
value
of the function being interrupted (for return probes). Line
349 looks like it’s placing the interrupted thread’s stack
pointer in the second argument position, but really it’s
staging the pointer in RSI
for lines 351-352. On
line 351 we dereference the
RSP
of the interrupted thread and place that
value on the top of the handler’s stack. Which leads to the
question: what was the last thing pushed on the stack of the
interrupted thread? This depends on the probe. For entry
probes we instrument the pushq %rbp
instruction,
thus the last thing on the stack is the return site of the
function that called the interrupted function.
If foo()
called bar()
, and bar’s
entry point is instrumented by DTrace,
then (%rsi)
would be foo()+0xXXXX
.
And it turns out this is true for return probes as well as
illumos uses leave
plus ret
to exit
the function, and FBT instruments the ret
instruction. Therefore, upon entry into the invop handler, the
last thing on the interrupted thread’s stack is always the
return instruction pointer. Moving onto line 352, we replace
the interrupted thread’s stack pointer with our handler’s
stack pointer, so that dtrace_invop()
will have a
pointer to the handler’s stack as its second argument.
Finally, before calling dtrace_invop()
, we
subtract 8 bytes from the stack to keep it at a 16-byte
alignment (see Intel Vol. 3, §6.14.2). The
generic dtrace_invop()
function checks this
instrumentation point against the SDT and FBT providers. In
this case we are dealing with an FBT probe and will end up
calling fbt_invop()
. But before moving on, it’s
also worth noting line 355, after the call
to dtrace_invop()
, where we create the
symbol dtrace_invop_callsite
. This becomes the
return address upon entering dtrace_invop()
, and
its use becomes important when accessing function arguments,
which I will cover in a follow up post.
static int
fbt_invop(uintptr_t addr, uintptr_t *stack, uintptr_t rval)
{
uintptr_t stack0, stack1, stack2, stack3, stack4;
fbt_probe_t *fbt = fbt_probetab[FBT_ADDR2NDX(addr)];
for (; fbt != NULL; fbt = fbt->fbtp_hashnext) {
The FBT probes are stored in a chained hash table
named fbt_probetab
. Line 85 hashes the
instrumented address to determine the bucket the probe is in,
and then line 87 loops through all entries in that bucket to
find the one matching this address.
if ((uintptr_t)fbt->fbtp_patchpoint == addr) {
if (fbt->fbtp_roffset == 0) {
int i = 0;
/*
* When accessing the arguments on the stack,
* we must protect against accessing beyond
* the stack. We can safely set NOFAULT here
* -- we know that interrupts are already
* disabled.
*/
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
CPU->cpu_dtrace_caller = stack[i++];
#ifdef __amd64
/*
* On amd64, stack[0] contains the dereferenced
* stack pointer, stack[1] contains savfp,
* stack[2] contains savpc. We want to step
* over these entries.
*/
i += 2;
#endif
stack0 = stack[i++];
stack1 = stack[i++];
stack2 = stack[i++];
stack3 = stack[i++];
stack4 = stack[i++];
DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
CPU_DTRACE_BADADDR);
dtrace_probe(fbt->fbtp_id, stack0, stack1,
stack2, stack3, stack4);
CPU->cpu_dtrace_caller = 0;
In this case we found an entry probe matching the instrumented
address. We know it’s an entry probe because
fbtp_roffset
is zero.
The use of CPU_DTRACE_NOFAULT
is interesting, and
I’ll be honest, I’m not sure I fully understand why it’s used
here, but here’s my best guess. First off, the no-fault
mechanism is used throughout DTrace to protect the operating
system from bad loads in DTrace actions. This flag allows the
page fault logic
in locore.s
to determine if the #PF originated from DTrace, and if so to
skip over the offending instruction, set some DTrace CPU flags
to indicate the faulting load, and move on with execution.
This is different from a typical #PF, which would wind up in
the
global trap
handler. As for it’s use here? I think it’s because we are
directly accessing the handler’s stack via
the stack
pointer, as opposed to the typical way
you access the stack in C: by using local variables. These
reads should always be valid, but perhaps that wasn’t always
the case in the history of this code. In any event, if any of
the references on lines 109-113 happen to go out of bounds,
the #PF handler will catch it and simply move on (leaving
whatever garbage value it was initialized to).
Now it’s time to start reading data off the handler’s stack.
Remember, stack
is pointing into
the handler’s stack, at the point of the last push in
the invoptrap
handler. The last thing we saved on
the stack was the caller of the instrumented function, found
at location stack[0]
. On line 99 we store that
symbol address in the CPU-local
variable cpu_dtrace_caller
, which is used in the
DTrace stack()
action and to populate the DTrace
built-in caller
variable. The next two values on
the handler’s stack are the regs
structure
values r_savfp
and r_savpc
, the
saved values of the frame pointer and program counter,
respectively. We skip over them to get to the arguments of the
instrumented function.
The System V AMD64 ABI dictates that the first six arguments
are stored in specific registers. These stack[N]
accesses reference those registers and those variables are
passed to dtrace_probe()
to act as a cache for
the first five arguments to the instrumented function. But why
five and not six? While illumos no longer supports a 32-bit
kernel, when we did have such support the #UD handler
setup stack[1]
to point to the beginning of the
first 10 arguments to the instrumented function, so we know
it’s not because of 32-bit. It also doesn’t seem to be because
of a SPARC limitation, as a quick search of the various SPARC
ABIs all claim the first six arguments may be placed in
registers. My best guess is that it’s from a desire to keep
all arguments to dtrace_probe()
in registers —
for efficiency and perhaps to avoid polluting the stack as
some probe actions are sensitive to stack depth such as
the stack()
action. In any event, the final
action is to call dtrace_probe()
, the main entry
point into the DTrace framework which ultimately executes the
actions linked with this DTrace probe.
} else {
#ifdef __amd64
/*
* On amd64, we instrument the ret, not the
* leave. We therefore need to set the caller
* to assure that the top frame of a stack()
* action is correct.
*/
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
CPU->cpu_dtrace_caller = stack[0];
DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
CPU_DTRACE_BADADDR);
#endif
dtrace_probe(fbt->fbtp_id, fbt->fbtp_roffset,
rval, 0, 0, 0);
CPU->cpu_dtrace_caller = 0;
}
return (fbt->fbtp_rval);
}
}
return (0);
}
In this case we have a matching return probe. I’m not so sure
I follow this comment. The caller’s return address is still on
the interrupted thread’s stack regardless of whether we
instrument the leave
or ret
instruction, but the point is that this value must be saved in
case the user calls the DTrace action stack()
in
their action. Like the entry probe, the goal is to send a
probe event to the DTrace framework, but this time instead of
passing arguments we pass the return offset and return value.
These values are accessible in the FBT return probe action as
the variables arg0
and arg1
,
respectively.
In the case of a matching probe, we need to return
the fbtp_rval
value (line 140). This value was
determined at probe creation time and tells the
invoptrap
handler how to return back to the
interrupted thread. This value depends on the type of probe,
entry vs. return, and the architecture, 32-bit vs. 64. Given
that illumos no longer ships a 32-bit kernel, we can ignore
the architecture and assume AMD64. That means we
return DTRACE_INVOP_PUSHL_EBP
for an entry probe
and DTRACE_INVOP_RET
for a return probe. If we
didn’t find a matching probe, then return zero (line 144).
This tells the invop handler to treat this fault as a normal
#UD and continue processing.
pushq (%rsi)
movq %rsp, %rsi
subq $8, %rsp
call dtrace_invop
ALTENTRY(dtrace_invop_callsite)
addq $16, %rsp
cmpl $DTRACE_INVOP_PUSHL_EBP, %eax
je ud_push
cmpl $DTRACE_INVOP_LEAVE, %eax
je ud_leave
cmpl $DTRACE_INVOP_NOP, %eax
je ud_nop
cmpl $DTRACE_INVOP_RET, %eax
je ud_ret
jmp ud_trap
Picking up on line 356 of the invoptrap
handler,
we add 16 bytes to the stack pointer which undoes the pushes
on lines 351 and 353, leaving RSP
at the start of
the regs
structure. From there we jump to one of
the labels depending on the value returned
by fbt_invop()
.
ud_push:
/*
* We must emulate a "pushq %rbp". To do this, we pull the stack
* down 8 bytes, and then store the base pointer.
*/
INTR_POP
subq $16, %rsp /* make room for %rbp */
pushq %rax /* push temp */
movq 24(%rsp), %rax /* load calling RIP */
addq $1, %rax /* increment over trapping instr */
movq %rax, 8(%rsp) /* store calling RIP */
movq 32(%rsp), %rax /* load calling CS */
movq %rax, 16(%rsp) /* store calling CS */
movq 40(%rsp), %rax /* load calling RFLAGS */
movq %rax, 24(%rsp) /* store calling RFLAGS */
movq 48(%rsp), %rax /* load calling RSP */
subq $8, %rax /* make room for %rbp */
movq %rax, 32(%rsp) /* store calling RSP */
movq 56(%rsp), %rax /* load calling SS */
movq %rax, 40(%rsp) /* store calling SS */
movq 32(%rsp), %rax /* reload calling RSP */
movq %rbp, (%rax) /* store %rbp there */
popq %rax /* pop off temp */
jmp tr_iret_kernel /* return from interrupt */
In this case we are leaving an entry probe and need to emulate
the pushq %rbp
instruction (which was overwritten
with a breakpoint). The first line, INTR_POP
restores the GPRs and adjusts RSP
to point to the
values pushed by the processor when switching to the interrupt
stack, namely the CS:RIP
, RFLAGS
,
and SS:RSP
values of the interrupted thread. The
rest of the code, starting with line 373, is a bit odd and
even conflicts with the comment above it somewhat (it’s
pulling the stack down 16 bytes, not 8). This code doesn’t
need to be this way, but I think it is this way as a
consequence of a time in the past when the AMD64 interrupt
handler was not using it’s own stack or perhaps the author of
this routine started with a copy of the 32-bit routine. But
the idea is that the
iret
call is going to pop values off the stack to
restore the CS:RIP
, RFLAGS
,
and SS:RSP
registers to their original values. In
order to emulate the frame pointer push we need to pull the
stack down by enough bytes to store said pointer. To do that
we first need to shift the values that were originally stashed
by the processor on the stack when the interrupt was taken. On
i386 this made sense as the handler reused the kernel thread’s
stack, so we need to “pull it down”. But on AMD64 the handler
has its own stack and so this is all unnecessary. That said,
the AMD64 handler still pulls the stack down, to no real
effect, but it also modifies the interrupted thread’s
(referred to as the “caller” in this code) RSP
to
stash the RBP
and increments the interrupted
thread’s RIP
to the next instruction. Finally, we
jump to tr_iret_kernel
, which is really just
an iret
instruction (the author’s of KPTI decided
it would be wise to alias it so that future maintainers know
that the iret
path was analyzed when working on
the Meltdown mitigations — a blessing for those who will have
to maintain this code for decades to come). After
the iret
has executed we are back in the kernel
thread’s original execution context and it’s as if FBT was
never involved.
Here’s how the push
emulation could be written in
a simpler manner.
simple_ud_push:
/*
* Emulate "pushq %rbp" by stashing RBP on the caller's stack
* and incrementing the caller's RIP by one.
*/
INTR_POP
pushq %rax /* stash RAX to use as scratch */
movq (%rsp), %rax /* load calling RIP */
addq $1, %rax /* increment over trapping instr */
movq %rax, (%rsp) /* store calling RIP */
movq 24(%rsp), %rax /* load calling RSP */
subq $8, %rax /* make room for %rbp */
movq %rbp, (%rax) /* store %rbp there */
movq %rax, 24(%rsp) /* store calling RSP */
popq %rax /* restore RAX */
jmp tr_iret_kernel /* return from interrupt */
ud_ret:
INTR_POP
pushq %rax /* push temp */
movq 32(%rsp), %rax /* load %rsp */
movq (%rax), %rax /* load calling RIP */
movq %rax, 8(%rsp) /* store calling RIP */
addq $8, 32(%rsp) /* adjust new %rsp */
popq %rax /* pop off temp */
jmp tr_iret_kernel /* return from interrupt */
Here we emulate the ret
instruction. On line 426
push RAX
to the stack in order to use it as a
scratch register. On lines 427-428 we use the interrupted
thread’s RSP
to get the return pointer. On line
429 we replace the stashed RIP
on the handler’s
stack with the one we just pulled from the interrupted
thread’s stack (remember, iret
is going to
populate RIP
from the handler’s stack). On line
430 we remove the stored RIP
from the interrupted
thread’s stack. On line 431 we restore RAX
. And
finally we iret
back into the interrupted thread.
And with that we now know how a kernel function entry/return point is instrumented, intercepted, and emulated. While that’s interesting, it’s only part of the FBT story. One of the more powerful aspects of DTrace is the ability to grab the stacktrace and arguments at the time of probe firing. These features are implemented in their own functions, but they rely on the invop handler’s help. Now that we understand the handler we can see how these other DTrace features work, to be described in a follow up post.