Tracing Kernel Functions: FBT stack() and arg
Oct 27, 2020
In my previous
post I described how FBT intercepts function calls and
vectors them into the DTrace framework. That laid the
foundation for what I want to dicuss in this post: the
implementation of the
stack() action and
arg variables. These features rely on
the precise layout of the stack, the details of which I
touched on previously. In this post I hope to illuminate those
details a bit more with the help of some visuals, and then
guide you through the implentation of these two DTrace
features as they relate to the FBT provider.
But first I must make a correction to my last post. It turns
out the FBT handler does not execute on the IST stack.
It runs on either the thread’s stack or the CPU’s high-level
interrupt stack depending on the context of the kernel
function call, but never on the IST.
uses the IST stack as a scratch space to perform its
trampoline into the real handler. This little detail is
important. Functions like
have zero chance of working if run with the IST stack, for
reasons which become obvious later. This also explains why the
AMD64 handler pulls down the stack
pushq %RBP emulation: if it’s working
on the same stack as the thread/interrupt, then it must make
RBP. I can explain better with a visual.
First, the diagram from the last post.
On the left we have a kernel thread, interrupt thread, or
high-level interrupt running on CPU. On the right we have the
“interrupt context” of the breakpoint exception, using the
IST. The image is correct in that there are two different
stacks in play, but what’s running on the right-hand side is
brktrap handler. The right-hand side is
running the KPTI trampoline, ensuring a CR3 switch when moving
between the user/kernel boundary. The trampoline also provides
a facsimile of the processor frame to the interrupted thread’s
stack, making it none the wiser that KPTI was ever on the
scene. So all the action happens on the left side, but what
does the stack look like as we transition through the #BP
handler on our way to
In phase ①
mac_ring_tx() while it is under FBT entry
instrumentation. The last thing on the thread’s stack is the
return address, and the CPU is about to execute
Phase ② is immediately after the CPU has finished execution of
int3 instruction. The processor (via the
spectre of the KPTI trampoline) has pushed a 16-byte aligned
processor frame on the stack and has vectored into
Phase ③ is after some amount of execution of
handlers—remember, the #BP handler for DTrace mimics a #UD.
This last phase shows the state just before the call
dtrace_invop(). At this point we’ve grown an
regs structure on the stack and stashed a
copy of the return address on top of this. The later used to
cpu_dtrace_caller, a variable which
becomes important later.
The stack() Action
The separation of probes and actions is a vital aspect of DTrace’s architecture. A firm boundary between these two makes DTrace more powerful than it ever could be if they were tightly coupled. Think about it, I can ask for the call stack in any probe, not just the probes that deem that information useful. The probes give you access to a context, and the actions give you access to data in that context. To limit the execution of actions to specific probes would limit the questions you can ask about the system. With this design the number of questions you can ask is virtually endless. And it turns out one of the more useful questions to ask is: “what the hell is running on my CPU”?
stack() action allows you to record the call
stack that lead to the probe site. In the context of FBT this
will record the call stack of the kernel thread or interrupt
executing an entry or return from this kernel function. You
can also access the userland stack of a thread
ustack(), but I don’t cover that here.
stack() action is implemented by
dtrace_getpcstack() function. To get there
dtrace_invop() requires a couple of more
calls in the DTrace framework. Ultimately, the call stack to
get there looks like this.
The implementation of
stack() really starts
The first argument is the address of the array used to store
program counter values (aka function pointers). This array
starts at some offset into the current DTrace buffer. The
second argument if the size of that array. The third argument
is the number of “artificial frames” on the stack, more on
this later. The fourth argument is used to determine if the
first (topmost) program counter in the call stack is the value
“anchored” probe is one that has a function name specified
dtrace_probe_create(). For example,
the FBT provider uses the name of the kernel function as the
probe’s function name, thus it is anchored on the kernel
function. The profile provider, however, specifies no probe
function name; it is not anchored and is a bit of a special
case. I address this at the end of the post.
This brings us to the
function. But first I’ll expand
on figure 2 to show our
stack state as of source line 60 of the function.
To build the call stack we first need to be able to walk the
stack. Luckily, illumos keeps frame pointers in the kernel,
making this easy. But in this particular situation there is
more to consider. First, we might have two stacks in play: the
high-level interrupt’s stack as well as the stack of the
thread it interrupted. Second, the DTrace framework and FBT
provider have put their own frames between this code and the
function that tripped this probe; we must exclude these
“artificial” frames from the result. Finally, we need to make
sure not to walk off the stack and into space, both for
correctness and safety. Speaking of the stack,
stacktop variable is pointing to the “top” of
the stack in terms of memory (on x86 stacks grow downwards).
stacktop is the bottom of the
stack and the
dtrace_getpcstack() frame is the
intrpc is set, then that’s our first program counter.
The main loop walks the call stack and fills in program
counters as long as there are slots remaining in
pcstack. If we were in the context of a
high-level interrupt and we’ve walked off its stack, then hop
to the thread stack. Otherwise, we’ve walked off the thread
stack, leaving just this last frame to record.
Make sure to skip over any artificial frames.
aframes value is based on information given
by the provider at probe creation time
as well as knowledge inherent to the DTrace framework. These
two know how many frames they have each injected between
stack() action and the first real frame; we
sum the values to know how many total frames to skip.
caller variable is a bit more subtle; and
this is another thing I got wrong in
my last post while
discussing the return probe. The
CPU->cpu_dtrace_caller; a per-cpu
value used exclusively by the FBT provider to record the first
real frame of the call stack. But why? First, a refresher on
the code (this is the return probe logic but it’s the same for
the entry probe as well).
In my last post, when discussing this code comment, I said the following.
In this case we have a matching return probe. I’m not so sure I follow this comment. The caller’s return address is still on the interrupted thread’s stack regardless of whether we instrument the leave or ret instruction...
I am correct in stating that the return address is on the
stack. But the subtle detail I forgot is that the interrupt
machinery does not create a frame—it doesn’t push a frame
pointer to the stack. You can see this visually if you trace
RBP link up
dtrace_invop(): it links back to
mac_provider_tx() frame, skipping the program
mac_provider_tx+0x80) stored by
call instruction just before FBT interposed.
We have a legit call stack frame and room on
pcstack, add it.
If we’ve finished walking the call stack, then zero out the
pcstack. Otherwise, continue walking the
The built-in arg variables
arg0-arg9 variables, and their typed
args-args, allow each probe to
supply up to 10 arguments. The
arg values are
provider dependent. FBT passes the kernel function arguments
for entry probes and the return offset plus value for return
probes. Regardless of the provider, all arg variable usage
ultimately ends up at
I’m not going to explain this entire thing. I put it on display only to show that these values are dependent on the provider. But in the case of FBT we have two possibilities.
arg4 we pull from
the argument cache stored in
dtms_arg, shown on
line 3215. The provider populates this cache via the call
arg9 we must get
help from the provider-specific
dtps_getargval. When undefined, as it
is for FBT, we fallback to the DTrace framework
dtrace_getarg(). Explaining this
function makes more sense by starting at the end.
As you can see, getting
arg9 is a simple matter of dereferencing
stack. But how do we get that?
As was the case for the
stack() action, we need
to walk the current stack; but instead of recording the
program counters we search for the stack pointer at the time
of the #BP. This was recorded in the processor frame as part
of the processor’s trap machinery. If you look back at
figure 3, it’s
where the processor frame’s
RSP points back
mac_provider_tx+0x80. When we
dtrace_invop_callsite we know we’re at the
top of the
dtrace_invop() frame. We can’t follow
fr_savfp any further, as we’ll blow by the
processor frame, so what do we do?
Turns out we can use some pointer shenanigans to walk our way
back to the
regs structure setup by
invoptrap handler. We do this by
fp as an array of frames and walking
past the current one. From there we cast to a pointer type so
we can walk individual stack entries, skipping past the
padding and the stashed
RIP. That leaves us at
the beginning of
With a pointer to the
regs structure we can
finally choose the
stack based on
arg we want. As required by the ABI, we
know the first six args are in registers. These registers are
laid out consecutively in the
regs structure. We
can point to the first one and pretend that’s
stack. The first five arguments are served by
the cache in
arg5 is served by this method.
served from the stack of the caller. In the diagram above, the
bulk of the stack frame for
elided. It only has three arguments, but if it had seven or
more, these later arguments would be stored on the stack above
stack we must
arg to take into account the register
arguments which are not on the caller’s stack. In this case we
arg = 6
arg = 1. You might have expected this to
be zero-based, like how the first argument starts
arg0. But we have to take into account the
fact that the first thing on the caller’s stack is
RIP, and skip over it.
Appendix: What exactly is intrpc?
As a refresher, here’s how
intrpc is used.
And here’s how
intrpc is set.
And profile sets
Here’s the apix interrupt handler
member is the
r_rip of the
So what does it all mean?
The profile provider is the exclusive user of this mechanism.
Its probe sites are implemented via a high-level interrupt.
The initial vectoring of a high-level interrupt is no
different than the #BP interrupt used by FBT: the processor
lays out a processor frame on the current stack and the
interrupt handler builds a
regs structure on top
of it. But remember, this processor frame has no frame pointer
and thus no way to see the interrupted program counter. That
bar() was interrupted by profile’s high-level
interrupt, we’d see the first program counter
foo(). However, we can grab the
regs and stash
that for later retrieval. We don’t need to worry about
clobbering this value as it is only set for this specific
interrupt level. And this is why it’s
intrpc: it’s the interrupted program
This makes me wonder though: since we always have
regs structure, why not do away with
cpu_dtrace_caller? Why not always walk the
regs and pull the
there? My only guess, perhaps this is an optimization for when
someone is sitting on a hot probe referencing
caller built-in variable (which is just the
first frame of