Tracing Kernel Functions: FBT stack() and arg
Oct 27, 2020
In my previous
post I described how FBT intercepts function calls and
vectors them into the DTrace framework. That laid the
foundation for what I want to dicuss in this post: the
implementation of the stack()
action and
built-in arg
variables. These features rely on
the precise layout of the stack, the details of which I
touched on previously. In this post I hope to illuminate those
details a bit more with the help of some visuals, and then
guide you through the implentation of these two DTrace
features as they relate to the FBT provider.
A Correction
But first I must make a correction to my last post. It turns
out the FBT handler does not execute on the IST stack.
It runs on either the thread’s stack or the CPU’s high-level
interrupt stack depending on the context of the kernel
function call, but never on the IST.
Rather, KPTI
uses the IST stack as a scratch space to perform its
trampoline into the real handler. This little detail is
important. Functions like dtrace_getpcstack()
have zero chance of working if run with the IST stack, for
reasons which become obvious later. This also explains why the
AMD64 handler pulls down the stack
during pushq %RBP
emulation: if it’s working
on the same stack as the thread/interrupt, then it must make
room for RBP
. I can explain better with a visual.
First, the diagram from the last post.
On the left we have a kernel thread, interrupt thread, or
high-level interrupt running on CPU. On the right we have the
“interrupt context” of the breakpoint exception, using the
IST. The image is correct in that there are two different
stacks in play, but what’s running on the right-hand side is
not the brktrap
handler. The right-hand side is
running the KPTI trampoline, ensuring a CR3 switch when moving
between the user/kernel boundary. The trampoline also provides
a facsimile of the processor frame to the interrupted thread’s
stack, making it none the wiser that KPTI was ever on the
scene. So all the action happens on the left side, but what
does the stack look like as we transition through the #BP
handler on our way to dtrace_invop()
?
In phase ① mac_provider_tx()
is
calling mac_ring_tx()
while it is under FBT entry
instrumentation. The last thing on the thread’s stack is the
return address, and the CPU is about to execute
the int3
instruction.
Phase ② is immediately after the CPU has finished execution of
the int3
instruction. The processor (via the
spectre of the KPTI trampoline) has pushed a 16-byte aligned
processor frame on the stack and has vectored into
the brktrap()
handler.
Phase ③ is after some amount of execution of
the brktrp()
and invoptrap()
handlers—remember, the #BP handler for DTrace mimics a #UD.
This last phase shows the state just before the call
to dtrace_invop()
. At this point we’ve grown an
entire regs
structure on the stack and stashed a
copy of the return address on top of this. The later used to
populate cpu_dtrace_caller
, a variable which
becomes important later.
The stack() Action
The separation of probes and actions is a vital aspect of DTrace’s architecture. A firm boundary between these two makes DTrace more powerful than it ever could be if they were tightly coupled. Think about it, I can ask for the call stack in any probe, not just the probes that deem that information useful. The probes give you access to a context, and the actions give you access to data in that context. To limit the execution of actions to specific probes would limit the questions you can ask about the system. With this design the number of questions you can ask is virtually endless. And it turns out one of the more useful questions to ask is: “what the hell is running on my CPU”?
The stack()
action allows you to record the call
stack that lead to the probe site. In the context of FBT this
will record the call stack of the kernel thread or interrupt
executing an entry or return from this kernel function. You
can also access the userland stack of a thread
via ustack()
, but I don’t cover that here.
The stack()
action is implemented by
the dtrace_getpcstack()
function. To get there
from dtrace_invop()
requires a couple of more
calls in the DTrace framework. Ultimately, the call stack to
get there looks like this.
The implementation of stack()
really starts
with DTRACEACT_STACK
inside
of dtrace_probe()
.
The first argument is the address of the array used to store
program counter values (aka function pointers). This array
starts at some offset into the current DTrace buffer. The
second argument if the size of that array. The third argument
is the number of “artificial frames” on the stack, more on
this later. The fourth argument is used to determine if the
first (topmost) program counter in the call stack is the value
passed in arg0
to dtrace_probe()
. An
“anchored” probe is one that has a function name specified
when calling dtrace_probe_create()
. For example,
the FBT provider uses the name of the kernel function as the
probe’s function name, thus it is anchored on the kernel
function. The profile provider, however, specifies no probe
function name; it is not anchored and is a bit of a special
case. I address this at the end of the post.
This brings us to the dtrace_getpcstack()
function. But first I’ll expand
on figure 2 to show our
stack state as of source line 60 of the function.
void
dtrace_getpcstack(pc_t *pcstack, int pcstack_limit, int aframes,
uint32_t *intrpc)
{
struct frame *fp = (struct frame *)dtrace_getfp();
struct frame *nextfp, *minfp, *stacktop;
int depth = 0;
int on_intr, last = 0;
uintptr_t pc;
uintptr_t caller = CPU->cpu_dtrace_caller;
if ((on_intr = CPU_ON_INTR(CPU)) != 0)
stacktop = (struct frame *)(CPU->cpu_intr_stack + SA(MINFRAME));
else
stacktop = (struct frame *)curthread->t_stk;
minfp = fp;
aframes++;
To build the call stack we first need to be able to walk the
stack. Luckily, illumos keeps frame pointers in the kernel,
making this easy. But in this particular situation there is
more to consider. First, we might have two stacks in play: the
high-level interrupt’s stack as well as the stack of the
thread it interrupted. Second, the DTrace framework and FBT
provider have put their own frames between this code and the
function that tripped this probe; we must exclude these
“artificial” frames from the result. Finally, we need to make
sure not to walk off the stack and into space, both for
correctness and safety. Speaking of the stack,
the stacktop
variable is pointing to the “top” of
the stack in terms of memory (on x86 stacks grow downwards).
Logically speaking, stacktop
is the bottom of the
stack and the dtrace_getpcstack()
frame is the
top.
if (intrpc != NULL && depth < pcstack_limit)
pcstack[depth++] = (pc_t)intrpc;
If intrpc
is set, then that’s our first program counter.
while (depth < pcstack_limit) {
nextfp = (struct frame *)fp->fr_savfp;
pc = fp->fr_savpc;
if (nextfp <= minfp || nextfp >= stacktop) {
if (on_intr) {
/*
* Hop from interrupt stack to thread stack.
*/
stacktop = (struct frame *)curthread->t_stk;
minfp = (struct frame *)curthread->t_stkbase;
on_intr = 0;
continue;
}
/*
* This is the last frame we can process; indicate
* that we should return after processing this frame.
*/
last = 1;
}
The main loop walks the call stack and fills in program
counters as long as there are slots remaining in
pcstack
. If we were in the context of a
high-level interrupt and we’ve walked off its stack, then hop
to the thread stack. Otherwise, we’ve walked off the thread
stack, leaving just this last frame to record.
if (aframes > 0) {
if (--aframes == 0 && caller != 0) {
/*
* We've just run out of artificial frames,
* and we have a valid caller -- fill it in
* now.
*/
ASSERT(depth < pcstack_limit);
pcstack[depth++] = (pc_t)caller;
caller = 0;
}
} else {
Make sure to skip over any artificial frames.
The aframes
value is based on information given
by the provider at probe creation time
(dtrace_probe_create()
/dtpr_aframes
)
as well as knowledge inherent to the DTrace framework. These
two know how many frames they have each injected between
the stack()
action and the first real frame; we
sum the values to know how many total frames to skip.
The caller
variable is a bit more subtle; and
this is another thing I got wrong in
my last post while
discussing the return probe. The caller
value
comes from CPU->cpu_dtrace_caller
; a per-cpu
value used exclusively by the FBT provider to record the first
real frame of the call stack. But why? First, a refresher on
the code (this is the return probe logic but it’s the same for
the entry probe as well).
/*
* On amd64, we instrument the ret, not the
* leave. We therefore need to set the caller
* to assure that the top frame of a stack()
* action is correct.
*/
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
CPU->cpu_dtrace_caller = stack[0];
In my last post, when discussing this code comment, I said the following.
In this case we have a matching return probe. I’m not so sure I follow this comment. The caller’s return address is still on the interrupted thread’s stack regardless of whether we instrument the leave or ret instruction...
I am correct in stating that the return address is on the
stack. But the subtle detail I forgot is that the interrupt
machinery does not create a frame—it doesn’t push a frame
pointer to the stack. You can see this visually if you trace
the the RBP
link up
from dtrace_invop()
: it links back to
the mac_provider_tx()
frame, skipping the program
counter (mac_provider_tx+0x80
) stored by
the call
instruction just before FBT interposed.
} else {
if (depth < pcstack_limit)
pcstack[depth++] = (pc_t)pc;
}
We have a legit call stack frame and room on
pcstack
, add it.
if (last) {
while (depth < pcstack_limit)
pcstack[depth++] = 0;
return;
}
fp = nextfp;
minfp = fp;
If we’ve finished walking the call stack, then zero out the
rest of pcstack
. Otherwise, continue walking the
call stack.
The built-in arg variables
The arg0-arg9
variables, and their typed
counterparts args[0]-args[9]
, allow each probe to
supply up to 10 arguments. The arg
values are
provider dependent. FBT passes the kernel function arguments
for entry probes and the return offset plus value for return
probes. Regardless of the provider, all arg variable usage
ultimately ends up at dtrace_dif_variable()
.
case DIF_VAR_ARGS:
if (!(mstate->dtms_access & DTRACE_ACCESS_ARGS)) {
cpu_core[CPU->cpu_id].cpuc_dtrace_flags |=
CPU_DTRACE_KPRIV;
return (0);
}
ASSERT(mstate->dtms_present & DTRACE_MSTATE_ARGS);
if (ndx >= sizeof (mstate->dtms_arg) /
sizeof (mstate->dtms_arg[0])) {
int aframes = mstate->dtms_probe->dtpr_aframes + 2;
dtrace_provider_t *pv;
uint64_t val;
pv = mstate->dtms_probe->dtpr_provider;
if (pv->dtpv_pops.dtps_getargval != NULL)
val = pv->dtpv_pops.dtps_getargval(pv->dtpv_arg,
mstate->dtms_probe->dtpr_id,
mstate->dtms_probe->dtpr_arg, ndx, aframes);
else
val = dtrace_getarg(ndx, aframes);
/*
* This is regrettably required to keep the compiler
* from tail-optimizing the call to dtrace_getarg().
* The condition always evaluates to true, but the
* compiler has no way of figuring that out a priori.
* (None of this would be necessary if the compiler
* could be relied upon to _always_ tail-optimize
* the call to dtrace_getarg() -- but it can't.)
*/
if (mstate->dtms_probe != NULL)
return (val);
ASSERT(0);
}
return (mstate->dtms_arg[ndx]);
I’m not going to explain this entire thing. I put it on display only to show that these values are dependent on the provider. But in the case of FBT we have two possibilities.
For arg0
through arg4
we pull from
the argument cache stored in dtms_arg[]
, shown on
line 3215. The provider populates this cache via the call
to dtrace_probe()
.
For arg5
through arg9
we must get
help from the provider-specific
callback: dtps_getargval
. When undefined, as it
is for FBT, we fallback to the DTrace framework
function dtrace_getarg()
. Explaining this
function makes more sense by starting at the end.
load:
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
val = stack[arg];
DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
return (val);
As you can see, getting arg5
through arg9
is a simple matter of dereferencing
the stack
. But how do we get that?
uint64_t
dtrace_getarg(int arg, int aframes)
{
uintptr_t val;
struct frame *fp = (struct frame *)dtrace_getfp();
uintptr_t *stack;
int i;
#if defined(__amd64)
/*
* A total of 6 arguments are passed via registers; any argument with
* index of 5 or lower is therefore in a register.
*/
int inreg = 5;
#endif
for (i = 1; i <= aframes; i++) {
fp = (struct frame *)(fp->fr_savfp);
if (fp->fr_savpc == (pc_t)dtrace_invop_callsite) {
As was the case for the stack()
action, we need
to walk the current stack; but instead of recording the
program counters we search for the stack pointer at the time
of the #BP. This was recorded in the processor frame as part
of the processor’s trap machinery. If you look back at
figure 3, it’s
where the processor frame’s RSP
points back
to mac_provider_tx+0x80
. When we
hit dtrace_invop_callsite
we know we’re at the
top of the dtrace_invop()
frame. We can’t follow
fr_savfp
any further, as we’ll blow by the
processor frame, so what do we do?
#else
/*
* In the case of amd64, we will use the pointer to the
* regs structure that was pushed when we took the
* trap. To get this structure, we must increment
* beyond the frame structure, the calling RIP, and
* padding stored in dtrace_invop(). If the argument
* that we're seeking is passed on the stack, we'll
* pull the true stack pointer out of the saved
* registers and decrement our argument by the number
* of arguments passed in registers; if the argument
* we're seeking is passed in regsiters, we can just
* load it directly.
*/
struct regs *rp = (struct regs *)((uintptr_t)&fp[1] +
sizeof (uintptr_t) * 2);
if (arg <= inreg) {
stack = (uintptr_t *)&rp->r_rdi;
} else {
stack = (uintptr_t *)(rp->r_rsp);
arg -= inreg;
}
#endif
goto load;
Turns out we can use some pointer shenanigans to walk our way
back to the regs
structure setup by
the invoptrap
handler. We do this by
treating fp
as an array of frames and walking
past the current one. From there we cast to a pointer type so
we can walk individual stack entries, skipping past the
padding and the stashed RIP
. That leaves us at
the beginning of regs
.
With a pointer to the regs
structure we can
finally choose the stack
based on
which arg
we want. As required by the ABI, we
know the first six args are in registers. These registers are
laid out consecutively in the regs
structure. We
can point to the first one and pretend that’s
the stack
. The first five arguments are served by
the cache in dtms_arg[]
, so
only arg5
is served by this method.
Finally, arg6
through arg9
are
served from the stack of the caller. In the diagram above, the
bulk of the stack frame for mac_provider_tx()
is
elided. It only has three arguments, but if it had seven or
more, these later arguments would be stored on the stack above
the RIP
.
Before dereferencing stack
we must
adjust arg
to take into account the register
arguments which are not on the caller’s stack. In this case we
subtract inreg
: arg = 6
becomes arg = 1
. You might have expected this to
be zero-based, like how the first argument starts
at arg0
. But we have to take into account the
fact that the first thing on the caller’s stack is
the RIP
, and skip over it.
Appendix: What exactly is intrpc?
As a refresher, here’s how intrpc
is used.
void
dtrace_getpcstack(pc_t *pcstack, int pcstack_limit, int aframes,
uint32_t *intrpc)
{
...
if (intrpc != NULL && depth < pcstack_limit)
pcstack[depth++] = (pc_t)intrpc;
And here’s how intrpc
is set.
case DTRACEACT_STACK:
if (!dtrace_priv_kernel(state))
continue;
dtrace_getpcstack((pc_t *)(tomax + valoffs),
size / sizeof (pc_t), probe->dtpr_aframes,
DTRACE_ANCHORED(probe) ? NULL :
(uint32_t *)arg0);
And profile sets arg0
to CPU->cpu_profile_pc
.
dtrace_probe(prof->prof_id, CPU->cpu_profile_pc,
CPU->cpu_profile_upc, late, 0, 0);
Here’s the apix interrupt handler
setting cpu_profile_pc
. The r_pc
member is the r_rip
of the regs
structure.
if (pil == CBE_HIGH_PIL) { /* 14 */
cpu->cpu_profile_pil = oldpil;
if (USERMODE(rp->r_cs)) {
cpu->cpu_profile_pc = 0;
cpu->cpu_profile_upc = rp->r_pc;
cpu->cpu_cpcprofile_pc = 0;
cpu->cpu_cpcprofile_upc = rp->r_pc;
} else {
cpu->cpu_profile_pc = rp->r_pc;
cpu->cpu_profile_upc = 0;
cpu->cpu_cpcprofile_pc = rp->r_pc;
cpu->cpu_cpcprofile_upc = 0;
}
}
So what does it all mean?
The profile provider is the exclusive user of this mechanism.
Its probe sites are implemented via a high-level interrupt.
The initial vectoring of a high-level interrupt is no
different than the #BP interrupt used by FBT: the processor
lays out a processor frame on the current stack and the
interrupt handler builds a regs
structure on top
of it. But remember, this processor frame has no frame pointer
and thus no way to see the interrupted program counter. That
is, if foo()
called bar()
,
and bar()
was interrupted by profile’s high-level
interrupt, we’d see the first program counter
as foo()
. However, we can grab the
interrupted RIP
from regs
and stash
that for later retrieval. We don’t need to worry about
clobbering this value as it is only set for this specific
interrupt level. And this is why it’s
called intrpc
: it’s the interrupted program
counter.
This makes me wonder though: since we always have
a regs
structure, why not do away with
both cpu_profile_pc
and cpu_dtrace_caller
? Why not always walk the
stack to regs
and pull the RIP
from
there? My only guess, perhaps this is an optimization for when
someone is sitting on a hot probe referencing
the caller
built-in variable (which is just the
first frame of stack()
).