Writing [Feed] About Pub

Tracing Kernel Functions: FBT stack() and arg

Oct 27, 2020

In my previous post I described how FBT intercepts function calls and vectors them into the DTrace framework. That laid the foundation for what I want to dicuss in this post: the implementation of the stack() action and built-in arg variables. These features rely on the precise layout of the stack, the details of which I touched on previously. In this post I hope to illuminate those details a bit more with the help of some visuals, and then guide you through the implentation of these two DTrace features as they relate to the FBT provider.

A Correction

But first I must make a correction to my last post. It turns out the FBT handler does not execute on the IST stack. It runs on either the thread’s stack or the CPU’s high-level interrupt stack depending on the context of the kernel function call, but never on the IST. Rather, KPTI uses the IST stack as a scratch space to perform its trampoline into the real handler. This little detail is important. Functions like dtrace_getpcstack() have zero chance of working if run with the IST stack, for reasons which become obvious later. This also explains why the AMD64 handler pulls down the stack during pushq %RBP emulation: if it’s working on the same stack as the thread/interrupt, then it must make room for RBP. I can explain better with a visual. First, the diagram from the last post.

INT3 thread/interrupt state pre-handler
Figure 1. INT3 thread/interrupt state pre-handler

On the left we have a kernel thread, interrupt thread, or high-level interrupt running on CPU. On the right we have the “interrupt context” of the breakpoint exception, using the IST. The image is correct in that there are two different stacks in play, but what’s running on the right-hand side is not the brktrap handler. The right-hand side is running the KPTI trampoline, ensuring a CR3 switch when moving between the user/kernel boundary. The trampoline also provides a facsimile of the processor frame to the interrupted thread’s stack, making it none the wiser that KPTI was ever on the scene. So all the action happens on the left side, but what does the stack look like as we transition through the #BP handler on our way to dtrace_invop()?

stack state from #BP to pre dtrace_invop()
Figure 2. stack state from #BP to pre dtrace_invop()

In phase ① mac_provider_tx() is calling mac_ring_tx() while it is under FBT entry instrumentation. The last thing on the thread’s stack is the return address, and the CPU is about to execute the int3 instruction.

Phase ② is immediately after the CPU has finished execution of the int3 instruction. The processor (via the spectre of the KPTI trampoline) has pushed a 16-byte aligned processor frame on the stack and has vectored into the brktrap() handler.

Phase ③ is after some amount of execution of the brktrp() and invoptrap() handlers—remember, the #BP handler for DTrace mimics a #UD. This last phase shows the state just before the call to dtrace_invop(). At this point we’ve grown an entire regs structure on the stack and stashed a copy of the return address on top of this. The later used to populate cpu_dtrace_caller, a variable which becomes important later.

The stack() Action

The separation of probes and actions is a vital aspect of DTrace’s architecture. A firm boundary between these two makes DTrace more powerful than it ever could be if they were tightly coupled. Think about it, I can ask for the call stack in any probe, not just the probes that deem that information useful. The probes give you access to a context, and the actions give you access to data in that context. To limit the execution of actions to specific probes would limit the questions you can ask about the system. With this design the number of questions you can ask is virtually endless. And it turns out one of the more useful questions to ask is: “what the hell is running on my CPU”?

The stack() action allows you to record the call stack that lead to the probe site. In the context of FBT this will record the call stack of the kernel thread or interrupt executing an entry or return from this kernel function. You can also access the userland stack of a thread via ustack(), but I don’t cover that here.

The stack() action is implemented by the dtrace_getpcstack() function. To get there from dtrace_invop() requires a couple of more calls in the DTrace framework. Ultimately, the call stack to get there looks like this.

dtrace_getpcstack()
dtrace_probe()
fbt_invop()
dtrace_invop()
dtrace_invop_callsite() <aka invoptrap>
<rest of call stack that lead here>
call stack between dtrace_getpcstack() and dtrace_invop()

The implementation of stack() really starts with DTRACEACT_STACK inside of dtrace_probe().

usr/src/uts/common/dtrace/dtrace.c
			case DTRACEACT_STACK:
				if (!dtrace_priv_kernel(state))
					continue;

				dtrace_getpcstack((pc_t *)(tomax + valoffs),
				    size / sizeof (pc_t), probe->dtpr_aframes,
				    DTRACE_ANCHORED(probe) ? NULL :
				    (uint32_t *)arg0);
stack() action implementation found in dtrace_probe()

The first argument is the address of the array used to store program counter values (aka function pointers). This array starts at some offset into the current DTrace buffer. The second argument if the size of that array. The third argument is the number of “artificial frames” on the stack, more on this later. The fourth argument is used to determine if the first (topmost) program counter in the call stack is the value passed in arg0 to dtrace_probe(). An “anchored” probe is one that has a function name specified when calling dtrace_probe_create(). For example, the FBT provider uses the name of the kernel function as the probe’s function name, thus it is anchored on the kernel function. The profile provider, however, specifies no probe function name; it is not anchored and is a bit of a special case. I address this at the end of the post.

This brings us to the dtrace_getpcstack() function. But first I’ll expand on figure 2 to show our stack state as of source line 60 of the function.

start of dtrace_getpcstack()
Figure 3. start of dtrace_getpcstack()
usr/src/uts/intel/dtrace/dtrace_isa.c
void
dtrace_getpcstack(pc_t *pcstack, int pcstack_limit, int aframes,
    uint32_t *intrpc)
{
	struct frame *fp = (struct frame *)dtrace_getfp();
	struct frame *nextfp, *minfp, *stacktop;
	int depth = 0;
	int on_intr, last = 0;
	uintptr_t pc;
	uintptr_t caller = CPU->cpu_dtrace_caller;

	if ((on_intr = CPU_ON_INTR(CPU)) != 0)
		stacktop = (struct frame *)(CPU->cpu_intr_stack + SA(MINFRAME));
	else
		stacktop = (struct frame *)curthread->t_stk;
	minfp = fp;

	aframes++;
dtrace_getpcstack()

To build the call stack we first need to be able to walk the stack. Luckily, illumos keeps frame pointers in the kernel, making this easy. But in this particular situation there is more to consider. First, we might have two stacks in play: the high-level interrupt’s stack as well as the stack of the thread it interrupted. Second, the DTrace framework and FBT provider have put their own frames between this code and the function that tripped this probe; we must exclude these “artificial” frames from the result. Finally, we need to make sure not to walk off the stack and into space, both for correctness and safety. Speaking of the stack, the stacktop variable is pointing to the “top” of the stack in terms of memory (on x86 stacks grow downwards). Logically speaking, stacktop is the bottom of the stack and the dtrace_getpcstack() frame is the top.

usr/src/uts/intel/dtrace/dtrace_isa.c
	if (intrpc != NULL && depth < pcstack_limit)
		pcstack[depth++] = (pc_t)intrpc;
dtrace_getpcstack()

If intrpc is set, then that’s our first program counter.

usr/src/uts/intel/dtrace/dtrace_isa.c
	while (depth < pcstack_limit) {
		nextfp = (struct frame *)fp->fr_savfp;
		pc = fp->fr_savpc;

		if (nextfp <= minfp || nextfp >= stacktop) {
			if (on_intr) {
				/*
				 * Hop from interrupt stack to thread stack.
				 */
				stacktop = (struct frame *)curthread->t_stk;
				minfp = (struct frame *)curthread->t_stkbase;
				on_intr = 0;
				continue;
			}

			/*
			 * This is the last frame we can process; indicate
			 * that we should return after processing this frame.
			 */
			last = 1;
		}
dtrace_getpcstack()

The main loop walks the call stack and fills in program counters as long as there are slots remaining in pcstack. If we were in the context of a high-level interrupt and we’ve walked off its stack, then hop to the thread stack. Otherwise, we’ve walked off the thread stack, leaving just this last frame to record.

usr/src/uts/intel/dtrace/dtrace_isa.c
		if (aframes > 0) {
			if (--aframes == 0 && caller != 0) {
				/*
				 * We've just run out of artificial frames,
				 * and we have a valid caller -- fill it in
				 * now.
				 */
				ASSERT(depth < pcstack_limit);
				pcstack[depth++] = (pc_t)caller;
				caller = 0;
			}
		} else {
dtrace_getpcstack()

Make sure to skip over any artificial frames. The aframes value is based on information given by the provider at probe creation time (dtrace_probe_create()/dtpr_aframes) as well as knowledge inherent to the DTrace framework. These two know how many frames they have each injected between the stack() action and the first real frame; we sum the values to know how many total frames to skip.

The caller variable is a bit more subtle; and this is another thing I got wrong in my last post while discussing the return probe. The caller value comes from CPU->cpu_dtrace_caller; a per-cpu value used exclusively by the FBT provider to record the first real frame of the call stack. But why? First, a refresher on the code (this is the return probe logic but it’s the same for the entry probe as well).

usr/src/uts/intel/dtrace/fbt.c
				/*
				 * On amd64, we instrument the ret, not the
				 * leave.  We therefore need to set the caller
				 * to assure that the top frame of a stack()
				 * action is correct.
				 */
				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
				CPU->cpu_dtrace_caller = stack[0];
fbt_invop()

In my last post, when discussing this code comment, I said the following.

In this case we have a matching return probe. I’m not so sure I follow this comment. The caller’s return address is still on the interrupted thread’s stack regardless of whether we instrument the leave or ret instruction...

I am correct in stating that the return address is on the stack. But the subtle detail I forgot is that the interrupt machinery does not create a frame—it doesn’t push a frame pointer to the stack. You can see this visually if you trace the the RBP link up from dtrace_invop(): it links back to the mac_provider_tx() frame, skipping the program counter (mac_provider_tx+0x80) stored by the call instruction just before FBT interposed.

usr/src/uts/intel/dtrace/dtrace_isa.c
		} else {
			if (depth < pcstack_limit)
				pcstack[depth++] = (pc_t)pc;
		}
dtrace_getpcstack()

We have a legit call stack frame and room on pcstack, add it.

usr/src/uts/intel/dtrace/dtrace_isa.c
		if (last) {
			while (depth < pcstack_limit)
				pcstack[depth++] = 0;
			return;
		}

		fp = nextfp;
		minfp = fp;
dtrace_getpcstack()

If we’ve finished walking the call stack, then zero out the rest of pcstack. Otherwise, continue walking the call stack.

The built-in arg variables

The arg0-arg9 variables, and their typed counterparts args[0]-args[9], allow each probe to supply up to 10 arguments. The arg values are provider dependent. FBT passes the kernel function arguments for entry probes and the return offset plus value for return probes. Regardless of the provider, all arg variable usage ultimately ends up at dtrace_dif_variable().

usr/src/uts/common/dtrace/dtrace.c
	case DIF_VAR_ARGS:
		if (!(mstate->dtms_access & DTRACE_ACCESS_ARGS)) {
			cpu_core[CPU->cpu_id].cpuc_dtrace_flags |=
			    CPU_DTRACE_KPRIV;
			return (0);
		}

		ASSERT(mstate->dtms_present & DTRACE_MSTATE_ARGS);
		if (ndx >= sizeof (mstate->dtms_arg) /
		    sizeof (mstate->dtms_arg[0])) {
			int aframes = mstate->dtms_probe->dtpr_aframes + 2;
			dtrace_provider_t *pv;
			uint64_t val;

			pv = mstate->dtms_probe->dtpr_provider;
			if (pv->dtpv_pops.dtps_getargval != NULL)
				val = pv->dtpv_pops.dtps_getargval(pv->dtpv_arg,
				    mstate->dtms_probe->dtpr_id,
				    mstate->dtms_probe->dtpr_arg, ndx, aframes);
			else
				val = dtrace_getarg(ndx, aframes);

			/*
			 * This is regrettably required to keep the compiler
			 * from tail-optimizing the call to dtrace_getarg().
			 * The condition always evaluates to true, but the
			 * compiler has no way of figuring that out a priori.
			 * (None of this would be necessary if the compiler
			 * could be relied upon to _always_ tail-optimize
			 * the call to dtrace_getarg() -- but it can't.)
			 */
			if (mstate->dtms_probe != NULL)
				return (val);

			ASSERT(0);
		}

		return (mstate->dtms_arg[ndx]);
dtrace_dif_variable()

I’m not going to explain this entire thing. I put it on display only to show that these values are dependent on the provider. But in the case of FBT we have two possibilities.

For arg0 through arg4 we pull from the argument cache stored in dtms_arg[], shown on line 3215. The provider populates this cache via the call to dtrace_probe().

For arg5 through arg9 we must get help from the provider-specific callback: dtps_getargval. When undefined, as it is for FBT, we fallback to the DTrace framework function dtrace_getarg(). Explaining this function makes more sense by starting at the end.

usr/src/uts/intel/dtrace/dtrace_isa.c
load:
	DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
	val = stack[arg];
	DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);

	return (val);
dtrace_getarg()

As you can see, getting arg5 through arg9 is a simple matter of dereferencing the stack. But how do we get that?

usr/src/uts/intel/dtrace/dtrace_isa.c
uint64_t
dtrace_getarg(int arg, int aframes)
{
	uintptr_t val;
	struct frame *fp = (struct frame *)dtrace_getfp();
	uintptr_t *stack;
	int i;
#if defined(__amd64)
	/*
	 * A total of 6 arguments are passed via registers; any argument with
	 * index of 5 or lower is therefore in a register.
	 */
	int inreg = 5;
#endif

	for (i = 1; i <= aframes; i++) {
		fp = (struct frame *)(fp->fr_savfp);

		if (fp->fr_savpc == (pc_t)dtrace_invop_callsite) {
dtrace_getarg()

As was the case for the stack() action, we need to walk the current stack; but instead of recording the program counters we search for the stack pointer at the time of the #BP. This was recorded in the processor frame as part of the processor’s trap machinery. If you look back at figure 3, it’s where the processor frame’s RSP points back to mac_provider_tx+0x80. When we hit dtrace_invop_callsite we know we’re at the top of the dtrace_invop() frame. We can’t follow fr_savfp any further, as we’ll blow by the processor frame, so what do we do?

usr/src/uts/intel/dtrace/dtrace_isa.c
#else
			/*
			 * In the case of amd64, we will use the pointer to the
			 * regs structure that was pushed when we took the
			 * trap.  To get this structure, we must increment
			 * beyond the frame structure, the calling RIP, and
			 * padding stored in dtrace_invop().  If the argument
			 * that we're seeking is passed on the stack, we'll
			 * pull the true stack pointer out of the saved
			 * registers and decrement our argument by the number
			 * of arguments passed in registers; if the argument
			 * we're seeking is passed in regsiters, we can just
			 * load it directly.
			 */
			struct regs *rp = (struct regs *)((uintptr_t)&fp[1] +
			    sizeof (uintptr_t) * 2);

			if (arg <= inreg) {
				stack = (uintptr_t *)&rp->r_rdi;
			} else {
				stack = (uintptr_t *)(rp->r_rsp);
				arg -= inreg;
			}
#endif
			goto load;
dtrace_getarg()

Turns out we can use some pointer shenanigans to walk our way back to the regs structure setup by the invoptrap handler. We do this by treating fp as an array of frames and walking past the current one. From there we cast to a pointer type so we can walk individual stack entries, skipping past the padding and the stashed RIP. That leaves us at the beginning of regs.

dtrace_getarg() pointer arithmetic
Figure 4. dtrace_getarg() pointer arithmetic

With a pointer to the regs structure we can finally choose the stack based on which arg we want. As required by the ABI, we know the first six args are in registers. These registers are laid out consecutively in the regs structure. We can point to the first one and pretend that’s the stack. The first five arguments are served by the cache in dtms_arg[], so only arg5 is served by this method.

Finally, arg6 through arg9 are served from the stack of the caller. In the diagram above, the bulk of the stack frame for mac_provider_tx() is elided. It only has three arguments, but if it had seven or more, these later arguments would be stored on the stack above the RIP.

Before dereferencing stack we must adjust arg to take into account the register arguments which are not on the caller’s stack. In this case we subtract inreg: arg = 6 becomes arg = 1. You might have expected this to be zero-based, like how the first argument starts at arg0. But we have to take into account the fact that the first thing on the caller’s stack is the RIP, and skip over it.

Appendix: What exactly is intrpc?

As a refresher, here’s how intrpc is used.

usr/src/uts/intel/dtrace/dtrace_isa.c
void
dtrace_getpcstack(pc_t *pcstack, int pcstack_limit, int aframes,
    uint32_t *intrpc)
{
...
	if (intrpc != NULL && depth < pcstack_limit)
		pcstack[depth++] = (pc_t)intrpc;
dtrace_getpcstack()

And here’s how intrpc is set.

usr/src/uts/common/dtrace/dtrace.c
			case DTRACEACT_STACK:
				if (!dtrace_priv_kernel(state))
					continue;

				dtrace_getpcstack((pc_t *)(tomax + valoffs),
				    size / sizeof (pc_t), probe->dtpr_aframes,
				    DTRACE_ANCHORED(probe) ? NULL :
				    (uint32_t *)arg0);
stack() action implementation found in dtrace_probe()

And profile sets arg0 to CPU->cpu_profile_pc.

usr/src/uts/common/dtrace/profile.c
	dtrace_probe(prof->prof_id, CPU->cpu_profile_pc,
	    CPU->cpu_profile_upc, late, 0, 0);
profile probe

Here’s the apix interrupt handler setting cpu_profile_pc. The r_pc member is the r_rip of the regs structure.

usr/src/uts/i86pc/io/apix/apix_intr.c
	if (pil == CBE_HIGH_PIL) {	/* 14 */
		cpu->cpu_profile_pil = oldpil;
		if (USERMODE(rp->r_cs)) {
			cpu->cpu_profile_pc = 0;
			cpu->cpu_profile_upc = rp->r_pc;
			cpu->cpu_cpcprofile_pc = 0;
			cpu->cpu_cpcprofile_upc = rp->r_pc;
		} else {
			cpu->cpu_profile_pc = rp->r_pc;
			cpu->cpu_profile_upc = 0;
			cpu->cpu_cpcprofile_pc = rp->r_pc;
			cpu->cpu_cpcprofile_upc = 0;
		}
	}
	  
setting cpu_profile_pc

So what does it all mean?

The profile provider is the exclusive user of this mechanism. Its probe sites are implemented via a high-level interrupt. The initial vectoring of a high-level interrupt is no different than the #BP interrupt used by FBT: the processor lays out a processor frame on the current stack and the interrupt handler builds a regs structure on top of it. But remember, this processor frame has no frame pointer and thus no way to see the interrupted program counter. That is, if foo() called bar(), and bar() was interrupted by profile’s high-level interrupt, we’d see the first program counter as foo(). However, we can grab the interrupted RIP from regs and stash that for later retrieval. We don’t need to worry about clobbering this value as it is only set for this specific interrupt level. And this is why it’s called intrpc: it’s the interrupted program counter.

This makes me wonder though: since we always have a regs structure, why not do away with both cpu_profile_pc and cpu_dtrace_caller? Why not always walk the stack to regs and pull the RIP from there? My only guess, perhaps this is an optimization for when someone is sitting on a hot probe referencing the caller built-in variable (which is just the first frame of stack()).