Writing [Feed] About Pub

Tracing Kernel Functions: How the illumos AMD64 FBT Provider Intercepts Function Calls

June 15, 2020

For my money there is perhaps nothing more valuable than in situ kernel debugging. The ability to ask questions of any part of the kernel, on a live production system, without stopping or pausing anything, feels more like a right than a privilege after you’ve experienced it first hand. One such avenue to in situ kernel debugging is the DTrace FBT provider.

The FBT provider, which stands for Function Boundary Tracing, allows one to trace all kernel function entry and return sites, their arguments and return values, and even grab the stack that lead to the call. There is no overhead when probes are disabled, and minimal overhead on the specific entry or return site when enabled. DTrace is made up of several parts: the command, the user library, the kernel framework and VM, and the provider interface. Its authors made the smart decision of separating the general infrastructure of DTrace from the specifics of probe creation and firing: providers provide and fire the probe points, and the framework executes their actions. This allows providers to be developed separately from DTrace itself. If you want to learn more about DTrace I recommend reading Dynamic Instrumentation of Production Systems presented in USENIX 2004.

For this post I want to focus on how FBT intercepts function calls. That is, when a user enables an FBT probe, how exactly does the probe end up firing when a kernel thread hits that entry or return site? For example, the following one-liner instruments all calls to the mac_ring_tx() function, responsible for delivering a network packet to the underlying NIC. It counts the number of packets traveling across each interface along with the distribution of the packet size sent across each interface. As you can see, a simple function entry probe can be quite powerful, and I’m only touching the surface of what’s possible. But how exactly did the FBT provider accomplish this?

rpz@thunderhead:~$ pfexec dtrace -qn 'mac_ring_tx:entry { this->mip = (mac_impl_t *)arg0; @[this->mip->mi_name] = count(); @dist[this->mip->mi_name] = quantize(msgsize(args[2])); } END { printf("TOTAL PACKETS\n"); printa(@); printf("\nDISTRIBUTION\n"); printa(@dist); }'
^C
TOTAL PACKETS

  igb1                                                             43
  ixgbe6                                                        31046
  ixgbe3                                                        60370
  ixgbe5                                                        68938
  aggr1014                                                      99793
  ixgbe2                                                       137064
  aggr1013                                                     197244

DISTRIBUTION

  igb1
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |@@@@@@@@@@@@@@@                          16
             128 |@@@@@@@@@@@@@@@@@@@@@@                   24
             256 |@@@                                      3
             512 |                                         0

  ixgbe6
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         10
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31036
             128 |                                         0

  ixgbe5
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         134
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 68804
             128 |                                         0

  aggr1014
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         144
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 99649
             128 |                                         0

  ixgbe3
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         127
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         170
            2048 |                                         139
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  58801
            8192 |@                                        1052
           16384 |                                         12
           32768 |                                         69
           65536 |                                         0

  ixgbe2
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         128
              64 |                                         160
             128 |                                         0
             256 |                                         0
             512 |                                         1
            1024 |                                         79
            2048 |                                         107
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 136106
            8192 |                                         478
           16384 |                                         0
           32768 |                                         5
           65536 |                                         0

  aggr1013
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         128
              64 |                                         97
             128 |                                         0
             256 |                                         0
             512 |                                         1
            1024 |                                         249
            2048 |                                         246
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 194907
            8192 |                                         1530
           16384 |                                         12
           32768 |                                         74
           65536 |                                         0
packet distribution per interface

Insrtumenting Function Calls and Returns

Before I can describe how a site is intercepted I need to give a primer on how the kernel text is instrumented. Upon first use DTrace loads the FBT provider, iterates all loaded kernel modules, and passes each one to fbt_provide_module(). This function uses the ELF section header data to iterate all functions, read their program text, and find the location of all entry and return sites. The FBT provider creates an FBT probe (fbt_probe_t) for each site, responsible for holding the information needed to instrument that site. This design produces zero overhead when no probes are in use as no program text is modified during this stage.

It’s not until the user executes a D script containing one or more FBT probes that the provider instruments the kernel module’s text. The DTrace framework calls fbt_enable(), passing in the FBT probe structure created earlier which contains the information needed to alter the correct program text. If this is an entry probe, then FBT replaces the push %rbp instruction. Otherwise, it is a return probe, and FBT replaces the ret instruction. In both cases, FBT replaces this instruction with INT3, the breakpoint instruction (or #BP in Intel mnemonics). This design produces the minimal overhead needed to intercept the probe site. It modifies only the site requested, leaving all other text untouched.

A breakpoint is what’s known as a software exception — a processor exception produced at the request of software, as opposed to an exception generated by a legit program exception, like divide by zero. All exceptions are really just interrupts, with the small difference being that an interrupt is something typically generated external to the processor, whereas an exception is generated by the processor as a consequence of the code it is executing. Furthermore, when reading code, you’ll also see breakpoint referred to as a trap. This is yet a further classification of the exception, and it dictates what actions the processor will take upon servicing the exception. All three terms are used interchangeably throughout the code, but the important thing to remember is it’s an interrupt. It’s interrupting the normal flow of the program code (in this case, the kernel). Like all other interrupts, the operating system must provide the processor with some action to take upon receiving a breakpoint. This action is defined in the Interrupt Descriptor Table (IDT), seen below.

usr/src/uts/intel/ia32/os/desctbls.c
	set_gatesegd(&idt[T_BPTFLT],
	    (kpti_enable == 1) ? &tr_brktrap : &brktrap,
	    KCS_SEL, SDT_SYSIGT, TRP_UPL, idt_vector_to_ist(T_BPTFLT));
breakpoint IDT entry

Upon receiving a breakpoint the processor will call into the handler tr_brktrap. The tr_ prefix stands for trampoline, which is a mitigation method used as part of KPTI. This trampoline provides Meltdown protection and ultimately leads to brktrap, which I’ll come back to later. The SDT_SYSIGT argument tells us this is an “interrupt gate”, meaning the processor will disable the assertion of all maskable hardware interrupts before invoking the handler — preventing another interrupt from interrupting the breakpoint. The TRP_UPL argument indicates the interrupt may originate from user or kernel space. In the case of the FBT provider, it will always originate from kernel space. And finally, the last argument is an index into the Interrupt Stack Table (IST): a collection of dedicated stacks to be used during interrupt handling to avoid scribbling on the interrupted thread’s stack. In this case the T_BPTFLT handler maps to a dedicated stack.

Before the Handler

So I’m running a D script that contains an action for one or more FBT probes, and a kernel thread executes one of these probe sites, what’s next? Based on the IDT entry it should call the brktrap handler, but that’s not quite what happens. First, the processor does some work on the operating system’s behalf before calling the interrupt handler. It must switch to the stack specified in the IDT entry and push the following values to it, in this order (see Intel Vol. 3, §6.14.4).

  1. The interrupted thread’s stack selector (SS).
  2. The interrupted thread’s stack pointer (RSP).
  3. The interrupted thread’s RFlags.
  4. The interrupted thread’s Code Segment selector (CS).
  5. The interrupted thread’s Instruction Pointer (RIP).
  6. The error code of the exception, as INT3 has no error code this doesn’t apply.

This ends with the processor in interrupt context, with the interrupted thread’s primary CPU context loaded on the new stack, ready to execute the operating system’s brktrap handler — as depicted in the figure below.

INT3 thread/interrupt state pre-handler
INT3 thread/interrupt state pre-handler

The invoptrap Handler

The brktrap handler is actually a front for the real workhorse: the invoptrap handler. Different providers use different methods of trapping into the DTrace framework. Some use INT3 (#BP), others use the LOCK prefix (0xFO) incorrectly in order to force an invalid operation fault (#UD). I’m not sure why both methods exist or when a provider should prefer one method over the other, perhaps fodder for a future post. But in the case of FBT, the first stop is brktrap.

usr/src/uts/intel/ia32/ml/exception.s
	/*
	 * #BP
	 */
	ENTRY_NP(brktrap)
	XPV_TRAP_POP
	cmpw	$KCS_SEL, 8(%rsp)
	jne	bp_user

	/*
	 * This is a breakpoint in the kernel -- it is very likely that this
	 * is DTrace-induced.  To unify DTrace handling, we spoof this as an
	 * invalid opcode (#UD) fault.  Note that #BP is a trap, not a fault --
	 * we must decrement the trapping %rip to make it appear as a fault.
	 * We then push a non-zero error code to indicate that this is coming
	 * from #BP.
	 */
	decq	(%rsp)
	push	$1			/* error code -- non-zero for #BP */
	jmp	ud_kernel
breakpoint handler

Lines 296-297 determine where the BP originated from: user-land or kernel-land. If the former, jump to bp_user, otherwise fall through. For those that aren’t familiar with assembly, I’ll break down line 296 a bit more. At the point of this instruction the interrupt’s RSP points to the 64-bit word containing the RIP of the interrupted thread. The 8(%rsp) instruction is AT&T syntax for indirect memory addressing: it tells the CPU to read the value at %rsp + 8 (basically the parens are equivalent to dereferencing a pointer in C). Since the stack grows downward, and every value on the stack is 64-bit, adding 8 bytes references the value “above” RSP on the stack: the interrupted thread’s CS. Therefore, the cmpw instruction compares the lower 16 bits of the immediate value $KCS_SEL to the interrupted thread’s CS value. In this case we know the #BP originated from a kernel thread and the processor will fall through.

Lines 307-309 are well described by the comment above them. The bulk of the handling is done under the #UD handler, but in FBT’s case we are coming in via #BP. Rather than replicate the handler logic we opt to make the #BP look like a #UD, jumping to the ud_kernel handler after doing so. This is achieved by rolling back the RIP by one instruction, as it would be in the case of a #UD fault, and pushing an error code on the stack. The error code, while meaningless in this case, is needed to keep the stack layout consistent with other exceptions. All exceptions share common routines which expect the stack to be laid out in a particular way.

usr/src/uts/intel/ia32/ml/exception.s
	ENTRY_NP(invoptrap)

	XPV_TRAP_POP

	cmpw	$KCS_SEL, 8(%rsp)
	jne	ud_user

#if defined(__xpv)
	movb	$0, 12(%rsp)		/* clear saved upcall_mask from %cs */
#endif
	push	$0			/* error code -- zero for #UD */
ud_kernel:
	push	$0xdddd			/* a dummy trap number */
#UD handler

On line 345 we find the ud_kernel label. It sits a few instructions into the invoptrap handler, the main DTrace entry point. Like the brktrap handler, invoptrap looks at the the interrupt origin to determine what to do. Since we originated from #BP we skip all that and head straight to line 346 where we push a dummy trap number. Like the dummy error code pushed in brktrap, this is done to make sure the stack has a consistent layout across all exceptions so that macros like INTR_PUSH can function correctly. However, why we push a dummy value instead of the actual trap number, like all the other traps, is a mystery to me.

usr/src/uts/intel/ia32/ml/exception.s
	INTR_PUSH
#UD handler

As I mentioned, all exceptions use the same stack layout, but what exactly is that layout? The layout is defined by the regs structure, shown below. This structure includes all values pushed by the processor’s exception mechanism as well as the trap number, all general-purpose registers, the segment registers, and two special values named r_savfp and r_savpc. These last two fields contain copies of the interrupted thread’s frame pointer and program counter, respectively. The processor’s exception mechanism populates lines 101-105 of the structure on the stack, lines 95 & 100 are populated by the handler, and INTR_PUSH is responsible for populating the rest.

usr/src/uts/intel/amd64/sys/privregs.h
struct regs {
	/*
	 * Extra frame for mdb to follow through high level interrupts and
	 * system traps.  Set them to 0 to terminate stacktrace.
	 */
	greg_t	r_savfp;	/* a copy of %rbp */
	greg_t	r_savpc;	/* a copy of %rip */

	greg_t	r_rdi;		/* 1st arg to function */
	greg_t	r_rsi;		/* 2nd arg to function */
	greg_t	r_rdx;		/* 3rd arg to function, 2nd return register */
	greg_t	r_rcx;		/* 4th arg to function */

	greg_t	r_r8;		/* 5th arg to function */
	greg_t	r_r9;		/* 6th arg to function */
	greg_t	r_rax;		/* 1st return register, # SSE registers */
	greg_t	r_rbx;		/* callee-saved, optional base pointer */

	greg_t	r_rbp;		/* callee-saved, optional frame pointer */
	greg_t	r_r10;		/* temporary register, static chain pointer */
	greg_t	r_r11;		/* temporary register */
	greg_t	r_r12;		/* callee-saved */

	greg_t	r_r13;		/* callee-saved */
	greg_t	r_r14;		/* callee-saved */
	greg_t	r_r15;		/* callee-saved */

	/*
	 * fsbase and gsbase are sampled on every exception in DEBUG kernels
	 * only.  They remain in the non-DEBUG kernel to avoid any flag days.
	 */
	greg_t	__r_fsbase;	/* no longer used in non-DEBUG builds */
	greg_t	__r_gsbase;	/* no longer used in non-DEBUG builds */
	greg_t	r_ds;
	greg_t	r_es;
	greg_t	r_fs;		/* %fs is *never* used by the kernel */
	greg_t	r_gs;

	greg_t	r_trapno;

	/*
	 * (the rest of these are defined by the hardware)
	 */
	greg_t	r_err;
	greg_t	r_rip;
	greg_t	r_cs;
	greg_t	r_rfl;
	greg_t	r_rsp;
	greg_t	r_ss;
};
AMD64 regs structure shared across exception handlers

After the INTR_PUSH we have an entire regs structure populated on the stack with RSP pointing to r_savfp.

usr/src/uts/intel/ia32/ml/exception.s
	movq	REGOFF_RIP(%rsp), %rdi
	movq	REGOFF_RSP(%rsp), %rsi
	movq	REGOFF_RAX(%rsp), %rdx
	pushq	(%rsi)
	movq	%rsp, %rsi
	subq	$8, %rsp
	call	dtrace_invop
	ALTENTRY(dtrace_invop_callsite)
#UD handler

Lines 348-350 ready the first and third arguments for the dtrace_invop() call: the address of the instruction being intercepted, and the RAX value of the function being interrupted (for return probes). Line 349 looks like it’s placing the interrupted thread’s stack pointer in the second argument position, but really it’s staging the pointer in RSI for lines 351-352. On line 351 we dereference the RSP of the interrupted thread and place that value on the top of the handler’s stack. Which leads to the question: what was the last thing pushed on the stack of the interrupted thread? This depends on the probe. For entry probes we instrument the pushq %rbp instruction, thus the last thing on the stack is the return site of the function that called the interrupted function. If foo() called bar(), and bar’s entry point is instrumented by DTrace, then (%rsi) would be foo()+0xXXXX. And it turns out this is true for return probes as well as illumos uses leave plus ret to exit the function, and FBT instruments the ret instruction. Therefore, upon entry into the invop handler, the last thing on the interrupted thread’s stack is always the return instruction pointer. Moving onto line 352, we replace the interrupted thread’s stack pointer with our handler’s stack pointer, so that dtrace_invop() will have a pointer to the handler’s stack as its second argument.

Finally, before calling dtrace_invop(), we subtract 8 bytes from the stack to keep it at a 16-byte alignment (see Intel Vol. 3, §6.14.2). The generic dtrace_invop() function checks this instrumentation point against the SDT and FBT providers. In this case we are dealing with an FBT probe and will end up calling fbt_invop(). But before moving on, it’s also worth noting line 355, after the call to dtrace_invop(), where we create the symbol dtrace_invop_callsite. This becomes the return address upon entering dtrace_invop(), and its use becomes important when accessing function arguments, which I will cover in a follow up post.

usr/src/uts/intel/dtrace/fbt.c
static int
fbt_invop(uintptr_t addr, uintptr_t *stack, uintptr_t rval)
{
	uintptr_t stack0, stack1, stack2, stack3, stack4;
	fbt_probe_t *fbt = fbt_probetab[FBT_ADDR2NDX(addr)];

	for (; fbt != NULL; fbt = fbt->fbtp_hashnext) {
FBT invop handler

The FBT probes are stored in a chained hash table named fbt_probetab. Line 85 hashes the instrumented address to determine the bucket the probe is in, and then line 87 loops through all entries in that bucket to find the one matching this address.

usr/src/uts/intel/dtrace/fbt.c
		if ((uintptr_t)fbt->fbtp_patchpoint == addr) {
			if (fbt->fbtp_roffset == 0) {
				int i = 0;
				/*
				 * When accessing the arguments on the stack,
				 * we must protect against accessing beyond
				 * the stack.  We can safely set NOFAULT here
				 * -- we know that interrupts are already
				 * disabled.
				 */
				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
				CPU->cpu_dtrace_caller = stack[i++];
#ifdef __amd64
				/*
				 * On amd64, stack[0] contains the dereferenced
				 * stack pointer, stack[1] contains savfp,
				 * stack[2] contains savpc.  We want to step
				 * over these entries.
				 */
				i += 2;
#endif
				stack0 = stack[i++];
				stack1 = stack[i++];
				stack2 = stack[i++];
				stack3 = stack[i++];
				stack4 = stack[i++];
				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
				    CPU_DTRACE_BADADDR);

				dtrace_probe(fbt->fbtp_id, stack0, stack1,
				    stack2, stack3, stack4);

				CPU->cpu_dtrace_caller = 0;
FBT entry probe

In this case we found an entry probe matching the instrumented address. We know it’s an entry probe because fbtp_roffset is zero.

The use of CPU_DTRACE_NOFAULT is interesting, and I’ll be honest, I’m not sure I fully understand why it’s used here, but here’s my best guess. First off, the no-fault mechanism is used throughout DTrace to protect the operating system from bad loads in DTrace actions. This flag allows the page fault logic in locore.s to determine if the #PF originated from DTrace, and if so to skip over the offending instruction, set some DTrace CPU flags to indicate the faulting load, and move on with execution. This is different from a typical #PF, which would wind up in the global trap handler. As for it’s use here? I think it’s because we are directly accessing the handler’s stack via the stack pointer, as opposed to the typical way you access the stack in C: by using local variables. These reads should always be valid, but perhaps that wasn’t always the case in the history of this code. In any event, if any of the references on lines 109-113 happen to go out of bounds, the #PF handler will catch it and simply move on (leaving whatever garbage value it was initialized to).

Now it’s time to start reading data off the handler’s stack. Remember, stack is pointing into the handler’s stack, at the point of the last push in the invoptrap handler. The last thing we saved on the stack was the caller of the instrumented function, found at location stack[0]. On line 99 we store that symbol address in the CPU-local variable cpu_dtrace_caller, which is used in the DTrace stack() action and to populate the DTrace built-in caller variable. The next two values on the handler’s stack are the regs structure values r_savfp and r_savpc, the saved values of the frame pointer and program counter, respectively. We skip over them to get to the arguments of the instrumented function.

The System V AMD64 ABI dictates that the first six arguments are stored in specific registers. These stack[N] accesses reference those registers and those variables are passed to dtrace_probe() to act as a cache for the first five arguments to the instrumented function. But why five and not six? While illumos no longer supports a 32-bit kernel, when we did have such support the #UD handler setup stack[1] to point to the beginning of the first 10 arguments to the instrumented function, so we know it’s not because of 32-bit. It also doesn’t seem to be because of a SPARC limitation, as a quick search of the various SPARC ABIs all claim the first six arguments may be placed in registers. My best guess is that it’s from a desire to keep all arguments to dtrace_probe() in registers — for efficiency and perhaps to avoid polluting the stack as some probe actions are sensitive to stack depth such as the stack() action. In any event, the final action is to call dtrace_probe(), the main entry point into the DTrace framework which ultimately executes the actions linked with this DTrace probe.

usr/src/uts/intel/dtrace/fbt.c
			} else {
#ifdef __amd64
				/*
				 * On amd64, we instrument the ret, not the
				 * leave.  We therefore need to set the caller
				 * to assure that the top frame of a stack()
				 * action is correct.
				 */
				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
				CPU->cpu_dtrace_caller = stack[0];
				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
				    CPU_DTRACE_BADADDR);
#endif

				dtrace_probe(fbt->fbtp_id, fbt->fbtp_roffset,
				    rval, 0, 0, 0);
				CPU->cpu_dtrace_caller = 0;
			}

			return (fbt->fbtp_rval);
		}
	}

	return (0);
}
FBT return probe

In this case we have a matching return probe. I’m not so sure I follow this comment. The caller’s return address is still on the interrupted thread’s stack regardless of whether we instrument the leave or ret instruction, but the point is that this value must be saved in case the user calls the DTrace action stack() in their action. Like the entry probe, the goal is to send a probe event to the DTrace framework, but this time instead of passing arguments we pass the return offset and return value. These values are accessible in the FBT return probe action as the variables arg0 and arg1, respectively.

In the case of a matching probe, we need to return the fbtp_rval value (line 140). This value was determined at probe creation time and tells the invoptrap handler how to return back to the interrupted thread. This value depends on the type of probe, entry vs. return, and the architecture, 32-bit vs. 64. Given that illumos no longer ships a 32-bit kernel, we can ignore the architecture and assume AMD64. That means we return DTRACE_INVOP_PUSHL_EBP for an entry probe and DTRACE_INVOP_RET for a return probe. If we didn’t find a matching probe, then return zero (line 144). This tells the invop handler to treat this fault as a normal #UD and continue processing.

usr/src/uts/intel/ia32/ml/exception.s
	pushq	(%rsi)
	movq	%rsp, %rsi
	subq	$8, %rsp
	call	dtrace_invop
	ALTENTRY(dtrace_invop_callsite)
	addq	$16, %rsp
	cmpl	$DTRACE_INVOP_PUSHL_EBP, %eax
	je	ud_push
	cmpl	$DTRACE_INVOP_LEAVE, %eax
	je	ud_leave
	cmpl	$DTRACE_INVOP_NOP, %eax
	je	ud_nop
	cmpl	$DTRACE_INVOP_RET, %eax
	je	ud_ret
	jmp	ud_trap
#UD handler

Picking up on line 356 of the invoptrap handler, we add 16 bytes to the stack pointer which undoes the pushes on lines 351 and 353, leaving RSP at the start of the regs structure. From there we jump to one of the labels depending on the value returned by fbt_invop().

usr/src/uts/intel/ia32/ml/exception.s
ud_push:
	/*
	 * We must emulate a "pushq %rbp".  To do this, we pull the stack
	 * down 8 bytes, and then store the base pointer.
	 */
	INTR_POP
	subq	$16, %rsp		/* make room for %rbp */
	pushq	%rax			/* push temp */
	movq	24(%rsp), %rax		/* load calling RIP */
	addq	$1, %rax		/* increment over trapping instr */
	movq	%rax, 8(%rsp)		/* store calling RIP */
	movq	32(%rsp), %rax		/* load calling CS */
	movq	%rax, 16(%rsp)		/* store calling CS */
	movq	40(%rsp), %rax		/* load calling RFLAGS */
	movq	%rax, 24(%rsp)		/* store calling RFLAGS */
	movq	48(%rsp), %rax		/* load calling RSP */
	subq	$8, %rax		/* make room for %rbp */
	movq	%rax, 32(%rsp)		/* store calling RSP */
	movq	56(%rsp), %rax		/* load calling SS */
	movq	%rax, 40(%rsp)		/* store calling SS */
	movq	32(%rsp), %rax		/* reload calling RSP */
	movq	%rbp, (%rax)		/* store %rbp there */
	popq	%rax			/* pop off temp */
	jmp	tr_iret_kernel		/* return from interrupt */
#UD handler

In this case we are leaving an entry probe and need to emulate the pushq %rbp instruction (which was overwritten with a breakpoint). The first line, INTR_POP restores the GPRs and adjusts RSP to point to the values pushed by the processor when switching to the interrupt stack, namely the CS:RIP, RFLAGS, and SS:RSP values of the interrupted thread. The rest of the code, starting with line 373, is a bit odd and even conflicts with the comment above it somewhat (it’s pulling the stack down 16 bytes, not 8). This code doesn’t need to be this way, but I think it is this way as a consequence of a time in the past when the AMD64 interrupt handler was not using it’s own stack or perhaps the author of this routine started with a copy of the 32-bit routine. But the idea is that the iret call is going to pop values off the stack to restore the CS:RIP, RFLAGS, and SS:RSP registers to their original values. In order to emulate the frame pointer push we need to pull the stack down by enough bytes to store said pointer. To do that we first need to shift the values that were originally stashed by the processor on the stack when the interrupt was taken. On i386 this made sense as the handler reused the kernel thread’s stack, so we need to “pull it down”. But on AMD64 the handler has its own stack and so this is all unnecessary. That said, the AMD64 handler still pulls the stack down, to no real effect, but it also modifies the interrupted thread’s (referred to as the “caller” in this code) RSP to stash the RBP and increments the interrupted thread’s RIP to the next instruction. Finally, we jump to tr_iret_kernel, which is really just an iret instruction (the author’s of KPTI decided it would be wise to alias it so that future maintainers know that the iret path was analyzed when working on the Meltdown mitigations — a blessing for those who will have to maintain this code for decades to come). After the iret has executed we are back in the kernel thread’s original execution context and it’s as if FBT was never involved.

Here’s how the push emulation could be written in a simpler manner.

simple_ud_push:
	/*
	 * Emulate "pushq %rbp" by stashing RBP on the caller's stack
	 * and incrementing the caller's RIP by one.
	 */
	INTR_POP
	pushq	%rax			/* stash RAX to use as scratch */
	movq	(%rsp), %rax		/* load calling RIP */
	addq	$1, %rax		/* increment over trapping instr */
	movq	%rax, (%rsp)		/* store calling RIP */
	movq	24(%rsp), %rax		/* load calling RSP */
	subq	$8, %rax		/* make room for %rbp */
	movq	%rbp, (%rax)		/* store %rbp there */
	movq	%rax, 24(%rsp)		/* store calling RSP */
	popq	%rax			/* restore RAX */
	jmp	tr_iret_kernel		/* return from interrupt */
A simpler AMD64 push emulation
usr/src/uts/intel/ia32/ml/exception.s
ud_ret:
	INTR_POP
	pushq	%rax			/* push temp */
	movq	32(%rsp), %rax		/* load %rsp */
	movq	(%rax), %rax		/* load calling RIP */
	movq	%rax, 8(%rsp)		/* store calling RIP */
	addq	$8, 32(%rsp)		/* adjust new %rsp */
	popq	%rax			/* pop off temp */
	jmp	tr_iret_kernel		/* return from interrupt */
#UD handler

Here we emulate the ret instruction. On line 426 push RAX to the stack in order to use it as a scratch register. On lines 427-428 we use the interrupted thread’s RSP to get the return pointer. On line 429 we replace the stashed RIP on the handler’s stack with the one we just pulled from the interrupted thread’s stack (remember, iret is going to populate RIP from the handler’s stack). On line 430 we remove the stored RIP from the interrupted thread’s stack. On line 431 we restore RAX. And finally we iret back into the interrupted thread.

And with that we now know how a kernel function entry/return point is instrumented, intercepted, and emulated. While that’s interesting, it’s only part of the FBT story. One of the more powerful aspects of DTrace is the ability to grab the stacktrace and arguments at the time of probe firing. These features are implemented in their own functions, but they rely on the invop handler’s help. Now that we understand the handler we can see how these other DTrace features work, to be described in a follow up post.