Tracing Kernel Functions: How the illumos AMD64 FBT Provider Intercepts Function Calls
June 15, 2020
For my money there is perhaps nothing more valuable than in situ kernel debugging. The ability to ask questions of any part of the kernel, on a live production system, without stopping or pausing anything, feels more like a right than a privilege after you’ve experienced it first hand. One such avenue to in situ kernel debugging is the DTrace FBT provider.
The FBT provider, which stands for Function Boundary Tracing, allows one to trace all kernel function entry and return sites, their arguments and return values, and even grab the stack that lead to the call. There is no overhead when probes are disabled, and minimal overhead on the specific entry or return site when enabled. DTrace is made up of several parts: the command, the user library, the kernel framework and VM, and the provider interface. Its authors made the smart decision of separating the general infrastructure of DTrace from the specifics of probe creation and firing: providers provide and fire the probe points, and the framework executes their actions. This allows providers to be developed separately from DTrace itself. If you want to learn more about DTrace I recommend reading Dynamic Instrumentation of Production Systems presented in USENIX 2004.
	For this post I want to focus on how FBT intercepts function
	calls. That is, when a user enables an FBT probe, how exactly
	does the probe end up firing when a kernel thread hits that
	entry or return site? For example, the following one-liner
	instruments all calls to the mac_ring_tx()
	function, responsible for delivering a network packet to the
	underlying NIC. It counts the number of packets traveling
	across each interface along with the distribution of the
	packet size sent across each interface. As you can see, a
	simple function entry probe can be quite powerful, and I’m
	only touching the surface of what’s possible. But how exactly
	did the FBT provider accomplish this?
      
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_ring_tx:entry { this->mip = (mac_impl_t *)arg0; @[this->mip->mi_name] = count(); @dist[this->mip->mi_name] = quantize(msgsize(args[2])); } END { printf("TOTAL PACKETS\n"); printa(@); printf("\nDISTRIBUTION\n"); printa(@dist); }'
^C
TOTAL PACKETS
  igb1                                                             43
  ixgbe6                                                        31046
  ixgbe3                                                        60370
  ixgbe5                                                        68938
  aggr1014                                                      99793
  ixgbe2                                                       137064
  aggr1013                                                     197244
DISTRIBUTION
  igb1
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |@@@@@@@@@@@@@@@                          16
             128 |@@@@@@@@@@@@@@@@@@@@@@                   24
             256 |@@@                                      3
             512 |                                         0
  ixgbe6
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         10
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31036
             128 |                                         0
  ixgbe5
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         134
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 68804
             128 |                                         0
  aggr1014
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         144
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 99649
             128 |                                         0
  ixgbe3
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         127
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         170
            2048 |                                         139
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  58801
            8192 |@                                        1052
           16384 |                                         12
           32768 |                                         69
           65536 |                                         0
  ixgbe2
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         128
              64 |                                         160
             128 |                                         0
             256 |                                         0
             512 |                                         1
            1024 |                                         79
            2048 |                                         107
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 136106
            8192 |                                         478
           16384 |                                         0
           32768 |                                         5
           65536 |                                         0
  aggr1013
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |                                         128
              64 |                                         97
             128 |                                         0
             256 |                                         0
             512 |                                         1
            1024 |                                         249
            2048 |                                         246
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 194907
            8192 |                                         1530
           16384 |                                         12
           32768 |                                         74
           65536 |                                         0
	Insrtumenting Function Calls and Returns
	Before I can describe how a site is intercepted I
	need to give a primer on how the kernel text
	is instrumented. Upon first use DTrace loads the FBT
	provider, iterates all loaded kernel modules, and passes each
	one to fbt_provide_module(). This function uses
	the ELF section header data to iterate all functions, read
	their program text, and find the location of all entry and
	return sites. The FBT provider creates an FBT probe
	(fbt_probe_t) for each site, responsible for
	holding the information needed to instrument that site. This
	design produces zero overhead when no probes are in use as no
	program text is modified during this stage.
      
	It’s not until the user executes a D script containing one or
	more FBT probes that the provider instruments the
	kernel module’s text. The DTrace framework
	calls fbt_enable(), passing in the FBT probe
	structure created earlier which contains the information
	needed to alter the correct program text. If this is an entry
	probe, then FBT replaces the push %rbp
	instruction. Otherwise, it is a return probe, and FBT replaces
	the ret instruction. In both cases, FBT replaces
	this instruction with INT3, the breakpoint
	instruction (or #BP in Intel mnemonics). This design produces
	the minimal overhead needed to intercept the probe site. It
	modifies only the site requested, leaving all other text
	untouched.
      
A breakpoint is what’s known as a software exception — a processor exception produced at the request of software, as opposed to an exception generated by a legit program exception, like divide by zero. All exceptions are really just interrupts, with the small difference being that an interrupt is something typically generated external to the processor, whereas an exception is generated by the processor as a consequence of the code it is executing. Furthermore, when reading code, you’ll also see breakpoint referred to as a trap. This is yet a further classification of the exception, and it dictates what actions the processor will take upon servicing the exception. All three terms are used interchangeably throughout the code, but the important thing to remember is it’s an interrupt. It’s interrupting the normal flow of the program code (in this case, the kernel). Like all other interrupts, the operating system must provide the processor with some action to take upon receiving a breakpoint. This action is defined in the Interrupt Descriptor Table (IDT), seen below.
set_gatesegd(&idt[T_BPTFLT],(kpti_enable == 1) ? &tr_brktrap : &brktrap,KCS_SEL, SDT_SYSIGT, TRP_UPL, idt_vector_to_ist(T_BPTFLT));
	Upon receiving a breakpoint the processor will call into the
        handler tr_brktrap. The tr_ prefix
        stands for trampoline, which is a mitigation method used as
        part of KPTI. This trampoline provides Meltdown protection and
        ultimately leads to brktrap, which I’ll come back
        to later. The SDT_SYSIGT argument tells us this
        is an “interrupt gate”, meaning the processor will disable the
        assertion of all maskable hardware interrupts before invoking
        the handler — preventing another interrupt from interrupting
        the breakpoint. The TRP_UPL argument indicates
        the interrupt may originate from user or kernel space. In the
        case of the FBT provider, it will always originate from kernel
        space. And finally, the last argument is an index into the
        Interrupt Stack Table (IST): a collection of dedicated stacks
        to be used during interrupt handling to avoid scribbling on
        the interrupted thread’s stack. In this case
        the T_BPTFLT handler maps to a dedicated stack.
      
Before the Handler
	So I’m running a D script that contains an action for one or
	more FBT probes, and a kernel thread executes one of these
	probe sites, what’s next? Based on the IDT entry it should
	call the brktrap handler, but that’s not quite
	what happens. First, the processor does some work on the
	operating system’s behalf before calling the interrupt
	handler. It must switch to the stack specified in the IDT
	entry and push the following values to it, in this order (see
	Intel Vol. 3, §6.14.4).
      
- The interrupted thread’s stack selector (SS).
 - The interrupted thread’s stack pointer (RSP).
 - The interrupted thread’s RFlags.
 - The interrupted thread’s Code Segment selector (CS).
 - The interrupted thread’s Instruction Pointer (RIP).
 - 
	    The error code of the exception, as 
INT3has no error code this doesn’t apply. 
	This ends with the processor in interrupt context, with the
	interrupted thread’s primary CPU context loaded on the new
	stack, ready to execute the operating
	system’s brktrap handler — as depicted in the
	figure below.
      
The invoptrap Handler
	The brktrap handler is actually a front for the
	real workhorse: the invoptrap handler. Different
	providers use different methods of trapping into the DTrace
	framework. Some use INT3 (#BP), others use
	the LOCK prefix (0xFO) incorrectly
	in order to force an invalid operation fault (#UD). I’m not
	sure why both methods exist or when a provider should prefer
	one method over the other, perhaps fodder for a future post.
	But in the case of FBT, the first stop
	is brktrap.
      
/** #BP*/ENTRY_NP(brktrap)XPV_TRAP_POPcmpw $KCS_SEL, 8(%rsp)jne bp_user/** This is a breakpoint in the kernel -- it is very likely that this* is DTrace-induced. To unify DTrace handling, we spoof this as an* invalid opcode (#UD) fault. Note that #BP is a trap, not a fault --* we must decrement the trapping %rip to make it appear as a fault.* We then push a non-zero error code to indicate that this is coming* from #BP.*/decq (%rsp)push $1 /* error code -- non-zero for #BP */jmp ud_kernel
	Lines 296-297 determine where the BP originated from:
	user-land or kernel-land. If the former, jump
	to bp_user, otherwise fall through. For those
	that aren’t familiar with assembly, I’ll break down line 296 a
	bit more. At the point of this instruction the
	interrupt’s RSP points to the 64-bit word
	containing the RIP of the interrupted
	thread. The 8(%rsp) instruction is AT&T
	syntax for indirect memory addressing: it tells the CPU to
	read the value at %rsp + 8 (basically the parens
	are equivalent to dereferencing a pointer in C). Since the
	stack grows downward, and every value on the stack is 64-bit,
	adding 8 bytes references the value “above” RSP
	on the stack: the interrupted thread’s CS.
	Therefore, the cmpw instruction compares the
	lower 16 bits of the immediate value $KCS_SEL to
	the interrupted thread’s CS value. In this case
	we know the #BP originated from a kernel thread and the
	processor will fall through.
      
	Lines 307-309 are well described by the comment above them.
	The bulk of the handling is done under the #UD handler, but in
	FBT’s case we are coming in via #BP. Rather than replicate the
	handler logic we opt to make the #BP look like a #UD, jumping
	to the ud_kernel handler after doing so. This is
	achieved by rolling back the RIP by one
	instruction, as it would be in the case of a #UD fault, and
	pushing an error code on the stack. The error code, while
	meaningless in this case, is needed to keep the stack layout
	consistent with other exceptions. All exceptions share common
	routines which expect the stack to be laid out in a particular
	way.
      
ENTRY_NP(invoptrap)XPV_TRAP_POPcmpw $KCS_SEL, 8(%rsp)jne ud_user#if defined(__xpv)movb $0, 12(%rsp) /* clear saved upcall_mask from %cs */#endifpush $0 /* error code -- zero for #UD */ud_kernel:push $0xdddd /* a dummy trap number */
	On line 345 we find the ud_kernel label. It sits
	a few instructions into the invoptrap handler, the
	main DTrace entry point. Like the brktrap
	handler, invoptrap looks at the the interrupt
	origin to determine what to do. Since we originated from #BP
	we skip all that and head straight to line 346 where we push a
	dummy trap number. Like the dummy error code pushed
	in brktrap, this is done to make sure the stack
	has a consistent layout across all exceptions so that macros
	like INTR_PUSH can function correctly. However,
	why we push a dummy value instead of the actual trap
	number, like all the other traps, is a mystery to me.
      
	INTR_PUSH
	  
	As I mentioned, all exceptions use the same stack layout, but
	what exactly is that layout? The layout is defined by
	the regs structure, shown below. This structure
	includes all values pushed by the processor’s exception
	mechanism as well as the trap number, all general-purpose
	registers, the segment registers, and two special values
	named r_savfp and r_savpc. These
	last two fields contain copies of the interrupted thread’s
	frame pointer and program counter, respectively. The
	processor’s exception mechanism populates lines 101-105 of the
	structure on the stack, lines 95 & 100 are populated by the
	handler, and INTR_PUSH is responsible for
	populating the rest.
      
struct regs {/** Extra frame for mdb to follow through high level interrupts and* system traps. Set them to 0 to terminate stacktrace.*/greg_t r_savfp; /* a copy of %rbp */greg_t r_savpc; /* a copy of %rip */greg_t r_rdi; /* 1st arg to function */greg_t r_rsi; /* 2nd arg to function */greg_t r_rdx; /* 3rd arg to function, 2nd return register */greg_t r_rcx; /* 4th arg to function */greg_t r_r8; /* 5th arg to function */greg_t r_r9; /* 6th arg to function */greg_t r_rax; /* 1st return register, # SSE registers */greg_t r_rbx; /* callee-saved, optional base pointer */greg_t r_rbp; /* callee-saved, optional frame pointer */greg_t r_r10; /* temporary register, static chain pointer */greg_t r_r11; /* temporary register */greg_t r_r12; /* callee-saved */greg_t r_r13; /* callee-saved */greg_t r_r14; /* callee-saved */greg_t r_r15; /* callee-saved *//** fsbase and gsbase are sampled on every exception in DEBUG kernels* only. They remain in the non-DEBUG kernel to avoid any flag days.*/greg_t __r_fsbase; /* no longer used in non-DEBUG builds */greg_t __r_gsbase; /* no longer used in non-DEBUG builds */greg_t r_ds;greg_t r_es;greg_t r_fs; /* %fs is *never* used by the kernel */greg_t r_gs;greg_t r_trapno;/** (the rest of these are defined by the hardware)*/greg_t r_err;greg_t r_rip;greg_t r_cs;greg_t r_rfl;greg_t r_rsp;greg_t r_ss;};
	After the INTR_PUSH we have an
	entire regs structure populated on the stack
	with RSP pointing to r_savfp.
      
movq REGOFF_RIP(%rsp), %rdimovq REGOFF_RSP(%rsp), %rsimovq REGOFF_RAX(%rsp), %rdxpushq (%rsi)movq %rsp, %rsisubq $8, %rspcall dtrace_invopALTENTRY(dtrace_invop_callsite)
	Lines 348-350 ready the first and third arguments for
	the dtrace_invop() call: the address of the
	instruction being intercepted, and the RAX value
	of the function being interrupted (for return probes). Line
	349 looks like it’s placing the interrupted thread’s stack
	pointer in the second argument position, but really it’s
	staging the pointer in RSI for lines 351-352. On
	line 351 we dereference the
	RSP of the interrupted thread and place that
	value on the top of the handler’s stack. Which leads to the
	question: what was the last thing pushed on the stack of the
	interrupted thread? This depends on the probe. For entry
	probes we instrument the pushq %rbp instruction,
	thus the last thing on the stack is the return site of the
	function that called the interrupted function.
	If foo() called bar(), and bar’s
	entry point is instrumented by DTrace,
	then (%rsi) would be foo()+0xXXXX.
	And it turns out this is true for return probes as well as
	illumos uses leave plus ret to exit
	the function, and FBT instruments the ret
	instruction. Therefore, upon entry into the invop handler, the
	last thing on the interrupted thread’s stack is always the
	return instruction pointer. Moving onto line 352, we replace
	the interrupted thread’s stack pointer with our handler’s
	stack pointer, so that dtrace_invop() will have a
	pointer to the handler’s stack as its second argument.
      
	Finally, before calling dtrace_invop(), we
	subtract 8 bytes from the stack to keep it at a 16-byte
	alignment (see Intel Vol. 3, §6.14.2). The
	generic dtrace_invop() function checks this
	instrumentation point against the SDT and FBT providers. In
	this case we are dealing with an FBT probe and will end up
	calling fbt_invop(). But before moving on, it’s
	also worth noting line 355, after the call
	to dtrace_invop(), where we create the
	symbol dtrace_invop_callsite. This becomes the
	return address upon entering dtrace_invop(), and
	its use becomes important when accessing function arguments,
	which I will cover in a follow up post.
      
static intfbt_invop(uintptr_t addr, uintptr_t *stack, uintptr_t rval){uintptr_t stack0, stack1, stack2, stack3, stack4;fbt_probe_t *fbt = fbt_probetab[FBT_ADDR2NDX(addr)];for (; fbt != NULL; fbt = fbt->fbtp_hashnext) {
	The FBT probes are stored in a chained hash table
	named fbt_probetab. Line 85 hashes the
	instrumented address to determine the bucket the probe is in,
	and then line 87 loops through all entries in that bucket to
	find the one matching this address.
      
if ((uintptr_t)fbt->fbtp_patchpoint == addr) {if (fbt->fbtp_roffset == 0) {int i = 0;/** When accessing the arguments on the stack,* we must protect against accessing beyond* the stack. We can safely set NOFAULT here* -- we know that interrupts are already* disabled.*/DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);CPU->cpu_dtrace_caller = stack[i++];#ifdef __amd64/** On amd64, stack[0] contains the dereferenced* stack pointer, stack[1] contains savfp,* stack[2] contains savpc. We want to step* over these entries.*/i += 2;#endifstack0 = stack[i++];stack1 = stack[i++];stack2 = stack[i++];stack3 = stack[i++];stack4 = stack[i++];DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |CPU_DTRACE_BADADDR);dtrace_probe(fbt->fbtp_id, stack0, stack1,stack2, stack3, stack4);CPU->cpu_dtrace_caller = 0;
	In this case we found an entry probe matching the instrumented
	address. We know it’s an entry probe because
	fbtp_roffset is zero.
      
	The use of CPU_DTRACE_NOFAULT is interesting, and
	I’ll be honest, I’m not sure I fully understand why it’s used
	here, but here’s my best guess. First off, the no-fault
	mechanism is used throughout DTrace to protect the operating
	system from bad loads in DTrace actions. This flag allows the
	page fault logic
	in locore.s
	to determine if the #PF originated from DTrace, and if so to
	skip over the offending instruction, set some DTrace CPU flags
	to indicate the faulting load, and move on with execution.
	This is different from a typical #PF, which would wind up in
	the
	global trap
	handler. As for it’s use here? I think it’s because we are
	directly accessing the handler’s stack via
	the stack pointer, as opposed to the typical way
	you access the stack in C: by using local variables. These
	reads should always be valid, but perhaps that wasn’t always
	the case in the history of this code. In any event, if any of
	the references on lines 109-113 happen to go out of bounds,
	the #PF handler will catch it and simply move on (leaving
	whatever garbage value it was initialized to).
      
	Now it’s time to start reading data off the handler’s stack.
	Remember, stack is pointing into
	the handler’s stack, at the point of the last push in
	the invoptrap handler. The last thing we saved on
	the stack was the caller of the instrumented function, found
	at location stack[0]. On line 99 we store that
	symbol address in the CPU-local
	variable cpu_dtrace_caller, which is used in the
	DTrace stack() action and to populate the DTrace
	built-in caller variable. The next two values on
	the handler’s stack are the regs structure
	values r_savfp and r_savpc, the
	saved values of the frame pointer and program counter,
	respectively. We skip over them to get to the arguments of the
	instrumented function.
      
	The System V AMD64 ABI dictates that the first six arguments
	are stored in specific registers. These stack[N]
	accesses reference those registers and those variables are
	passed to dtrace_probe() to act as a cache for
	the first five arguments to the instrumented function. But why
	five and not six? While illumos no longer supports a 32-bit
	kernel, when we did have such support the #UD handler
	setup stack[1] to point to the beginning of the
	first 10 arguments to the instrumented function, so we know
	it’s not because of 32-bit. It also doesn’t seem to be because
	of a SPARC limitation, as a quick search of the various SPARC
	ABIs all claim the first six arguments may be placed in
	registers. My best guess is that it’s from a desire to keep
	all arguments to dtrace_probe() in registers —
	for efficiency and perhaps to avoid polluting the stack as
	some probe actions are sensitive to stack depth such as
	the stack() action. In any event, the final
	action is to call dtrace_probe(), the main entry
	point into the DTrace framework which ultimately executes the
	actions linked with this DTrace probe.
      
} else {#ifdef __amd64/** On amd64, we instrument the ret, not the* leave. We therefore need to set the caller* to assure that the top frame of a stack()* action is correct.*/DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);CPU->cpu_dtrace_caller = stack[0];DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |CPU_DTRACE_BADADDR);#endifdtrace_probe(fbt->fbtp_id, fbt->fbtp_roffset,rval, 0, 0, 0);CPU->cpu_dtrace_caller = 0;}return (fbt->fbtp_rval);}}return (0);}
	In this case we have a matching return probe. I’m not so sure
	I follow this comment. The caller’s return address is still on
	the interrupted thread’s stack regardless of whether we
	instrument the leave or ret
	instruction, but the point is that this value must be saved in
	case the user calls the DTrace action stack() in
	their action. Like the entry probe, the goal is to send a
	probe event to the DTrace framework, but this time instead of
	passing arguments we pass the return offset and return value.
	These values are accessible in the FBT return probe action as
	the variables arg0 and arg1,
	respectively.
      
	In the case of a matching probe, we need to return
	the fbtp_rval value (line 140). This value was
	determined at probe creation time and tells the
	invoptrap handler how to return back to the
	interrupted thread. This value depends on the type of probe,
	entry vs. return, and the architecture, 32-bit vs. 64. Given
	that illumos no longer ships a 32-bit kernel, we can ignore
	the architecture and assume AMD64. That means we
	return DTRACE_INVOP_PUSHL_EBP for an entry probe
	and DTRACE_INVOP_RET for a return probe. If we
	didn’t find a matching probe, then return zero (line 144).
	This tells the invop handler to treat this fault as a normal
	#UD and continue processing.
      
pushq (%rsi)movq %rsp, %rsisubq $8, %rspcall dtrace_invopALTENTRY(dtrace_invop_callsite)addq $16, %rspcmpl $DTRACE_INVOP_PUSHL_EBP, %eaxje ud_pushcmpl $DTRACE_INVOP_LEAVE, %eaxje ud_leavecmpl $DTRACE_INVOP_NOP, %eaxje ud_nopcmpl $DTRACE_INVOP_RET, %eaxje ud_retjmp ud_trap
	Picking up on line 356 of the invoptrap handler,
	we add 16 bytes to the stack pointer which undoes the pushes
	on lines 351 and 353, leaving RSP at the start of
	the regs structure. From there we jump to one of
	the labels depending on the value returned
	by fbt_invop().
      
ud_push:/** We must emulate a "pushq %rbp". To do this, we pull the stack* down 8 bytes, and then store the base pointer.*/INTR_POPsubq $16, %rsp /* make room for %rbp */pushq %rax /* push temp */movq 24(%rsp), %rax /* load calling RIP */addq $1, %rax /* increment over trapping instr */movq %rax, 8(%rsp) /* store calling RIP */movq 32(%rsp), %rax /* load calling CS */movq %rax, 16(%rsp) /* store calling CS */movq 40(%rsp), %rax /* load calling RFLAGS */movq %rax, 24(%rsp) /* store calling RFLAGS */movq 48(%rsp), %rax /* load calling RSP */subq $8, %rax /* make room for %rbp */movq %rax, 32(%rsp) /* store calling RSP */movq 56(%rsp), %rax /* load calling SS */movq %rax, 40(%rsp) /* store calling SS */movq 32(%rsp), %rax /* reload calling RSP */movq %rbp, (%rax) /* store %rbp there */popq %rax /* pop off temp */jmp tr_iret_kernel /* return from interrupt */
	In this case we are leaving an entry probe and need to emulate
	the pushq %rbp instruction (which was overwritten
	with a breakpoint). The first line, INTR_POP
	restores the GPRs and adjusts RSP to point to the
	values pushed by the processor when switching to the interrupt
	stack, namely the CS:RIP, RFLAGS,
	and SS:RSP values of the interrupted thread. The
	rest of the code, starting with line 373, is a bit odd and
	even conflicts with the comment above it somewhat (it’s
	pulling the stack down 16 bytes, not 8). This code doesn’t
	need to be this way, but I think it is this way as a
	consequence of a time in the past when the AMD64 interrupt
	handler was not using it’s own stack or perhaps the author of
	this routine started with a copy of the 32-bit routine. But
	the idea is that the
	iret call is going to pop values off the stack to
	restore the CS:RIP, RFLAGS,
	and SS:RSP registers to their original values. In
	order to emulate the frame pointer push we need to pull the
	stack down by enough bytes to store said pointer. To do that
	we first need to shift the values that were originally stashed
	by the processor on the stack when the interrupt was taken. On
	i386 this made sense as the handler reused the kernel thread’s
	stack, so we need to “pull it down”. But on AMD64 the handler
	has its own stack and so this is all unnecessary. That said,
	the AMD64 handler still pulls the stack down, to no real
	effect, but it also modifies the interrupted thread’s
	(referred to as the “caller” in this code) RSP to
	stash the RBP and increments the interrupted
	thread’s RIP to the next instruction. Finally, we
	jump to tr_iret_kernel, which is really just
	an iret instruction (the author’s of KPTI decided
	it would be wise to alias it so that future maintainers know
	that the iret path was analyzed when working on
	the Meltdown mitigations — a blessing for those who will have
	to maintain this code for decades to come). After
	the iret has executed we are back in the kernel
	thread’s original execution context and it’s as if FBT was
	never involved.
      
	Here’s how the push emulation could be written in
	a simpler manner.
      
simple_ud_push:/** Emulate "pushq %rbp" by stashing RBP on the caller's stack* and incrementing the caller's RIP by one.*/INTR_POPpushq %rax /* stash RAX to use as scratch */movq (%rsp), %rax /* load calling RIP */addq $1, %rax /* increment over trapping instr */movq %rax, (%rsp) /* store calling RIP */movq 24(%rsp), %rax /* load calling RSP */subq $8, %rax /* make room for %rbp */movq %rbp, (%rax) /* store %rbp there */movq %rax, 24(%rsp) /* store calling RSP */popq %rax /* restore RAX */jmp tr_iret_kernel /* return from interrupt */
ud_ret:INTR_POPpushq %rax /* push temp */movq 32(%rsp), %rax /* load %rsp */movq (%rax), %rax /* load calling RIP */movq %rax, 8(%rsp) /* store calling RIP */addq $8, 32(%rsp) /* adjust new %rsp */popq %rax /* pop off temp */jmp tr_iret_kernel /* return from interrupt */
	Here we emulate the ret instruction. On line 426
	push RAX to the stack in order to use it as a
	scratch register. On lines 427-428 we use the interrupted
	thread’s RSP to get the return pointer. On line
	429 we replace the stashed RIP on the handler’s
	stack with the one we just pulled from the interrupted
	thread’s stack (remember, iret is going to
	populate RIP from the handler’s stack). On line
	430 we remove the stored RIP from the interrupted
	thread’s stack. On line 431 we restore RAX. And
	finally we iret back into the interrupted thread.
      
And with that we now know how a kernel function entry/return point is instrumented, intercepted, and emulated. While that’s interesting, it’s only part of the FBT story. One of the more powerful aspects of DTrace is the ability to grab the stacktrace and arguments at the time of probe firing. These features are implemented in their own functions, but they rely on the invop handler’s help. Now that we understand the handler we can see how these other DTrace features work, to be described in a follow up post.