The illumos SYSCALL Handler

5 May 2016

The system call is the brain stem of the operating system. It ushers requests from the user brain to the kernel body. Without them your programs are helpless. Even the primal init process gets nowhere without system calls. They are the cost of doing business with the OS.

The system call is both simple and complicated. As an abstraction it is simple. A controlled entry point into the kernel used by the application to affect the hardware. Behind the abstraction lies an implementation which must deal with the complicated reality of hardware. On the user side sits libc stubs; placing the system call number and arguments in the correct registers and invoking the system call instruction. On the kernel side sits a regular old C function with access to kernel memory and privileged instructions. It’s the space between where the wild things are. This is the space of the handler.

At its core the handler transforms a user thread into a kernel thread while it’s running. It changes CPU state by using instructions that themselves change CPU state. This is like Optimus Prime transforming into an autobot while hauling ass down the road after a Decepticon. If the state is not transitioned in just the right sequence then the system crashes. In some cases, a bad transition leaves a ticking time bomb that explodes far removed from the original bug—leaving no indicator of what caused it. Debugging these kinds of crashes can drive you crazy. As my friend Scott Fritchie likes to say: “this is like seeing a picture of a car on fire and trying to explain why it is on fire”. To top if all off, the handler is coded in assembly. It must be. The handler deals with the CPU architecture directly and C can’t be used.

There are several ways to perform a system call on modern x86 chips. I’m focusing on one: the AMD64 syscall instruction. The illumos syscall handler should mostly look like other systems, but there are also things unique to illumos. For example, the branded system call: a form of OS-level virtualization performed at the system call layer. SmartOS uses this in their lx-brand zones. They run an entire Linux distribution by emulating the Linux system call table. For this post I’m focusing on the essential parts of the handler. I’ve taken liberty with the assembly code—removing extraneous bits to focus on the path most traveled.

The Handler

SWAPGS

To understand the first instruction it helps to understand how we arrive here and the state of the CPU at this point. The syscall instruction is minimal by design—performing a few actions.

Enter kernel-mode: switch into kernel-mode, aka ring-0. Giving access to privileged instructions and data.
CS & SS setup: load kernel-mode values into the code and stack segment registers.
Swap flags: stash the user CPU flags, then change them for kernel-mode execution.
Swap IP: stash the user instruction pointer. Swap it for the kernel syscall handler: sys_sycall.

The execution of syscall leaves you in an odd place. Like a child halfway over a tall fence. Getting to the top was easy, but now a difficult decision must be made. Slide back down the way you came? Or risk it and jump to the other side? Most of the hardware state is still from the user thread, but the CPU is pointing to kernel code and stack. One wrong move results in falling down the other side of the fence and eating it hard.

The handler’s first step is grabbing a pointer to the CPU kernel data structure. It’s here where it will begin stashing user hardware state. It’s imperative this be done before anything else. The handler must have somewhere to put user state before it can load kernel state. The swapgs instruction is built for this purpose. In fact, it was built specifically for use with syscall. It swaps the gs base value with that in the kernel’s GSBASE MSR (Model Specific Register). Each CPU has its own gs segment register. And each CPU’s GSBASE MSR points to its cpu_t structure. This structure contains space to stash a small amount of user state as well as a pointer to the kernel thread associated with the user thread. After swapgs runs the gs prefix is synonymous with the cpu_t pointer.

movq	%r15, %gs:CPU_RTMP_R15  /* stash user r15 */
movq	%rsp, %gs:CPU_RTMP_RSP  /* stash user stack ptr */

movq	%gs:CPU_THREAD, %r15    /* load kernel thread ptr */
movq	T_STACK(%r15), %rsp	/* load kernel stack ptr */

With the cpu_t in hand you begin to descend the other side of the fence into kernel-mode. Next step, get a handle to the kernel thread and its stack. That’s two pointers in two registers. That means moving two pieces of user state off to the side. The first two instructions stash the user r15 and rsp registers in the temp area of cpu_t. With those out of the way the kernel thread pointer goes into r15. This comes first because the stack pointer value is a member of the kernel thread structure. The fourth instruction loads the kernel stack pointer into rsp. Now you’re feeling pretty confident about getting down the other side of this fence. Just a matter of stashing the rest of the user state.

movl	$UCS_SEL, REGOFF_CS(%rsp)
movq	%rcx, REGOFF_RIP(%rsp)		/* syscall: %rip -> %rcx */
movq	%r11, REGOFF_RFL(%rsp)		/* syscall: %rfl -> %r11d */
movl	$UDS_SEL, REGOFF_SS(%rsp)

The first frame of kernel stack is the user hardware state saved as a regs structure. The REGOFF macros contain offsets into this frame.

SYSCALL destroys the instruction pointer and flags; but not until after it stashes them in rcx and r11. The middle two instructions save the user’s IP and flags. The user loses their original rcx and r11. That’s okay, the AMD64 ABI states that they are caller-saved. The user must save these values beforehand if they care about them.

SYSCALL also destroys the code and segment registers. The two outside instructions save constant values to these segment selectors—pointing to the appropriate segment descriptors for the return to user-mode. But sysret already writes its own values to these segment registers. The stashed values are never used. Why do this?

It turns out that just because a thread enters the kernel with syscall doesn’t guarantee that it will return with sysret. The handler may take an alternate return path via iret; which pops the code and stack registers from the stack. Furthermore, the alternate return path uses the stashed cs to determine if it is returning to userland, and if so, if it is returning to a 32 or 64-bit application.

movl %eax, %eax /* wrapper: sysc# -> %eax */

This curious instruction moves the value of eax into…eax. Weird. This instruction hides its intent in implicit behavior. AT&T syntax indicates the operand size by suffix. The movl instruction means “move doubleword” (32 bits). In 64-bit mode the CPU clears the top 32 bits of a 64-bit register during a 32-bit move. The eax register is synonymous with the bottom 32-bits of the rax register. Thus, this instruction clears the top 32-bits of rax.

The reason for this instruction isn’t immediately obvious. I believe its purpose is to prevent a subtle but powerful attack. I return to this later in the post.

movq	%rdi, REGOFF_RDI(%rsp)
movq	%rsi, REGOFF_RSI(%rsp)
movq	%rdx, REGOFF_RDX(%rsp)
movq	%r10, REGOFF_RCX(%rsp)		/* wrapper: %rcx -> %r10 */
movq	%r10, %rcx			/* arg[3] for direct calls */

movq	%r8, REGOFF_R8(%rsp)
movq	%r9, REGOFF_R9(%rsp)

Next, the handler saves the system call arguments to user state. The AMD64 ABI specifies these registers. The handler stashes the arguments in the same order they are passed to the kernel routine: rdi is the first argument, rsi the second, etc. SYSCALL destroys the value in rcx. To mitigate this the libc stub function copies the 4th argument into r10. The handler reestablishes rcx where the kernel routine expects to find its 4th argument.

movq	%rax, REGOFF_RAX(%rsp)
movq	%rbx, REGOFF_RBX(%rsp)

movq	%rbp, REGOFF_RBP(%rsp)
movq	%r10, REGOFF_R10(%rsp)
movq	%gs:CPU_RTMP_RSP, %r11
movq	%r11, REGOFF_RSP(%rsp)
movq	%r12, REGOFF_R12(%rsp)

movq	%r13, REGOFF_R13(%rsp)
movq	%r14, REGOFF_R14(%rsp)
movq	%gs:CPU_RTMP_R15, %r10
movq	%r10, REGOFF_R15(%rsp)

Here the handler stashes the rest of the GPRs (General Purpose Registers). Before stashing r15 and rsp, the handler extracts them from their temporary holding area in cpu_t.

movq	$0, REGOFF_SAVFP(%rsp)
movq	$0, REGOFF_SAVPC(%rsp)

Set r_savfp and r_savpc to 0. These two values have something to do with MDB (Modular Debugger), interrupts, and stacktraces. A value of 0 means to end the stacktrace. You can’t see it but I’m waving my hands—moving right along.

/*
 * Copy these registers here in case we end up stopped with
 * someone (like, say, /proc) messing with our register state.
 * We don't -restore- them unless we have to in update_sregs.
 *
 * Since userland -can't- change fsbase or gsbase directly,
 * and capturing them involves two serializing instructions,
 * we don't bother to capture them here.
 */
xorl	%ebx, %ebx
movw	%ds, %bx
movq	%rbx, REGOFF_DS(%rsp)
movw	%es, %bx
movq	%rbx, REGOFF_ES(%rsp)
movw	%fs, %bx
movq	%rbx, REGOFF_FS(%rsp)
movw	%gs, %bx
movq	%rbx, REGOFF_GS(%rsp)

Any time you xor a value with itself you get zero. It’s a common idiom: xor is just as fast as mov but encodes to three bytes versus mov’s five. It’s important to clear ebx because the instructions following it stash a 64-bit values. The top 48 bits should be clear because the segment selectors are only 16 bits wide.

The rest of the instructions stash the data segment registers. The comment alludes to a subtle difference between the GPRs and these registers. The handler saves the GPRs because they are destroyed while carrying out the system call. The handler uses instructions that either implicitly or explicitly use the GPRs. And if the handler doesn’t use them then the kernel routine will.

The segment registers, on the other hand, are less likely to change during the system call. The system modifies them when when a context switch occurs before the system call can finish. Or perhaps it’s some other reason; I didn’t trace this rabbit hole the entire way down. In any event, these registers likely don’t need restoring when returning to user-mode. On modification the system sets the LWP’s pcb_rupdate flag. The handler checks this flag on return: running update_sregs() if it is set.

movq %rsp, %rbp

Start a new stack frame by setting the frame base pointer to the current end of the stack. This marks the end of the first frame—where the handler just stashed all the user hardware state. From here on out the stack belongs to the kernel routines invoked by the handler.

Those familiar with assembly might wonder when the first frame, the user hardware state frame, was actually built. After all, there have been no push instructions or manipulation of the rsp up to this point. It turns out that the initial frame is a permanent fixture. The system allocates space for the regs structure at thread creation. And the stack pointer is set to the first address after this structure. The REGOFF macros depend on this detail—the stack grows downward so they add to rsp to reach into the initial frame.

movq	T_LWP(%r15), %r14
ASSERT_NO_RUPDATE_PENDING(%r14)
ENABLE_INTR_FLAGS

Move the LWP pointer into r14. LWP stands for Light-Weight Process. It’s a holdover from when Solaris had an M:N scheduling model—many user threads on a smaller set of kernel threads. These days illumos uses a 1:1 threading model: a kernel thread per user thread. T_LWP is the offset of the klwp_t pointer. It contains the kernel-side state for a thread: such as hardware context, system call support, I/O, and signal handling.

The second instruction is a sanity check that runs on debug kernels only. It verifies the thread was not switched off-cpu between the time the system call started and now—otherwise it panics the system. This is an important check: a context switch before this point would corrupt the hardware state.

The third instruction goes hand in hand with the second one. It enables the interrupt flag so that interrupts can proceed on this CPU. This implies interrupts were disabled up to this point, but I haven’t shown you any instruction that does this. What’s going on? It’s hidden in the implicit behavior of syscall. Illumos configures syscall to disable interrupts on its way into the kernel. This gives the handler a chance to save the hardware state without interruption. With all the user hardware state safely stashed the handler enables interrupts in case something more important needs to run.

MSTATE_TRANSITION(LMS_USER, LMS_SYSTEM)
movl	REGOFF_RAX(%rsp), %eax	/* (%rax damaged by mstate call) */

Transition the thread’s microstate accounting from user-mode to system-mode (another name for kernel-mode). This is the value displayed under the USR and SYS columns when running prstat -m. It shows the percentage of time spent in each mode. This is a coarse but useful metric to determine which side of the line a thread is spending most of its time. The microstate update destroys eax so restore it after.

movb	$LWP_SYS, LWP_STATE(%r14)
incq	LWP_RU_SYSC(%r14)

Transition the LWP into system state and increment its system call counter. Microstate accounting and LWP state are similar but used for different purposes. The former tracks time spent on each side of the line; the later is a flag used by other parts of the kernel. For example, the scheduler has a “CPU caps” feature that allows one to place an upper limit on the CPU time a thread takes. This limit applies only to user-mode time so the enforcement code uses lwp_state to verify it’s only limiting user-mode.

movb $NORMALRETURN, LWP_EOSYS(%r14)

This has something to do with I/O. It can hold the value NORMALRETURN or JUSTRETURN. The first adjusts the instruction pointer and registers on return; the later doesn’t.

incq %gs:CPU_STATS_SYS_SYSCALL

Increment the per-cpu system call counter as reported by mpstat(1M).

movw %ax, T_SYSNUM(%r15)

Place the system call number into the kernel thread’s t_sysnum field. This value serves various purposes across the system.

movq	REGOFF_RDI(%rbp), %rdi
movq	REGOFF_RSI(%rbp), %rsi
movq	REGOFF_RDX(%rbp), %rdx
movq	REGOFF_RCX(%rbp), %rcx
movq	REGOFF_R8(%rbp), %r8
movq	REGOFF_R9(%rbp), %r9

Load the system call arguments per the AMD64 ABI.

cmpl	$NSYSCALL, %eax
jae	_syscall_ill

Verify that the system call number is legal; or jump to _syscall_ill. The instruction jae stands for “jump if above or equal”. The cmpl instruction subtracts $NSYSCALL from eax. If the call is legal then the result is less than zero. Thus, jump to _syscall_ill if the result is above-or-equal to zero.

shll	$SYSENT_SIZE_SHIFT, %eax
leaq	sysent(%rax), %rbx

The system call table is an array of sysent structures. Each sysent contains a pointer to a kernel routine implementing the system call. The first step to calling the routine is gaining access to the sysent entry.

The first instruction multiplies the system call number by the sysent structure size. The structure is coded to be a power-of-two in size allowing efficient multiplication by shifting left by the log-2 of the size in bytes. For example, the 64-bit kernel sysent is 32 bytes, a shift of 5: 2^5 == 32.

The second instruction loads this address into rbx. The AT&T syntax sysent(%rax) says “add the offset in rax to the address of sysent”. These two instructions are logically equal to %rbx = sysent[%eax].

Remember that peculiar mov instruction earlier? You know, where the handler moves the value of eax into…eax. That instruction becomes important here. The sysent data structure is in 64-bit address space and we must use 64-bit operands. If any of the top 32-bits of rax are on it would cause a lookup far outside the bounds of the table. Maybe pointing to some malicious. An attacker could write custom assembly to load a valid system call in the bottom 32-bits of rax but also turn on some bits in the top 32-bits that end up pointing to code they created. This code would execute in kernel-mode with access to everything. The system call number validation can’t catch this type of attack because it’s using 32-bit operands and can’t see the top 32 bits.

Though, that’s just a guess on my part. I haven’t done the work to verify this attack is possible.

call *SY_CALLC(%rbx)

The moment of truth—actually calling the system call. Everything up to this point is prelude to the system call. Likewise, everything after is epilogue. This prelude and epilogue combine to form the system call handler. The logic shared by all system calls. That’s why it’s such an important piece of the OS. Any useful action starts with a system call; and all system calls must go through the handler.

The SY_CALLC macro is an offset into the sysent structure for the system call function pointer. The SY_CALLC(%rbx) syntax is just like the previous relative addressing. It says “starting from rbx as base, give me the address at the offset for the sy_callc member”. This results in a pointer to a function but call expects the address of the first instruction. The handler dereferences the pointer with an asterisk just like in C.

movq	%rax, %r12
movq	%rdx, %r13

/*
 * If the handler returns two ints, then we need to split the
 * 64-bit return value into two 32-bit values.
 */
testw	$SE_32RVAL2, SY_FLAGS(%rbx)
je	5f
movq	%r12, %r13
shrq	$32, %r13	/* upper 32-bits into %edx */
movl	%r12d, %r12d	/* lower 32-bits into %eax */

At some point, unknown to me, people needed to return 64-bit values on 32-bit CPUs. The way to do it without pointers is to use two registers. The 386 System V ABI already specified eax for a 32-bit return value, and the authors choose edx as an extension to hold to upper 32-bits of a 64-bit value. These registers were not picked by accident. As Joe Demato pointed out to me, instructions like mul already set historical precedent for use of eax and edx in such a manner. It makes sense as an optimization too. If the last instruction in your routine is of this type then you can forego the mov instructions.

When you aren’t using two registers to return a 64-bit value you can use them to return two 32-bit values, and so illumos did. It introduced the rval_t union allowing system calls to return one or two values. When everyone moved to 64-bit they still kept compatibility with 32-bit programs and used two registers. But now the registers are bigger. If we wanted we could return one 128-bit value or two 64-bit values. For the moment, illumos doesn’t return any 128-bit values, but it’s coded so that it can in the future.

The handler must jump through hoops because the system call routines are normal C functions—placing the return value in rax. If the call returns two 32-bit values then the handler needs to recognize that via the $SE_32RVAL2 flag and tease them out: placing one in eax and the other in edx. Just like a 32-bit kernel would have done. Even though no system call is placing a return value in rdx we still act like it does for the sake of future 128-bit return values.

MSTATE_TRANSITION(LMS_SYSTEM, LMS_USER)

The system call has returned and the thread is on its way back to user-mode. Flip the microstate.

CLI(%r14)

Disable maskable interrupts. The handler is swapping the user hardware state back in; a context switch now would send this thread off its rails. On metal this is just an alias for the cli instruction. The register argument applies only when compiling the kernel to run on Xen.

/*
 * We need to protect ourselves against non-canonical return values
 * because Intel doesn't check for them on sysret (AMD does).  Canonical
 * addresses on current amd64 processors only use 48-bits for VAs; an
 * address is canonical if all upper bits (47-63) are identical. If we
 * find a non-canonical %rip, we opt to go through the full
 * _syscall_post path which takes us into an iretq which is not
 * susceptible to the same problems sysret is.
 *
 * We're checking for a canonical address by first doing an arithmetic
 * shift. This will fill in the remaining bits with the value of bit 63.
 * If the address were canonical, the register would now have either all
 * zeroes or all ones in it. Therefore we add one (inducing overflow)
 * and compare against 1. A canonical address will either be zero or one
 * at this point, hence the use of ja.
 *
 * At this point, r12 and r13 have the return value so we can't use
 * those registers.
 */
movq	REGOFF_RIP(%rsp), %rcx
sarq	$47, %rcx
incq	%rcx
cmpq	$1, %rcx
ja	_syscall_post

This is my favorite part of the handler. The block comment is accurate but it doesn’t tell you the whole story. It doesn’t explain why returning to a non-canonical address is bad. It doesn’t explain why this would ever happen. In Unix tradition let me preface the next few paragraphs with “you are not expected to understand this”.

As the comment says, this code prevents returning to a non-canonical address. A non-canonical address is one that sits in the “VA hole”. When AMD created the 64-bit extensions they limited the MMU (Memory Management Unit) to 48 bits of addressable memory—keeping the page table structure to four levels. This is an important performance trade-off: any time your program reads memory it first needs to translate the virtual address to a physical one. If there is no matching entry in the TLB (Translation Lookaside Buffer, a fast map from virtual to physical) then it has to go through four page tables. A larger address space requires even more levels and more costly cold memory access. So the AMD engineers went with 48 bits. This decision leaves a surplus of unused bits in the top word. Us programmers, the frugal bunch we are, love to stash things in our unused bits and consider ourselves clever for doing it. But this poses a big problem if vendors ever decide to extend the address space. Those bits are now owned by the CPU; programs relying on them would break. To keep our foots from getting shot the chip makers invented the canonical address. A canonical address has all its top bits identical to the 47th bit: all ones or all zeros. If you draw the address space out on a vertical rectangle you get 128 terabytes on the top, 128 on the bottom, and a massive hole in the center called the VA hole.

Upon accessing an address in the VA hole the CPU generates a general protection fault. These five instructions prevent this from happening on Intel chips. They verify the return address is a canonical one. Otherwise the handler returns by a safe method and lets the user program segfault on the bad address. But the comment doesn’t tell you why this would ever happen. It seems like an odd thing to worry about. Why can’t the chip just handle this for you?

As it turns out, it can. AMD doesn’t need this code. The code exists solely to protect against an Intel specific exploit discovered in 2012. This is why I invoked Ritchie’s “you are not expected to understand this”. Like Ritchie, I’m not mocking you. Rather, the purpose of this code is not obvious. I figured it out only after looking at the code history, tracking that to an illumos issue, and then finally ending up at a vulnerability note. The exploit is subtle, requiring deep OS knowledge. But let me give you the gist of it.

If you call sysret with a non-canonical address in rip it causes a GP fault. On Intel this GP runs in kernel-mode (ring-0); on AMD it runs in user-mode (ring-3). Because of the way sysret works, the syscall handler must restore the user’s stack pointer before calling it. Thus, the GP happens in kernel-mode but with the user’s stack. When invoked, the GP handler immediately saves the CPU registers on the stack. This is fine in normal circumstances; but using our new knowledge we could concoct some C/ASM code to load nefarious values in the registers (in user-mode), set rsp to some address in kernel memory, somehow load a non-canonical address in rip, and then invoke syscall. It’s tricky—the exploit needs to pass itself off as a real system call and get through the entire handler to the sysret call all while maintaining its nefarious register values—but it’s doable. In the end you have something that can write arbitrary values to kernel memory via system calls.

That’s the story behind those five lines of assembly. After that the bit twiddling tricks seem less interesting. But here’s the summary.

A canonical address will result in an rcx value of 0 or 1. By comparing it to 1 and calling ja we are saying “jump to _syscall_post only if rcx is greater than 1”.

SIMPLE_SYSCALL_POSTSYS(%r15, %r14, %bx)

This macro fixes up various state. It expands to the following.

movb	$LWP_USER, LWP_STATE(%r14)
movw	%bx, T_SYSNUM(%r15)
andb	$_CONST(0xffff - PS_C), REGOFF_RFL(%rsp)

First, set the LWP state back to user. Second, set the kernel thread’s t_sysnum field to zero. The handler assumes bx is always zero at this point. I’m not sure why. I’m going to leave that as an exercise to the reader.

The third instruction clears the carry flag. It starts with all 16 bits enabled, disables the carry flag, and then performs a bitwise-and with the stashed user value.

Why does the handler disable the carry flag? It has to do with failure. This is one place where the Linux and Unix ABI diverge. Linux indicates error with a negative value in rax in the range -1 to -4095; anything else marks success. Unix indicates error by setting the carry flag. It separates the notion of failure from the return value itself, allowing system calls to return negative values as success. You can see this in action if you peek at the libc assembly for the read(2) call.

# dis -F __read /lib/64/libc.so.1
disassembly for /lib/64/libc.so.1

__read()
    __read:      49 89 ca           movq   %rcx,%r10
    __read+0x3:  b8 03 00 00 00     movl   $0x3,%eax
    __read+0x8:  0f 05              syscall
    __read+0xa:  73 0a              jae    +0xa <__read+0x16>
    __read+0xc:  83 f8 5b           cmpl   $0x5b,%eax
    __read+0xf:  74 ef              je     -0x11        <__read>
    __read+0x11: e9 9a 32 f5 ff     jmp    -0xacd66     <__cerror>
    __read+0x16: c3                 ret

The jae instruction is saying “jump to __read+0x16 if the carry flag is not set, otherwise fall into the error handling code”. It’s a bit clearer when reading the source.

/*
 * SYSREENTRY provides the entry sequence for restartable system calls.
 */
#define SYSREENTRY(name)        \
        ENTRY(name);            \
1:

/*
 * SYSRESTART provides the error handling sequence for restartable
 * system calls.
 * XX64 -- Are all of the argument registers restored to their
 * original values on an ERESTART return (including %rcx)?
 */
#define SYSRESTART(name)                \
        jae     1f;                     \
        cmpl    $ERESTART, %eax;        \
        je      1b;                     \
        jmp     __cerror;               \
1:

If the carry flag is set then we have an error. If the error is a restart ($ERESTART/0x5B in eax) then restart the sequence. The je 1b is syntactic sugar to say “jump back to the previous 1: label” (setup by SYSREENTRY). Likewise, jae 1f says “if there is no error then jump forward to the next 1: label”. If neither case is met there is a legit error: jump to __cerror.

And that concludes the SIMPLE_SYSCALL_POSTSYS macro. Simple, right?

movq	%r12, REGOFF_RAX(%rsp)
movq	%r13, REGOFF_RDX(%rsp)

Stash the return value into the user state.

Why bother? The instructions that follow will reload all the user state—including these registers. So why bother stashing them here? For debugging I suppose. If the system panics before the hardware state is restored then the core would include the return values in the kernel thread stack.

/*
 * To get back to userland, we need the return %rip in %rcx and
 * the return %rfl in %r11d.  The sysretq instruction also arranges
 * to fix up %cs and %ss; everything else is our responsibility.
 */
movq	REGOFF_RDI(%rsp), %rdi
movq	REGOFF_RSI(%rsp), %rsi
movq	REGOFF_RDX(%rsp), %rdx
/* %rcx used to restore %rip value */

movq	REGOFF_R8(%rsp), %r8
movq	REGOFF_R9(%rsp), %r9
movq	REGOFF_RAX(%rsp), %rax
movq	REGOFF_RBX(%rsp), %rbx

movq	REGOFF_RBP(%rsp), %rbp
movq	REGOFF_R10(%rsp), %r10
/* %r11 used to restore %rfl value */
movq	REGOFF_R12(%rsp), %r12

movq	REGOFF_R13(%rsp), %r13
movq	REGOFF_R14(%rsp), %r14
movq	REGOFF_R15(%rsp), %r15

movq	REGOFF_RIP(%rsp), %rcx
movl	REGOFF_RFL(%rsp), %r11d
movq	REGOFF_RSP(%rsp), %rsp

Restore the user hardware state from the saved values in the initial kernel stack frame. The user rip is loaded into rcx and rflags into r11 per the sysret spec.

SWAPGS /* user gsbase */

When entering the kernel via syscall we swap gs and the save the user hardware state. Now do the reverse: restore the user state and then swap gs.

SYSRETQ

And here we are. The last instruction. The end of the road for the system call handler.

The q suffix indicates this is a return to a 64-bit application. This takes us down another rabbit hole where Intel and AMD differ: AMD allows syscall in compatibility mode (32-bit app on 64-bit kernel) but Intel doesn’t. So sysret supports returning to 32-bit or 64-bit applications, and in this case we are in the 64-bit handler which will only ever return to a 64-bit application, thus the q suffix. Illumos uses sysenter/sysexit for 32-bit system calls on Intel chips.

This instruction returns the thread to user-mode, loads the user instruction pointer and flags, and perform some adjustments on the cs and ss segment registers.

And that’s it. You just completed a typical trip through the illumos syscall handler. Lean back in your chair and let this idea sink in: almost every system call needs to do all this work each time it is called. And a busy server might see millions of system calls a second. This code better be bug free and efficient.

Appendix A: SYSCALL/SYSRET Setup

The syscall instruction, and it’s companion sysret, rely on various special registers to do their job. Six MSRs must be set in total.

EFER (Extended Feature Enable Register): Set the system call extension (SCE) bit. Otherwise, calling these instructions produces an invalid opcode (UD) exception.
STAR (legacy-mode SYSCALL Target Address Register): SYSCALL uses bits 47-32 to set the CS and SS selectors upon entry; they point to the kernel code and data GDT descriptors. SYSRET uses bits 63-48 to set the CS and SS selectors when returning to user-mode; they point to the user code and data GDT descriptors. The bottom 32-bits are meant to be the eip when using syscall in legacy-mode but illumos doesn’t support that.
LSTAR (Long-Mode SYSCALL Target Address Register): SYSCALL uses this value as the instruction pointer upon entry. It points to sys_syscall: the syscall handler. The topic of this post.
CSTAR (Compatibility-Mode SYSCALL Target Address Register): Same as LSTAR but for 32-bit. It is legal for apps in compatibility-mode to call syscall on AMD, but not on Intel.
SFMASK (SYSCALL Flag Mask): The flags to disable on syscall entry: IF (interrupt), TF (trap aka single-step), and AC (access control) if SMAP is supported by the CPU. These are named PS_IE, PS_T, and PS_ACHK in the source. Thus, executing syscall disables interrupts, single-step, and SMAP.
KERNEL_GS_BASE: SWAPGS swaps this value with the gs value. Upon kernel entry the handler uses it to load a pointer to the kernel’s per-CPU structure where it can hang a small amount of machine state. On return to user-mode the handler uses it to swap the user gs back in.

void
init_cpu_syscall(struct cpu *cp)
{
...
	if (is_x86_feature(x86_featureset, X86FSET_MSR) &&
	    is_x86_feature(x86_featureset, X86FSET_ASYSC)) {
...
		/*
		 * Turn syscall/sysret extensions on.
		 */
		cpu_asysc_enable();

		/*
		 * Program the magic registers ..
		 */
		wrmsr(MSR_AMD_STAR,
		    ((uint64_t)(U32CS_SEL << 16 | KCS_SEL)) << 32);
		wrmsr(MSR_AMD_LSTAR, (uint64_t)(uintptr_t)sys_syscall);
		wrmsr(MSR_AMD_CSTAR, (uint64_t)(uintptr_t)sys_syscall32);

		/*
		 * This list of flags is masked off the incoming
		 * %rfl when we enter the kernel.
		 */
		flags = PS_IE | PS_T;
		if (is_x86_feature(x86_featureset, X86FSET_SMAP) == B_TRUE)
			flags |= PS_ACHK;
		wrmsr(MSR_AMD_SFMASK, flags);
	}
...
}

Each CPU has its own GDT with descriptors for the code & data segments for both kernel and user-mode. These descriptors are not used in the typical path. The syscall and sysret instructions enforce specific values in the segment registers. The descriptor values should match those.

static void
init_gdt_common(user_desc_t *gdt)
{
...
	/*
	 * 64-bit kernel code segment.
	 */
	set_usegd(&gdt[GDT_KCODE], SDP_LONG, NULL, 0, SDT_MEMERA, SEL_KPL,
	    SDP_PAGES, SDP_OP32);

	/*
	 * 64-bit kernel data segment. The limit attribute is ignored in 64-bit
	 * mode, but we set it here to 0xFFFF so that we can use the SYSRET
	 * instruction to return from system calls back to 32-bit applications.
	 * SYSRET doesn't update the base, limit, or attributes of %ss or %ds
	 * descriptors. We therefore must ensure that the kernel uses something,
	 * though it will be ignored by hardware, that is compatible with 32-bit
	 * apps. For the same reason we must set the default op size of this
	 * descriptor to 32-bit operands.
	 */
	set_usegd(&gdt[GDT_KDATA], SDP_LONG, NULL, -1, SDT_MEMRWA,
	    SEL_KPL, SDP_PAGES, SDP_OP32);
	gdt[GDT_KDATA].usd_def32 = 1;

	/*
	 * 64-bit user code segment.
	 */
	set_usegd(&gdt[GDT_UCODE], SDP_LONG, NULL, 0, SDT_MEMERA, SEL_UPL,
	    SDP_PAGES, SDP_OP32);

	/*
	 * 32-bit user code segment.
	 */
	set_usegd(&gdt[GDT_U32CODE], SDP_SHORT, NULL, -1, SDT_MEMERA,
	    SEL_UPL, SDP_PAGES, SDP_OP32);
...
	/*
	 * 32 and 64 bit data segments can actually share the same descriptor.
	 * In long mode only the present bit is checked but all other fields
	 * are loaded. But in compatibility mode all fields are interpreted
	 * as in legacy mode so they must be set correctly for a 32-bit data
	 * segment.
	 */
	set_usegd(&gdt[GDT_UDATA], SDP_SHORT, NULL, -1, SDT_MEMRWA, SEL_UPL,
	    SDP_PAGES, SDP_OP32);
...
}

Appendix B: REGOFF & Other Macros

Many of the macros used in the syscall handler come from assym.h. But you won’t find that file in the illumos-gate source code. It’s generated by a tool called genassym. At build time genassym reads .in files and produces .h files. These .in files contain a cpp-like format for describing the generation of macros for structure members. Most macros are implicitly named after the structure member name, just capitalized. You can also generate macros for structure size (_SIZE), and a log-2 shift (_SIZE_SHIFT) for any structures that have a power-of-two size. You’ll find all the macros used by the handler in either mach_offsets.in or offsets.in.

Here’s the definition of the REGOFF macros in mach_offsets.in.

regs	REGSIZE
	r_savfp	REGOFF_SAVFP
	r_savpc	REGOFF_SAVPC
	r_rdi	REGOFF_RDI
	r_rsi	REGOFF_RSI
	r_rdx	REGOFF_RDX
	r_rcx	REGOFF_RCX
	r_r8	REGOFF_R8
	r_r9	REGOFF_R9
	r_rax	REGOFF_RAX
	r_rbx	REGOFF_RBX
	r_rbp	REGOFF_RBP
	r_r10	REGOFF_R10
	r_r11	REGOFF_R11
	r_r12	REGOFF_R12
	r_r13	REGOFF_R13
	r_r14	REGOFF_R14
	r_r15	REGOFF_R15
\#if DEBUG
	__r_fsbase	REGOFF_FSBASE
	__r_gsbase	REGOFF_GSBASE
\#endif
	r_ds	REGOFF_DS
	r_es	REGOFF_ES
	r_fs	REGOFF_FS
	r_gs	REGOFF_GS
	r_trapno	REGOFF_TRAPNO
	r_err	REGOFF_ERR
	r_rip	REGOFF_RIP
	r_cs	REGOFF_CS
	r_rfl	REGOFF_RFL
	r_rsp	REGOFF_RSP
	r_ss	REGOFF_SS

\#define	REGOFF_PC	REGOFF_RIP

Appendix C: Resources

It’s easy to get lost in the weeds when studying system calls. A lot of the content out there is lacking. I’ve found that the best sources are the Intel & AMD development guides and the System V ABI. These documents have their own jargon that is intimidating. But as you pick up the lingo they become powerful references.

Intel Dev Manual Vol 1: Basic Architecture

Intel Dev Manual Vol 2: Instruction Set Reference

Intel Dev Manual Vol 3: System Programming

AMD64 Dev Manual Vol 1: Application Programming

AMD64 Dev Manual Vol 2: System Programming

AMD64 Dev Manual Vol 3: General & System Instructions

System V 386 ABI, 4th Ed (1997)

System V 386 ABI Supplement (2015)

System V AMD64 ABI, Draft 0.99.6 (2013)