The illumos SYSCALL Handler
5 May 2016
The system call is the brain stem of the operating system. It ushers requests from the user brain to the kernel body. Without them your programs are helpless. Even the primal init process gets nowhere without system calls. They are the cost of doing business with the OS.
The system call is both simple and complicated. As an abstraction it is simple. A controlled entry point into the kernel used by the application to affect the hardware. Behind the abstraction lies an implementation which must deal with the complicated reality of hardware. On the user side sits libc stubs; placing the system call number and arguments in the correct registers and invoking the system call instruction. On the kernel side sits a regular old C function with access to kernel memory and privileged instructions. It’s the space between where the wild things are. This is the space of the handler.
At its core the handler transforms a user thread into a kernel thread while it’s running. It changes CPU state by using instructions that themselves change CPU state. This is like Optimus Prime transforming into an autobot while hauling ass down the road after a Decepticon. If the state is not transitioned in just the right sequence then the system crashes. In some cases, a bad transition leaves a ticking time bomb that explodes far removed from the original bug—leaving no indicator of what caused it. Debugging these kinds of crashes can drive you crazy. As my friend Scott Fritchie likes to say: “this is like seeing a picture of a car on fire and trying to explain why it is on fire”. To top if all off, the handler is coded in assembly. It must be. The handler deals with the CPU architecture directly and C can’t be used.
There are several ways to perform a system call on modern x86
chips. I’m focusing on one: the AMD64 syscall
instruction. The illumos syscall
handler should
mostly look like other systems, but there are also things
unique to illumos. For example, the branded system
call: a form of OS-level virtualization performed at the
system call layer. SmartOS uses this in their lx-brand zones.
They run an entire Linux distribution by emulating the Linux
system call table. For this post I’m focusing on the essential
parts of the handler. I’ve taken liberty with
the assembly
code—removing extraneous bits to focus on the path most
traveled.
The Handler
SWAPGS
To understand the first instruction it helps to understand how
we arrive here and the state of the CPU at this point.
The syscall
instruction is minimal by
design—performing a few actions.
- Enter kernel-mode: switch into kernel-mode, aka ring-0. Giving access to privileged instructions and data.
- CS & SS setup: load kernel-mode values into the code and stack segment registers.
- Swap flags: stash the user CPU flags, then change them for kernel-mode execution.
-
Swap IP: stash the user instruction pointer. Swap it
for the kernel syscall handler:
sys_sycall
.
The execution of syscall
leaves you in an odd
place. Like a child halfway over a tall fence. Getting to the
top was easy, but now a difficult decision must be made. Slide
back down the way you came? Or risk it and jump to the other
side? Most of the hardware state is still from the user
thread, but the CPU is pointing to kernel code and stack. One
wrong move results in falling down the other side of the fence
and eating it hard.
The handler’s first step is grabbing a pointer to the CPU
kernel data structure. It’s here where it will begin stashing
user hardware state. It’s imperative this be done before
anything else. The handler must have somewhere to put user
state before it can load kernel state. The
swapgs
instruction is built for this purpose. In
fact, it
was built
specifically for use with syscall
. It swaps
the gs
base value with that in the
kernel’s GSBASE
MSR (Model Specific Register).
Each CPU has its own gs
segment register. And
each CPU’s GSBASE
MSR points to
its cpu_t
structure. This structure contains space to
stash
a small amount of user state as well as a pointer to the
kernel
thread associated with
the user
thread. After swapgs
runs the gs
prefix is synonymous with the cpu_t
pointer.
movq %r15, %gs:CPU_RTMP_R15 /* stash user r15 */
movq %rsp, %gs:CPU_RTMP_RSP /* stash user stack ptr */
movq %gs:CPU_THREAD, %r15 /* load kernel thread ptr */
movq T_STACK(%r15), %rsp /* load kernel stack ptr */
With the cpu_t
in hand you begin to descend the
other side of the fence into kernel-mode. Next step, get a
handle to the kernel thread and its stack. That’s two pointers
in two registers. That means moving two pieces of user state
off to the side. The first two instructions stash the
user r15
and rsp
registers in the
temp area of cpu_t. With those out of the way the kernel
thread pointer goes into r15
. This comes first
because the stack pointer value is a member of the kernel
thread structure. The fourth instruction loads the kernel
stack pointer into rsp
. Now you’re feeling pretty
confident about getting down the other side of this fence.
Just a matter of stashing the rest of the user state.
movl $UCS_SEL, REGOFF_CS(%rsp)
movq %rcx, REGOFF_RIP(%rsp) /* syscall: %rip -> %rcx */
movq %r11, REGOFF_RFL(%rsp) /* syscall: %rfl -> %r11d */
movl $UDS_SEL, REGOFF_SS(%rsp)
The first frame of kernel stack is the user hardware state
saved as
a regs
structure. The REGOFF
macros contain offsets
into this frame.
SYSCALL
destroys the instruction pointer and
flags; but not until after it stashes them in rcx
and r11
. The middle two instructions save the
user’s IP and flags. The user loses their
original rcx
and r11
. That’s okay,
the AMD64 ABI states that they are caller-saved. The user must
save these values beforehand if they care about them.
SYSCALL
also destroys the code and segment
registers. The two outside instructions save constant values
to these segment selectors—pointing to the appropriate segment
descriptors for the return to user-mode.
But sysret
already writes its own values to these
segment registers. The stashed values are never used. Why do
this?
It turns out that just because a thread enters the kernel
with syscall
doesn’t guarantee that it will
return with sysret
. The handler may take an
alternate return path via iret
; which pops the
code and stack registers from the stack. Furthermore, the
alternate return path uses the stashed cs
to
determine if it is returning
to userland,
and if so, if it is returning to
a 32
or 64-bit application.
movl %eax, %eax /* wrapper: sysc# -> %eax */
This curious instruction moves the value of eax
into…eax
. Weird. This instruction hides its
intent in implicit behavior. AT&T syntax indicates the
operand size by suffix. The movl
instruction
means “move doubleword” (32 bits). In 64-bit mode the CPU
clears the top 32 bits of a 64-bit register during a 32-bit
move. The eax
register is synonymous with the
bottom 32-bits of the rax
register. Thus, this
instruction clears the top 32-bits of rax
.
The reason for this instruction isn’t immediately obvious. I believe its purpose is to prevent a subtle but powerful attack. I return to this later in the post.
movq %rdi, REGOFF_RDI(%rsp)
movq %rsi, REGOFF_RSI(%rsp)
movq %rdx, REGOFF_RDX(%rsp)
movq %r10, REGOFF_RCX(%rsp) /* wrapper: %rcx -> %r10 */
movq %r10, %rcx /* arg[3] for direct calls */
movq %r8, REGOFF_R8(%rsp)
movq %r9, REGOFF_R9(%rsp)
Next, the handler saves the system call arguments to user
state.
The AMD64
ABI specifies these registers. The handler stashes the
arguments in the same order they are passed to the kernel
routine: rdi
is the first
argument, rsi
the second, etc.
SYSCALL
destroys the value in rcx
.
To mitigate this the libc stub function copies the 4th
argument into r10
. The handler
reestablishes rcx
where the kernel routine
expects to find its 4th argument.
movq %rax, REGOFF_RAX(%rsp)
movq %rbx, REGOFF_RBX(%rsp)
movq %rbp, REGOFF_RBP(%rsp)
movq %r10, REGOFF_R10(%rsp)
movq %gs:CPU_RTMP_RSP, %r11
movq %r11, REGOFF_RSP(%rsp)
movq %r12, REGOFF_R12(%rsp)
movq %r13, REGOFF_R13(%rsp)
movq %r14, REGOFF_R14(%rsp)
movq %gs:CPU_RTMP_R15, %r10
movq %r10, REGOFF_R15(%rsp)
Here the handler stashes the rest of the GPRs (General Purpose
Registers). Before stashing r15
and rsp
, the handler extracts them from their
temporary holding area in cpu_t
.
movq $0, REGOFF_SAVFP(%rsp)
movq $0, REGOFF_SAVPC(%rsp)
Set r_savfp
and r_savpc
to 0. These
two values have something to do with MDB (Modular Debugger),
interrupts, and stacktraces. A value of 0 means to end the
stacktrace. You can’t see it but I’m waving my hands—moving
right along.
/*
* Copy these registers here in case we end up stopped with
* someone (like, say, /proc) messing with our register state.
* We don't -restore- them unless we have to in update_sregs.
*
* Since userland -can't- change fsbase or gsbase directly,
* and capturing them involves two serializing instructions,
* we don't bother to capture them here.
*/
xorl %ebx, %ebx
movw %ds, %bx
movq %rbx, REGOFF_DS(%rsp)
movw %es, %bx
movq %rbx, REGOFF_ES(%rsp)
movw %fs, %bx
movq %rbx, REGOFF_FS(%rsp)
movw %gs, %bx
movq %rbx, REGOFF_GS(%rsp)
Any time you xor
a value with itself you get
zero. It’s a common idiom: xor
is just as fast
as mov
but encodes
to three
bytes versus mov’s five. It’s important to
clear ebx
because the instructions following it
stash a 64-bit values. The top 48 bits should be clear because
the segment selectors are only 16 bits wide.
The rest of the instructions stash the data segment registers. The comment alludes to a subtle difference between the GPRs and these registers. The handler saves the GPRs because they are destroyed while carrying out the system call. The handler uses instructions that either implicitly or explicitly use the GPRs. And if the handler doesn’t use them then the kernel routine will.
The segment registers, on the other hand, are less likely to
change during the system call. The system modifies them when
when a context switch occurs before the system call can
finish. Or perhaps it’s some other reason; I didn’t trace this
rabbit hole the entire way down. In any event, these registers
likely don’t need restoring when returning to user-mode. On
modification the system sets the
LWP’s pcb_rupdate
flag. The handler checks this
flag on return: running update_sregs()
if it is
set.
movq %rsp, %rbp
Start a new stack frame by setting the frame base pointer to the current end of the stack. This marks the end of the first frame—where the handler just stashed all the user hardware state. From here on out the stack belongs to the kernel routines invoked by the handler.
Those familiar with assembly might wonder when the first
frame, the user hardware state frame, was actually built.
After all, there have been no push
instructions
or manipulation of the rsp
up to this point. It
turns out that the initial frame is a permanent fixture. The
system allocates space for the regs structure at
thread
creation. And the
stack
pointer is set to the first address after this
structure. The REGOFF
macros depend on this
detail—the stack grows downward so they add
to rsp
to reach into the initial frame.
movq T_LWP(%r15), %r14
ASSERT_NO_RUPDATE_PENDING(%r14)
ENABLE_INTR_FLAGS
Move the LWP pointer into r14
. LWP stands for
Light-Weight Process. It’s a holdover from when Solaris had an
M:N scheduling model—many user threads on a smaller set of
kernel threads. These days illumos uses a 1:1 threading model:
a kernel thread per user thread. T_LWP
is the
offset of the klwp_t
pointer. It contains the
kernel-side state for a thread: such as hardware context,
system call support, I/O, and signal handling.
The second instruction is a sanity check that runs on debug kernels only. It verifies the thread was not switched off-cpu between the time the system call started and now—otherwise it panics the system. This is an important check: a context switch before this point would corrupt the hardware state.
The third instruction goes hand in hand with the second one.
It enables the interrupt flag so that interrupts can proceed
on this CPU. This implies interrupts were disabled up to this
point, but I haven’t shown you any instruction that does this.
What’s going on? It’s hidden in the implicit behavior
of syscall
. Illumos
configures syscall
to disable interrupts on its
way into the kernel. This gives the handler a chance to save
the hardware state without interruption. With all the user
hardware state safely stashed the handler enables interrupts
in case something more important needs to run.
MSTATE_TRANSITION(LMS_USER, LMS_SYSTEM)
movl REGOFF_RAX(%rsp), %eax /* (%rax damaged by mstate call) */
Transition the thread’s microstate accounting from user-mode
to system-mode (another name for kernel-mode). This is the
value displayed under the USR and SYS columns when
running prstat -m
. It shows the percentage of
time spent in each mode. This is a coarse but useful metric to
determine which side of the line a thread is spending most of
its time. The microstate update destroys eax
so
restore it after.
movb $LWP_SYS, LWP_STATE(%r14)
incq LWP_RU_SYSC(%r14)
Transition the LWP into system state and increment its system
call counter. Microstate accounting and LWP state are similar
but used for different purposes. The former tracks time spent
on each side of the line; the later is a flag used by other
parts of the kernel. For example, the scheduler has a “CPU
caps” feature that allows one to place an upper limit on the
CPU time a thread takes. This limit applies only to user-mode
time so
the enforcement
code uses lwp_state
to verify it’s only limiting
user-mode.
movb $NORMALRETURN, LWP_EOSYS(%r14)
This has something to do with I/O. It can hold the value NORMALRETURN or JUSTRETURN. The first adjusts the instruction pointer and registers on return; the later doesn’t.
incq %gs:CPU_STATS_SYS_SYSCALL
Increment the per-cpu system call counter as reported
by mpstat(1M)
.
movw %ax, T_SYSNUM(%r15)
Place the system call number into the kernel
thread’s t_sysnum
field. This value serves
various purposes across the system.
movq REGOFF_RDI(%rbp), %rdi
movq REGOFF_RSI(%rbp), %rsi
movq REGOFF_RDX(%rbp), %rdx
movq REGOFF_RCX(%rbp), %rcx
movq REGOFF_R8(%rbp), %r8
movq REGOFF_R9(%rbp), %r9
Load the system call arguments per the AMD64 ABI.
cmpl $NSYSCALL, %eax
jae _syscall_ill
Verify that the system call number is legal; or jump
to _syscall_ill
. The instruction jae
stands for “jump if above or equal”. The cmpl
instruction subtracts $NSYSCALL
from eax
. If the call is legal then the result is
less than zero. Thus, jump to _syscall_ill
if the
result is above-or-equal to zero.
shll $SYSENT_SIZE_SHIFT, %eax
leaq sysent(%rax), %rbx
The system call table is an array of sysent structures. Each sysent contains a pointer to a kernel routine implementing the system call. The first step to calling the routine is gaining access to the sysent entry.
The first instruction multiplies the system call number by the
sysent structure size. The structure is coded to be a
power-of-two in size allowing efficient multiplication by
shifting left by the log-2 of the size in bytes. For example,
the 64-bit kernel sysent is 32 bytes, a shift of
5: 2^5 == 32
.
The second instruction loads this address
into rbx
. The AT&T
syntax sysent(%rax)
says “add the offset
in rax
to the address of sysent
”.
These two instructions are logically equal
to %rbx = sysent[%eax]
.
Remember that peculiar mov
instruction earlier?
You know, where the handler moves the value
of eax
into…eax
. That instruction
becomes important here. The sysent
data structure
is in 64-bit address space and we must use 64-bit operands. If
any of the top 32-bits of rax
are on it would
cause a lookup far outside the bounds of the table. Maybe
pointing to some malicious. An attacker could write custom
assembly to load a valid system call in the bottom 32-bits
of rax
but also turn on some bits in the top
32-bits that end up pointing to code they created. This code
would execute in kernel-mode with access to everything. The
system call number validation can’t catch this type of attack
because it’s using 32-bit operands and can’t see the top 32
bits.
Though, that’s just a guess on my part. I haven’t done the work to verify this attack is possible.
call *SY_CALLC(%rbx)
The moment of truth—actually calling the system call. Everything up to this point is prelude to the system call. Likewise, everything after is epilogue. This prelude and epilogue combine to form the system call handler. The logic shared by all system calls. That’s why it’s such an important piece of the OS. Any useful action starts with a system call; and all system calls must go through the handler.
The SY_CALLC
macro is an offset into the sysent
structure for the system call function pointer.
The SY_CALLC(%rbx)
syntax is just like the
previous relative addressing. It says “starting
from rbx
as base, give me the address at the
offset for the sy_callc
member”. This results in
a pointer to a function but call
expects the
address of the first instruction. The handler dereferences the
pointer with an asterisk just like in C.
movq %rax, %r12
movq %rdx, %r13
/*
* If the handler returns two ints, then we need to split the
* 64-bit return value into two 32-bit values.
*/
testw $SE_32RVAL2, SY_FLAGS(%rbx)
je 5f
movq %r12, %r13
shrq $32, %r13 /* upper 32-bits into %edx */
movl %r12d, %r12d /* lower 32-bits into %eax */
At some point, unknown to me, people needed to return 64-bit
values on 32-bit CPUs. The way to do it without pointers is to
use two registers. The 386 System V ABI already
specified eax
for a 32-bit return value, and the
authors choose edx
as an extension to hold to
upper 32-bits of a 64-bit value. These registers were not
picked by accident.
As Joe
Demato pointed out to me, instructions
like mul
already set historical precedent for use
of eax
and edx
in such a manner. It
makes sense as an optimization too. If the last instruction in
your routine is of this type then you can forego
the mov
instructions.
When you aren’t using two registers to return a 64-bit value
you can use them to return two 32-bit values, and so illumos
did. It introduced the rval_t
union allowing
system calls to return one or two values. When everyone moved
to 64-bit they still kept compatibility with 32-bit programs
and used two registers. But now the registers are bigger. If
we wanted we could return one 128-bit value or two 64-bit
values. For the moment, illumos doesn’t return any 128-bit
values, but it’s coded so that
it can
in the future.
The handler must jump through hoops because the system call
routines are normal C functions—placing the return value
in rax
. If the call returns two 32-bit values
then the handler needs to recognize that via
the $SE_32RVAL2
flag and tease them out: placing
one in eax
and the other in edx
.
Just like a 32-bit kernel would have done. Even though no
system call is placing a return value in rdx
we
still act like it does for the sake of future 128-bit return
values.
MSTATE_TRANSITION(LMS_SYSTEM, LMS_USER)
The system call has returned and the thread is on its way back to user-mode. Flip the microstate.
CLI(%r14)
Disable maskable interrupts. The handler is swapping
the user hardware state back in; a context switch now would
send this thread off its rails. On metal this is just an alias
for the cli
instruction. The register argument
applies only when compiling the kernel to run on Xen.
/*
* We need to protect ourselves against non-canonical return values
* because Intel doesn't check for them on sysret (AMD does). Canonical
* addresses on current amd64 processors only use 48-bits for VAs; an
* address is canonical if all upper bits (47-63) are identical. If we
* find a non-canonical %rip, we opt to go through the full
* _syscall_post path which takes us into an iretq which is not
* susceptible to the same problems sysret is.
*
* We're checking for a canonical address by first doing an arithmetic
* shift. This will fill in the remaining bits with the value of bit 63.
* If the address were canonical, the register would now have either all
* zeroes or all ones in it. Therefore we add one (inducing overflow)
* and compare against 1. A canonical address will either be zero or one
* at this point, hence the use of ja.
*
* At this point, r12 and r13 have the return value so we can't use
* those registers.
*/
movq REGOFF_RIP(%rsp), %rcx
sarq $47, %rcx
incq %rcx
cmpq $1, %rcx
ja _syscall_post
This is my favorite part of the handler. The block comment is accurate but it doesn’t tell you the whole story. It doesn’t explain why returning to a non-canonical address is bad. It doesn’t explain why this would ever happen. In Unix tradition let me preface the next few paragraphs with “you are not expected to understand this”.
As the comment says, this code prevents returning to a non-canonical address. A non-canonical address is one that sits in the “VA hole”. When AMD created the 64-bit extensions they limited the MMU (Memory Management Unit) to 48 bits of addressable memory—keeping the page table structure to four levels. This is an important performance trade-off: any time your program reads memory it first needs to translate the virtual address to a physical one. If there is no matching entry in the TLB (Translation Lookaside Buffer, a fast map from virtual to physical) then it has to go through four page tables. A larger address space requires even more levels and more costly cold memory access. So the AMD engineers went with 48 bits. This decision leaves a surplus of unused bits in the top word. Us programmers, the frugal bunch we are, love to stash things in our unused bits and consider ourselves clever for doing it. But this poses a big problem if vendors ever decide to extend the address space. Those bits are now owned by the CPU; programs relying on them would break. To keep our foots from getting shot the chip makers invented the canonical address. A canonical address has all its top bits identical to the 47th bit: all ones or all zeros. If you draw the address space out on a vertical rectangle you get 128 terabytes on the top, 128 on the bottom, and a massive hole in the center called the VA hole.
Upon accessing an address in the VA hole the CPU generates a general protection fault. These five instructions prevent this from happening on Intel chips. They verify the return address is a canonical one. Otherwise the handler returns by a safe method and lets the user program segfault on the bad address. But the comment doesn’t tell you why this would ever happen. It seems like an odd thing to worry about. Why can’t the chip just handle this for you?
As it turns out, it can. AMD doesn’t need this code. The code exists solely to protect against an Intel specific exploit discovered in 2012. This is why I invoked Ritchie’s “you are not expected to understand this”. Like Ritchie, I’m not mocking you. Rather, the purpose of this code is not obvious. I figured it out only after looking at the code history, tracking that to an illumos issue, and then finally ending up at a vulnerability note. The exploit is subtle, requiring deep OS knowledge. But let me give you the gist of it.
If you call sysret
with a non-canonical address
in rip
it causes a GP fault. On Intel this GP
runs in kernel-mode (ring-0); on AMD it runs in user-mode
(ring-3). Because of the way sysret
works, the
syscall handler must restore the user’s stack pointer before
calling it. Thus, the GP happens in kernel-mode but with the
user’s stack. When invoked, the GP handler immediately saves
the CPU registers on the stack. This is fine in normal
circumstances; but using our new knowledge we could concoct
some C/ASM code to load nefarious values in the registers (in
user-mode), set rsp
to some address in kernel
memory, somehow load a non-canonical address
in rip
, and then invoke syscall
.
It’s tricky—the exploit needs to pass itself off as a real
system call and get through the entire handler to
the sysret
call all while maintaining its
nefarious register values—but it’s doable. In the end you have
something that can write arbitrary values to kernel memory via
system calls.
That’s the story behind those five lines of assembly. After that the bit twiddling tricks seem less interesting. But here’s the summary.
A canonical address will result in an rcx
value
of 0
or 1
. By comparing it
to 1
and calling ja
we are saying
“jump to _syscall_post
only if rcx
is greater than 1”.
SIMPLE_SYSCALL_POSTSYS(%r15, %r14, %bx)
This macro fixes up various state. It expands to the following.
movb $LWP_USER, LWP_STATE(%r14)
movw %bx, T_SYSNUM(%r15)
andb $_CONST(0xffff - PS_C), REGOFF_RFL(%rsp)
First, set the LWP state back to user. Second, set the kernel
thread’s t_sysnum
field to zero. The handler
assumes bx
is always zero at this point.
I’m not sure why. I’m going to leave that as an exercise to
the reader.
The third instruction clears the carry flag. It starts with all 16 bits enabled, disables the carry flag, and then performs a bitwise-and with the stashed user value.
Why does the handler disable the carry flag? It has to do with
failure. This is one place where the Linux and Unix ABI
diverge. Linux indicates error with a negative value
in rax
in the range -1
to -4095
; anything else marks success. Unix
indicates error by setting the carry flag. It separates the
notion of failure from the return value itself, allowing
system calls to return negative values as success. You can see
this in action if you peek at the libc assembly for
the read(2)
call.
# dis -F __read /lib/64/libc.so.1
disassembly for /lib/64/libc.so.1
__read()
__read: 49 89 ca movq %rcx,%r10
__read+0x3: b8 03 00 00 00 movl $0x3,%eax
__read+0x8: 0f 05 syscall
__read+0xa: 73 0a jae +0xa <__read+0x16>
__read+0xc: 83 f8 5b cmpl $0x5b,%eax
__read+0xf: 74 ef je -0x11 <__read>
__read+0x11: e9 9a 32 f5 ff jmp -0xacd66 <__cerror>
__read+0x16: c3 ret
The jae
instruction is saying “jump
to __read+0x16
if the carry flag is not
set, otherwise fall into the error handling code”. It’s a bit
clearer when reading
the source.
/*
* SYSREENTRY provides the entry sequence for restartable system calls.
*/
#define SYSREENTRY(name) \
ENTRY(name); \
1:
/*
* SYSRESTART provides the error handling sequence for restartable
* system calls.
* XX64 -- Are all of the argument registers restored to their
* original values on an ERESTART return (including %rcx)?
*/
#define SYSRESTART(name) \
jae 1f; \
cmpl $ERESTART, %eax; \
je 1b; \
jmp __cerror; \
1:
If the carry flag is set then we have an error. If the error
is a restart ($ERESTART
/0x5B
in eax
) then restart the sequence. The je
1b
is syntactic sugar to say “jump back to the
previous 1:
label” (setup
by SYSREENTRY
). Likewise, jae 1f
says “if there is no error then jump forward to the
next 1:
label”. If neither case is met there is a
legit error: jump to __cerror
.
And that concludes the SIMPLE_SYSCALL_POSTSYS
macro. Simple, right?
movq %r12, REGOFF_RAX(%rsp)
movq %r13, REGOFF_RDX(%rsp)
Stash the return value into the user state.
Why bother? The instructions that follow will reload all the user state—including these registers. So why bother stashing them here? For debugging I suppose. If the system panics before the hardware state is restored then the core would include the return values in the kernel thread stack.
/*
* To get back to userland, we need the return %rip in %rcx and
* the return %rfl in %r11d. The sysretq instruction also arranges
* to fix up %cs and %ss; everything else is our responsibility.
*/
movq REGOFF_RDI(%rsp), %rdi
movq REGOFF_RSI(%rsp), %rsi
movq REGOFF_RDX(%rsp), %rdx
/* %rcx used to restore %rip value */
movq REGOFF_R8(%rsp), %r8
movq REGOFF_R9(%rsp), %r9
movq REGOFF_RAX(%rsp), %rax
movq REGOFF_RBX(%rsp), %rbx
movq REGOFF_RBP(%rsp), %rbp
movq REGOFF_R10(%rsp), %r10
/* %r11 used to restore %rfl value */
movq REGOFF_R12(%rsp), %r12
movq REGOFF_R13(%rsp), %r13
movq REGOFF_R14(%rsp), %r14
movq REGOFF_R15(%rsp), %r15
movq REGOFF_RIP(%rsp), %rcx
movl REGOFF_RFL(%rsp), %r11d
movq REGOFF_RSP(%rsp), %rsp
Restore the user hardware state from the saved values in the
initial kernel stack frame. The user rip
is
loaded into rcx
and rflags
into r11
per the sysret
spec.
SWAPGS /* user gsbase */
When entering the kernel via syscall
we
swap gs
and the save the user hardware state. Now
do the reverse: restore the user state and then
swap gs
.
SYSRETQ
And here we are. The last instruction. The end of the road for the system call handler.
The q suffix indicates this is a return to a 64-bit
application. This takes us down another rabbit hole where
Intel and AMD differ: AMD allows syscall
in
compatibility mode (32-bit app on 64-bit kernel) but Intel
doesn’t. So sysret
supports returning to 32-bit
or 64-bit applications, and in this case we are in the 64-bit
handler which will only ever return to a 64-bit application,
thus the q suffix. Illumos uses sysenter/sysexit
for 32-bit system calls on Intel chips.
This instruction returns the thread to user-mode, loads the
user instruction pointer and flags, and perform some
adjustments on the cs
and ss
segment
registers.
And that’s it. You just completed a typical trip through the
illumos syscall
handler. Lean back in your chair
and let this idea sink in: almost every system call
needs to do all this work each time it is called. And a busy
server might see millions of system calls a second. This code
better be bug free and efficient.
Appendix A: SYSCALL/SYSRET Setup
The syscall
instruction, and it’s companion
sysret
, rely on various special registers to do
their job. Six MSRs must be set in total.
- EFER (Extended Feature Enable Register)
- Set the system call extension (SCE) bit. Otherwise, calling these instructions produces an invalid opcode (UD) exception.
- STAR (legacy-mode SYSCALL Target Address Register)
-
SYSCALL
uses bits 47-32 to set the CS and SS selectors upon entry; they point to the kernel code and data GDT descriptors.SYSRET
uses bits 63-48 to set the CS and SS selectors when returning to user-mode; they point to the user code and data GDT descriptors. The bottom 32-bits are meant to be theeip
when usingsyscall
in legacy-mode but illumos doesn’t support that. - LSTAR (Long-Mode SYSCALL Target Address Register)
-
SYSCALL
uses this value as the instruction pointer upon entry. It points tosys_syscall
: the syscall handler. The topic of this post. - CSTAR (Compatibility-Mode SYSCALL Target Address Register)
-
Same as LSTAR but for 32-bit. It is legal for apps in
compatibility-mode to call
syscall
on AMD, but not on Intel. - SFMASK (SYSCALL Flag Mask)
-
The flags to disable on
syscall
entry: IF (interrupt), TF (trap aka single-step), and AC (access control) if SMAP is supported by the CPU. These are namedPS_IE
,PS_T
, andPS_ACHK
in the source. Thus, executingsyscall
disables interrupts, single-step, and SMAP. - KERNEL_GS_BASE
-
SWAPGS
swaps this value with thegs
value. Upon kernel entry the handler uses it to load a pointer to the kernel’s per-CPU structure where it can hang a small amount of machine state. On return to user-mode the handler uses it to swap the usergs
back in.
void
init_cpu_syscall(struct cpu *cp)
{
...
if (is_x86_feature(x86_featureset, X86FSET_MSR) &&
is_x86_feature(x86_featureset, X86FSET_ASYSC)) {
...
/*
* Turn syscall/sysret extensions on.
*/
cpu_asysc_enable();
/*
* Program the magic registers ..
*/
wrmsr(MSR_AMD_STAR,
((uint64_t)(U32CS_SEL << 16 | KCS_SEL)) << 32);
wrmsr(MSR_AMD_LSTAR, (uint64_t)(uintptr_t)sys_syscall);
wrmsr(MSR_AMD_CSTAR, (uint64_t)(uintptr_t)sys_syscall32);
/*
* This list of flags is masked off the incoming
* %rfl when we enter the kernel.
*/
flags = PS_IE | PS_T;
if (is_x86_feature(x86_featureset, X86FSET_SMAP) == B_TRUE)
flags |= PS_ACHK;
wrmsr(MSR_AMD_SFMASK, flags);
}
...
}
Each CPU has its own GDT with descriptors for the code & data
segments for both kernel and user-mode. These descriptors are
not used in the typical path. The syscall
and sysret
instructions enforce specific values
in the segment registers. The descriptor values should match
those.
static void
init_gdt_common(user_desc_t *gdt)
{
...
/*
* 64-bit kernel code segment.
*/
set_usegd(&gdt[GDT_KCODE], SDP_LONG, NULL, 0, SDT_MEMERA, SEL_KPL,
SDP_PAGES, SDP_OP32);
/*
* 64-bit kernel data segment. The limit attribute is ignored in 64-bit
* mode, but we set it here to 0xFFFF so that we can use the SYSRET
* instruction to return from system calls back to 32-bit applications.
* SYSRET doesn't update the base, limit, or attributes of %ss or %ds
* descriptors. We therefore must ensure that the kernel uses something,
* though it will be ignored by hardware, that is compatible with 32-bit
* apps. For the same reason we must set the default op size of this
* descriptor to 32-bit operands.
*/
set_usegd(&gdt[GDT_KDATA], SDP_LONG, NULL, -1, SDT_MEMRWA,
SEL_KPL, SDP_PAGES, SDP_OP32);
gdt[GDT_KDATA].usd_def32 = 1;
/*
* 64-bit user code segment.
*/
set_usegd(&gdt[GDT_UCODE], SDP_LONG, NULL, 0, SDT_MEMERA, SEL_UPL,
SDP_PAGES, SDP_OP32);
/*
* 32-bit user code segment.
*/
set_usegd(&gdt[GDT_U32CODE], SDP_SHORT, NULL, -1, SDT_MEMERA,
SEL_UPL, SDP_PAGES, SDP_OP32);
...
/*
* 32 and 64 bit data segments can actually share the same descriptor.
* In long mode only the present bit is checked but all other fields
* are loaded. But in compatibility mode all fields are interpreted
* as in legacy mode so they must be set correctly for a 32-bit data
* segment.
*/
set_usegd(&gdt[GDT_UDATA], SDP_SHORT, NULL, -1, SDT_MEMRWA, SEL_UPL,
SDP_PAGES, SDP_OP32);
...
}
Appendix B: REGOFF & Other Macros
Many of the macros used in the syscall handler come from
assym.h
. But you won’t find that file in the
illumos-gate source code. It’s generated by a tool
called genassym.
At build time genassym
reads .in
files and produces .h
files.
These .in
files contain a cpp-like format for
describing the generation of macros for structure members.
Most macros are implicitly named after the structure member
name, just capitalized. You can also generate macros for
structure size (_SIZE
), and a log-2 shift
(_SIZE_SHIFT
) for any structures that have a
power-of-two size. You’ll find all the macros used by the
handler in
either mach_offsets.in
or offsets.in.
Here’s the definition of the REGOFF
macros
in mach_offsets.in
.
regs REGSIZE
r_savfp REGOFF_SAVFP
r_savpc REGOFF_SAVPC
r_rdi REGOFF_RDI
r_rsi REGOFF_RSI
r_rdx REGOFF_RDX
r_rcx REGOFF_RCX
r_r8 REGOFF_R8
r_r9 REGOFF_R9
r_rax REGOFF_RAX
r_rbx REGOFF_RBX
r_rbp REGOFF_RBP
r_r10 REGOFF_R10
r_r11 REGOFF_R11
r_r12 REGOFF_R12
r_r13 REGOFF_R13
r_r14 REGOFF_R14
r_r15 REGOFF_R15
\#if DEBUG
__r_fsbase REGOFF_FSBASE
__r_gsbase REGOFF_GSBASE
\#endif
r_ds REGOFF_DS
r_es REGOFF_ES
r_fs REGOFF_FS
r_gs REGOFF_GS
r_trapno REGOFF_TRAPNO
r_err REGOFF_ERR
r_rip REGOFF_RIP
r_cs REGOFF_CS
r_rfl REGOFF_RFL
r_rsp REGOFF_RSP
r_ss REGOFF_SS
\#define REGOFF_PC REGOFF_RIP
Appendix C: Resources
It’s easy to get lost in the weeds when studying system calls. A lot of the content out there is lacking. I’ve found that the best sources are the Intel & AMD development guides and the System V ABI. These documents have their own jargon that is intimidating. But as you pick up the lingo they become powerful references.
Intel Dev Manual Vol 1: Basic Architecture
Intel Dev Manual Vol 2: Instruction Set Reference
Intel Dev Manual Vol 3: System Programming
AMD64 Dev Manual Vol 1: Application Programming
AMD64 Dev Manual Vol 2: System Programming
AMD64 Dev Manual Vol 3: General & System Instructions
System V 386 ABI, 4th Ed (1997)