Debugging SmartOS #1
Finding Lock Inversions with DTrace
19 Jun 2017
The problem was simple, a classic lock inversion bug. Two functions grabbing the same locks in opposite order. It’s an easy fix, change the bad function to use the correct order. But what if you don’t know which one is the bad function because you don’t know what the correct order is? This is the problem I faced. To find my way out of this predicament I needed to assess all uses of these locks and determine the correct order. But when you spread the locks across tens of thousands of lines of C that gets a little tricky. Tracing this type of code by hand is the work of a masochist. No human is going to get this right, and they shouldn’t have to. In this post I’m going to explain how I used DTrace to analyze lock usage in the DLS and MAC layers of the SmartOS networking stack.
When two or more threads grab two or more locks in opposite order you have lock inversion. This is a classic deadlock scenario and most often portrayed in the famous dining philosophers allegory. A bunch of hungry philosophers can’t agree on which order to grab their limited utensil set and end up starving to death. Well, they don’t always starve to death. Deadlock is not guaranteed when you have lock inversion. It requires concurrency and unfortunate timing as well (like a stool, it only works with all three legs). But you should remove lock inversion wherever you find it.
My issue dealt with the acquisition order of the global DLS and MAC perimeter locks across a large collection of functions. But what I’m going to show you works for any pair of locks. The first step to determining lock acquisition order is knowing when individual locks are acquired. To discover that you’ll want to reach for the lockstat and fbt providers.
- lockstat
-
The lockstat provider covers lock acquisition and release. It fires the
lockstat:::{adaptive,rw}-acquire
probes after lock acquisition and thelockstat:::{adaptive,rw}-release
probes after lock release. The adaptive (mutex) probes point to thekmutex_t
inarg0
and the rw probes point to thekrwlock_t
inarg0
. You can use these pointers in clause predicates to filter only the locks you are interested in. - fbt
-
The fbt (Function Boundary Tracing) provider traces entry and exit of kernel functions. It fires
fbt::
on entry and:entry fbt::
on exit. The entry probe includes the typed arguments in the:return args[]
array allowing you to inspect the function arguments as you would in C code.
You might think that lockstat is enough but it doesn’t cover what I call composite locks: a plain mutex or rw lock paired with additional state. Lockstat only covers plain locks. The MAC perimeter is a recursive mutex made up of a plain mutex, an owner field, and a counter. To track this lock I need to watch the kernel functions used to enter and exit it. That’s where the fbt provider comes in.
These two providers aren’t going to tell me when lock inversion happens. They can only tell me when a thread acquires or releases a lock. But with that information in hand it’s a simple matter of tracking state to find inversion. The key idea is this: keep thread-local state to track lock acquisition order and print the order upon acquisition of the last lock. For example, upon grabbing lock A check if the thread-local lock B flag is set, if so then I know the order was B->A. I can then print that order along with the stacktrace leading to that order. The code snippet below shows how I tracked acquisition and release of the DLS and MAC locks with thread-local variables.
lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0)/ {
self->dls = 1;
}
lockstat:::rw-release /IS_DEVNET_LOCK(arg0)/ {
self->dls = 0;
}
fbt::i_mac_perim_enter:entry {
self->mac = 1;
}
/*
* Letting go of the last reference to the perim.
*/
fbt::i_mac_perim_exit:entry /IS_MAC_PERIM_CNT(args[0], 1)/ {
self->mac = 0;
}
Ignoring the predicate macros for a moment, this should look
pretty simple. If a lock is held I set its thread-local flag
to 1; otherwise it’s set to 0. The MAC perimeter is a
recursive mutex so it’s not enough to just check lock release.
The consumer may grab this perimeter five times and it’s not
truly released until the owner releases it as many times as it
acquired it. Thus I only set self->mac
to 0 when
the current reference count is 1, indicating the final release.
Those four clauses cover all cases of acquisition and release. Now I need to detect when both locks are held and print the stacktrace.
lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0) && self->mac/ {
printf("MAC -> DLS %s (%d)\n", execname, pid);
stack();
}
/*
* If MAC perim count is 0 then this is the first entry into the
* perim.
*/
fbt::i_mac_perim_enter:entry /IS_MAC_PERIM_CNT(args[0], 0) && self->dls/ {
printf("DLS -> MAC %s (%d)\n", execname, pid);
stack();
}
These two clauses are like the previous two entry clauses but with a slight twist: when acquiring one of the locks check if the other lock is already held, if it is then I know the order and can print the stacktrace. Once again, the MAC perimeter is more involved. I only want to print the stacktrace upon first entry to the perimeter, not on recursive entries.
And that’s all there is to it. The final thing to understand are the predicate macros I used.
#define MIS_IS_VNIC 0x0002
#define IS_VNIC(mip) (mip->mi_state_flags & MIS_IS_VNIC)
#define MAC_LOWER(upper) \
((mac_client_impl_t *)((vnic_t *)upper->mi_driver)->vn_mc_handles[0])->mci_mip
#define MAC_PERIM_CNT(mip) (IS_VNIC(mip) ? \
MAC_LOWER(mip)->mi_perim_ocnt : mip->mi_perim_ocnt)
#define IS_MAC_PERIM_CNT(mip, cnt) (MAC_PERIM_CNT((mip)) == (cnt))
#define IS_DEVNET_LOCK(l) ((uintptr_t)(l) == (uintptr_t)&`i_dls_devnet_lock)
I used three nice features here: typed arguments à
la CTF,
embedded C preprocessor, and external variables. The SmartOS
kernel is compiled with Compact Type Format (CTF) support.
When CTF is in play DTrace probes support an alternate
argument array, args[]
, containing typed
arguments that you can manipulate just as you would in C.
Combining this with DTrace’s CPP support I was able to lift C
code from the kernel into my script. These macros are complex
only because MAC is complex. Without CPP and DTrace’s type
support I’d have to know the memory offsets of struct members
and make binary comparisons, which would break if the struct
changes. Embedded C makes this script easier to work with and
less prone to breakage. The devnet lock predicate compares
the kmutext_t
pointer to the address of the
global kernel symbol i_dls_devnet_lock
. The
backtick is how you reference external variables in DTrace.
Allowing you to reference the value or address of kernel
symbols.
Finally, here’s the output during a zone start and stop.
MAC -> DLS dladm (125046)
dls`dls_devnet_set+0x6b
dls`dls_devnet_create+0x4c
0xfffffffff81de5ca
0xfffffffff81dd8cd
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x241
MAC -> DLS dladm (125046)
dls`dls_devnet_hold_common+0x53
dls`dls_devnet_hold+0x17
dls`dls_devnet_setzid+0x8e
dld`drv_ioc_prop_common+0x44c
dld`drv_ioc_setprop+0x29
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x241
MAC -> DLS dladm (125047)
dls`dls_devnet_rename+0xc0
dld`drv_ioc_rename+0xbb
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x241
MAC -> DLS flowadm (124904)
dls`dls_devnet_hold_common+0x53
dls`dls_devnet_hold_tmp+0x1a
dls`dls_devnet_hold_link+0x2f
unix`stubs_common_code+0x59
mac`mac_link_flow_walk+0x57
dld`dld_walk_flow+0xca
dld`drv_ioc_walkflow+0x24
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x241
DLS -> MAC dlmgmtd (100016)
mac`mac_perim_enter_by_mh+0x23
mac`mac_perim_enter_by_macname+0x33
dls`i_dls_devnet_setzid+0x6b
dls`dls_devnet_unset+0x2b2
dls`dls_devnet_destroy+0x46
vnic`vnic_dev_delete+0x96
vnic`vnic_ioc_delete+0x28
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`sys_syscall32+0x227
DLS -> MAC dlmgmtd (100016)
mac`mac_close+0x18
mac`mac_perim_exit+0x33
dls`i_dls_devnet_setzid+0xfc
dls`dls_devnet_unset+0x2b2
dls`dls_devnet_destroy+0x46
vnic`vnic_dev_delete+0x96
vnic`vnic_ioc_delete+0x28
dld`drv_ioctl+0x1e4
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`sys_syscall32+0x227
From this output I could see that functions called during zone start were grabbing the locks in a different order than functions called during zone shutdown. This coincided with reports in the field that mass zone restarts would sometimes cause deadlock. From this output I also noticed that most functions used MAC->DLS ordering. With this information, and some further code spelunking, I convinced myself that this is the only correct order (even though some code comments said otherwise). The real benefit of using DTrace is knowing I can trust the output (especially over my ability to trace the code by hand). I can share this output with others and they will trust it as well. That is, assuming they trust my script, which they can also verify. Finally, this script will far outlast my memory and gives future maintainers a tool to diagnose similar issues.
I can apply this same approach to three or more locks as well. To do that I’d change the thread-local variables from boolean flags to acquisition order. Here’s what such a script might look like.
#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a)
#define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b)
#define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c)
lockstat:::adaptive-acquire /IS_A(args[0])/ {
self->cnt++;
self->a = self->cnt;
}
lockstat:::adaptive-acquire /IS_B(args[0])/ {
self->cnt++;
self->b = self->cnt;
}
lockstat:::adaptive-acquire /IS_C(args[0])/ {
self->cnt++;
self->c = self->cnt;
}
lockstat:::adaptive-release /IS_A(args[0])/ {
self->cnt--;
self->a = 0;
}
lockstat:::adaptive-release /IS_B(args[0])/ {
self->cnt--;
self->b = 0;
}
lockstat:::adaptive-release /IS_C(args[0])/ {
self->cnt--;
self->c = 0;
}
lockstat:::adaptive-acquire /IS_A(args[0]) && self->a == 3 &&
self->b == 2 && self->c == 1/ {
printf("C -> B -> A\n");
}
lockstat:::adaptive-acquire /IS_A(args[0]) && self->a == 3 &&
self->b == 1 && self->c == 2/ {
printf("B -> C -> A\n");
}
lockstat:::adaptive-acquire /IS_B(args[0]) && self->b == 3 &&
self->a == 1 && self->c == 2/ {
printf("A -> C -> B\n");
}
lockstat:::adaptive-acquire /IS_B(args[0]) && self->b == 3 &&
self->a == 2 && self->c == 1/ {
printf("C -> A -> B\n");
}
lockstat:::adaptive-acquire /IS_C(args[0]) && self->c == 3 &&
self->a == 2 && self->b == 1/ {
printf("B -> A -> C\n");
}
lockstat:::adaptive-acquire /IS_C(args[0]) && self->c == 3 &&
self->a == 1 && self->b == 2/ {
printf("A -> B -> C\n");
}
This gets ugly in a hurry. I’m using the same method as my original script which uses only predicates for flow control. This causes the number of print clauses to grow factorial to the number of locks. What if, instead, I created a thread-local variable for each slot and placed a string representation of the lock in the next open slot when the lock is acquired?
#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a)
#define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b)
#define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c)
#define IS_A_B_C(ptr) IS_A(ptr) || IS_B(ptr) || IS_C(ptr)
#define LOCK_STR(ptr) \
IS_A(ptr) ? "A" : IS_B(ptr) ? "B" : IS_C(ptr) ? "C" : "?"
lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->third/ {
self->third = LOCK_STR(args[0]);
}
lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->second/ {
self->second = LOCK_STR(args[0]);
}
lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->first/ {
self->first = LOCK_STR(args[0]);
}
lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first &&
!self->second && !self->third/ {
self->first = 0;
}
lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first &&
self->second && !self->third/ {
self->second = 0;
}
lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first &&
self->second && self->third/ {
self->third = 0;
}
lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && self->third/ {
printf("%s -> %s -> %s\n", self->first, self->second, self->third);
stack();
}
It’s an improvement, but a delicate one. These clauses only work if you get the order exactly right. With a newer DTrace (late 2016), making use of Matt Ahrens’ if/else support, I can make this tighter and more robust. Here’s the script as it probably should be written (but it’s still good to know the old tricks).
#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a)
#define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b)
#define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c)
#define IS_A_B_C(ptr) IS_A(ptr) || IS_B(ptr) || IS_C(ptr)
#define LOCK_STR(ptr) \
IS_A(ptr) ? "A" : IS_B(ptr) ? "B" : IS_C(ptr) ? "C" : "?"
lockstat::adaptive-acquire /IS_A_B_C(args[0])/ {
if (!self->first)
self->first = LOCK_STR(args[0]);
else if (!self->second)
self->second = LOCK_STR(args[0]);
else if (!self->third)
self->third = LOCK_STR(args[0]);
if (self->third) {
printf("%s -> %s -> %s\n", self->first, self->second,
self->third);
stack();
}
}
lockstat::adaptive-release /IS_A_B_C(args[0])/ {
if (self->third)
self->third = 0;
else if (self->second)
self->second = 0;
else if (self->first)
self->first = 0;
}
Now that’s worth taking home to mom.
Remember, whenever you find yourself debugging some code—especially legacy code with twists and turns—don’t do yourself what the computer can do for you.