More Efficient TLB Shootdowns
12 Apr 2015
In illumos issue #5498 Matthew Ahrens reported the occurence of ZFS “latency bubbles” when switching between I/O of large block sizes to that of small ones. He traced the bubbles to an excessive number of TLB shootdowns made while freeing ZIO buffer slabs, finding that a shootdown was made for each page in the slab when instead the same job could be done with just one shootdown. This change would remove much extraneous cross-core communication and help avoid latency bubbles.
As I studied Matthew’s patch I wondered if there was more to this than his simple description. His focus was on ZFS performance but he changed a very primal part of illumos memory management. A part through which all physical memory changes must travel. I couldn’t help but wonder if other parts of the system might benefit. Surely they must, but how could I tell?
First I needed a good understanding of Matthew’s patch. That specific knowledge would lead me to a more general understanding of the systems involved and how they interact with each other. From there I could then use DTrace to introspect a running system to verify my knowledge and look for effects on other parts of the system. But first I started where Matthew started, with ZFS.
Buffers & Slabs
ZFS makes heavy use of main memory caches to minimize I/O to slower devices like hard drives and solid-state disks. SSDs are fast, but still up to two orders of magnitude slower than DRAM. The entire set of main memory caches is referred to simply as the ARC, standing for Adaptive Replacement Cache because it adapts to varying workloads by using a hybrid MRU-MFU data structure. Underneath the ARC is the ZFS I/O layer, referred to as ZIO. It mediates between the ARC and the underlying storage devices: performing tasks like I/O scheduling, batching, and caching of data. The ZIO layer is what actually caches the data in memory, and the ARC acts as an administrator: determining which buffers are needed and which ones may be dropped from main memory.
These buffers come in many different sizes to accommodate ZFS’s dynamic block size: ranging from small 512 byte blocks to large 128K blocks. Large block support of up to 1M was recently added. To efficiently allocate these variable sized block buffers the ZIO layer, like many kernel services, uses the kernel slab allocator, or kmem for short. kmem gathers up contiguous ranges of kernel heap into chunks called slabs and provides object-based allocation on top of them. Behind the scenes kmem does the heavy lifting of making sure that each CPU core has access to a set of free objects, called a magazine, and that each of these cores can quickly get more magazines from a local depot when they run out. It’s basically a cache inside a cache inside a cache, giving you fast allocation and deallocation of objects with a malloc-like API.
As I/O is performed more and more of these ZIO buffer objects are allocated, taking up more and more memory. Over time space will get tight and allocating new buffers will require freeing old ones, which really means freeing slabs. But how?
Virtual Memory & The Page
Regardless of the method used, all allocations must eventually enter the kernel’s virtual memory system. And just like Fight Club, there is one very important rule: you do not allocate anything other than a page. Page size varies across different combinations of hardware and operating system, and some even offer multiple sizes at once, but all of them allocate memory in units of a page.
On x64 the base page size is 4K. Earlier I defined a slab to be a contiguous range of kernel heap, and that heap is backed by pages of memory, thus a slab is a set of one or more contiguous pages. A 128K slab is equivalent to 32 4K pages. And herein lies the crux of Matthew’s fix.
The high-level memory unit maps down to many contiguous low-level memory units.
Freeing a 128K slab means freeing 32 pages. How those pages are freed makes all the difference.
HAT, TLB Shootdowns & xcalls
Up until this point everything discussed deals with virtual memory. As the name suggests this type of memory is not real—it’s virtual, faux, pseudo. A delicate illusion performed by the Hardware Address Translation layer to trick the process into believing it owns all the memory. Any time a userland program, the kernel, or any piece of code references a virtual address the HAT layer communicates with the CPU’s Memory Management Unit to translate it to a physical page address, or as it is more typically called, a page frame. This translation is done by a lookup in the page table which maps virtual page numbers to physical page frames. To efficiently map the monster address space that is x64, four levels of page tables are actually used. In some cases a single lookup turns into five lookups: four page table lookups followed by the final lookup of the actual frame. If this sounds expensive that’s because it is. Memory access is too common an operation to always take a 5x hit. So the hardware folks created the Translation Lookaside Buffer: a per-core hardware cache that translates an address in one shot.
When a virtual page is freed its mapping must be removed from
the page table as well as any TLBs with an entry for that
page. Clearing a page in a single core’s TLB is easy
enough using the INVLPG
instruction, but
multicore is a different story. There is no built-in TLB
coherency on Intel chips, it is up to the operating system to
enforce consistency. This is important because the ARC lives
on the Kernel Address Space, and since all cores are
constantly switching into kernel context we have to assume
that any page in KAS is mapped on all TLBs and therefore
an INVLPG
must be sent to all cores—this is
referred to as the ole’ TLB shootdown.
In a shootdown one core tells all the other cores to clear a page mapping from their TLB. A common way to implement this is by using a cross call, or xcall for short (the formal name is Inter-Processor Interrupt, or IPI). An xcall allows one core to send an interrupt to one or more other cores for the purpose of taking some immediate action. There are several flavors of xcall but for the purpose of a TLB shootdown you can loosely think of it like a transaction in a distributed database, except with a really fast and reliable network. The core freeing the page sends out an interrupt to all the other cores, notifying them to free the page as well, and then waits for all of them to reply before continuing. The callees, however, are allowed to continue immediately after they clear the page and send their reply, they don’t have to wait along with the core that originated the request.
As you might imagine this is a relativey expensive operation to perform. Not only is one core blocked for the entire duration but all cores are interrupted as well. With a typical latency of about 8us on my machine, it approaches two orders of magnitude more than DRAM access—roughly 28,000 cycles on my 3.5GHz chip. A cost worth avoiding when possible.
Page vs. Range
To recap, ZFS caches data in the ARC. The ARC is a collection of ZIO buffers. These buffers are backed by the kernel memory slab allocator. Each slab is a contiguous set of one or more pages. When freed, its pages must be unmapped from the TLB. Since these pages live in KAS they are potentially mapped in all TLBs and thus a shootdown must be performed. Each shootdown is a blocking xcall that interrupts all other cores on the machine. This is done for each page, one at a time.
If a 128K slab is made up of 32 pages then it will require 32 individual shootdowns, each of which interrupts every core on the machine.
Ouch.
The fix is simple and relies on one key element. The pages in
a slab are always contiguous and can therefore be
unmapped with one range-based shootdown, unmapping all
pages at once. An INVLPG
instruction is still
executed for each page at the hardware level, the difference
is that the entire range to unmap is now passed in the xcall.
That is, shootdowns are now performed in batch by contiguous
range instead of by page.
Seeing is Believing
Talking about performance is cool, but ya’ know what’s really cool? Verifying performance improvements with DTrace. So I wrote a script that reports ZIO buffer frees as they occur: showing the base address and length of the slab being freed, and any shootdowns executed.
# /data/zio-shootdowns
ZIO FREE addr=0xffffff0047d68000 len=131072
TLB SHOOTDOWN #1 addr=0xffffff0047d68000 len=4096
TLB SHOOTDOWN #2 addr=0xffffff0047d69000 len=4096
TLB SHOOTDOWN #3 addr=0xffffff0047d6a000 len=4096
TLB SHOOTDOWN #4 addr=0xffffff0047d6b000 len=4096
TLB SHOOTDOWN #5 addr=0xffffff0047d6c000 len=4096
TLB SHOOTDOWN #6 addr=0xffffff0047d6d000 len=4096
TLB SHOOTDOWN #7 addr=0xffffff0047d6e000 len=4096
TLB SHOOTDOWN #8 addr=0xffffff0047d6f000 len=4096
TLB SHOOTDOWN #9 addr=0xffffff0047d70000 len=4096
TLB SHOOTDOWN #10 addr=0xffffff0047d71000 len=4096
TLB SHOOTDOWN #11 addr=0xffffff0047d72000 len=4096
TLB SHOOTDOWN #12 addr=0xffffff0047d73000 len=4096
TLB SHOOTDOWN #13 addr=0xffffff0047d74000 len=4096
TLB SHOOTDOWN #14 addr=0xffffff0047d75000 len=4096
TLB SHOOTDOWN #15 addr=0xffffff0047d76000 len=4096
TLB SHOOTDOWN #16 addr=0xffffff0047d77000 len=4096
TLB SHOOTDOWN #17 addr=0xffffff0047d78000 len=4096
TLB SHOOTDOWN #18 addr=0xffffff0047d79000 len=4096
TLB SHOOTDOWN #19 addr=0xffffff0047d7a000 len=4096
TLB SHOOTDOWN #20 addr=0xffffff0047d7b000 len=4096
TLB SHOOTDOWN #21 addr=0xffffff0047d7c000 len=4096
TLB SHOOTDOWN #22 addr=0xffffff0047d7d000 len=4096
TLB SHOOTDOWN #23 addr=0xffffff0047d7e000 len=4096
TLB SHOOTDOWN #24 addr=0xffffff0047d7f000 len=4096
TLB SHOOTDOWN #25 addr=0xffffff0047d80000 len=4096
TLB SHOOTDOWN #26 addr=0xffffff0047d81000 len=4096
TLB SHOOTDOWN #27 addr=0xffffff0047d82000 len=4096
TLB SHOOTDOWN #28 addr=0xffffff0047d83000 len=4096
TLB SHOOTDOWN #29 addr=0xffffff0047d84000 len=4096
TLB SHOOTDOWN #30 addr=0xffffff0047d85000 len=4096
TLB SHOOTDOWN #31 addr=0xffffff0047d86000 len=4096
TLB SHOOTDOWN #32 addr=0xffffff0047d87000 len=4096
Above is the result of freeing a 128K ZIO buffer on a version of SmartOS from before Matthew’s change. As expected, freeing this buffer required 32 shootdowns, one for each page. Sometimes less depending on the TLB state. Notice the address of each shootdown increments by 4K, the page size. What about the latest SmartOS?
# /data/zio-shootdowns
ZIO FREE addr=0xffffff0209885000 len=12288
TLB SHOOTDOWN #1 addr=0xffffff0209885000 len=12288
ZIO FREE addr=0xffffff02475a1000 len=20480
TLB SHOOTDOWN #1 addr=0xffffff02475a1000 len=20480
ZIO FREE addr=0xffffff0446a14000 len=131072
TLB SHOOTDOWN #1 addr=0xffffff0446a14000 len=131072
Here I’ve shown three different buffer sizes being freed, and each uses only one shootdown. For 128K buffers this is a factor of 32 reduction—that’s huge. Thus why Matthew says, “The performance benefit will be most noticeable when reaping larger (e.g. 128KB) caches.”
But what about my original question? Are their savings elsewhere? This change, being in the bowels of the HAT layer, should apply to anything that needs to free memory.
Beyond ZFS Performance
Matthew’s change makes no difference in the most common case: freeing one page of virtual memory. The system was already doing the minimal amount of work possible, and there was nothing to be saved. But any free involving two or more contiguous pages stands to gain something from the new code. The question is, how often does that happen? And even more imporant, how often does it happen compared to the common case of one page?
To answer these questions I wrote another DTrace script that expands on the first one. It tracks all TLB shootdowns, no matter what their origin, and collects statistics for each one such as number of pages, latency, reduction factor, and the kernel stack that initiated the shootdown. By default shootdowns are reported as they happen and after an hour collection is stopped and statistics are reported. For this particular run I had a large amount of 128K ZIO buffers cached, started a large KVM instance to introduce some memory pressure, and then kicked of a smartos-live build in a native zone. Here are the results.
/data/shootdowns -ls -u 5 -t 3h | tee -i /data/shootdowns.log
ADDRESS PAGES SHOOTDOWNS FACTOR LATENCY (us)
0xffffff0ab83c9000 1 1 1 21
0xffffff0b5dd3a000 1 1 1 15
0xffffff09592eb000 1 1 1 20
0xffffff2f3860b000 1731 1 1731 509
0xffffff0905a56000 2 1 2 5
0xffffff0905a37000 1 1 1 11
0xffffff0905a37000 1 1 1 5
0xffffff0905a52000 2 1 2 5
0xffffff09b5981000 1 1 1 17
0xffffff0943a2f000 1 1 1 19
0xffffff2f39393000 34 1 34 23
0xffffff2f393b6000 39 1 39 19
0xffffff2f393de000 42 1 42 20
0xffffff2f39409000 47 1 47 20
0xffffff2f39439000 48 1 48 18
0xffffff2f3946a000 52 1 52 20
0xffffff0b4243e000 1 1 1 19
0xffffff2f3949f000 200 1 200 57
0xffffff097cb74000 1 1 1 19
0xffffff09b2ba6000 1 1 1 21
...
I cherry picked 20 particular shootdowns to highlight that large frees do occur. Sometimes they are very large, like 1,731 pages large. That particular free, caused by DTrace, took 509μs (half a millisecond) which can be an astonishingly long time for a kernel operation.
TOP FIVE STACKS BY PAGES UNMAPPED
The fives stacks unmapping the most memory. For each stack the
distribution of pages per memory unmapping is shown. The value
on the left is # of pages unmapped in one call.
unix`hat_unload+0x3e
genunix`segkp_release_internal+0x92
genunix`segkp_release+0xa0
genunix`schedctl_freepage+0x34
genunix`schedctl_proc_cleanup+0x68
genunix`exec_args+0x1e0
elfexec`elf32exec+0x4b0
genunix`gexec+0x347
genunix`exec_common+0x4d5
genunix`exece+0x1b
unix`_sys_sysenter_post_swapgs+0x153
value ------------- Distribution ------------- count
< 1 | 0
1 |██████████████████████████████████████ 190171
2 | 0
This first report is the top five unmapped stacks report,
which shows the kernel stacks causing the most memory to be
freed. For this run, the fifth most common memory freer was
the exec
system call. Every instance of exec
freed only one page, gaining nothing from Matthew’s change.
genunix`segvn_unmap+0x5bf
genunix`as_unmap+0x19c
genunix`munmap+0x83
unix`_sys_sysenter_post_swapgs+0x153
value ------------- Distribution ------------- count
< 1 | 0
1 |███████████████████████████▉ 2882
2 |██████████████████████████████████████ 3921
3 |███████████████████▏ 1983
4 |███████████▏ 1157
5 |█████████▌ 985
6 |███████▎ 753
7 |██████▋ 696
8 |██████▏ 644
9 |████▉ 508
10 |███████▎ 755
12 |█████▏ 541
14 |████▎ 440
16 |██▉ 302
18 |██▏ 230
20 |█▉ 194
22 |██▏ 222
24 |█▎ 137
26 |█▋ 180
28 |█▊ 185
30 |█▍ 152
32 |█▍ 143
34 |█▎ 130
36 |█▍ 147
38 |█▎ 130
40 |█▍ 144
42 |█ 106
44 |▉ 101
46 |█ 108
48 |▉ 101
50 |█ 104
52 |▋ 74
54 |▋ 69
56 |▋ 66
58 |▌ 52
60 |▋ 69
62 |▌ 64
64 |▌ 61
66 |▍ 43
68 |▎ 32
70 |▎ 27
72 |▍ 45
74 |▎ 37
76 |▏ 23
78 |▎ 34
80 |▎ 31
82 |▎ 32
84 |▎ 29
86 |▏ 24
88 |▏ 18
90 |▏ 24
92 |▏ 15
94 |▏ 24
96 |▏ 18
98 |▏ 16
100 |█▏ 117
120 |▍ 50
140 |▏ 17
160 | 12
180 | 3
200 | 2
220 | 0
240 | 1
260 | 1
280 | 1
300 | 0
320 | 0
340 | 0
360 | 0
380 | 0
400 | 0
420 | 0
440 | 0
460 | 0
480 | 0
500 | 0
520 | 0
540 | 0
560 | 0
580 | 0
600 | 0
620 | 0
640 | 0
660 | 0
680 | 0
700 | 0
720 | 0
740 | 0
760 | 0
780 | 0
800 | 0
820 | 0
840 | 0
860 | 0
880 | 0
900 | 0
920 | 0
940 | 0
960 | 0
980 | 0
>= 1000 | 4
The next stack is far more interesting. It is the result of
userland making mmap
and munmap
system calls. The majority of the distribution lies within 1–3
pages but there is a long tail along with some outliers. The
takeaway is that Matthew’s change may have a positive impact
on systems with moderate to heavy userland mmap/munmap load.
unix`hat_unload+0x3e
unix`segkmem_free_vn+0x62
unix`segkmem_free+0x23
genunix`vmem_xfree+0xf4
genunix`vmem_free+0x23
genunix`kmem_slab_destroy+0x8d
genunix`kmem_slab_free+0x309
genunix`kmem_magazine_destroy+0x6e
genunix`kmem_depot_ws_reap+0x5d
genunix`taskq_thread+0x2d0
unix`thread_start+0x8
value ------------- Distribution ------------- count
< 1 | 0
1 |█████▌ 5736
2 |▏ 180
3 |▍ 466
4 |██████████████████████████████████████ 38795
5 |▏ 254
6 |▏ 147
7 |▏ 199
8 |▎ 267
9 | 0
10 |▌ 554
12 |▏ 223
14 |▏ 176
16 |▏ 195
18 | 0
20 |▎ 283
22 | 0
24 |▎ 269
26 | 0
28 |▎ 333
30 | 0
32 |█▍ 1493
34 | 0
This stack shows the kmem reaper thread freeing various kmem slab sizes. It has a multimodal distribution with the majority being frees of 16K slabs. Use of the kmem system, which basically applies to the entire kernel, stands to gain from Matthew’s change.
unix`hat_unload+0x3e
genunix`segkp_release_internal+0x92
genunix`segkp_release+0xa0
genunix`schedctl_freepage+0x34
genunix`schedctl_proc_cleanup+0x68
genunix`proc_exit+0x22c
genunix`exit+0x15
genunix`rexit+0x18
unix`_sys_sysenter_post_swapgs+0x153
value ------------- Distribution ------------- count
< 1 | 0
1 |██████████████████████████████████████ 338814
2 | 0
The stack that unloaded the second most amount of memory was caused by process exits. This makes sense because building a large project like smartos-live requires forking many short-lived processes—it looks like over 330,000 of them.
unix`hat_unload+0x3e
unix`segkmem_free_vn+0x62
unix`segkmem_zio_free+0x23
genunix`vmem_xfree+0xf4
genunix`vmem_free+0x23
genunix`kmem_slab_destroy+0x8d
genunix`kmem_slab_free+0x309
genunix`kmem_magazine_destroy+0x6e
genunix`kmem_depot_ws_reap+0x5d
genunix`taskq_thread+0x2d0
unix`thread_start+0x8
value ------------- Distribution ------------- count
< 1 | 0
1 |██████████▌ 56392
2 |██████████████████████████████████████ 203538
3 |█████▎ 28252
4 |█▍ 7858
5 |████▋ 25288
6 |█▋ 8725
7 |██▊ 15013
8 |▉ 5171
9 | 0
10 |█▎ 7255
12 |▉ 5231
14 |▋ 3607
16 |▌ 2773
18 | 0
20 |▊ 4216
22 | 0
24 |▌ 3291
26 | 0
28 |▍ 2152
30 | 0
32 |████████████████████ 107393
34 | 0
The stack that unmapped the most memory was kmem on behalf of ZIO, thanks to the fact that I had a large number of active 128K buffers before starting the smartos-live build. Another multimodal distribution that includes over 100,000 large frees. Proof that Matthew’s change is a boon for ZFS performance.
OVERALL LATENCY
The latency distribution for all memory unmappings. The value on the
left is latency in microseconds.
value ------------- Distribution ------------- count
2 | 0
3 | 12
4 |████████▊ 151790
5 |████████▉ 154641
6 |████████▎ 143235
7 |██████▉ 120230
8 |█████▍ 94932
9 |███████▊ 134837
10 |██████████████████████████████████████ 658368
15 |██ 35927
20 |█▏ 21435
25 |▏ 2773
30 | 772
35 | 367
40 | 261
45 | 180
50 | 148
55 | 155
60 | 130
65 | 130
70 | 70
75 | 57
80 | 50
85 | 37
90 | 28
95 | 19
100 | 36
150 | 3
200 | 6
250 | 0
300 | 1
350 | 3
400 | 3
450 | 2
500 | 14
550 | 1
600 | 1
650 | 5
700 | 0
750 | 0
800 | 0
850 | 0
900 | 0
950 | 0
>= 1000 | 9
The next section reports the latency distribution for all shootdowns. Without comparing this to the latency distribution of a similar load on an older SmartOS release it doesn't really prove anything but it’s nice to know most shootdowns now take 10μs or less. It also might be worth looking more closely at the outliers, especially the shootdowns that took over 1ms, that’s a long time.
OVERALL SAVINGS
total pages unmapped 6407340
total shootdowns 1529415
total factor saved 4
The savings distribution for all shootdowns. The value on the left
is the factor saved (# pages / # shootdowns).
value ------------- Distribution ------------- count
< 1 | 0
1 |██████████████████████████████████████ 999581
2 |████████▊ 232952
3 |█▏ 31886
4 |█▊ 47926
5 |█ 26308
6 |▌ 15754
7 |▌ 15671
8 |▏ 5803
9 | 242
10 |▎ 8286
12 |▏ 5799
14 |▏ 5109
16 |▏ 3405
18 | 292
20 |▏ 4737
22 | 190
24 |▏ 3755
26 | 149
28 | 2606
30 | 131
32 |████▏ 109187
34 | 68
36 | 95
38 | 59
40 | 69
42 | 68
44 | 56
46 | 61
48 | 43
50 | 31
52 | 35
54 | 29
56 | 29
58 | 19
60 | 21
62 | 17
64 | 28
66 | 19
68 | 8
70 | 8
72 | 9
74 | 4
76 | 3
78 | 11
80 | 11
82 | 8
84 | 4
86 | 4
88 | 1
90 | 1
92 | 2
94 | 2
96 | 2
98 | 2
100 | 18
120 | 14
140 | 5
160 | 1
180 | 1
200 | 4
220 | 1
240 | 4
260 | 1
280 | 0
300 | 0
320 | 0
340 | 0
360 | 0
380 | 0
400 | 0
420 | 0
440 | 0
460 | 0
480 | 0
500 | 0
520 | 0
540 | 0
560 | 0
580 | 0
600 | 0
620 | 0
640 | 0
660 | 0
680 | 0
700 | 0
720 | 0
740 | 0
760 | 0
780 | 0
800 | 0
820 | 0
840 | 0
860 | 0
880 | 0
900 | 0
920 | 0
940 | 0
960 | 0
980 | 0
>= 1000 | 23
The last section reports savings. Savings is the ratio of pages being freed to the number of shootdowns required to do so. Given that the majority of shootdowns are for contiguous memory regions this number also represents how much the new code saves over the old. A factor of four for my particular test. The distribution shows that most shootdowns are for one page and thus save nothing, but it also shows a significant amount of multi-page shootdowns that do benefit, and thus the overall benefit by a factor of four.
While not conclusive, this data backs Matthew’s original claims of better ZFS latency and hints that other workloads might benefit as well. I’d be curious to see results from other workloads. Please run my shootdowns script and share your results.