Writing [Feed] About Pub

More Efficient TLB Shootdowns

12 Apr 2015

In illumos issue #5498 Matthew Ahrens reported the occurence of ZFS “latency bubbles” when switching between I/O of large block sizes to that of small ones. He traced the bubbles to an excessive number of TLB shootdowns made while freeing ZIO buffer slabs, finding that a shootdown was made for each page in the slab when instead the same job could be done with just one shootdown. This change would remove much extraneous cross-core communication and help avoid latency bubbles.

As I studied Matthew’s patch I wondered if there was more to this than his simple description. His focus was on ZFS performance but he changed a very primal part of illumos memory management. A part through which all physical memory changes must travel. I couldn’t help but wonder if other parts of the system might benefit. Surely they must, but how could I tell?

First I needed a good understanding of Matthew’s patch. That specific knowledge would lead me to a more general understanding of the systems involved and how they interact with each other. From there I could then use DTrace to introspect a running system to verify my knowledge and look for effects on other parts of the system. But first I started where Matthew started, with ZFS.

Buffers & Slabs

ZFS makes heavy use of main memory caches to minimize I/O to slower devices like hard drives and solid-state disks. SSDs are fast, but still up to two orders of magnitude slower than DRAM. The entire set of main memory caches is referred to simply as the ARC, standing for Adaptive Replacement Cache because it adapts to varying workloads by using a hybrid MRU-MFU data structure. Underneath the ARC is the ZFS I/O layer, referred to as ZIO. It mediates between the ARC and the underlying storage devices: performing tasks like I/O scheduling, batching, and caching of data. The ZIO layer is what actually caches the data in memory, and the ARC acts as an administrator: determining which buffers are needed and which ones may be dropped from main memory.

These buffers come in many different sizes to accommodate ZFS’s dynamic block size: ranging from small 512 byte blocks to large 128K blocks. Large block support of up to 1M was recently added. To efficiently allocate these variable sized block buffers the ZIO layer, like many kernel services, uses the kernel slab allocator, or kmem for short. kmem gathers up contiguous ranges of kernel heap into chunks called slabs and provides object-based allocation on top of them. Behind the scenes kmem does the heavy lifting of making sure that each CPU core has access to a set of free objects, called a magazine, and that each of these cores can quickly get more magazines from a local depot when they run out. It’s basically a cache inside a cache inside a cache, giving you fast allocation and deallocation of objects with a malloc-like API.

As I/O is performed more and more of these ZIO buffer objects are allocated, taking up more and more memory. Over time space will get tight and allocating new buffers will require freeing old ones, which really means freeing slabs. But how?

Virtual Memory & The Page

Regardless of the method used, all allocations must eventually enter the kernel’s virtual memory system. And just like Fight Club, there is one very important rule: you do not allocate anything other than a page. Page size varies across different combinations of hardware and operating system, and some even offer multiple sizes at once, but all of them allocate memory in units of a page.

On x64 the base page size is 4K. Earlier I defined a slab to be a contiguous range of kernel heap, and that heap is backed by pages of memory, thus a slab is a set of one or more contiguous pages. A 128K slab is equivalent to 32 4K pages. And herein lies the crux of Matthew’s fix.

The high-level memory unit maps down to many contiguous low-level memory units.

Freeing a 128K slab means freeing 32 pages. How those pages are freed makes all the difference.

HAT, TLB Shootdowns & xcalls

Up until this point everything discussed deals with virtual memory. As the name suggests this type of memory is not real—it’s virtual, faux, pseudo. A delicate illusion performed by the Hardware Address Translation layer to trick the process into believing it owns all the memory. Any time a userland program, the kernel, or any piece of code references a virtual address the HAT layer communicates with the CPU’s Memory Management Unit to translate it to a physical page address, or as it is more typically called, a page frame. This translation is done by a lookup in the page table which maps virtual page numbers to physical page frames. To efficiently map the monster address space that is x64, four levels of page tables are actually used. In some cases a single lookup turns into five lookups: four page table lookups followed by the final lookup of the actual frame. If this sounds expensive that’s because it is. Memory access is too common an operation to always take a 5x hit. So the hardware folks created the Translation Lookaside Buffer: a per-core hardware cache that translates an address in one shot.

When a virtual page is freed its mapping must be removed from the page table as well as any TLBs with an entry for that page. Clearing a page in a single core’s TLB is easy enough using the INVLPG instruction, but multicore is a different story. There is no built-in TLB coherency on Intel chips, it is up to the operating system to enforce consistency. This is important because the ARC lives on the Kernel Address Space, and since all cores are constantly switching into kernel context we have to assume that any page in KAS is mapped on all TLBs and therefore an INVLPG must be sent to all cores—this is referred to as the ole’ TLB shootdown.

In a shootdown one core tells all the other cores to clear a page mapping from their TLB. A common way to implement this is by using a cross call, or xcall for short (the formal name is Inter-Processor Interrupt, or IPI). An xcall allows one core to send an interrupt to one or more other cores for the purpose of taking some immediate action. There are several flavors of xcall but for the purpose of a TLB shootdown you can loosely think of it like a transaction in a distributed database, except with a really fast and reliable network. The core freeing the page sends out an interrupt to all the other cores, notifying them to free the page as well, and then waits for all of them to reply before continuing. The callees, however, are allowed to continue immediately after they clear the page and send their reply, they don’t have to wait along with the core that originated the request.

As you might imagine this is a relativey expensive operation to perform. Not only is one core blocked for the entire duration but all cores are interrupted as well. With a typical latency of about 8us on my machine, it approaches two orders of magnitude more than DRAM access—roughly 28,000 cycles on my 3.5GHz chip. A cost worth avoiding when possible.

Page vs. Range

To recap, ZFS caches data in the ARC. The ARC is a collection of ZIO buffers. These buffers are backed by the kernel memory slab allocator. Each slab is a contiguous set of one or more pages. When freed, its pages must be unmapped from the TLB. Since these pages live in KAS they are potentially mapped in all TLBs and thus a shootdown must be performed. Each shootdown is a blocking xcall that interrupts all other cores on the machine. This is done for each page, one at a time.

If a 128K slab is made up of 32 pages then it will require 32 individual shootdowns, each of which interrupts every core on the machine.

Ouch.

The fix is simple and relies on one key element. The pages in a slab are always contiguous and can therefore be unmapped with one range-based shootdown, unmapping all pages at once. An INVLPG instruction is still executed for each page at the hardware level, the difference is that the entire range to unmap is now passed in the xcall. That is, shootdowns are now performed in batch by contiguous range instead of by page.

Seeing is Believing

Talking about performance is cool, but ya’ know what’s really cool? Verifying performance improvements with DTrace. So I wrote a script that reports ZIO buffer frees as they occur: showing the base address and length of the slab being freed, and any shootdowns executed.

# /data/zio-shootdowns ZIO FREE addr=0xffffff0047d68000 len=131072 TLB SHOOTDOWN #1 addr=0xffffff0047d68000 len=4096 TLB SHOOTDOWN #2 addr=0xffffff0047d69000 len=4096 TLB SHOOTDOWN #3 addr=0xffffff0047d6a000 len=4096 TLB SHOOTDOWN #4 addr=0xffffff0047d6b000 len=4096 TLB SHOOTDOWN #5 addr=0xffffff0047d6c000 len=4096 TLB SHOOTDOWN #6 addr=0xffffff0047d6d000 len=4096 TLB SHOOTDOWN #7 addr=0xffffff0047d6e000 len=4096 TLB SHOOTDOWN #8 addr=0xffffff0047d6f000 len=4096 TLB SHOOTDOWN #9 addr=0xffffff0047d70000 len=4096 TLB SHOOTDOWN #10 addr=0xffffff0047d71000 len=4096 TLB SHOOTDOWN #11 addr=0xffffff0047d72000 len=4096 TLB SHOOTDOWN #12 addr=0xffffff0047d73000 len=4096 TLB SHOOTDOWN #13 addr=0xffffff0047d74000 len=4096 TLB SHOOTDOWN #14 addr=0xffffff0047d75000 len=4096 TLB SHOOTDOWN #15 addr=0xffffff0047d76000 len=4096 TLB SHOOTDOWN #16 addr=0xffffff0047d77000 len=4096 TLB SHOOTDOWN #17 addr=0xffffff0047d78000 len=4096 TLB SHOOTDOWN #18 addr=0xffffff0047d79000 len=4096 TLB SHOOTDOWN #19 addr=0xffffff0047d7a000 len=4096 TLB SHOOTDOWN #20 addr=0xffffff0047d7b000 len=4096 TLB SHOOTDOWN #21 addr=0xffffff0047d7c000 len=4096 TLB SHOOTDOWN #22 addr=0xffffff0047d7d000 len=4096 TLB SHOOTDOWN #23 addr=0xffffff0047d7e000 len=4096 TLB SHOOTDOWN #24 addr=0xffffff0047d7f000 len=4096 TLB SHOOTDOWN #25 addr=0xffffff0047d80000 len=4096 TLB SHOOTDOWN #26 addr=0xffffff0047d81000 len=4096 TLB SHOOTDOWN #27 addr=0xffffff0047d82000 len=4096 TLB SHOOTDOWN #28 addr=0xffffff0047d83000 len=4096 TLB SHOOTDOWN #29 addr=0xffffff0047d84000 len=4096 TLB SHOOTDOWN #30 addr=0xffffff0047d85000 len=4096 TLB SHOOTDOWN #31 addr=0xffffff0047d86000 len=4096 TLB SHOOTDOWN #32 addr=0xffffff0047d87000 len=4096

Above is the result of freeing a 128K ZIO buffer on a version of SmartOS from before Matthew’s change. As expected, freeing this buffer required 32 shootdowns, one for each page. Sometimes less depending on the TLB state. Notice the address of each shootdown increments by 4K, the page size. What about the latest SmartOS?

# /data/zio-shootdowns ZIO FREE addr=0xffffff0209885000 len=12288 TLB SHOOTDOWN #1 addr=0xffffff0209885000 len=12288 ZIO FREE addr=0xffffff02475a1000 len=20480 TLB SHOOTDOWN #1 addr=0xffffff02475a1000 len=20480 ZIO FREE addr=0xffffff0446a14000 len=131072 TLB SHOOTDOWN #1 addr=0xffffff0446a14000 len=131072

Here I’ve shown three different buffer sizes being freed, and each uses only one shootdown. For 128K buffers this is a factor of 32 reduction—that’s huge. Thus why Matthew says, “The performance benefit will be most noticeable when reaping larger (e.g. 128KB) caches.”

But what about my original question? Are their savings elsewhere? This change, being in the bowels of the HAT layer, should apply to anything that needs to free memory.

Beyond ZFS Performance

Matthew’s change makes no difference in the most common case: freeing one page of virtual memory. The system was already doing the minimal amount of work possible, and there was nothing to be saved. But any free involving two or more contiguous pages stands to gain something from the new code. The question is, how often does that happen? And even more imporant, how often does it happen compared to the common case of one page?

To answer these questions I wrote another DTrace script that expands on the first one. It tracks all TLB shootdowns, no matter what their origin, and collects statistics for each one such as number of pages, latency, reduction factor, and the kernel stack that initiated the shootdown. By default shootdowns are reported as they happen and after an hour collection is stopped and statistics are reported. For this particular run I had a large amount of 128K ZIO buffers cached, started a large KVM instance to introduce some memory pressure, and then kicked of a smartos-live build in a native zone. Here are the results.

/data/shootdowns -ls -u 5 -t 3h | tee -i /data/shootdowns.log ADDRESS PAGES SHOOTDOWNS FACTOR LATENCY (us) 0xffffff0ab83c9000 1 1 1 21 0xffffff0b5dd3a000 1 1 1 15 0xffffff09592eb000 1 1 1 20 0xffffff2f3860b000 1731 1 1731 509 0xffffff0905a56000 2 1 2 5 0xffffff0905a37000 1 1 1 11 0xffffff0905a37000 1 1 1 5 0xffffff0905a52000 2 1 2 5 0xffffff09b5981000 1 1 1 17 0xffffff0943a2f000 1 1 1 19 0xffffff2f39393000 34 1 34 23 0xffffff2f393b6000 39 1 39 19 0xffffff2f393de000 42 1 42 20 0xffffff2f39409000 47 1 47 20 0xffffff2f39439000 48 1 48 18 0xffffff2f3946a000 52 1 52 20 0xffffff0b4243e000 1 1 1 19 0xffffff2f3949f000 200 1 200 57 0xffffff097cb74000 1 1 1 19 0xffffff09b2ba6000 1 1 1 21 ...

I cherry picked 20 particular shootdowns to highlight that large frees do occur. Sometimes they are very large, like 1,731 pages large. That particular free, caused by DTrace, took 509μs (half a millisecond) which can be an astonishingly long time for a kernel operation.

TOP FIVE STACKS BY PAGES UNMAPPED The fives stacks unmapping the most memory. For each stack the distribution of pages per memory unmapping is shown. The value on the left is # of pages unmapped in one call. unix`hat_unload+0x3e genunix`segkp_release_internal+0x92 genunix`segkp_release+0xa0 genunix`schedctl_freepage+0x34 genunix`schedctl_proc_cleanup+0x68 genunix`exec_args+0x1e0 elfexec`elf32exec+0x4b0 genunix`gexec+0x347 genunix`exec_common+0x4d5 genunix`exece+0x1b unix`_sys_sysenter_post_swapgs+0x153 value ------------- Distribution ------------- count < 1 | 0 1 |██████████████████████████████████████ 190171 2 | 0

This first report is the top five unmapped stacks report, which shows the kernel stacks causing the most memory to be freed. For this run, the fifth most common memory freer was the exec system call. Every instance of exec freed only one page, gaining nothing from Matthew’s change.

genunix`segvn_unmap+0x5bf genunix`as_unmap+0x19c genunix`munmap+0x83 unix`_sys_sysenter_post_swapgs+0x153 value ------------- Distribution ------------- count < 1 | 0 1 |███████████████████████████▉ 2882 2 |██████████████████████████████████████ 3921 3 |███████████████████▏ 1983 4 |███████████▏ 1157 5 |█████████▌ 985 6 |███████▎ 753 7 |██████▋ 696 8 |██████▏ 644 9 |████▉ 508 10 |███████▎ 755 12 |█████▏ 541 14 |████▎ 440 16 |██▉ 302 18 |██▏ 230 20 |█▉ 194 22 |██▏ 222 24 |█▎ 137 26 |█▋ 180 28 |█▊ 185 30 |█▍ 152 32 |█▍ 143 34 |█▎ 130 36 |█▍ 147 38 |█▎ 130 40 |█▍ 144 42 |█ 106 44 |▉ 101 46 |█ 108 48 |▉ 101 50 |█ 104 52 |▋ 74 54 |▋ 69 56 |▋ 66 58 |▌ 52 60 |▋ 69 62 |▌ 64 64 |▌ 61 66 |▍ 43 68 |▎ 32 70 |▎ 27 72 |▍ 45 74 |▎ 37 76 |▏ 23 78 |▎ 34 80 |▎ 31 82 |▎ 32 84 |▎ 29 86 |▏ 24 88 |▏ 18 90 |▏ 24 92 |▏ 15 94 |▏ 24 96 |▏ 18 98 |▏ 16 100 |█▏ 117 120 |▍ 50 140 |▏ 17 160 | 12 180 | 3 200 | 2 220 | 0 240 | 1 260 | 1 280 | 1 300 | 0 320 | 0 340 | 0 360 | 0 380 | 0 400 | 0 420 | 0 440 | 0 460 | 0 480 | 0 500 | 0 520 | 0 540 | 0 560 | 0 580 | 0 600 | 0 620 | 0 640 | 0 660 | 0 680 | 0 700 | 0 720 | 0 740 | 0 760 | 0 780 | 0 800 | 0 820 | 0 840 | 0 860 | 0 880 | 0 900 | 0 920 | 0 940 | 0 960 | 0 980 | 0 >= 1000 | 4

The next stack is far more interesting. It is the result of userland making mmap and munmap system calls. The majority of the distribution lies within 1–3 pages but there is a long tail along with some outliers. The takeaway is that Matthew’s change may have a positive impact on systems with moderate to heavy userland mmap/munmap load.

unix`hat_unload+0x3e unix`segkmem_free_vn+0x62 unix`segkmem_free+0x23 genunix`vmem_xfree+0xf4 genunix`vmem_free+0x23 genunix`kmem_slab_destroy+0x8d genunix`kmem_slab_free+0x309 genunix`kmem_magazine_destroy+0x6e genunix`kmem_depot_ws_reap+0x5d genunix`taskq_thread+0x2d0 unix`thread_start+0x8 value ------------- Distribution ------------- count < 1 | 0 1 |█████▌ 5736 2 |▏ 180 3 |▍ 466 4 |██████████████████████████████████████ 38795 5 |▏ 254 6 |▏ 147 7 |▏ 199 8 |▎ 267 9 | 0 10 |▌ 554 12 |▏ 223 14 |▏ 176 16 |▏ 195 18 | 0 20 |▎ 283 22 | 0 24 |▎ 269 26 | 0 28 |▎ 333 30 | 0 32 |█▍ 1493 34 | 0

This stack shows the kmem reaper thread freeing various kmem slab sizes. It has a multimodal distribution with the majority being frees of 16K slabs. Use of the kmem system, which basically applies to the entire kernel, stands to gain from Matthew’s change.

unix`hat_unload+0x3e genunix`segkp_release_internal+0x92 genunix`segkp_release+0xa0 genunix`schedctl_freepage+0x34 genunix`schedctl_proc_cleanup+0x68 genunix`proc_exit+0x22c genunix`exit+0x15 genunix`rexit+0x18 unix`_sys_sysenter_post_swapgs+0x153 value ------------- Distribution ------------- count < 1 | 0 1 |██████████████████████████████████████ 338814 2 | 0

The stack that unloaded the second most amount of memory was caused by process exits. This makes sense because building a large project like smartos-live requires forking many short-lived processes—it looks like over 330,000 of them.

unix`hat_unload+0x3e unix`segkmem_free_vn+0x62 unix`segkmem_zio_free+0x23 genunix`vmem_xfree+0xf4 genunix`vmem_free+0x23 genunix`kmem_slab_destroy+0x8d genunix`kmem_slab_free+0x309 genunix`kmem_magazine_destroy+0x6e genunix`kmem_depot_ws_reap+0x5d genunix`taskq_thread+0x2d0 unix`thread_start+0x8 value ------------- Distribution ------------- count < 1 | 0 1 |██████████▌ 56392 2 |██████████████████████████████████████ 203538 3 |█████▎ 28252 4 |█▍ 7858 5 |████▋ 25288 6 |█▋ 8725 7 |██▊ 15013 8 |▉ 5171 9 | 0 10 |█▎ 7255 12 |▉ 5231 14 |▋ 3607 16 |▌ 2773 18 | 0 20 |▊ 4216 22 | 0 24 |▌ 3291 26 | 0 28 |▍ 2152 30 | 0 32 |████████████████████ 107393 34 | 0

The stack that unmapped the most memory was kmem on behalf of ZIO, thanks to the fact that I had a large number of active 128K buffers before starting the smartos-live build. Another multimodal distribution that includes over 100,000 large frees. Proof that Matthew’s change is a boon for ZFS performance.

OVERALL LATENCY The latency distribution for all memory unmappings. The value on the left is latency in microseconds. value ------------- Distribution ------------- count 2 | 0 3 | 12 4 |████████▊ 151790 5 |████████▉ 154641 6 |████████▎ 143235 7 |██████▉ 120230 8 |█████▍ 94932 9 |███████▊ 134837 10 |██████████████████████████████████████ 658368 15 |██ 35927 20 |█▏ 21435 25 |▏ 2773 30 | 772 35 | 367 40 | 261 45 | 180 50 | 148 55 | 155 60 | 130 65 | 130 70 | 70 75 | 57 80 | 50 85 | 37 90 | 28 95 | 19 100 | 36 150 | 3 200 | 6 250 | 0 300 | 1 350 | 3 400 | 3 450 | 2 500 | 14 550 | 1 600 | 1 650 | 5 700 | 0 750 | 0 800 | 0 850 | 0 900 | 0 950 | 0 >= 1000 | 9

The next section reports the latency distribution for all shootdowns. Without comparing this to the latency distribution of a similar load on an older SmartOS release it doesn't really prove anything but it’s nice to know most shootdowns now take 10μs or less. It also might be worth looking more closely at the outliers, especially the shootdowns that took over 1ms, that’s a long time.

OVERALL SAVINGS total pages unmapped 6407340 total shootdowns 1529415 total factor saved 4 The savings distribution for all shootdowns. The value on the left is the factor saved (# pages / # shootdowns). value ------------- Distribution ------------- count < 1 | 0 1 |██████████████████████████████████████ 999581 2 |████████▊ 232952 3 |█▏ 31886 4 |█▊ 47926 5 |█ 26308 6 |▌ 15754 7 |▌ 15671 8 |▏ 5803 9 | 242 10 |▎ 8286 12 |▏ 5799 14 |▏ 5109 16 |▏ 3405 18 | 292 20 |▏ 4737 22 | 190 24 |▏ 3755 26 | 149 28 | 2606 30 | 131 32 |████▏ 109187 34 | 68 36 | 95 38 | 59 40 | 69 42 | 68 44 | 56 46 | 61 48 | 43 50 | 31 52 | 35 54 | 29 56 | 29 58 | 19 60 | 21 62 | 17 64 | 28 66 | 19 68 | 8 70 | 8 72 | 9 74 | 4 76 | 3 78 | 11 80 | 11 82 | 8 84 | 4 86 | 4 88 | 1 90 | 1 92 | 2 94 | 2 96 | 2 98 | 2 100 | 18 120 | 14 140 | 5 160 | 1 180 | 1 200 | 4 220 | 1 240 | 4 260 | 1 280 | 0 300 | 0 320 | 0 340 | 0 360 | 0 380 | 0 400 | 0 420 | 0 440 | 0 460 | 0 480 | 0 500 | 0 520 | 0 540 | 0 560 | 0 580 | 0 600 | 0 620 | 0 640 | 0 660 | 0 680 | 0 700 | 0 720 | 0 740 | 0 760 | 0 780 | 0 800 | 0 820 | 0 840 | 0 860 | 0 880 | 0 900 | 0 920 | 0 940 | 0 960 | 0 980 | 0 >= 1000 | 23

The last section reports savings. Savings is the ratio of pages being freed to the number of shootdowns required to do so. Given that the majority of shootdowns are for contiguous memory regions this number also represents how much the new code saves over the old. A factor of four for my particular test. The distribution shows that most shootdowns are for one page and thus save nothing, but it also shows a significant amount of multi-page shootdowns that do benefit, and thus the overall benefit by a factor of four.

While not conclusive, this data backs Matthew’s original claims of better ZFS latency and hints that other workloads might benefit as well. I’d be curious to see results from other workloads. Please run my shootdowns script and share your results.