More Efficient TLB Shootdowns

12 Apr 2015

In illumos issue #5498 Matthew Ahrens reported the occurence of ZFS “latency bubbles” when switching between I/O of large block sizes to that of small ones. He traced the bubbles to an excessive number of TLB shootdowns made while freeing ZIO buffer slabs, finding that a shootdown was made for each page in the slab when instead the same job could be done with just one shootdown. This change would remove much extraneous cross-core communication and help avoid latency bubbles.

As I studied Matthew’s patch I wondered if there was more to this than his simple description. His focus was on ZFS performance but he changed a very primal part of illumos memory management. A part through which all physical memory changes must travel. I couldn’t help but wonder if other parts of the system might benefit. Surely they must, but how could I tell?

First I needed a good understanding of Matthew’s patch. That specific knowledge would lead me to a more general understanding of the systems involved and how they interact with each other. From there I could then use DTrace to introspect a running system to verify my knowledge and look for effects on other parts of the system. But first I started where Matthew started, with ZFS.

Buffers & Slabs

ZFS makes heavy use of main memory caches to minimize I/O to slower devices like hard drives and solid-state disks. SSDs are fast, but still up to two orders of magnitude slower than DRAM. The entire set of main memory caches is referred to simply as the ARC, standing for Adaptive Replacement Cache because it adapts to varying workloads by using a hybrid MRU-MFU data structure. Underneath the ARC is the ZFS I/O layer, referred to as ZIO. It mediates between the ARC and the underlying storage devices: performing tasks like I/O scheduling, batching, and caching of data. The ZIO layer is what actually caches the data in memory, and the ARC acts as an administrator: determining which buffers are needed and which ones may be dropped from main memory.

These buffers come in many different sizes to accommodate ZFS’s dynamic block size: ranging from small 512 byte blocks to large 128K blocks. Large block support of up to 1M was recently added. To efficiently allocate these variable sized block buffers the ZIO layer, like many kernel services, uses the kernel slab allocator, or kmem for short. kmem gathers up contiguous ranges of kernel heap into chunks called slabs and provides object-based allocation on top of them. Behind the scenes kmem does the heavy lifting of making sure that each CPU core has access to a set of free objects, called a magazine, and that each of these cores can quickly get more magazines from a local depot when they run out. It’s basically a cache inside a cache inside a cache, giving you fast allocation and deallocation of objects with a malloc-like API.

As I/O is performed more and more of these ZIO buffer objects are allocated, taking up more and more memory. Over time space will get tight and allocating new buffers will require freeing old ones, which really means freeing slabs. But how?

Virtual Memory & The Page

Regardless of the method used, all allocations must eventually enter the kernel’s virtual memory system. And just like Fight Club, there is one very important rule: you do not allocate anything other than a page. Page size varies across different combinations of hardware and operating system, and some even offer multiple sizes at once, but all of them allocate memory in units of a page.

On x64 the base page size is 4K. Earlier I defined a slab to be a contiguous range of kernel heap, and that heap is backed by pages of memory, thus a slab is a set of one or more contiguous pages. A 128K slab is equivalent to 32 4K pages. And herein lies the crux of Matthew’s fix.

The high-level memory unit maps down to many contiguous low-level memory units.

Freeing a 128K slab means freeing 32 pages. How those pages are freed makes all the difference.

HAT, TLB Shootdowns & xcalls

Up until this point everything discussed deals with virtual memory. As the name suggests this type of memory is not real—it’s virtual, faux, pseudo. A delicate illusion performed by the Hardware Address Translation layer to trick the process into believing it owns all the memory. Any time a userland program, the kernel, or any piece of code references a virtual address the HAT layer communicates with the CPU’s Memory Management Unit to translate it to a physical page address, or as it is more typically called, a page frame. This translation is done by a lookup in the page table which maps virtual page numbers to physical page frames. To efficiently map the monster address space that is x64, four levels of page tables are actually used. In some cases a single lookup turns into five lookups: four page table lookups followed by the final lookup of the actual frame. If this sounds expensive that’s because it is. Memory access is too common an operation to always take a 5x hit. So the hardware folks created the Translation Lookaside Buffer: a per-core hardware cache that translates an address in one shot.

When a virtual page is freed its mapping must be removed from the page table as well as any TLBs with an entry for that page. Clearing a page in a single core’s TLB is easy enough using the INVLPG instruction, but multicore is a different story. There is no built-in TLB coherency on Intel chips, it is up to the operating system to enforce consistency. This is important because the ARC lives on the Kernel Address Space, and since all cores are constantly switching into kernel context we have to assume that any page in KAS is mapped on all TLBs and therefore an INVLPG must be sent to all cores—this is referred to as the ole’ TLB shootdown.

In a shootdown one core tells all the other cores to clear a page mapping from their TLB. A common way to implement this is by using a cross call, or xcall for short (the formal name is Inter-Processor Interrupt, or IPI). An xcall allows one core to send an interrupt to one or more other cores for the purpose of taking some immediate action. There are several flavors of xcall but for the purpose of a TLB shootdown you can loosely think of it like a transaction in a distributed database, except with a really fast and reliable network. The core freeing the page sends out an interrupt to all the other cores, notifying them to free the page as well, and then waits for all of them to reply before continuing. The callees, however, are allowed to continue immediately after they clear the page and send their reply, they don’t have to wait along with the core that originated the request.

As you might imagine this is a relativey expensive operation to perform. Not only is one core blocked for the entire duration but all cores are interrupted as well. With a typical latency of about 8us on my machine, it approaches two orders of magnitude more than DRAM access—roughly 28,000 cycles on my 3.5GHz chip. A cost worth avoiding when possible.

Page vs. Range

To recap, ZFS caches data in the ARC. The ARC is a collection of ZIO buffers. These buffers are backed by the kernel memory slab allocator. Each slab is a contiguous set of one or more pages. When freed, its pages must be unmapped from the TLB. Since these pages live in KAS they are potentially mapped in all TLBs and thus a shootdown must be performed. Each shootdown is a blocking xcall that interrupts all other cores on the machine. This is done for each page, one at a time.

If a 128K slab is made up of 32 pages then it will require 32 individual shootdowns, each of which interrupts every core on the machine.

Ouch.

The fix is simple and relies on one key element. The pages in a slab are always contiguous and can therefore be unmapped with one range-based shootdown, unmapping all pages at once. An INVLPG instruction is still executed for each page at the hardware level, the difference is that the entire range to unmap is now passed in the xcall. That is, shootdowns are now performed in batch by contiguous range instead of by page.

Seeing is Believing

Talking about performance is cool, but ya’ know what’s really cool? Verifying performance improvements with DTrace. So I wrote a script that reports ZIO buffer frees as they occur: showing the base address and length of the slab being freed, and any shootdowns executed.

# /data/zio-shootdowns
ZIO FREE addr=0xffffff0047d68000 len=131072
        TLB SHOOTDOWN #1    addr=0xffffff0047d68000 len=4096
        TLB SHOOTDOWN #2    addr=0xffffff0047d69000 len=4096
        TLB SHOOTDOWN #3    addr=0xffffff0047d6a000 len=4096
        TLB SHOOTDOWN #4    addr=0xffffff0047d6b000 len=4096
        TLB SHOOTDOWN #5    addr=0xffffff0047d6c000 len=4096
        TLB SHOOTDOWN #6    addr=0xffffff0047d6d000 len=4096
        TLB SHOOTDOWN #7    addr=0xffffff0047d6e000 len=4096
        TLB SHOOTDOWN #8    addr=0xffffff0047d6f000 len=4096
        TLB SHOOTDOWN #9    addr=0xffffff0047d70000 len=4096
        TLB SHOOTDOWN #10   addr=0xffffff0047d71000 len=4096
        TLB SHOOTDOWN #11   addr=0xffffff0047d72000 len=4096
        TLB SHOOTDOWN #12   addr=0xffffff0047d73000 len=4096
        TLB SHOOTDOWN #13   addr=0xffffff0047d74000 len=4096
        TLB SHOOTDOWN #14   addr=0xffffff0047d75000 len=4096
        TLB SHOOTDOWN #15   addr=0xffffff0047d76000 len=4096
        TLB SHOOTDOWN #16   addr=0xffffff0047d77000 len=4096
        TLB SHOOTDOWN #17   addr=0xffffff0047d78000 len=4096
        TLB SHOOTDOWN #18   addr=0xffffff0047d79000 len=4096
        TLB SHOOTDOWN #19   addr=0xffffff0047d7a000 len=4096
        TLB SHOOTDOWN #20   addr=0xffffff0047d7b000 len=4096
        TLB SHOOTDOWN #21   addr=0xffffff0047d7c000 len=4096
        TLB SHOOTDOWN #22   addr=0xffffff0047d7d000 len=4096
        TLB SHOOTDOWN #23   addr=0xffffff0047d7e000 len=4096
        TLB SHOOTDOWN #24   addr=0xffffff0047d7f000 len=4096
        TLB SHOOTDOWN #25   addr=0xffffff0047d80000 len=4096
        TLB SHOOTDOWN #26   addr=0xffffff0047d81000 len=4096
        TLB SHOOTDOWN #27   addr=0xffffff0047d82000 len=4096
        TLB SHOOTDOWN #28   addr=0xffffff0047d83000 len=4096
        TLB SHOOTDOWN #29   addr=0xffffff0047d84000 len=4096
        TLB SHOOTDOWN #30   addr=0xffffff0047d85000 len=4096
        TLB SHOOTDOWN #31   addr=0xffffff0047d86000 len=4096
        TLB SHOOTDOWN #32   addr=0xffffff0047d87000 len=4096

Above is the result of freeing a 128K ZIO buffer on a version of SmartOS from before Matthew’s change. As expected, freeing this buffer required 32 shootdowns, one for each page. Sometimes less depending on the TLB state. Notice the address of each shootdown increments by 4K, the page size. What about the latest SmartOS?

# /data/zio-shootdowns
ZIO FREE addr=0xffffff0209885000 len=12288
        TLB SHOOTDOWN #1 addr=0xffffff0209885000 len=12288
ZIO FREE addr=0xffffff02475a1000 len=20480
        TLB SHOOTDOWN #1 addr=0xffffff02475a1000 len=20480
ZIO FREE addr=0xffffff0446a14000 len=131072
        TLB SHOOTDOWN #1 addr=0xffffff0446a14000 len=131072

Here I’ve shown three different buffer sizes being freed, and each uses only one shootdown. For 128K buffers this is a factor of 32 reduction—that’s huge. Thus why Matthew says, “The performance benefit will be most noticeable when reaping larger (e.g. 128KB) caches.”

But what about my original question? Are their savings elsewhere? This change, being in the bowels of the HAT layer, should apply to anything that needs to free memory.

Beyond ZFS Performance

Matthew’s change makes no difference in the most common case: freeing one page of virtual memory. The system was already doing the minimal amount of work possible, and there was nothing to be saved. But any free involving two or more contiguous pages stands to gain something from the new code. The question is, how often does that happen? And even more imporant, how often does it happen compared to the common case of one page?

To answer these questions I wrote another DTrace script that expands on the first one. It tracks all TLB shootdowns, no matter what their origin, and collects statistics for each one such as number of pages, latency, reduction factor, and the kernel stack that initiated the shootdown. By default shootdowns are reported as they happen and after an hour collection is stopped and statistics are reported. For this particular run I had a large amount of 128K ZIO buffers cached, started a large KVM instance to introduce some memory pressure, and then kicked of a smartos-live build in a native zone. Here are the results.

/data/shootdowns -ls -u 5 -t 3h | tee -i /data/shootdowns.log
ADDRESS                   PAGES SHOOTDOWNS     FACTOR   LATENCY (us)
0xffffff0ab83c9000            1          1          1             21
0xffffff0b5dd3a000            1          1          1             15
0xffffff09592eb000            1          1          1             20
0xffffff2f3860b000         1731          1       1731            509
0xffffff0905a56000            2          1          2              5
0xffffff0905a37000            1          1          1             11
0xffffff0905a37000            1          1          1              5
0xffffff0905a52000            2          1          2              5
0xffffff09b5981000            1          1          1             17
0xffffff0943a2f000            1          1          1             19
0xffffff2f39393000           34          1         34             23
0xffffff2f393b6000           39          1         39             19
0xffffff2f393de000           42          1         42             20
0xffffff2f39409000           47          1         47             20
0xffffff2f39439000           48          1         48             18
0xffffff2f3946a000           52          1         52             20
0xffffff0b4243e000            1          1          1             19
0xffffff2f3949f000          200          1        200             57
0xffffff097cb74000            1          1          1             19
0xffffff09b2ba6000            1          1          1             21
...

I cherry picked 20 particular shootdowns to highlight that large frees do occur. Sometimes they are very large, like 1,731 pages large. That particular free, caused by DTrace, took 509μs (half a millisecond) which can be an astonishingly long time for a kernel operation.

TOP FIVE STACKS BY PAGES UNMAPPED

	The fives stacks unmapping the most memory. For each stack the
	distribution of pages per memory unmapping is shown. The value
	on the left is # of pages unmapped in one call.

              unix`hat_unload+0x3e
              genunix`segkp_release_internal+0x92
              genunix`segkp_release+0xa0
              genunix`schedctl_freepage+0x34
              genunix`schedctl_proc_cleanup+0x68
              genunix`exec_args+0x1e0
              elfexec`elf32exec+0x4b0
              genunix`gexec+0x347
              genunix`exec_common+0x4d5
              genunix`exece+0x1b
              unix`_sys_sysenter_post_swapgs+0x153

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |██████████████████████████████████████   190171   
               2 |                                         0

This first report is the top five unmapped stacks report, which shows the kernel stacks causing the most memory to be freed. For this run, the fifth most common memory freer was the exec system call. Every instance of exec freed only one page, gaining nothing from Matthew’s change.

              genunix`segvn_unmap+0x5bf
              genunix`as_unmap+0x19c
              genunix`munmap+0x83
              unix`_sys_sysenter_post_swapgs+0x153

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |███████████████████████████▉             2882     
               2 |██████████████████████████████████████   3921     
               3 |███████████████████▏                     1983     
               4 |███████████▏                             1157     
               5 |█████████▌                               985      
               6 |███████▎                                 753      
               7 |██████▋                                  696      
               8 |██████▏                                  644      
               9 |████▉                                    508      
              10 |███████▎                                 755      
              12 |█████▏                                   541      
              14 |████▎                                    440      
              16 |██▉                                      302      
              18 |██▏                                      230      
              20 |█▉                                       194      
              22 |██▏                                      222      
              24 |█▎                                       137      
              26 |█▋                                       180      
              28 |█▊                                       185      
              30 |█▍                                       152      
              32 |█▍                                       143      
              34 |█▎                                       130      
              36 |█▍                                       147      
              38 |█▎                                       130      
              40 |█▍                                       144      
              42 |█                                        106      
              44 |▉                                        101      
              46 |█                                        108      
              48 |▉                                        101      
              50 |█                                        104      
              52 |▋                                        74       
              54 |▋                                        69       
              56 |▋                                        66       
              58 |▌                                        52       
              60 |▋                                        69       
              62 |▌                                        64       
              64 |▌                                        61       
              66 |▍                                        43       
              68 |▎                                        32       
              70 |▎                                        27       
              72 |▍                                        45       
              74 |▎                                        37       
              76 |▏                                        23       
              78 |▎                                        34       
              80 |▎                                        31       
              82 |▎                                        32       
              84 |▎                                        29       
              86 |▏                                        24       
              88 |▏                                        18       
              90 |▏                                        24       
              92 |▏                                        15       
              94 |▏                                        24       
              96 |▏                                        18       
              98 |▏                                        16       
             100 |█▏                                       117      
             120 |▍                                        50       
             140 |▏                                        17       
             160 |                                         12       
             180 |                                         3        
             200 |                                         2        
             220 |                                         0        
             240 |                                         1        
             260 |                                         1        
             280 |                                         1        
             300 |                                         0        
             320 |                                         0        
             340 |                                         0        
             360 |                                         0        
             380 |                                         0        
             400 |                                         0        
             420 |                                         0        
             440 |                                         0        
             460 |                                         0        
             480 |                                         0        
             500 |                                         0        
             520 |                                         0        
             540 |                                         0        
             560 |                                         0        
             580 |                                         0        
             600 |                                         0        
             620 |                                         0        
             640 |                                         0        
             660 |                                         0        
             680 |                                         0        
             700 |                                         0        
             720 |                                         0        
             740 |                                         0        
             760 |                                         0        
             780 |                                         0        
             800 |                                         0        
             820 |                                         0        
             840 |                                         0        
             860 |                                         0        
             880 |                                         0        
             900 |                                         0        
             920 |                                         0        
             940 |                                         0        
             960 |                                         0        
             980 |                                         0        
         >= 1000 |                                         4

The next stack is far more interesting. It is the result of userland making mmap and munmap system calls. The majority of the distribution lies within 1–3 pages but there is a long tail along with some outliers. The takeaway is that Matthew’s change may have a positive impact on systems with moderate to heavy userland mmap/munmap load.

              unix`hat_unload+0x3e
              unix`segkmem_free_vn+0x62
              unix`segkmem_free+0x23
              genunix`vmem_xfree+0xf4
              genunix`vmem_free+0x23
              genunix`kmem_slab_destroy+0x8d
              genunix`kmem_slab_free+0x309
              genunix`kmem_magazine_destroy+0x6e
              genunix`kmem_depot_ws_reap+0x5d
              genunix`taskq_thread+0x2d0
              unix`thread_start+0x8

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |█████▌                                   5736     
               2 |▏                                        180      
               3 |▍                                        466      
               4 |██████████████████████████████████████   38795    
               5 |▏                                        254      
               6 |▏                                        147      
               7 |▏                                        199      
               8 |▎                                        267      
               9 |                                         0        
              10 |▌                                        554      
              12 |▏                                        223      
              14 |▏                                        176      
              16 |▏                                        195      
              18 |                                         0        
              20 |▎                                        283      
              22 |                                         0        
              24 |▎                                        269      
              26 |                                         0        
              28 |▎                                        333      
              30 |                                         0        
              32 |█▍                                       1493     
              34 |                                         0

This stack shows the kmem reaper thread freeing various kmem slab sizes. It has a multimodal distribution with the majority being frees of 16K slabs. Use of the kmem system, which basically applies to the entire kernel, stands to gain from Matthew’s change.

              unix`hat_unload+0x3e
              genunix`segkp_release_internal+0x92
              genunix`segkp_release+0xa0
              genunix`schedctl_freepage+0x34
              genunix`schedctl_proc_cleanup+0x68
              genunix`proc_exit+0x22c
              genunix`exit+0x15
              genunix`rexit+0x18
              unix`_sys_sysenter_post_swapgs+0x153

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |██████████████████████████████████████   338814   
               2 |                                         0

The stack that unloaded the second most amount of memory was caused by process exits. This makes sense because building a large project like smartos-live requires forking many short-lived processes—it looks like over 330,000 of them.

              unix`hat_unload+0x3e
              unix`segkmem_free_vn+0x62
              unix`segkmem_zio_free+0x23
              genunix`vmem_xfree+0xf4
              genunix`vmem_free+0x23
              genunix`kmem_slab_destroy+0x8d
              genunix`kmem_slab_free+0x309
              genunix`kmem_magazine_destroy+0x6e
              genunix`kmem_depot_ws_reap+0x5d
              genunix`taskq_thread+0x2d0
              unix`thread_start+0x8

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |██████████▌                              56392    
               2 |██████████████████████████████████████   203538   
               3 |█████▎                                   28252    
               4 |█▍                                       7858     
               5 |████▋                                    25288    
               6 |█▋                                       8725     
               7 |██▊                                      15013    
               8 |▉                                        5171     
               9 |                                         0        
              10 |█▎                                       7255     
              12 |▉                                        5231     
              14 |▋                                        3607     
              16 |▌                                        2773     
              18 |                                         0        
              20 |▊                                        4216     
              22 |                                         0        
              24 |▌                                        3291     
              26 |                                         0        
              28 |▍                                        2152     
              30 |                                         0        
              32 |████████████████████                     107393   
              34 |                                         0

The stack that unmapped the most memory was kmem on behalf of ZIO, thanks to the fact that I had a large number of active 128K buffers before starting the smartos-live build. Another multimodal distribution that includes over 100,000 large frees. Proof that Matthew’s change is a boon for ZFS performance.

OVERALL LATENCY

	The latency distribution for all memory unmappings. The value on the
	left is latency in microseconds.

           value  ------------- Distribution ------------- count    
               2 |                                         0        
               3 |                                         12       
               4 |████████▊                                151790   
               5 |████████▉                                154641   
               6 |████████▎                                143235   
               7 |██████▉                                  120230   
               8 |█████▍                                   94932    
               9 |███████▊                                 134837   
              10 |██████████████████████████████████████   658368   
              15 |██                                       35927    
              20 |█▏                                       21435    
              25 |▏                                        2773     
              30 |                                         772      
              35 |                                         367      
              40 |                                         261      
              45 |                                         180      
              50 |                                         148      
              55 |                                         155      
              60 |                                         130      
              65 |                                         130      
              70 |                                         70       
              75 |                                         57       
              80 |                                         50       
              85 |                                         37       
              90 |                                         28       
              95 |                                         19       
             100 |                                         36       
             150 |                                         3        
             200 |                                         6        
             250 |                                         0        
             300 |                                         1        
             350 |                                         3        
             400 |                                         3        
             450 |                                         2        
             500 |                                         14       
             550 |                                         1        
             600 |                                         1        
             650 |                                         5        
             700 |                                         0        
             750 |                                         0        
             800 |                                         0        
             850 |                                         0        
             900 |                                         0        
             950 |                                         0        
         >= 1000 |                                         9

The next section reports the latency distribution for all shootdowns. Without comparing this to the latency distribution of a similar load on an older SmartOS release it doesn't really prove anything but it’s nice to know most shootdowns now take 10μs or less. It also might be worth looking more closely at the outliers, especially the shootdowns that took over 1ms, that’s a long time.

OVERALL SAVINGS

	total pages unmapped	6407340 
	total shootdowns	1529415 
	total factor saved	4       

	The savings distribution for all shootdowns. The value on the left
	is the factor saved (# pages / # shootdowns).

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |██████████████████████████████████████   999581   
               2 |████████▊                                232952   
               3 |█▏                                       31886    
               4 |█▊                                       47926    
               5 |█                                        26308    
               6 |▌                                        15754    
               7 |▌                                        15671    
               8 |▏                                        5803     
               9 |                                         242      
              10 |▎                                        8286     
              12 |▏                                        5799     
              14 |▏                                        5109     
              16 |▏                                        3405     
              18 |                                         292      
              20 |▏                                        4737     
              22 |                                         190      
              24 |▏                                        3755     
              26 |                                         149      
              28 |                                         2606     
              30 |                                         131      
              32 |████▏                                    109187   
              34 |                                         68       
              36 |                                         95       
              38 |                                         59       
              40 |                                         69       
              42 |                                         68       
              44 |                                         56       
              46 |                                         61       
              48 |                                         43       
              50 |                                         31       
              52 |                                         35       
              54 |                                         29       
              56 |                                         29       
              58 |                                         19       
              60 |                                         21       
              62 |                                         17       
              64 |                                         28       
              66 |                                         19       
              68 |                                         8        
              70 |                                         8        
              72 |                                         9        
              74 |                                         4        
              76 |                                         3        
              78 |                                         11       
              80 |                                         11       
              82 |                                         8        
              84 |                                         4        
              86 |                                         4        
              88 |                                         1        
              90 |                                         1        
              92 |                                         2        
              94 |                                         2        
              96 |                                         2        
              98 |                                         2        
             100 |                                         18       
             120 |                                         14       
             140 |                                         5        
             160 |                                         1        
             180 |                                         1        
             200 |                                         4        
             220 |                                         1        
             240 |                                         4        
             260 |                                         1        
             280 |                                         0        
             300 |                                         0        
             320 |                                         0        
             340 |                                         0        
             360 |                                         0        
             380 |                                         0        
             400 |                                         0        
             420 |                                         0        
             440 |                                         0        
             460 |                                         0        
             480 |                                         0        
             500 |                                         0        
             520 |                                         0        
             540 |                                         0        
             560 |                                         0        
             580 |                                         0        
             600 |                                         0        
             620 |                                         0        
             640 |                                         0        
             660 |                                         0        
             680 |                                         0        
             700 |                                         0        
             720 |                                         0        
             740 |                                         0        
             760 |                                         0        
             780 |                                         0        
             800 |                                         0        
             820 |                                         0        
             840 |                                         0        
             860 |                                         0        
             880 |                                         0        
             900 |                                         0        
             920 |                                         0        
             940 |                                         0        
             960 |                                         0        
             980 |                                         0        
         >= 1000 |                                         23

The last section reports savings. Savings is the ratio of pages being freed to the number of shootdowns required to do so. Given that the majority of shootdowns are for contiguous memory regions this number also represents how much the new code saves over the old. A factor of four for my particular test. The distribution shows that most shootdowns are for one page and thus save nothing, but it also shows a significant amount of multi-page shootdowns that do benefit, and thus the overall benefit by a factor of four.

While not conclusive, this data backs Matthew’s original claims of better ZFS latency and hints that other workloads might benefit as well. I’d be curious to see results from other workloads. Please run my shootdowns script and share your results.