The 8Byte TwoStep
20 Sep 2014
While studying the illumos queue implementation I came across a line of code that left me quite perplexed.
I knew it had something to do with alignment on a 64bit boundary, but just how the hell was it doing it? Why is it subtracting one from the alignment size? Why is it negating the expression on the right? Why is it adding to the max message size? What the hell is going on?
At first I attempted to derive the pattern in my head but kept losing my place; forced to start over multiple times. After placing my pride on the bookshelf next to me, I wrote some C to help decode the mystery. The program breaks the expression into its three parts, printing the base2 representation of each step. Here are some test runs with different sizes:
Breaking apart the expression and testing several sizes revealed the pattern to me. Pause for a moment, do you see it?
Everybody Do The 8Byte TwoStep
Don’t worry if you couldn’t spot the pattern, I’m going to explain it right now. I’ve decided to call it the “8Byte TwoStep” because it does its job in two steps: overflow and truncate.

Overflow  Add 7 to to make sure the
size
either hits the next alignment boundary or lies past it. 
Truncate  Truncate the loworder bits. After the first step you know you are at the next boundary or over it so you simply need to shed the three low bits to make sure you are on an 8byte boundary.
Still don’t follow? That’s okay, everyone learns differently. Perhaps a visual explanation will help.
The Green Bit Rule
In figure 1 I reduced the bit space down to a single byte to simplify matters. The header, in small print, marks the bit positions. The first row shows the powers of two. The next two rows represent the two cases we have to deal with: ❶ the size already lies on an 8byte boundary, or ❷ it doesn’t lie on a boundary. The last two rows show the binary representation of seven as well as its negation.
Using this diagram, there is only one rule to remember: to be an 8byte aligned size only the green bit positions may be turned on, and any combination of green bits will do. Don’t believe me? Try it for yourself. Take ❶ and randomly fill in a one or zero for each question mark. Add the powers of 2 of and divide by eight, there should be no remainder.
As a dual to this rule, the red bits can never be turned on. Any combination of red bits can never result in a number that is divisible by eight. Funny, the rule and its dual look a lot like the bitwisenegation of seven and seven, respectively. Adding seven to ❷ will push a one into the green bits causing the least significant green zero bit to turn on and any green one bits below it to turn off. Prove this to yourself by writing out the bit patterns for 26 and 33. Basically, adding seven to the size will add an eight to the green bits (unless the size is already on a boundary). Once that is done the final step is to bitwiseand with negated seven to make sure that only the green bits are turned on.
It turns out this pattern is not called “The 8Byte TwoStep” or “The Green Bit Rule”, but what great names they are. No, as my friend Steve Vinoski pointed out, bit tricks like this have been in use for a long time. In fact, you can find this trick, and many others like it, in the venerable Hacker’s Delight War03.
Hacker’s Delight
If you flip to section 31 of Hacker’s Delight you’ll see the exact pattern as described above. In fact, as the title indicates, it is more general than alignment on an 8byte boundary: “Rounding Up/Down to a Multiple of a Known Power of 2”. The pattern works for alignment on any power of 2. This is why the green bit rule works: any power of 2 above the alignment size is a multiple of that size. And adding two or more multiples of a size results in a number which is also a multiple of that size. Math is a wondrous thing.
Interestingly, the illumos code is actually a mix of two different patterns shown in Hacker’s Delight. In the book the author shows a pattern for finding the boundary based on a power of 2 or based on the log_{2} of that value. For example, if the boundary was 8 then its log_{2} would be 3. The illumos code could have been written more simply:
But more important than how it is written is—why it is written this way at all?
C.R.E.A.M. Cycles Rule Everything Around Me
A key performance trait of a CPU is its frequency, the number of cycles in a second. For example, my development machine contains an Intel i73970X which runs six cores at 3.50GHz. That’s six cores each ticking threeandahalf billion times a second. Ideally, chips would complete an instruction each cycle, but not all instructions are created equal. Some instructions cost more than others. The less cycles an algorithm uses, the faster it can run.
This pattern, then, is an efficient way to align a number to a multiple of a power of two. But efficient compared to what? How else might you implement this algorithm? I came up with two solutions.
The align_1
function is the original 8Byte
TwoStep implementation. The align_2
function
adds eight until the boundary is greater than or equal
to x
. The align_3
function uses
division to avoid the loop in align_2
but
requires ceil
and multiplication to put the
answer back in terms of number of bytes. One might think the
3rd is the best since it uses only three operations while the
1st uses five, but looks can be deceiving. Remember, C is
a high level language. Let’s take a peak at the
assembler.
gcc std=c99 m32 S masm=intel Wall O0 align.c
I’ve curated the assembly, removing irrelevant lines and adding a latency cost for each instruction. I used Agner Fog’s instruction tables to determine the minimum latency of each instruction Fog14. The numbers represent the minimum number of cycles an instruction uses in an ideal execution. A question mark means there was no entry for the instruction. All numbers are for the Intel Sandybridge as that is what my development machine runs.
It turns out, for any reasonably sized x
(the
size we want to round up to the nearest boundary),
the align_1
function performs the best
and align_2
performs the worst. In fact, for a
size as small as 1024 the align_2
function will
have a latency three orders of magnitude greater
than align_1
. Using less cycles is good not only
because it means less work; but I posit it reduces the chance
of being interrupted which can lead to further latency costs
like cache misses.
These theoretical numbers are great, but I want some hard numbers. Time for some benchmarks.
Benchmarks
I created a separate C file to benchmark each of the three
alignment
functions: align_1_bench.c, align_2_bench.c,
align_3_bench.c,
and align_bench.h.
There is also a control
benchmark, align_control_bench.c,
used as a baseline to account for any constant overhead. Each
benchmark runs onemillion times, passing 1026u
as the size to align. Intel specific assembler is used to
count the number of cycles for each invocation and print that
number to stdout. Finally, the redirected output is turned
into a histogram via a mix of sort, uniq, and awk.
gcc std=c99 m32 lm Wall O0
Control  Align 1  Align 2  Align 3  

Cycles  Count  Cycles  Count  Cycles  Count  Cycles  Count 
112  333,677  112  43  912  2  196  389,942 
116  624,414  116  816,038  916  122,676  200  609,995 
120  41,860  120  183,855  920  212,176  204  2 
124  9  124  3  924  13,760  212  2 
128  7  128  5  928  75,642  216  1 
132  10  132  9  932  205,533  ⋮  ⋮ 
136  7  136  13  936  190,635  264  1 
140  5  140  13  940  481  268  1 
144  3  144  2  944  2  284  20 
148  2  148  5  ⋮  ⋮  288  11 
152  1  152  5  1,036  22  304  2 
156  1  156  5  1,040  2,295  308  1 
164  1  160  1  1,044  111,111  ⋮  ⋮ 
192  1  164  1  1,048  65,609  376  1 
240  1  192  1  ⋮  ⋮  408  1 
268  1  9,672  1  14,584  1  11,320  1 
Interestingly, the results are not only close to my predictions, but they are lower (perhaps due to ILP or some other CPU voodoo). The baseline uses 112–116 cycles per invocation. The 1st method uses 116–120 cycles, 4 more than the baseline. The 2nd method stays within 916–1048 cycles, 900 cycles more than baseline. And finally, the 3rd method takes 196–200 cycles, 84 more than baseline.
The nugget in this data is the fact that align_2
is multimodal (this is why histograms are so important). This
lends some credibility to my theory that the more cycles a
function uses the more it is susceptible to variance from
outside forces. But for me to be scientific about it much
more data would have to be gathered. For now, I’m happy to
leave it a conjecture.
Who Cares?
I know what some of you are thinking: why care about 100 cycles when the CPU is ticking at 3.5 billion a second? You shouldn’t, in most cases. But when writing code that is invoked millions of times a second an orderofmagnitude overhead may become important. As load increases this overhead adds up like compound interest, reducing the total amount of work your machine can service in an acceptable response time. For example, queueing theory tells us that for an M/D/1 model mean response time rises very quickly after 60% utilization Gre13 §2.6.5. The longer your mean response time the faster you become saturated and the longer the requests take; it’s a vicious cycle. That said, this optimization is not needed in the illumos mqueue code. If you are creating queues at ludicrous speed this is not going to be the overhead that bites you first. My guess, this was done more as an idiom of systems programming than as an optimization.
References
Fog14 

Gre13 

Int13  
War03 
