The 8-Byte Two-Step
20 Sep 2014
While studying the illumos queue implementation I came across a line of code that left me quite perplexed.
I knew it had something to do with alignment on a 64-bit boundary, but just how the hell was it doing it? Why is it subtracting one from the alignment size? Why is it negating the expression on the right? Why is it adding to the max message size? What the hell is going on?
At first I attempted to derive the pattern in my head but kept losing my place; forced to start over multiple times. After placing my pride on the bookshelf next to me, I wrote some C to help decode the mystery. The program breaks the expression into its three parts, printing the base-2 representation of each step. Here are some test runs with different sizes:
Breaking apart the expression and testing several sizes revealed the pattern to me. Pause for a moment, do you see it?
Don’t worry if you couldn’t spot the pattern, I’m going to explain it right now. I’ve decided to call it the “8-Byte Two-Step” because it does its job in two steps: overflow and truncate.
Overflow - Add 7 to to make sure the
sizeeither hits the next alignment boundary or lies past it.
Truncate - Truncate the low-order bits. After the first step you know you are at the next boundary or over it so you simply need to shed the three low bits to make sure you are on an 8-byte boundary.
Still don’t follow? That’s okay, everyone learns differently. Perhaps a visual explanation will help.
In figure 1 I reduced the bit space down to a single byte to simplify matters. The header, in small print, marks the bit positions. The first row shows the powers of two. The next two rows represent the two cases we have to deal with: ❶ the size already lies on an 8-byte boundary, or ❷ it doesn’t lie on a boundary. The last two rows show the binary representation of seven as well as its negation.
Using this diagram, there is only one rule to remember: to be an 8-byte aligned size only the green bit positions may be turned on, and any combination of green bits will do. Don’t believe me? Try it for yourself. Take ❶ and randomly fill in a one or zero for each question mark. Add the powers of 2 of and divide by eight, there should be no remainder.
As a dual to this rule, the red bits can never be turned on. Any combination of red bits can never result in a number that is divisible by eight. Funny, the rule and its dual look a lot like the bitwise-negation of seven and seven, respectively. Adding seven to ❷ will push a one into the green bits causing the least significant green zero bit to turn on and any green one bits below it to turn off. Prove this to yourself by writing out the bit patterns for 26 and 33. Basically, adding seven to the size will add an eight to the green bits (unless the size is already on a boundary). Once that is done the final step is to bitwise-and with negated seven to make sure that only the green bits are turned on.
It turns out this pattern is not called “The 8-Byte Two-Step” or “The Green Bit Rule”, but what great names they are. No, as my friend Steve Vinoski pointed out, bit tricks like this have been in use for a long time. In fact, you can find this trick, and many others like it, in the venerable Hacker’s Delight War03.
If you flip to section 3-1 of Hacker’s Delight you’ll see the exact pattern as described above. In fact, as the title indicates, it is more general than alignment on an 8-byte boundary: “Rounding Up/Down to a Multiple of a Known Power of 2”. The pattern works for alignment on any power of 2. This is why the green bit rule works: any power of 2 above the alignment size is a multiple of that size. And adding two or more multiples of a size results in a number which is also a multiple of that size. Math is a wondrous thing.
Interestingly, the illumos code is actually a mix of two different patterns shown in Hacker’s Delight. In the book the author shows a pattern for finding the boundary based on a power of 2 or based on the log2 of that value. For example, if the boundary was 8 then its log2 would be 3. The illumos code could have been written more simply:
But more important than how it is written is—why it is written this way at all?
A key performance trait of a CPU is its frequency, the number of cycles in a second. For example, my development machine contains an Intel i7-3970X which runs six cores at 3.50GHz. That’s six cores each ticking three-and-a-half billion times a second. Ideally, chips would complete an instruction each cycle, but not all instructions are created equal. Some instructions cost more than others. The less cycles an algorithm uses, the faster it can run.
This pattern, then, is an efficient way to align a number to a multiple of a power of two. But efficient compared to what? How else might you implement this algorithm? I came up with two solutions.
align_1 function is the original 8-Byte
Two-Step implementation. The
adds eight until the boundary is greater than or equal
align_3 function uses
division to avoid the loop in
ceil and multiplication to put the
answer back in terms of number of bytes. One might think the
3rd is the best since it uses only three operations while the
1st uses five, but looks can be deceiving. Remember, C is
a high level language. Let’s take a peak at the
gcc -std=c99 -m32 -S -masm=intel -Wall -O0 align.c
I’ve curated the assembly, removing irrelevant lines and adding a latency cost for each instruction. I used Agner Fog’s instruction tables to determine the minimum latency of each instruction Fog14. The numbers represent the minimum number of cycles an instruction uses in an ideal execution. A question mark means there was no entry for the instruction. All numbers are for the Intel Sandybridge as that is what my development machine runs.
It turns out, for any reasonably sized
size we want to round up to the nearest boundary),
align_1 function performs the best
align_2 performs the worst. In fact, for a
size as small as 1024 the
align_2 function will
have a latency three orders of magnitude greater
align_1. Using less cycles is good not only
because it means less work; but I posit it reduces the chance
of being interrupted which can lead to further latency costs
like cache misses.
These theoretical numbers are great, but I want some hard numbers. Time for some benchmarks.
I created a separate C file to benchmark each of the three
functions: align_1_bench.c, align_2_bench.c,
There is also a control
used as a baseline to account for any constant overhead. Each
benchmark runs one-million times, passing
as the size to align. Intel specific assembler is used to
count the number of cycles for each invocation and print that
number to stdout. Finally, the redirected output is turned
into a histogram via a mix of sort, uniq, and awk.
gcc -std=c99 -m32 -lm -Wall -O0
Interestingly, the results are not only close to my predictions, but they are lower (perhaps due to ILP or some other CPU voodoo). The baseline uses 112–116 cycles per invocation. The 1st method uses 116–120 cycles, 4 more than the baseline. The 2nd method stays within 916–1048 cycles, 900 cycles more than baseline. And finally, the 3rd method takes 196–200 cycles, 84 more than baseline.
The nugget in this data is the fact that
is multimodal (this is why histograms are so important). This
lends some credibility to my theory that the more cycles a
function uses the more it is susceptible to variance from
outside forces. But for me to be scientific about it much
more data would have to be gathered. For now, I’m happy to
leave it a conjecture.
I know what some of you are thinking: why care about 100 cycles when the CPU is ticking at 3.5 billion a second? You shouldn’t, in most cases. But when writing code that is invoked millions of times a second an order-of-magnitude overhead may become important. As load increases this overhead adds up like compound interest, reducing the total amount of work your machine can service in an acceptable response time. For example, queueing theory tells us that for an M/D/1 model mean response time rises very quickly after 60% utilization Gre13 §2.6.5. The longer your mean response time the faster you become saturated and the longer the requests take; it’s a vicious cycle. That said, this optimization is not needed in the illumos mqueue code. If you are creating queues at ludicrous speed this is not going to be the overhead that bites you first. My guess, this was done more as an idiom of systems programming than as an optimization.