The 8-Byte Two-Step

20 Sep 2014

While studying the illumos queue implementation I came across a line of code that left me quite perplexed.

#define	MQ_ALIGNSIZE	8	/* 64-bit alignment */
...
temp = (mqhp->mq_maxsz + MQ_ALIGNSIZE - 1) & ~(MQ_ALIGNSIZE - 1);

I knew it had something to do with alignment on a 64-bit boundary, but just how the hell was it doing it? Why is it subtracting one from the alignment size? Why is it negating the expression on the right? Why is it adding to the max message size? What the hell is going on?

At first I attempted to derive the pattern in my head but kept losing my place; forced to start over multiple times. After placing my pride on the bookshelf next to me, I wrote some C to help decode the mystery. The program breaks the expression into its three parts, printing the base-2 representation of each step. Here are some test runs with different sizes:

$ ./align_debug 1024
size: 00000000000000000000010000000000 (1024)
   a: 00000000000000000000010000000111 (1031)
   b: 11111111111111111111111111111000 (4294967288)
temp: 00000000000000000000010000000000 (1024)

$ ./align_debug 1023
size: 00000000000000000000001111111111 (1023)
   a: 00000000000000000000010000000110 (1030)
   b: 11111111111111111111111111111000 (4294967288)
temp: 00000000000000000000010000000000 (1024)

$ ./align_debug 9
size: 00000000000000000000000000001001 (9)
   a: 00000000000000000000000000010000 (16)
   b: 11111111111111111111111111111000 (4294967288)
temp: 00000000000000000000000000010000 (16)

$ ./align_debug 11
size: 00000000000000000000000000001011 (11)
   a: 00000000000000000000000000010010 (18)
   b: 11111111111111111111111111111000 (4294967288)
temp: 00000000000000000000000000010000 (16)

Breaking apart the expression and testing several sizes revealed the pattern to me. Pause for a moment, do you see it?

Everybody Do The 8-Byte Two-Step

Don’t worry if you couldn’t spot the pattern, I’m going to explain it right now. I’ve decided to call it the “8-Byte Two-Step” because it does its job in two steps: overflow and truncate.

Overflow - Add 7 to to make sure the size either hits the next alignment boundary or lies past it.
Truncate - Truncate the low-order bits. After the first step you know you are at the next boundary or over it so you simply need to shed the three low bits to make sure you are on an 8-byte boundary.

Still don’t follow? That’s okay, everyone learns differently. Perhaps a visual explanation will help.

The Green Bit Rule

In figure 1 I reduced the bit space down to a single byte to simplify matters. The header, in small print, marks the bit positions. The first row shows the powers of two. The next two rows represent the two cases we have to deal with: ❶ the size already lies on an 8-byte boundary, or ❷ it doesn’t lie on a boundary. The last two rows show the binary representation of seven as well as its negation.

Using this diagram, there is only one rule to remember: to be an 8-byte aligned size only the green bit positions may be turned on, and any combination of green bits will do. Don’t believe me? Try it for yourself. Take ❶ and randomly fill in a one or zero for each question mark. Add the powers of 2 of and divide by eight, there should be no remainder.

As a dual to this rule, the red bits can never be turned on. Any combination of red bits can never result in a number that is divisible by eight. Funny, the rule and its dual look a lot like the bitwise-negation of seven and seven, respectively. Adding seven to ❷ will push a one into the green bits causing the least significant green zero bit to turn on and any green one bits below it to turn off. Prove this to yourself by writing out the bit patterns for 26 and 33. Basically, adding seven to the size will add an eight to the green bits (unless the size is already on a boundary). Once that is done the final step is to bitwise-and with negated seven to make sure that only the green bits are turned on.

It turns out this pattern is not called “The 8-Byte Two-Step” or “The Green Bit Rule”, but what great names they are. No, as my friend Steve Vinoski pointed out, bit tricks like this have been in use for a long time. In fact, you can find this trick, and many others like it, in the venerable Hacker’s Delight War03.

Hacker’s Delight

If you flip to section 3-1 of Hacker’s Delight you’ll see the exact pattern as described above. In fact, as the title indicates, it is more general than alignment on an 8-byte boundary: “Rounding Up/Down to a Multiple of a Known Power of 2”. The pattern works for alignment on any power of 2. This is why the green bit rule works: any power of 2 above the alignment size is a multiple of that size. And adding two or more multiples of a size results in a number which is also a multiple of that size. Math is a wondrous thing.

Interestingly, the illumos code is actually a mix of two different patterns shown in Hacker’s Delight. In the book the author shows a pattern for finding the boundary based on a power of 2 or based on the log₂ of that value. For example, if the boundary was 8 then its log₂ would be 3. The illumos code could have been written more simply:

temp = (mqhp->mq_maxsz + MQ_ALIGNSIZE - 1) & -MQ_ALIGNSIZE

But more important than how it is written is—why it is written this way at all?

C.R.E.A.M. Cycles Rule Everything Around Me

A key performance trait of a CPU is its frequency, the number of cycles in a second. For example, my development machine contains an Intel i7-3970X which runs six cores at 3.50GHz. That’s six cores each ticking three-and-a-half billion times a second. Ideally, chips would complete an instruction each cycle, but not all instructions are created equal. Some instructions cost more than others. The less cycles an algorithm uses, the faster it can run.

This pattern, then, is an efficient way to align a number to a multiple of a power of two. But efficient compared to what? How else might you implement this algorithm? I came up with two solutions.

#include <math.h>
#define ALIGNSZ 8

unsigned int align_1(unsigned int x) {
  return (x + ALIGNSZ - 1) & ~(ALIGNSZ - 1);
}

unsigned int align_2(unsigned int x) {
  unsigned int boundary = ALIGNSZ;
  while (x > boundary) {
    boundary += ALIGNSZ;
  }
  return boundary;
}

unsigned int align_3(unsigned int x) {
  return ALIGNSZ * ceil((double)x / ALIGNSZ);
}

The align_1 function is the original 8-Byte Two-Step implementation. The align_2 function adds eight until the boundary is greater than or equal to x. The align_3 function uses division to avoid the loop in align_2 but requires ceil and multiplication to put the answer back in terms of number of bytes. One might think the 3rd is the best since it uses only three operations while the 1st uses five, but looks can be deceiving. Remember, C is a high level language. Let’s take a peak at the assembler.

ASM generated with: gcc -std=c99 -m32 -S -masm=intel -Wall -O0 align.c

align_1:
	push	ebp                     ; 1
	mov	ebp, esp                ; 1
	mov	eax, DWORD PTR [ebp+8]  ; 2
	add	eax, 7                  ; 1
	and	eax, -8                 ; 1
	pop	ebp                     ; 2
	ret                             ; ?
        ;; TOTAL LATENCY: 8 + ?
align_2:
	push	ebp                     ; 1
	mov	ebp, esp                ; 1
	sub	esp, 16                 ; 1
	mov	DWORD PTR [ebp-4], 8    ; ?
	jmp	.L4                     ; 0
.L5:
	add	DWORD PTR [ebp-4], 8    ; 6
.L4:
	mov	eax, DWORD PTR [ebp+8]  ; 2
	cmp	eax, DWORD PTR [ebp-4]  ; ?
	ja	.L5                     ; ?
	mov	eax, DWORD PTR [ebp-4]  ; 2
	leave                           ; ?
	ret                             ; ?
        ;; TOTAL LATENCY: (size/8)*(13 + 5?)
align_3:
	push	ebp                     ; 1
	mov	ebp, esp                ; 1
	sub	esp, 40                 ; 1
	mov	eax, DWORD PTR [ebp+8]  ; 2
	mov	edx, 0                  ; ?
	mov	DWORD PTR [ebp-24], eax ; 3
	mov	DWORD PTR [ebp-20], edx ; 3
	fild	QWORD PTR [ebp-24]      ; 6
	fstp	QWORD PTR [ebp-16]      ; 5
	fld	QWORD PTR [ebp-16]      ; 3
	fld	TBYTE PTR .LC0          ; 4
	fdivp	st(1), st               ; 10
	fstp	QWORD PTR [ebp-16]      ; 5
	fld	QWORD PTR [ebp-16]      ; 3
	fstp	QWORD PTR [esp]         ; 5
	call	ceil                    ; ?
	fld	TBYTE PTR .LC0          ; 4
	fmulp	st(1), st               ; 5
	fnstcw	WORD PTR [ebp-26]       ; 5
	movzx	eax, WORD PTR [ebp-26]  ; ?
	mov	ah, 12                  ; ?
	mov	WORD PTR [ebp-28], ax   ; 3
	fldcw	WORD PTR [ebp-28]       ; 8
	fistp	QWORD PTR [ebp-24]      ; 7
	fldcw	WORD PTR [ebp-26]       ; 8
	mov	eax, DWORD PTR [ebp-24] ; 2
	mov	edx, DWORD PTR [ebp-20] ; 2
	leave                           ; ?
	ret                             ; ?
        ;; TOTAL LATENCY: 96 + 6?

I’ve curated the assembly, removing irrelevant lines and adding a latency cost for each instruction. I used Agner Fog’s instruction tables to determine the minimum latency of each instruction Fog14. The numbers represent the minimum number of cycles an instruction uses in an ideal execution. A question mark means there was no entry for the instruction. All numbers are for the Intel Sandybridge as that is what my development machine runs.

It turns out, for any reasonably sized x (the size we want to round up to the nearest boundary), the align_1 function performs the best and align_2 performs the worst. In fact, for a size as small as 1024 the align_2 function will have a latency three orders of magnitude greater than align_1. Using less cycles is good not only because it means less work; but I posit it reduces the chance of being interrupted which can lead to further latency costs like cache misses.

These theoretical numbers are great, but I want some hard numbers. Time for some benchmarks.

Benchmarks

I created a separate C file to benchmark each of the three alignment functions: align_1_bench.c, align_2_bench.c, align_3_bench.c, and align_bench.h. There is also a control benchmark, align_control_bench.c, used as a baseline to account for any constant overhead. Each benchmark runs one-million times, passing 1026u as the size to align. Intel specific assembler is used to count the number of cycles for each invocation and print that number to stdout. Finally, the redirected output is turned into a histogram via a mix of sort, uniq, and awk.

Benchmarks Compiled With gcc -std=c99 -m32 -lm -Wall -O0

Control		Align 1		Align 2		Align 3
Cycles	Count	Cycles	Count	Cycles	Count	Cycles	Count
112	333,677	112	43	912	2	196	389,942
116	624,414	116	816,038	916	122,676	200	609,995
120	41,860	120	183,855	920	212,176	204	2
124	9	124	3	924	13,760	212	2
128	7	128	5	928	75,642	216	1
132	10	132	9	932	205,533	⋮	⋮
136	7	136	13	936	190,635	264	1
140	5	140	13	940	481	268	1
144	3	144	2	944	2	284	20
148	2	148	5	⋮	⋮	288	11
152	1	152	5	1,036	22	304	2
156	1	156	5	1,040	2,295	308	1
164	1	160	1	1,044	111,111	⋮	⋮
192	1	164	1	1,048	65,609	376	1
240	1	192	1	⋮	⋮	408	1
268	1	9,672	1	14,584	1	11,320	1

Interestingly, the results are not only close to my predictions, but they are lower (perhaps due to ILP or some other CPU voodoo). The baseline uses 112–116 cycles per invocation. The 1st method uses 116–120 cycles, 4 more than the baseline. The 2nd method stays within 916–1048 cycles, 900 cycles more than baseline. And finally, the 3rd method takes 196–200 cycles, 84 more than baseline.

The nugget in this data is the fact that align_2 is multimodal (this is why histograms are so important). This lends some credibility to my theory that the more cycles a function uses the more it is susceptible to variance from outside forces. But for me to be scientific about it much more data would have to be gathered. For now, I’m happy to leave it a conjecture.

Who Cares?

I know what some of you are thinking: why care about 100 cycles when the CPU is ticking at 3.5 billion a second? You shouldn’t, in most cases. But when writing code that is invoked millions of times a second an order-of-magnitude overhead may become important. As load increases this overhead adds up like compound interest, reducing the total amount of work your machine can service in an acceptable response time. For example, queueing theory tells us that for an M/D/1 model mean response time rises very quickly after 60% utilization Gre13 §2.6.5. The longer your mean response time the faster you become saturated and the longer the requests take; it’s a vicious cycle. That said, this optimization is not needed in the illumos mqueue code. If you are creating queues at ludicrous speed this is not going to be the overhead that bites you first. My guess, this was done more as an idiom of systems programming than as an optimization.

References

Fog14	4. Instruction tables: List of instruction latencies, throughputs, and micro-operation breakdowns for Intel, AMD, and VIA CPUs Agner Fog Technical University of Denmark 2014
Gre13	Systems Performance, Enterprise and the Cloud Brendan Gregg Prentice Hall 2013
Int13	Intel 64 and IA-32 Architectures Software Developer’s Manual: Combined Volumes 1, 2A, 2B, 2C, 3A, 3B, and 3C Intel Corporation 2013
War03	Hacker’s Delight Henry Warren, Jr. Addison-Wesly 2003