Unrolling and Duff's Device

Loop unrolling can be a fascinating art.

The canonical example of loop unrolling is something like

for (i = 0; i < 100; i++)
    x[i] = i;


for (i = 0; i < 25; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;

Longer runs of instructions afford you a better chance of managing to find instruction level parallelism; that is instructions that can all happen together. You can keep the pipeline fuller and hence run faster, at the cost of some expanded code size.

But unlike that example, you generally do not know where your loop ends. In the above example, we made four copies of the loop body, reducing the iterations by 4. For an indeterminate number of iterations, we would need to do the original loop n % 4 times and then the unrolled body floor(n / 4) times. So to spell it out, if we want to do the loop 7 times, then we execute the original loop 3 times and the unrolled portion once.

Our new code might look something like

int extras = n % 4;
for (i = 0; i < extras; i++)
    x[i] += i;

for ( ; i < n / 4 ; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;

That leads to a little trick called "Duff's Device", references to which can be found most anywhere (however I found most of them a little unclear, hence this post).

Duff's device looks like this

send(to, from, count)
register short *to, *from;
register count;
        register n=(count+7)/8;

        switch(count%8) {
        case 0: do {    *to = *from++;
        case 7:         *to = *from++;
        case 6:         *to = *from++;
        case 5:         *to = *from++;
        case 4:         *to = *from++;
        case 3:         *to = *from++;
        case 2:         *to = *from++;
        case 1:         *to = *from++;
                } while(--n>0);

Now that doesn't look like loop unrolling, but it really is. In this case, we have unwound a loop doing simply *to = *from 8 times. But we still have that extra bit problem, since we have an indeterminate count.

To understand it, let's just work through what happens for possible inputs. For values of count from 1-7, n is one and count % 8 is just count so we fall right into the case statements, which fall through to each other without break and never repeat. The nifty bit comes when count is greater than 7. In this case, our first iteration comes from the switch where we fall through to do the extra count % 8 iterations. After that, we loop back to the start of the unrolling and do our 8 step loop as many times as we require.

Now by post-incrementing *to you can use this to implement memcpy(). An interesting exercise might be to see how much slower this is than the highly optimised memcpy that comes with glibc.

The code is most certainly ingenious and this page explains some of the history of it. Today I think it is mostly good as an exercise on "why it works".

If you're wondering how gcc handles unrolling, it creates the unrolled loop and the does a sort of fall through test which evaluates how many extra loops are required and branches into the unrolled portion at the correct point. A little extract is below, you can see r14 holds the number of extra loops required and branches higher up the unrolled code for each extra loop required.

  70: [MFB]       cmp4.eq p6,p7=0,r14
  76:             nop.f 0x0
  7c:       (p06) br.cond.dpnt.few 150 ;;
  80: [MFB]       cmp4.eq p6,p7=1,r14
  86:             nop.f 0x0
  8c:       (p06) br.cond.dpnt.few 130 ;;
  90: [MFB]       cmp4.eq p6,p7=2,r14
  96:             nop.f 0x0
  9c:       (p06) br.cond.dpnt.few 120 ;;
  a0: [MFB]       cmp4.eq p6,p7=3,r14
  a6:             nop.f 0x0
  ac:       (p06) br.cond.dpnt.few 110 ;;
  b0: [MFB]       cmp4.eq p6,p7=4,r14
  b6:             nop.f 0x0
  bc:       (p06) br.cond.dpnt.few 100 ;;
  c0: [MFB]       cmp4.eq p6,p7=5,r14
  c6:             nop.f 0x0
  cc:       (p06) br.cond.dpnt.few f0 ;;
  d0: [MMI]       cmp4.eq p6,p7=6,r14;;
  d6:       (p07) st4 [r16]=r17,4
  dc:             nop.i 0x0