RSS | technovelty home | page of ian | ianw@ieee.org
Loop unrolling can be a fascinating art.
The canonical example of loop unrolling is something like
for (i = 0; i < 100; i++)
x[i] = i;
becoming
for (i = 0; i < 25; i+=4) {
x[i] = i;
x[i+1] = i+1;
x[i+2] = i+2;
x[i+3] = i+3;
}
Longer runs of instructions afford you a better chance of managing to find instruction level parallelism; that is instructions that can all happen together. You can keep the pipeline fuller and hence run faster, at the cost of some expanded code size.
But unlike that example, you generally do not know where your loop
ends. In the above example, we made four copies of the loop body,
reducing the iterations by 4. For an indeterminate number of
iterations, we would need to do the original loop n % 4
times and then the unrolled body floor(n / 4) times. So
to spell it out, if we want to do the loop 7 times, then we execute
the original loop 3 times and the unrolled portion once.
Our new code might look something like
int extras = n % 4;
for (i = 0; i < extras; i++)
x[i] += i;
for ( ; i < n / 4 ; i+=4) {
x[i] = i;
x[i+1] = i+1;
x[i+2] = i+2;
x[i+3] = i+3;
}
That leads to a little trick called "Duff's Device", references to which can be found most anywhere (however I found most of them a little unclear, hence this post).
Duff's device looks like this
send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while(--n>0);
}
}
Now that doesn't look like loop unrolling, but it really is. In
this case, we have unwound a loop doing simply *to =
*from 8 times. But we still have that extra bit problem, since
we have an indeterminate count.
To understand it, let's just work through what happens for possible
inputs. For values of count from 1-7, n is
one and count % 8 is just count so we fall
right into the case statements, which fall through to each other
without break and never repeat. The nifty bit comes when
count is greater than 7. In this case, our first iteration comes from
the switch where we fall through to do the extra
count % 8 iterations. After that, we loop back to the
start of the unrolling and do our 8 step loop as many times as we
require.
Now by post-incrementing *to you can use this to
implement memcpy(). An interesting exercise might be to
see how much slower this is than the highly optimised
memcpy that comes with glibc.
The code is most certainly ingenious and this page explains some of the history of it. Today I think it is mostly good as an exercise on "why it works".
If you're wondering how gcc handles unrolling, it creates the
unrolled loop and the does a sort of fall through test which evaluates
how many extra loops are required and branches into the unrolled
portion at the correct point. A little extract is below, you can see
r14 holds the number of extra loops required and branches
higher up the unrolled code for each extra loop required.
... 70: [MFB] cmp4.eq p6,p7=0,r14 76: nop.f 0x0 7c: (p06) br.cond.dpnt.few 150;; 80: [MFB] cmp4.eq p6,p7=1,r14 86: nop.f 0x0 8c: (p06) br.cond.dpnt.few 130 ;; 90: [MFB] cmp4.eq p6,p7=2,r14 96: nop.f 0x0 9c: (p06) br.cond.dpnt.few 120 ;; a0: [MFB] cmp4.eq p6,p7=3,r14 a6: nop.f 0x0 ac: (p06) br.cond.dpnt.few 110 ;; b0: [MFB] cmp4.eq p6,p7=4,r14 b6: nop.f 0x0 bc: (p06) br.cond.dpnt.few 100 ;; c0: [MFB] cmp4.eq p6,p7=5,r14 c6: nop.f 0x0 cc: (p06) br.cond.dpnt.few f0 ;; d0: [MMI] cmp4.eq p6,p7=6,r14;; d6: (p07) st4 [r16]=r17,4 dc: nop.i 0x0 ...
posted at: Fri, 24 Mar 2006 09:32 | in /code/c | permalink | add comment (0 others)

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.