I was playing with loop unrolling, and decided to investigate it a
little. The basis is this very simple program which looks like it could
do with some loop unrolling.
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define SIZE (1500 * 1024 * 1024)
int fn(int *buf) {
        int i;
    for(i=0; i < (SIZE/sizeof(int)); i+=1)
        buf[i] = i;
    return 0;
}
int main(void)
{
        int *buf = malloc(SIZE);
        assert(buf);
    return fn(buf);
}
Firstly, lets have a look at what happens to the function when
-funroll-loops is applied. Below is the de-compilation with some
comments on what each line is doing.
Non-optimised
                         | nop.m 0x0
r14=0                    | mov r14=r0
save the "loop counter"  | mov.i r2=ar.lc
                         | nop.m 0x0
one less than iterations | movl r15=0x176fffff;;
                         | nop.m 0x0
initalise loop counter   | mov.i ar.lc=r15
                         | nop.i 0x0;;
                         | loop:
*buf = i, buf++          | st4 [r32]=r14,4
i++                      | adds r14=1,r14
dec lc, branch if > 0    | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore loop counter     | mov.i ar.lc=r2
Loop unrolled
                         | nop.m 0x0
r26 = 0                  | mov r26=r0
save lc                  | mov.i r2=ar.lc
                         | nop.m 0x0
iter / 32                | movl r14=0x2edffff;;
                         | nop.m 0x0
setup loop counter       | mov.i ar.lc=r14
                         | nop.i 0x0
                         | loop:
r3 = buf                 | mov r3=r32
r8 = 1                   | adds r8=1,r26
r25 = 3                  | adds r25=3,r26
r23 = buf[3]             | adds r23=12,r32
r24 = 4                  | adds r24=4,r26
r22 = buf[4]             | adds r22=16,r32;;
buf[0] = 0; r3=buf+1     | st4 [r3]=r26,4
r21 = 5                  | adds r21=5,r26
r19 = buf[5]             | adds r19=20,r32
r20 = 6                  | adds r20=6,r26
r18 = buf[6]             | adds r18=24,r32
r16 = 7                  | adds r16=7,r26;;
                         | nop.m 0x0
buf[1] = 1, r3=buf+2     | st4 [r3]=r8,4
r15 = buf[7]             | adds r15=28,r32
r17 = 2                  | adds r17=1,r8
r26 += 8                 | adds r26=8,r26
buf += 8 (for next iter) | adds r32=32,r32;;
buf[2] = 2               | st4 [r3]=r17
buf[3] = 3               | st4 [r23]=r25
                         | nop.i 0x0
buf[4] = 4               | st4 [r22]=r24
buf[5] = 5               | st4 [r19]=r21
                         | nop.i 0x0
buf[6] = 6               | st4 [r18]=r20
buf[7] = 7               | st4 [r15]=r16
loop, dec branch         | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore lc               | mov.i ar.lc=r2
                         | br.ret.sptk.many b0;;
We can see that in the first case the loop counter is setup to run
393216000 times (e.g. 1500*1024*1024/sizeof(int)) and a very simple
loop of storing the value and incrementing is undertaken. In the second
version, the code becomes more complex. We can see the loop now executes
49152000 times (e.g. 1500*1024*1024/sizeof(int)/8) so we can infer
that for each loop iteration, 8 values have been unrolled to be written
into our buf array. Indeed, when pulled apart we can see the 8
stores (numbers just reflect the first time through).
So, is it faster?
$ time ./loop
real    0m1.209s
user    0m0.388s
sys     0m0.820s
$ timme ./loop-unroll
real    0m1.056s
user    0m0.244s
sys     0m0.812s
Yes! Great. We can suggest a number of reasons why, but it would be
interesting to get some hard statistics on why this is so. Thanks to the
Itanium performance counters, we can. Firstly, lets start by seeing how
cycles are being spent.
$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop
 547923609 CPU_CYCLES
1179995250 IA64_INST_RETIRED_THIS
    104431 NOPS_RETIRED
 154285534 BACK_END_BUBBLE_ALL
$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop-unroll
 311570590 CPU_CYCLES
1327451286 IA64_INST_RETIRED_THIS
 147560435 NOPS_RETIRED
  16088889 BACK_END_BUBBLE_ALL
Now, the Itanium can issue up to six instructions
(IA64_INST_RETIRED_THIS) in two bundles of three instructions per
cycle (CPU_CYCLES). However, you can't have dependencies between
registers within a 3-instruction bundle (and branches and interruptions
play havoc) so sometimes you need to pad with no-ops (this is why you
need a good compiler). We can work out some figures from this, namely
the useful instructions per cycle (e.g. instructions retired - nop
instructions / cycles).
| Test | Useful instructions/cycle | 
|---|
| loop | 1179995250 - 104431 / 547923609 = 2.15 | 
| unroll | 1327451286 - 147560435 / 311570590 = 3.78 | 
This tells us that for each given cycle, the unrolled-loop version is
doing more work (ideally we'd like that to be a the theoretical maximum
of 6). So what is the unrolled version doing for all this time? Bubbles
are where the CPU is waiting for something; a prime candidate is moving
data from caches to registers. Bubbles make up (154285534/547823609) 28%
of the cycles on the loop version, but only (16088889/311570590) 5% of
the time for the unrolled version. This means the processor is spending
a lot less time waiting for things to happen with the unrolled loop
version.
We can drill down further to see what bubbles are occurring.
$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop
 14229581 BE_L1D_FPU_BUBBLE_ALL
    97669 BE_EXE_BUBBLE_ALL
     1260 BE_RSE_BUBBLE_ALL
131354866 BACK_END_BUBBLE_FE
$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop-unroll
22676169 BE_L1D_FPU_BUBBLE_ALL
   83064 BE_EXE_BUBBLE_ALL
    1263 BE_RSE_BUBBLE_ALL
  718850 BACK_END_BUBBLE_FE
We are now breaking the stalls down to see where they are occuring.
L1D_FPU_BUBBLES are essentially related to micro-architectural issues
with the caches; things like the L1D cache being overwhelmend and
interactions with the hardware page-table walker. BE_EXE bubbles are
more simply described a waiting for data to come from caches (or even
main memory) to registers. BE_RSE bubbles are related to the register
stack engine, and "front end" bubbles (BACK_END_BUBBLE_FE) are where
there are not enough instructions coming from the "front-end" for the
"back-end" to issue out to functional units in the processor.
In this case, the major cause of problems for the simple loop seems to
be not being able to get enough instructions through to keep the
processor busy. We can try to pin down what is happening in the
front-end:
$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop
   27727 FE_BUBBLE_IMISS
    4910 FE_BUBBLE_TLBMISS
98094189 FE_BUBBLE_BRANCH
52615907 FE_BUBBLE_BUBBLE
$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop-unroll
780729 FE_BUBBLE_IMISS
  4858 FE_BUBBLE_TLBMISS
291620 FE_BUBBLE_BRANCH
111089 FE_BUBBLE_BUBBLE
What we can see here is that the increased code size of the unrolled
version causes us a lot more i-cache misses, as expected. However, the
problem for the unrolled version seems to be the time spent by the
front-end dealing with the branch. This is not surprising since there is
a rumor that there is a front-end buffer which fills up after 3 repeated
cycles with branch instructions; apparently this is a problem addressed
in the new Montecito chips (if anyone has one, I'll update this if you
send me the numbers). Since we are doing 393216000 branches all within a
single bundle its not surprising we've hit it!
Just to confirm, we can break-down the BE_EXE bubbles. Remember,
BE_EXE bubbles are caused by stalls in the execution phase primarily
due to cache latencies or inter-register dependencies. Looking at our
code, we don't re-use the same register over-and-over so we would not
expect to spend a long time waiting for registers, and indeed this is
what we see.
$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop
79601 BE_EXE_BUBBLE_GRALL
  919 BE_EXE_BUBBLE_GRGR
$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop-unroll
76627 BE_EXE_BUBBLE_GRALL
  918 BE_EXE_BUBBLE_GRGR
Therefore in this case, the speed increase isn't coming from better
cache behaviour as we might have expected, but greater ability to keep
the processor busy with instructions between branches.
We can also examine the major events happening with the unrolled case.
Here we see BE_L1D_FPU bubbles being the major overhead. On Itanium
floating-point avoids the L1 cache (it is too big, and not re-used
enough to make it useful) hence the concatenation of the event acronym.
These bubbles can be broken down into the following events (measured in
two runs, since you can only do 4 at a time).
$ pfmon -e be_l1d_fpu_bubble_l1d,be_l1d_fpu_bubble_l1d_dcurecir,\
  be_l1d_fpu_bubble_l1d_tlb,be_l1d_fpu_bubble_l1d_stbufrecir ./loop-unroll
 9623193 BE_L1D_FPU_BUBBLE_L1D_DCURECIR
     414 BE_L1D_FPU_BUBBLE_L1D_TLB
     282 BE_L1D_FPU_BUBBLE_L1D_STBUFRECIR
$ pfmon -e be_l1d_fpu_bubble_l1d_fullstbuf,\
    be_l1d_fpu_bubble_l1d_l2bpress,be_l1d_fpu_bubble_l1d_tlb ./loop-unroll
      23 BE_L1D_FPU_BUBBLE_L1D_FULLSTBUF
17220177 BE_L1D_FPU_BUBBLE_L1D_L2BPRESS
     425 BE_L1D_FPU_BUBBLE_L1D_TLB
These events expose micro-architectural details, the full extent of
which is only gleaned by reading the processor manuals (a PhD in
computer architecture probably helps too!). The store buffers are not an
issue here -- the caches are more than able to keep up with our few
writes. The TLB is also not involved; we have pretty good TLB coverage
thanks to the virtual-linear page table and hardware page-table walker.
The two clues are DCURECIR and L2BPRESS. Essentially DCURECIR happens
when references have to "re-circulate"; this is a side-effect of the
data not being available. The L2BPRESS figure suggests why this is so;
this event happens when the L2 cache can not handle any more requests.
The L2 cache keeps a queue of outstanding requests; when this queue is
full (for example, full of long latency requests to get data from main
memory) the L2 cache causes "back-pressure" which stops the L1 cache
from issuing new requests. This suggests the program is chewing through
a lot of data and not using the cache very effectively, which is exactly
what we are doing. Pre-fetching hints can help (but as described in a
previous entry on hacking
Enigma may not
always help), but in this case there probably isn't any spare bandwidth
to pre-fetch in because essentially no processing is happening.
What does all this mean? I don't know, but playing with performance
monitors is fun!