(one reason) why gcc has a hard time on IA64

James Wilson was kind enough to point me to some of the problems that gcc has with the peculiarities of IA64. Take, for example, the following little function :

void function(int *a, int *b)
{
    *a = 1;
    *b = 2;
}

Now let's inspect the code that gcc generates for this on 386 and IA64

--- 386 ---
push   %ebp
mov    %esp,%ebp
mov    0x8(%ebp),%edx
movl   $0x3,(%edx)
mov    0xc(%ebp),%edx
movl   $0x4,(%edx)
pop    %ebp
ret

--- ia64 ---
[MMI]       mov r2=3;;
            st4 [r32]=r2
            mov r2=4;;
[MIB]       st4 [r33]=r2
            nop.i 0x0
            br.ret.sptk.many b0;;

Now those of you who know IA64 assembly will see that that function takes three cycles when it really only needs to take two. The ;; signifies a stop, which means that there is a dependency -- i.e. after the mov you need to put a stop before loading it into r32. This is called a "read after write" dependency; you have to tell the processor so that pipeline gets things right. On x86 you don't have to worry about that, as internally the processor breaks down instructions to smaller RISC operations and will schedule the two independently -- the whole point of IA64 and EPIC is that the compiler should figure this out for you.

What we could do is put the two mov instructions into separate registers (remember, we have plenty) and then stop, saving a cycle. Indeed, the Intel compiler does as good a job as one could have by hand :

[MII]       alloc r8=ar.pfs,2,2,0
            mov r3=3
            mov r2=4;;
[MMB]       st4 [r32]=r3
            st4 [r33]=r2
            br.ret.sptk.many b0;;

Now, if you're really good (like James, who did this) you can inspect the gcc optimisation paths (via the -d flags) and realise that at first gcc gets it right, putting the two loads before a stop. But the register optimisation pass jumps in and tries to reduce the lifespan of a registers with a constants in them, breaking up the instructions. On x86 you want this -- it reduces register pressure at the cost of instruction level parallelism; a trade off you are willing to take. On IA64, with plenty of registers and relying on ILP to run fast, it hurts.

People are starting to look at issues like this; but if you've ever peered under the hood of gcc you'll know that it's not for the faint of heart!