Comparing some of the 386, AMD64 and IA64 ABI

Apart from the obvious 32-64 bit distinction between 386 and AMD64 there are two other interesting comparisons; parameter passing conventions and position independent code conventions.

Parameter Passing : on x86 parameters are passed via the stack. On AMD64 the first six "integer" arguments (anything that fits in a 64 bit register, basically) are passed via registers, similarly some floats can be passed via SSE registers. Only after this is data passed on the stack. On IA64, the first 8 arguments are passed in registers, whilst the rest are put on the stack.

On both AMD64 and IA64, there is a extra 16 byte "scratch area" (IA64) / 128 byte "red zone" (AMD64) that is below at the bottom of current stack frame. I would suggest that the smaller IA64 scratch area size is because of register windowing, which AMD64 does not support. On both architectures this is reserved and not modified by signal or interrupt handlers. "Leaf functions" (functions that do not call other functions) can use this area as their entire stack frame; saving some considerable overhead.

For varargs functions causes some confusion for AMD64/IA64, since arguments might be floats or might be integers, meaning they should be passed in either general or float/SSE registers respectively. On AMD64, functions known to be varargs functions should have a prologue that saves all arguments to a "register save area" that has a known layout (you pass the maximum number of possible floating point args as well to avoid saving unnecessary registers). Then, as you use the va_arg macro to go through the arguments you grab them from the register save area. On IA64, you assume that the first 8 arguments are passed in via the stack, and save these registers to your scratch area (2 registers) and 48 bytes of your stack (remaining 6 registers). This means all your arguments are stacked together (the incoming parameter list sits up against the scratch area) and va_arg can simply "walk" upwards.

Undefined functions are a bit more tricky; IA64 suggests that if a float is passed into a function with an undefined parameter, it should be copied to both the first general purpose register and the first floating point register, just to be safe. AMD64 doesn't seem to make such assumptions for you, for example, on IA64

ianw@lime:/tmp$ cat function.c
void function(float f)
{
        printf("%f\n", f);
}
ianw@lime:/tmp$ cat test.c
extern function();

int main(void)
{
        float f = 10000.01;
        function(f);
}
ianw@lime:/tmp$ gcc -o test test.c function.c
ianw@lime:/tmp$ ./test
10000.009766

That same code on AMD64 returns 0.

IP relative addressing: Position Independent Code (PIC) is code that can be loaded anywhere into memory and work. This is important because shared libraries may not always be at the same address, since other shared libraries might be loaded before or after them, etc. To maintain position independence, you can't rely on the base address of any code (because it might change) so you add a layer of indirection between your calls. In Linux/ELF land this is done with a Global Offset Table (GOT).

You can think of the GOT as a big list two columned list that has a symbol and it's "real address". Thus, instead of loading the symbol directly, you load the value from the GOT, and then load that value to find the real thing.

Note, you always know the relative address of the GOT, because although the base address might change, the difference between your code and where the GOT is will not. This means that if you need to load an address from the GOT, the easiest way is to load via an offset from the current instruction from the GOT entry. The compiler knows the current instruction offset (note it can't know the current instruction address, because the binary might be anywhere in memory), so it wants to say load the address at (CURRENT_INSTRUCTION - OFFSET_TO_GOT_ENTRY).

386 just can't do this -- there is no way to load an offset from the current instruction pointer. The only way you can do it is to keep a pointer to the GOT in a register (%ebp), and then offset from that. This wastes a whole register, and when you only have a few like the 386 this is a big killer.

AMD64 fixes this and allows you to offset from the current instruction pointer. This frees up a register, and changes the ABI by removing the distinction between the Absolute PLT and PIC PLT.

The PLT is a further enhancement that facilitates lazy binding. The PLT is "stubs" that point to a fix up function in the dynamic loader. At first, the GOT entries for functions point to the PLT entry for that function.

When you call the function, you don't go directly to it, you load it's value via the GOT and then jump to that value. As mentioned, at first this points to the PLT stub. This calls the lookup function in the dynamic loader which goes off and finds the real function (this might actually be in another shared library that needs to be loaded, for example). As arguments to this lookup function you pass the function name you're looking for (obviously) and the GOT entry of the original call. The dynamic loader finds the function, but then additionally fixes up the GOT entry to no longer point to the PLT stub, but to point directly to the required function. This means the next time you load from the GOT, you get the direct address of the function without the overhead of the PLT stub again.

IA64, allowing IP relative addressing, similarly doesn't have a distinction between absolute and PIC PLT's.