Position Independent Code and x86-64 libraries

Sat 05 January 2013

A comment pointed out that the original article from 2008 made a few simplifications that were a bit misleading, so I have taken some time to update this. Thanks for the feedback.

If you've ever tried to link non-position independent code into a shared library on x86-64, you should have seen a fairly cryptic error about invalid relocations and missing symbols. Hopefully this will clear it up a little.

Let's start with a small program to illustrate.

int global = 100;

int function(int i) {
    return i + global;
}

Firstly, inspect the disassembley of this function:

$gcc -c function.c

$objdump --disassemble function.o

0000000000000000 <function>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # d <function+0xd>
   d:   03 45 fc                add    -0x4(%rbp),%eax
  10:   c9                      leaveq
  11:   c3                      retq

Lets just go through that for clarity:

0,1: save rbp to the stack and save the stack pointer (rsp) to rbp. This common stanza is setting up the frame pointer, which is essentially a rule used by debuggers (mostly) to keep track of the base of the stack. It's not important for now.
4:Move the value from edi to 4 bytes below the stack pointer. This is moving the first argument (int i) into the "red-zone", a 128-byte scratch area each function has reserved below the stack pointer.
7,d: Move the value at offset 0 from the current instruction pointer (rip) into eax (by convention the return value is left in register eax). Then add the incoming argument to it (retrieved from the scratch area); i.e. return global + i

The IP relative move is really the trick here. We know from the code that it has to move the value of the global variable here. The zero value is simply a place holder - the compiler currently does not determine the required address (i.e. how far away from the instruction pointer the memory holding the global variable is). It leaves behind a relocation -- a note that says to the linker "you should determine the correct address of foo (global in our case), and then patch this bit of the code to point to that addresss".

The top portion of the image above gives some idea of how it works. We can examine relocations in binaries with the readelf tool.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x518 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000009  000800000002 R_X86_64_PC32     0000000000000000 global + fffffffffffffffc

There are many different types of relocations for different situations; the exact rules for different relocation types are described in the ABI documentation for the architecture. The R_X86_64_PC32 relocation is defined as "the base of the section the symbol is within, plus the symbol value, plus the addend". The addend makes it look more tricky than it is; remember that when an instruction is executing the instruction pointer points to the next instruction to be executed. Therefore, to correctly find the data relative to the instruction pointer, we need to subtract the extra. This can be seen more clearly when layed out in a linear fashion (as in the bottom of the above diagram).

If you try and build a shared object (dynamic library) with an object file with this type of relocation, you should get something like:

$ gcc -shared function.c
/usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
/tmp/ccQ2ttcT.o: could not read symbols: Bad value
collect2: ld returned 1 exit status

If you look back at the disassembly, we notice that the R_X86_64_32 relocation has left only 4-bytes (32-bits) of space left for the relocation entry (the zeros in 7: 8b 15 00 00 00 00).

So why does this matter when you're creating a shared library? The first thing to remember is that in a shared library situation, we can not depend on the local value of global actually being the one we want. Consider the following example, where we override the value of global with a LD_PRELOAD library.

$ cat function.c
int global = 100;

int function(int i) {
    return i + global;
}

$ gcc -fPIC -shared -o libfunction.so function.c

$ cat preload.c
int global = 200;

$ gcc -shared preload.c -o libpreload.so

$ cat program.c
#include <stdio.h>

int function(int i);

int main(void) {
   printf("%d\n", function(10));
}

$ gcc -L. -lfunction program.c -o program

$ LD_LIBRARY_PATH=. ./program
110

$ LD_PRELOAD=libpreload.so LD_LIBRARY_PATH=. ./program
210

If the code in libfunction.so were to have a fixed offset into its own data section, it will not be able to be overridden at run-time by the value provided by libpreload.so. Additionally, there are only 4-bytes available to patch in for the address of global -- since a shared library could conceivably be loaded anywhere in the 64-bit (8-byte) address space we therefore need 8-bytes of space to cover ourselves for all possible addresses global might turn up at.

The two basic possibilities for an object file are to be either linked into an executable or linked into a shared-library. In the executable case, the value of global will be in the exectuable's data section, which should definitely be reachable with a 32-bit offset of the current instruction-pointer. The instruction-pointer relative address can simply be patched in and the executable is finalised.

But what about the shared-library case, where we know the value of global could essentially be anywhere within the 64-bit address space? It is possible to leave 8-bytes of space for the address of global, by telling gcc to use the large-code model. e.g.

$ gcc -c -mcmodel=large function.c

$ objdump --disassemble ./function.o

./function.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <function>:
   0:  55                      push   %rbp
   1:  48 89 e5                mov    %rsp,%rbp
   4:  89 7d fc                mov    %edi,-0x4(%rbp)
   7:  48 b8 00 00 00 00 00    movabs $0x0,%rax
   e:  00 00 00
  11:  8b 10                   mov    (%rax),%edx
  13:  8b 45 fc                mov    -0x4(%rbp),%eax
  16:  01 d0                   add    %edx,%eax
  18:  5d                      pop    %rbp
  19:  c3                      retq

However, this creates a problem if you really want to share this code. By having to patch in an address of global directly, this means the run-time code above does not remain unchanged. Two separate processes therefore can't share this code -- they each need separate copies that are identical but for their own addresses of global patched into it.

By enabling Position Independent Code (PIC, with the flag -fPIC) you can ensure the code remains share-able. PIC means that the output binary does not expect to be loaded at a particular base address, but is happy being put anywhere in memory (compare the output of readelf --segments on a binary such as /bin/ls to that of any shared library). This is obviously critical for implementing lazy-loading (i.e. only loaded when required) shared-libraries, where you may have many libraries loaded in essentially any order at any location.

Of course, any problem in computer science can be solved with a layer of abstraction and that is essentially what is done when compiling with -fPIC. To examine this case, let's see what happens with PIC turned on.

$ gcc -fPIC -shared -c  function.c

$ objdump --disassemble ./function.o

./function.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <function>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # e <function+0xe>
   e:   8b 00                   mov    (%rax),%eax
  10:   03 45 fc                add    -0x4(%rbp),%eax
  13:   c9                      leaveq
  14:   c3                      retq

It's almost the same! We setup the frame pointer with the first two instructions as before. We push the first argument into memory in the pre-allocated "red-zone" as before. Then, however, we do an IP relative load of an address into rax. Next we de-reference this into eax (e.g. eax = *rax in C) before adding the incoming argument to it and returning.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x550 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000800000009 R_X86_64_GOTPCREL 0000000000000000 global + fffffffffffffffc

The magic here is again in the relocations. Notice this time we have a P_X86_64_GOTPCREL relocation. This says "replace the data at offset 0xa with the global offset table (GOT) entry of global.

Global Offset Table operation with data variables

As shown above, the GOT ensures the abstraction required so symbols can be diverted as expected. Each entry is essentially a pointer to the real data (hence the extra dereference in the code above). Since we can assume the GOT is at a fixed offset from the program code within plus or minus 2Gib, the code can use a 32-bit IP relative address to gain access to the table entries.

So, taking a look a the final shared-library binary we see a final offset hard-coded

$ gcc -shared -fPIC -o libfunction.so function.c

$ objdump --disassemble ./libfunction.so

00000000000006b0 <function>:
 6b0:  55                      push   %rbp
 6b1:  48 89 e5                mov    %rsp,%rbp
 6b4:  89 7d fc                mov    %edi,-0x4(%rbp)
 6b7:  48 8b 05 8a 02 20 00    mov    0x20028a(%rip),%rax        # 200948 <_DYNAMIC+0x1d8>
 6be:  8b 10                   mov    (%rax),%edx
 6c0:  8b 45 fc                mov    -0x4(%rbp),%eax
 6c3:  01 d0                   add    %edx,%eax
 6c5:  5d                      pop    %rbp
 6c6:  c3                      retq
 6c7:  90                      nop

Every process who wants to share this code just needs to make sure they have their unique address of global at 0x20028a(%rip). Since each process has a separate address-space, this means they can all have different values for global but share the same code!

Thus the default of the small-code model is sensible. It is exceedingly rare for an executable to need more than 4-byte offsets for a relative access to a variable in it's data region, so using a full 8-byte value would just be a waste of space. Although leaving 8-bytes would allow access to the variable anywhere in the 64-bit address space; when building a shared library, you really want to use -fPIC to ensure the library can actually be shared, which introduces a different relocation and access to data via the GOT.

This should explain why gcc -shared function.c works on x86-32, but does not work on x86-64. Inspecting the code reveals why:

$ objdump --disassemble ./function.o
00000000 <function>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   a1 00 00 00 00          mov    0x0,%eax
   8:   03 45 08                add    0x8(%ebp),%eax
   b:   5d                      pop    %ebp
   c:   c3                      ret
$ readelf --relocs ./function.o
Relocation section '.rel.text' at offset 0x2ec contains 1 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000004  00000701 R_386_32          00000000   global

We start out the same, with the first two instructions setting up the frame pointer. However, next we load a memory value into eax -- as we can see from the relocation information, the address of global. Next we add the incoming argument from the stack (0x8(%ebp)) to the value in this memory location; implicitly dereferencing it. But since we only have a 32-bit address-space, the 4-bytes allocated is enough to access any possible address. So while this can work, you're not creating position-independent code and hence not enabling code-sharing.

The disadvantage of PIC code is that you require "bouncing" through the GOT, which requires more loads and reads to find an address than directly referencing it. However, if your program is at the point that this is becoming a performance issue you're probably not reading this blog!

Hopefully, this helps clear up that possibly cryptic error message. Further searches around position-independent code, global-offset tables and code-sharing should also yield you more information if it remains unclear.