technovelty

weblog of Ian Wienand

RSS  |  technovelty home  |  page of ian  |  ianw@ieee.org

Position Independent Code and x86-64 libraries

If you've ever tried to link non-position independent code into a shared library on x86-64, you should have seen a fairly cryptic error about invalid relocations and missing symbols. Hopefully this will clear it up a little!

Let's start with a small program to illustrate.

$ cat function.c
int global = 100;

int function(int i) {
	return i + global;
}
$ gcc -c function.c

Firstly, inspect the disassembley of this function:

0000000000000000 <function>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # d <function+0xd>
   d:	03 45 fc             	add    -0x4(%rbp),%eax
  10:	c9                   	leaveq
  11:	c3                   	retq

Lets just go through that for clarity:

The IP relative move is really the trick here. We know from the code that it has to move the value of the global variable here. The zero value is simply a place holder - the compiler currently does not determine the required address (i.e. how far away from the instruction pointer the memory holding the global variable is). It leaves behind a relocation -- a note that says to the linker "you should determine the correct address of foo (global in our case), and then patch this bit of the code to point to that addresss (i.e. foo)."

Relocations with addend

The top portion of the image above gives some idea of how it works. We can examine relocations in binaries with the readelf tool.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x518 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000009  000800000002 R_X86_64_PC32     0000000000000000 global + fffffffffffffffc

There are many different types of relocations for different situations; the exact rules for different relocation types are described in the ABI documentation for the architecture. The R_X86_64_PC32 relocation is defined as "the base of the section the symbol is within, plus the symbol value, plus the addend". The addend makes it look more tricky than it is; remember that when an instruction is executing the instruction pointer points to the next instruction to be executed. Therefore, to correctly find the data relative to the instruction pointer, we need to subtract the extra. This can be seen more clearly when layed out in a linear fashion (as in the bottom of the above diagram).

If you try and build a shared object (dynamic library) with an object file with this type of relocation, you should get something like:

$ gcc -shared function.c
/usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
/tmp/ccQ2ttcT.o: could not read symbols: Bad value
collect2: ld returned 1 exit status

The specific problem is how this relocation interacts with Position Independent Code (PIC, enabled with -fPIC). PIC just means that the output binary does not expect to be loaded at a particular base address, but is happy being put anywhere in memory (compare the output of readelf --segments on a binary such as /bin/ls to that of any shared library). This is obviously critical for implementing lazy-loading (i.e. only loaded when required) shared-libraries, where you may have many libraries loaded in essentially any order. Trying to pre-allocate where in memory they would all live is completely impractical and just does not work (not to mention every single library that might ever be used would be competing for a spot in the limited address space of a 32-bit process!).

What's the specific problem with this relocation in a shared library? In a shared library situation, we can not depend on the local value of global actually being the one we want. Consider the following example, where we override the value of global with a LD_PRELOAD library.

$ cat function.c
int global = 100;

int function(int i) {
	return i + global;
}
$ gcc -fPIC -shared -o libfunction.so function.c

$ cat preload.c
int global = 200;
$ gcc -shared preload.c -o libpreload.so

$ cat program.c
#include <stdio.h>

int function(int i);

int main(void) {
   printf("%d\n", function(10));
}
$ gcc -L. -lfunction program.c -o program

$ LD_LIBRARY_PATH=. ./program
110
$ LD_PRELOAD=libpreload.so LD_LIBRARY_PATH=. ./program
210

If the code in libfunction.so has a fixed offset into its own data section, it will not be able to see the overridden value provided by libpreload.so. This is not the case when building a stand-alone executable, where references are satisfied internally.

Of course, any problem in computer science can be solved with a layer of abstraction, and that is what is done when compiling with -fPIC. To examine this case, let's see what happens with PIC turned on.

$ gcc -fPIC -shared -c  function.c
$ objdump --disassemble ./function.o

./function.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <function>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # e <function+0xe>
   e:	8b 00                	mov    (%rax),%eax
  10:	03 45 fc             	add    -0x4(%rbp),%eax
  13:	c9                   	leaveq
  14:	c3                   	retq

It's almost the same! We setup the frame pointer with the first two instructions as before. We push the first argument into memory in the pre-allocated "red-zone" as before. Then, however, we do an IP relative load of an address into rax. Next we de-reference this into eax (e.g. eax = *rax in C) before adding the incoming argument to it and returning.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x550 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000800000009 R_X86_64_GOTPCREL 0000000000000000 global + fffffffffffffffc

The magic here is again in the relocations. Notice this time we have a P_X86_64_GOTPCREL relocation. This says "replace the data at offset 0xa with the global offset table (GOT) entry of global.

Global Offset Table operation with data variables

As shown above, the GOT ensures the abstraction required so symbols can be diverted as expected. Each entry is essentially a pointer to the real data (hence the extra dereference in the code above). Since the GOT is at a fixed offset from the program code, it can use an IP relative address to gain access to the table entries.

This extra reference is obviously slower; however for the most part I imagine the overhead would be essentially immeasurable and is required for "generic" operation. If you have figured the cost of indirection through the GOT is the major bottleneck of your program, I imagine you wouldn't be reading this and would already be considering strategies to remove it!

The next question is why this works on plain old x86-32. Inspecting the code reveals why:

$ objdump --disassemble ./function.o
00000000 <function>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	a1 00 00 00 00       	mov    0x0,%eax
   8:	03 45 08             	add    0x8(%ebp),%eax
   b:	5d                   	pop    %ebp
   c:	c3                   	ret
$ readelf --relocs ./function.o
Relocation section '.rel.text' at offset 0x2ec contains 1 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000004  00000701 R_386_32          00000000   global

We start out the same, with the first two instructions setting up the frame pointer. However, next we load a memory value into eax -- as we can see from the relocation information, the address of global. Next we add the incoming argument from the stack (0x8(%ebp)) to the value in this memory location; implicitly dereferencing it. This provides the abstraction we need -- if the relocation makes the patched address at 0x4 the address of the GOT entry, it will be correctly dereferenced. It is the inability of the x86-32 architecture to try and optimise by doing instruction-pointer relative offseting which means it always needs to do slower memory references, which turns out to be just what you want when you're making a shared library!

So, the executive summary: the ability of x86-64 to use instruction-pointer relative offsetting to data addresses is a nice optimisation, but in a shared-library situation assumptions about the relative location of data are invalid and can not be used. In this case, access to global data (i.e. anything that might be changed around on you) must go through a layer of abstraction, namely the global offset table.

posted at: Wed, 26 Nov 2008 13:53 | in /code/c | permalink | add comment (2 others)

Posted by Diego E. "Flameeyes" Pettenò at Fri Nov 28 02:19:11 2008

Very well written and explained, but I think I'll leave a little note here ;)

There is another consideration that has to be done between x86-64 and x86-32 when it comes to shared libraries and PIC/non-PIC code.

Even on x86-32, the default by libtool, and by most distributions, included Debian (which I guess is where you're active on since I read your blog on Planet Debian ;)) and Gentoo (which is where I am active on), is to use PIC code for shared objects, for the very way x86-32 implements non-PIC shared objects.

Since you have to change the actual code of the functions, you got to relocated the content of the .text section, which is where the compiled code of the function lies. This causes two problems.

Since in each process the library may be loaded at a different address, or a symbol might be interposed by another, each process will end up having a possibly different address for the same symbol. This thus require each process to have its own copy of the patched (relocated) .text section; since a relocated .text section has to be in dirty pages, this increases dramatically the actual memory usage of a program, disallowing proper sharing of shared libraries code.

Also, since you have to change the .text section to do the relocation, it cannot be loaded in read-only pages but it has to be loaded in writeable pages, conflicting with vulnerability mitigation technologies like PaX, W^X and NX (I think SElinux also has something like that).

What is more important to state here is that since the PIC variant on x86-32 require the use of the ebx register, non-PIC code is preferred, especially by multimedia developers, so that there are more free registers which allow writing faster code. Which explains why for a lot of multimedia libraries, PIC is an opt-in rather than an opt-out.

Also, it's not like x86-64 is the outsider which forces PIC on; PIC-only code for shared libraries was something already very present before introduction of AMD64 on architectures like Alpha, SPARC64 and iirc HP-PA too. So it's more like x86-32 (i386) is the outsider here.

Posted by Branko Badrljica at Mon Dec 8 06:09:31 2008

I still think of this as dirty and sinfull method, looking from the speed perspective.

All this shareability cr*p was introduced in order to save RAM.

But RAM is plentifull these days, while CPU speed is not. Not only that, but these speedbumps can royally scr*w cache filling and prefetching units success rates.

Big part of being fast today depends on how well can you feed your cache units.

I imagine this kind of c*ap with stack games, PLT and GOT tables and whatnot stumbles cache units and so it might be prudent to return to static linkage, at least for some apps...

Add a comment
*Name
*Email (not shown)
Website
*Comment:
*Word above?
* denotes required field

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.