Articles in the c category

Position Independent Code and x86-64 libraries

A comment pointed out that the original article from 2008 made a few simplifications that were a bit misleading, so I have taken some time to update this. Thanks for the feedback.

If you've ever tried to link non-position independent code into a shared library on x86-64, you should have seen a fairly cryptic error about invalid relocations and missing symbols. Hopefully this will clear it up a little.

Let's start with a small program to illustrate.

int global = 100;

int function(int i) {
    return i + global;

Firstly, inspect the disassembley of this function:

$gcc -c function.c

$objdump --disassemble function.o

0000000000000000 <function>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # d <function+0xd>
   d:   03 45 fc                add    -0x4(%rbp),%eax
  10:   c9                      leaveq
  11:   c3                      retq

Lets just go through that for clarity:

  • 0,1: save rbp to the stack and save the stack pointer (rsp) to rbp. This common stanza is setting up the frame pointer, which is essentially a rule used by debuggers (mostly) to keep track of the base of the stack. It's not important for now.
  • 4:Move the value from edi to 4 bytes below the stack pointer. This is moving the first argument (int i) into the "red-zone", a 128-byte scratch area each function has reserved below the stack pointer.
  • 7,d: Move the value at offset 0 from the current instruction pointer (rip) into eax (by convention the return value is left in register eax). Then add the incoming argument to it (retrieved from the scratch area); i.e. return global + i

The IP relative move is really the trick here. We know from the code that it has to move the value of the global variable here. The zero value is simply a place holder - the compiler currently does not determine the required address (i.e. how far away from the instruction pointer the memory holding the global variable is). It leaves behind a relocation -- a note that says to the linker "you should determine the correct address of foo (global in our case), and then patch this bit of the code to point to that addresss".

Relocations with addend

The top portion of the image above gives some idea of how it works. We can examine relocations in binaries with the readelf tool.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x518 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000009  000800000002 R_X86_64_PC32     0000000000000000 global + fffffffffffffffc

There are many different types of relocations for different situations; the exact rules for different relocation types are described in the ABI documentation for the architecture. The R_X86_64_PC32 relocation is defined as "the base of the section the symbol is within, plus the symbol value, plus the addend". The addend makes it look more tricky than it is; remember that when an instruction is executing the instruction pointer points to the next instruction to be executed. Therefore, to correctly find the data relative to the instruction pointer, we need to subtract the extra. This can be seen more clearly when layed out in a linear fashion (as in the bottom of the above diagram).

If you try and build a shared object (dynamic library) with an object file with this type of relocation, you should get something like:

$ gcc -shared function.c
/usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
/tmp/ccQ2ttcT.o: could not read symbols: Bad value
collect2: ld returned 1 exit status

If you look back at the disassembly, we notice that the R_X86_64_32 relocation has left only 4-bytes (32-bits) of space left for the relocation entry (the zeros in 7: 8b 15 00 00 00 00).

So why does this matter when you're creating a shared library? The first thing to remember is that in a shared library situation, we can not depend on the local value of global actually being the one we want. Consider the following example, where we override the value of global with a LD_PRELOAD library.

$ cat function.c
int global = 100;

int function(int i) {
    return i + global;

$ gcc -fPIC -shared -o function.c

$ cat preload.c
int global = 200;

$ gcc -shared preload.c -o

$ cat program.c
#include <stdio.h>

int function(int i);

int main(void) {
   printf("%d\n", function(10));

$ gcc -L. -lfunction program.c -o program

$ LD_LIBRARY_PATH=. ./program

$ LD_LIBRARY_PATH=. ./program

If the code in were to have a fixed offset into its own data section, it will not be able to be overridden at run-time by the value provided by Additionally, there are only 4-bytes available to patch in for the address of global -- since a shared library could conceivably be loaded anywhere in the 64-bit (8-byte) address space we therefore need 8-bytes of space to cover ourselves for all possible addresses global might turn up at.

The two basic possibilities for an object file are to be either linked into an executable or linked into a shared-library. In the executable case, the value of global will be in the exectuable's data section, which should definitely be reachable with a 32-bit offset of the current instruction-pointer. The instruction-pointer relative address can simply be patched in and the executable is finalised.

But what about the shared-library case, where we know the value of global could essentially be anywhere within the 64-bit address space? It is possible to leave 8-bytes of space for the address of global, by telling gcc to use the large-code model. e.g.

$ gcc -c -mcmodel=large function.c

$ objdump --disassemble ./function.o

./function.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <function>:
   0:  55                      push   %rbp
   1:  48 89 e5                mov    %rsp,%rbp
   4:  89 7d fc                mov    %edi,-0x4(%rbp)
   7:  48 b8 00 00 00 00 00    movabs $0x0,%rax
   e:  00 00 00
  11:  8b 10                   mov    (%rax),%edx
  13:  8b 45 fc                mov    -0x4(%rbp),%eax
  16:  01 d0                   add    %edx,%eax
  18:  5d                      pop    %rbp
  19:  c3                      retq

However, this creates a problem if you really want to share this code. By having to patch in an address of global directly, this means the run-time code above does not remain unchanged. Two separate processes therefore can't share this code -- they each need separate copies that are identical but for their own addresses of global patched into it.

By enabling Position Independent Code (PIC, with the flag -fPIC) you can ensure the code remains share-able. PIC means that the output binary does not expect to be loaded at a particular base address, but is happy being put anywhere in memory (compare the output of readelf --segments on a binary such as /bin/ls to that of any shared library). This is obviously critical for implementing lazy-loading (i.e. only loaded when required) shared-libraries, where you may have many libraries loaded in essentially any order at any location.

Of course, any problem in computer science can be solved with a layer of abstraction and that is essentially what is done when compiling with -fPIC. To examine this case, let's see what happens with PIC turned on.

$ gcc -fPIC -shared -c  function.c

$ objdump --disassemble ./function.o

./function.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <function>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # e <function+0xe>
   e:   8b 00                   mov    (%rax),%eax
  10:   03 45 fc                add    -0x4(%rbp),%eax
  13:   c9                      leaveq
  14:   c3                      retq

It's almost the same! We setup the frame pointer with the first two instructions as before. We push the first argument into memory in the pre-allocated "red-zone" as before. Then, however, we do an IP relative load of an address into rax. Next we de-reference this into eax (e.g. eax = *rax in C) before adding the incoming argument to it and returning.

$ readelf --relocs ./function.o

Relocation section '.rela.text' at offset 0x550 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000800000009 R_X86_64_GOTPCREL 0000000000000000 global + fffffffffffffffc

The magic here is again in the relocations. Notice this time we have a P_X86_64_GOTPCREL relocation. This says "replace the data at offset 0xa with the global offset table (GOT) entry of global.

Global Offset Table operation with data variables

As shown above, the GOT ensures the abstraction required so symbols can be diverted as expected. Each entry is essentially a pointer to the real data (hence the extra dereference in the code above). Since we can assume the GOT is at a fixed offset from the program code within plus or minus 2Gib, the code can use a 32-bit IP relative address to gain access to the table entries.

So, taking a look a the final shared-library binary we see a final offset hard-coded

$ gcc -shared -fPIC -o function.c

$ objdump --disassemble ./

00000000000006b0 <function>:
 6b0:  55                      push   %rbp
 6b1:  48 89 e5                mov    %rsp,%rbp
 6b4:  89 7d fc                mov    %edi,-0x4(%rbp)
 6b7:  48 8b 05 8a 02 20 00    mov    0x20028a(%rip),%rax        # 200948 <_DYNAMIC+0x1d8>
 6be:  8b 10                   mov    (%rax),%edx
 6c0:  8b 45 fc                mov    -0x4(%rbp),%eax
 6c3:  01 d0                   add    %edx,%eax
 6c5:  5d                      pop    %rbp
 6c6:  c3                      retq
 6c7:  90                      nop

Every process who wants to share this code just needs to make sure they have their unique address of global at 0x20028a(%rip). Since each process has a separate address-space, this means they can all have different values for global but share the same code!

Thus the default of the small-code model is sensible. It is exceedingly rare for an executable to need more than 4-byte offsets for a relative access to a variable in it's data region, so using a full 8-byte value would just be a waste of space. Although leaving 8-bytes would allow access to the variable anywhere in the 64-bit address space; when building a shared library, you really want to use -fPIC to ensure the library can actually be shared, which introduces a different relocation and access to data via the GOT.

This should explain why gcc -shared function.c works on x86-32, but does not work on x86-64. Inspecting the code reveals why:

$ objdump --disassemble ./function.o
00000000 <function>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   a1 00 00 00 00          mov    0x0,%eax
   8:   03 45 08                add    0x8(%ebp),%eax
   b:   5d                      pop    %ebp
   c:   c3                      ret
$ readelf --relocs ./function.o
Relocation section '.rel.text' at offset 0x2ec contains 1 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000004  00000701 R_386_32          00000000   global

We start out the same, with the first two instructions setting up the frame pointer. However, next we load a memory value into eax -- as we can see from the relocation information, the address of global. Next we add the incoming argument from the stack (0x8(%ebp)) to the value in this memory location; implicitly dereferencing it. But since we only have a 32-bit address-space, the 4-bytes allocated is enough to access any possible address. So while this can work, you're not creating position-independent code and hence not enabling code-sharing.

The disadvantage of PIC code is that you require "bouncing" through the GOT, which requires more loads and reads to find an address than directly referencing it. However, if your program is at the point that this is becoming a performance issue you're probably not reading this blog!

Hopefully, this helps clear up that possibly cryptic error message. Further searches around position-independent code, global-offset tables and code-sharing should also yield you more information if it remains unclear.

Symbol Versions and dependencies

The documentation on ld's symbol versioning syntax is a little bit vague on "dependencies", which it talks about but doesn't give many details on.

Let's construct a small example:

``$ cat foo.c``

#include <stdio.h>

#ifndef VERSION_2
void foo(int f) {
     printf("version 1 called\n");

void foo_v1(int f) {
     printf("version 1 called\n");
__asm__(".symver foo_v1,foo@VERSION_1");

void foo_v2(int f) {
     printf("version 2 called\n");
/* i.e. foo_v2 is really foo@VERSION_2
 * @@ means this is the default version
__asm__(".symver foo_v2,foo@@VERSION_2");


``$ cat 1.ver``


``$ cat 2.ver``



``$ cat main.c``

#include <stdio.h>

void foo(int);

int main(void) {
    return 0;

``$ cat Makefile``

all: v1 v2 : foo.c
        gcc -shared -fPIC -o -Wl,--soname='' -Wl,--version-script=1.ver foo.c : foo.c
        gcc -shared -fPIC -DVERSION_2 -o -Wl,--soname='' -Wl,--version-script=2.ver foo.c

v1: main.c
    ln -sf
    gcc -Wall -o v1 -lfoo -L. -Wl,-rpath=. main.c

v2: main.c
    ln -sf
    gcc -Wall -o v2 -lfoo -L. -Wl,-rpath=. main.c

.PHONY: clean
    rm -f libfoo* v1 v2

``$ ./v1``

version 1 called

``$ ./v2``

version 2 called

In words, we create two libraries; a version 1 and a version 2, where we provide a new version of foo in the version 2 library. The soname is set in the libraries, so v1 and v2 can distinguish the correct library to use.

In the updated 2.ver version, we say that VERSION_2 depends on VERSION_1. So, the question is, what does this mean? Does it have any effect?

We can examine the version descriptors in the library and see that there is indeed a relationship recorded there.

``$ readelf --version-info ./``

Version definition section '.gnu.version_d' contains 3 entries:
  Addr: 0x0000000000000264  Offset: 0x000264  Link: 5 (.dynstr)
  000000: Rev: 1  Flags: BASE   Index: 1  Cnt: 1  Name:
  0x001c: Rev: 1  Flags: none  Index: 2  Cnt: 1  Name: VERSION_1
  0x0038: Rev: 1  Flags: none  Index: 3  Cnt: 2  Name: VERSION_2
  0x0054: Parent 1: VERSION_1

Looking at the specification we can see that each version definition has a vd_aux field which is a linked list of, essentially, strings that give "the version or dependency name". This is a little vague for a specification, however it appears to mean that the first entry is the name of the version specification, and any following elements are your dependencies. At least, this is how readelf interprets it when it shows you the "Parent" field in the output above.

This implies something that the ld documentation doesn't mention; in that you may list multiple dependencies for a version node. That does work, and readelf will just report more parents if you try it.

So the question is, what does this dependency actually do? Well, as far as I can tell, nothing really. The dynamic loader doesn't look at the dependency information; and doesn't have any need to — it is looking to resolve something specific, foo@VERSION_2 for example, and doesn't really care that VERSION_1 even exists.

ld does enforce the dependency, in that if you specify a dependent node but leave it out or accidentally erase it, the link will fail. However, it doesn't really convey anything other than its intrinsic documenation value.

How much slower is cross compiling to your own host?

The usual case for cross-compiling is that your target is so woefully slow and under-powered that you would be insane to do anything else.

However, sometimes for one of the best reasons of all, "historical reasons", you might ship a 64-bit product but support building on 32-bit hosts, and thus cross-compile even on a very fast architecture like x86. How much does this cost, even though almost everyone is running the 32-bit cross-compiler on a modern 64-bit machine?

To test, I got a a 32-bit cross and a 64-bit native x86_64 compiler and toolchain; in this case based on gcc-4.1.2 and binutils 2.17. I then did a allyesconfig build of Linux 2.6.33 x86_64 kernel 3 times using the cross compilier toolchain and then native one. The results (in seconds):












So, all up, ~7% less by building your 64-bit code on a 64-bit machine with a 32-bit cross-compiler.

What exactly does -Bsymblic do? -- update

Some time ago I wrote a description of the -Bsymbolic linker flag which could do with some further explanation. The original article is a good starting place.

One interesting point that I didn't go into was the potential for code optimisation -Bsymbolic brings about. I'm not sure if I missed that at the time, or the toolchain changed, both are probably equally likely!

Let me recap the example...

ianw@jj:/tmp/bsymbolic$ cat Makefile
all: test test-bsymbolic

    rm -f *.so test testsym : liboverride.c
           $(CC) -Wall -O2 -shared -fPIC -o $< : libtest.c
       $(CC) -Wall -O2 -shared -fPIC -o $< : libtest.c
             $(CC) -Wall -O2 -shared -fPIC -Wl,-Bsymbolic -o $@ $<

test : test.c
     $(CC) -Wall -O2 -L. -Wl,-rpath=. -ltest -o $@ $<

test-bsymbolic : test.c
    $(CC) -Wall -O2 -L. -Wl,-rpath=. -ltest-bsymbolic -o $@ $<
$ cat liboverride.c
#include <stdio.h>

int foo(void)
    printf("override foo called\n");
    return 0;
$ cat libtest.c
#include <stdio.h>

int foo(void) {
    printf("libtest foo called\n");
    return 1;

int test_foo(void) {
    return foo();
$ cat test.c
#include <stdio.h>

int test_foo(void);

int main(void)
    printf("%d\n", test_foo());
    return 0;

In words; provides test_foo(), which calls foo() to do the actual work. is simply built with the flag in question, -Bsymbolic. provides the alternative version of foo() designed to override the original via a LD_PRELOAD of the library.

test is built against, test-bsymbolic against

Running the examples, we can see that the LD_PRELOAD does not override the symbol in the library built with -Bsymbolic.

$ ./test
libtest foo called

$ ./test-bsymbolic
libtest foo called

$ ./test
override foo called

$ ./test-bsymbolic
libtest foo called

There are a couple of things going on here. Firstly, you can see that the SYMBOLIC flag is set in the dynamic section, leading to the dynamic linker behaviour I explained in the original article:

ianw@jj:/tmp/bsymbolic$ readelf --dynamic ./

Dynamic section at offset 0x550 contains 22 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: []
 0x00000010 (SYMBOLIC)                   0x0

However, there is also an effect on generated code. Have a look at the PLTs:

$ objdump --disassemble-all ./
Disassembly of section .plt:

[... blah ...]

0000039c &lt;foo@plt&gt;:
 39c:   ff a3 10 00 00 00       jmp    *0x10(%ebx)
 3a2:   68 08 00 00 00          push   $0x8
 3a7:   e9 d0 ff ff ff          jmp    37c &lt;_init+0x30&gt;
$ objdump --disassemble-all ./
Disassembly of section .plt:

00000374 <__gmon_start__@plt-0x10>:
 374:   ff b3 04 00 00 00       pushl  0x4(%ebx)
 37a:   ff a3 08 00 00 00       jmp    *0x8(%ebx)
 380:   00 00                   add    %al,(%eax)

00000384 <__gmon_start__@plt>:
 384:   ff a3 0c 00 00 00       jmp    *0xc(%ebx)
 38a:   68 00 00 00 00          push   $0x0
 38f:   e9 e0 ff ff ff          jmp    374 <_init+0x30>

00000394 <puts@plt>:
 394:   ff a3 10 00 00 00       jmp    *0x10(%ebx)
 39a:   68 08 00 00 00          push   $0x8
 39f:   e9 d0 ff ff ff          jmp    374 <_init+0x30>

000003a4 <__cxa_finalize@plt>:
 3a4:   ff a3 14 00 00 00       jmp    *0x14(%ebx)
 3aa:   68 10 00 00 00          push   $0x10
 3af:   e9 c0 ff ff ff          jmp    374 <_init+0x30>

Notice the difference? There is no PLT entry for foo() when -Bsymbolic is used.

Effectively, the toolchain has noticed that foo() can never be overridden and optimised out the PLT call for it. This is analogous to using "hidden" attributes for symbols, which I have detailed in another article on symbol visiblity attributes (which also goes into PLT's, if the above meant nothing to you).

So -Bsymbolic does have some more side-effects than just setting a flag to tell the dynamic linker about binding rules -- it can actually result in optimised code. However, I'm still struggling to find good use-cases for -Bsymbolic that can't be better done with Version scripts and visibility attributes. I would certainly recommend using these methods if at all possible.

Thanks to Ryan Lortie for comments on the original article.

Relocation truncated to fit - WTF?

If you code for long enough on x86-64, you'll eventually hit an error such as:

(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol `array' defined in foo section in ./pcrel8.o

Here's a little example that might help you figure out what you've done wrong.

Consider the following code:

$ cat foo.s
.globl foovar
  .section   foo, "aw",@progbits
  .type foovar, @object
  .size foovar, 4
   .long 0

.globl _start
 .type function, @function
  movq $foovar, %rax

In case it's not clear, that would look something like:

int foovar = 0;

void function(void) {
  int *bar = &foovar;

Let's build that code, and see what it looks like

$ gcc -c foo.s

$ objdump --disassemble-all ./foo.o

./foo.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <_start>:
   0:        48 c7 c0 00 00 00 00   mov    $0x0,%rax

Disassembly of section foo:

0000000000000000 <foovar>:
   0:        00 00          add    %al,(%rax)

We can see that the mov instruction has only allocated 4 bytes (00 00 00 00) for the linker to put in the address of foovar. If we check the relocations:

$ readelf --relocs ./foo.o

Relocation section '.rela.text' at offset 0x3a0 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000003  00050000000b R_X86_64_32S      0000000000000000 foovar + 0

The R_X86_64_32S relocation is indeed only a 32-bit relocation. Now we can tickle this error. Consider the following linker script, which puts the foo section about 5 gigabytes away from the code.

$ cat
 . = 10000;
 .text : { *(.text) }
 . = 5368709120;
 .data : { *(.foo) }

This now means that we can not fit the address of foovar inside the space allocated by the relocation. When we try it:

$ ld ./foo.o
./foo.o: In function `_start':
(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol `foovar' defined in foo section in ./foo.o

What this means is that the full 64-bit address of foovar, which now lives somewhere above 5 gigabytes, can't be represented within the 32-bit space allocated for it.

For code optimisation purposes, the default immediate size to the mov instructions is a 32-bit value. This makes sense because, for the most part, programs can happily live within a 32-bit address space, and people don't do things like keep their data so far away from their code it requires more than a 32-bit address to represent it. Defaulting to using 32-bit immediates therefore cuts the code size considerably, because you don't have to make room for a possible 64-bit immediate for every mov.

So, if you want to really move a full 64-bit immediate into a register, you want the movabs instruction. Try it out with the code above - with movabs you should get a R_X86_64_64 relocation and 64-bits worth of room to patch up the address, too.

If you're seeing this and you're not hand-coding, you probably want to check out the -mmodel argument to gcc.

Conditional Operator Harmful ? yes : no

It seems existing versions of gcc don't warn when you use an assignment as a truth value as the first operand to the conditional operator. For example:

int fn(int i)
    return (i = 0x20) ? 0x40 : 0x80;

Presumably you meant ==, but unfortunately that does not warn with gcc (maybe it will one day).

We can use the tree dumping mechanisms (previously mentioned here) to confirm our problem. For the following function gcc creates:

fn (i)
  int D.1541;
  int iftmp.0;

  i = 32;
  if (1)
      iftmp.0 = 64;
      iftmp.0 = 128;
  D.1541 = iftmp.0;
  return D.1541;

As you can see, this isn't the result we wanted. This can be quite nasty if you use the conditional operator in a hidden fashion; for example behind an assert. A common idiom is:

#define ASSERT(x) (x) ? : fail()

/* programmer now subsitutes a == for */
ASSERT(x = y);

You'll currently get no warning you've slipped up and used the assert wrong. Moral of the story: a quick audit of your code might turn up some surprising uses! However, in this case, the Intel compiler picks this one up:

$ icc -c test.c
test.c(3): warning #187: use of "=" where "==" may have been intended
       return (i = 0x20) ? 0x40 : 0x80;

Although it's often not trivial to move to another compiler, as a rule I would recommend trying alternative compilers for your code as it's a great sanity check.

Checking users of typedefs

If you poke around the Linux VM layers you might wonder why there is a STRICT_MM_TYPECHECKS option. It is there to stop programmers touching opaque types. Linus has said many times that a typedef should only be made for a completely opaque type. This therefore implies that rather than operate on it directly, there is some sort of "API" to get at the values.

A simple example is below. my_long_t is opaque; the only way you should use its value is via the my_val accessor.

But how do we enforce this, when the type is really just an unsigned long? We wrap it up in something else, in this case a struct.

#ifndef STRICT
  typedef unsigned long my_long_t;
# define my_val(x) (x)

  typedef struct { unsigned long t; } my_long_t;
# define my_val(x) ((x).t)


my_long_t my_naughty_function(my_long_t input)
        return input * 1000;

We can now see the outcome

$ gcc -c strict.c
$ gcc -DSTRICT -c strict.c
strict.c: In function 'my_naughty_function':
strict.c:15: error: invalid operands to binary *

So now we have an error where we do the wrong thing, rather than no notification at all! This niftily moves you up from "bug via visual inspection" to "won't compile" (Rusty Russell talked about this scale at once -- does anyone have a reference to that talk?). Why not just leave it like this then? Well it might confuse the compiler; if you have rules about always passing structures on the stack, for example, the above will suck. In general, it's best to hide as little from the compiler as possible, to give it the best chance of doing a good job for you.

GCC Trampolines

Paul Wayper writes about trampolines. This is something I've come across before, when I was learning about function pointers with IA64.

Nested functions are a strange contraption, and probably best avoided (people who should know agree). For example, here is a completely obfuscated way to double a number.

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;

        return nested_function(arg);

int main(void)
        printf("%d\n", function(10));

Let's disassemble that on IA64 (because it is so much easier to understand).

0000000000000000 :
   0:   00 10 15 08 80 05       [MII]       alloc r34=ar.pfs,5,4,0
   6:   c0 80 33 7e 46 60                   adds r12=-16,r12
   c:   04 08 00 84                         mov r35=r1
  10:   01 00 00 00 01 00       [MII]       nop.m 0x0
  16:   10 02 00 62 00 80                   mov r33=b0
  1c:   04 00 01 84                         mov r36=r32;;
  20:   0a 70 40 18 00 21       [MMI]       adds r14=16,r12;;
  26:   00 00 39 20 23 e0                   st4 [r14]=r32
  2c:   01 70 00 84                         mov r15=r14
  30:   10 00 00 00 01 00       [MIB]       nop.m 0x0
  36:   00 00 00 02 00 00                   nop.i 0x0
  3c:   48 00 00 50                b0=70
  40:   0a 08 00 46 00 21       [MMI]       mov r1=r35;;
  46:   00 00 00 02 00 00                   nop.m 0x0
  4c:   20 02 aa 00                         mov.i ar.pfs=r34
  50:   00 00 00 00 01 00       [MII]       nop.m 0x0
  56:   00 08 05 80 03 80                   mov b0=r33
  5c:   01 61 00 84                         adds r12=16,r12
  60:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  66:   00 00 00 02 00 80                   nop.i 0x0
  6c:   08 00 84 00                         br.ret.sptk.many b0;;

0000000000000070 :
  70:   0a 40 00 1e 10 10       [MMI]       ld4 r8=[r15];;
  76:   80 00 21 00 40 00                   add r8=r32,r8
  7c:   00 00 04 00                         nop.i 0x0
  80:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  86:   00 00 00 02 00 80                   nop.i 0x0
  8c:   08 00 84 00                         br.ret.sptk.many b0;;

Note that nothing funny happens there at all. The nested function looks exactly like a real function, but we don't do any of the alloc or anything to get us a new stack frame -- we're using the stack frame of the old function. No trampoline involved, just a straight branch.

So why do we need a trampoline? Because a nested function has an extra property over a normal function; the stack pointer must be set to be the stack pointer of the function it is nested within. This means that if we were to pass around pointers to the nested function, we would have to keep track that this isn't a pointer to just any old function, but a pointer to special function that needs its stack setup to come from the nested function. C doesn't work like this, there is only one type "pointer to function".

The easiest way is therefore to create a little function that sets the stack to the right place and calls the code. This "trampoline" is a normal function, so no need for trickery. For example, we can modify this to use a function pointer to the nested function:

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;

        int (*function_pointer)(int arg) = nested_function;

        return function_pointer(arg);

int main(void)
        printf("%d\n", function(10));

We can see that now we are calling through a function pointer, we go through __ia64_trampoline, which is a little code stub in libgcc.a which loads up the right stack pointer and calls the "nested" function.

    .prologue 12, 33
    .save ar.pfs, r34
    alloc r34 = ar.pfs, 1, 3, 1, 0
    .fframe 48
    adds r12 = -48, r12
    nop 0
    addl r15 = @ltoffx(__ia64_trampoline#), r1
    mov r35 = r1
    .save rp, r33
    mov r33 = b0
    nop 0
    adds r8 = 24, r12
    adds r14 = 16, r12
    .mmi r15 = [r15], __ia64_trampoline#
    st4 [r14] = r32
    nop 0
    mov r14 = r8
    st8 [r14] = r15, 8
    adds r15 = 40, r12
    st8 [r14] = r15, 8
    nop 0
    addl r15 = @ltoff(@fptr(nested_function.1814#)), gp
    ld8 r15 = [r15]
    st8 [r14] = r15, 8
    adds r15 = 16, r12
    st8 [r14] = r15
    ld8 r14 = [r8], 8
    nop 0
    ld4 r36 = [r15]
    nop 0
    mov b6 = r14
    ld8 r1 = [r8]
    nop 0 b0 = b6
    mov r1 = r35
    mov ar.pfs = r34
    mov b0 = r33
    nop 0
    .restore sp
    adds r12 = 48, r12
    br.ret.sptk.many b0
    .endp function#
    .section    .rodata.str1.8,"aMS",@progbits,1
    .align 8

So, the summary is that if you're taking the address of a nested function, to avoid having different types of pointers to functions if they are nested or not gcc puts in the stub "trampoline" code.