A little tour of linux-gate.so

A few people have noticed and wondered what linux-gate.so.1 is in their binaries with newer libc's.

ianw@morrison:~$ ldd /bin/ls
        linux-gate.so.1 =>  (0xffffe000)
        librt.so.1 => /lib/tls/librt.so.1 (0xb7fdb000)
        libacl.so.1 => /lib/libacl.so.1 (0xb7fd5000)
        libc.so.6 => /lib/tls/libc.so.6 (0xb7e9c000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7e8a000)
        /lib/ld-linux.so.2 (0xb7feb000)
        libattr.so.1 => /lib/libattr.so.1 (0xb7e86000)

It's actually a shared library that is exported by the kernel to provide a way to make system calls faster. Most architectures have ways of making system calls that are less expensive than taking a full trap; sysenter on x86 (syscall on AMD I think) and epc on IA64 for example.

If you want the gist of how it works, first we can pull it apart. The following program reads and dumps the so on a x86 machine. Note it's just a kernel page, so you can just dump getpagesize() should you want to; though you can't directly call write on it (i.e. you need to memcpy and then write). Below I pull apart the headers.

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <elf.h>
#include <alloca.h>

int main(void)
{
  int i;
  unsigned size = 0;
  char *buf;

  Elf32_Ehdr *so = (Elf32_Ehdr*)0xffffe000;
  Elf32_Phdr *ph = (Elf32_Phdr*)((void*)so + so->e_phoff);

  size += so->e_ehsize + (so->e_phentsize * so->e_phnum);

  for (i = 0 ; i < so->e_phnum; i++)
    {
      size += ph->p_memsz;
      ph = (void*)ph + so->e_phentsize;
    }

  buf = alloca(size);
  memcpy(buf, so, size);

  int f = open("./kernel-gate.so", O_CREAT|O_WRONLY, S_IRWXU);

  int w = write(f, buf, size);

  printf("wrote %d (%s)\n", w, strerror(errno));

}

At this stage you should have a binary you can look at with, say readelf.

ianw@morrison:~/tmp$ readelf --symbols ./kernel-gate.so

Symbol table '.dynsym' contains 15 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
  [--snip--]
    11: ffffe400    20 FUNC    GLOBAL DEFAULT    6 __kernel_vsyscall@@LINUX_2.5
    12: 00000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.5
    13: ffffe440     7 FUNC    GLOBAL DEFAULT    6 __kernel_rt_sigreturn@@LINUX_ 2.5
    14: ffffe420     8 FUNC    GLOBAL DEFAULT    6 __kernel_sigreturn@@LINUX_2.5

__kernel_vsyscall is the function you call to do the fast syscall magic. But I bet you're wondering just how that gets called?

It's easy if you poke inside the auxiliary vector that is passed to ld, the dynamic loader by the kernel. There's a couple of ways to see it; via an environment flag, peeking into /proc/self/auxv or on PowerPC it is passed as the forth argument to main().

ianw@morrison:~/tmp$ LD_SHOW_AUXV=1 /bin/true
AT_SYSINFO:      0xffffe400
AT_SYSINFO_EHDR: 0xffffe000
AT_HWCAP:    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x8048034
AT_PHENT:        32
AT_PHNUM:        7
AT_BASE:         0xb7feb000
AT_FLAGS:        0x0
AT_ENTRY:        0x8048960
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_PLATFORM:     i686

Notice how the AT_SYSINFO symbols refers to the fast system call function in our kernel shared object? Also notice that the EHDR flag points to the library its self.

If you start to poke through the glibc source code and look how the sysinfo entry is handled you can see the dynamic linker will choose to use the library function for system calls if it is available. If that flag is never passed by the kernel it can fall back to the old way of doing things.

IA64 works in the same way, although we keep our kernel shared library at 0xa000000000000000. You can see how the shared object is quite an elegant design that allows maximum compatibility across and within architectures, since you have abstracted the calling mechanism away from userspace. A 386 can call the same way as a Pentium IV through the library and the kernel will make sure the appropriate thing is done in __kernel_vsyscall.