RSS | technovelty home | page of ian | ianw@ieee.org
A few people have noticed and wondered what
linux-gate.so.1 is in their binaries with newer
libc's.
ianw@morrison:~$ ldd /bin/ls
linux-gate.so.1 => (0xffffe000)
librt.so.1 => /lib/tls/librt.so.1 (0xb7fdb000)
libacl.so.1 => /lib/libacl.so.1 (0xb7fd5000)
libc.so.6 => /lib/tls/libc.so.6 (0xb7e9c000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7e8a000)
/lib/ld-linux.so.2 (0xb7feb000)
libattr.so.1 => /lib/libattr.so.1 (0xb7e86000)
It's actually a shared library that is exported by the kernel to
provide a way to make system calls faster. Most architectures have
ways of making system calls that are less expensive than taking a full
trap; sysenter on x86 (syscall on AMD I
think) and epc on IA64 for example.
If you want the gist of how it works, first we can pull it apart.
The following program reads and dumps the so on a x86 machine. Note
it's just a kernel page, so you can just dump
getpagesize() should you want to; though you can't
directly call write on it (i.e. you need to
memcpy and then write). Below I pull apart the
headers.
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <elf.h>
#include <alloca.h>
int main(void)
{
int i;
unsigned size = 0;
char *buf;
Elf32_Ehdr *so = (Elf32_Ehdr*)0xffffe000;
Elf32_Phdr *ph = (Elf32_Phdr*)((void*)so + so->e_phoff);
size += so->e_ehsize + (so->e_phentsize * so->e_phnum);
for (i = 0 ; i < so->e_phnum; i++)
{
size += ph->p_memsz;
ph = (void*)ph + so->e_phentsize;
}
buf = alloca(size);
memcpy(buf, so, size);
int f = open("./kernel-gate.so", O_CREAT|O_WRONLY, S_IRWXU);
int w = write(f, buf, size);
printf("wrote %d (%s)\n", w, strerror(errno));
}
At this stage you should have a binary you can look at with, say
readelf.
ianw@morrison:~/tmp$ readelf --symbols ./kernel-gate.so
Symbol table '.dynsym' contains 15 entries:
Num: Value Size Type Bind Vis Ndx Name
[--snip--]
11: ffffe400 20 FUNC GLOBAL DEFAULT 6 __kernel_vsyscall@@LINUX_2.5
12: 00000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.5
13: ffffe440 7 FUNC GLOBAL DEFAULT 6 __kernel_rt_sigreturn@@LINUX_ 2.5
14: ffffe420 8 FUNC GLOBAL DEFAULT 6 __kernel_sigreturn@@LINUX_2.5
__kernel_vsyscall is the function you call to do the
fast syscall magic. But I bet you're wondering just how that gets
called?
It's easy if you poke inside the auxiliary vector that is passed to
ld, the dynamic loader by the kernel. There's a couple
of ways to see it; via an environment flag, peeking into
/proc/self/auxv or on PowerPC it is passed as the forth
argument to main().
ianw@morrison:~/tmp$ LD_SHOW_AUXV=1 /bin/true AT_SYSINFO: 0xffffe400 AT_SYSINFO_EHDR: 0xffffe000 AT_HWCAP: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe AT_PAGESZ: 4096 AT_CLKTCK: 100 AT_PHDR: 0x8048034 AT_PHENT: 32 AT_PHNUM: 7 AT_BASE: 0xb7feb000 AT_FLAGS: 0x0 AT_ENTRY: 0x8048960 AT_UID: 1000 AT_EUID: 1000 AT_GID: 1000 AT_EGID: 1000 AT_SECURE: 0 AT_PLATFORM: i686
Notice how the AT_SYSINFO symbols refers to the fast
system call function in our kernel shared object? Also notice that
the EHDR flag points to the library its self.
If you start to poke through the glibc source code and
look how the sysinfo entry is handled you can see the dynamic linker
will choose to use the library function for system calls if it is
available. If that flag is never passed by the kernel it can fall
back to the old way of doing things.
IA64 works in the same way, although we keep our kernel shared
library at 0xa000000000000000. You can see how the
shared object is quite an elegant design that allows maximum
compatibility across and within architectures, since you have
abstracted the calling mechanism away from userspace. A 386 can call
the same way as a Pentium IV through the library and the kernel will
make sure the appropriate thing is done in
__kernel_vsyscall.
posted at: Mon, 15 Aug 2005 15:06 | in /linux | permalink | add comment (0 others)

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.