Tuesday, April 3, 2018

How does a system call (e.g. fork()) work?

(originally published at https://stackoverflow.com/a/11570572/371250)

libc's implementation of fork() and other system calls contain special processor instructions that invoke a system call. System call invocation is architecture-specific, and can be a quite complex topic.

Let's begin with a "simple" example, MIPS:

On MIPS system calls are invoked via the SYSCALL instruction. So, libc's implementation of fork() ends up putting some arguments on some registers, the system call number in regiter v0, and issuing a syscall instruction.

On MIPS, this causes a SYSCALL_EXCEPTION (exception number 8). When booting, the kernel associates exception 8 to a handling routine in arch/mips/kernel/traps.c:trap_init():

set_except_vector(8, handle_sys);

So when the CPU receives an exception 8 because a program has issued a syscall instruction, the CPU transitions into kernel mode, and begins executing the handler at handle_sys at /usr/src/linux/arch/mips/kernel/scall*.S (there are several files for the different 32/64 bits kernelspace/userspace combinations). That routine looks up the system call number in the system call table and jumps to the appropriate sys_...() function, in this example sys_fork().

Now, x86 is more complicated. Traditionally, Linux used interrupt 0x80 to invoke system calls. This is associated to an x86 gate in arch/x86/kernel/traps_*.c:trap_init():

set_system_gate(SYSCALL_VECTOR,&system_call);

An x86 processor has several levels (rings) of privilege (since 80286). It is only possible to access (jump to) a lower ring (= more privilege) through predefined gates, which are special kinds of segment descriptors set by the kernel. So, when an int 0x80 is called, an interrupt is generated, the CPU looks up a special table called the IDT (Interrupt Descriptor Table), sees that it has a gate (a trap gate in x86, an interrupt gate in x86-64), and transitions into ring 0, beginning the execution of the system_call/ia32_syscall handler at arch/x86/kernel/entry_32.S/arch/x86/ia32/ia32entry.S (for x86/x86_64 respectively).

But, since the Pentium Pro, there is an alternative way to invoke a system call: using the SYSENTER instruction (AMD also has its own SYSCALL instruction). This is a more efficient way to invoke a system call. The handler for this "newer" mechanism is set at arch/x86/vdso/vdso32-setup.c:syscall32_cpu_init():

#ifdef CONFIG_X86_64
[...]
void syscall32_cpu_init(void)
{
    if (use_sysenter < 0)
            use_sysenter = (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL);

    /* Load these always in case some future AMD CPU supports
       SYSENTER from compat mode too. */
    checking_wrmsrl(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
    checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL);
    checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);

    wrmsrl(MSR_CSTAR, ia32_cstar_target);
}
[...]
#else
[...]
void enable_sep_cpu(void)
{
    int cpu = get_cpu();
    struct tss_struct *tss = &per_cpu(init_tss, cpu);

    if (!boot_cpu_has(X86_FEATURE_SEP)) {
            put_cpu();
            return;
    }

    tss->x86_tss.ss1 = __KERNEL_CS;
    tss->x86_tss.sp1 = sizeof(struct tss_struct) + (unsigned long) tss;
    wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
    wrmsr(MSR_IA32_SYSENTER_ESP, tss->x86_tss.sp1, 0);
    wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) ia32_sysenter_target, 0);
    put_cpu();
}
[...]
#endif  /* CONFIG_X86_64 */

The above uses machine specific registers (MSRs) to do the setup. The handler routines are ia32_sysenter_target and ia32_cstar_target (this last one only for x86_64) (in arch/x86/kernel/entry_32.S or arch/x86/ia32/ia32entry.S).

Choosing which syscall mechanism to use

The linux kernel and glibc have a mechanism to choose between the different ways to invoke a system call.

The kernel sets up a virtual shared library for each process, it's called the VDSO (virtual dynamic shared object), which you can see in the output of cat /proc/<pid>/maps:

$ cat /proc/self/maps
08048000-0804c000 r-xp 00000000 03:04 1553592    /bin/cat
0804c000-0804d000 rw-p 00003000 03:04 1553592    /bin/cat
[...]
b7ee8000-b7ee9000 r-xp b7ee8000 00:00 0          [vdso]
[...]

This vdso, among other things, contains an appropriate system call invocation sequence for the CPU in use, e.g:

ffffe414 <__kernel_vsyscall>:
ffffe414:       51                      push   %ecx        ; \
ffffe415:       52                      push   %edx        ; > save registers
ffffe416:       55                      push   %ebp        ; /
ffffe417:       89 e5                   mov    %esp,%ebp   ; save stack pointer
ffffe419:       0f 34                   sysenter           ; invoke system call
ffffe41b:       90                      nop
ffffe41c:       90                      nop                ; the kernel will usually
ffffe41d:       90                      nop                ; return to the insn just
ffffe41e:       90                      nop                ; past the jmp, but if the
ffffe41f:       90                      nop                ; system call was interrupted
ffffe420:       90                      nop                ; and needs to be restarted
ffffe421:       90                      nop                ; it will return to this jmp
ffffe422:       eb f3                   jmp    ffffe417 <__kernel_vsyscall+0x3>
ffffe424:       5d                      pop    %ebp        ; \
ffffe425:       5a                      pop    %edx        ; > restore registers
ffffe426:       59                      pop    %ecx        ; /
ffffe427:       c3                      ret                ; return to caller

In arch/x86/vdso/vdso32/ there are implementations using int 0x80, sysenter and syscall, the kernel selects the appropriate one.

To let userspace know that there is a vdso, and where it is located, the kernel sets AT_SYSINFO and AT_SYSINFO_EHDR entries in the auxiliary vector (auxv, the 4th argument to main(), after argc, argv, envp, which is used to pass some information from the kernel to newly started processes). AT_SYSINFO_EHDR points to the ELF header of the vdso, AT_SYSINFO points to the vsyscall implementation:

$ LD_SHOW_AUXV=1 id    # tell the dynamic linker ld.so to output auxv values
AT_SYSINFO:      0xb7fd4414
AT_SYSINFO_EHDR: 0xb7fd4000
[...]

glibc uses this information to locate the vsyscall. It stores it into the dynamic loader global _dl_sysinfo, e.g.:

glibc-2.16.0/elf/dl-support.c:_dl_aux_init():
ifdef NEED_DL_SYSINFO
  case AT_SYSINFO:
    GL(dl_sysinfo) = av->a_un.a_val;
    break;
#endif
#if defined NEED_DL_SYSINFO || defined NEED_DL_SYSINFO_DSO
  case AT_SYSINFO_EHDR:
    GL(dl_sysinfo_dso) = (void *) av->a_un.a_val;
    break;
#endif

glibc-2.16.0/elf/dl-sysdep.c:_dl_sysdep_start()

glibc-2.16.0/elf/rtld.c:dl_main:
GLRO(dl_sysinfo) = GLRO(dl_sysinfo_dso)->e_entry + l->l_addr;

and in a field in the header of the TCB (thread control block):

glibc-2.16.0/nptl/sysdeps/i386/tls.h

_head->sysinfo = GLRO(dl_sysinfo)

If the kernel is old and doesn't provide a vdso, glibc provides a default implementation for _dl_sysinfo:

.hidden _dl_sysinfo_int80:
int $0x80
ret

When a program is compiled against glibc, depending on circumstances, a choice is made between different ways of invoking a system call:

glibc-2.16.0/sysdeps/unix/sysv/linux/i386/sysdep.h:
/* The original calling convention for system calls on Linux/i386 is
   to use int $0x80.  */
#ifdef I386_USE_SYSENTER
# ifdef SHARED
#  define ENTER_KERNEL call *%gs:SYSINFO_OFFSET
# else
#  define ENTER_KERNEL call *_dl_sysinfo
# endif
#else
# define ENTER_KERNEL int $0x80
#endif
  • int 0x80 ← the traditional way
  • call *%gs:offsetof(tcb_head_t, sysinfo)%gs points to the TCB, so this jumps indirectly through the pointer to vsyscall stored in the TCB. This is prefered for objects compiled as PIC. This requires TLS initialization. For dynamic executables, TLS is initialized by ld.so. For static PIE executables, TLS is initialized by __libc_setup_tls().
  • call *_dl_sysinfo ← this jumps indirectly through the global variable. This requires relocation of _dl_sysinfo, so it is avoided for objects compiled as PIC.

So, in x86:

                       fork()
                         ↓
int 0x80 / call *%gs:0x10 / call *_dl_sysinfo 
  |                ↓              ↓
  |       (in vdso) int 0x80 / sysenter / syscall
  ↓                ↓              ↓            ↓
      system_call     | ia32_sysenter_target | ia32_cstar_target
                          ↓
                       sys_fork()