Tuesday, April 3, 2018

How does a system call (e.g. fork()) work?

(originally published at https://stackoverflow.com/a/11570572/371250)

libc's implementation of fork() and other system calls contain special processor instructions that invoke a system call. System call invocation is architecture-specific, and can be a quite complex topic.

Let's begin with a "simple" example, MIPS:

On MIPS system calls are invoked via the SYSCALL instruction. So, libc's implementation of fork() ends up putting some arguments on some registers, the system call number in regiter v0, and issuing a syscall instruction.

On MIPS, this causes a SYSCALL_EXCEPTION (exception number 8). When booting, the kernel associates exception 8 to a handling routine in arch/mips/kernel/traps.c:trap_init():

set_except_vector(8, handle_sys);

So when the CPU receives an exception 8 because a program has issued a syscall instruction, the CPU transitions into kernel mode, and begins executing the handler at handle_sys at /usr/src/linux/arch/mips/kernel/scall*.S (there are several files for the different 32/64 bits kernelspace/userspace combinations). That routine looks up the system call number in the system call table and jumps to the appropriate sys_...() function, in this example sys_fork().

Now, x86 is more complicated. Traditionally, Linux used interrupt 0x80 to invoke system calls. This is associated to an x86 gate in arch/x86/kernel/traps_*.c:trap_init():


An x86 processor has several levels (rings) of privilege (since 80286). It is only possible to access (jump to) a lower ring (= more privilege) through predefined gates, which are special kinds of segment descriptors set by the kernel. So, when an int 0x80 is called, an interrupt is generated, the CPU looks up a special table called the IDT (Interrupt Descriptor Table), sees that it has a gate (a trap gate in x86, an interrupt gate in x86-64), and transitions into ring 0, beginning the execution of the system_call/ia32_syscall handler at arch/x86/kernel/entry_32.S/arch/x86/ia32/ia32entry.S (for x86/x86_64 respectively).

But, since the Pentium Pro, there is an alternative way to invoke a system call: using the SYSENTER instruction (AMD also has its own SYSCALL instruction). This is a more efficient way to invoke a system call. The handler for this "newer" mechanism is set at arch/x86/vdso/vdso32-setup.c:syscall32_cpu_init():

#ifdef CONFIG_X86_64
void syscall32_cpu_init(void)
    if (use_sysenter < 0)
            use_sysenter = (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL);

    /* Load these always in case some future AMD CPU supports
       SYSENTER from compat mode too. */
    checking_wrmsrl(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
    checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL);
    checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);

    wrmsrl(MSR_CSTAR, ia32_cstar_target);
void enable_sep_cpu(void)
    int cpu = get_cpu();
    struct tss_struct *tss = &per_cpu(init_tss, cpu);

    if (!boot_cpu_has(X86_FEATURE_SEP)) {

    tss->x86_tss.ss1 = __KERNEL_CS;
    tss->x86_tss.sp1 = sizeof(struct tss_struct) + (unsigned long) tss;
    wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
    wrmsr(MSR_IA32_SYSENTER_ESP, tss->x86_tss.sp1, 0);
    wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) ia32_sysenter_target, 0);
#endif  /* CONFIG_X86_64 */

The above uses machine specific registers (MSRs) to do the setup. The handler routines are ia32_sysenter_target and ia32_cstar_target (this last one only for x86_64) (in arch/x86/kernel/entry_32.S or arch/x86/ia32/ia32entry.S).

Choosing which syscall mechanism to use

The linux kernel and glibc have a mechanism to choose between the different ways to invoke a system call.

The kernel sets up a virtual shared library for each process, it's called the VDSO (virtual dynamic shared object), which you can see in the output of cat /proc/<pid>/maps:

$ cat /proc/self/maps
08048000-0804c000 r-xp 00000000 03:04 1553592    /bin/cat
0804c000-0804d000 rw-p 00003000 03:04 1553592    /bin/cat
b7ee8000-b7ee9000 r-xp b7ee8000 00:00 0          [vdso]

This vdso, among other things, contains an appropriate system call invocation sequence for the CPU in use, e.g:

ffffe414 <__kernel_vsyscall>:
ffffe414:       51                      push   %ecx        ; \
ffffe415:       52                      push   %edx        ; > save registers
ffffe416:       55                      push   %ebp        ; /
ffffe417:       89 e5                   mov    %esp,%ebp   ; save stack pointer
ffffe419:       0f 34                   sysenter           ; invoke system call
ffffe41b:       90                      nop
ffffe41c:       90                      nop                ; the kernel will usually
ffffe41d:       90                      nop                ; return to the insn just
ffffe41e:       90                      nop                ; past the jmp, but if the
ffffe41f:       90                      nop                ; system call was interrupted
ffffe420:       90                      nop                ; and needs to be restarted
ffffe421:       90                      nop                ; it will return to this jmp
ffffe422:       eb f3                   jmp    ffffe417 <__kernel_vsyscall+0x3>
ffffe424:       5d                      pop    %ebp        ; \
ffffe425:       5a                      pop    %edx        ; > restore registers
ffffe426:       59                      pop    %ecx        ; /
ffffe427:       c3                      ret                ; return to caller

In arch/x86/vdso/vdso32/ there are implementations using int 0x80, sysenter and syscall, the kernel selects the appropriate one.

To let userspace know that there is a vdso, and where it is located, the kernel sets AT_SYSINFO and AT_SYSINFO_EHDR entries in the auxiliary vector (auxv, the 4th argument to main(), after argc, argv, envp, which is used to pass some information from the kernel to newly started processes). AT_SYSINFO_EHDR points to the ELF header of the vdso, AT_SYSINFO points to the vsyscall implementation:

$ LD_SHOW_AUXV=1 id    # tell the dynamic linker ld.so to output auxv values
AT_SYSINFO:      0xb7fd4414
AT_SYSINFO_EHDR: 0xb7fd4000

glibc uses this information to locate the vsyscall. It stores it into the dynamic loader global _dl_sysinfo, e.g.:

  case AT_SYSINFO:
    GL(dl_sysinfo) = av->a_un.a_val;
#if defined NEED_DL_SYSINFO || defined NEED_DL_SYSINFO_DSO
    GL(dl_sysinfo_dso) = (void *) av->a_un.a_val;


GLRO(dl_sysinfo) = GLRO(dl_sysinfo_dso)->e_entry + l->l_addr;

and in a field in the header of the TCB (thread control block):


_head->sysinfo = GLRO(dl_sysinfo)

If the kernel is old and doesn't provide a vdso, glibc provides a default implementation for _dl_sysinfo:

.hidden _dl_sysinfo_int80:
int $0x80

When a program is compiled against glibc, depending on circumstances, a choice is made between different ways of invoking a system call:

/* The original calling convention for system calls on Linux/i386 is
   to use int $0x80.  */
#ifdef I386_USE_SYSENTER
# ifdef SHARED
# else
#  define ENTER_KERNEL call *_dl_sysinfo
# endif
# define ENTER_KERNEL int $0x80
  • int 0x80 ← the traditional way
  • call *%gs:offsetof(tcb_head_t, sysinfo)%gs points to the TCB, so this jumps indirectly through the pointer to vsyscall stored in the TCB. This is prefered for objects compiled as PIC. This requires TLS initialization. For dynamic executables, TLS is initialized by ld.so. For static PIE executables, TLS is initialized by __libc_setup_tls().
  • call *_dl_sysinfo ← this jumps indirectly through the global variable. This requires relocation of _dl_sysinfo, so it is avoided for objects compiled as PIC.

So, in x86:

int 0x80 / call *%gs:0x10 / call *_dl_sysinfo 
  |                ↓              ↓
  |       (in vdso) int 0x80 / sysenter / syscall
  ↓                ↓              ↓            ↓
      system_call     | ia32_sysenter_target | ia32_cstar_target

Tuesday, March 12, 2013

In a comment in programmers.SE, Dibbeke states the following:

Small hint here: your most important liability in programming is your code. So, less is truly more here.

Many people view code as an asset. It is not. Code is a liability. The functionality code provides is the asset. So, it follows that you should try to get that functionality with the minimum amount of code. Sometimes, you should even forgo some functionality, because its cost in code (and the cost of code is _not_ the cost of writing it) exceeds the value it provides.

See also: SoftwareAsLiability and LegacyCode at the c2.com wiki.

Thursday, November 17, 2011

Your own linker warnings using the GNU toolchain

You've probably seen linker warnings like these:

linkwarnmain.c:(.text+0x1d): warning: foo is deprecated, please use the shiny new foobar function
linkwarnmain.c:(.text+0x27): warning: the use of `tmpnam' is dangerous, better use `mkstemp'

These warnings are actually stored in ELF sections named .gnu.warning.symbolname:

$ objdump -s -j .gnu.warning.gets /lib/libc.so.6

/lib/libc.so.6:     file format elf32-i386

Contents of section .gnu.warning.gets:
 0000 74686520 60676574 73272066 756e6374  the `gets' funct
 0010 696f6e20 69732064 616e6765 726f7573  ion is dangerous
 0020 20616e64 2073686f 756c6420 6e6f7420   and should not
 0030 62652075 7365642e 00                 be used..

So when you try to use a symbol, if the linker sees a section whose name matches the above pattern, it emits the corresponding warning message.

You can add your own linker warnings to your source files:

void foo(void)

static const char foo_warning[] __attribute__((section(".gnu.warning.foo"))) =
        "foo is deprecated, please use the shiny new foobar function";

void foo(void);

int main()

        return 0;

Then, when you compile, you will get a warning:

$ gcc -Wall -c linkwarn.c
$ gcc -Wall -o linkwarnmain linkwarnmain.c linkwarn.o
/tmp/ccyHLTw6.o: In function `main':
linkwarnmain.c:(.text+0x1d): warning: foo is deprecated, please use the shiny new foobar function

glibc machinery for emitting linker warnings is a little bit more complicated:

#ifdef HAVE_ELF

/* We want the .gnu.warning.SYMBOL section to be unallocated.  */
#  define __make_section_unallocated(section_string)    \
  asm (".section " section_string "\n\t.previous");
#  define __make_section_unallocated(section_string)    \
  asm (".pushsection " section_string "\n\t.popsection");
# else
#  define __make_section_unallocated(section_string)
# endif

/* Tacking on "\n\t#" to the section name makes gcc put it's bogus
   section attributes on what looks like a comment to the assembler.  */
#  define __sec_comment "\"\n\t#\""
# else
#  define __sec_comment "\n\t#"
# endif
# define link_warning(symbol, msg) \
  __make_section_unallocated (".gnu.warning." #symbol) \
  static const char __evoke_link_warning_##symbol[]     \
    __attribute__ ((used, section (".gnu.warning." #symbol __sec_comment))) \
    = msg;

The warning message is marked as used so the optimizer doesn't decide to completely optimize it out of existence. Also, the section is not marked as allocatable to prevent the loader from loading it into memory. Since gcc's section attribute doesn't allow to change its default flags, the solution is to declare the section using asm, return to the previous section (the one the compiler was using beforehand), and prevent (via __sec_comment) the assembler from seeing the flags that gcc adds in its section attribute output. Otherwise we'd have something similar to:

$ gcc -Wall -S linkwarn.c
$ cat linkwarn.s
        .file   "linkwarn.c"
.globl foo
        .type   foo, @function
        pushl   %ebp
        movl    %esp, %ebp
        popl    %ebp
        .size   foo, .-foo
        .section        .gnu.warning.foo,"a",@progbits
        .align 32
        .type   foo_warning, @object
        .size   foo_warning, 60
        .string "foo is deprecated, please use the shiny new foobar function"
        .section        .note.GNU-stack,"",@progbits
        .ident  "GCC: (GNU) 3.4.5 (Gentoo 3.4.5-r1, ssp-3.4.5-1.0, pie-8.7.9)"

... where the section is marked as allocatable via the "a" flag.

Tuesday, October 18, 2011

Terminals in *nix

Terminals were hardware devices that consisted of a keyboard and an output device (initially a line printer, later a CRT monitor). A large computer could have several remote terminals connected to it. Each terminal would have a protocol for communicating efficiently with the computer, for CRT-based terminals this includes having special "control sequences" to change cursor position, erase parts of the current line/screen, switch to an alternate full-screen mode, ...

A terminal emulator is an application emulating one of those older terminals. It allows to do functions like cursor positioning, setting foreground and background colors, ... Terminal emulators try to emulate some specific terminal protocol, but each has its own set of quirks and deviations.

Unix systems have databases describing terminals and terminal emulators, so applications are abstracted away from the particular terminal (or terminal emulator) in use. An older database is termcap(5), while terminfo(5) is a newer database. These databases allow applications to query for the capabilities of the terminal in use. Capabilities can be booleans, numeric capabilities, or even string capabilities, e.g.: if a specific terminal type has/supports a F12 key, it will have a capability "key_f12" (long terminfo name), "kf12" (short terminfo name), "F2" (termcap name) describing the string that key produces. Try it with: tput kf12 | od -tx1.

Since programming directly with capabilities can be cumbersome, applications typically use a higher-level library like curses/ncurses, slang, etc...

There is a special environment variable called TERM that tells applications what terminal type they are talking to. This variable should be set to the exact terminal type if it exists in the database, for best results. This just tells the application which precise protocol and protocol deviations does the terminal understand. Changing the TERM variable does not change the terminal type, it just changes the terminal type the application thinks it is talking to.

Monday, October 17, 2011


I remember SuSE Linux had a motd that simply said: "Have fun..." That has always been the spirit of Unix, having an environment in which it is fun to work and tinker about. We owe that to Ken Thompson, Dennis Ritchie and their colleagues at Bell Labs (http://www.princeton.edu/~hos/Mahoney/unixpeople.htm).

Dennis Ritchie also evolved Thompson's B into C, giving us a concise and practical systems programming language. Kernighan and Ritchie taught many people how to write great code, some of those people taught other people, directly, or simply by sharing great code. The end result has been a ripple effect that has had a profound and positive impact on how many of us code. Not many people in history have influenced the work of others so much.

Dennis Ritchie is no longer with us, he will be greatly missed. He is among those luminaries we will never forget. The computing community will never forget Jon Postel, Richard Stevens and Dennis Ritchie.


Monday, October 10, 2011

Tentative Definitions in C

Suppose you want to access a variable from several translation units. You would declare it in a header file:

int foo;

You would optionally define it on a translation unit:

int foo = 2;

And you would refer to it on some other(s) translation unit(s):

#include <stdio.h>

#include "example.h"

int main()
        printf("foo = %d\n", foo);

        return 0;

If we compile and run the above we get:

$ gcc -Wall -o example example-main.c example.c
$ ./example
foo = 2

int foo; from header.h is a tentative definition. If you comment out the definition in example.c, the tentative definition acts as a definition with an initializer of 0:

$ gcc -Wall -o example example-main.c example.c
$ ./example
foo = 0

But it can be considered to be better style to restrict header files to contain declarations, and not definitions, not even tentative ones. In that case, we would add a extern storage specifier, which would turn our declaration on the header file from a tentative definition to a mere declaration:

extern int foo;

In this case, if we don't define foo in any translation unit, the linker will error out:

$ gcc -Wall -o example example-main.c example.c
/tmp/ccS3FMxx.o: In function `main':
example-main.c:(.text+0x1d): undefined reference to `foo'
collect2: ld returned 1 exit status

All of the above is for C. C++ doesn't have tentative declarations, and therefore has some different rules on what constitutes a definition. In C++, int foo; constitutes a definition, while extern int foo; constitutes a declaration. So, if we leave out the extern storage specifier in C++, the linker will error out:

$ g++ -Wall -o example example-main.cpp example.cpp
/tmp/cc8fzx4x.o:(.data+0x0): multiple definition of `foo'
/tmp/ccxLfXPo.o:(.bss+0x0): first defined here
collect2: ld returned 1 exit status

Friday, September 23, 2011

Spring @Transactional Propagation Reference

Spring's propagation types for the @Transactional annotation differ in how they behave in presence/absence of an existing transaction:

Propagation types and their behaviour
PROPAGATION TYPE no current transaction there's a current transaction
MANDATORY throw exception use current transaction
NEVER don't create a transaction, run method outside any transaction throw exception
NOT_SUPPORTED don't create a transaction, run method outside any transaction suspend current transaction, run method outside any transaction
SUPPORTS don't create a transaction, run method outside any transaction use current transaction
REQUIRED (default) create a new transaction use current transaction
REQUIRES_NEW create a new transaction suspend current transaction, create a new independent transaction
NESTED create a new transaction create a new nested transaction

It could be argued that NEVER and NOT_SUPPORTED (and arguably SUPPORTS also) don't serve any useful purpose, and exist only for completeness; after all, why would you want to explicitly run something outside of any transaction?

As for the other propagation types, REQUIRED/MANDATORY/SUPPORTS would be used when you want your code to be part of the current transaction, sharing its commit/rollback fate; NESTED would come into use in cases where you want to be able to rollback only parts of a bigger transaction; and finally REQUIRES_NEW is to be used when you want to do something totally unrelated to the current transaction.

The difference between REQUIRED, MANDATORY and SUPPORTS strives in how they behave when there is no existing transaction. MANDATORY acts as a sort of assert directive, ensuring there is an opened transaction. REQUIRED creates a new transaction, being safe to use outside any transaction. SUPPORTS rides along an existing transaction, but doesn't create a new one if none exists.