Sunday, July 3, 2011

vfork()

There are some limitations to using non-MMU CPUs in *nix, e.g: there are restrictions to fork(), mmap(), shmat() and brk(). Let's talk about fork().

fork(2) creates a child process with a copy of the memory of the current process (and a copy of the file descriptors, signal handlers, filesystem namespace, ...) This copy has the same virtual addresses as those used on the parent. In a CPU with a MMU, the MMU translates each process' virtual addresses to different physical addresses, so everything works and everyone is happy.

However, in a MMU-less CPU, virtual addresses are the same as physical addresses (said another way, there are no virtual addresses), so fork() cannot work in a MMU-less system in the general case (at least in an efficient manner, you can always move processes in memory at each context switch).

Often, fork(2) is used to immediately call execve(2) in the child. There is a special system call fot this: vfork(2). Typically, vfork() doesn't create a new copy of the parent's memory, but uses the parent's memory. It's also typical for the parent to remain blocked until the child calls execve() or _exit().

The only safe things you can do after vfork() on the child are the following:
  • Calling execve().
  • Calling _exit() (Note it's _exit(), not exit(), exit() can run C library finalization code, such as closing and freeing file handles, which in vfork() implementations using the parent's memory would also close and free them for the parent, leading to very bad things).
  • Use the pid_t value returned by vfork().

Of course, vfork() can be implemented simply as:
#define vfork fork

As vfork() uses a shared address space, it works perfectly fine on non-MMU CPUs. Also, creating a child to immediately call execve() is a very common use of fork()/vfork().

The other *nix classical API to create processes/threads/tasks is pthread_create(). As the different threads share the memory address space, this works for non-MMU CPUs. POSIX also introduces a posix_spawn() function.

In the specific case of Linux, there is also clone(2). In non-MMU CPUs, clone() works fine if it's passed the CLONE_VM flag.

An interesting detail in vfork() (explained by Jamie Loker at uclinux-dev at http://www.mail-archive.com/uclinux-dev@uclinux.org/msg01290.html) is how it's implemented in uClibc:


__vfork:
popl %ecx
movl $__NR_vfork,%eax
int $0x80
pushl %ecx
cmpl $-4095,%eax
jae __syscall_error
ret

When you call vfork(), Linux first returns control to the child. The parent hasn't yet returned from vfork(). The call to execve() in the child can corrupt vfork()'s stack frame in the parent.

The solution is not depending on vfork()'s stack frame. In the previous i386 example, the first thing that is done is save the return address (which is the only think saved on vfork()'s stack frame, as vfork() has neither parameters nor local variables) in a register, where it is safe. The int $0x80 instruction is the one to pass control to sys_vfork() at the kernel. On return from sys_vfork(), we push the return address into the stack frame again, check for errors, and return from vfork().

(Originally published at http://barrapunto.com/~ninjalj/journal/27731 (in Spanish))

No comments:

Post a Comment