Monday, July 11, 2011

Introduction to regexes

Let Σ be an alphabet.

  • ε (the empty string) is a regular expression.
  • If a is a symbol from the alphabet Σ, a is a regular expression.
  • If a and b are symbols from the alphabet Σ, ab (a and b concatenated) is a regular expression.
  • If a and b are symbols from the alphabet Σ, a|b (a or b) is a regular expression.
  • If a is a symbol from the alphabet Σ, a* (a repeated 0 or more times) is a regular expression.
A regular language is the language defined by a regular expression. A regular language is the language matched by a finite automaton. A word w from a regular language can be recognized in O(n).

Unix regexes:

  • . means any character.
  • Concatenation is represented without any symbol.
  • Union (alternation) is represented by |.
  • * means Kleene closure (0 or more repetitions).
  • + means 1 or more repetitions.
  • ? means 0 or 1 repetition.
  • ^ means start of line.
  • $ means end of line.
  • \< means start of word (in some programs).
  • \> means end of word (in some programs).
  • {n} means n repetitions (in some programs).
  • {,m} means from 0 to m repetitions (in some programs).
  • {n,m} means from n to m repetitions (in some programs).
  • a|b|c|d|e|f can be written as [abcdef] or [a-f].
  • a|b|c|d|h|i can be written as [a-dhi] or [a-dh-i], ...
  • [^list] matches characters not in list.
  • () is used for grouping.
  • \ escapes special meaning of a metacharacter, e.g.: \*
Precedence is: repetition, concatenation, union.

The supported regular expression syntax varies among programs, (e.g: gawk vs mawk, vi vs vim, ...).

Back-references:

Many programs allow the use of \n (where n is a number) to match a subexpression enclosed in () or \(\), e.g: in vim, \<\(.\)\(.\).\2\1\> matches five-letter palindromes. Regular expressions using these back-references are no longer true regular expressions, and not all of them can be matched in O(n) time.



Regex/Regexp vs regular expressions:


Since modern regexes match things that cannot be matched with true regular expressions, it's been proposed (by Larry Wall) to call them regexes, not regular expressions. Perl regexes have been gaining expressive power through the years, and have grown, among many other things, support for look-around assertions, support for calling Perl code, and support for recursion. Many regex engines have taken things from Perl's regexes. On the other hand, Google's re2 library uses true regular expressions and a DFA-based implementation (in contrast to the NFA-based implementation of backtracking-regex engines) to achieve very high performance.

Sunday, July 3, 2011

vfork()

There are some limitations to using non-MMU CPUs in *nix, e.g: there are restrictions to fork(), mmap(), shmat() and brk(). Let's talk about fork().

fork(2) creates a child process with a copy of the memory of the current process (and a copy of the file descriptors, signal handlers, filesystem namespace, ...) This copy has the same virtual addresses as those used on the parent. In a CPU with a MMU, the MMU translates each process' virtual addresses to different physical addresses, so everything works and everyone is happy.

However, in a MMU-less CPU, virtual addresses are the same as physical addresses (said another way, there are no virtual addresses), so fork() cannot work in a MMU-less system in the general case (at least in an efficient manner, you can always move processes in memory at each context switch).

Often, fork(2) is used to immediately call execve(2) in the child. There is a special system call fot this: vfork(2). Typically, vfork() doesn't create a new copy of the parent's memory, but uses the parent's memory. It's also typical for the parent to remain blocked until the child calls execve() or _exit().

The only safe things you can do after vfork() on the child are the following:
  • Calling execve().
  • Calling _exit() (Note it's _exit(), not exit(), exit() can run C library finalization code, such as closing and freeing file handles, which in vfork() implementations using the parent's memory would also close and free them for the parent, leading to very bad things).
  • Use the pid_t value returned by vfork().

Of course, vfork() can be implemented simply as:
#define vfork fork

As vfork() uses a shared address space, it works perfectly fine on non-MMU CPUs. Also, creating a child to immediately call execve() is a very common use of fork()/vfork().

The other *nix classical API to create processes/threads/tasks is pthread_create(). As the different threads share the memory address space, this works for non-MMU CPUs. POSIX also introduces a posix_spawn() function.

In the specific case of Linux, there is also clone(2). In non-MMU CPUs, clone() works fine if it's passed the CLONE_VM flag.

An interesting detail in vfork() (explained by Jamie Loker at uclinux-dev at http://www.mail-archive.com/uclinux-dev@uclinux.org/msg01290.html) is how it's implemented in uClibc:


__vfork:
popl %ecx
movl $__NR_vfork,%eax
int $0x80
pushl %ecx
cmpl $-4095,%eax
jae __syscall_error
ret

When you call vfork(), Linux first returns control to the child. The parent hasn't yet returned from vfork(). The call to execve() in the child can corrupt vfork()'s stack frame in the parent.

The solution is not depending on vfork()'s stack frame. In the previous i386 example, the first thing that is done is save the return address (which is the only think saved on vfork()'s stack frame, as vfork() has neither parameters nor local variables) in a register, where it is safe. The int $0x80 instruction is the one to pass control to sys_vfork() at the kernel. On return from sys_vfork(), we push the return address into the stack frame again, check for errors, and return from vfork().

(Originally published at http://barrapunto.com/~ninjalj/journal/27731 (in Spanish))

Saturday, July 2, 2011

Integer Promotions and Conversions in C

Suppose you have the following code:
#include <stdio.h>

int main()
{
    unsigned x = 1;
    char y = -1;

    if (x > y)
        printf("x > y\n");
    else
        printf("x <= y\n");

    return 0;
}
What does really happen here? Since we are dealing with integer types, first the integer promotions are applied, and then the arithmetic conversions are applied.

If char is equivalent to signed char:
  • char is promoted to int (Integer Promotions, ISO C99 §6.3.1.1 ¶2)
  • Since int and unsigned have the same rank, int is converted to unsigned (Arithmetic Conversions, ISO C99 §6.3.1.8)
If char is equivalent to unsigned char:
  • char may be promoted to either int or unsigned int:
    • If int can represent all unsigned char values (typically because sizeof(int) > sizeof(char)), char is converted to int.
    • Otherwise (typically because sizeof(char)==sizeof(int)), char is converted to unsigned.
  • Now we have one operand that is either int or unsigned, and another that is unsigned. The first operand is converted to unsigned.

The rules that are applied are the following:

Integer promotions: An expression of a type of lower rank that int is converted to int if int can hold all of the values of the original type, to unsigned otherwise.

Arithmetic conversions: Try to convert to the larger type. When there is conflict between signed and unsigned, if the larger (including the case where the two types have the same rank) type is unsigned, go with unsigned. Otherwise, go with signed only in the case it can represent all the values of both types.

Conversions between integer types(ISO C99 §6.3.1.3):
  • Conversion of an out-of-range value to an unsigned integer type is done via wrap-around (modular arithmetic).
  • Conversion of an out-of-range value to a signed integer type is implementation defined, and can raise a signal (such as SIGFPE).
Ranks: Every type has a rank. unsigned types have the same rank as the corresponding signed type. Ranks satisfy the following: char < short < int < long < long long.

Representation of integer types:
  • Signed types consist of sign bits, padding bits and value bits.
  • Unsigned types consists of padding bits and value bits.
  • An unsigned type has to have a number of value bits greater or equal to the number of value bits of its corresponding signed type.
Now we can rephrase part of the above as:
  • unsigned char may be promoted to either int or unsigned int:
    • If int can represent all unsigned char values (because the number of value bits of int >= number of value bits of unsigned char), unsigned char is converted to int.
    • Otherwise (because the number of value bits of int < number of value bits of unsigned char), unsigned char is converted to unsigned.
Borderline example: a system with an unsigned char type with 1 padding bit and 31 value bits, and an int type with 1 sign bit and 31 value bits would fall into the first condition (and be an exception to the previous rule of thumb using sizeof()).