Monday, July 11, 2011

Introduction to regexes

Let Σ be an alphabet.

  • ε (the empty string) is a regular expression.
  • If a is a symbol from the alphabet Σ, a is a regular expression.
  • If a and b are symbols from the alphabet Σ, ab (a and b concatenated) is a regular expression.
  • If a and b are symbols from the alphabet Σ, a|b (a or b) is a regular expression.
  • If a is a symbol from the alphabet Σ, a* (a repeated 0 or more times) is a regular expression.
A regular language is the language defined by a regular expression. A regular language is the language matched by a finite automaton. A word w from a regular language can be recognized in O(n).

Unix regexes:

  • . means any character.
  • Concatenation is represented without any symbol.
  • Union (alternation) is represented by |.
  • * means Kleene closure (0 or more repetitions).
  • + means 1 or more repetitions.
  • ? means 0 or 1 repetition.
  • ^ means start of line.
  • $ means end of line.
  • \< means start of word (in some programs).
  • \> means end of word (in some programs).
  • {n} means n repetitions (in some programs).
  • {,m} means from 0 to m repetitions (in some programs).
  • {n,m} means from n to m repetitions (in some programs).
  • a|b|c|d|e|f can be written as [abcdef] or [a-f].
  • a|b|c|d|h|i can be written as [a-dhi] or [a-dh-i], ...
  • [^list] matches characters not in list.
  • () is used for grouping.
  • \ escapes special meaning of a metacharacter, e.g.: \*
Precedence is: repetition, concatenation, union.

The supported regular expression syntax varies among programs, (e.g: gawk vs mawk, vi vs vim, ...).

Back-references:

Many programs allow the use of \n (where n is a number) to match a subexpression enclosed in () or \(\), e.g: in vim, \<\(.\)\(.\).\2\1\> matches five-letter palindromes. Regular expressions using these back-references are no longer true regular expressions, and not all of them can be matched in O(n) time.



Regex/Regexp vs regular expressions:


Since modern regexes match things that cannot be matched with true regular expressions, it's been proposed (by Larry Wall) to call them regexes, not regular expressions. Perl regexes have been gaining expressive power through the years, and have grown, among many other things, support for look-around assertions, support for calling Perl code, and support for recursion. Many regex engines have taken things from Perl's regexes. On the other hand, Google's re2 library uses true regular expressions and a DFA-based implementation (in contrast to the NFA-based implementation of backtracking-regex engines) to achieve very high performance.

No comments:

Post a Comment