# Regular expression

regular one of expressions (ABC RegExp or Regex, English. regularly expression), D serve the description of a family of formal languages. h. they describe (under) quantities of character strings. They belong thus to theoretical computer science. Here they formthe lowest and thus expression-weakest stage of the Chomsky hierarchy (Typ-3). It can be shown that to each regular expression an equivalent finite automat exists and in reverse. This automat is simply assignable. From this the relatively simple capable of being implementedness of regular expressions follows.

The mathematician Stephen Kleene used a notation, which he called regular quantities. The power of regular expressions is sufficient, in order to describe the morphology of a natural speech.

## regular ones of expressions in theoretical computer science

### theoretical bases

reference: In this section the knowledge of some concepts of the theory of the formal languages becomes presupposed.

Regular expressions support exactly three operations: Alternative one, lining up and repetition. The formal definition looks as follows:

### syntax

1. [itex] \ underline \ varnothing< /math> ( the empty quantity) is a regular expression.
2. [itex] \ underline \ epsilon< /math> (the empty indication) is a regular expression.
3. [itex] \ foralla_ {i} \ in \ sigma< /math> are [itex] \ underline {a_i}< /math> (each indication from the underlying alphabet) a regular expression.
4. Are [itex] x< /math> and [itex] y< /math> regular expressions so also [itex] (x \ cup y)< /math> (Combination), [itex] (XY)< /math> (Konkatenation) and [itex] x^*< /math> (Star operator).
5. There are no furtherregular expressions.

## application of regular expressions

Ken Thompson used this notation over qed (a previous version of the Unix editor OD) to build and the tool later grep to write. Since that time a great many programs and libraries of programming languages implement functions,in order to use regular expressions for looking for and replacing from character strings to. Examples of it are the programs sed, grep, emacs and Bibliotheken of the programming languages C, Perl, Java, also the text processing of the Office package OpenOffice.org offer the possibility,to search with regular expressions in the text.

Some programming languages such as z. B. Some extensions of the regular expressions, z support Perl. B. References of kind of rear wall. Here it does not concern any longer regular expressions in the sense of theoretical computer science, because in such a way extendedExpressions belong no longer necessarily to the type 3 of the Chomsky hierarchy.

### elements, with those a regular expression to specify can

the following descriptions of syntax be referred to the syntax of the usual rain ex implementations with extensions, it correspond themselves thus onlypartly the above definition from theoretical computer science.

A frequent application of regular expressions consists of finding special character strings in a quantity from character strings to. In the following the indicated description is (often used) a convention, around concepts such as indication class, quantificationTo realize , linkage and a summarizing concretely. Here a regular becomes expression from the indications of the which is the basis alphabet in combination with so-called Metazeichen ([,], (,), {,}, |? , +, *, ^, \$, \.) formed. Everythingstand for remaining indications of the alphabet for itself.

#### indication-literal

those indications, which must agree direct (literally, literally), are noted also directly. Depending upon system there are also possibilities of indicating the octal or hexadecimal code.

#### Arbitrary indication

• . : One point means that at its place (nearly) an arbitrary indication can stand. Dependent on the used program one point also a new LINE ( line-makeup) can contain, most implementations regards however new LINE not as arbitrary indication.

#### an indication from a selection

with square brackets can be defined an indication selection. The expression in square brackets stands then for exactly one indication from this selection (a indication sample).

Examples:

 [egh] one of the indications „e “, „g “or „h “ [0-6] a number of „0 “to „6 “(hyphens are indicator for a range) [A-Za-z0-9] a any Latin letter or a any number [^a] a any indication except „A “(„^ “at the beginning of an indication class selbige negates)

in many newer implementations can contain classes within the square brackets also to be indicated, even again the square brackets. They read for example:

 [: alnum:] Alphanumeric indications: [: alpha:] and [: digit:]. [: alpha:] Letter: [: more lower:] and [: more upper:]. [: brightly:] Blank and tabulator. [: CNTRL:] Control character. In the ASCII - Code are that the characters 00 to 1F, and 7F (DEL). [: digit:] Numbers: 0, 1, 2,… to 9. [: graph:] Graphic indications: [: alnum:] and [: punct:]. [: more lower:] Small letters: A to z. [: print:] Printer graphics: [: alnum:], [: punct:] and blanks. [: punct:] Indication how: ! "# \$ % & '() * +, -. / : ; < = > ? @ [\] ^ _ `{ | } ~. [: space:] Whitespace: tab, new LINE, vertically tab, form feed,carriage return, and space. [: more upper:] Capital letter: A to Z. [: xdigit:] Hexadecimal numbers: 0 to 9, A to F, A to f.

#### pre-defined

indication classes, which are not supported however by all implementations, give pre-defined indication classes there itonly short forms are and also by an indication selection to be described can. Important indication classes are:

• \ D : a number [0-9]
• \ D : no number [^0-9]
• \ w : a letter, a number or underlined [a-zA-Z_0-9]
• \ W: no letter, no numberand none underlined [^ \ w]
• \ s : Whitespace, mostly [\ f \ n \ r \ t \ v]
• \ S : all indications except Whitespace [^ \ s]

#### Quantoren (indication of the number of repetitions)

Quantoren (also Quantifizierer or iteration factors) permit it, the previous expression in different multiplicity into permit the character string:

• ? : The leading expression is optional, it can once occur, does not have it however not, D. h. the expression seems to zero or once.
• +: In addition, the leading expression must occur at least once, may several timesoccur.
• * : The leading expression may occur arbitrarily often (also no times).
• {n} : The leading expression must seem accurately to n - times.
• {min,} : The leading expression must seem at least to min - times.
• {min, max} : The leading expression must seem at least to min - times and maximally max may - times.

Examples:

• `a+` a “A” permits or “aa” or also “aaaa” etc.
• `[off] +` against it a “A” permits, “b”, “aa”, “baab” etc.
• `[0-9] {} finds` 2.5 “13”,“28333”, “123”, but not “0”, “123123223” etc.

#### greedy behavior

is normally found by a regular expression with Quantor the greatest possible suitable character string (gematcht, of English “ton match”), why this behavior as „greedily “(English: “designates greedy”)becomes. Since this behavior is however not always so intended, Quantoren than “non greedy” (thus “not greedily”) can being defined with some newer rain ex implementations. For this the Quantor does a question mark become?placed behind. The developing expression leads during conventional rain ex implementation to oneError message.

Example:

• Assumed to the stringer “ABCDEB” the regular expression `A.*B` is applied, then it became the complete stringer “ABCDEB” matchen. With the help of “non greedy” - the Quantors “*?” does the expression A.* `matcht? B` the character string “OFF”, breaks thus the search for thatfirst found “B” off.

The implementation of being sufficient seed (“non greedy”) Quantoren is comparatively aufwändig, why not all RegEx Parser support this.

#### grouping with round clips

of expressions can be summarized with round `clips` ( `and` ): About permits “(ABC) +”“ABC” or “ABC ABC” etc.

Some programs store the grouping and make their re-use possible in the regular expression or during the text replacement: Looking for and replacing with

``` AA (. *?)BB
```

as regular search expression and

``` \ 1
```

as replacement replaces all character strings, thoseare enclosed by AA and BB, by the text contained between AA and BB. D. h. AA and BB and that between them replaced by that between them, therefore AA and BB in the result are missing. \ 1, \ 2 etc. one calls references of kind of rear wall(English. “Backreferences”). \ 1 refers to the first pair of clammy, \ 2 to the second etc.; one counts the opening clips.

Interpreters of regular expressions, which permit references of kind of rear wall, do not correspond any longer to the type 3 of the Chomsky hierarchy. With the Pumping Lemma leavesitself simply show that the following regular expression, which determines, whether in a stringer before and after the 1 the same number of 0 is located, no regular language is.

```/^ (0*) 1 \ 1\$/
```

#### alternatives

one know alternative expressions with “|” - Symbolpermit:

• “(ABC|ABC)” means “ABC” or “ABC”, but z. B. not “ABC”.

#### further indications

around applications on the computer, often referred to character strings, to support, become usually additionally specified the already the following indicationsdefined:

• ^ stands for the start of line. (not to confounding with ^ with the indication selection by means of [ and ])
• \$ can stand depending upon context for the line or stringer end.
• \ lifts if necessary. the Metabedeutung of the next indication up, for examplethe expression “(A \ *) leaves +” the character strings “A*”, “A*A*” etc. too.
• \ b stands for the empty character string at the word beginning or at the word ending.
• \ B stands for the empty character string, which does not form at the beginning or the end of a word.
• \<stands forthe empty character string at the word beginning.
• \> stands for the empty character string at the word ending.

## literature

• Jeffrey Friedl: Regular expressions. O'Reilly, ISBN 3-89721-349-4. Very comprehensive, book in line with standard usage, which in addition, einführende exhibits chapters, which are already sufficient for many work
• Tony Stubblebine: Regular expressions - in a word. O'Reilly, ISBN 3-89721-264-1
• Mehran Habibi: Material World regularly express ion with Java 1.4. Springer, ISBN 1-59059-107-0