.Open Definition of the form and meaning of Regular Expressions . Introduction This is a sample definition of the way that regular expressions are used in the UNIX awk command to define a pattern to be searched for. This is used in my Common Gateway Interface `lookup` to select bibliographic entries .See http://www/dick/samples/lookup.html . Method This piece of documentation defines the syntax and semantics of such regular expressions. The syntax is written in XBNF (stretched Backus-Naur Form). XBNF is a general version of EBNF that allows alternative definitions to be written. This allows the formal definition of several varieties of UNIX regular expression in one document: program::{ed, grep, egrep, sed, ex, vi, awk, ...}. The semantics by a function that maps each syntactic item into a set of matching strings. syntax::=@#$char, sets of ASCII character strings in a pattern. semantics::=@#$char,st of ASCII character strings. m::=$matches, an abbreviation for matches::=$syntax->$semantics. This map is defined piecemeal in the following. . Selected lines Line::=#$char. A given line either matches or does not match a pattern... the set of matching Lines is written $M: $M:: pattern->@Line. The lines that match a given pattern An item is selected by a pattern if any of its lines matches the pattern: For $Item I, $pattern p, selected(I,p)::= for some L:I(L in $M(p) ). . Patterns A pattern can specify where a given regular expression is searched for in a line: at the start, at the end, any where, or filling the whole line. pattern::=$caret $re | $re $dollar | $re | $caret $re $dollar. For $re r, M("^" r "$")= $m(r). For $re r, M("^" r )= $m(r) ! $any. For $re r, M( r "$")= $any ! $m(r). For $re r, M( r )= $any ! $m(r) ! $any. Where (!) = "followed by" or $concatenation. any::=#$char, -- any string of ASCII characters. Note. Anchored patterns (with "^" and "$") may not be useful for looking things up in this web site. . Regular Expressions regular_expression::syntax, re::abbreviation=regular_expression, If $program=awk or egrep then re=$alternative | $alternative $bar regular_expression else re=$alternative. For $alternative a, $regular_expression r, m( a $bar r) = $m(a) | $m(r). alternative ::= # $element. An empty pattern matches anything and so is deprecated. For $element e1,e2,e3,..., m(e1 e2 e3 ...) = $m(e1) ! $m(e2) ! $m(e3) ! ... where (!) = "followed by" or $concatenation. element ::= $single_piece | $option | $repeated | $iteration | $word. Words can be recognized in the two UNIX editirs: ex and vi. word::$syntax, if $program=ex or vi then word = "\\<" element "\\>" else word=$no. Options are only available in awk and egrep: option::syntax. If $program=awk or egrep then option =$single_piece $query else option=$no. For $single_piece c, m( c $query)= ($empty | $m(c)), optional. repeated::syntax. If $program=awk or egrep then repeated::=$single_piece $plus else repeated=$no. For some programs repeated = $single_piece "\\{" O(number) O(, O(number)) "\\}". m( c $plus)= m(c) ! #($m(c)). -- one or more m( c "\\{" n "," m "\\}")= between n and m copies inclusive. iteration::=$single_piece $star, -- any number including none. m( c $star)= #($m(c)). single_piece::= $single_char | $set_of_single_chars | $sub_expression. subexpression::syntax, If $program=awk or egrep then subexpression = $left $pattern $right else subexpression = $no. m( $left p $right) = $m(p). set_of_single_chars::= $dot | $lbracket $string $rbracket. m($dot)=char, -- the wild-card symbol is a period. For $string s, m("[" s "]")= $m2(s). . Sets of single characters Strings of form "[" s1 s2 s3 ... "]" have a special syntax: string::=$normal | $complemented. complemented::=$caret $normal normal::= $range_or_single | $range_or_single normal. range_or_normal::=$range | $char~$brackets. brackets::=$lbracket | $rbracket. Their meaning can be described by the functions $m2 and $m3 that interpret a string of characters(#$char) as a set of single character strings (@#$char) as follows: m2::$string->@#$$char. -- handles complemented forms. m3::$normal->@#$$char. -- handles normal forms only. For $normal s, $char c,c1,c2. m2("^" s )= $char ~ $m3(s). -- complement, but not,... m2( s )= $m3(s). m3(c1 "-" c2 ) = $set{ c:$char. c1<=c<=c2 }. m3(c1 "-" c2 s ) = $set{ c:$char. c1<=c<=c2 } | $m3( s ). m3(c )= {c}. m3(c s )= {c} | $m3(s). . Ommission I haven't documented the rules for putting brackets inside sets of single characters... . Lexemes char::Sets=`the set of ASCII characters`. magic::=$caret | $dollar | $plus | $star | $query | $bar | $left | $right | $dot | $lbracket | $rbracket | $backslash. |- $magic ==> $char. non_magic::=$char ~ $magic. for $non_magic c, $m(c) = {c}. escaped_char::= $backslash $char. For $char c, $backslash b, $m(b c) = {c}. single_char::=$non_magic | $escaped_char. This automatically defines matches ($m) for any $single_char. caret::="^". Indicates pattern occurs at start of line dollar::="$". Indicates pattern occurs at end of line. plus::="+". Indicates `one or more of the previous`. star::="*". Indicates `zero or more of the previous`. query::="?". follows an option piece bar::="|". separates alternative forms. left::="(". starts a subexpressions. right::=")", ends a subexpression. dot::=".". indicates a wild card character. lbracket::="[". starts a set of alternate characters. rbracket::="]". ends a set of alternate characters. backslash::="\\". used to indicate the character itself rather than its meaning. Note. It is probably not necessary to escape $caret and $dollar except at the start (respectively end) of a regular expression since they only have a `magic` meaning in those positions. . Glossary (concatenation): For a,b:#$char, a ! b ::= (a1, a2, a3, ..., b1,b2,b3, ...). For A,B:@#$char, A ! B ::= $set{ s:#$char. for some a:A, b:B( s=a!b) }. For A, #A ::= the smallest(X. empty in X and X!A in X). img(s)::`set of characters listed in string s`, img(s)::=$set( s[i]. i:1..len(s) ) empty::@#char= {""}, -- the set containg an empty or null string. no::@#char={}, -- the empty or null set. In an alternative shows that something is note allowed. magic::UNIX.jargon. Symbols meaning something other than themselves. Terminology dating back to the `vi` 'nomagic' option (I guess). set{ v:S. P} ::= the set of c in S satisfying P. .Close Definition of the form and meaning of Regular Expressions