Introduction
This is a sample definition of the way that regular
expressions are used in the UNIX awk command to define
a pattern to be searched for. This is used in my
Common Gateway Interface lookup to select bibliographic entries
[ lookup.html ]
Method
This piece of documentation defines the syntax and semantics of
such regular expressions. The syntax is written in XBNF (stretched
Backus-Naur Form). XBNF is a general version of EBNF that
allows alternative definitions to be written. This
allows the formal definition of several
varieties of UNIX regular expression in one document:
- program::{ed, grep, egrep, sed, ex, vi, awk, ...}.
The semantics by a function that maps each
syntactic item into a set of matching strings.
- syntax::=@#char, sets of ASCII character strings in a pattern.
- semantics::=@#char,st of ASCII character strings.
- m::=matches, an abbreviation for
- matches::=syntax->semantics.
This map is defined piecemeal in the following.
Selected lines
- Line::=#char.
A given line either matches or does not match a pattern... the set of
matching Lines is written M:
M:: pattern->@Line. The lines that match a given pattern
An item is selected by a pattern if any of its lines matches the pattern:
- For Item I, pattern p, selected(I,p)::= for some L:I(L in M(p) ).
Patterns
A pattern can specify where a given regular expression is searched for
in a line: at the start, at the end, any where, or filling the whole line.
- pattern::=caret re | re dollar | re | caret re dollar.
- For re r, M("^" r "$")= m(r).
- For re r, M("^" r )= m(r) ! any.
- For re r, M( r "$")= any ! m(r).
- For re r, M( r )= any ! m(r) ! any.
- Where (!) = "followed by" or concatenation.
- any::=#char, -- any string of ASCII characters.
Note. Anchored patterns (with "^" and "$") may not be useful for
looking things up in this web site.
Regular Expressions
- regular_expression::syntax,
- re::abbreviation=regular_expression,
- If program=awk or egrep then re=alternative | alternative bar regular_expression
- else re=alternative.
- For alternative a, regular_expression r, m( a bar r) = m(a) | m(r).
- alternative::= # element. An empty pattern matches anything and so is deprecated.
- For element e1,e2,e3,..., m(e1 e2 e3 ...) = m(e1) ! m(e2) ! m(e3) ! ...
- where (!) = "followed by" or concatenation.
- element::= single_piece | option | repeated | iteration | word.
Words can be recognized in the two UNIX editirs: ex and vi.
- word::syntax,
- if program=ex or vi then word = "\\<" element "\\>" else word=no.
Options are only available in awk and egrep:
- option::syntax.
- If program=awk or egrep then option =single_piece query else option=no.
- For single_piece c, m( c query)= (empty | m(c)), optional.
- repeated::syntax.
If program=awk or egrep then repeated::=single_piece plus else repeated=no.
For some programs repeated = single_piece "\\{" O(number) O(, O(number)) "\\}".
- m( c plus)= m(c) ! #(m(c)). -- one or more
- m( c "\\{" n "," m "\\}")= between n and m copies inclusive.
- iteration::=single_piece star, -- any number including none.
- m( c star)= #(m(c)).
- single_piece::= single_char | set_of_single_chars | sub_expression.
- subexpression::syntax,
- If program=awk or egrep then subexpression = left pattern right else subexpression = no.
- m( left p right) = m(p).
- set_of_single_chars::= dot | lbracket string rbracket.
- m(dot)=char, -- the wild-card symbol is a period.
- For string s, m("[" s "]")= m2(s).
Sets of single characters
Strings of form "[" s1 s2 s3 ... "]" have a special syntax:
- string::=normal | complemented.
- complemented::=caret normal
- normal::= range_or_single | range_or_single normal.
- range_or_normal::=range | char~brackets.
- brackets::=lbracket | rbracket.
Their meaning can be described
by the functions m2 and m3 that interpret a string of characters(#char)
as a set of single character strings (@#char) as follows:
- m2::string->@#$char. -- handles complemented forms.
- m3::normal->@#$char. -- handles normal forms only.
For normal s, char c,c1,c2.
- m2("^" s )= char ~ m3(s). -- complement, but not,...
- m2( s )= m3(s).
- m3(c1 "-" c2 ) = set{ c:char. c1<=c<=c2 }.
- m3(c1 "-" c2 s ) = set{ c:char. c1<=c<=c2 } | m3( s ).
- m3(c )= {c}.
- m3(c s )= {c} | m3(s).
Ommission
I haven't documented the rules for putting brackets inside
sets of single characters...
Lexemes
- char::Sets=the set of ASCII characters.
- magic::=caret | dollar | plus | star | query | bar | left | right | dot | lbracket | rbracket | backslash.
- |-magic ==> char.
- non_magic::=char ~ magic.
- for non_magic c, m(c) = {c}.
- escaped_char::= backslash char.
- For char c, backslash b, m(b c) = {c}.
- single_char::=non_magic | escaped_char.
This automatically defines matches (m) for any single_char.
- caret::="^". Indicates pattern occurs at start of line
- dollar::="$". Indicates pattern occurs at end of line.
- plus::="+". Indicates one or more of the previous.
- star::="*". Indicates zero or more of the previous.
- query::="?". follows an option piece
- bar::="|". separates alternative forms.
- left::="(". starts a subexpressions.
- right::=")", ends a subexpression.
- dot::=".". indicates a wild card character.
- lbracket::="[". starts a set of alternate characters.
- rbracket::="]". ends a set of alternate characters.
- backslash::="\\". used to indicate the character itself rather than its meaning.
Note. It is probably not necessary to escape caret and dollar except
at the start (respectively end) of a regular expression since they only have
a magic meaning in those positions.
Glossary
(concatenation):
- For a,b:#char, a ! b::= (a1, a2, a3, ..., b1,b2,b3, ...).
- For A,B:@#char, A ! B::= set{ s:#char. for some a:A, b:B( s=a!b) }.
- For A, #A::= the smallest(X. empty in X and X!A in X).
- img(s)::set of characters listed in string s,
- img(s)::=set( s[i]. i:1..len(s) )
- empty::@#char= {""}, -- the set containg an empty or null string.
- no::@#char={}, -- the empty or null set. In an alternative shows that
something is note allowed.
- magic::UNIX.jargon. Symbols meaning something other than themselves. Terminology dating back to the vi 'nomagic' option (I guess).
set{ v:S. P} ::= the set of c in S satisfying P.