Skip to content

Latest commit

 

History

History
221 lines (157 loc) · 10.7 KB

File metadata and controls

221 lines (157 loc) · 10.7 KB

Regular expressions

Setup

For today's lesson you will need to download the following file and unzip it: regular_expressions.zip

Reference notes

Regular expressions (called REs, regexes, regexps, regex patterns) are essentially a tiny, highly specialized programming language embedded inside general purpose programming languages (Python, XQuery, javascript). Please note that there are differences in how all general purpose programming langugages implement thier RE language. When using a RE language, you specify the rules for the set of possible strings that you want to match:

  • sentences in a language,
  • e-mail addresses,
  • TeX commands, or
  • anything you like.

It does not replace a parser for XML or HTML since you can easily create invalid and non-wellformed markup.

You can then ask questions such as:

  • Does this string match the pattern?, or
  • Is there a match for the pattern anywhere in this string?.

You can also use REs to modify a string or to split it apart in various ways.

RE patterns are usually compiled into a series of bytecodes which are then executed by a matching engine. (For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster or just not use too much resources to be useful).

Since a RE language is relatively small and restricted not all possible string processing tasks can be done using REs. There are also tasks that can be done with REs, but the expressions turn out to be very complicated.

In these cases, you may be better off writing code in the programming language, e.g. Python, to do the processing; Usually it is slower than an elaborate RE but probably a lot easier to understand.

Your best use would probably be to assert negative patterns, e.g. things you know are wrong.

Simple patterns

We will start by learning about the simplest possible REs. Since REs are used to operate on strings, we will begin with the most common task: matching characters.

Characters

  • Most letters and characters will simply match themselves.
  • Though some characters are special metacharacters, and do not match themselves:
    • They signal that some out-of-the-ordinary thing should be matched, or
    • affect other portions of the RE by repeating them or changing their meaning.

Metacharacters

The complete list of metacharacters:

Metacharacter Name
. period or dot
^ caret
$ dollar sign
* asterisk or star
+ plus sign
? question mark
{ opening curly brace
} closing curly brace
[ opening square bracket
] closing square bracket
\ backslash
| pipe or bar
( opening parenthesis
) closing parenthesis

Character classes are surrounded by opening square bracket [ and closing square bracket ] to form a set of characters. Either you specify the characters individually or use ranges by giving a hyphen - inbetween. Metacharacters are not active inside character classes. Since the character class is a set you can also complement it. To do complementing you give a caret ^ as the first character of the class.

One of the most important metacharacters is the backslash \ which is used to:

  • indicate various special sequences
  • escape all metacharacters so they can be used in patterns without their special mening, e.g. use \[ to match an actual opening square bracket [ in the string.

Some of the special sequences beginning with backslash \ represent predefined shorthand sets of characters that are often useful:

  • the set of digits,
  • the set of letters, or
  • the set of anything that is not whitespace.

\w matches any alphanumeric character. For use with Python this set differs depending on whether the RE pattern is:

  • a string, \w will match all the characters marked as letters or digits in the Unicode data plus underscore, or
  • bytes, then this is equivalent to the class [a-zA-Z0-9_].
Special sequence Matches Restricted1 equivalent to
\d any decimal digit [0-9]
\D any non-digit character [^0-9]
\s any whitespace character [ \t\n\r\f\v]
\S any non-whitespace character [^ \t\n\r\f\v]
\w any alphanumeric character [a-zA-Z0-9_]
\W any non-alphanumeric character [^a-zA-Z0-9_]

1 With Python you can use the more restricted definition of e.g. \w in a string pattern by supplying the re.ASCII flag when compiling the regular expression. Otherwise the Unicode character categories are used and thus the sequence sets include a lot more characters.

Sequences can be included inside a character class. E.g. [\s:;] will match any whitespace character, a colon : or semicolon ;.

The final metacharacter in this section is dot .. It matches anything except a newline character. (In Python you can use re.DOTALL to match even a newline. Dot . is often used where you want to match any character.)

Repetition

Matching varying sets of characters is the first thing REs can do. Another capability is that you can specify that portions of the RE must be repeated, i.e. qualified, a certain number of times.

All four repeating qualifiers:

* + ? {m,n}

The first single metacharacter for repeating things that we will look at is star *. Star * does not match the literal character *. It specifies that the previous character can be matched zero or more times, instead of exactly once. This means whatever is being repeated may not be present at all.

String RE Match
star [e]* Yes
staar t[a]*r Yes

The second repeating metacharacter is plus + which matches one or more times. This requires at least one occurrence compared to asterisk *.

String RE Match
plus pl[au]+s Yes
pluus plu+s Yes
plusplus uss+ No
plussusch us[cs]+ Yes

The third single repeating qualifiers is the question mark ? which matches either once or zero times.

String RE Match
question qu?e Yes
question est?s Yes
markka rk?a No
mark a?r Yes

The fourth and most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n repetitions.

String RE Match
complicated li{1,1}c Yes
appreciated p{2,2} Yes
rain [ai]{2,2} Yes
rain [ai]{1,2} Yes
complicated li[act]{3,}ed Yes

If either m or n is omitted it becomes for e.g. {3,} three or more and {,3} up to three repetitions.

With this qualifier you can express all the single repeating qualifiers, e.g. ? as {0,1} + as {1,}, and * as {0,} but the single versions are both easier on the eye and shorter to write.

Anchors

Anchors match a position not characters.

Metacharacter anchors

Metacharacter Matches at
^ the start of a string
$ the end of a string

Most RE engines have a multi-line mode that makes caret ^ match after any line break, and dollar_sign $ before any line break.

String RE Match
complicated ^comp Yes
appreciated ed$ Yes
rain ^rain$ Yes
rain $r[ai]+n$ Yes
complicated ^comp.*ed$ Yes

Special sequence anchors

Special sequence Matches at
\b a word boundary
\B not a word boundary

A word boundary is a position between a character that can be matched by the set of characters of \w and a character that cannot be matched by \w. \b also matches at the ends of the string if the first/last characters in the string are word characters. \B matches at every position where \b cannot match.

String RE Match
complicated \bcomp Yes
appreciated \Bed\b Yes
rain \brain\b Yes
rain $r[ai]+n\b Yes
complicated \bcomp.+\b Yes

Regular Expressions 2

Last session we covered simple patterns and repetition. We also did some exercises on this using egrep. Today we firstly want to cover alternation and grouping before we continue using egrep with more andvanced expressions. Later on we will start compare egreps REs to python.

Alternation

Alternation is the RE equivalent of or. word|weapon matches words in About words and other mighty weapons. Applied again the RE alternation matches weapons. You can add as many alternatives as you want, e.g. letter|syllable|word|phrase|sentence|paragraph.

Alternation has the lowest precedence of all RE operators.

Exercise: Find the preference for all types, e.g concatenation, repetition and alternation.

Grouping

Since we introduced precedence in the previous section we also want to be able to change the behaviour. This is one of the things grouping does.

Metacharacter Explanation
( starts a group
) ends a group

Precedence examples

StringPatternMatch?
word and phrase levelword|phrase levelYes, both word and phrase level
walked up to the talking lamp posted\b|ing\bYes, both ed at the end of walked and ing at the end of talking
word level and phrase levelword|phrase levelYes, but only word and phrase level (not all of word level)
word level and phrase level(word|phrase) levelYes, both word level and phrase level

In addition to use the grouping metachararacters to alter the precedence you can use it for back reference. Some RE implementations have named grouping back references others just \1, \2 etcetera.

Exercise: Check out how this is in egrep.

Comparing to Python

Without going into actual Python programming we are going to see how the egrep REs compare to Python's:

Make sure to tick Python in the Flavour list to the left.