In the first two parts of this series, we looked at some fairly advanced usage of regular expressions. In this part, we take a step back and look at some of the other functions Python offers in the re module, then we talk about some common mistakes people regularly (ha!) make.
Useful Python Regular Expression Functions
Python offers several functions that make it easy to manipulate strings using regular expressions.
- With a single statement, Python can return a list of all the sub-strings that match a regular expression.
>>> s = 'Hello world, this is a test!' >>> print(re.findall(r'\S+', s)) ['Hello', 'world,', 'this', 'is', 'a', 'test!']
\S means any non-whitespace character, so the regular expression
\S+ means match one or more non-whitespace characters (eg. a word).
- We can replace each matching sub-string with another string.
>>> print(re.sub( r'\S+' , 'WORD', s)) WORD WORD WORD WORD WORD WORD
The call to
re.sub replaces all matches for the regular expression (eg. words) with the string “WORD”.
- Or if you want to iterate over each matching sub-string and process it yourself,
re.finditerwill loop over each match, returning a
MatchObjecton each iteration.
>>> for mo in re.finditer(r'\S+', s): ... print('[%d:%d] = %s' % (mo.start(), mo.end(), mo.group())) [0:5] = Hello [6:12] = world, [13:17] = this [18:20] = is [21:22] = a [23:28] = test!
- Python also has a function that will split a string into parts, using a regular expression as the separator. Let’s say we have a string that is using commas and semi-colons as a separator, with spaces all over the place.
s = 'word1,word2 , word3;word4 ; word'
Our regular expression for the separator would be:
Or in plain English:
- Zero or more white-space characters.
- A comma or semi-colon.
- Zero or more white-space characters.
Here it is in action:
>>> s = 'word1,word2 , word3;word4 ; word5' >>> print(re.split(r'\s*[,;]\s*', s)) ['word1', 'word2', 'word3', 'word4', 'word5']
Each word has been split off correctly and the white-space removed.
- Python Programming – Pattern Matching
- Python Programming – String Functions
- Python Programming – String Functions and Methods
Common Python Regular Expression Mistakes
Not using the DOTALL flag when searching multi-line strings
In a regular expression, the special character
. means match any character.
>>> s = 'BEGIN hello world END' >>> mo = re.search('BEGIN (.*) END', s) >>> print(mo.group(1)) hello world
However, if the string being searched consists of multiple lines, . will not match the newline character (
>>> s = '''BEGIN hello ... world END''' >>> mo = re.search('BEGIN (.*) END', s) >>> print(mo) None
Our regular expression says find the word BEGIN, then one or more characters, then the word END, so what’s happened is that Python has found the word “BEGIN”, then one or more characters up to the newline, which doesn’t match as a character. Then, Python looks for the word “END” and since it doesn’t find it, the regular expression doesn’t match anything.
If you want the regular expression to match a sub-string that spans multiple lines, you need to pass in the DOTALL flag:
>>> mo = re.search('BEGIN (.*) END', s, re.DOTALL) >>> print(mo.group()) BEGIN hello world END
Not using the MULTILINE flag when searching multi-line strings
In the UNIX world,
$ are widely understood to match the start/end of a line but this is only true with Python regular expressions if the
MULTILINE flag has been set. If it hasn’t, they will only match the start/end of the entire string being searched.
>>> s = '''hello >>> ... world''' >>> print(re.findall(r'^\S+$', s)) 
To get the behaviour we would expect, pass in the
M for short) flag:
>>> print(re.findall(r'^\S+$', s, re.MULTILINE)) ['hello', 'world']
Not making repetitions non-greedy
? match 0 or more, 1 or more, and 0 or 1 repetitions respectively, and by default, they are greedy (eg. they try to match as many characters as they possibly can).
A classic mistake is trying to match HTML tags using a regular expression like this:
It seems reasonable enough – match the opening
<, then one or more characters, then the closing
> – but when we try it out on some HTML, this is what happens:
>>> s = '<head> <style> blah </style> </head>' >>> mo = re.search('<.+>', s) >>> print(mo.group()) <head> <style> blah </style> </style>
What’s happened is that Python has matched the opening
<, then one or more characters (head), then the closing
>, but instead of stopping there, it tries to see if it can do better and get the
. character to match more characters. And indeed it can, it can match everything all the way up to the
> at the very end of the string, which is why this regular expression ends up matching the entire string.
The way to fix this is to make the
. operator non-greedy (eg. make it match as few characters as possible) by putting a
? character after it.
>>> mo = re.search('<.+?>', s) >>> print(mo.group()) <head>
Now, when Python reaches the first
> (that closes the initial
tag), it stops straight away instead of trying to see if it can do any better.
Not making searches case-insensitive
By default, regular expressions are case-sensitive. For example:
>>> s = 'Hello World!' >>> mo = re.search('world', s) >>> print(mo) None
To make the search case-insensitive, pass in the
>>> mo = re.search('world', s, re.IGNORECASE) >>> print(mo.group()) World
Not compiling regular expressions
Python does a lot of work to prepare a regular expression for use, so if you’re going to use a particular regular expression a lot, it’s worth compiling it first.
>>> myRegex = re.compile('...') >>> # This reads the file line-by-line >>> for lineBuf in open(testFilename, 'r'): ... print(myRegex.findall(lineBuf))
Now Python does the preparatory work only once, then re-uses the pre-compiled regular expression on each pass of the loop, resulting in a big time saving.