In the first two parts of this series, we looked at some fairly advanced usage of regular expressions. In this part, we take a step back and look at some of the other functions Python offers in the re module, then we talk about some common mistakes people regularly (ha!) make.

Useful Python Regular Expression Functions

Python offers several functions that make it easy to manipulate strings using regular expressions.

  • With a single statement, Python can return a list of all the sub-strings that match a regular expression.

For example:

>>> s = 'Hello world, this is a test!'
>>> print(re.findall(r'\S+', s))
['Hello', 'world,', 'this', 'is', 'a', 'test!']

\S means any non-whitespace character, so the regular expression \S+ means match one or more non-whitespace characters (eg. a word).

  • We can replace each matching sub-string with another string.

For example:

>>> print(re.sub( r'\S+' , 'WORD', s))
WORD WORD WORD WORD WORD WORD

The call to re.sub replaces all matches for the regular expression (eg. words) with the string “WORD”.

  • Or if you want to iterate over each matching sub-string and process it yourself, re.finditer will loop over each match, returning a MatchObject on each iteration.

For example:

>>> for mo in re.finditer(r'\S+', s):
... print('[%d:%d] = %s' % (mo.start(), mo.end(), mo.group()))
[0:5] = Hello
[6:12] = world,
[13:17] = this
[18:20] = is
[21:22] = a
[23:28] = test!
  • Python also has a function that will split a string into parts, using a regular expression as the separator. Let’s say we have a string that is using commas and semi-colons as a separator, with spaces all over the place.

For example:

s = 'word1,word2 , word3;word4 ; word'

Our regular expression for the separator would be: \s*[,;]\s*.

Or in plain English:

  • Zero or more white-space characters.
  • A comma or semi-colon.
  • Zero or more white-space characters.

Here it is in action:

>>> s = 'word1,word2 , word3;word4 ; word5'
>>> print(re.split(r'\s*[,;]\s*', s))
['word1', 'word2', 'word3', 'word4', 'word5']

Each word has been split off correctly and the white-space removed.

Common Python Regular Expression Mistakes

Not using the DOTALL flag when searching multi-line strings

In a regular expression, the special character . means match any character.

For example:

>>> s = 'BEGIN hello world END'
>>> mo = re.search('BEGIN (.*) END', s)
>>> print(mo.group(1))
hello world

However, if the string being searched consists of multiple lines, . will not match the newline character (\n).

>>> s = '''BEGIN hello
...        world END'''
>>> mo = re.search('BEGIN (.*) END', s)
>>> print(mo)
None

Our regular expression says find the word BEGIN, then one or more characters, then the word END, so what’s happened is that Python has found the word “BEGIN”, then one or more characters up to the newline, which doesn’t match as a character. Then, Python looks for the word “END” and since it doesn’t find it, the regular expression doesn’t match anything.

If you want the regular expression to match a sub-string that spans multiple lines, you need to pass in the DOTALL flag:

>>> mo = re.search('BEGIN (.*) END', s, re.DOTALL)
>>> print(mo.group())
BEGIN hello
world END

Not using the MULTILINE flag when searching multi-line strings

In the UNIX world, ^ and $ are widely understood to match the start/end of a line but this is only true with Python regular expressions if the MULTILINE flag has been set. If it hasn’t, they will only match the start/end of the entire string being searched.

>>> s = '''hello
>>> ... world'''
>>> print(re.findall(r'^\S+$', s))
[]

To get the behaviour we would expect, pass in the MULTILINE (or M for short) flag:

>>> print(re.findall(r'^\S+$', s, re.MULTILINE))
['hello', 'world']

Not making repetitions non-greedy

The operators * and + and ? match 0 or more1 or more, and 0 or 1 repetitions respectively, and by default, they are greedy (eg. they try to match as many characters as they possibly can).

A classic mistake is trying to match HTML tags using a regular expression like this: <.+&>

It seems reasonable enough – match the opening <, then one or more characters, then the closing > – but when we try it out on some HTML, this is what happens:

>>> s = '<head> <style> blah </style> </head>'
>>> mo = re.search('<.+>', s)
>>> print(mo.group())
<head> <style> blah </style> </style>

What’s happened is that Python has matched the opening <, then one or more characters (head), then the closing >, but instead of stopping there, it tries to see if it can do better and get the . character to match more characters. And indeed it can, it can match everything all the way up to the > at the very end of the string, which is why this regular expression ends up matching the entire string.

The way to fix this is to make the . operator non-greedy (eg. make it match as few characters as possible) by putting a ? character after it.

>>> mo = re.search('<.+?>', s)
>>> print(mo.group())
<head>

Now, when Python reaches the first > (that closes the initial tag), it stops straight away instead of trying to see if it can do any better.

Not making searches case-insensitive

By default, regular expressions are case-sensitive. For example:

>>> s = 'Hello World!'
>>> mo = re.search('world', s)
>>> print(mo)
None

To make the search case-insensitive,  pass in the IGNORECASE flag:

>>> mo = re.search('world', s, re.IGNORECASE)
>>> print(mo.group())
World

Not compiling regular expressions

Python does a lot of work to prepare a regular expression for use, so if you’re going to use a particular regular expression a lot, it’s worth compiling it first.

For example:

>>> myRegex = re.compile('...')
>>> # This reads the file line-by-line
>>> for lineBuf in open(testFilename, 'r'):
... print(myRegex.findall(lineBuf))

Now Python does the preparatory work only once, then re-uses the pre-compiled regular expression on each pass of the loop, resulting in a big time saving.

Leave a Reply

Your email address will not be published. Required fields are marked *