More About Python Regular Expressions

In the first part of this series, we looked at the basic syntax of regular expressions and some simple examples. In this part, we’ll take a look at some more advanced syntax and a few of the other features Python has to offer.

Regular Expression Captured Groups

So far, we’ve searched within a string using a regular expression and used the returned MatchObject to extract the entire sub-string that was matched. Now we’ll look at how we can extract parts within the sub-string that was matched.

This regular expression:

\d{2}-\d{2}-\d{4}

Will match a date with the following format:

  • A 2-digit date.
  • A hyphen.
  • A 2-digit month.
  • A hyphen.
  • A 4-digit year.

For example:

>>> s = 'Today is 31-05-2012'
>>> mo = re.search(r'\d{2}-\d{2}-\d{4}', s)
>>> print(mo.group())
31-05-2012

We can capture various parts of this regular expression by putting them in parentheses:

(\d{2})-(\d{2})-(\d{4})

If Python matches this regular expression, we can then retrieve each captured group separately.

>>> mo = re.search(r'(\d{2})-(\d{2})-(\d{4})', s)
>>> # Note: The entire matched string is still available
>>> print(mo.group())
31-05-2012
>>> # The first captured group is the date
>>> print(mo.group(1))
31
>>> # And this is its start/end position in the string
>>> print('%s %s' % (mo.start(1), mo.end(1)))
9 11
>>> # The second captured group is the month
>>> print(mo.group(2))
05
>>> # The third captured group is the year
>>> print(mo.group(3))
2012

When you start writing more complex regular expressions, with lots of captured groups, it can be useful to refer to them by a meaningful name rather than a number. The syntax is (...), where … is the regular expression to be captured, and name is the name you want to give to the group.

>>> s = "Joe's ID: abc123"
>>> # A normal captured group
>>> mo = re.search(r'ID: (.+)', s)
>>> print(mo.group(1))
abc123
>>> # A named captured group
>>> mo = re.search(r'ID: (?P<id>.+)', s)
>>> print(mo.group('id'))
abc123

Re-using Captured Groups with Regular Expressions

We can also take captured groups and re-use them later in the regular expression! (?P=name) means match whatever was previously matched in the named group. For example:

>>> s = 'abc 123 def 456 def 789'
>>> mo = re.search(r'(?P<foo>def) \d+', s)
>>> print(mo.group())
def 456
>>> print(mo.group('foo'))
def
>>> # Capture 'def' in a group
>>> mo = re.search(r'(?P<foo>def) \d+ (?P=foo)', s)
>>> print(mo.group())
def 456 def
>>> mo.group('foo')
def

Python Regular Expression Assertions

Sometimes we want to match something only if it is followed by something else, which means that Python needs to peek ahead as it is searching the string. This is called a look-ahead assertion and the syntax is (?=...), where … is a regular expression for what needs to follow.

In the example below, the regular expression ham(?= and eggs) means match ‘ham’ but only if it is followed by ‘ and eggs’.

>>> s = 'John likes ham and eggs.'
>>> mo = re.search(r'ham(?= and eggs)', s)
>>> print(mo.group())
ham

Note that the matched sub-string is only ham, and not ham and eggs. The and eggs part is simply a requirement for the ham part to be matched. Let’s see what happens if this requirement is not met.

>>> s = 'John likes ham and mushrooms.'
>>> mo = re.search(r'ham(?= and eggs)', s)
>>> print(mo)
None
>>> s = 'John likes ham, eggs and mushrooms.'
>>> mo = re.search(r'ham(?= and eggs)', s)
>>> print(mo)
None

Unfortunately, Python only does simple character matching and will only match the string ham, as long as it is followed by and eggs. Artificial intelligence and semantic analysis is a whole ‘nother article. ????

We can also do negative look-ahead assertions, that is, an element matches only if it is not followed by something else.

>>> s = 'My name is John Doe.'
>>> # Syntax is (?!...)
>>> mo = re.search( r'John(?! Doe)', s)
>>> print(mo)
None
>>> s = 'My name is John Jones.'
>>> mo = re.search(r'John(?! Doe)', s)
>>> print(mo.group())
John

Leave a Reply

Your email address will not be published. Required fields are marked *