In the first part of this series, we looked at the basic syntax of regular expressions and some simple examples. In this part, we’ll take a look at some more advanced syntax and a few of the other features Python has to offer.
Regular Expression Captured Groups
So far, we’ve searched within a string using a regular expression and used the returned MatchObject
to extract the entire sub-string that was matched. Now we’ll look at how we can extract parts within the sub-string that was matched.
This regular expression:
\d{2}-\d{2}-\d{4}
Will match a date with the following format:
- A 2-digit date.
- A hyphen.
- A 2-digit month.
- A hyphen.
- A 4-digit year.
For example:
>>> s = 'Today is 31-05-2012' >>> mo = re.search(r'\d{2}-\d{2}-\d{4}', s) >>> print(mo.group()) 31-05-2012
We can capture various parts of this regular expression by putting them in parentheses:
(\d{2})-(\d{2})-(\d{4})
If Python matches this regular expression, we can then retrieve each captured group separately.
>>> mo = re.search(r'(\d{2})-(\d{2})-(\d{4})', s) >>> # Note: The entire matched string is still available >>> print(mo.group()) 31-05-2012 >>> # The first captured group is the date >>> print(mo.group(1)) 31 >>> # And this is its start/end position in the string >>> print('%s %s' % (mo.start(1), mo.end(1))) 9 11 >>> # The second captured group is the month >>> print(mo.group(2)) 05 >>> # The third captured group is the year >>> print(mo.group(3)) 2012
When you start writing more complex regular expressions, with lots of captured groups, it can be useful to refer to them by a meaningful name rather than a number. The syntax is (...)
, where … is the regular expression to be captured, and name is the name you want to give to the group.
>>> s = "Joe's ID: abc123" >>> # A normal captured group >>> mo = re.search(r'ID: (.+)', s) >>> print(mo.group(1)) abc123 >>> # A named captured group >>> mo = re.search(r'ID: (?P<id>.+)', s) >>> print(mo.group('id')) abc123
Re-using Captured Groups with Regular Expressions
We can also take captured groups and re-use them later in the regular expression! (?P=name)
means match whatever was previously matched in the named group. For example:
>>> s = 'abc 123 def 456 def 789' >>> mo = re.search(r'(?P<foo>def) \d+', s) >>> print(mo.group()) def 456 >>> print(mo.group('foo')) def >>> # Capture 'def' in a group >>> mo = re.search(r'(?P<foo>def) \d+ (?P=foo)', s) >>> print(mo.group()) def 456 def >>> mo.group('foo') def
Python Regular Expression Assertions
Sometimes we want to match something only if it is followed by something else, which means that Python needs to peek ahead as it is searching the string. This is called a look-ahead assertion and the syntax is (?=...)
, where … is a regular expression for what needs to follow.
In the example below, the regular expression ham(?= and eggs)
means match ‘ham’ but only if it is followed by ‘ and eggs’.
>>> s = 'John likes ham and eggs.' >>> mo = re.search(r'ham(?= and eggs)', s) >>> print(mo.group()) ham
Note that the matched sub-string is only ham, and not ham and eggs. The and eggs part is simply a requirement for the ham part to be matched. Let’s see what happens if this requirement is not met.
>>> s = 'John likes ham and mushrooms.' >>> mo = re.search(r'ham(?= and eggs)', s) >>> print(mo) None
>>> s = 'John likes ham, eggs and mushrooms.' >>> mo = re.search(r'ham(?= and eggs)', s) >>> print(mo) None
Unfortunately, Python only does simple character matching and will only match the string ham, as long as it is followed by and eggs. Artificial intelligence and semantic analysis is a whole ‘nother article. ????
We can also do negative look-ahead assertions, that is, an element matches only if it is not followed by something else.
>>> s = 'My name is John Doe.' >>> # Syntax is (?!...) >>> mo = re.search( r'John(?! Doe)', s) >>> print(mo) None
>>> s = 'My name is John Jones.' >>> mo = re.search(r'John(?! Doe)', s) >>> print(mo.group()) John