Python Programming - Pattern Matching

You can learn about Functions in Python Programs with Outputs helped you to understand the language better.

Also Read: Java Program to Convert Foot to Yard and Yard to Foot

Python Programming – Pattern Matching

Regular expressions, called regexes for short, are descriptions for a pattern of text. A regular expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, logs, spreadsheets, or even documents. They are used at the server side to validate the format of email addresses or passwords during registration, used for parsing text data files to find, replace or delete certain strings, etc. It is also used for webpage “Scraping” (extract a large amount of data from websites).

In Python, a regular expression is denoted as RE (REs, regexes, or regex pattern) are imported through re module, “re” module included with Python primarily used for string searching and manipulation. In order to use the search ( ) function, you need to import re first and then execute the code.
>>> import re

re-attribute

The re-attribute of a matched object returns a regular expression object.

string attribute

string attribute returns the passed string.

Using r prefix before RegEx

When the r or R prefix is used before a regular expression, it means raw string. For example, ‘\n’ is a new line whereas r’\n’ means two characters: a backslash \ followed by n. Backlash \ is used to escape various characters including all metacharacters. However, using the r prefix makes \ treat as a normal character.

Metacharacters

Each character in a Python Regex is either a meta-character or a regular character. Metacharacters are characters with a special meaning, while a regular character matches itself. Python has the following metacharacters as listed in Table 10.3.

Metacharacter Description
A Matches a pattern at the start of the string
$ Matches end of string.
. Matches a single character, except a newline.
[ ] A bracket expression matches a single character from the ones inside it.
[ A ] Matches a single character from those except the ones mentioned in the brackets.
( ) Parentheses define a marked subexpression, also called a block, or a capturing group.
\t, \n, \r, \f Tab, newline, return, form feed.
* Checks for zero or more characters to its left.
{m,n} Matches the preceding character minimum m times, and maximum n times.
{m} Matches the preceding character exactly m times.
? Checks for exactly zero or one character to its left.
+ Checks for one or more characters to its left.
I The choice operator matches either the expression before it, or the one after.

When using wildcards,? matches any single character, and * matches any series of characters. So for instance:
a? a matches aaa, aba, aca but not aa or abba.
a*a matches aaa, aba, aca, aa or abba,
a. a matches aaa, aba, aca, etc.
a. .a matches aaaa, aaba, abba, etc.
[abc] matches a or b or c. You can also give a range:
[a-d] means a b c or d.
[Aabc] matches everything but a or b or c.
[a-zA-ZO-9] – Matches any letter from (a to z) or (A to Z) or (0 to 9).

Example
Demo of metacharacter.

# Demo of . metacharacter
import re
str1 = “Hello World”
if re.search(r” “, str1):
print (str1 + ” has length >= 5″)
else:
print (str1 + “has length <=5”)RUN
>>>
Hello World has length >= 5
>>>

Example
Demo of . metacharacter

# Demo of . metacharacter
import re
str1 = “Cat”
if re.search(r” “, str1):
print (str1 + ” has length >=5″)
else:
print (str1 + ” has length <=5″)RUN
>>>
Cat has length <=5
>>>

Example
Demo of [.]

# Demo of [. ]
import re
str1 = “Hello, Python.”
if re.search(r”….[.]”, str1):
print (str1 + ” has length >= 5 and ends with a .” )RUN
>>>
Hello, Python, has length >= 5 and ends with a .
>>>
  • The I character is called a pipe. You can use it anywhere you want to match one of many expressions.

Special Sequences

If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered a new line. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character. A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

Character Description
\A (start of string) Returns a match if the specified characters are at the beginning of the string. It works across multiple lines as well.
\b (empty string at the beginning or end of a word) Returns a match where the specified characters are at the beginning or at the end of a word.
\B (empty string not at the beginning or end of a word) Returns a match where the specified characters are present, but not at the beginning (or at the end) of a word.
\d (a digit) Returns a match where the string contains digits (numbers from 0-9).
\ D (a non-digit) Returns a match where the string does not contain digits.
\ s (whitespace) Returns a match where the string contains a single whitespace character like: space, newline, tab, return.
\ S (non-whitespace) Returns a match where the string does not contain a white space character.
\w (alphanumeric) Returns a match where the string contains any word characters (any single letter, digit or underscore).
\W (non-alphanumeric) Returns a match where the string does not contain any word characters.
\ Z (end of string) Returns a match if the specified characters are at the end of the string.

Rules for a Match

The basic rules of regular expression search for a pattern within a string are:
(a) The search proceeds through the string from start to end.
(b) All of the patterns must be matched, but not all of the string.
(c) The search stops at the first match.

Python Regular Expression Functions

All the regex functions in Python are in the re module. Regular expression methods include re.match ( ), re.search ( )  & re.findall( ). Many Python regex methods and Regex functions take an optional argument called Flags.

match ( )

The match( ) function returns a match object if the text matches the pattern. Otherwise it returns None. The matchQ functions looks for a pattern at the beginning of a string:

>>> import re
>>> re .match (r’foot”football’) # Match
<re.Match object; span=(0, 4),. match=’ foot’ >
>>> re.match(r’ball’, ‘football’) # No match

search ( )

The search( ) function checks for a match anywhere in the string . If there is more than one match , only the first occurrence of the match wil1 be burned.

>>> re.search(r’foot’, ‘football’)
<re.Match object; span=(0, 4), match=’foot’>
>>> re.search(r’ball’, ‘football’)
<re.Match object; span=(4, 8), match=’ball’>

The behavior of regular expression matching can make your regular expressions blind to the differ- be subtly modified by using flags. For instance by once between lower and uppercase characters: specifying there.IGNORECASE (or simply re.I), you make your regular expressions blind to the difference between lower and uppercase characters:

>>> re.search(hello,’HELLO’, flags=re.I)
<re.Match object; span=(0, 5), match=’HELLO’>

Example
Demo of search ( ).

import re
str1 = “Hello, world.”
if re.search(r”1+”, str1):
print(‘There are one or more consecutive letter “1”‘ +\
“‘s in ” + str1)RUN
>>>
There are one or more consecutive letters “l”‘s in Hello, world.
>>>

Compilation Flags

You can change the way the matching engine processes an expression using option flags. The flags can be combined using a bitwise or operation, and passed to compile( ), search( ), match ( ), and other functions that accept a pattern for searching. Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE, and a short, one-letter form such as I. Various flags used in Python include IGNORECASE [re. I] performs case-insensitive matching; character class and literal strings will match letters by ignoring case.
MULTILINE [re.M] flag controls how the pattern matching code processes anchoring instructions for text containing newline characters. Within a string made of many lines, allow A and $ to match the start and end of each line.

DOTALL [re.S] is the other flag related to multiline text. Normally the dot character. matches everything in the input text except a newline character. The flag allows dot to match newlines as well.

findall ( )

Both match ( ) and search( ) return the first match. If you want all matches, you need to use findall ( ) function.
>>> re.findall(r’t.o’, ‘two cats too cute’)
[‘two’, ‘too’]
Sometimes you would like to access specific parts of the matched string-this can be achieved by groups. To create a group put the interesting part of a pat¬tern in parentheses. For instance, you can use three groups to match the month, day and year part of a date:
>>> import re
>>> re.search(r’\b(\d{2})-(\d{2})-(\d{4}
)\b’, ’17-06-2019′)
<re.Match object; span=(0, 10),
match=’17-06-2019′>
To access the groups you can query the match object returned by the search function:
>>> mo = re.search(r’\b(\d{2})-(\d{2})-( \d{4})\b’, ’17-06-2019′)
>>> mo.group0 ’17-06-2019′
>>> mo.group(0)
‘ 17-06-2019’
>>> mo.group(1)
‘ 17’
>>> mo.group(2)
‘ 06’
>>> mo.group(3)
‘2019’
>>> mo.groups()
(’17’, ’06’, ‘2019’)
Groups can also be used with findall():
>>> re.findall(r’w(..)k’, ‘week weak’)
[‘ee’, ‘ea’]

split ( )

split method splits the string where there is a match and returns a list of strings where the splits have occurred.
>>> re.split(r’ [,:]’, ‘red,blue:green’)
[‘red’, ‘blue’, ‘green’]

sub ( )

You can use the sub( ) function to substitute the part of a string with another. The method returns a string where matched occurrences are replaced with the content of replacing variable. sub() takes three arguments- pattern, substring, and string. For example,
>>> re.sub(‘Aa’,’an’,’a apple’)
‘an apple’
Here, you used A so it won’t change apple to apple.

subn ( )

The subn( ) is similar to sub( ) expect it returns both the modified string and the number of substitutions made. For example,

Example
Demo of subn ( ).

import re
# string
str = ‘abc 123 xyz’
# matches all whitespace characters
pattern = ‘\s+’
# empty string
replace = ”
new_str = re.subn(pattern, replace, str)
print(new_str)RUN
>>>
(‘abcl23xyz’, 2)
>>>

Compiling Regular Expressions

Regular expressions are compiled into RegexObject instances, which then have methods for various operations such as searching for pattern matches or performing string substitutions. You can learn about this by interactively experimenting with the re module. It allows you to enter REs and strings and displays whether the RE matches or fails. First, run the Python interpreter, import the re module, and compile a RE.
>>> import re
>>> p.re.compile(‘[a-z]+’)
>>> p
re.compile(‘ [a-z]+’)
Now, you can try matching various strings against the RE ‘[a-z]+’. An empty string should not match at all since + means ‘one or more repetitions’. match() should return None in this case, which will cause the interpreter to print no output. You can explicitly print the result of the match ( ) to make this clear.
>>> p.match(“”)
>>> print(p.match(“”))
None
Now, let’s try it on a string that it should match, such as “hello”. In this case, match() will return a MatchObject, so you should store the result in a variable for later use.
>>> m=p.match(‘hello’)
>>> print(m)
<re.Match object; span=(0, 5),
match=’hello’ >
Now you can query the MatchObject for information about the matching string. MatchObject instances also have several methods and attributes; the most important ones are:

group( )

The group ( ) method returns the part of the string where there is a match. If a match is found, if not, it returns None.
>>> m.group()
‘hello’

start ( )

The start ( ) function returns the index of the start of the matched substring.

end ( )

The end( ) returns the end index of the matched sub-string.
>>> m.start( ), m.end ( )
(0, 5)

span ( )

The span ( ) function returns a tuple containing the start and end index of the matched part.
>>> m.span ( )
(0, 5)

Greedy and Non-greedy

When a special character matches as much of the search sequence (string) as possible, it is said to be a “Greedy Match”. Nongreedy matches the smallest number of repetitions.
By default the + and * operators are greedy, i.e. they will match as much as possible. For instance:
>>> re.search(r’\(.*\)’, ‘(foot) (foot-ball) ‘)
<re.Match object; span=(0, 17), match='(foot) (football)’> will not stop on the first closing parenthesis and will match the whole string ‘(foot) (football)’.
Sometimes, however, you want our regular expression to be non-greedy and to match as little as pos¬sible. The non-greedy counterparts of + and * are +? and *?, respectively.
>>> re.search(r’\(.*?\) ‘, ‘(foot) (foot-ball) ‘)
<re.Match object; span=(0, 6), match='(foot)’> will only match ‘(foot)’.

Leave a Reply

Your email address will not be published. Required fields are marked *