Learning Python: Programming and Data Structures- Tutorial 13- Regular Expression Matching


We'll take a quick look at regular expressions in Python.

1. We will need to 'import re' first.
2. We will create a test string and in that test string, we will try to search for a word starting with the letter 'H'.

Searching for a pattern in a string


Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> testString1 = "Time to say Hello World"
>>> match_object = re.search(r"(H....)",testString1)
>>> print match_object.group(1)

Output:
Hello


Searching for Multiple Instances of a Pattern in a String


Let us try to look for all the numbers which occur in this string: "This is a string with numbers in it: 1, 100, 2, 3, 4,99"
For finding all instances of a pattern in a string, we use 'findall'.

>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99"
>>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2)
>>> for matchedNumber in numbersInTheString: print matchedNumber

Output:
... 
1
100
2
3
4
99


Replacing a pattern in a String, with a Substring

This can be done using "sub".

>>> numbersInTheString = re.sub(r"(?:^|\b)s[a-z]+","word_starting_with_s",testString2)
>>> print numbersInTheString

Output:
This is a word_starting_with_s with numbers in it: 1, 100, 2, 3, 4,99



Example: Detecting an Email Address in a chunk of text

>>> testString = "This is a string with an [email protected] embedded in it."
>>> matchObj = re.search(r"(?:\b)([^\s]*\@[^\s]*\.[a-zA-Z]{2,4})(?:\b)",testString)
>>> print matchObj.group(1)

Output:

Example: Detecting the domain name in a given Url


>>> testString="http://www.thehindu.com/features/education/issues/engineering-students-locked-into-microsoft-office/article4640546.ece"
>>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString)
>> print matchObj.group(1)

Output:
www.hindu.com


Important Building Blocks of Regular Expressions and Examples


 Symbol/Character/Character Class      Explanation Example
 . Any Character matchObj = re.search(r"(..o)","Hello")
print matchObj.group(1)

Output:
llo


Explanation: Regex picks up a group which ends with an 'o' and has any two characters before that 'o'.
 * Zero or more occurances of a particular character matchObj = re.search(r"e(l*)","Hello")
print matchObj.group(1)

Output:
ll


Explanation: Picks up the two consecutive 'l'.
 + One or more instances of a particular character matchObj = re.search(r"(l+)","Hello")
print matchObj.group(1)

Output:
ll
 ? Zero or One occurance of a particular character. If this is used after a * or a +, it tries to do a lazy match and tries to match as few characters as possible, to fit the regex. Detecting the domain. We'd like to consider Url's starting with https: or http:.

>>> testString="http://www.thehindu.com/features/education/issues"
>>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString)
>> print matchObj.group(1)

Output:
www.hindu.com
 [a-z] or [A-Z] Matches any character between, and including, [a to z] or [A to Z]matchObj = re.search(r"([a-z]+)","Hello, world")
print matchObj.group(1)

Output:
ello
 [0-9] Matches any digit between, and including [0-9] Example:

>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99"
>>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2)
>>> for matchedNumber in numbersInTheString: print matchedNumber

Output:
... 
1
100
2
3
4
99
 \W Matches a Non-Alpha numeric character excluding _ 
>>> testString = "Convert all non-alphanumeric characters in this string to blanks."
>>> print re.sub(r"[\W]","*",testString)

Output:
Convert*all*non*alphanumeric*characters*in*this*string*to*blanks*

 \w Matches an alpha-numeric character including - 
>>>testString = "A 100 men"
>>> print re.sub(r"[\w]","*",testString)

Output:
* *** ***
 \D Matches a Non-Digit character
>>> testString = "A 100 men"
>>> print re.sub(r"[\D]","*",testString)

Output: All non-digit characters have been replaced by *
**100****
 \d      Matches a Digit Character
>>> testString = "A 100 men"
>>> print re.sub(r"[\d]","*",testString)

Output: All digits have been replaced by *
A *** men
 $ End of a string or line 
 \Z End of a string 
 \s Whitespace 
 ^             Beginning of a string 
 {m,n} Between m and n occurances of a particular character.  
 [^...] Matches every character other than the ones inside the box brackets. 




We also have some tutorials for Python for Data and Machine Learning