Learning Python: Programming and Data Structures- Tutorial 13- Regular Expression Matching

We'll take a quick look at regular expressions in Python.

1. We will need to 'import re' first.
2. We will create a test string and in that test string, we will try to search for a word starting with the letter 'H'.

Searching for a pattern in a string

Python 2.7.3 (default, Aug  1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> testString1 = "Time to say Hello World"
>>> match_object = re.search(r"(H....)",testString1)
>>> print match_object.group(1)

Output:
Hello

Searching for Multiple Instances of a Pattern in a String

Let us try to look for all the numbers which occur in this string: "This is a string with numbers in it: 1, 100, 2, 3, 4,99"
For finding all instances of a pattern in a string, we use 'findall'.

>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99"
>>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2)
>>> for matchedNumber in numbersInTheString: print matchedNumber

Output:
...
1
100
2
3
4
99

Replacing a pattern in a String, with a Substring

This can be done using "sub".

>>> numbersInTheString = re.sub(r"(?:^|\b)s[a-z]+","word_starting_with_s",testString2)
>>> print numbersInTheString

Output:
This is a word_starting_with_s with numbers in it: 1, 100, 2, 3, 4,99

Example: Detecting an Email Address in a chunk of text

>>> testString = "This is a string with an [email protected] embedded in it."
>>> matchObj = re.search(r"(?:\b)([^\s]*\@[^\s]*\.[a-zA-Z]{2,4})(?:\b)",testString)
>>> print matchObj.group(1)

Output:

Example: Detecting the domain name in a given Url

>>> testString="http://www.thehindu.com/features/education/issues/engineering-students-locked-into-microsoft-office/article4640546.ece"
>>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString)
>> print matchObj.group(1)

Output:
www.hindu.com

Important Building Blocks of Regular Expressions and Examples

 Symbol/Character/Character Class Explanation Example . Any Character matchObj = re.search(r"(..o)","Hello")print matchObj.group(1)Output:lloExplanation: Regex picks up a group which ends with an 'o' and has any two characters before that 'o'. * Zero or more occurances of a particular character matchObj = re.search(r"e(l*)","Hello")print matchObj.group(1)Output:llExplanation: Picks up the two consecutive 'l'. + One or more instances of a particular character matchObj = re.search(r"(l+)","Hello")print matchObj.group(1)Output:ll ? Zero or One occurance of a particular character. If this is used after a * or a +, it tries to do a lazy match and tries to match as few characters as possible, to fit the regex. Detecting the domain. We'd like to consider Url's starting with https: or http:.>>> testString="http://www.thehindu.com/features/education/issues">>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString)>> print matchObj.group(1)Output:www.hindu.com [a-z] or [A-Z] Matches any character between, and including, [a to z] or [A to Z] matchObj = re.search(r"([a-z]+)","Hello, world")print matchObj.group(1)Output:ello [0-9] Matches any digit between, and including [0-9] Example:>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99">>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2)>>> for matchedNumber in numbersInTheString: print matchedNumberOutput:... 110023499 \W Matches a Non-Alpha numeric character excluding _ >>> testString = "Convert all non-alphanumeric characters in this string to blanks.">>> print re.sub(r"[\W]","*",testString)Output:Convert*all*non*alphanumeric*characters*in*this*string*to*blanks* \w Matches an alpha-numeric character including - >>>testString = "A 100 men">>> print re.sub(r"[\w]","*",testString)Output:* *** *** \D Matches a Non-Digit character >>> testString = "A 100 men">>> print re.sub(r"[\D]","*",testString)Output: All non-digit characters have been replaced by ***100**** \d Matches a Digit Character >>> testString = "A 100 men">>> print re.sub(r"[\d]","*",testString)Output: All digits have been replaced by *A *** men \$ End of a string or line \Z End of a string \s Whitespace ^ Beginning of a string {m,n} Between m and n occurances of a particular character. [^...] Matches every character other than the ones inside the box brackets.