We'll take a quick look at regular expressions in Python.
1. We will need to 'import re' first. 2. We will create a test string and in that test string, we will try to search for a word starting with the letter 'H'.
Searching for a pattern in a string
Python 2.7.3 (default, Aug 1 2012, 05:16:07) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> testString1 = "Time to say Hello World" >>> match_object = re.search(r"(H....)",testString1) >>> print match_object.group(1)
Output: Hello
Searching for Multiple Instances of a Pattern in a String
Let us try to look for all the numbers which occur in this string: "This is a string with numbers in it: 1, 100, 2, 3, 4,99" For finding all instances of a pattern in a string, we use 'findall'.
>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99" >>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2) >>> for matchedNumber in numbersInTheString: print matchedNumber
Output: ... 1 100 2 3 4 99
Replacing a pattern in a String, with a SubstringThis can be done using "sub".
>>> numbersInTheString = re.sub(r"(?:^|\b)s[a-z]+","word_starting_with_s",testString2) >>> print numbersInTheString
Output: This is a word_starting_with_s with numbers in it: 1, 100, 2, 3, 4,99
Example: Detecting an Email Address in a chunk of text>>> matchObj = re.search(r"(?:\b)([^\s]*\@[^\s]*\.[a-zA-Z]{2,4})(?:\b)",testString) >>> print matchObj.group(1)
Output:
Example: Detecting the domain name in a given Url
>>> testString="http://www.thehindu.com/features/education/issues/engineering-students-locked-into-microsoft-office/article4640546.ece" >>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString) >> print matchObj.group(1)
Output: www.hindu.com
Important Building Blocks of Regular Expressions and Examples
Symbol/Character/Character Class | Explanation | Example | . | Any Character | matchObj = re.search(r"(..o)","Hello") print matchObj.group(1)
Output: llo
Explanation: Regex picks up a group which ends with an 'o' and has any two characters before that 'o'. | * | Zero or more occurances of a particular character | matchObj = re.search(r"e(l*)","Hello") print matchObj.group(1)
Output: ll
Explanation: Picks up the two consecutive 'l'. | + | One or more instances of a particular character | matchObj = re.search(r"(l+)","Hello") print matchObj.group(1)
Output: ll | ? | Zero or One occurance of a particular character. If this is used after a * or a +, it tries to do a lazy match and tries to match as few characters as possible, to fit the regex. | Detecting the domain. We'd like to consider Url's starting with https: or http:.
>>> testString="http://www.thehindu.com/features/education/issues" >>> matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString) >> print matchObj.group(1)
Output: www.hindu.com | [a-z] or [A-Z] | Matches any character between, and including, [a to z] or [A to Z] | matchObj = re.search(r"([a-z]+)","Hello, world") print matchObj.group(1)
Output: ello | [0-9] | Matches any digit between, and including [0-9] | Example:
>>> testString2 = "This is a string with numbers in it: 1, 100, 2, 3, 4,99" >>> numbersInTheString = re.findall(r"(?:^|\b)([0-9]+)(?:\b)",testString2) >>> for matchedNumber in numbersInTheString: print matchedNumber
Output: ... 1 100 2 3 4 99 | \W | Matches a Non-Alpha numeric character excluding _ |
>>> testString = "Convert all non-alphanumeric characters in this string to blanks." >>> print re.sub(r"[\W]","*",testString)
Output: Convert*all*non*alphanumeric*characters*in*this*string*to*blanks*
| \w | Matches an alpha-numeric character including - | >>>testString = "A 100 men" >>> print re.sub(r"[\w]","*",testString)
Output: * *** *** | \D | Matches a Non-Digit character |
>>> testString = "A 100 men" >>> print re.sub(r"[\D]","*",testString)
Output: All non-digit characters have been replaced by * **100**** | \d | Matches a Digit Character |
>>> testString = "A 100 men" >>> print re.sub(r"[\d]","*",testString)
Output: All digits have been replaced by * A *** men | $ | End of a string or line | | \Z | End of a string | | \s | Whitespace | | ^ | Beginning of a string | | {m,n} | Between m and n occurances of a particular character. | | [^...] | Matches every character other than the ones inside the box brackets. | |
We also have some tutorials for Python for Data and Machine Learning
|