While analyzing the text data, we might have encountered a scenario where we wanted to search all the strings following a specific pattern. This is a fairly common task in which we might have searched for e-mail addresses, passport numbers, and transaction ids in the large corpus of text. Searching such strings manually in a corpus (a large amount of text data) is a demanding task and unimaginable for many. Fortunately, Regular-Expressions are introduced to avoid manual searches to find specified patterns in a large text corpus to address this issue. As the name suggests, the regular expression is an expression that holds a defined search pattern to extract the pattern-specific strings.
Today, regular expressions are available for almost every high-level programming language with a bit of variation in their implementation. As data scientists or NLP engineers, we should know the basics of regular expressions and when to use them. We will keep ourselves limited to Python to implement regular expression in this tutorial.
Post-reading this article, we will be able to answer the following:
So let’s start our journey in greater detail and first learn more about
Regular Expressions are the expressions majorly used to extract or replace a specific pattern present in the text-corpus. In python, a regular expression can be directly imported using the inbuilt ‘re’ module, the short form of Regular Expression.
For instance, finding all the emails in a text corpus would be challenging. However, using a one-linear regular expression using the re module can solve this tedious task within seconds, saving time. Let’s look at the description of available functions from the re module:
We will go through all the above methods, but we need to know how regular expression works.
Let’s take an example to understand the working of a regular expression:
Suppose we have a small piece of text, as mentioned below, and we have to find all the email ids present in it.
“ Yesterday, I received an untitled email with email id as xyz@yahoo.com, and I thought it was spam; I didn’t open it. The following day, I received another mail from jeff with the email id as jeff245@gmail.com, and from there, I found out that It was just a reminder from the bank for the loan dues.”
We aim to find all the E-mail IDs, which we can achieve manually, but let’s try to do it using regular expressions. First, we will discover the unique properties of an email id. Let’s list down the properties of an Email ID:
We need to accommodate the above patterns in an expression to find all the E-mail addresses present in the text. Following is an example of a regular expression for extracting the email IDs from the text. It obeys all the rules mentioned above:
Let’s implement the above regular expression in python:
import re
string = open('email.txt', 'r').read()
match = re.findall(r'[\w._%+-]+@[\w.-]+\.[a-zA-Z]{2,4}', string)
print(match)
# ['xyz@yahoo.com', 'jeff245@gmail.com']
Now that we know how a regular expression works, we can start exploring the functions of the re module. Following are some standard functions of re module:
Let’s go through each of these functions one by one:
This function returns all the events where the pattern matches within the string. The result of the findall() function is a list of matched occurrences.
Syntax: re.findall(patterns, string)
Implementation of findall() function is as follows:
import re
string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"
print(re.findall(r'Japan',string))
# ['Japan', 'Japan']
This function helps store the regular expression pattern in the memory as a cache for quick searches. The compiled pattern can be searched within the text using the ‘findall’ function.
Syntax: re.compile(patterns, string)
Implementation of re.compile() function:
import re
string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"
pattern = re.compile('Japan')
print(pattern.findall(string))
# ['Japan', 'Japan']
This function helps detect whether the regular expression pattern exists in the given input or not. Its yields are objects if the pattern is found in the text; otherwise, it returns none if the pattern is not found.
Syntax: re.search(patterns, string)
import re
string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"
pattern_exp = re.search('Japan', string)
print(pattern_exp)
# <re.Match object; span=(27, 32), match='Japan'>
The function re.match() returns a regex object only if the pattern is present in the initial part of the string; otherwise, it returns None. The matched expression can be extracted using the group() function. Implementation of re.match() function:
Syntax: re.match(patterns, string)
import re
string="Gravity is high on poles and less on the equator"
pattern = re.match('Gravity', string)
print(pattern)
print(pattern.group())
# <re.Match object; span=(0, 7), match='Gravity'>
# Gravity
Consider another scenario where the string doesn’t start with the pattern word. The function returns ‘None’ in such cases.
import re
string="Gravity is high on poles and less on the equator"
pattern = re.match('poles', string)
print(pattern)
# None
The re.sub() helps in replacing a sub-string with another sub-string. Implementation of re.sub():
Syntax: re.sub(patterns, substitute, Input String)
import re
string = "Sun-set in the east"
updated_string = re.sub("east", "west", string)
print(updated_string)
# Sun-set in the west
The re.split() function helps split the string when it encounters a specific pattern. The function returns a list of substrings separated by the match of the pattern.
Syntax: re.split(pattern, string, maxsplit=0)
Implementation of re.split() is as follows:
import re
string = "Sun-set in the west and rises in the east"
updated_string = re.split("and", string)
print(updated_string)
# ['Sun-set in the west ', ' rises in the east']
All the above functions require a certain pattern or exact matching word to search, match, map, split, etc. Creating patterns requires some special characters that carry a special meaning and such characters are known as metacharacters.
Meta-Characters are some special characters that hold a special meaning in denoting a particular pattern. They are used in regular expressions as a medium to build search criteria and extend to text manipulation. Let’s look at some Meta-Characters and their meaning.
Backslash character ensures that the search character shouldn’t be considered a meta-character. It is a way of escaping the metacharacter. Let’s check the quick implementation:
import re
string = 'This dot . in the string is not desirable'
# Without using the backslash(\)
print(re.search(r'.', string))
# Using backslash(\)
print(re.search(r'\.', string))
# <re.Match object; span=(0, 1), match='T'>
# <re.Match object; span=(9, 10), match='.'>
Explanation: without using the backslash (), the ‘.’ character is considered a metacharacter, making it impossible to search for a dot as an ordinary character. On the other hand, Backslash works as a medium to escape the metacharacter.
Square brackets denote a pattern of character class, including a set of characters that we desire to match. For instance, if we want to match any string keeping character between a and k, we need to search for the pattern [a-k] where the ‘-’ character represents a range.
By this norm, [1–4] is the same as [1234]
Caret meta-character checks whether the string is starting with given characters or not. For instance:
Dollar meta-character checks whether the string ends with the given character. Just opposite of the caret meta-character. For instance:
Dot meta-character helps in matching any in-between character in a string. For instance:
Or meta-character confirms if the pattern within the Or statement is matching or not. For instance:
Question Mark meta-character checks if there exists zero or one occurrence of the instantly previous regex. Let’s understand this with an example:
Star meta-character matches zero or more occurrences of the previous regex. Let’s understand this with an example:
Plus, meta-character matches one or more occurrences of the regex previous to the + character.
Braces meta-characters match any string within the mentioned range of the number of repetitions, inclusive of both a and b.
Group meta-character is used to match certain sub-patterns in the string.
For instance, (ca|t)dog would match strings like ‘cadog’, ‘tdog’, ‘catdog’, ‘lcadog’ etc.
Apart from meta-characters, there are some special sequences that are useful when writing common patterns such as finding strings starting with a certain word pr character. Now, this can also be achieved using Meta-characters but using special characters saves a lot of effort. Let’s look at some special sequences.
Special Sequences help find the location of a specific string where the regular expression must match. With special sequences, we can write complex and common patterns exceptionally quickly. Let’s look at some commonly used Special-Sequences:
There are endless applications of Regex in NLP itself. We have seen the potential of regex in filtering the E-mail address. We can even use regex to search passport IDs, telephone numbers, names, etc. Let’s look at some significant applications of Regular Expressions.
Following are some significant applications of regex:
This blog started with a brief introduction to the regular expressions and why it is essential. We went through an example where we saw the potential of regular expression in finding the email address in the text. Moving onwards, we learned some functions available in the in-built re module of python for matching and searching the patterns in text. Further, we explored some meta-characters and special sequences used for writing the regular expression, and finally, we looked at some applications of regular expressions. We hope you enjoyed the article.