Regular Expression in ML (Python)

While analyzing the text data, we might have encountered a scenario where we wanted to search all the strings following a specific pattern. This is a fairly common task in which we might have searched for e-mail addresses, passport numbers, and transaction ids in the large corpus of text. Searching such strings manually in a corpus (a large amount of text data) is a demanding task and unimaginable for many. Fortunately, Regular-Expressions are introduced to avoid manual searches to find specified patterns in a large text corpus to address this issue. As the name suggests, the regular expression is an expression that holds a defined search pattern to extract the pattern-specific strings.

Today, regular expressions are available for almost every high-level programming language with a bit of variation in their implementation. As data scientists or NLP engineers, we should know the basics of regular expressions and when to use them. We will keep ourselves limited to Python to implement regular expression in this tutorial.

Key takeaways from this blog

Post-reading this article, we will be able to answer the following:

What are Regular Expressions in Python?
How does regular expression work?
What are some standard regular expression functions used in NLP?
What are meta-characters and special sequences in regular expressions?
What are some applications of regular expression in text analytics?

So let’s start our journey in greater detail and first learn more about

What are Regular Expressions?

Regular Expressions are the expressions majorly used to extract or replace a specific pattern present in the text-corpus. In python, a regular expression can be directly imported using the inbuilt ‘re’ module, the short form of Regular Expression.

For instance, finding all the emails in a text corpus would be challenging. However, using a one-linear regular expression using the re module can solve this tedious task within seconds, saving time. Let’s look at the description of available functions from the re module:

List of functions coming with "re" module

We will go through all the above methods, but we need to know how regular expression works.

How does Regular Expression work?

Let’s take an example to understand the working of a regular expression:

Suppose we have a small piece of text, as mentioned below, and we have to find all the email ids present in it.

“ Yesterday, I received an untitled email with email id as xyz@yahoo.com, and I thought it was spam; I didn’t open it. The following day, I received another mail from jeff with the email id as jeff245@gmail.com, and from there, I found out that It was just a reminder from the bank for the loan dues.”

We aim to find all the E-mail IDs, which we can achieve manually, but let’s try to do it using regular expressions. First, we will discover the unique properties of an email id. Let’s list down the properties of an Email ID:

Any email starts with an alpha-numeric string which can also contain symbols like [ ., %, _, +, -] and ends where it encounters the ‘@’ symbol.
Past ‘@’ symbol arrives at the domain name, a string like Gmail, Yahoo, etc., and terminates at the ‘.’ symbol.
Following the ‘.’ symbol comes the domain extension that is a string in general and completes an email address.

We need to accommodate the above patterns in an expression to find all the E-mail addresses present in the text. Following is an example of a regular expression for extracting the email IDs from the text. It obeys all the rules mentioned above:

Regular Expression For Finding Email IDs

Let’s implement the above regular expression in python:

import re

string = open('email.txt', 'r').read()
match = re.findall(r'[\w._%+-]+@[\w.-]+\.[a-zA-Z]{2,4}', string)

print(match)

# ['xyz@yahoo.com', 'jeff245@gmail.com']

Standard Regular Expression Functions used in NLP

Now that we know how a regular expression works, we can start exploring the functions of the re module. Following are some standard functions of re module:

re.findall()
re.compile()
re.search()
re.match()
re.sub()
re.split()

Let’s go through each of these functions one by one:

re.findall()

This function returns all the events where the pattern matches within the string. The result of the findall() function is a list of matched occurrences.

Syntax: re.findall(patterns, string)

Implementation of findall() function is as follows:

import re

string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"
print(re.findall(r'Japan',string))

# ['Japan', 'Japan']

re.compile()

This function helps store the regular expression pattern in the memory as a cache for quick searches. The compiled pattern can be searched within the text using the ‘findall’ function.

Syntax: re.compile(patterns, string)

Implementation of re.compile() function:

import re

string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"

pattern = re.compile('Japan')
print(pattern.findall(string))

# ['Japan', 'Japan']

re.search()

This function helps detect whether the regular expression pattern exists in the given input or not. Its yields are objects if the pattern is found in the text; otherwise, it returns none if the pattern is not found.

Syntax: re.search(patterns, string)

import re

string="I used to live in Northern Japan while I was young, later I moved to southern Japan due to bad weather"

pattern_exp = re.search('Japan', string)
print(pattern_exp)

# <re.Match object; span=(27, 32), match='Japan'>

re.match()

The function re.match() returns a regex object only if the pattern is present in the initial part of the string; otherwise, it returns None. The matched expression can be extracted using the group() function. Implementation of re.match() function:

Syntax: re.match(patterns, string)

import re

string="Gravity is high on poles and less on the equator"
pattern = re.match('Gravity', string)

print(pattern)
print(pattern.group())

# <re.Match object; span=(0, 7), match='Gravity'>
# Gravity

Consider another scenario where the string doesn’t start with the pattern word. The function returns ‘None’ in such cases.

import re
string="Gravity is high on poles and less on the equator"

pattern = re.match('poles', string)
print(pattern)

# None

re.sub()

The re.sub() helps in replacing a sub-string with another sub-string. Implementation of re.sub():

Syntax: re.sub(patterns, substitute, Input String)

import re
string = "Sun-set in the east"

updated_string = re.sub("east", "west", string)
print(updated_string)

# Sun-set in the west

re.split()

The re.split() function helps split the string when it encounters a specific pattern. The function returns a list of substrings separated by the match of the pattern.

Syntax: re.split(pattern, string, maxsplit=0)

Implementation of re.split() is as follows:

import re
string = "Sun-set in the west and rises in the east"

updated_string = re.split("and", string)
print(updated_string)


# ['Sun-set in the west ', ' rises in the east']

All the above functions require a certain pattern or exact matching word to search, match, map, split, etc. Creating patterns requires some special characters that carry a special meaning and such characters are known as metacharacters.

What are Meta-Characters?

Meta-Characters are some special characters that hold a special meaning in denoting a particular pattern. They are used in regular expressions as a medium to build search criteria and extend to text manipulation. Let’s look at some Meta-Characters and their meaning.

Meta characters and their description in regular expression

Backslash(‘\’)

Backslash character ensures that the search character shouldn’t be considered a meta-character. It is a way of escaping the metacharacter. Let’s check the quick implementation:

import re
string = 'This dot . in the string is not desirable'

# Without using the backslash(\)
print(re.search(r'.', string))

# Using backslash(\)
print(re.search(r'\.', string))


# <re.Match object; span=(0, 1), match='T'>
# <re.Match object; span=(9, 10), match='.'>

Explanation: without using the backslash (), the ‘.’ character is considered a metacharacter, making it impossible to search for a dot as an ordinary character. On the other hand, Backslash works as a medium to escape the metacharacter.

Square Brackets(‘[]’)

Square brackets denote a pattern of character class, including a set of characters that we desire to match. For instance, if we want to match any string keeping character between a and k, we need to search for the pattern [a-k] where the ‘-’ character represents a range.

By this norm, [1–4] is the same as [1234]

Caret(‘^’)

Caret meta-character checks whether the string is starting with given characters or not. For instance:

^e will check if the string begins with the character ‘e’ or not, such as enjoy, evolve, electric, etc.
^en will check if the string starts with the characters ‘en’ or not, such as -engage, energetic, enjoy, enrollment, etc.

Dollar(‘$’)

Dollar meta-character checks whether the string ends with the given character. Just opposite of the caret meta-character. For instance:

s$ will match all the strings ending with s, such as cats, dogs, status, etc.
ds$ will match all the strings ending with characters ‘ds’ such as ends, commends, etc.

Dot (‘.’)

Dot meta-character helps in matching any in-between character in a string. For instance:

a.e will match all the strings that have any character on the ‘.’ place holder such as — are, ace, ate, etc.

Or (‘|’)

Or meta-character confirms if the pattern within the Or statement is matching or not. For instance:

a|r matches any string that starts with either a or r, such as — analytics, research, argument, return, etc.

Question Mark(‘?’)

Question Mark meta-character checks if there exists zero or one occurrence of the instantly previous regex. Let’s understand this with an example:

‘dog?’ regex would match with ‘do,’ ‘dog’ but not with ‘dogg,’ ‘doggie.’

Star(‘*’)

Star meta-character matches zero or more occurrences of the previous regex. Let’s understand this with an example:

‘dog?’ regex would match the string ‘do,’ ‘dog,’ ‘dogg,’ ‘doggg,’ etc.

Plus(‘+’)

Plus, meta-character matches one or more occurrences of the regex previous to the + character.

[do+g] regex would match for the strings ‘dog’, ‘doog,’ ‘ddog’ but not for ‘dobg’ or ‘dg.’

Braces(‘{a, b}’)

Braces meta-characters match any string within the mentioned range of the number of repetitions, inclusive of both a and b.

k{1,4} would match the strings — ‘kite,’ ‘kkite,’ ‘kkkite,’ ‘kkkkite’ but not ‘kkkkkite’ since repetitions can go max for four times.

Group(‘()’)

Group meta-character is used to match certain sub-patterns in the string.

For instance, (ca|t)dog would match strings like ‘cadog’, ‘tdog’, ‘catdog’, ‘lcadog’ etc.

Apart from meta-characters, there are some special sequences that are useful when writing common patterns such as finding strings starting with a certain word pr character. Now, this can also be achieved using Meta-characters but using special characters saves a lot of effort. Let’s look at some special sequences.

What are Special Sequences?

Special Sequences help find the location of a specific string where the regular expression must match. With special sequences, we can write complex and common patterns exceptionally quickly. Let’s look at some commonly used Special-Sequences:

Special Sequences in regular expression

Application of Regular Expressions in Text Analytics

There are endless applications of Regex in NLP itself. We have seen the potential of regex in filtering the E-mail address. We can even use regex to search passport IDs, telephone numbers, names, etc. Let’s look at some significant applications of Regular Expressions.

Following are some significant applications of regex:

Web-Scrapping & Data Collection
Text Preprocessing (NLP)
Pattern Detection for IDs, E-mails, Names
Date-time manipulations

Conclusion

This blog started with a brief introduction to the regular expressions and why it is essential. We went through an example where we saw the potential of regular expression in finding the email address in the text. Moving onwards, we learned some functions available in the in-built re module of python for matching and searching the patterns in text. Further, we explored some meta-characters and special sequences used for writing the regular expression, and finally, we looked at some applications of regular expressions. We hope you enjoyed the article.

Introduction to Regular Expression in Machine Learning