Blessing Agyei Kyem

Posted on Jan 13, 2023

Deep Dive into Preprocessing Techniques in NLP using Python - Part 1

#discuss #productivity #career

Photo by Patrick Tomasso on Unsplash

5 Python Tricks Every Python Developer should know

Blessing Agyei Kyem ・ Jan 3 '23

#emptystring

Language and Speech data we encounter in the real-world are normally messy and disorganized; this makes it hard for machines to understand and therefore it necessitates we preprocess them so we can make informed decisions during analysis and modelling.

Without a systematic way to start and keep data clean, bad data will happen - Donato Diorio

Consider the sentence below :

Lifeeee is such a painnn :(

And another sentence :

Life is such a pain :(

The two sentences carry the same semantic meaning but the former requires a bit of preprocessing to remove the extra characters at the end of some of the words.

Preprocessing in NLP tasks is very essential and an important toolkit for Machine learning engineers and Data Scientists as they transition to build ML models.

For one to become good at data preprocessing especially in NLP, it is necessary you are able to detect and extract patterns in data.

As a result, knowing how to manipulate strings with regex should be a priority. In this tutorial we will dive deep into how to use regular expressions in Python.

Regular Expressions

Regular Expressions are some strings of characters and symbols(literals and metacharacters) that are used to detect patterns in text.
Suppose we have the following text :

The numbers are : 022-236-1823, 0554-236-172, 055 345 17584, and 0234456812

We might want to extract only the numbers from the text.

The following regex can help us achieve this :

/\d+[\s-]?\d+[\s-]?\d+/

Another example :

Hello everyone, Helium, hectic, help me!

Considering the text above we might be interested in words that start with He or he.
We can type the following expression :

/[Hh]e[a-z]+/

We can also extract a specific word pattern by typing some literals. Example :

The regex below :

/happy/

will extract or match happy from the sentence I am very happy

To understand regex, we have to know the difference between metacharacters and literals.

Let's consider the pattern we used earlier:

/\d+[\s-]?\d+[\s-]?\d+/

The metacharacters are :

\, +, \d, [ ], \s, ?

The only literal we have is -.

In our second example :

/[Hh]e[a-z]+/

Our literals are H, h, e, a , z.

There are a lot of metacharacters in regex and each of them has its specific use case. Let's explore them and know when to use them :

Metacharacter	Description
\	It is used before some characters to illustrate that the character is a special character or a literal.
^	Matches the start of an input
$	Matches the end of an input
.	Detects any single character except a newline
\|	Match either characters given. E.g. x \| y will match either `x` or `y`
?	Matches the character before it zero or more times. E.g. `s?it` will match `sit` or `it`
+	Matches the character before it one or more times. E.g. `a+` will match b`a`g and b`aaaa`g
[ ]	Matches everything inside it. E.g [A-Z] will match any uppercase from A to Z
\w	Matches any word character including underscore. It is equivalent to [A-Za-z0-9_]
\W	Matches any non-word character. It is equivalent to [^A-Za-z0-9_]
\d	Matches any digit. i.e. 0-9
\D	Matches a non-digit number
\s	Matches any whitespace
\S	Matches any non-whitespace

For information on Metacharacters, check this resource.

We will be using the popular python module re for regex matching operations.

Let's import regex :

import re

To create a pattern in regex you can use the compile function which strictly takes in the pattern you want to extract.

re.compile(pattern, flags=0)

Let's say we want to create a pattern to extract some number from the text : I am 25 years old. We can type :

re.compile(r'\d+')

r is just used to indicate the pattern as a raw string. This is because there are some characters like \ which performs a specific function in python so we have to make them raw strings to be used for regex-specific tasks.

We will be sing the following functions to match a pattern:

re.match() -> checks for a match only at the beginning of the string

re.search() -> checks for a match anywhere in the string

re.findall() -> checks for all occurrences of the match

Suppose we want to check whether Coming is at the beginning of the text below :

Coming is a verb

We will first create our pattern :

# Create our pattern
pattern = re.compile(r'Coming')

Let's use match() to match our pattern :

text = 'Coming is a verb'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>

Alternatively, we can use re.match() directly :

match = re.match(pattern, text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>

NOTE: When using match(), if the pattern isn't found at the beginning of the text, there will be no match.

Let's verify that with an example below :

pattern = re.compile(r'Coming')
text = 'Is Coming a verb?'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
None

As illustrated above, because the text begins with Is, there will be no match.

We can rectify this by using search() function below :

pattern = re.compile(r'Coming')
text = 'Is coming a verb?'

# Creating our Match Object
match = pattern.search(text)
print(match)

##Output:
<re.Match object; span=(3, 9), match='Coming'>

Yes! We have been able to match Coming. This is because the search() function matches anywhere within the text.

Now, what if a particular pattern exists multiple times within a text and we would like to detect all the instances of that pattern?

Example: Say, we want to detect all occurrences of a number within the string below :

These are four-digit numbers : 1245, 1220, 9028.

Using the search() function will only match the first occurrence of the number :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.search(text)
print(match)

## Output
<re.Match object; span=(31, 35), match='1245'>

Intuition behind the above code :

our pattern \d+ has two components : \d and +.

\d will match any single digit like 1, 2, ...

+ is a quantifier which when added to \d will match 1 or more additional digit till it reaches a non-digit character like whitespace or an alphabet. Eg: 1245

search() then goes through our text and once it sees a single pattern as described above, it immediately matches and returns that pattern. In this case it will match only 1245.

NOTE: search() only returns a single occurrence of the match.

We can use findall() to match all occurrences of the pattern in our text:

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.findall(text) # -> Returns a list
print(match)

## Output:
['1245', '1220', '9028']

Suppose you have a large chunk of data and you aren't interested in getting all the matches in the text at once, we can retrieve the matches in a sequence.

finditer() can help us achieve that.

Let's get the four-digit numbers in sequences :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.finditer(text) # -> Returns an callable iterator

# Let's check the type of the match 
print(match)

## Output 
<class 'callable_iterator'>

To get the next item in the iterator object, we can use the next() function in python.

Let's get the matches in sequences :

print(next(match))  # -> Outputs the first match 

print(next(match))  # -> Outputs the second match 

print(next(match))  # -> Outputs the last match 

## Output
<re.Match object; span=(31, 35), match='1245'>
<re.Match object; span=(37, 41), match='1220'>
<re.Match object; span=(43, 47), match='9028'>

Using the `^` and `&` metacharacter

^ is used before characters to match a pattern only at the beginning of a text. E.g. We can check whether say the word The is at the beginning of a line by typing ^The.

Let's illustrate that with an example:
We can detect whether The is at the beginning of the text below:

The work is super easy.

We can achieve that as illustrated:

text = 'The work is super easy.'
pattern = re.compile(r'^The')

match = pattern.search(text)

print(match)

## Output
<re.Match object; span=(0, 3), match='The'>

In the same way, $ is used to match whether a character or some set of characters is at the end of a line.
Let's check if cool is at the end of the sentence in the text below :

Regex is super cool

Code :

text = 'Regex is super cool'
pattern = re.compile(r'cool$')
match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(15, 19), match='cool'>

NOTE: There is a limitation to ^ and $ metacharacter as it only matches a pattern within the first line. In NLP and other applications, you might be working with multiple documents which you would have to preprocess to extract patterns.

Let's consider an example.

Suppose we want to extract the first user-id(24ga-d34) in the string:

'User-ids\n24ga-d34\n87bx-f60\n47nd-q21'

which contains user ids each at the beginning of a new line,

using search() function alone wouldn't work :

pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}')
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
None

We can fix this by adding a re.MULTILINE or re.M flag to our compile() function.

You can check all the available flags in re module.

re.MULTILINE flag prevents ^ or $ from considering just the first line. It allows it to check the beginning of all the lines in the text.

Code :

import re
pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}', re.MULTILINE)
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(9, 17), match='24ga-d34'>

Intuition behind the above code :

^ -> matches the pattern at the beginning of a line

\d{2} -> matches any two-digit number

[a-z]{2} -> matches any two lowercase alphabet

- -> matches a hyphen

[a-z] -> matches any single alphabet

\d{2} -> matches any two-digit number

re.MULTILINE -> overrides the default behavior of ^ in matching only at the beginning of a single line.

{n} is a metacharacter which will match anything before it n number of times, where n is a non-negative integer.

To be continued later...

Conclusion

In this tutorial, you learnt about the difference between literal and metacharacters in regex. You also learnt about how to use these metacharacters to match patterns in texts using the re module in python. In the next part of the tutorial, we will delve more into other preprocessing techniques in NLP.

Follow me for more of this content. Let's connect on LinkedIn!

References

Top comments (1)

Divyanshu Katiyar • Jan 16 '23

Really nice post! It shows why preprocessing the NLP data is needed before training any models on this data. We want our models to be as cost effective as possible so the first step should always be data preprocessing.