Introduction
Stemming and Lemmatization are techniques used in text processing. In Natural Language Processing (NLP), text processing is needed to normalize the text. The aim of text normalization is to reduce the amount of information that a machine has to handle thus improving the efficiency of the machine learning process.
Both stemming and lemmatization involves reducing the inflectional forms of words to their root forms. Inflection forms of words are words that are derived from the root or base form of a word. For example, the words jumped, jumping and jumps are inflectional forms of the root word jump. Likewise, creating, created, creates are inflectional forms of the root word create, and so on.
Prerequisites
- Basic knowledge of python programming
- Python installed
- Natural Language Toolkit(nltk) package installed
What is the difference between stemming and lemmatization?
The main difference between stemming and lemmatization is that stemming chops off the suffixes of a word to reduce a word to its root form while lemmatization first takes into consideration the context of a word and makes use of the context to convert the word to its meaningful base form which is known as lemma.
Below are examples of words that stemming and lemmatization have been performed on.
Stemming Examples
Word --- Porter Stemmer
- jumped --- jump
- friends --- friend
- football --- footbal
- mysteries --- mysteri
- created --- creat
- took --- took
Lemmatization Examples
Word --- Lemmatized word
- jumped --- jump
- friends --- friend
- football --- football
- mysteries --- mystery
- created --- create
- took --- take
How to carry out stemming
Natural Language Toolkit(nltk) package has two stemmers for the English Language. These stemmers are PorterStemmer and LancasterStemmer.
We are going to use PorterStemmer to carryout stemming.
First let's import PorterStemmer
from nltk.stem import PorterStemmer
Let's now create a list of words that we want to stem
word_list = ["jumped", "friendship", "friends", "swimming","creation","stability","writing",
"realize","mystery","football", "mysteries", "created", "took"]
We will now stem every word in the list and then print the word with its stemmed version.
stemmer = PorterStemmer()
for word in word_list:
print((word,stemmer.stem(word)))
Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stabil')
('writing', 'write')
('realize', 'realiz')
('mystery', 'mysteri')
('football', 'footbal')
('mysteries', 'mysteri')
('created', 'creat')
('took', 'took')
How to carry out lemmatization
As mentioned earlier, lemmatization just like stemming reduces a word to its root form but for lemmatization we need to first tag the words with their parts of speech tags before carrying out the lemmatization. For example, every word that is verb will be given the tag verb(v), words that are noun will be given noun(n) tag and so on.
Let's first install the libraries that we will be using
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
As a start, let's create a function for tagging the words. We will use wordnet for tagging the words.
def tag(doc):
#POS tagging
tagged_tokens = nltk.pos_tag(doc)
return tagged_tokens
Next, let's create a function for converting the parts of speech(pos) tags.
# function for converting tags
def pos_tag_wordnet(tagged_tokens):
tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
for word, tag in tagged_tokens]
return new_tagged_tokens
Let's now tag the words in the word list from before, then convert the tags and print the output.
# tag the words
tagged_tokens = tag(word_list)
# convert the tags
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print(wordnet_tokens)
Output
[('jumped', 'v'), ('friendship', 'n'), ('friends', 'n'), ('swimming', 'v'), ('creation', 'n'), ('stability', 'n'), ('writing', 'v'), ('realize', 'v'), ('mystery', 'n'), ('football', 'n'), ('mysteries', 'n'), ('created', 'v'), ('took', 'v')]
From the output, we can see we've got verbs(v) and nouns(n).
Let's now lemmatize the tagged words.
wnl = WordNetLemmatizer()
for word, tag in wordnet_tokens:
print((word, wnl.lemmatize(word, tag)))
Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stability')
('writing', 'write')
('realize', 'realize')
('mystery', 'mystery')
('football', 'football')
('mysteries', 'mystery')
('created', 'create')
('took', 'take')
Conclusion
In this article, we've learned about stemming and lemmatization, what they are and their differences. Both stemming and lemmatization are good techniques for text processing and they each have pros and cons.
Top comments (1)
This is great! Thanks for sharing this!