Stemming vs Lemmatization - What is the difference?

#nlp #machinelearning #datascience #beginners

Introduction

Stemming and Lemmatization are techniques used in text processing. In Natural Language Processing (NLP), text processing is needed to normalize the text. The aim of text normalization is to reduce the amount of information that a machine has to handle thus improving the efficiency of the machine learning process.
Both stemming and lemmatization involves reducing the inflectional forms of words to their root forms. Inflection forms of words are words that are derived from the root or base form of a word. For example, the words jumped, jumping and jumps are inflectional forms of the root word jump. Likewise, creating, created, creates are inflectional forms of the root word create, and so on.

Prerequisites

Basic knowledge of python programming
Python installed
Natural Language Toolkit(nltk) package installed

What is the difference between stemming and lemmatization?

The main difference between stemming and lemmatization is that stemming chops off the suffixes of a word to reduce a word to its root form while lemmatization first takes into consideration the context of a word and makes use of the context to convert the word to its meaningful base form which is known as lemma.

Below are examples of words that stemming and lemmatization have been performed on.

Stemming Examples

Word --- Porter Stemmer

jumped --- jump
friends --- friend
football --- footbal
mysteries --- mysteri
created --- creat
took --- took

Lemmatization Examples

Word --- Lemmatized word

jumped --- jump
friends --- friend
football --- football
mysteries --- mystery
created --- create
took --- take

How to carry out stemming

Natural Language Toolkit(nltk) package has two stemmers for the English Language. These stemmers are PorterStemmer and LancasterStemmer.
We are going to use PorterStemmer to carryout stemming.

First let's import PorterStemmer

from nltk.stem import PorterStemmer

Let's now create a list of words that we want to stem

word_list = ["jumped", "friendship", "friends", "swimming","creation","stability","writing",
             "realize","mystery","football", "mysteries", "created", "took"]

We will now stem every word in the list and then print the word with its stemmed version.

stemmer = PorterStemmer()

for word in word_list:
    print((word,stemmer.stem(word)))

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stabil')
('writing', 'write')
('realize', 'realiz')
('mystery', 'mysteri')
('football', 'footbal')
('mysteries', 'mysteri')
('created', 'creat')
('took', 'took')

How to carry out lemmatization

As mentioned earlier, lemmatization just like stemming reduces a word to its root form but for lemmatization we need to first tag the words with their parts of speech tags before carrying out the lemmatization. For example, every word that is verb will be given the tag verb(v), words that are noun will be given noun(n) tag and so on.

Let's first install the libraries that we will be using

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

As a start, let's create a function for tagging the words. We will use wordnet for tagging the words.

def tag(doc):
    #POS tagging
    tagged_tokens = nltk.pos_tag(doc)
    return tagged_tokens

Next, let's create a function for converting the parts of speech(pos) tags.

# function for converting tags
def pos_tag_wordnet(tagged_tokens):
    tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
    new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
                            for word, tag in tagged_tokens]
    return new_tagged_tokens

Let's now tag the words in the word list from before, then convert the tags and print the output.

# tag the words
tagged_tokens = tag(word_list)
# convert the tags
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print(wordnet_tokens)

Output
[('jumped', 'v'), ('friendship', 'n'), ('friends', 'n'), ('swimming', 'v'), ('creation', 'n'), ('stability', 'n'), ('writing', 'v'), ('realize', 'v'), ('mystery', 'n'), ('football', 'n'), ('mysteries', 'n'), ('created', 'v'), ('took', 'v')]
From the output, we can see we've got verbs(v) and nouns(n).

Let's now lemmatize the tagged words.

wnl = WordNetLemmatizer()

for word, tag in wordnet_tokens:
    print((word, wnl.lemmatize(word, tag)))

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stability')
('writing', 'write')
('realize', 'realize')
('mystery', 'mystery')
('football', 'football')
('mysteries', 'mystery')
('created', 'create')
('took', 'take')

Conclusion

In this article, we've learned about stemming and lemmatization, what they are and their differences. Both stemming and lemmatization are good techniques for text processing and they each have pros and cons.