Ecaterina Teodoroiu

Posted on Dec 19, 2022

Natural language processing: A data science tutorial in Python

Introduction: What is Natural Language Processing?

Natural language processing is the ability of a computer to understand human speech and text.
It is a subset of artificial intelligence. It can be used to do many things, like find patterns in text or speech, translate languages and even generate text that sounds like it was written by a human.
The first time natural language processing was mentioned in 1965 by Joseph Weizenbaum. He called it “computer understanding of natural language” and created ELIZA program which could mimic conversations with humans.

Natural Language Processing Uses and Applications

The first use of Natural Language Processing was in the 1940s when a machine was programmed to translate Russian into English. The Russians were not pleased with the result and it resulted in a ban on the export of computers from America.

Some applications of NLP are:

Image recognition software that identifies objects or people in pictures,
chat bots that can answer questions and provide information,
speech recognition software that recognizes the spoken words of a person,
sentiment analysis software that analyzes text to detect positive or negative sentiment reasoning skills,
ai writer softwares etc.

One of the main reasons that natural language processing is considered more challenging than other data science domains is because it's a difficult task to assign meaning to words. Another reason is that it's an ambiguous task with no clear solution. It's also a highly complex problem with many different types of algorithms.

Syntactic Analysis Vs Semantic Analysis

There is two primary ways to understand natural language: syntactic analysis and semantic analysis.

Syntactic analysis

is the process of analyzing a text or sentence to identify its words and grammar. Semantic analysis is the process of analyzing a text or sentence to understand what it means.
The syntactic analysis is primarily concerned with the form of the text, while semantic analysis is focused on the content. and meaning of the text.

Syntactic analysis is primarily concerned with breaking down a sentence into its constituent parts, and can be used to identify verb tense, grammatical person, and semantic function in a sentence.

Semantic analysis

is primarily concerned with understanding what the text means.For example, syntactic analysis might show that 'I drank water' contains an action verb (drank) in the present tense while semantic analysis might show that 'I'm drinking water' describes future intentions

Popular Natural Language Processing Packages

Some of the most popularly used packages for different NLP methods are the following:

Python Natural Language Toolkit (NLTK)

The Python Natural Language Toolkit (NLTK) is a library that aids in the research and development of natural language processing. It provides resources for beginners to experts in the field, and can be used to develop programs that can answer questions, extract information from text, recognize textual patterns and much more.

Spacy

Spacy is another popular NLP package that can be used as an API to build a custom model for any language. It is also a library of pre-trained models for many languages.

NLP packages are designed to help provide the necessary tools for building an intelligent application that can process natural language input. They are essentially libraries of functions and algorithms that provide the capability to parse, analyze, and understand text.

Hugging Face

is the most popular NLP package out there right now.

AI writing assistants are revolutionizing the world of copywriting. These writing assistants can understand the structure of language and create original, creative content in no time. They can produce text in a variety of formats such as blog posts, social media posts, articles, emails & more at a fraction of the cost.

Hugging Face is one such NLP package that has gained significant traction in the marketplace because of its ability to understand human language and produce original, creative text.

First, we will import all necessary libraries that we will be using throughout this tutorial.
import spacy import nltk import gensim from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize from nltk import ngrams from sklearn import preprocessing import gensim.downloader as API

Data Cleaning in NLP

Data cleaning is a process that ensures the data is of high quality, accurate, and error-free. The data cleaning process largely depends on the type of data that needs to be cleaned.

Data cleaning for text is different than other types of data such as images or audio. Textual data can be cleaned by checking for spelling mistakes, punctuation errors and syntactic errors.

The first step in any type of data cleaning is to identify what kind of errors need to be fixed and which steps are required for fixing them.

These symbols don't contain any information for our model to learn. They act as noise in our data so, we'll discard them.
We remove any special characters such as $, %, #, @, <, >, etc.

Data Preprocessing in NLP

There were many different ways we could process data, but in this tutorial we narrowed it down to a few. Data preprocessing is an important step in Natural Language Processing. It's the first step in any NLP pipeline, and it can be done in many different ways. In this tutorial, we'll go over a few ways to process data. -

Lowercase

If any character in our text is in uppercase we convert it to lowercase. .- Remove punctuation. In English, punctuation is not really necessary. We can remove it from our text so we can focus on the words.
Split sentences into individual word tokens. When a sentence is made up of a bunch of words that are the same, we split them up. This offers easier analysis in terms of tokenization and part-of-speech tagging. For example: "I went to school" would be split up into individual tokens: "i",

Let’s take a look at how to convert our textual data to lowercase.
text_data = """ Let's convert this demo text to Lowercase for this NLP Tutorial using NLTK. NLTK stands for Natural Language Toolkit """ lower_text = text_data.lower() print (lower_text)
Output:
let’s convert this demo text to lowercase for this nlp tutorial using nltk. stands for natural language toolkit

Tokenization

Tokenization In tokenization, we take our text from the documents and break them down into individual words. For example, we tokenized the text "A sea of content" into the words “sea”, “content”, and space.
Now we will take our textual data and make word tokens of our data. Then we will be printing them.
word_tokens = word_tokenize(text_data) print(word_tokens)
Output:
[‘Let’, “‘s”, ‘convert’, ‘this’, ‘demo’, ‘text’, ‘to’, ‘Lowercase’, ‘for’, ‘this’, ‘NLP’, ‘Tutorial’, ‘using’, ‘NLTK’, ‘.’, ‘NLTK’, ‘stands’, ‘for’, ‘Natural’, ‘Language’, ‘Toolkit’]

Stopwords Removal

Stopwords are words that are not helpful to the document and they don't add much information to it. So, they add noise to the text data.

It is possible that the stop words may differ in different domains. For example, if “the” and “to” our some tokens in our stopwords list, when we remove stopwords from our sentence “The dog belongs to Jim” we will be left with “dog belongs Jim”.

Stop words don’t hold a great deal of information so it is better to remove them since they act as noise in our data. We get
stopword = stopwords.words('english') removing_stopwords = [word for word in word_tokens if word not in stopword] print (removing_stopwords)
Output:
[‘Let’, “‘s”, ‘convert’, ‘demo’, ‘text’, ‘Lowercase’, ‘NLP’, ‘Tutorial’, ‘using’, ‘NLTK’, ‘.’, ‘NLTK’, ‘stands’, ‘Natural’, ‘Language’, ‘Toolkit’]

Stemming

In stemming we reduce a word to its root word. It transforms the word back to its original form i.e reduces inflection. For example, Stemming is a process by which we reduce a word to its root word. It transforms the word back to its original form i.e reduces inflection. For example, in the sentence, "The student was very happy," the words "was" and "very" are reduced back to their roots: "be" and "happi." We’ll be using the Porter Stemmer, a popular stemming algorithm to stem the word.
ps = PorterStemmer() stemmed_words = [ps.stem(word) for word in word_tokens] print(stemmed_words)
Output:
[‘let’, “‘s”, ‘convert’, ‘thi’, ‘demo’, ‘text’, ‘to’, ‘lowercas’, ‘for’, ‘thi’, ‘nlp’, ‘tutori’, ‘use’, ‘nltk’, ‘.’, ‘nltk’, ‘stand’, ‘for’, ‘natur’, ‘languag’, ‘toolkit’]

Lemmatization

It is a process of reducing words to their root form. Lemmatization does the same thing as stemming but in lemmatization, we get a root word that has some meaning. For example, "dancing" would become "dance" and "running" would become "run".
wnl = WordNetLemmatizer() word_tokens2 = ["corpora","better","rocks","care","classes"] lemmatized_word = [wnl.lemmatize(word) for word in word_tokens2] print (lemmatized_word)
Output:
[‘corpus’, ‘better’, ‘rock’, ‘care’, ‘class’]

N Grams

N Grams are a series of letters, words or phrases that occur in a document. They are used to preserve the sequence of information which is present in the document.

N Grams help to make text more readable and easy to understand by breaking up large blocks of text with short phrases.

It can be used for various purposes like identifying key concepts, locating important points, understanding the flow of information and summarizing content.
When N = 1, they are called Unigrams. When N = 2, they are called bigrams. When N = 3, they are called trigrams. And so on.
For example,

For example, “Today is Tuesday.”

Unigrams = Today, is, Tuesday

Bigrams = Today is, is Tuesday

Trigrams = Today is Tuesday
Let’s see how we can convert our text data to N-grams. Over here the value of N will be 3, so we’ll be making trigrams.
n_grams = ngrams(text_data.split(), 3) for grams in n_grams: print(grams)
Output:
(“Let’s”, ‘convert’, ‘this’) (‘convert’, ‘this’, ‘demo’) (‘this’, ‘demo’, ‘text’) (‘demo’, ‘text’, ‘to’) (‘text’, ‘to’, ‘Lowercase’) (‘to’, ‘Lowercase’, ‘for’) (‘Lowercase’, ‘for’, ‘this’) (‘for’, ‘this’, ‘NLP’) (‘this’, ‘NLP’, ‘Tutorial’) (‘NLP’, ‘Tutorial’, ‘using’) (‘Tutorial’, ‘using’, ‘NLTK.’) (‘using’, ‘NLTK.’, ‘NLTK’) (‘NLTK.’, ‘NLTK’, ‘stands’) (‘NLTK’, ‘stands’, ‘for’) (‘stands’, ‘for’, ‘Natural’) (‘for’, ‘Natural’, ‘Language’) (‘Natural’, ‘Language’, ‘Toolkit’)

Word Vectorization

Word vectorization is a process of converting texts into vectors. These vectors are numerical representations of words and sentences. This process is used to help machines understand textual data.

The main purpose of word vectorization is to turn the text into a format that will make it easier for machines to understand the data.

It can be done by using neural networks to map out different relationships between words as well as their contexts, which can then be used in order to provide meaning and context for other words and sentences that do not have an explicit meaning yet

One Hot Vector Encoding

One-hot vector encoding is a technique that uses a binary vector to represent the words in the corpus.

This technique is used when we need to extract semantic information from text. For example, if we want to find out what are the most important words in a text, one-hot vectors can help us do that.

One of the main advantages of one-hot vector encoding is that it does not require any human intervention.
word_tokens3 = ['corpora', 'better', 'rocks', 'care', 'classes','better','apple'] lab_encoder = preprocessing.LabelEncoder() int_label_encoder = lab_encoder.fit_transform(word_tokens3) lab_encoded = int_label_encoder.reshape(len(int_label_encoder),1) one_hot_encoder = preprocessing.OneHotEncoder(sparse=False) one_hot_encoded = one_hot_encoder.fit_transform(lab_encoded) print(one_hot_encoded) print(word_tokens3)
Output:
[[0. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 0.] [0. 0. 0. 1. 0. 0.] [0. 1. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0.]] [‘corpora’, ‘better’, ‘rocks’, ‘care’, ‘classes’, ‘better’, ‘apple’]

Word2Vec

The word2vec algorithm is a machine learning algorithm that can be used to find relationships between words. It is used in natural language processing and information retrieval.

It has been shown to be useful for many tasks, including classification, clustering, indexing, spelling correction, and translation.
Now let’s see how we can make Word2Vec vectors of our data. We won’t be training a Word2Vec model from scratch, we’ll just load a pre-trained Word2Vec model using Gensim which is another important package for different NLP methods.
model = api.load("word2vec-google-news-300") model.most_similar("obama")
Output:
[(‘romney’, 0.9566564559936523), (‘president’, 0.9400959610939026), (‘barack’, 0.9376799464225769), (‘clinton’, 0.9285898804664612), (‘says’, 0.9087842702865601), (‘bill’, 0.9080009460449219), (‘claims’, 0.9074634909629822), (‘hillary’, 0.8889248371124268), (‘talks’, 0.8864543437957764), (‘government’, 0.8833804130554199)]

Let’s go through some different methods to create Word2Vec vectors. The following are two methods we can use to obtain word2vec vectors:

CBOW Model

n the CBOW (continuous bag of words) model, we predict the target (center) word using the context (neighboring) words.

The CBOW model is faster than the skip-gram model because it requires fewer computations and it is great at representing less frequent words.

Skip Gram Model

The Skip Gram Model predicts the context words for target word using its surrounding word.

The Skip Gram Model has been successfully applied to many tasks, including predicting sentiment, generating captions, and translating languages.

The model uses the target word as input and outputs a probability distribution over all of the contextual words that follow it.

Named Entity Recognition

Named entity recognition is a process of identifying word sequences in text that refer to specific entities.

It is an important NLP method for recognizing entities. It can be used for a number of purposes, such as extracting information from text, recognizing key phrases and discovering new relationships between entities.

Named entity recognition can be performed automatically by using a machine learning algorithm or manually by human coders.
To understand the working of named entity recognition, look at the diagram below.

From the above diagram, we can see that a named entity recognition model takes text as input and returns the entities along with their labels present in the text.

It has numerous applications. It can be used for content classification, using it we can detect entities in text and classify the content based on those entities.

In academia and research, it can be used for retrieving information faster.
Now let’s take a look at how we can do NER in python. First we’ll load a pre-trained spacy pipeline which is trained on numerous different forms of textual data. Using that, we can use different NLP methods. For now let’s take a look at NER.

nlp = spacy.load("en_core_web_sm")
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)

Output:

Sebastian Thrun PERSON 2007 DATE American NORP Thrun PERSON Recode ORG earlier this week DATE

Natural language processing summary

In this data science tutorial, we looked at different methods for natural language processing, also abbreviated as NLP. We went through different preprocessing techniques to prepare our text to apply models and get insights from them. We discussed word vectors and why we use them in NLP. Then we used NER to identify entities and their labels in our text.

Top comments (2)

Divyanshu Katiyar • Jan 2 '23

Thank you for such an informative article! This indeed contains a good overview of the new NLP models and concepts that are used in today's world of NLP. Can't believe our machines have acquired the power to perform complex tasks like translation, summarization, etc. :)

Ecaterina Teodoroiu • Jan 8 '23

Yes. unbelievable..., but at the same time very useful :)