DEV Community

loading...

14 tasks for text preprocessing in NLP

amananandrai
Data Science and Machine Learning Enthusiast
・2 min read

Natural Language Processing a subfield of Machine Learning mainly deals with text data. It analyses reviews of objects like books, movies, play store apps, etc, to find whether they are positive or negative, sentiment analysis, text generation for chatbots, query analysis and resolution for search engines, and many other text-related tasks.

Preprocessing of datasets is one of the most arduous tasks of the machine learning pipeline. Text preprocessing also requires many steps. Some of the tasks while dealing with text datasets is given below.

Lower casing

All the text data is converted into the lower case to make all the words with different casing get the same weightage.

Removal of punctuation

All the punctuation symbols are removed from the dataset as they are not important in many tasks for word prediction and sentiment analysis.

Removal of Stopwords

Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have, etc. These stopwords are removed from the dataset.

Removal of frequent words

Sometimes the frequent words are also removed to increase classification accuracy in text classification tasks because they are present in all the classes and removing them causes the accuracy to increase.

Removal of Rare words

In some of the cases, rare words are also ignored and therefore removed because they work as outliers.

Stemming

Stemming means to chop off the end of the words to make it similar to the root word like removing "ing", "ant" from "consulting" and "consultant" to make it "consult".

Lemmatization

Lemmatization means to change the words to root words by the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
eg-
1) am, are, is => be
2) operating, operates, operation, operative, operatives, operational => operate

In the second example if stemming is performed then instead of "operate" the words will change to "operat" as it does not take into account the meaning of the words and just chop offs the characters from the last.

Removal of emojis

In today's world emojis are a must in text messages but they can be dealt in two ways the first way is to remove them from the dataset.

Removal of emoticons

Emoticons are also removed from the dataset for many datasets.

Conversion of emoticons to words

The other way to deal with emoticons is to convert them to words.

Conversion of emojis to words

Emojis can also be converted to relatable words.

Removal of URLs

The URLs present must be removed from the dataset.

Removal of HTML tags

Sometimes while scrapping data from websites HTML tags are included in the datasets which must be removed to make better language models.

Spelling correction

Spelling mistakes must be corrected to make better language mistakes. Minimum edit distance can be used to find words which are slightly altered from the original.

Discussion (1)

Collapse
zsevic profile image
Željko Šević

Great overview, conversion of diacritics to Latin characters can also be added to the list of tasks for preprocessing.