Natural Language Processing a subfield of Machine Learning mainly deals with text data. It analyses reviews of objects like books, movies, play store apps, etc, to find whether they are positive or negative, sentiment analysis, text generation for chatbots, query analysis and resolution for search engines, and many other text-related tasks.
Preprocessing of datasets is one of the most arduous tasks of the machine learning pipeline. Text preprocessing also requires many steps. Some of the tasks while dealing with text datasets is given below.
All the text data is converted into the lower case to make all the words with different casing get the same weightage.
All the punctuation symbols are removed from the dataset as they are not important in many tasks for word prediction and sentiment analysis.
Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have, etc. These stopwords are removed from the dataset.
Sometimes the frequent words are also removed to increase classification accuracy in text classification tasks because they are present in all the classes and removing them causes the accuracy to increase.
In some of the cases, rare words are also ignored and therefore removed because they work as outliers.
Stemming means to chop off the end of the words to make it similar to the root word like removing "ing", "ant" from "consulting" and "consultant" to make it "consult".
Lemmatization means to change the words to root words by the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
1) am, are, is => be
2) operating, operates, operation, operative, operatives, operational => operate
In the second example if stemming is performed then instead of "operate" the words will change to "operat" as it does not take into account the meaning of the words and just chop offs the characters from the last.
In today's world emojis are a must in text messages but they can be dealt in two ways the first way is to remove them from the dataset.
Emoticons are also removed from the dataset for many datasets.
The other way to deal with emoticons is to convert them to words.
Emojis can also be converted to relatable words.
The URLs present must be removed from the dataset.
Sometimes while scrapping data from websites HTML tags are included in the datasets which must be removed to make better language models.
Spelling mistakes must be corrected to make better language mistakes. Minimum edit distance can be used to find words which are slightly altered from the original.