Language Models are machine learning models that work on text data to perform different tasks related to Natural Language Processing. NLP has basically two major categories of tasks Natural Language Understanding and Natural Language Generation. There are many different tasks performed by language models that include: Sentiment Analysis, Question-Answering, Query Resolution, Text Summarization, etc. There are many intermediate tasks that are performed to make the language models better. Some of these are given below-
The whole 'corpus' which refers to the entire text collection is broken down into separate sentences. This is the first step in understanding languages first they are broken into simple sentences.
The next step after sentence segmentation is tokenization or more correctly word tokenization. The sentences are broken down into words, in some of the tasks where there is the importance of punctuation marks they are also treated as tokens along with words.
Stemming refers to the process of reducing the words to their root stem, it is done by chopping off the end of the words.
eg- oppressor, oppression, oppressed, oppressive will all we changed to oppress which is the root stem by chopping off 'or', 'ion',' ed', 'ive' respectively from each word.
Some of the famous stemming algorithms are- Porter Stemmer, Lancaster Stemmer, and Snowball Stemmer. These are implemented in the
nltk library and the packages can be imported as follow-
from nltk.stem.porter import PorterStemmer,
from nltk.stem.lancaster import LancasterStemmer,
from nltk.stem import SnowballStemmer.
Lemmatization means to convert the word to its root word known as lemma it takes into account the meaning of the word and just does not simply chop off the last section of words.
eg- the root word for 'is', 'am', 'are' is 'be',another example is
the words 'creation', 'creating', 'creative' will be changed to 'create'. If stemming was done then they will become 'creati' instead of 'create'.
POS means Part Of Speech and in this step, all the tokens are assigned or tagged with a part of speech. POS tagging has basically two methods - Rule Based POS Tagging and Stochastic Based POS Tagging. POS tagging is used because it helps in building lemmatizers, it helps in building parse trees which are used for Named Entity Recognition, and also resolving word disambiguation.
Stopwords are common words found in languages that do not give a lot of meaning to the sentence like 'and', 'the', 'is', 'am', etc. These words must be identified and based on task removed from the corpus because they are like noise in the dataset.
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Some common NER tools are- Stanford Named Entity Recognizer (SNER), SpaCy, Natural Language Toolkit (NLTK).
Text Classification is one of the most important steps in Sentiment Analysis after all the steps like tokenization, stemming and lemmatization are performed on the corpus they are passed to any machine learning algorithm to classify it.
It works on top of POS tagging. It uses POS-tags as input and provides chunks as output. In short, Chunking means grouping of words/tokens into chunks. The chunks are a group of words or phrases which can be clubbed together to form meaningful parts of the sentence like noun group/phrase, verb group/phrase, etc.
Chunking can break sentences into phrases that are more useful than individual words and yield meaningful results. Chunking is very important when you want to extract information from text such as locations, person names (NER). NLTK can be used for chunking.
In linguistics, coreference occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refers to the same person, namely to Bill. Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher-level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.