Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw what is topic modeling, what are the different ways to do it in a very simple superficial way. We saw why that was important. Since the last article, if you've noticed, we've started putting the smaller thing together and started to see the big picture of what these NLP applications are. You might not even realize but you are on course to write your first sophisticated algorithm. Well, from the point of view of this tutorial series, you might not see that happening in terms of code but the knowledge will sure ease the way. You just need to pick up a language and get yourself up to speed.
In this article, we are going to talk about yet another interesting and important topic in NLP. It is called Text Classification. We will see what that is, what are the different ways to do it among many others through examples and complete the pipeline. So let's get ourselves going...
The simplest definition of text classification is that it is a classification of text based on the content of that text. It can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web. For example, new articles can be organized by topics; support tickets can be organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.
Consider the following example:
“The movie was boring and slow!”
A classifier can take this text as an input, analyze its content, and then automatically assign relevant tags, such as
slow that represent this text.
Now, to understand the processes of text classification, let's take a real word problem of spam. Daily, you receive so many emails related to different activities you are involved in. You must also get some spam emails, that's not new for any one of us. Have you ever meticulously analyzed spam mail? Tried to figure out if there is a general structure or a pattern in it? I am sure many of us would've tried this activity at least once. Ok, you haven't I'm the procrastinator who used to do such things.
So you see, there are two classes you can classify your mail into based on the usability of the mail, they are
not spam. Let's see an example of each of these classes:
Spam: "Dear Customer, we are excited to inform you that your account is eligible for a $1000 reward. To avail click the link below now!!!"
Not Spam: "Dear customer, this is to inform you that our services will be temporarily restricted between 12:00 AM to 4:00 AM for maintenance purposes. We request you to please avoid using the services."
Once you have the data, and a significant number of the records, you can start with the tasks typical to the NLP application that we read in the earlier articles like tokenization, lemmatization, and stop words removal. You could any language of your choice and depending on it you would have an output more or less something like below:
Spam: "dear customer excited inform account eligible 1000 reward avail click link"
Not Spam: "dear customer inform service temporarily restricted 12:00 4:00 maintenance purpose request please avoid using service"
Once you can clean your data to this kind of representation, it can be something even more advanced than this or a bit more primitive than this, totally based on the data you have, and finally, what do we expect from the data. Anyways, now we must remember that machine learning algorithms work with numerical data only. In fact, the computer is designed to work with numbers. And so, we got to represent our modified textual data to the numerical type somehow.
The mapping from textual data to real-valued vectors is called feature extraction. There are many ways to do that, one of the most common ways is
Bag of Words (BoW). It is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where it is in the document. The intuition is that documents are similar if they have similar content.
Jack be nimble Jack be quick Jack jump over The candlestick
This snippet consists of 4 lines in all. Let us, for this example, consider each line to be a separate document and thus we have a collection of documents with each document having few words in it.
Now let us design the document:
In this collection of documents, we have 8 unique words out of the 11 total words.
Remember, the objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.
One simple way is to mark the presence of words as boolean values, 1 if present, 0 otherwise.
In that way, our first document will look something like this.
[1, 1, 1, 0, 0, 0, 0, 0]
Consider this array as a vocabulary index boolean array. Similarly, you would do it for the rest of the documents. It would look something like this:
"Jack be nibmle" = [1, 1, 0, 1, 0, 0, 0, 0] "Jack jump over" = [1, 0, 0, 0, 1, 1, 0, 0] "The candlestick" = [0, 0, 0, 0, 0, 0, 1, 1]
Good, now there is a problem with this you see. The matrix that you have as of now is dominated by 0. It is also known as a
sparse matrix. The poses a problem for computation with respect to space and time. We must condense it. There are again many ways to do that, one of the most favored ways of doing that in NLP is using
n-grams. This has to be one of the easiest concepts. An n-gram is the N-token sequence of words. A 2-word n-gram, commonly known as a bigram could be a 2-word string.
For example, consider the first document "Jack be quick". It will be like this: ["Jack be", "be quick"].
You see now we have a shorter number of elements to match against to produce the Bag of Words.
Still, the problem of sparse matrix persists. It can be mitigated by using Singular Value Decomposition (SVD). This is a very comprehensive tutorial if you got to go into technical details.
One of the major disadvantages of using BOW is that it discards word order thereby ignoring the context and in turn meaning of words in the document. For natural language processing (NLP) maintaining the context of the words is of utmost importance. To solve this problem we use another approach called Word Embedding.
Word Embedding is a representation of text where words that have the same meaning have a similar representation. There are various models for textual representation using this paradigm but the most popular ones are Word2Vec and Glove and ELMo.
Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function indicates the level of semantic similarity between the words represented by those vectors.
GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford and was launched in 2014. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.
However, once you have your representation of text ready as numbers and matrices. The next step is to choose a Machine Learning model for text classification. Now there is no hard and fast rule as to which model performs the best, it depends upon a lot of factors such as the data, the computational resources, the time complexity, the use case, the end-users, and the device on which they will be using this on. There are many intricate details in each of the algorithms. But usually, these models are grouped into two:
- Machine Learning models
- Deep Learning models
Here is a comprehensive list for both of them:
- Multinomial Naïve Bayes (NB)
- Logistic Regression (LR)
- SVM (SVM)
- Stochastic Gradient Descent (SGD)
- k-Nearest-Neighbors (kNN)
- RandomForest (RF)
- Gradient Boosting (GB)
- XGBoost (the famous) (XGB)
- Shallow Neural Network
- Deep neural network (and 2 variations)
- Recurrent Neural Network (RNN)
- Long Short Term Memory (LSTM)
- Convolutional Neural Network (CNN)
- Gated Recurrent Unit (GRU)
- Bidirectional RNN
- Bidirectional LSTM
- Bidirectional GRU
- Recurrent Convolutional Neural Network (RCNN) (and 3 variations)
You have a number of metrics on which you can decide the performance of these models on your data. Some of them are Precision, Recall, F1 Score, Confusion Matrix, AUC, ROC AUC, ROC AUC, ROC Curves, Cohen's Kappa, True/False Positive Rate curve, and so on.
You now have a complete picture and a good pipeline to get you started.
I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.
Until next time...