AutoNLP for Automating Twitter Sentiment Analysis

#nlp #machinelearning #algorithms #computerscience

What is NLP?

Natural Language Processing or NLP is a field of Artificial Intelligence that gives machines the ability to read, understand, and derive meaning from human languages.

NLP is particularly booming in the healthcare industry. This technology is improving care delivery, disease diagnosis, and bringing costs down while healthcare organizations are going through a growing adoption of electronic health records. The fact that clinical documentation can be improved means that patients can be better understood and benefited through better healthcare. The goal should be to optimize their experience, and several organizations are already working on this.

A number of publications containing the sentence “natural language processing” in PubMed in the period 1978–2018. As of 2018, PubMed comprised more than 29 million citations for biomedical literature

What is Automated Machine Learning?

Automated machine learning changes that, making it easier to build and use machine learning models in the real world by running systematic processes on raw data and selecting models that pull the most relevant information from the data.
Automated Machine Learning(AutoML) is currently one of the explosive subfields within Data Science. It sounds great for those who are not fluent in machine learning and terrifying for current Data Scientists.

What is AutoNLP?

Using the concepts of AutoML, AutoNLP helps in automating the process of exploratory data analysis like stemming, tokenization, lemmatization, etc. It also helps in text processing and picking the best model for the given dataset. AutoNLP was developed under AutoVIML which stands for Automatic Variant Interpretable ML.

Some of the features of AutoNLP are:

Data cleansing: The entire dataset can be sent to the model without performing any process like vectorization. It even fills the missing data and cleans the data automatically.
Uses feature tools library for feature extraction: Feature Tools is another great library that helps in feature engineering and extraction in any easy way.
Model performance and graphs are produced automatically: Just by setting the verbose, the model graph and performance can be shown.
Feature reduction is automatic: With huge datasets, it becomes tough to select the best features and perform EDA. But this is taken care of by AutoNLP.

Let start implementing twitter sentiments analysis using auto NLP.

Without auto NLP we have to clean the data, then vectorized, stemmed, lemmatized, and then choose the best model for data.
But with Auto NLP we can do this thing in few lines of codes.

Installing the AutoNLP

To install this we can use a simple pip command. Since AutoNLP belongs to autoviml we need to install that.

!pip install autoviml

After installation, You can download the dataset from here.

Let's look at our dataset

import pandas as pd
df = pd.read_csv('../input/twitter-sentiment-analysis-analytics-vidya/train_E6oV3lV.csv')
df.head()

Model

Now we will split the data into training and test dataset and use AutoNLP to build our model

from sklearn.model_selection import train_test_split
from autoviml.Auto_NLP import Auto_NLP
train, test = train_test_split(df, test_size=0.2)

This will split the dataset into 80% training and 20% test dataset.
Since this is a Classification problem so we have to tell this in AutoNLP method.

input_feature, target = "tweet", "label"
train_x, test_x, final, predicted= Auto_NLP(input_feature, train, test,target,score_type="balanced_accuracy",top_num_features=200,modeltype="Classification",verbose=2,build_model=True)

If we don't mention top_num_features then it will take its default value i.e 300. Also training with more top_num_features will result in slower training.

After a few couples of minutes, you will see the trained model along with some plots for visualization of data.
This is the most beautiful thing about AutoNLP is that after choosing the best model it performs Hyperparameter Tuning over 30 params using RandomizedSearchCV and automatically generates plots for Exploratory Data Analysis.
You can see the results below:

After completing the training process, Auto NLP also generates Confusion Matrix which tells how good our classifier had performed over the dataset.

The model has selected Multinomial NB as a classifier and has performed the training.
Note: If the top_num_features were not given, a Random Forest Algorithm would be used.

Predictions

You can make predictions as

final.predict(test_x[input_feature])

Conclusion

We saw how AutoNLP cleans, preprocess, vectorized the data, and also it generates plots for visualization, and performs hyperparameter tuning for the best model. Also, it uses Cross-Validation to avoid Overfitting the model.
But we can't say this is the best approach for classification in NLP Because we are living in the era of Transformer which gives state-of-the-art Natural Language Processing. There are many transformers like Google's BERT(Bidirectional Encoder Representations from Transformers), GPT-2, XLM, etc.

Thank You!! I hope this helps you in clearing the concept behind Automating NLP. If you liked this post, then please do give me a few ❤️.

The full code is here.