Intro'
Sentiment analysis is a technique that is used to determine the emotional tone behind a particular text. For example, a business can use sentiment analysis to classify reviews as positive, negative or neutral.
Looking at online reviews, insights can be gained on the sentiment behind each review and then the common themes frequently mentioned in the reviews can be identified. Based on these insights, then a business/organisation or individuals can make informed decisions in their respective operations.
In today's world, advancement in technology has made it possible for systems to learn how to do tasks. This is through Artificial Intelligence, AI. So it is also possible to teach a system how to perform sentiment analysis getting rid of the need for repetitive analysis of the data by a human.
In this article, we will briefly go over how to get a computer to perform sentiment analysis by itself using machine learning algorithms.
Dataset
In order to do this, we will use a collection of about 1.6 million tweets. This dataset Sentiment140 is hosted on Kaggle.
The tweets in the dataset were collected in February 2009 using the Twitter API and were labeled with sentiment polarity using emoticons present in the tweets. For instance, tweets with positive emoticons like :) were labeled as positive, tweets with negative emoticons like :( were labeled as negative, and tweets without any emoticons were labeled as neutral.
The Sentiment140 dataset is commonly used in research and industry for sentiment analysis tasks due to its large size and labeled sentiment polarity. Researchers and practitioners can use this dataset to develop and evaluate machine learning models for sentiment analysis tasks, such as sentiment classification or sentiment regression.
Implementation
Similar to any data science project, there are general steps involved in performing any data analysis. In this case, here are the steps:
1. Data Collection:
Instead of downloading the data to the local machine, the dataset will be extracted from Kaggle directly into Colab where the analysis will happen.
Authenticating the Kaggle API client
# Get the username and key from your Kaggle account
os.environ['KAGGLE_USERNAME'] = "username"
os.environ['KAGGLE_KEY'] = "key"
Download and unzip the dataset from Kaggle
!kaggle datasets download -d kazanova/sentiment140
# Unzip the downloaded dataset
!unzip sentiment140
Load the downloaded dataset
tweets_df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1')
tweets_df.head()
2. Data Pre-Processing:
Next step is to preprocess the data by cleaning it and converting it into a structured format that can be used for analysis.
# Using the .columns method insert a list of the column names
tweets_df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']
tweets_df.head()
Pre-process the text column data using regular expressions to remove elements like punctuations, special characters, urls, hashtags, stop-words, usernames and convert all to lowercase.
Before making any structural changes to the dataset, I created a copy of the original dataset and are working on the copy.
# import NLTK, Natural Language Toolkit, library
# This library provides good tools for loading and cleaning text
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# define a function to implement the pre-processing & cleaning of the text data
def clean_text(text):
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'@[^\s]+', '', text) # Remove usernames
text = re.sub(r'#([^\s]+)', r'\1', text) # Remove hashtags
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = text.lower() # Convert to lowercase
text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
return text
# Apply the above clean_text function to the text column values
# Drop the text column after adding the clean_text column to the dataframe
tweets_cp['clean_text'] = tweets_cp['text'].apply(clean_text)
tweets_cp.drop(['text'], axis=1)
3. Feature Extraction
After data preprocessing then convert the preprocessed text into a numerical format that can be used for analysis. This involves a technique like TF-IDF, Term Frequency Inverse Document Frequency. TF-IDF can be defined as the calculation of how relevant a word in a series or corpus is to a text.
#Convert the text data into numerical features using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X = tfidf.fit_transform(tweets_cp['clean_text'])
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, tweets_cp['target'], test_size=0.3, random_state=42)
4. Model Selection
The next step is to pick an appropriate machine learning algorithm to classify the sentiment of the tweet text. In this case we will try this with Naive Bayes.
# Train a Naive Bayes classifier on the training data
nb = MultinomialNB()
nb.fit(X_train, y_train)
# Test the model on the testing data
y_pred = nb.predict(X_test)
5. Model Training
We will train the model using the labeled training dataset that we split in the Feature Extraction.
# Test the model on the testing data
y_pred = nb.predict(X_test)
6. Model Evaluation
After training the model, we need to evaluate its performance on a test dataset(30% of the original dataset) that we split in the Feature Extraction section.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-Score:', f1_score(y_test, y_pred))
The model without fine turning it has an accuracy score of 75% and a precision of 75%.
Evaluation Score
Accuracy: 0.7511354166666667
Precision: 0.7564523638210522
Recall: 0.7427183457378064
F1-Score: 0.7495224455818614
7. Predict
We will try predict the sentiment of a new tweet using the model we have trained, tested and evaluated.
new_tweet = 'I hate Mondays'
new_tweet_cleaned = clean_text(new_tweet)
new_tweet_vectorized = tfidf.transform([new_tweet_cleaned])
sentiment = nb.predict(new_tweet_vectorized)[0]
print('Sentiment:', sentiment)
The model predicts the new tweet has a negative tone.
Sentiment: 0
Conclusion
Sentiment analysis can help gauge how the outside world feels about a business, product, trend and so many more. With the integration of machine learning models into such analysis, the results can be outstanding. Even with fine turning of a simple model like the one that we just built can really inform decision-making at the said entity.
You can find the model code at this Link.
Why did the sentiment analyst's computer keep crashing? It couldn't handle all the feelings.
Exploring the Possibilities: Let's Collaborate on Your Next Data Venture! You can check me out at this Link
Top comments (0)