DEV Community

Cover image for Sentiment Analysis: A Practical Guide to Data Collection, Preparation, and Model Training
ICCHA Technologies
ICCHA Technologies

Posted on

Sentiment Analysis: A Practical Guide to Data Collection, Preparation, and Model Training

Sentiment Analysis has become an essential tool in the field of data science and machine learning. It enables us to understand the underlying sentiments in text data, which can be valuable for various tasks like product reviews, customer feedback, or social media analysis. In this guide, we will walk you through the steps of creating a sentiment analysis model, starting from data collection to model evaluation. For this example, we will focus on product reviews as our topic.

Data Collection and Preparation

The first step is to collect a dataset containing product reviews. Online platforms like Amazon or Yelp are good sources for such data, as are publicly available datasets on platforms such as Kaggle or UCI Machine Learning Repository. For this example, we will use a small synthetic dataset of product reviews. We'll make sure to include some 'dirty' elements, such as missing values, irrelevant information, and unclean text.

import pandas as pd
import numpy as np

# Creating a synthetic dataset
data = {
    'user_id': ['user1', 'user2', 'user3', 'user4', 'user5', 'user6'],
    'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'],
    'review': ['This product is amazing!', 'I hated it, would not recommend.', 'Best purchase I\'ve ever made!!!', 'Absolutely terrible. Stay away!', 'Pretty good.', np.nan],
    'rating': [5, 1, 5, 1, 4, 3],
}

df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

After collecting the dataset, the next step is data preprocessing. This step is critical as the quality of data affects the performance of the machine learning model. Here are some common preprocessing steps:

Remove Irrelevant Information

First, we will drop the columns that are not relevant to the sentiment of the review. In this case, the 'user_id' and 'timestamp' columns:

df = df.drop(['user_id', 'timestamp'], axis=1)
Enter fullscreen mode Exit fullscreen mode

Handle Missing Values

Next, we need to handle the missing review text. Since we can't infer the sentiment of a missing review, we'll drop any rows with missing 'review' values:

df = df.dropna(subset=['review'])
Enter fullscreen mode Exit fullscreen mode

Text Cleaning

We'll now clean the text data. This involves removing punctuation, converting all text to lowercase, and removing any unnecessary whitespace. Let's use Python's built-in string methods:

df['review'] = df['review'].str.lower()  # Convert to lowercase
df['review'] = df['review'].str.replace('[^\w\s]', '')  # Remove punctuation
df['review'] = df['review'].str.strip()  # Remove unnecessary whitespace
Enter fullscreen mode Exit fullscreen mode

Now, our data is clean and ready for labelling.

Labelling the Data

In this step, we assign labels to the reviews indicating whether they are positive or negative. For our product reviews, we could use a simple criterion like the star rating: reviews with 4 or 5 stars are labelled as 'positive', and reviews with 1 or 2 stars as 'negative'. For more nuanced criteria, you could manually review and label a subset of the data or use a pre-trained sentiment analysis model to assign initial labels. We'll create a new column, 'label', to store these:

df['label'] = df['rating'].apply(lambda x: 'positive' if x > 3 else 'negative')
Enter fullscreen mode Exit fullscreen mode

Our data is now preprocessed and labelled, and ready for feature extraction.

Feature Engineering

Great! Now that we have preprocessed and labelled our data, the next step is to extract features from our text data that will be used to train the machine learning model.

Imagine you're a detective investigating a crime. You have a bunch of evidence, including fingerprints, footprints, DNA samples, and security camera footage. Your goal is to extract useful features from these pieces of evidence to help you identify the culprit.

In sentiment analysis, feature engineering is similar to the detective's task. Instead of physical evidence, we have a collection of text documents (such as reviews, tweets, or customer feedback) that express opinions or sentiments. The aim is to extract meaningful and relevant features from this text data that can be used to determine the sentiment associated with each document.

Feature extraction involves creating informative features from the text data that can help classify the sentiment of the reviews. Here are a few common text feature extraction methods:

  • Bag of Words: This is the simplest form of text representation, where each document is represented as a vector in a high-dimensional space.
  • TF-IDF: This is a more sophisticated method that weighs each word by how unique it is to a document in comparison to the entire corpus.
  • Word Embeddings: This is an advanced method where each word is represented as a high-dimensional vector capturing semantic information.

We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) method for this task. TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus. We will use the TfidfVectorizer from the Scikit-Learn library. This will convert our text data into a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a vectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the 'review' column
tfidf_matrix = vectorizer.fit_transform(df['review'])

# Convert the matrix into a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
Enter fullscreen mode Exit fullscreen mode

Splitting the Data

Now that we've transformed our text data into numerical features using TF-IDF, we're ready to split our data into training and testing sets. This is an essential step in building any machine learning model. The model is trained on the training set and then evaluated on the unseen testing set. This process helps us understand how well our model can generalize to new, unseen data.

We'll use the train_test_split function from Scikit-Learn to do this. We'll follow the common practice of using 80% of the data for training and 20% for testing.

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(tfidf_df, df['label'], test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Let's delve deeper into the details of what's happening here:

train_test_split function: This function from Scikit-learn is used to divide our dataset into training and testing sets. It takes in several parameters:

  • tfidf_df: This is the first parameter, representing the features our model will learn from.

  • df['label']: The second parameter, which is the target variable or labels that our model will aim to predict.

  • test_size=0.2: An optional parameter that specifies the proportion of the dataset to be included in the test split. Here, we have set it to 0.2, implying that 20% of the data will be used for testing.

  • random_state=42: Another optional parameter that manages the shuffling applied to the data prior to executing the split. The number 42 is arbitrary and can be set to any integer. It ensures that the splits you generate are reproducible.

Outputs of the train_test_split function:

  • X_train and X_test: These are the portions of tfidf_df assigned for training and testing, respectively.

  • y_train and y_test: Similarly, these are the portions of df['label'] assigned for training and testing, respectively.

Model Selection and Training

At this stage, we need to pick a suitable machine learning algorithm for our sentiment classification task. Here are a few potential options:

  • Logistic Regression: This is a simple, yet potent algorithm for binary classification problems. It's efficient to train and usually offers good baseline performance.

  • Naive Bayes: This is a probabilistic classifier that makes strong independence assumptions. It tends to be particularly effective when dealing with high-dimensional text data.

  • Decision Trees: These models are straightforward to understand and interpret, and they can handle both numerical and categorical data.

  • Deep Learning Models: For more intricate tasks or larger datasets, deep learning models like Recurrent Neural Networks (RNNs) or Transformers can deliver superior results. However, they demand more computational resources and additional time for training.

After deciding on a model, we'll train it using our labeled training data and the features we extracted earlier.

For our specific task, we're going to use a Logistic Regression model. Logistic Regression is often a good starting point for text classification tasks due to its simplicity and efficiency. Let's proceed to train a Logistic Regression model with Scikit-Learn:

from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model on our training data
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

In the above code, we first import the LogisticRegression class from the Scikit-Learn library. We then create an instance of this class, which represents our model. The fit method is used to train the model on our training data.
At this point, our Logistic Regression model has been trained on our training data. The model has learned to associate the TF-IDF features with the 'positive' and 'negative' labels.
The next step is to use this trained model to make predictions on our testing data, and then evaluate how well it performed. Let's proceed with the model evaluation.

Model Evaluation

With our trained model ready, we can now evaluate its performance. This is done by making predictions on our test data and comparing those predictions to the actual labels.

We will use common metrics for binary classification problems to assess the model's performance: accuracy, precision, recall, and F1-score. These metrics offer different viewpoints on how well our model performs:

  • Accuracy: This is the ratio of total predictions that are correct.

  • Precision: This is the ratio of positive predictions that are indeed correct.

  • Recall: This is the ratio of actual positive instances that our model correctly identified.

  • F1-score: This is the harmonic mean of precision and recall, providing a balance between these two metrics.

Let's see how to generate these metrics using Scikit-Learn:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Generate predictions on the test data
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='positive')
recall = recall_score(y_test, y_pred, pos_label='positive')
f1 = f1_score(y_test, y_pred, pos_label='positive')

# Print the metrics
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1-score:', f1)
Enter fullscreen mode Exit fullscreen mode

With this, we've evaluated our model's performance. These metrics provide a starting point to understand how well our model is performing. Remember, in practice, the choice of metrics depends on your specific problem and objectives.
This completes our guide to building a sentiment analysis model. Remember that this is a simplified example. Real-world data will often require more extensive preprocessing and feature engineering, and you may need to try multiple models or tune their parameters to achieve optimal performance.

Top comments (0)