DEV Community

loading...
Cover image for Don't blindly remove STOPWORDS for a Sentiment Analysis Model

Don't blindly remove STOPWORDS for a Sentiment Analysis Model

sunilaleti profile image Sunil Aleti ・3 min read

Does removing stopwords really improve model performance?

Hey Peeps!!
Before creating any model, data preprocessing is must
Data preprocessing includes Data Cleaning, Data Transformation and Data Reduction

Data Cleaning:

It involves handling of missing data, noisy data etc..
  • Missing Data
  • Noisy Data

Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for the mining process
  • Normalization

Data Reduction:

While working with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction techniques.
  • Dimensionality Reduction
And I started working on Amazon Fine Food Review where I got dataset from Kaggle. The Main Objective of this model is to determine whether the review is positive or negative And I started data preprocessing before training the dataset

Steps in preprocessing:

  • Begin by removing the html tags.
  • Remove any punctuations or limited set of special characters like , or . etc.
  • Check if the word is made up of english letters and is not alpha-numeric
  • Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
  • Convert the word to lowercase
  • Remove Stopwords
  • Finally Snowball Stemming the word

And I used Naive Bayes algorithm to train my dataset and tested it, unfortunately, my model is underperforming.
After reviewing the model, I came to know that it is because of removing StopWords, yes you heard it right

The most common method to remove stopwords is using NLTK's stopwords.

import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
print(stop_words)
Enter fullscreen mode Exit fullscreen mode

Alt Text

The main objective of building this model is to determine the given review is positive or negative but performing stopwords it removes the negative words which indeed it literally changes the whole meaning of the review i.e negative to postive
Ex:

Before Stopwords After Stopwords
The product is really very good (Positive) product really good(Positive)
The products seems to be good.(Positive) products seems good (Positive)
Good product I really liked it(Positive) Good product really liked (Positive)
I didn’t like the product (Negative) like product (Positive)
The product is not good (Negative) product good (Positive)

We can see the after stopwords the negative reviews also changed to positive.

A bit scary right?

If you are working with basic NLP techniques like BOW, W2V or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods but creating a new list or importing NLP from nlppreprocess is good

from nlppreprocess import NLP
import pandas as pd

nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)
Enter fullscreen mode Exit fullscreen mode
or

def decontracted(phrase):
    phrase = re.sub(r"\'t", "not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
Enter fullscreen mode Exit fullscreen mode

Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.
Let me know what is your opinion on this in the comment section.

Hope it's useful
A ❤️ would be Awesome 😊

Discussion (0)

pic
Editor guide