Does removing stopwords really improve model performance?
Hey Peeps!!
Before creating any model, data preprocessing is must
Data preprocessing includes Data Cleaning, Data Transformation and Data Reduction
Data Cleaning:
It involves handling of missing data, noisy data etc..- Missing Data
- Noisy Data
Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for the mining process- Normalization
Data Reduction:
While working with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction techniques.- Dimensionality Reduction
Steps in preprocessing:
- Begin by removing the html tags.
- Remove any punctuations or limited set of special characters like , or . etc.
- Check if the word is made up of english letters and is not alpha-numeric
- Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
- Convert the word to lowercase
- Remove Stopwords
- Finally Snowball Stemming the word
And I used Naive Bayes algorithm to train my dataset and tested it, unfortunately, my model is underperforming.
After reviewing the model, I came to know that it is because of removing StopWords, yes you heard it right
The most common method to remove stopwords is using NLTK's stopwords.
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
print(stop_words)
The main objective of building this model is to determine the given review is positive or negative but performing stopwords it removes the negative words which indeed it literally changes the whole meaning of the review i.e negative to postive
Ex:
Before Stopwords | After Stopwords |
---|---|
The product is really very good (Positive) | product really good(Positive) |
The products seems to be good.(Positive) | products seems good (Positive) |
Good product I really liked it(Positive) | Good product really liked (Positive) |
I didn’t like the product (Negative) | like product (Positive) |
The product is not good (Negative) | product good (Positive) |
We can see the after stopwords the negative reviews also changed to positive.
A bit scary right?
If you are working with basic NLP techniques like BOW, W2V or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods but creating a new list or importing NLP from nlppreprocess is good
from nlppreprocess import NLP
import pandas as pd
nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)
def decontracted(phrase):
phrase = re.sub(r"\'t", "not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.
Let me know what is your opinion on this in the comment section.
Hope it's useful
A ❤️ would be Awesome 😊
Top comments (0)