DEV Community

praveenr
praveenr

Posted on

CountVectorizer vs TFIDF - Logistic Regression

Recently I have become curious about how Natural Language Processing(NLP) works. If you are someone like me then this blog could be really helpful.

When beginning with M.L. we would have observed how tabular data was used to train an M.L. model, most of the columns would be numeric columns, rest of the text columns would usually be having 1 word which would be converted to numbers using techniques like one hot encoding.

There are cases where columns have sentences or even paragraphs so new techniques should be applied to convert raw text data to computer-usable form and in this blog we are going to see 2 such ways to do it.

What is Vectorization and why Vectorization

Many machine learning algorithms and almost all deep learning algorithms are not capable of processing text in the raw form, they need numerical inputs. This process of converting text data to numerical data is called vectorization. In the NLP world this process is referred to as embeddings.

CountVectorizer

When we use countvectorizer we create a sparse matrix, and in this sparse matrix we store count of all the words in our corpus, this is a simple but efficient way of converting text to numerical data.
Each row in the sparse matrix contains the word and its corresponding count in that particular line.

Count Vectorizer
In the above diagram, you could observe that in the x-axis all distinct words are present and in the y-axis the sentence index/line index is present(which is represented as doc), this is the sparse matrix representation and this is used to train the M.L model.

Fortunately scikit-learn has an inbuilt module that we could use to generate the sparse matrix and is super easy to use.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
Enter fullscreen mode Exit fullscreen mode

In addition to importing the CountVectorizer module, we also importing word_tokenize, this is nothing but an inbuilt module which could be used to convert each sentence into chunks of words and this is a preprocessing step in almost all NLP algorithms.

Let's create our own sample input

corpus = [
    "hello, how are you, I am Praveen?",
    "You know football is a wonderful sport, what do you think?",
    "Opensource is something that everyone should appreciate, what do you think?"
]
Enter fullscreen mode Exit fullscreen mode

This is our sample input

ctv = CountVectorizer(tokenizer = word_tokenize, token_pattern=None)
Enter fullscreen mode Exit fullscreen mode

We are creating an object and the argument tokenizer is assigned with word_tokenize module so all words and all special characters would be considered as separate tokens.

ctv.fit(corpus)
corpus_transformed = ctv.transform(corpus)
Enter fullscreen mode Exit fullscreen mode

The corpus_transformed variable now holds the sparse matrix, now we can visualize the sparse matrix.

Count Vectorizer Sparse Matrix
Our input contains 3 lines so 0,1 and 2, and the other number in the tuple-like structure is the unique id of each word in the corpus.

print("Unique index assigned to each word : ",ctv.vocabulary_)
Enter fullscreen mode Exit fullscreen mode

Unique index - CV
*Unique index assigned to each word : {'hello': 8, ',': 0, 'how': 9, 'are': 4, 'you': 20, '?': 1, 'know': 11, 'football': 7, 'is': 10, 'a': 2, 'wonderful': 19, 'sport': 15, 'what': 18, 'do': 5, 'think': 17, 'opensource': 12, 'something': 14, 'that': 16, 'everyone': 6, 'should': 13, 'appreciate': 3}
*

Term Frequency Inverse Document Frequence (TFIDF)

TFIDF
The above picture is self explanatory for the TFIDF formula, this method too creates a sparse matrix but instead of the count here we have the TFIDF formula applied to each token, the resultant is a float value.

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
Enter fullscreen mode Exit fullscreen mode

Importing the necessary modules

corpus = [
    "hello, how are you?",
    "You know football is a wonderful sport, what do you think?",
    "Opensource is something that everyone should appreciate, what do you think?"
]
Enter fullscreen mode Exit fullscreen mode

Input data that we are going to use

# To also include special characters while creating sparse matrix
tfidf = TfidfVectorizer(tokenizer = word_tokenize, token_pattern=None)
tfidf.fit(corpus)
Enter fullscreen mode Exit fullscreen mode

Creating an object and fitting it on our input data.

# TFIDF vectorizer
corpus_transformed = tfidf.transform(corpus)
Enter fullscreen mode Exit fullscreen mode

corpus_transformed contains the sparse matrix generated which could be used to train the model.

print("Sparse Matrix Representation : ", corpus_transformed)
Enter fullscreen mode Exit fullscreen mode

TFIDF matrix

print("Unique index assigned to each word : ", tfidf.vocabulary_)
Enter fullscreen mode Exit fullscreen mode

*
Unique index assigned to each word : {'hello': 8, ',': 0, 'how': 9, 'are': 4, 'you': 20, '?': 1, 'know': 11, 'football': 7, 'is': 10, 'a': 2, 'wonderful': 19, 'sport': 15, 'what': 18, 'do': 5, 'think': 17, 'opensource': 12, 'something': 14, 'that': 16, 'everyone': 6, 'should': 13, 'appreciate': 3}

*

Lets use the sparse matrix generated in Logistic Regression

Let's use a kaggle dataset to perform logistic regression, the dataset that we are going to use is IMDB movie review to perform sentiment classification - positive/negative

import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
Enter fullscreen mode Exit fullscreen mode

Importing the necessary modules

if __name__ == "__main__":
    df = pd.read_csv("/home/praveen/Desktop/Projects/Approching_Almost_Any_ML_Prob_Book/NLP/data/IMDB Dataset.csv")

    # Converting sentiment to 1 and 0
    df.sentiment = df.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

    df["kfold"] = -1

    df = df.sample(frac=1).reset_index(drop=True)

    y = df.sentiment.values

    kf = model_selection.StratifiedKFold(n_splits=5)

    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, 'kfold'] = f

Enter fullscreen mode Exit fullscreen mode

We are converting the sentiment column which is our target column to 0 and 1. We are going to take up the k-fold cross validation approach, this is nothing but converting the total dataset into k(for eg 4) folds and during training 3 folds would be used for training and 1 fold would be used for validation, this is a better way to evaluate our model's performance.

There is something called StratifiedKFold where all the folds would have balanced distribution of classes.
We are adding a new column to the dataframe called kfold to represent which fold the particular data point or row belongs to.

    accuracy_list = []
    for fold_ in range(5):
        train_df = df[df.kfold != fold_].reset_index(drop=True)
        test_df = df[df.kfold == fold_].reset_index(drop=True)

        count_vec = CountVectorizer(tokenizer=word_tokenize, token_pattern=None)
        count_vec.fit(train_df.review)

        xtrain = count_vec.transform(train_df.review)
        xtest = count_vec.transform(test_df.review)

        model = linear_model.LogisticRegression()

        model.fit(xtrain, train_df.sentiment)

        preds = model.predict(xtest)

        accuracy = metrics.accuracy_score(test_df.sentiment, preds)
        accuracy_list.append(accuracy)

        print(f"Fold : {fold_}")
        print(f"Accuracy : {accuracy}")
        print("")

    for i in range(0, 4):
        print(f"Fold : {i+1}, Accuracy : {accuracy_list[i]}")
Enter fullscreen mode Exit fullscreen mode

We are using countvectorizer to transform the review to vectors and using logistic regression to classify the sentiment. We are using accuracy to evaluate our model performance.

LR accuracy

We could observe that the model is able to reach almost 90 percent accuracy, that was easy right?

Same implementation using TFIDF

import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer

if __name__ == "__main__":
    df = pd.read_csv("/home/praveen/Desktop/Projects/Approching_Almost_Any_ML_Prob_Book/NLP/data/IMDB Dataset.csv")

    # Converting sentiment to 1 and 0
    df.sentiment = df.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

    df["kfold"] = -1

    df = df.sample(frac=1).reset_index(drop=True)

    y = df.sentiment.values

    kf = model_selection.StratifiedKFold(n_splits=5)

    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, 'kfold'] = f


    accuracy_list = []
    for fold_ in range(5):
        train_df = df[df.kfold != fold_].reset_index(drop=True)
        test_df = df[df.kfold == fold_].reset_index(drop=True)

        tfidf_vec = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
        tfidf_vec.fit(train_df.review)

        xtrain = tfidf_vec.transform(train_df.review)
        xtest = tfidf_vec.transform(test_df.review)

        model = linear_model.LogisticRegression()

        model.fit(xtrain, train_df.sentiment)

        preds = model.predict(xtest)

        accuracy = metrics.accuracy_score(test_df.sentiment, preds)
        accuracy_list.append(accuracy)

        print(f"Fold : {fold_}")
        print(f"Accuracy : {accuracy}")
        print("")

    for i in range(0, 4):
        print(f"Fold : {i+1}, Accuracy : {accuracy_list[i]}") 

Enter fullscreen mode Exit fullscreen mode

TFIDF accuracy
TFIDF is also close to 90 percent accuracy.

With this I conclude, so in this blog we have seen how sentence/word is converted to vector and how it is used to tackle a classification problem.

Github : https://github.com/praveenr2998/Approching_Almost_Any_ML_Prob_Book/tree/main/NLP
Book : https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf (a generous author :)))
Linked in : profile

See you in the next blog, bye...

Top comments (0)