Not sure how NLP works?
Read this blog to get clear with basic definitions related to NLP while also working on a mini-project.
Let’s begin with basic definitions:
Text corpus or corpora
The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus.
A paragraph is the largest unit of text handled by an NLP task. Paragraph level boundaries by itself may not be much use unless broken down into sentences. Though sometimes the paragraph may be considered as context boundaries. Tokenizers that can split a document into paragraphs are available in some of the Python libraries.
Sentences are the next level of lexical unit of language data. A sentence encapsulates a complete meaning or thought and context. It is usually extracted from a paragraph based on boundaries determined by punctuations like period. The sentence may also convey opinion or sentiment expressed in it. In general, sentences consists of parts of speech (POS) entities like nouns, verbs, adjectives, and so on. There are tokenizers available to split paragraphs to sentences based on punctuations.
Phrases and words
Phrases are a group of consecutive words within a sentence that can convey a specific meaning. For example, in the sentence Tomorrow is going to be a rainy day the part going to be a rainy day expresses a specific thought. Some of the NLP tasks extract key phrases from sentences for search and retrieval applications. The next smallest unit of text is the word. The common tokenizers split sentences into text based on punctuations like spaces and comma. One of the problems with NLP is ambiguity in the meaning of same words used in different context. We will later see how this is handled well when we discuss word embeddings.
A sequence of characters or words forms an N-gram. For example, character unigram consists of a single character, a bigram consists of a sequence of two characters and so on. Similarly word N-grams consists of a sequence of n words. In NLP, N-grams are used as features for tasks like text classification.
Bag-of-words in contrast to N-grams does not consider word order or sequence. It captures the word occurrence frequencies in the text corpus. Bag-of-words is also used as features in tasks like sentiment analysis and topic identification.
Ready for a mini-project?
We will use the Yelp Review Data Set from Kaggle.
Each observation in this dataset is a review of a particular business by a particular user.
The “stars” column is the number of stars (1 through 5) assigned by the reviewer to the business. Higher number of stars is better. In other words, it is the rating of the business by the person who wrote the review.
Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews:
yelp_class = yelp[(yelp.stars==1) | (yelp.stars==5)]
Create two objects X and y. X will be the ‘text’ column of yelp_class and y will be the ‘stars’ column of yelp_class (your features and target/labels):
X = yelp_class['text'] y = yelp_class['stars'] Import CountVectorizer and create a CountVectorizer object: from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer()
Use the fit_transform method on the CountVectorizer object and pass in X (the ‘text’ column). Save this result by overwriting X:
X = cv.fit_transform(X)
Train Test Split
Let’s now split our data into training and testing data.
Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test train_test_split(X,y,test_size=0.3,random_state=101)
Training a Model
Time to train a model!
Import MultinomialNB and create an instance of the estimator and call it nb:
from sklearn.naive_bayes import MultinomialNB nb = MultinomialNB()
Now fit nb using the training data:
nb.fit(X_train,y_train) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Predictions and Evaluations
Time to see how our model did!
Use the predict method off of nb to predict labels from X_test:
predictions = nb.predict(X_test)
Create a confusion matrix and classification report using these predictions and y_test:
from sklearn.metrics import confusion_matrix,classification_report print(confusion_matrix(y_test,predictions)) print('\n') print(classification_report(y_test,predictions)) [[159 69] [ 22 976]] precision recall f1-score support 1 0.88 0.70 0.78 228 5 0.93 0.98 0.96 998 avg / total 0.92 0.93 0.92 1226
Great! Now let’s see what happens if we try to include TF-IDF to this process using a pipeline.
Using Text Processing
Import TfidfTransformer from sklearn.
from sklearn.feature_extraction.text import TfidfTransformer Import Pipeline from sklearn. from sklearn.pipeline import Pipeline
Now create a pipeline with the following steps: CountVectorizer(), TfidfTransformer(), MultinomialNB():
pipeline = Pipeline([ ('bow', CountVectorizer()), # strings to token integer counts ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores ('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier ])
Using the Pipeline
Time to use the pipeline. Remember this pipeline has all your pre-process steps in it already, meaning we’ll need to re-split the original data. Note that we overwrote X as the CountVectorized version. What we need is just the text.
Train Test Split
Redo the train test split on the yelp_class object:
X = yelp_class['text'] y = yelp_class['stars'] X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)
Now fit the pipeline to the training data. Remember you can’t use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels:
Pipeline(steps=[('bow', CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
Predictions and Evaluation
Now use the pipeline to predict from the X_test and create a classification report and confusion matrix. You should notice strange results:
predictions = pipeline.predict(X_test) print(confusion_matrix(y_test,predictions)) print(classification_report(y_test,predictions)) [[ 0 228] [ 0 998]] precision recall f1-score support 1 0.00 0.00 0.00 228 5 0.81 1.00 0.90 998 avg / total 0.66 0.81 0.73 1226
TF-IDF actually made things worse!