Text Classification with Natural Language Processing (NLP) in Python using Scikit-Learn

#nlp #machinelearning #programming #python

Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to textual data. This technique has a wide range of applications, from sentiment analysis and spam detection to topic categorization. In this article, we'll explore how to perform text classification using Python and the Scikit-Learn library. We'll walk through the process step by step, including data preprocessing, feature extraction, model training, and evaluation.

Before we get into this article, if you really want to learn NLP and other new technologies, I would recommend tutorials and courses over at Educative, who I am affiliated with.

Prerequisites

Before we get started, make sure you have Python and Scikit-Learn installed on your system. You can install Scikit-Learn using pip:

pip install scikit-learn

Dataset

For this example, we'll use the "20 Newsgroups" dataset, a collection of newsgroup documents organized into 20 different categories. We'll perform binary text classification to distinguish between two categories: "sci.space" and "comp.graphics."

Step 1: Data Preprocessing

Let's begin by loading and preprocessing the dataset. We'll use Scikit-Learn's built-in function to fetch the data and perform some basic text cleaning.

from sklearn.datasets import fetch_20newsgroups

# Define the categories we want to classify
categories = ['sci.space', 'comp.graphics']

# Fetch the training dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

# Fetch the testing dataset
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

Step 2: Feature Extraction

To perform text classification, we need to convert the text data into numerical features that machine learning models can understand. We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to do this.

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(newsgroups_train.data)

# Transform the testing data
X_test_tfidf = tfidf_vectorizer.transform(newsgroups_test.data)

Step 3: Model Training

Now, let's train a text classification model. We'll use a simple logistic regression classifier for this example.

from sklearn.linear_model import LogisticRegression

# Create a logistic regression classifier
clf = LogisticRegression()

# Train the model
clf.fit(X_train_tfidf, newsgroups_train.target)

Step 4: Model Evaluation

Finally, let's evaluate the performance of our text classification model on the test data.

from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the test data
predicted = clf.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(newsgroups_test.target, predicted)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
report = classification_report(newsgroups_test.target, predicted, target_names=newsgroups_test.target_names)
print("Classification Report:\n", report)

Conclusion

Text classification is a foundational NLP task with numerous practical applications. By mastering these techniques and exploring more advanced models and datasets, you can leverage the power of NLP for tasks such as sentiment analysis, document categorization, and more. Experiment with different algorithms and feature extraction methods to further improve the accuracy of your text classification models.