I’ll walk you through the task of detecting sarcasm with machine learning using the Python programming language.
It reads a dataset of headlines labeled as sarcastic or non-sarcastic, processes the data to map the labels into human-readable form, and converts the text data into a matrix of token counts using the CountVectorizer
.
The data is then split into training and testing sets, and a Bernoulli Naive Bayes classifier is trained on the training set. The model's accuracy is evaluated on the test set, and it can also predict whether new user-inputted text is sarcastic or not.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
These lines import the necessary libraries:
-
pandas
(pd) for data manipulation. -
numpy
(np) for numerical operations. -
CountVectorizer
fromsklearn
for converting text data into a matrix of token counts. -
BernoulliNB
fromsklearn
for implementing the Bernoulli Naive Bayes classifier. -
train_test_split
fromsklearn
for splitting data into training and testing sets.
data = pd.read_json("https://raw.githubusercontent.com/amankharwal/Website-data/master/Sarcasm.json", lines=True)
This line reads JSON data from the given URL into a pandas DataFrame. The lines=True
argument specifies that each line in the file is a separate JSON object.
data.head()
Displays the first few rows of the DataFrame to give an overview of the data.
data.tail()
Displays the last few rows of the DataFrame to give another overview of the data.
data.columns
Shows the column names of the DataFrame.
data.shape
Displays the dimensions (number of rows and columns) of the DataFrame.
data['is_sarcastic'] = data['is_sarcastic'].map({0:'No Sarcasm', 1: 'Sarcasm'})
Maps the values in the is_sarcastic
column from 0 and 1 to 'No Sarcasm' and 'Sarcasm' respectively.
data.head()
Displays the first few rows of the DataFrame again to show the updated is_sarcastic
column.
data = data[['headline', 'is_sarcastic']]
Selects only the headline
and is_sarcastic
columns from the DataFrame for further analysis.
x = np.array(data['headline'])
y = np.array(data['is_sarcastic'])
Converts the headline
and is_sarcastic
columns to numpy arrays, assigning them to x
and y
respectively.
cv = CountVectorizer()
Creates an instance of CountVectorizer
to transform the text data into a matrix of token counts.
X = cv.fit_transform(x)
Fits the CountVectorizer
to the headlines and transforms them into a sparse matrix of token counts, assigned to X
.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Splits the data into training and testing sets. 80% of the data is used for training and 20% for testing. The random_state=42
ensures reproducibility.
model = BernoulliNB()
Creates an instance of the Bernoulli Naive Bayes classifier.
model.fit(X_train, y_train)
Trains the model using the training data (X_train
and y_train
).
print(model.score(X_test, y_test))
Prints the accuracy of the model on the test data.
user = input("Enter the text here")
Prompts the user to enter a piece of text for sarcasm detection.
data = cv.transform([user]).toarray()
Transforms the user input text into the same format as the training data (a sparse matrix of token counts).
output = model.predict(data)
Uses the trained model to predict whether the user input text is sarcastic or not.
print(output)
Prints the prediction result.
You can find the dataset here, and colab notebook here also you can follow me on Github.
Happy Coding!
Top comments (0)