DEV Community

Cover image for Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Cleaning and Saving The Data
Steven Mathew
Steven Mathew

Posted on • Edited on

Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Cleaning and Saving The Data

Now we will clean the data and save the data for training and testing in the next part.

def clean_comment(text):
    text = re.sub(r'http\S+', '', text)  # Remove any web URLs in the text
    text = re.sub(r'/u/\w+', '', text)  # Remove mentions of Reddit users (like /u/username)
    text = re.sub(r'r/\w+', '', text)  # Remove mentions of subreddits (like r/subreddit)
    text = re.sub(r'\n', ' ', text)  # Replace new line characters with spaces
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove any characters that are not letters, numbers, or spaces
    return text.lower()  # Convert the cleaned text to lowercase
Enter fullscreen mode Exit fullscreen mode

This function takes in a piece of text (text) and cleans it up by removing web URLs, mentions of Reddit users and subreddits, new line characters, and any characters that are not letters, numbers, or spaces. Finally, it converts the cleaned text to lowercase.


# Load data from a CSV file into a DataFrame
df = pd.read_csv('reddit_comments.csv')

# Apply the cleaning function to each comment and create a new column for cleaned comments
df['cleaned_comment'] = df['comment'].apply(clean_comment)
Enter fullscreen mode Exit fullscreen mode

Here, we load data from a CSV file (reddit_comments.csv) into a table-like structure called a DataFrame. Then, for each comment in the 'comment' column of this DataFrame, we use the clean_comment function we defined earlier to clean up the text. The cleaned versions of the comments are stored in a new column named 'cleaned_comment'.


# Manually assign labels to the comments
labels = [0, 1] * (len(df) // 2)  # Create a list of labels alternating between 0 and 1
if len(labels) < len(df):
    labels.append(0)  # Add one more label to match the number of comments

df['label'] = labels  # Assign the labels to a new column named 'label' in the DataFrame
Enter fullscreen mode Exit fullscreen mode

In this part, we assign labels to each comment to indicate whether it's sarcastic or not. For demonstration purposes, we alternate between labels 0 (for non-sarcastic) and 1 (for sarcastic). We make sure that each comment gets a corresponding label. These labels are stored in a new column named 'label' in the DataFrame.

# Remove rows where the cleaned comment is empty or NaN (missing)
df = df.dropna(subset=['cleaned_comment'])  # Remove rows where 'cleaned_comment' is NaN
df = df[df['cleaned_comment'].str.strip() != '']  # Remove rows where 'cleaned_comment' is empty or only whitespace

# Save the cleaned and labeled data to a new CSV file
df.to_csv('labeled_reddit_comments.csv', index=False)  # Save DataFrame to CSV without including the index
Enter fullscreen mode Exit fullscreen mode

Finally, we clean up the data further by removing any rows where the cleaned comment is empty or missing (NaN). We also remove rows where the cleaned comment consists only of whitespace.

After cleaning and filtering, we save the cleaned and labeled data (including the 'cleaned_comment' and 'label' columns) to a new CSV file named labeled_reddit_comments.csv.

Note:
The index=False parameter ensures that the CSV file does not include an extra column for row numbers.

Read the Part 3 - Sarcasm Detection From Reddit Comments : Training & Testing

GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments

Author: Steven Mathew

Top comments (0)