Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Part 1

I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.

Google Colab
Reddit API Credentials
Lots of time

  1. First we will import the necessary libraries.
import asyncio  # For asynchronous programming in Python.
import asyncpraw  # Python Reddit API Wrapper for asynchronous Reddit API interactions.
import pandas as pd  # Data manipulation and analysis tool.
import nest_asyncio  # Necessary for allowing nested asyncio run loops.
import re  # Regular expressions for pattern matching and text manipulation.
from sklearn.model_selection import train_test_split  # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text data into TF-IDF feature vectors.
from sklearn.ensemble import RandomForestClassifier  # Random Forest classifier for machine learning.
from sklearn.metrics import accuracy_score, classification_report  # Metrics for evaluating model performance.
from imblearn.over_sampling import SMOTE  # Oversampling technique for handling class imbalance.
from sklearn.pipeline import Pipeline  # Constructs a pipeline of transformations and estimators.
from sklearn.model_selection import GridSearchCV  # Performs grid search over specified parameter values.
  1. Connecting to Reddit API Get your API credentials from
`client_id = 'your_client_id'
client_secret = 'your_client_secret'
user_agent = 'MyRedditApp/0.1 by your_username'

reddit = praw.Reddit(client_id=client_id,
This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit's API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.

  1. Initialization and Setup
This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.

Asynchronous Function Definition

`async def collect_reddit_comments(subreddit_name, keyword, limit=1000):
    reddit = asyncpraw.Reddit(
Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.

Fetching Subreddit and Comments

`subreddit = await reddit.subreddit(subreddit_name)
comments = []
count = 0
after = None`
Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.

Looping Through Submissions and Comments

`while len(comments) < limit:
        async for submission in, limit=None, params={'after': after}):
            await submission.load()
            submission.comment_limit = 0
Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).

Collecting and Storing Comments

           ` for comment in submission.comments.list():
                if isinstance(comment, asyncpraw.models.Comment):
                    author_name = if else '[deleted]'
                    comments.append([comment.body, author_name, comment.created_utc])
                    count += 1

                    if count >= limit:

            after =  # Sets the 'after' parameter for pagination

            if count >= limit:
Iterates through each comment in the submission, checking if it's a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.

Handling API Exceptions

    `except asyncpraw.exceptions.APIException as e:
        print(f"API exception occurred: {e}")
        wait_time = 60  # Wait for 1 minute before retrying
        print(f"Waiting for {wait_time} seconds before retrying...")
        await asyncio.sleep(wait_time)`
Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.

Returning Results

`return comments[:limit]`  # Returns up to 'limit' number of comments
Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.

Main Function to Execute Collection

async def main():
    comments = await collect_reddit_comments('sarcasm', 'sarcastic', limit=5000)  # Adjust limit as needed
    df = pd.DataFrame(comments, columns=['comment', 'author', 'created_utc'])
    df.to_csv('reddit_comments.csv', index=False)
    print(f"Total comments collected: {len(df)}")
Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name='sarcasm', keyword='sarcastic', and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.

Running the Main Function

`await main()`
Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.

Read the Part 2 - Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data


Author: Steven Mathew

