I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.
Requirements:
Google Colab
Reddit API Credentials
Lots of time
Coffee
- First we will import the necessary libraries.
import asyncio # For asynchronous programming in Python.
import asyncpraw # Python Reddit API Wrapper for asynchronous Reddit API interactions.
import pandas as pd # Data manipulation and analysis tool.
import nest_asyncio # Necessary for allowing nested asyncio run loops.
import re # Regular expressions for pattern matching and text manipulation.
from sklearn.model_selection import train_test_split # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer # Converts text data into TF-IDF feature vectors.
from sklearn.ensemble import RandomForestClassifier # Random Forest classifier for machine learning.
from sklearn.metrics import accuracy_score, classification_report # Metrics for evaluating model performance.
from imblearn.over_sampling import SMOTE # Oversampling technique for handling class imbalance.
from sklearn.pipeline import Pipeline # Constructs a pipeline of transformations and estimators.
from sklearn.model_selection import GridSearchCV # Performs grid search over specified parameter values.
- Connecting to Reddit API Get your API credentials from https://www.reddit.com/prefs/apps
`client_id = 'your_client_id'
client_secret = 'your_client_secret'
user_agent = 'MyRedditApp/0.1 by your_username'
reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
user_agent=user_agent)`
This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit's API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.
- Initialization and Setup
`nest_asyncio.apply()`
This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.
Asynchronous Function Definition
`async def collect_reddit_comments(subreddit_name, keyword, limit=1000):
reddit = asyncpraw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)`
Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.
Fetching Subreddit and Comments
`subreddit = await reddit.subreddit(subreddit_name)
comments = []
count = 0
after = None`
Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.
Looping Through Submissions and Comments
`while len(comments) < limit:
try:
async for submission in subreddit.search(keyword, limit=None, params={'after': after}):
await submission.load()
submission.comment_limit = 0
submission.comments.replace_more(limit=0)`
Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).
Collecting and Storing Comments
` for comment in submission.comments.list():
if isinstance(comment, asyncpraw.models.Comment):
author_name = comment.author.name if comment.author else '[deleted]'
comments.append([comment.body, author_name, comment.created_utc])
count += 1
if count >= limit:
break
after = submission.id # Sets the 'after' parameter for pagination
if count >= limit:
break`
Iterates through each comment in the submission, checking if it's a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.
Handling API Exceptions
`except asyncpraw.exceptions.APIException as e:
print(f"API exception occurred: {e}")
wait_time = 60 # Wait for 1 minute before retrying
print(f"Waiting for {wait_time} seconds before retrying...")
await asyncio.sleep(wait_time)`
Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.
Returning Results
`return comments[:limit]` # Returns up to 'limit' number of comments
Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.
Main Function to Execute Collection
async def main():
comments = await collect_reddit_comments('sarcasm', 'sarcastic', limit=5000) # Adjust limit as needed
df = pd.DataFrame(comments, columns=['comment', 'author', 'created_utc'])
df.to_csv('reddit_comments.csv', index=False)
print(f"Total comments collected: {len(df)}")
print(df.head())
Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name='sarcasm', keyword='sarcastic', and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.
Running the Main Function
`await main()`
Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.
Read the Part 2 - Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data
GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments
Author: Steven Mathew
Top comments (0)