DEV Community

DishyDev
DishyDev

Posted on • Originally published at dishy.dev

Scraping Images from Reddit Threads in Python

Introduction

This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime where users add a lot screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this

Reddit Scraper v1 example video

PRAW

PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.

To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new "app" with Reddit. Connecting is as simple as

import praw

reddit = praw.Reddit(client_id='id', \
                     client_secret='secret', \
                     user_agent='useragent', \
                     username='username', \
                     password='DevToIsCool')

Traversing reddit is made simple by the API, for example printing all of the comments in a thread.

submission = reddit.submission(url="https://reddit.com/r/abcde")
for comment in submission.comments.list():
    print(comment)

Finding links

99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.

    REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
    p = re.compile(REGEX_TEST, re.IGNORECASE)

Check if an image still exists

One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.

# Check if a link still is exists
def checkLinkActive(url):
    request = requests.head(url)
    if request.status_code == 200:
        return True
    else:
        return False

Getting Thumbnails

To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.

# Add a letter to an imgur url to make a small thumbnail
def getImgurThumbnail(url, size):
    startStr = url[:(len(url)-4)]
    endStr = url[len(url)-4:]
    return startStr + size + endStr

Putting it all together

Putting all of these bits together you get

def getImages(url):
    submission = reddit.submission(url=url)
    # Tell API to return all comment in thread, results are
    # paginated by default
    submission.comments.replace_more(limit=None)

    # Create RegEx object for matching images
    REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
    p = re.compile(REGEX_TEST, re.IGNORECASE)

    imageMatches = []
    for comment in submission.comments.list():
        matches = p.findall(comment.body)
        for match in matches:
            if checkLinkActive(match[0]):
                imageMatches.append(
                    {"image": match[0], "thumbnail": getImgurThumbnail(match[0], "m")}
                )

    return imageMatches

Trying it out

I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.

The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.

If you want to give it a go, you can have a play on my site here.


I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!

Top comments (0)