Jack Parsons

Posted on Aug 20, 2021 • Originally published at jackparsonss.hashnode.dev

How To Scrap Articles From Your Favourite Publications On Medium

#python #webscraping #automation #generalprogramming

Welcome everyone, hope you are having an amazing day!
Today we are going to build a simple python web scraping script to scrap the top articles of your favourite medium.com publications based on any day, month, or year you want! And if you stick around till the end I will show you how to use pythons Threading module which leads to blazing fast web scraping.

Here are the tools we are going to use today:

Python Programming Language
Beautiful Soup Package
Requests Package
Threading Module
Integrating with the file system

What is Web Scraping?

Before diving into the code, I should probably explain what in the heck web scraping is. Web scraping is extracting information from the internet, pretty simple right? Okay, so we know what web scraping is, but why do we need it, and how is it useful?

Web scraping is so broad that we can't narrow it down to a single use but some common needs for web scraping include, data collection and automating repetitive tasks(hmm that sounds pretty useful). Sure you can do just about anything manually that you can do web scraping but if you're anything like me, you prefer to avoid the boring, repetitive tasks. Plus writing some code to do work for you just sounds like advancing your life.

What modules do you need?

Before we get started with coding, we have to make sure to have all the proper packages installed. For this script, we are going to need to install the requests and beautifulsoup4 modules with pip.

pip install requests
pip install beautifulsoup4

The requests module is how we are going to actually make a network request to Medium.com and fetch its data and the beautifulsoup4 module is what will parse this data for us and return the data we actually care about. (oh and I am going to call the BeautifulSoup module bs4 from now on)

If you would like to go deeper into these modules, here are two amazing articles from Real Python, requests, bs4

Now that we have the modules installed, we will create a get_medium_articles.py script and import everything.

How do you get Articles From Publications on Medium?

Something that we need to understand before starting to fetch our data is to actually understand the URL's that we need to reach. In Medium every publication will have its own unique URL and this is great but Medium has gone one step further for us as we want to find different articles depending on the day they were published.

How do we achieve this? Well, all we have to do is add

/archive/YEAR-MONTH-DAY

to the end of the URL and the top articles from the specified day. This makes our lives easier because all we have to do is generate the day we want (this is where the datetime module comes in), add it to the publication URL, and we are off and running fetching data.

So let's start by finding all your favourite publications and creating a dictionary that stores the publication name as the key and its URL as the value with /archive/ appended to the end.

You may notice that we do not add any dates here, and that is because we are going to generate them dynamically through the program(I am going to stick with the current date, but you have total control over it). One tip is that you can exclude the day or month in the date to get the top articles from a certain month or year.

After creating your URLs dictionary, we are going to create a function called fetch_articles which will take in a publication and URL and return the data we want!

Let's go ahead and initialize an empty dictionary called data and then let's go ahead fetch the current date, we will need the datetime.now() function which returns the current date and time and then use the strftime method on the date to properly format it in a way Medium will understand.

Just a quick tip, if you are doing this early in the morning, then there's a good chance that there aren't any articles out yet, so it may be a good idea to fetch from the previous day, if this is the case, you can change the date in your get_articles function to

yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
date = str(yesterday.strftime("%Y/%m/%d"))

Next, we will need to append the date to the end of the URL and print out the publication to give us some feedback or if you want, you can print out the URL here. (If you are curious, you can go ahead and check out the URL and change the dates and see what happens!)

How do you Make Network Requests?

For this step, all we need to do is perform a GET request. What is a GET request? What we need to know here is that is it going to a URL and fetching all the data from that URL, this can be JSON, HTML, etc. And in this case, what we are going to get is HTML which we will feed into bs4 to get our data. So without any further ado, let's start building our function 😄.

We make the GET request by calling requests.get() and passing in the URL into it. This will return a response object and the first thing we want to do is call raise_for_status() because we want to try to crash as early as possible when coding because it will always crash closer to the problem.

In this case, the program will crash if you provided an invalid URL, or if there was a connection problem.

I have decided to catch this exception and print out the invalid URL. As long as the request is valid we will be able to access the HTML on the request by calling .content on the response object, this is what we are going to feed bs4 next!

Okay, some of you must be asking, "what does allow_redirects=False do?". Good job if you caught that, but let me explain it.

On Medium, if you pass in a day when no articles are published, then it will redirect to the last month when an article was published, and because I only care about finding all new relevant articles from today I am choosing to disallow redirects. If this is functionality that you want, then just set allow_redirects=True.

How do you Parse HTML With Beautiful Soup?

Before we actually do anything with the code let's visit any publication of your choice and put in the date you want to check articles, I am going to go to https://betterprogramming.pub/archive/2021/08/18, and then open up my inspector.

If you use your selector tool (command/control+shift+c) and hover over an article, you will see a div with the class names class="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls" and this is what bs4 is going to use to find every article on the page.

Now that we know what to look for, let's create a BeautifulSoup object and pass in the page's contents and for the second parameter, we will pass in html.parser which tells bs4 to parse HTML.
Looking back, we can see that these classes are inside a div, so we need to tell our new bs4 object to find all the divs on the page with these specific classes.

After getting all the articles, we will give ourselves some terminal feedback if we found any articles. Next, we are going to determine how many of the articles we want, I am choosing 3 (or less) per publication, but you can have as many as you want.

WOAH, that's a lot of code, and this next part can be taken a lot further, but I am choosing only to retrieve the title and URL from the article(I challenge you to also get the author!). Let's break it down and turn the HTML into something we can use!

Currently, we have a list of articles, so let's loop through and extract data one by one.
First, we will extract the title using the find() method, validate it, and retrieve the text string.

The next line looks pretty confusing but let's see, find we are finding all a tags(links), finding the appropriate text, getting it's href, removing parameters from the URL and then returning it.

The way I plan to store data is to have a dictionary with a publication as a key and the value is an array of articles, so let's go ahead and create the article and then put it in place with the current publication.

I just want to say congratulations if you have made it this far, and if you only came here to get the data, you are done, because after you loop through all your publications your dictionary will be filled with all the data you need!

But if you want to see how to output it in a nice Markdown format to your desktop, keep on reading :) (I also show threading at the end)

How do you Write Your Data Into a File?

For this next part, we don't need any special module or package, all the things we need are built right into python(how nice).

To get started we are going to create a new function called write_to_desktop(data), I plan on writing all our data into a markdown file onto my desktop, but you can do anything you want with this such as using a simple text file, or writing to a separate drive.

We start by creating(or opening) a file on your desktop and initialize an empty string, which is what ends up getting written to the file.

Next, we are going to write two for loops, the first to get the publication, and it's articles, and the next to get each article's data. If the hashtags and asterisks look scary to you, that just to make the markdown look nice and you can ignore them if you are writing to a text file.

Then we write our data to the file, and we are just about done, let's write out ourmain() function and actually see some results!

Go ahead and run your scripts and BOOM, you're all done. Congra... hmm, I guess I did promise you all a bonus. Let's go improve performance 🚀

BONUS - How to Improve Performance with Threads

Before heading into this section I do want to point out that this can majorly increase performance, however, if you are only getting articles from a few publications, you won't see much of a change but once you are fetching from more than 5, the performance gains really start to kick in.

The only function that we need to modify here is our main() function, so this doesn't get too complicated.
First, we need to import the threading module into our script.

Then, we are going to make a new Thread for each publication, so go ahead and call threading.Thread making the target out get_articles script and everything in the args are the arguments that will go into get_articles.

After that, we will append the thread to our list of threads and then start the thread.

The next for loop may seem a bit odd, but we do this so that all of our threads finish before writing the data. Without this loop, our data would only contain data from a single thread because the write() function would get called before our data is ready.

Now try adding a bunch more publications to the dictionary at the top and see if you notice any performance differences!

Once again, congratulations, we are officially done with this script.

Conclusion

WOW, that was a lot, we went over what web scraping is, making network requests, scraping data, writing to a file, and using threads. Hopefully, you can find a use for this script somewhere in your workflow! Personally, I use this every morning, as I use obsidian daily notes and I have this script append all the articles to the bottom of the note for me to look at while drinking a nice cup of coffee ☕️. Please comment below on how you plan on using this script!

Anyway, I really hope you enjoyed this article and if you want you can talk to me on Twitter! Have an amazing day! (the code is right below)

Source Code

import requests, datetime, threading
from bs4 import BeautifulSoup


urls = {
    "Towards Data Science": "https://towardsdatascience.com/archive/",
    "Personal Growth": "https://medium.com/personal-growth/archive/",
    "Better Programming": "https://betterprogramming.pub/archive/",
}


data = {}


def get_articles(publication, url):
    yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
    date = str(yesterday.strftime("%Y/%m/%d"))
    url += date

    print(f"Checking {publication}...")

    response = requests.get(url, allow_redirects=False)

    try:
        response.raise_for_status()
    except Exception:
        print(f"Invalid URL At {url}")

    page = response.content

    soup = BeautifulSoup(page, "html.parser")
    articles = soup.find_all(
        "div",
        class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls",
    )

    if len(articles) > 0:
        print(f"Fetching Articles from {url}")

    amount_of_articles = min(3, len(articles))

    for i in range(amount_of_articles):
        title = articles[i].find("h3", class_="graf--title")

        if title is None:
            continue

        title = title.contents[0]
        article_url = articles[i].find_all("a")[3]["href"].split("?")[0]

        article = {
            "title": title,
            "article_url": article_url,
        }

        if not data.get(publication):
            data[publication] = [article]
        else:
            data[publication].append(article)


def write_to_desktop(data):
    with open("PATH_TO_DESKTOP/articles.md", "a") as file:
        out = ""

        for publication, articles in data.items():
            out += f"### ***{publication}***\n"
            for article in articles:
                out += f"#### [{article['title']}]({article['article_url']})\n\n"

            out += "---\n\n"

        file.write(out)


def main():
    threads = []

    for publication, url in urls.items():
        thread = threading.Thread(target=get_articles, args=[publication, url])
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    write_to_desktop(data)


main()

DEV Community

How To Scrap Articles From Your Favourite Publications On Medium

Table Of Contents:

What is Web Scraping?

What modules do you need?

How do you get Articles From Publications on Medium?

/archive/YEAR-MONTH-DAY

How do you Make Network Requests?

How do you Parse HTML With Beautiful Soup?

How do you Write Your Data Into a File?

BONUS - How to Improve Performance with Threads

Conclusion

Source Code

Top comments (0)

Read next

Safeguarding AI with Llama Guard: Ethical AI Development

AI enthusiasm #4 - Your stable diffusion chatbot🐠

Exploring Multicollinearity: Strategies for Detecting and Managing Correlated Predictors in Regression Analysis

Automated Tailwind CSS integration for Django

Table Of Contents:

What is Web Scraping?

What modules do you need?

How do you get Articles From Publications on Medium?

**/archive/YEAR-MONTH-DAY **

How do you Make Network Requests?

How do you Parse HTML With Beautiful Soup?

How do you Write Your Data Into a File?

BONUS - How to Improve Performance with Threads

Conclusion

Source Code

Read next

Safeguarding AI with Llama Guard: Ethical AI Development

AI enthusiasm #4 - Your stable diffusion chatbot🐠

Exploring Multicollinearity: Strategies for Detecting and Managing Correlated Predictors in Regression Analysis

Automated Tailwind CSS integration for Django

/archive/YEAR-MONTH-DAY