DEV Community

Cover image for Build A Web Crawler To Check for Broken Links with Python & BeautifulSoup 🕷️
Arvind Mehairjan
Arvind Mehairjan

Posted on • Updated on • Originally published at helloiamarra.com

 

Build A Web Crawler To Check for Broken Links with Python & BeautifulSoup 🕷️

In this article, I am going to show you how you can build a simple web crawler with Python and BeautifulSoup that checks for broken links.

If you want to see the video, check out the video below:

Prerequisites

Before we are going to make our application, we need the following tools installed on our device:

  • Python 3. If you haven’t installed it yet, download and install it from their website.
  • An IDE. You are free to use any IDE/text editor that is available out there. I am going to use PyCharm. If you want to download the free version, make sure you download and install the Community Edition.
  • BeautifulSoup. We need to download and install BeautifulSoup using pip. In your command line (or terminal) you can run the following command: pip install beautifulsoup4
  • requests. This is the last library we need to install. You can also install it by entering this command: pip install requests

What is BeautifulSoup?

Beautiful Soup is a library written in Python that extracts data out of HTML and XML files. It works well if you want to get data quickly and saves programmers a lot of time.

Writing our script

The first thing we need to is to create a script. Create an empty file in your IDE and give it the name verify_response_code.py

The second thing we need to do is to import BeautifulSoup from bs4 (the library we installed in our prerequisites). We also need to import the library requests. Our code looks like this:

from bs4 import BeautifulSoup, SoupStrainer

import requests
Enter fullscreen mode Exit fullscreen mode

Next, we create a variable with the name url where we create a prompt message where we are entering the URL we want to retrieve the links from. Our code looks like this:

url = input("Enter your url: ")
Enter fullscreen mode Exit fullscreen mode

Afterward, we create a variable in which we going to use the requests library. Within the library, we use the get method to actually get the URL we entered.

page = requests.get(url)
Enter fullscreen mode Exit fullscreen mode

We know got our URL. Now we want to retrieve the response code. If our site is available, we want to get the response code 200. If it isn’t available, we will get the response code 404. We going to use the variable page we used before and we are going to convert it using str method. Our code like this

response_code = str(page.status_code)
Enter fullscreen mode Exit fullscreen mode

Furthermore, our application needs to display the URL text itself. To do that we create a variable called data that is going to display the URL in a string.

data = page.text
Enter fullscreen mode Exit fullscreen mode

The last variable we have to add is soup. In this variable, we are going to assign it to BeautifulSoup and use the data variable as the argument. We do this so we can use the built-in methods of BeautifulSoup.

soup = BeautifulSoup(data)
Enter fullscreen mode Exit fullscreen mode

The last step in our web crawler is adding a for-loop. We are going to use the method find_all with the argument 'a'. This is trying to find all a elements on our webpage. After that, we are going to print our URL. We are going to use the get method again to get all a elements with the value href. This is to indicate that we want to have only the URL. Next to it, we want to put our response code. Our code looks now like this:

for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Enter fullscreen mode Exit fullscreen mode

If it is correct, your whole code should look like this:

# Import libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests

# Prompt user to enter the URL
url = input("Enter your url: ")

# Make a request to get the URL
page = requests.get(url)

# Get the response code of given URL
response_code = str(page.status_code)

# Display the text of the URL in str
data = page.text

# Use BeautifulSoup to use the built-in methods
soup = BeautifulSoup(data)

# Iterate over all links on the given URL with the response code next to it
for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Enter fullscreen mode Exit fullscreen mode

Now run the script by typing python verify_response_code.py in your terminal. You are asked to enter an URL. Enter the given the URL and press enter. If things are going well, you should receive an output like this below.

Output URL and response codes

That’s it! Our small web crawler is done. I hope this article was good for you. If you want to check out more content on my blog, join the newsletter.

Happy coding!

Check out the video below:

➡️ If you want to know more tips about programming, feel free to check out my blog.
➡️ Also, feel free to check out my YouTube channel for more tutorials!

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git