In this article, I am going to show you how you can build a simple web crawler with Python and BeautifulSoup that checks for broken links.
If you want to see the video, check out the video below:
Prerequisites
Before we are going to make our application, we need the following tools installed on our device:
- Python 3. If you haven’t installed it yet, download and install it from their website.
- An IDE. You are free to use any IDE/text editor that is available out there. I am going to use PyCharm. If you want to download the free version, make sure you download and install the Community Edition.
- BeautifulSoup. We need to download and install BeautifulSoup using pip. In your command line (or terminal) you can run the following command:
pip install beautifulsoup4
-
requests
. This is the last library we need to install. You can also install it by entering this command:pip install requests
What is BeautifulSoup?
Beautiful Soup is a library written in Python that extracts data out of HTML and XML files. It works well if you want to get data quickly and saves programmers a lot of time.
Writing our script
The first thing we need to is to create a script. Create an empty file in your IDE and give it the name verify_response_code.py
The second thing we need to do is to import BeautifulSoup
from bs4
(the library we installed in our prerequisites). We also need to import the library requests
. Our code looks like this:
from bs4 import BeautifulSoup, SoupStrainer
import requests
Next, we create a variable with the name url
where we create a prompt message where we are entering the URL we want to retrieve the links from. Our code looks like this:
url = input("Enter your url: ")
Afterward, we create a variable in which we going to use the requests library. Within the library, we use the get
method to actually get the URL we entered.
page = requests.get(url)
We know got our URL. Now we want to retrieve the response code. If our site is available, we want to get the response code 200. If it isn’t available, we will get the response code 404. We going to use the variable page
we used before and we are going to convert it using str
method. Our code like this
response_code = str(page.status_code)
Furthermore, our application needs to display the URL text itself. To do that we create a variable called data
that is going to display the URL in a string.
data = page.text
The last variable we have to add is soup
. In this variable, we are going to assign it to BeautifulSoup
and use the data
variable as the argument. We do this so we can use the built-in methods of BeautifulSoup.
soup = BeautifulSoup(data)
The last step in our web crawler is adding a for-loop. We are going to use the method find_all
with the argument 'a'
. This is trying to find all a
elements on our webpage. After that, we are going to print our URL. We are going to use the get
method again to get all a
elements with the value href
. This is to indicate that we want to have only the URL. Next to it, we want to put our response code. Our code looks now like this:
for link in soup.find_all('a'):
print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
If it is correct, your whole code should look like this:
# Import libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests
# Prompt user to enter the URL
url = input("Enter your url: ")
# Make a request to get the URL
page = requests.get(url)
# Get the response code of given URL
response_code = str(page.status_code)
# Display the text of the URL in str
data = page.text
# Use BeautifulSoup to use the built-in methods
soup = BeautifulSoup(data)
# Iterate over all links on the given URL with the response code next to it
for link in soup.find_all('a'):
print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Now run the script by typing python verify_response_code.py
in your terminal. You are asked to enter an URL. Enter the given the URL and press enter. If things are going well, you should receive an output like this below.
That’s it! Our small web crawler is done. I hope this article was good for you. If you want to check out more content on my blog, join the newsletter.
Happy coding!
Check out the video below:
➡️ If you want to know more tips about programming, feel free to check out my blog.
➡️ Also, feel free to check out my YouTube channel for more tutorials!
Top comments (0)