Gabe Romualdo

Posted on Nov 10, 2019 • Edited on Feb 14, 2024 • Originally published at xtrp.io

A Quick Script to Find Broken Links on Your Site 🎯

#webdev #python #test

Introduction

It seems like almost every other click on the internet ends up in an "Error 404: Page Not Found" page. "Whoops, the page you're looking for does not exist," "Sorry, the requested URL was not found on this server," "Oops, something went wrong. Page not found." Every internet user has seen pages like these.

I think it's important that web developers consider paying less attention to building clever 404 pages, and start eliminating broken links altogether.

The Program

I've built an automated program to find broken links.

Written in Python 3, it recursively follows links on any given site and checks each one for 404 errors. When the program has finished searching an entire site, it prints out any found broken links and where those links are so that developers can fix them.

Note that the program does make a lot of HTTP requests in a relatively short period of time, so be aware of Internet usage rates and the like.

Usage

Check if you have Python 3 installed:

If the following command does not yield a version number, download Python 3 from python.org.

$ python3 -V

Download the Requests and BeautifulSoup package (for HTML parsing) with PyPi.

(Note: I do not maintain these packages and am not associated with them, so download at your own risk)

$ pip3 install requests
$ pip3 install beautifulsoup4

Copy paste the following code into a file (I use the name find_broken_links.py in this article).

import requests
import sys
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.parse import urljoin

searched_links = []
broken_links = []

def getLinksFromHTML(html):
    def getLink(el):
        return el["href"]
    return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]")))

def find_broken_links(domainToSearch, URL, parentURL):
    if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")):
        try:
            requestObj = requests.get(URL);
            searched_links.append(URL)
            if(requestObj.status_code == 404):
                broken_links.append("BROKEN: link " + URL + " from " + parentURL)
                print(broken_links[-1])
            else:
                print("NOT BROKEN: link " + URL + " from " + parentURL)
                if urlparse(URL).netloc == domainToSearch:
                    for link in getLinksFromHTML(requestObj.text):
                        find_broken_links(domainToSearch, urljoin(URL, link), URL)
        except Exception as e:
            print("ERROR: " + str(e));
            searched_links.append(domainToSearch)

find_broken_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "")

print("\n--- DONE! ---\n")
print("The following links were broken:")

for link in broken_links:
    print ("\t" + link)

Run on command line with a website of your choice.

$ python3 find_broken_links.py https://your_site.com/

Conclusion

I hope you found this useful, and it certainly helped me find a few broken links on my own site.

This program is CC0 Licensed, so it is completely free to use, but makes no warranties or guarantees.

Give this post a ❤️ if you liked it!

Thanks for scrolling.

— Gabriel Romualdo, November 10, 2019

Top comments (5)

Anjan Kant • Nov 11 '19

It's very nice explained and very helpful to find out broken links of website, broken links in our website Search engines penalize our website. I have written Web Scraping also to web scraping and find broken links as well with Videos.

Gabe Romualdo • Nov 11 '19 • Edited

Thanks, that's really cool! I enjoyed checking out your site and articles :)

— Gabriel

Anjan Kant • Nov 12 '19

Thanks Fred to navigate my website and checking cool stuff :)

Manoj Barman • Nov 12 '19

extra closing bracket at the end of line 16:

if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg"))):

Gabe Romualdo • Nov 12 '19 • Edited

So sorry! Didn't notice that, thanks so much for letting me know. I'll fix that in the article now. Thanks again :)

— Gabriel