DEV Community

Cover image for A Quick Script to Find Broken Links on Your Site 🎯
Gabe Romualdo
Gabe Romualdo

Posted on • Edited on • Originally published at xtrp.io

A Quick Script to Find Broken Links on Your Site 🎯

Introduction

It seems like almost every other click on the internet ends up in an "Error 404: Page Not Found" page. "Whoops, the page you're looking for does not exist," "Sorry, the requested URL was not found on this server," "Oops, something went wrong. Page not found." Every internet user has seen pages like these.

I think it's important that web developers consider paying less attention to building clever 404 pages, and start eliminating broken links altogether.

The Program

I've built an automated program to find broken links.

Program Demo

Written in Python 3, it recursively follows links on any given site and checks each one for 404 errors. When the program has finished searching an entire site, it prints out any found broken links and where those links are so that developers can fix them.

Note that the program does make a lot of HTTP requests in a relatively short period of time, so be aware of Internet usage rates and the like.

Usage

  1. Check if you have Python 3 installed:

If the following command does not yield a version number, download Python 3 from python.org.

$ python3 -V
Enter fullscreen mode Exit fullscreen mode
  1. Download the Requests and BeautifulSoup package (for HTML parsing) with PyPi.

(Note: I do not maintain these packages and am not associated with them, so download at your own risk)

$ pip3 install requests
$ pip3 install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
  1. Copy paste the following code into a file (I use the name find_broken_links.py in this article).
import requests
import sys
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.parse import urljoin

searched_links = []
broken_links = []

def getLinksFromHTML(html):
    def getLink(el):
        return el["href"]
    return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]")))

def find_broken_links(domainToSearch, URL, parentURL):
    if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")):
        try:
            requestObj = requests.get(URL);
            searched_links.append(URL)
            if(requestObj.status_code == 404):
                broken_links.append("BROKEN: link " + URL + " from " + parentURL)
                print(broken_links[-1])
            else:
                print("NOT BROKEN: link " + URL + " from " + parentURL)
                if urlparse(URL).netloc == domainToSearch:
                    for link in getLinksFromHTML(requestObj.text):
                        find_broken_links(domainToSearch, urljoin(URL, link), URL)
        except Exception as e:
            print("ERROR: " + str(e));
            searched_links.append(domainToSearch)

find_broken_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "")

print("\n--- DONE! ---\n")
print("The following links were broken:")

for link in broken_links:
    print ("\t" + link)
Enter fullscreen mode Exit fullscreen mode
  1. Run on command line with a website of your choice.
$ python3 find_broken_links.py https://your_site.com/
Enter fullscreen mode Exit fullscreen mode

Conclusion

I hope you found this useful, and it certainly helped me find a few broken links on my own site.

This program is CC0 Licensed, so it is completely free to use, but makes no warranties or guarantees.

Give this post a ❤️ if you liked it!

Thanks for scrolling.

— Gabriel Romualdo, November 10, 2019

Top comments (5)

Collapse
 
anjankant profile image
Anjan Kant

It's very nice explained and very helpful to find out broken links of website, broken links in our website Search engines penalize our website. I have written Web Scraping also to web scraping and find broken links as well with Videos.

Collapse
 
gaberomualdo profile image
Gabe Romualdo • Edited

Thanks, that's really cool! I enjoyed checking out your site and articles :)

— Gabriel

Collapse
 
anjankant profile image
Anjan Kant

Thanks Fred to navigate my website and checking cool stuff :)

Collapse
 
itsmanojb profile image
Manoj Barman

extra closing bracket at the end of line 16:

if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg"))):
Collapse
 
gaberomualdo profile image
Gabe Romualdo • Edited

So sorry! Didn't notice that, thanks so much for letting me know. I'll fix that in the article now. Thanks again :)

— Gabriel