DEV Community

loading...
Cover image for A Quick Script to Find Broken Links on Your Site 🎯

A Quick Script to Find Broken Links on Your Site 🎯

xtrp profile image Gabriel Romualdo Originally published at xtrp.io Updated on ・3 min read

This post is originally from my personal website, xtrp.io where you can read about me, check out my projects, and more.

Introduction

It seems like almost every other click on the internet ends up in an "Error 404: Page Not Found" page. "Whoops, the page you're looking for does not exist," "Sorry, the requested URL was not found on this server," "Oops, something went wrong. Page not found." Every internet user has seen pages like these.

I think it's important that web developers consider paying less attention to building clever 404 pages, and start eliminating broken links altogether.

The Program

I've built an automated program to find broken links.

Program Demo

Written in Python 3, it recursively follows links on any given site and checks each one for 404 errors. When the program has finished searching an entire site, it prints out any found broken links and where those links are so that developers can fix them.

Note that the program does make a lot of HTTP requests in a relatively short period of time, so be aware of Internet usage rates and the like.

Usage

  1. Check if you have Python 3 installed:

If the following command does not yield a version number, download Python 3 from python.org.

$ python3 -V
  1. Download the Requests and BeautifulSoup package (for HTML parsing) with PyPi.

(Note: I do not maintain these packages and am not associated with them, so download at your own risk)

$ pip3 install requests
$ pip3 install beautifulsoup4
  1. Copy paste the following code into a file (I use the name find_broken_links.py in this article).
import requests
import sys
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.parse import urljoin

searched_links = []
broken_links = []

def getLinksFromHTML(html):
    def getLink(el):
        return el["href"]
    return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]")))

def find_broken_links(domainToSearch, URL, parentURL):
    if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")):
        try:
            requestObj = requests.get(URL);
            searched_links.append(URL)
            if(requestObj.status_code == 404):
                broken_links.append("BROKEN: link " + URL + " from " + parentURL)
                print(broken_links[-1])
            else:
                print("NOT BROKEN: link " + URL + " from " + parentURL)
                if urlparse(URL).netloc == domainToSearch:
                    for link in getLinksFromHTML(requestObj.text):
                        find_broken_links(domainToSearch, urljoin(URL, link), URL)
        except Exception as e:
            print("ERROR: " + str(e));
            searched_links.append(domainToSearch)

find_broken_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "")

print("\n--- DONE! ---\n")
print("The following links were broken:")

for link in broken_links:
    print ("\t" + link)
  1. Run on command line with a website of your choice.
$ python3 find_broken_links.py https://your_site.com/

Conclusion

I hope you found this useful, and it certainly helped me find a few broken links on my own site.

This program is CC0 Licensed, so it is completely free to use, but makes no warranties or guarantees.

Give this post a ❤️ if you liked it!

Thanks for scrolling.

— Gabriel Romualdo, November 10, 2019

Check out my personal website for my blog posts, an extended about me page, and more: xtrp.io

Note: I formerly wrote under my pseudonym, Fred Adams.

Discussion (5)

pic
Editor guide
Collapse
anjankant profile image
Anjan Kant

It's very nice explained and very helpful to find out broken links of website, broken links in our website Search engines penalize our website. I have written Web Scraping also to web scraping and find broken links as well with Videos.

Collapse
xtrp profile image
Gabriel Romualdo Author • Edited

Thanks, that's really cool! I enjoyed checking out your site and articles :)

— Gabriel

Collapse
anjankant profile image
Anjan Kant

Thanks Fred to navigate my website and checking cool stuff :)

Collapse
itsmanojb profile image
Manoj Barman

extra closing bracket at the end of line 16:

if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg"))):
Collapse
xtrp profile image
Gabriel Romualdo Author • Edited

So sorry! Didn't notice that, thanks so much for letting me know. I'll fix that in the article now. Thanks again :)

— Gabriel