Finding Links to 404 Pages

#webdev #python #tutorial

There are few things as annoying as links on websites that lead to a 404 error page. Hence, designers spend a lot of time carefully crafting these pages to make them a bit more pleasing. But for developers, the goal should be to supersede this work of designers by finding these faulty links.

Let’s Automate This!

Developers automate everything. If we see a nearly iterative task, we write a program for it. There are of course cases where this is only an excellent way to procrastinate, but for this specific problem there is no reasonable alternative to automation.

The problem consists in fact of two iterations: The iterative process of finding faulty links on the website and the iterative process of doing this over and over again. The latter is necessary as websites usually link to external sites, which can change routes without letting anyone know. Also, if websites do not use a strictly hierarchical structure, there are probably cross-links between articles within texts which can become very complex very fast.

Of course, you could create a CMS that automatically observes for links to 404 pages, but that seems complex and probably computationally intensive. But if you are aware of such an integrated system, it’d be awesome to let me know.

An Initial Idea

The first, naive idea was to look through the article database and scan it for <a>-Tags. Then you would search in your database and look if there is an article for this URL. The limitations and problems of this approach are obvious: You need to access your database, which means probably downloading it and setting it up in a way that allows your code to access it.
External links need extra treatment.

Depending on your CMS, it’s likely that there are different urls for one article. Hence, you need several regular expressions (or whatever technique you use) to treat those different cases. A great example of this problem is Joomla: If you access an article via the menubar, it has the url https://joomla-domain.com/index.php/menu-item-x/article-title, otherwise it’s https://joomla-domain.com/index.php/article-title. If you want to extract the article-title/-id it can be necessary to differentiate between such different urls.
You’ll need a custom implementation for every different CMS.
Hence, I decided to go another route (pun intented): Building a web crawler.

Crawler McCrawl

Counter-intuitively, this concept is a bit simpler than crawling through your database. As an extra, it’s a great example of graph theory. You choose a starting point/a start url; the crawler then visits this page and looks for all links there and remembers them. It then visits all these linked sites and looks again for all links on them. This process is then repeated until all pages are visited. In the end, you have a graph of your whole website and if a visited site returned a 404 HTTP Code, you can note this url in an extra list together with the site the link is on. Not exactly, rocket science, right?

As always, there are of course some potential pitfalls, you should consider when creating such an algorithm:

Links in <a>-tags are often absolute urls in the format /main-category/another-layer/landing-page, to make your crawler visit this site, you'll need to add the domain to create this format: https://tld.com/main-category/another-layer/landing-page.
The website probably has links to external sites. You don’t want to crawl these, hence you need to differentiate between your website of interest and external links, but of course, it makes sense to check if the external website exists! Maybe, it’d be useful to ignore some url paths, i. e. some calendar modules use php scripts that parse urls to return events (Joomla again :( ). This can generate thousands of urls you’d parse, if you don’t ignore them.
Most likely, you’ll run into loops. Landing page A links to landing page B and the other way around. If your algorithm doesn’t remember, which page it already visited, congratulations: You built an infinite loop.
Menu bars are the devil. You’ll have A LOT more operations, if you look at them on every page. Same with sidebars and footers. Hence, it might be helpful to ignore certain HTML elements by class or id.
Some servers have limits for how many accesses per minute they allow. Even if not, it’s a good practice to be polite, meaning waiting between requests for so and so many seconds.

You probably already noticed, but a lot of these problems are already solved in graph theory, hence it might be a great time to refresh your knowledge on this topic.

A ready-to-use Implementation

If you don’t want to build such a program yourself, you are lucky: There are several free online tools available that can find links to not existing pages.
Even better: I made my own implementation public some weeks ago. It has solutions for all the mentioned problems and has already helped me to find several dozen links to not existing pages on a client’s website. Also, it exports the results as a CSV file, so you can open it in Excel or whatever application you prefer and look through the data. It’s easily accessible via command line and is written in Python.
You can find it here: https://github.com/bahe007/tt404

Where to Start With Your Own Implementation

Quite often, I find it hard to get a foot in the door if I start a completely new project, hence here are two great libraries for an implementation in Python to make the first step a bit easier for you:

Beautiful Soup: An amazing HTML parser. Incredible work without which the project would’ve taken much longer.
requests: This is the basic Python library for HTTP requests. I highly recommend it as I used it successfully in several projects, although some people prefer urllib.

Feel free to leave feedback or link to your own project, I’d appreciate both. Thanks for reading!