Pablo Calvo

Posted on Dec 27, 2019

Broken links checker with Python and Scrapy webcrawler

#testing #python #scrapy #crawler

Python web crawler using Scrapy to check for broken links

Web crawlers are fascinating in terms or auditing a website, they are automated, fast and efficient.

In this article I will provide instructions to build a super simple out of the box web crawler using python and scrapy library to crawl through a given site and generate a .csv report with broken links

Pre-requisites

For this article I will use python3, so make sure that is installed.

Getting started

First, lets create a project folder and setup a python environment.

$ mkdir web-crawler && cd web-crawler
$ python3 -m venv venv
$ . venv/bin/activate

Then we will install all our dependencies, in this case we just need scrapy:

$ pip install scrapy

Now, we will create a script that will run the crawler. At this point I will sugges using a content editor (vscode, sublime, pyCharm, notepad++), but I will create the file using the terminal.

$ touch script.py

Let's open the file and start scripting.

[!NOTE]
This is python be careful about indentation.

imports

These are the list of modules that we will need from scrapy.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item, Field

Model

Scrapy.CrawlSpider require that we return an Item object, this will contain the data that will be reported.

class MyItems(Item):
    referer =Field() # where the link is extracted
    response= Field() # url that was requested
    status = Field() # status code received

CrawlSpider Class

Scrapy provides an out of the box web crawler called CrawlSpider that will crawl the given site based on the defined configuration.

class MySpider(CrawlSpider):
    name = "test-crawler"
    target_domains = ["dev.to"] # list of domains that will be allowed to be crawled
    start_urls = ["https://dev.to/"] # list of starting urls for the crawler
    handle_httpstatus_list = [404,410,301,500] # only 200 by default. you can add more status to list

    # Throttle crawl speed to prevent hitting site too hard
    custom_settings = {
        'CONCURRENT_REQUESTS': 2, # only 2 requests at the same time
        'DOWNLOAD_DELAY': 0.5 # delay between requests
    }

    rules = [
        Rule(
            LinkExtractor( allow_domains=target_domains, deny=('patterToBeExcluded'), unique=('Yes')), 
            callback='parse_my_url', # method that will be called for each request
            follow=True),
        # crawl external links but don't follow them
        Rule(
            LinkExtractor( allow=(''),deny=("patterToBeExcluded"),unique=('Yes')),
            callback='parse_my_url',
            follow=False
        )
    ]

The rules explained above are the way the links will be extracted from each page, so:

The first rule says: extract all unique links under the target_domains and follow them, but exclude those who contains patterToBeExcluded.
The second rule says: extract all unique links but do not follow them and exclude those who contains patterToBeExcluded.

Why 2 rules? in this case we want to make sure our site is not hitting external links for broken or 404 pages. for example:

www.oursite.com -> www.google.com/this/does/not/exist

The callback

This is the method that will be called for each link that gets requested. every item that will be returned will be added to the csv report. so here is where can filter out only what we need to report.

    def parse_my_url(self, response):
      # list of response codes that we want to include on the report, we know that 404
      report_if = [404] 
      if response.status in report_if: # if the response matches then creates a MyItem
          item = MyItems()
          item['referer'] = response.request.headers.get('Referer', None)
          item['status'] = response.status
          item['response']= response.url
          yield item
      yield None # if the response did not match return empty

Knowing that 404 is the not found page code, everytime the web crawler hits a page and response 404 then a row will be added to the csv report. modify this list as requested.

Running the crawler

So running the crawler is really simple

$ scrapy runspider script.py -o report-file.csv

Look at the report

During the execution of the crawler the report-file.csv will be populated.

Please read more about the library on their official site, it is full of really useful information, how to deploy,
creating custom spiders, and much more.

scrapy official documentation

full code can be found here

Final notes

This is a really simple way to crawl and find broken links, as I see it there is tons of room for adding custom features and custom checks.
please be aware that this is a super simple script, so don't ask for best practices, scalability or anything out of the scope of this post. this tool is meant to be a starting point so that you can build a customized script that will suite your needs properly.

[!NOTE]
Many many pages out there will block most of the crawlers unless they provide explicit rules, please look at robots.txt file
that typically leave under the base domain for rules and guidelines. https://www.adobe.com/robots.txt

cheers :) and happy crawling. Check my personal blog https://pjcalvo.github.com

DEV Community