How to scale a crawler for 1000 websites

#webdev #architecture #database

I have few architecture scaling questions.

For e.g Let say we are building a web based SEO crawler

You have a crawler that takes the seed/root url and then starts discovering all linked urls

Every new url it finds, adds it to the list

The crawler's job is to simply discover all the pages.

At the end of the crawl, it discovers say 500 total internal urls

Let say a single threaded crawler took 2 hours to discover all the 500 pages and do necessary processing. (there is a limit to how fast a crawler can discover links due to the fact that it is waiting on the website to deliver the response and often this takes few seconds)

The above steps are to crawl one root domain/seed URL.

Let us see how can we scale the operation.

Requirement: A web application with a crawler that is designed to crawl say 1000 websites each having around 500 pages and each website is scanned end-to-end every week.

For the above requirement, a single threaded crawler will simply not work as the math to scan 1000 websites would be 2000 hours and we simply cannot match the SLA of "processing each website every week".

Let say there are 160 hours available in a week (24x7 = 168 minus few for maintenance roughly 160 hours)

This leads us to 2000/160 = approx 13 crawlers

To keep things simple, let say each crawler is running on its own VM. We need just 13 VMs just for the crawlers.

We need a single big master database that will maintain the list of all the websites that are needed to be crawled.

Can MySQL or PostgreSQL server this purpose?

How to make each crawler smart enough so that it works only on its own subset of websites or URLs.
Even if two crawlers are working on the same website, it should make sure that it will be working on a different subset of urls to speed things up and not waste compute on the same URL scans.

6.1 Do we need to implement a queue mechanism in the database?

Scaling Up: If we scale up the VM to accommodate more than 1 crawler, is this a better design?

Scaling Horizontal: Azure and AWS provide auto-scaling of clusters. Does this kind of autoscaling still require the crawler to be smart ?

Any other considerations?