Web crawling and scraping is a lot about the ability to tame the chaos, and a lot of it is not under your control. Websites change code, change their navigation, put up restrictions, may even IP block you if you are not using rotating proxies like Proxies API, the network speeds go up and down. These are just the realities in the world of web scraping, and here are some questions you can ask your web scraping agency and notice how comfortable they are with the challenges they pose. Are they vague, struggling for clear answers, or are they ready for you and even impressed that you asked them these questions? So here they are:
a. Ask them about the kinds of projects they have done and show you some examples of actual websites crawled, the challenges they posed, and how they overcame them. Ask them if they have an Upwork profile. If they are on Upwork, the most important I look for is a combination of high ratings (4.5 and above), enough experience (total hours worked), and a near 100% completion rate. You don’t want an agency abandoning your work midway ever. Read the customer reviews to get an idea of the vibe the team has with its customers.
b. The combination of skill sets — Web crawling is not just about coding; there will be a fair bit of manual work and wrangling with data. You can check the kind of skills they have in their team.
c. It’s ideal if they have listed web crawling or web scraping as a part of the leading service description. Like these guys:
d. Ask them about what measures will they take to overcome restrictions like CAPTCHAs, rate limits, code changes, etc.
e. Go for the Jugular — Ask them how they will deal with a situation where the website IP blocks them.
f. Ask them about what checks and measures have you built in to know if the web crawler is working as it should.
g. What framework do they use to build on top of and why?
You don’t want them reinventing the wheel from scratch. It will not go well. Frameworks like Scrapy offer abstractions to control concurrency support multiple spiders, CSS selectors for scraping, automatic link extractors, cookies support, etc.
The author is the founder of Proxies API, a proxy rotation API service.