DEV Community

Mohan Ganesan
Mohan Ganesan

Posted on

The Differences Between Newbie & Pro Level Web Scraper Coder

  1. Checks & Balances

Newbie

A newbie uses no checks and balances. If it works on my machine, it should work in production.

Pro

a. A pro has carefully looked at every breaking point imaginable in the code and looks to see if any of that can bring the whole operation down. For example, if the webserver IP block, rate limits, change their code, the internet goes down, the disk space gets full, etc.

b. A pro builds in alerts and essential info into the alerts so he can debug them easily.

  1. Code & Architecture

Newbie

A newbie spends too much time on code and too little time on the Architecture.

Pro

A pro spends much time researching and experimenting with different frameworks and libraries like Scrapy, Puppeteer, Selenium, Beautiful Soup, etc. to see what suits his current needs the best.

  1. Framework

Newbie

A newbie doesn’t use a framework because it is not in his ‘Favorite’ programming language and writes code without any best practices.

Pro

The pro knows that a framework might have a small learning curve but is heavily offset very soon by all the abstractions they provide.

  1. Being Like a Bot

Newbie

A newbie doesn’t work on ‘pretending to be human’ enough.

Pro

Pro works more human than an actual human taking care of small children or babies.

  1. Choosing Proxy

Newbie

A newbie uses free proxy servers available on the internet

Pro

A pro doesn’t want a free lunch. If the project is important, he knows there is no way he can build a rotating proxy infrastructure. He will opt for one like Proxies API.

  1. Expect the Unexpected

Newbie

A newbie doesn’t factor in that the target website might change their code.

Pro

A pro expects it. Puts a time stamp on every website he written a scraper for. Writes a Hello World test case for each which should pass no matter what, and if it doesn’t, he sends himself an alert to change his code.

  1. Scrapping Process

Newbie

A newbie uses RegEx or some such rudimentary way to scrape data.

Pro

CSS selectors or XPath are the way to predictably be able to retrieve data, which allows for many changes to be made in the target HTML and the code will probably still work.

  1. Normalization Of Data

Newbie

A newbie doesn’t normalize data that are downloaded

Pro

Downloading from multiple websites means duplicate data, the same data in multiple formats, etc. A pro puts in normalization code to make sure the end data looks as uniform as possible.

  1. Crawling Speed

Newbie

A newbie doesn’t work on scaling the spiders by using concurrency, multiple spiders using Scrapyd, using Rotating Proxies to make more requests per second.

Pro

Pro is always looking to make the crawling process faster and more reliable.

  1. IP Blockage

Newbie

A newbie doesn’t believe that he will ever get IP blocked until he is.

Pro

A pro expects this almost to be a guarantee, especially for big sites like Amazon, Reddit, Yelp, etc. He puts in measures like Proxies API (Rotating Proxies) to help completely negate this risk.

The author is the founder of Proxies API, a proxy rotation API service.

Discussion (0)