DEV Community

A guide to Web scraping without getting blocked

Pierre on July 31, 2019

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you w...

Read full post

Andy Piper • Jul 31 '19 • Edited

I am going to come at this from a different angle, working for an API platform: PLEASE USE THE API (much more specifically, where there IS an API)

Yes, I get that you can (arguably...) work around limits by doing headless scraping, but this is often against the platform terms of service: you will get far less, less useful, metadata; your IP address may be blocked; and the UI typically has no contract with you as a developer that can help to ensure that data access is maintained when a site layout changes, leading to more work for you.

Be cool. Work with us, as API providers. We want you to use our APIs and not have you run around us to grab data in unnecessary ways.

Most of all though, enjoy coding!

seckelberny67 • Aug 6 '19 • Edited

I remember when I started scrapping, I used to search for free proxies and tried to save my money in this important step of scraping. But really quickly I realized that you cannot trust free proxies because they are so unreliable and unstable. I totally agree with you that people should not use free proxies. All of your listed proxy providers are really solid names for an affordable price. Personally, I prefer Smartproxy for their price and quality balance. All in all, a really solid article, Pierre!

Saul Costa • Jul 31 '19

ScrapingNinja looks really cool! Just curious ― are there any legal issues with providing a service like that?

Pierre ScrapingBee • Jul 31 '19

Thank you very much.

As long as we ensure that people don't use our service for DDOS purpose, we've been told we should be fine 🤞

Saul Costa • Jul 31 '19

Hmm, interesting. Many sites list scraping, crawling, and / or non-human access as violations of their Terms of Service.

For example, see section 8.2(b) of LinkedIn's User Agreement (I list LinkedIn because I know they're a common target for scraping).

Pierre ScrapingBee • Jul 31 '19

Yes, you are right, Linkedin is well known for this.

Well, I am not a lawyer so I'd rather say nothing than saying no-sense.

We plan to do a blog post about this, well-sourced and more detailed than my answers :)

Saul Costa • Jul 31 '19

Well, I am not a lawyer so I'd rather say nothing than saying no-sense.

Smart :)

We plan to do a blog post about this, well-sourced and more detailed than my answers

Cool, looking forward to reading that!

Andy Piper • Jul 31 '19

What about complying with terms of service for the websites and API platforms your service may scrape?

theincognitotech • Dec 20 '19

Great article and nice proxy recommendation. I even have a review about one proxy provider you mentioned.

Crawlbase • Mar 7 '24

This blog is your go-to guide for web scraping essentials. It breaks down why scraping is important and how to avoid detection by websites, offering tips like using Headless Chrome, proxies, and CAPTCHA solving. Plus, it mentions ScrapingBee, a super user-friendly option for hassle-free scraping tasks.
Do explore and check Crawlbase aswell.

Andrés Gutiérrez • Aug 1 '19

Hi Pierre, I really like your post, great job!

I am actually working on web scraper using Python with requests library
I am getting information about job titles from my country and find out a pattern

MichaelSwerston • Jun 19 '20

Proxy rotation is very useful in this and many other tasks, especially for automation I think. Next to proxy services weird that you didn't mention Oxylabs or Geosurf, these seem to be some of the more web scraping centered proxy providers.

Uli Troyo • Jul 31 '19

I guess I'm here too early? I saw this and immediately came to the comments because there's no way that image is't going to shock the arachnophobes among us 🤣

Michael Tharrington • Jul 31 '19

Haha!! I expected much more of an uproar as well... I mean that image is absolutely terrifying.

assender • Jun 17 '20

One of the best articles on this topic that I've ever read! Very great!
And I must add that proxy services are very important and necessary in this case, as well as to choose the right proxy provider that could meet all your needs and would help to mask your scraping tool from detection.

Kurt Bauer • Jul 31 '19

This was an awesome read! Seems like the link was broken where it said, "just go over here, it's a webpage that simply displays"? Maybe I'm wrong, but I was very interested and wanted to take a look 👀

Pierre ScrapingBee • Jul 31 '19

Oopsie, thanks for the catch, it is now fixed !

Kurt Bauer • Jul 31 '19

Thanks!

Laurel Kline • Jul 31 '19

Do you have any opinion on the Scrapy API? I've gotten some good results with them:
scrapy.org/

Pierre ScrapingBee • Jul 31 '19

Scrapy is AWSOME !

It allows you to do so much with such a few lines of codes.

I consider Scrapy as a requests package under big steroïds.

The fact that you can handle parallelization, throttling, data filtering, and data loading in one place is very good. I am specifically fond of the autothrottle feature

However, Scrapy need some extension to work well with proxies and headless browser.