DEV Community

Cover image for A guide to Web scraping without getting blocked

A guide to Web scraping without getting blocked

Pierre on July 31, 2019

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you w...
Collapse
 
andypiper profile image
Andy Piper • Edited

I am going to come at this from a different angle, working for an API platform: PLEASE USE THE API (much more specifically, where there IS an API)

Yes, I get that you can (arguably...) work around limits by doing headless scraping, but this is often against the platform terms of service: you will get far less, less useful, metadata; your IP address may be blocked; and the UI typically has no contract with you as a developer that can help to ensure that data access is maintained when a site layout changes, leading to more work for you.

Be cool. Work with us, as API providers. We want you to use our APIs and not have you run around us to grab data in unnecessary ways.

Most of all though, enjoy coding!

Collapse
 
seckelberny67 profile image
seckelberny67 • Edited

I remember when I started scrapping, I used to search for free proxies and tried to save my money in this important step of scraping. But really quickly I realized that you cannot trust free proxies because they are so unreliable and unstable. I totally agree with you that people should not use free proxies. All of your listed proxy providers are really solid names for an affordable price. Personally, I prefer Smartproxy for their price and quality balance. All in all, a really solid article, Pierre!

Collapse
 
scosta profile image
Saul Costa

ScrapingNinja looks really cool! Just curious ― are there any legal issues with providing a service like that?

Collapse
 
daolf profile image
Pierre

Thank you very much.

As long as we ensure that people don't use our service for DDOS purpose, we've been told we should be fine 🤞

Collapse
 
scosta profile image
Saul Costa

Hmm, interesting. Many sites list scraping, crawling, and / or non-human access as violations of their Terms of Service.

For example, see section 8.2(b) of LinkedIn's User Agreement (I list LinkedIn because I know they're a common target for scraping).

Thread Thread
 
daolf profile image
Pierre

Yes, you are right, Linkedin is well known for this.

Well, I am not a lawyer so I'd rather say nothing than saying no-sense.

We plan to do a blog post about this, well-sourced and more detailed than my answers :)

Thread Thread
 
scosta profile image
Saul Costa

Well, I am not a lawyer so I'd rather say nothing than saying no-sense.

Smart :)

We plan to do a blog post about this, well-sourced and more detailed than my answers

Cool, looking forward to reading that!

Collapse
 
andypiper profile image
Andy Piper

What about complying with terms of service for the websites and API platforms your service may scrape?

Collapse
 
crawlbase profile image
Crawlbase

This blog is your go-to guide for web scraping essentials. It breaks down why scraping is important and how to avoid detection by websites, offering tips like using Headless Chrome, proxies, and CAPTCHA solving. Plus, it mentions ScrapingBee, a super user-friendly option for hassle-free scraping tasks.
Do explore and check Crawlbase aswell.

Collapse
 
theincognitotech profile image
theincognitotech

Great article and nice proxy recommendation. I even have a review about one proxy provider you mentioned.

Collapse
 
ulitroyo profile image
Uli Troyo

I guess I'm here too early? I saw this and immediately came to the comments because there's no way that image is't going to shock the arachnophobes among us 🤣

Collapse
 
michaeltharrington profile image
Michael Tharrington

Haha!! I expected much more of an uproar as well... I mean that image is absolutely terrifying.

Collapse
 
andrsgutirrz profile image
Andrés Gutiérrez

Hi Pierre, I really like your post, great job!

I am actually working on web scraper using Python with requests library
I am getting information about job titles from my country and find out a pattern

Collapse
 
michaelswerston profile image
MichaelSwerston

Proxy rotation is very useful in this and many other tasks, especially for automation I think. Next to proxy services weird that you didn't mention Oxylabs or Geosurf, these seem to be some of the more web scraping centered proxy providers.

Collapse
 
assender profile image
assender

One of the best articles on this topic that I've ever read! Very great!
And I must add that proxy services are very important and necessary in this case, as well as to choose the right proxy provider that could meet all your needs and would help to mask your scraping tool from detection.

Collapse
 
krtb profile image
Kurt Bauer

This was an awesome read! Seems like the link was broken where it said, "just go over here, it's a webpage that simply displays"? Maybe I'm wrong, but I was very interested and wanted to take a look 👀

Collapse
 
daolf profile image
Pierre

Oopsie, thanks for the catch, it is now fixed !

Collapse
 
krtb profile image
Kurt Bauer

Thanks!

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
geekshe profile image
Laurel Kline

Do you have any opinion on the Scrapy API? I've gotten some good results with them:
scrapy.org/

Collapse
 
daolf profile image
Pierre

Scrapy is AWSOME !

It allows you to do so much with such a few lines of codes.

I consider Scrapy as a requests package under big steroïds.

The fact that you can handle parallelization, throttling, data filtering, and data loading in one place is very good. I am specifically fond of the autothrottle feature

However, Scrapy need some extension to work well with proxies and headless browser.

Collapse
 
microworlds profile image
Caleb David

Awesome piece

Collapse
 
alex24409331 profile image
alex24409331

awesome article thank you. also, I have found another site scraper service. Maybe it will help someone too. e-scraper.com/useful-articles/down...

Collapse
 
vezzick profile image
Albert Ko

scraperapi is a tool that takes some of the headache out of this, you only pay for successes. scraperapi.com/?fp_ref=albert-ko83