DEV Community

Cover image for Web Scraping - A Complete Guide

Web Scraping - A Complete Guide

Serpdog on January 21, 2023

Introduction Web Scraping, also known as data extraction or data scraping, is the process of extracting or collecting data from websites...
Collapse
 
lissy93 profile image
Alicia Sykes • Edited

Worth noting, that nearly every mainstream site has an API.
And fetching data from APIs is so much easier, faster, more reliable, more scalable and just plain safer.

Scraping has a lot of issues:

  • Usually web scraping goes against the Terms of Service of most sites
  • Any small change in a websites markup will break your scraper
  • Captcha's, rate-limits and other anti-bot measures will prevent it working at scale
  • Your loading far more data than you need (scripts, images, fonts, styles, etc)
  • Many modern websites insert content dynamically on hydration, giving you temperamental results
  • Your IP will very quickly get blacklisted for web scraping
  • A lot of content requires authentication to access via the browser (giving your scraper any credentials would be a terrible idea)
  • The data your fetching won't be structured in any meaningful way, adding to the processing work you need to do
  • You need to write separate scrapers for different websites
  • It's not fair on the website owners. You're unnecessarily bombarding their site with bot traffic
  • Scraping is also pretty morally dubious, someone has put time and effort into creating and maintaining a data set, which you're just trying to lift for free (and in the most clumsy possible way)

The simple solution to all those issues, is just to fetch data from an API instead.

Collapse
 
serpdogapi profile image
Serpdog

People will start to use official API when it will be scalable, and available at cheap prices.
Also if you take example of Google official API, you can't use it for commercial purposes. We have an 80 billion dollar SEO industry --> how will this survive if there is no scraper available in the market.

Collapse
 
greenwoodmap profile image
Richard Greenwood

Alicia - well said! Before you scrape, ask if the site provides alternative means to access the data. As a publisher of public information for local government and non-profits I'm coming from the other side of the scraper equation. Just because you can do it doesn't mean that it's the right thing to do or the right way to do it. Ask first.

Collapse
 
cubiclesocial profile image
cubiclesocial

It depends on the government entity, but many supply raw database dumps right on their website. Either full dumps performed regularly (e.g. nightly or weekly) or full dump + incrementals. When they exist, you can retrieve those raw dumps (with a scraper) and reconstruct your own database from them. Scraping the content from individual pages is unnecessary and wasteful in those instances. Before asking, poke around a bit on the website to see if you can find a data dump that is updated regularly. Saves a little bit of back-and-forth.

Most government entities in the U.S. are obligated and required by public records laws to publish their information. Doing a nightly data dump and shoving it onto a webpage is the easiest way to comply with those laws. If they don't publish a raw dump online, you can ask, but some entities, especially police/sheriff departments, U.S. Border Patrol, and the courts, are extremely obnoxious and will only respond begrudgingly under a court order. This is not how any government employee or entity should ever behave. Some entities respond to FOIA requests for data with PDFs (basically a digital middle finger to the requestor) instead of the requested format (e.g. CSV). In general, you can't get in legal trouble for scraping publicly available content on government websites as the law itself generally protects you from that. However, there might be some politician with a chip on their shoulder who might make it their mission in life to make your life miserable because they think they can, but that's a separate issue.

Collapse
 
cubiclesocial profile image
cubiclesocial • Edited

You can't get in legal trouble for scraping public websites where you don't have a clickwrap agreement for the Terms of Service. (Your IP might get banned by an admin or automated system for abusing web server resources, but that's a completely different issue.) Terms of Service documents are not legally binding if the data being scraped is publicly available. That is, an account or clickwrap approval was not required to obtain the data. Data is generally more like a recipe. Recipes are not protected by copyright law. Most website operators allow googlebot to scrape their content so that the website can be indexed in search results, but googlebot, in this case, violates any Terms of Service document that claims to disallow web scraping. It's a good thing then that googlebot ignores ToS documents.

As an example, imagine if I were allowed to say, "You now owe me $1,000 for the privilege of reading this message on dev.to. Go to any CubicleSoft repo on GitHub and use the Donate link to pay up." Not only is that ridiculous, but you didn't agree to it and the allowance of such would result in the collapse of society. No sane court of law would entertain such an argument.

Similarly, a Terms of Service document on a website is legally non-enforceable unless the user actually agrees to it either by creating an account where doing so has language as such or every entry point to valuable data requires agreement prior to accessing the data, thereby forming a contract between the user and the data provider. Contract law then takes effect. It's a subtle but important distinction. Everyone who has gotten in trouble legally to date for scraping content has formally agreed to the provider's ToS.

Whether or not digital clickwrap agreements like ToS' and software EULAs should actually have force of law under contract law is still a matter of ample debate and very little case law.

Note that I'm not a lawyer and this isn't legal advice but any assumption that simply accessing a website results in automatically agreeing to that website's ToS is an obviously invalid argument. Like a contract, unless you sign the agreement, it has no effect.

Collapse
 
nickjeon profile image
Nicholas Jeon • Edited

I agree that using APIs instead of a scraper is better.

Collapse
 
sysmaya profile image
sysmaya

I've made like 10 spiders...
To take photos, news, ebay articles, etc...
But I use grandpa, old man Visual Basic 6 :(

Collapse
 
serpdogapi profile image
Serpdog

Ohh no brother, why are you using these outdated languages??

Collapse
 
cubiclesocial profile image
cubiclesocial

Application stability over time is a perfectly valid reason to use an "outdated language."

Applications written in Python, Javascript, PHP, or other "modern" languages that are constantly evolving and getting more and more bloated in the process are more likely to break when upgrading the language itself.

On the other hand, VB6 is unlikely to ever change the language specification. As long as the runtimes continue to function, code written in VB6 is unlikely to ever break. Windows also ships with the VB6 runtimes (I think for vbscript support), which means there's nothing special to install binary-wise.

Should everyone run out and start writing VB6 code? Probably not. However, we shouldn't judge those who choose one programming/scripting language over another. They obviously have their reasons for their language of choice.

Collapse
 
sysmaya profile image
sysmaya

The problem with using VB6 for scrapping is: The torture of having to use the Internet Control OCX, something like Internet Explorer 5.
But with patience and tricks it can work.
The finished program runs like a charm... First it scans for valid hyperlinks, then it puts them into a database (Obviously Access Model 97).. And it downloads photos, contents, in an acceptable way..
Believe me when I tell you that I have reviewed more than 1,000,000 pages.

Collapse
 
zubair12 profile image
Zubair Ahmed Khushk

Hi, can you help me in making a scraper? I am facing few problems.

Collapse
 
serpdogapi profile image
Serpdog

Tell me the problem.

Collapse
 
sysmaya profile image
sysmaya

Scrapper in VB6 ?? Of course

Collapse
 
gamerseo profile image
Gamerseo

Obtaining data from websites is very important and can lead to many promising conclusions.

Collapse
 
savvyengineer profile image
Cauane Andrade • Edited

Great post! You may also find it useful to check out my post on the differences between Web Crawling and Web Scraping for a more in-depth understanding of the topic.

Collapse
 
serpdogapi profile image
Serpdog

Thanks for reading the article!!! Will surely check your post.

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
serpdogapi profile image
Serpdog

Thanks for reading the post Abhay!!

Collapse
 
cubiclesocial profile image
cubiclesocial

PHP does just fine for nearly all web scraping tasks. Shameless self-promotion:

github.com/cubiclesoft/ultimate-we...

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
serpdogapi profile image
Serpdog

Thanks for reading the article!

Collapse
 
sysmaya profile image
sysmaya

Some time ago I tried to google a spider, looking for images... Bad idea.
It works fine with the first (say 100) searches, and after that, google catches a lot of queries and shuts off the tap.

Collapse
 
serpdogapi profile image
Serpdog

Google is the smartest in catching bots. That is why you need a large pool of residential IPs to scrape it.

Collapse
 
samuel_marien profile image
Samuel Marien

Great article, i learn a lot. Thx to the author :)

Collapse
 
serpdogapi profile image
Serpdog

Thanks for reading the article!!