DEV Community

Cover image for The Web Scraping Continuum
CincyBC
CincyBC

Posted on

The Web Scraping Continuum

Over the next few weeks, I'm going to be evolving simple code that scrapes a website and turn it into a more robust set of tasks performing an ETL (Extract Transform Load) process orchestrated by the widely used open source software Airflow. All of the code can be found (here)[https://github.com/CincyBC/bootstrap-to-airflow] in Github.

First, a couple bits on Web Scraping.

What is Web Scraping?

Web scraping refers to pulling information off the web whether it be a bot or someone paid to copy and paste the daily stock price off a website into an Excel spreadsheet. In an ideal world, all data you need would be available from an API (Application Programming Interface) where you post your request and get a response with the data you need, but setting up and maintaining APIs for general consumption costs money.

There are a couple things to consider when thinking about web scraping:

  1. Check your local jurisdiction for any legal restrictions. In the United States, you're not, for example, allowed to scrape copyrighted Getty Images pictures from the web and sell t-shirts with the image on them without the consent of Getty Images.
  2. Check the website for any restrictions or rate limitations (sometimes in a robots.txt file). This isn't a problem when you are manually pulling data from a website, but if you have a script hitting a website several times a second, you could be impacting the performance of the website, which could lead your IP Address to get blocked from accessing the website.

The Web Scraping Continuum

Image showing the continuum from hidden apis to people manually scraping

Now that we have that out of the way, it's time to talk about the web scraping continuum. On one extreme, you have people paid to pull information regularly from websites. I worked at a company where someone was paid to write a report each day on what happened in the market and that person would manually go to a list of websites and pull information everyday. One step away from this is web scraping with a package like Selenium that does the same thing as this person, but the browser button pushing is done in an automated browser.

On the other end, especially with SPAs (Single Page Applications), you can skip loading web pages and use the API the web pages use as if it were a fully accessible API. How? (Here is a good writeup.)[https://blog.devgenius.io/scrape-data-without-selenium-by-exposing-hidden-apis-946b23850d47] Sometimes, this "hidden api" doesn't return a nice JSON, but rather a block of html you need to parse. This is inconvenient, but the blocks of html usually stay fairly consistent over time, so it's still fairly stable.

Right in the middle of the continuum is good ol' BeautifulSoup where you make a request out to an entire webpage (not just a block returned from a hidden api) and parse it all for the bits that you want. This is actually where our series will start; with building a web scraper in 10 lines of code with BeautifulSoup. Join me as we turn this simple 10 lines of code into a robust Airflow DAG with custom operators.

Top comments (0)