shrey vijayvargiya

Posted on Apr 22, 2023

Introduction to Web Scraping

Under the Hood

The story begins I decided a few days back that I should start writing more technical articles as I want to learn backend development to an advanced level.

The strategy is simple, I will do something new in Node JS and keep sharing basic information or what I’ve learned over here.

How I am learning, the answer is simple, Chat GPT and actual documentation.

The first topic I’ve picked is web scraping because I want to build something around it and it has a lot of applications that I’ll cover later in this story.

Introduction

Web scraping is scraping off the web by the virtue of its name and grabbing the DOM elements or web elements to read the content or extract the content.

In simple words, the Data extraction process is called web scraping.

Why do we need web scraping?

How we will do it can be understood very easily by applying reverse engineering.

If I say you want to extract data or content from a web page?

How will you proceed as a front-end developer?

It’s simple each web page has DOM elements and those DOM elements have the data or content. We can first extract or read DOM elements and finally read their corresponding data to extract the web page content.

This is how web scraping is done.

If you want to be a web scraper you need to know what are DOM elements and what is DOM and only basic knowledge will work.

How do we do web scraping?

The logic of web scraping execution is explained above as how it has been done using scripts.

Node JS axios npm fetch the html file using the URL
We iterate over the html file DOM elements using jQuery
Read the content of DOM elements
Extract and save the content Of course, Node JS has tonnes of other libraries like Cheerio and Pupetter that make the iteration or web scraping a cakewalk.

My friend recommend another package called Playwright below is the basic code sample of web scraping.

const playwright = require('playwright');

const vgmUrl = 'https://www.vgmusic.com/music/console/nintendo/nes';

(async () => {
  const browser = await playwright.chromium.launch();
  const page = await browser.newPage();

  await page.goto(vgmUrl);

  const links = await page.$$eval('a', elements => elements.filter(element => {
    const parensRegex = /^((?!\().)*$/;
    return element.href.includes('.mid') && parensRegex.test(element.textContent);
  }).map(element => element.href));

  links.forEach(link => console.log(link));

  await browser.close();
})();

For basic explanation,

We first defined the URL we want to scrap
Launch the browser
Open the new page in the browser and open the URL website
Filter the links using regular expression and includes a link that only contains href with .mid value
Read and save all those links
close the browser The simple and basic way to scrap the website links that has certain attributes such as href includes .mid string and so on.

You can of course add more conditions to filter data and read the other DOM elements like Inputs, Checkboxes, Headings, captions and so on.

Packages for Web scraping

Javascript

Cheerio
Playwright
Pupetter

Python

Beautiful soup
Selenium
Scrapy

Edge cases — Websites with dynamic class names

There are cases where content-based websites block web scraping by adding dynamic class names to each DOM element.

In that case, we have left with certain options as mentioned below

1.Look for alternative attributes: If the class name is changing dynamically, look for alternative attributes like id, name, data-*, etc., that remain constant over time. You can use these attributes to identify the elements you want to scrape.

Use CSS selectors: CSS selectors can be used to select HTML elements based on their attributes. You can use selectors like contains, starts-with, ends-with, etc., to select elements based on their dynamic class names.
Use regular expressions: If the dynamic class names follow a particular pattern, you can use regular expressions to match the pattern and select the elements.
Use a web scraping tool: There are many web scraping tools like BeautifulSoup, Scrapy, Selenium, etc., that can handle dynamic class names. These tools have built-in functions that can select elements based on their attributes, match patterns using regular expressions, etc.
Monitor the website: If the dynamic class names are changing frequently, you can monitor the website to identify the patterns in the changes. This will help you to update your scraping code accordingly.

Application

Data scraping has a lot of benefits as defined below

Price comparison among E-commerce platforms
Data indexing
SEO analysis
Data extractions … a lot more

Data is the OIL so from wherever we get it it’s an OIL or money.

Earn from Data Scraping

Sell your scraped data
Make tools to scrap websites and sell those tools
Become a website scrap developer, freelancer or full-time There are multiple ways but one good way is to sell scrap data make sure you understand the client requirements regarding the data first and then sell it.

You can also make scraping tools using AI to scrap the user’s choice platforms or even sell your scrap algorithm or codebase.

Conclusion

Data scraping is very useful in many senses as a personal developer you can scrap and filter data and sell it, and as a company, you can analyse SEO and compare competitor prices.

Node JS Roadmap Template — Comprehensive guide from beginner to advanced level for Node JS

Until next time, have a good day, people

Shrey

iHateReading

DEV Community

Introduction to Web Scraping

Under the Hood

Introduction

Why do we need web scraping?

How do we do web scraping?

Packages for Web scraping

Edge cases — Websites with dynamic class names

Application

Earn from Data Scraping

Conclusion

Top comments (0)

Read next

AIO : Unable to read /etc/rancher/k3s/k3s.yaml, please start server with --write-kubeconfig-mode to modify ... permissions

I am lost in a maze help me !

Learn SwiftUI (Day 13/100)

Build and deploy a Next.js ecommerce website in 5 steps