DEV Community

Cover image for Unleash the Data Goldmine with Web Scraping
Kritebh Lagan Bibhakar
Kritebh Lagan Bibhakar

Posted on

Unleash the Data Goldmine with Web Scraping

Data is the new oil and to extract oil from digital desert we can use scrapping, which means to extract data from web pages and there is various way to achieve this. There are many popular packages such as selenium, scrapy, beautiful soup etc.

But in this episode, we will be scrapping data with puppeteer a very famous package in node eco system which provide us a user-friendly Api to interact with web pages.

One other package is famous for scraping is cheerio, but I am not going to use that because it has some limitations and kind of not so user friendly for all developers because it uses jQuery syntax to filter the element and also it will not work on CSR (Client-Side Rendering) website.

Crafting Our Extractor's Den

Let's start by preparing our coding playground. Create a folder with a fun name like "scrape-book". Next, let's initiate our project: simply type npm init -y. Now, to summon Puppeteer, enter npm i puppeteer. Easy-peasy!

Now, let's get ready to code by creating a file named "scrape.js". Voilà!

mkdir scrape-book
cd scrape-book
npm init -y
npm i puppeteer
touch scrape.js
Enter fullscreen mode Exit fullscreen mode

for this episode I am going to scrape the books collection of https://books.toscrape.com

now let's import the puppetter

now I will make a function scrapeBooks which will take a parameter link.

Inside this function let's initialize the browser and open a new page with the given link.

const puppeteer = require("puppeteer");
const scrapeBooks = async (link) => {
    const browser = await puppeteer.launch({headless: false}); //pass "new" for headless
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768, deviceScaleFactor: 1 });
    await page.goto(link);

    await browser.close();
};
Enter fullscreen mode Exit fullscreen mode

A quick tip: There is a "headless" option that can be passed in puppeteer.launch, which means the GUI version of the browser will not be opened. This is fine when we are developing locally so that we can see what's happening. You can set it to false or use the "new" option.

And the timeout challenge! Puppeteer gives us 30 seconds to open the page. If it's taking too long, we can speed things up by skipping images and unnecessary stuff. And if the page is fancy with scripts, let's give them a break.

//abort the request for image, css and javascript file
page.on('request', (request) => {
        if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet' || request.resourceType() === 'script') {
            request.abort();
        } else {
            request.continue();
        }
    });

Enter fullscreen mode Exit fullscreen mode

okay let's continue.

now we can execute some query in the page, for that I will make another function called script this function will be executed on that page which just opened so think of it as frontend environment and don't try to include any nodejs module in this function.

It won't work because it's environment is different that's why I have created a separate function to keep it separate.

Now grab all the products element on the page and run a loop to extract all the information we need.

I am going to store all of them in an array and then return it.

const script = () => {
    let allProductsNode = document.querySelectorAll(".product_pod");

    let allProducts = [];

    allProductsNode.forEach((p) => {
        let title = p.querySelector("h3 a").innerHTML;
        let price = p.querySelector(".price_color").innerHTML;
        let availability = p
            .querySelector(".instock.availability")
            .textContent.trim();
        let imageUrl = p.querySelector("img").src;
        let bookLink =
            `https://books.toscrape.com/catalogue/` +
            p.querySelector(".image_container a").getAttribute("href");
        allProducts.push({
            title,
            price,
            availability,
            imageUrl,
            bookLink,
        });
    });
    return allProducts;
};
Enter fullscreen mode Exit fullscreen mode

we have to update our scrapeBooks function also, we will wait till the products class is available on the page and then run that script using page.evaluate.

    //wait until products class is available
    await page.waitForSelector(".product_pod")

    //excecute script to grab all the data
    let allBooks = await page.evaluate(script);
Enter fullscreen mode Exit fullscreen mode

On the website there are 1000 books, 20 books per page so we can loop through all the pages and we will get all the details of 1000 books.

let's write another scrape function which will extract data from each pages.

I will simply run a loop from 1 to 50 and call the scrapeBooks function, and also I will need a global array to store all of them.

after grabbing all the data we can store it in a file so we can do this easily with fs module.

so just import fs module and create a new file books.json.

 await fs.writeFile("./books.json", JSON.stringify(data), "utf-8");
Enter fullscreen mode Exit fullscreen mode

Our Book object will look like this -

Book Object Json

Instead of writing it into a file you can also store these data in database directly.

Note:- Sometimes it is not a good idea to store all that data in a single array which will occupy lots of memory so you can also write into file in each iteration and after that you can modify that file to a single array.

now the whole code looks like this :-

const puppeteer = require("puppeteer");
const fs = require("fs/promises");
let data = [];
const script = () => {
    let allProductsNode = document.querySelectorAll(".product_pod");

    let allProducts = [];

    allProductsNode.forEach((p) => {
        let title = p.querySelector("h3 a").innerHTML;
        let price = p.querySelector(".price_color").innerHTML;
        let availability = p
            .querySelector(".instock.availability")
            .textContent.trim();
        let imageUrl = p.querySelector("img").src;
        let bookLink =
            `https://books.toscrape.com/catalogue/` +
            p.querySelector(".image_container a").getAttribute("href");
        allProducts.push({
            title,
            price,
            availability,
            imageUrl,
            bookLink,
        });
    });
    return allProducts;
};

const scrapeBooks = async (link) => {
    const browser = await puppeteer.launch({ headless: false }); //pass "new" for headless
    const page = await browser.newPage();
    await page.setViewport({ width: 1600, height: 1000, deviceScaleFactor: 1 });
    await page.goto(link);

    //wait until products class is available
    await page.waitForSelector(".product_pod");

    //excecute script to grab all the data
    let allBooks = await page.evaluate(script);

    data = [...data, ...allBooks];

    await browser.close();
};

async function scrap() {
    try {
        for (let i = 1; i <= 50; i++) {
            await scrapeBooks(`https://books.toscrape.com/catalogue/page-${i}.html`);
        }

        await fs.writeFile("./books.json", JSON.stringify(data), "utf-8");
    } catch (error) {
        console.log(error)
    }
}

scrap();

Enter fullscreen mode Exit fullscreen mode

If you have any suggestions or questions, please let me know in the comment section.

Top comments (0)