Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

#webdev #javascript #web #programming

Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.

What is Web Scraping?

Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.

🟢What is Puppeteer?

Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).

We can automate various tasks using puppeteer,such as:

Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.

🌟How to get started with puppetter?

Firstly we have to install the library,go ahead and do this.
Using npm:

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.

Using yarn:

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.

Using pnpm:

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.

🛠 Example to demonstrate the use of puppeteer

Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

🌄Explanation:

In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.