oteri

Posted on Jul 6, 2023 • Edited on Jul 8, 2023 • Originally published at hackernoon.com

Web scraping using a headless browser in NodeJS

#programming #node #webscraping #codenewbie

Web scraping collects and extracts unstructured data from a website to a more readable structured format like JSON, CSV format, and more. Organizations set guiding principles on scraped endpoints that are permitted.

When scraping a website for personal use, it can be stressful to manually change the code every time, as most big brand websites want people to refrain from scraping their public data. The following restrictions or problems might arise, such as CAPTCHAs, user agent (allowed and disallowed endpoints) blocking, IP blocking, and proxy network setup are set.

A practical use case of web scraping is notifying users of price changes for an item on sites like Amazon, eBay, etc.

In this article, you will learn how to use Bright Data’s Scraping Browser to unlock websites at scale without being blocked because of its built-in unlocking capabilities.

Sandbox

Test and run the complete code in this Codesandbox.

Prerequisites

It would help if you had the following to complete this tutorial:

Basic knowledge of JavaScript.
Have Node installed on your local machine. It is required to install dependencies
A code editor - VS Code

What is Bright Data?

Bright Data is a data collection or aggregation service with a massive network of internet protocols (IPs) and proxies to scrape information off a website, thereby having the resource to avoid detection by company bots that prevent data scraping.

In essence, Bright Data does the heavy lifting in the background because of its large datasets available on the platform, which removes the worry of being blocked or gaining access to website data.

What is a headless browser?

A headless browser is a browser that operates without a graphical user interface (GUI). Modern web browsers like Google, Safari, Brave, Mozilla, and so on; all have a graphical interface for interactivity and displaying visual content. For headless browsers, it functions in the background with scripts or in the command line interface (CLI) written by developers.

Using a headless browser for web scraping is essential because it allows you to extract data from any public website by simulating user behavior.

Headless browsers are suitable for the following:

Automated testing
Web scraping

Benefits of Puppeteer

Puppeteer is an example of a headless browser. The following are some of the benefits of using Puppeteer in web scraping:

Crawl single-page application (SPA)
Allows for automated testing of website code
Clicking on pages elements
Downloading data
Generate screenshots and PDFs of pages

Installation

Create a new folder for this app, and run the command below to install a node server.

    npm init -y

The command will initialize this project and create a package.json file containing all the dependencies and project information. The -y flag accepts all the defaults upon initialization of the app.

With the initialization complete, let’s install the nodemon dependency with this command:

    npm install -D nodemon

Nodemon is a tool that will automatically restart the node application when the file changes.

In the package.json, update the scripts object with this code:

package.json

    {
      ...
      "scripts": {
        "start": "node index.js",
        "start:dev": "nodemon index.js"
      },
      ...
    }

Next, create a file, index.js, in the directory's root, which will be the entry point for writing the script.

The other package to install is the puppeteer-core, the automation library without the browser used when connecting to a remote browser.

    npm install puppeteer-core

Building with Bright Data’s Scraping Browser

Create an account on Bright Data to access all its services. But for this project, the focus would be on the Scraping Browser functionality.

On your admin dashboard, click on the Proxies and Scraping Infra.

Scroll to the bottom of the page and select the Scraping Browser. After that, click the Get started button from the proxy products listed.

On opening the tool, give the proxy a name and click the button, Add Proxy, and when prompted about creating a new zone, select Yes.

The next screen should be something like this, with the host, username, and password displayed.

Now, click on the button </> Check out code and integration examples and on the next screen, select Node.js as the language of choice for this app.

Creating environment variables

Environment variables are secret keys and credentials that should not be shared, hosted, or pushed to GitHub to prevent unauthorized access.

Before creating the .env file in the root of the directory, let’s install this command:

    npm install dotenv

Copy-paste this code to the .env file, and replace the entire value in the quotation from your Access parameters tab:

.env

    UNAME="<user-name>"
    HOST="<host>"

Creating a web scraper using Puppeteer

Back to the entry point file, index.js, copy-paste this code:

index.js

    const puppeteer = require("puppeteer-core");
    require("dotenv").config();

    const auth = process.env.UNAME;
    const host = process.env.HOST;

    async function run() {
      let browser;
      try {
        browser = await puppeteer.connect({
          browserWSEndpoint: `wss://${auth}@${host}`,
        });

        const page = await browser.newPage();
        page.setDefaultNavigationTimeout(2 * 60 * 1000);

        await page.goto("http://lumtest.com/myip.json");
        const html = await page.content();

        console.log(html);
      } catch (e) {
        console.error("run failed", e);
      } finally {
        await browser?.close();
      }
    }

    if (require.main == module) run();

The code above does the following:

Import the modules, the puppeteer-core, and dotenv
Read the secret variables with the host and auth variables
Define the asynchronous run function
In the try block, connect the endpoint with puppeteer in the object using the key browserWSEndpoint
The browser page launches programmatically to access the different pages like elements and fire up events
Since this is an asynchronous method, the setDefaultNavigationTimeout sets a navigation timeout for 2 minutes
Navigate to the page using the goto function, and afterward, get the URL's content with the page.content() method
It is compulsory that after scraping the web, you must close it in the finally block

If you want to expand this project, you can take screenshots of the web pages in png or pdf format.

Check out the documentation to learn more.

Conclusion

Scraping the web with Bright Data infrastructure makes the process quicker for your use case without writing your scripts from scratch, as it is already taken care of for you.

Try it today to explore the benefits of Bright Data over traditional web scraping tools, restricted by proxy networks and make it challenging to work with large datasets.

Resources

Scraping Browser documentation
Scrape at scale with Bright Data Scraping Browser

Top comments (2)

Mohanraj • Jul 8 '23

I am using puppeteer library to scrape the data from url of website.I got the scraped data but it is in improper format. I need to convert this scraped data into relevant question and answers format in nextjs project.

I want to convert scraped data into relevant question and answers format in nextjs project.

Note:Web scraping process. When type any url in textfield i need scraped data with question and answer format.

oteri • Jul 8 '23

Hi Mohanraj,
I am working on something to simplify the result of the scraped data in JSON format.

Thanks for your concern and would work something out.

DEV Community