DEV Community

Cover image for Web Scraping With JavaScript And Node JS - An Ultimate Guide
ApiForSeo
ApiForSeo

Posted on • Originally published at serpdog.io

Web Scraping With JavaScript And Node JS - An Ultimate Guide

Introduction

JavaScript has now become one of the most preferred languages for web scraping. Its ability to extract the data from SPA(Single Page Application) is boosting its popularity. Developers can easily automate their tasks with the help of libraries like Puppeteer and Cheerio, which are available in JavaScript.

In this blog, we are going to discuss various web scraping libraries present in JavaScript, their advantages and disadvantages, determine the best among them, and at the end, we will discuss some differences between Python and JavaScript in terms of web scraping.

Web Scraping With JavaScript And Node JS - An Ultimate Guide 1

Web Scraping With Node JS

Before starting with the tutorial, let us learn some basics of web scraping.

What is Web Scraping?

Web Scraping is the process of extracting data from a single or bunch of websites with the help of HTTP requests on the website’s server to get access to the raw HTML of a particular webpage and then converting it into a format you want.

Web Scraping With JavaScript And Node JS - An Ultimate Guide 2

There are various uses of Web Scraping:

  • SEO — Web Scraping can be used to scrape Google Search Results, which can be used for various objectives like SERP Monitoring, keyword tracking, etc.
  • News Monitoring — Web Scraping can enable access to a large number of articles from various media agencies which can be used to keep a track of current news and events.
  • Lead Generation — Web Scraping helps to extract the contact details of a person who can be your potential customer.
  • Price Comparison — Web Scraping can be used to gather product pricing from multiple online sellers for price comparison.

Best Web Scraping Libraries in Node JS

The best web scraping libraries present in Node JS are:

  1. Unirest
  2. Axios
  3. SuperAgent
  4. Cheerio
  5. Puppeteer
  6. Playwright
  7. Nightmare

Let us start discussing these various web scraping libraries.

HTTP CLIENT

HTTP client libraries are used to interact with website servers by sending requests and retrieving the response. In the following sections, we will discuss several libraries that can be utilized for making HTTP requests.

Unirest

Unirest is a lightweight HTTP request library available in multiple languages, built and maintained by Kong. It supports various HTTP methods like GET, POST, DELETE, HEAD, etc which can be easily added to your applications, making it a preferable choice for effortless use cases.

Unirest is one of the most popular JavaScript libraries that can be utilized to extract the valuable data available on the internet.

Let us take an example of how we can do it. Before starting, I am assuming that you have already set up your Node JS project with a working directory.

First, install Unirest JS by running the below command in your project terminal.

npm i unirest
Enter fullscreen mode Exit fullscreen mode

Now, we will request the target URL to extract the raw HTML data.

const unirest = require(“unirest”);
const getData = async() => {
try{
 const response = await unirest.get("https://www.reddit.com/r/programming.json")
 console.log(response.body); // HTML
}
catch(e)
{
 console.log(e);
}
}
getData();
Enter fullscreen mode Exit fullscreen mode

This is how you can create a basic scraper with Unirest.

Advantages:

  1. All HTTP methods are supported, including GET, POST, DELETE, etc.
  2. It is very fast for web scraping tasks and can handle a large amount of load without any problem.
  3. It allows file transfer over a server in a much simpler way.

Axios

Axios is a promise-based HTTP client for both Node JS and browsers. Axios is widely used among the developer community because of its wide range of methods, simplicity, and active maintenance. It also supports various features like cancel requests, automatic transforms for JSON data, etc.

You can install the Axios library by running the below command in your terminal.

npm i axios
Enter fullscreen mode Exit fullscreen mode

Making an HTTP request with Axios is quite simple.

const axios = require(“axios”);
const getData = async() => {
try{
 const response = await axios.get("https://books.toscrape.com/")
 console.log(response.data); // HTML
}
catch(e)
{
 console.log(e);
}
}
getData();
Enter fullscreen mode Exit fullscreen mode

Advantages:

  1. It can intercept an HTTP request and can modify it.
  2. It has large community support and is actively maintained by its founders making it a reliable option for making HTTP requests.
  3. It can transform the request and response data.

SuperAgent

SuperAgent is another lightweight HTTP Client library for both Node JS and browser. It supports many high-level HTTP client features. It features a similar API as Axios and supports, both promise and async/await syntax for handling responses.

You can install SuperAgent by running the following command.

npm i superagent
Enter fullscreen mode Exit fullscreen mode

You can make an HTTP request using async/await with SuperAgent like this:

const superagent = require(“superagent”);
const getData = async() => {
try{
 const response = await superagent.get("https://books.toscrape.com/")
 console.log(response.text); // HTML
}
catch(e)
{
 console.log(e);
}
}
getData();
Enter fullscreen mode Exit fullscreen mode

Advantages:

  1. SuperAgent can be easily extended via various plugins.
  2. It works in both the browser and node.

Disadvantages:

  1. Fewer features as compared to other HTTP client libraries like
  2. Axios.
  3. Documentation is not provided in detail.

Web Parsing Libraries

Web Scraping With JavaScript and Node JS — An Ultimate Guide 3

Web Parsing Libraries are used to extract the required data from the raw HTML or XML document. There are various web parsing libraries present in JavaScript including Cheerio, JSONPath, html-parse-stringify2, etc. In the following section, we will discuss Cheerio, the most popular web parsing library in JavaScript.

Cheerio

Cheerio is a lightweight web parsing library based on the powerful API of jQuery that can be used to parse and extract data from HTML and XML documents.

Cheerio is blazingly fast in HTML parsing, manipulating, and rendering as it works with a simple consistent DOM model. It is not a web browser as it can’t produce visual rendering, apply CSS, and execute JavaScript. For scraping SPA(Single Page Applications) we need complete browser automation tools like Puppeteer, Playwright, etc which we will discuss in a bit.

Let us scrape the title of the book in the below image.

Web Scraping With JavaScript and Node JS — An Ultimate Guide 4

First, we will install the Cheerio library.

npm i cheerio
Enter fullscreen mode Exit fullscreen mode

Then, we can extract the title by running the below code.

const unirest = require(“unirest”);
const cheerio = require(“cheerio”);
const getData = async() => {
try{
 const response = await unirest.get("https://books.toscrape.com/catalogue/sharp-objects_997/index.html")
const $ = cheerio.load(response.body);
 console.log("Book Title: " + $("h1").text()); // "Book Title: Sharp Objects"
}
catch(e)
{
 console.log(e);
}
}
getData();
Enter fullscreen mode Exit fullscreen mode

The process is quite similar to what we have done in the Unirest section, but with a little difference. In the above code, we load the extracted HTML into a Cheerio constant, and then we used the CSS Selector of the title to extract the required data.

Advantages:

  1. Faster than any other web parsing library.
  2. Cheerio has a very simple syntax and is similar to jQuery which allows developers to scrape web pages easily.
  3. Cheerio can be used or integrated with various web scraping libraries like Unirest and Axios, which can be a great combo for scraping a website.

Disadvantages:

  1. It cannot execute Javascript.

Headless Browsers

Web Scraping With JavaScript and Node JS — An Ultimate Guide 5

Nowadays, website development has become more advanced, and developers are preferring more dynamic content on their websites, which is possible because of JavaScript. But this content rendered by JavaScript is not accessible while doing web scraping with a simple HTTP GET request.

The only way you can scrape the dynamic content is by using headless browsers. Let us discuss the libraries which can help in scraping that content.

Puppeteer

Puppeteer is a Node JS library designed by Google that provides a high-level API that allows you to control Chrome or Chromium browsers.

Features associated with Puppeteer JS:

  1. Puppeteer can be used to have better control over Chrome.
  2. It can generate screenshots and PDFs of web pages.
  3. It can be used to scrape web pages that use JavaScript to load the content dynamically.

Let us scrape all the book titles and their links on this website.

But first, we will install the puppeteer library.

npm i puppeteer
Enter fullscreen mode Exit fullscreen mode

Now, we will prepare a script to scrape the required information.

Web Scraping With JavaScript and Node JS — An Ultimate Guide 6

Now, write the below code in your js file.

const browser = await puppeteer.launch({
 headless: false,
 });
 const page = await browser.newPage(); 
 await page.goto("https://books.toscrape.com/index.html" , {
 waitUntil: 'domcontentloaded'
 })
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation:

  1. First, we launched the browser with the headless mode set to false, which allows us to see exactly what is happening.
  2. Then, we created a new page in the headless browser.
  3. After that, we navigated to our target URL and waited until the HTML completely loaded.

Now, we will parse the HTML.

let data = await page.evaluate(() => {
 return Array.from(document.querySelectorAll(“article h3”)).map((el) => {
 return {
 title: el.querySelector(“a”).getAttribute(“title”),
 link: el.querySelector(“a”).getAttribute(“href”),
 };
 });
 });
Enter fullscreen mode Exit fullscreen mode

The page.evalueate() will execute the javascript within the current page context. And then, the document.querySelectorAll() will select all the elements which identify with article h3 tags. The document.querySelector() is the same, but it selects a single HTML element.

Great! Now, we will print the data and close the browser.

console.log(data)
await browser.close();
Enter fullscreen mode Exit fullscreen mode

This will give you 20 titles and links to the books present on the web page.

Advantages:

  1. We can perform various activities on the web page, like clicking on the buttons and links, navigating between the pages, scrolling the web page, etc.
  2. It can be used to take screenshots of web pages.
  3. The evaluate() function in the puppeteer JS helps you to execute Javascript.
  4. You don’t need an external driver to run the tests.

Disadvantages:

  1. It requires very high CPU usage to run.
  2. It currently supports only Chrome web browsers.

Playwright

Playwright is a test automation framework to automate web browsers like Chrome, Firefox, and Safari with an API similar to Puppeteer. It was developed by the same team that worked on Puppeteer. Like Puppeteer, Playwright can also run in the headless and non-headless modes making it suitable for a wide range of uses from automating tasks to web scraping or web crawling.

Major Differences between Playwright and Puppeteer:

  1. Playwright is compatible with Chrome, Firefox, and Safari, while Puppeteer only supports Chrome web browsers.
  2. Playwright provides a wide range of options to control the browser in headless mode.
  3. Puppeteer is limited to Javascript only, while Playwright supports various languages like C#, .NET, Java, Python, etc.

Let us install Playwright now.

npm i playwright
Enter fullscreen mode Exit fullscreen mode

We will now prepare a basic script to scrape the prices and stock availability from the same website which we used in the Puppeteer section.

Web Scraping With JavaScript and Node JS — An Ultimate Guide 7

The syntax is quite similar to Puppeteer.

const browser = await playwright[‘chromium’].launch({ headless: false,});
 const context = await browser.newContext();
 const page = await context.newPage();
 await page.goto(“https://books.toscrape.com/index.html");
Enter fullscreen mode Exit fullscreen mode

The newContext() will create a new browser context.

Now, we will prepare our parser.

  let articles =  await page.$$("article");

    let data = [];                 
    for(let article of articles)
    {
        data.push({
            price: await article.$eval("p.price_color", el => el.textContent),
            availability: await article.$eval("p.availability", el => el.textContent),
        });
    }
Enter fullscreen mode Exit fullscreen mode

Then, we will close our browser.

await browser.close();
Enter fullscreen mode Exit fullscreen mode

Advantages:

  1. It supports multiple languages like Python, Java, .Net, and Javascript.
  2. It is faster than any other web browser automation library.
  3. It supports multiple web browsers like Chrome, Firefox, and Safari on a single API.
  4. Its documentation is well-written which makes it easy for developers to learn and use.

Nightmare JS

Nightmare is a high-level web automation library designed to automate browsing, web scraping, and various other tasks. It uses Electron(similar to Phantom JS, but twice as fast) which provides it with a headless browser, making it efficient and easy to use. It is predominantly used for UI testing and crawling.

It can be used to mimic user actions such as navigating to a website, clicking a button or a link, typing, etc with an API that provides a smooth experience for each script block.

Install Nightmare JS by running the following command.

npm i nightmare
Enter fullscreen mode Exit fullscreen mode

Now, we will search for the results of “Serpdog” on duckduckgo.com.

const Nightmare = require(‘nightmare’)
const nightmare = Nightmare()
 nightmare
.goto(‘https://duckduckgo.com')
.type(‘#search_form_input_homepage’, ‘Serpdog’)
.click(‘#search_button_homepage’)
.wait(‘.nrn-react-div’)
.evaluate(() =>
{
 return Array.from(document.querySelectorAll(‘.nrn-react-div’)).map((el) => {
 return {
 title: el.querySelector(“h2”).innerText.replace(“\n”,””),
 link: el.querySelector(“h2 a”).href
 }
})
})
 .end()
 .then((data) => {
 console.log(data)
 })
 .catch((error) => {
 console.error(‘Search failed:’, error)
 })
Enter fullscreen mode Exit fullscreen mode

In the above code, first, we declared an instance of Nightmare. Then, we navigated to the Duckduckgo search page.

Then, we used the type() method to type Serpdog in the search field, and submitted the form by clicking the search button on the homepage using the click() method. We will make our scraper wait till the search results are loaded, after that we will extract the search results present on the web page with the help of their CSS selectors.

Advantages:

  1. It is faster than Puppeteer.
  2. Fewer resources are needed to run the program.

Disadvantages:

  1. It doesn’t have good community support like Puppeteer. Also, some undiscovered issues exist on Electron, which can allow a malicious website to execute code on your computer.

Other libraries

In this section, we will discuss some alternatives to the previously discussed libraries.

Node Fetch

Node Fetch is a lightweight library that brings Fetch API to Node JS, allowing HTTP requests efficiently in the Node JS environment.

Features:

  1. It allows the use of promises and async functions.
  2. It implements the Fetch API functionality in Node JS.
  3. Simple API that is maintained regularly, and is easy to use and understand.

You can install Node Fetch by running the following command.

npm i node-fetch
Enter fullscreen mode Exit fullscreen mode

Here is how you can use Node Fetch for web scraping.

const fetch = require(“node-fetch”)
const getData = async() => {
 const response = await fetch(‘https://en.wikipedia.org/wiki/JavaScript');
 const body = await response.text();

 console.log(body);
}
getData();
Enter fullscreen mode Exit fullscreen mode

Osmosis

Osmosis is a web scraping library used for extracting HTML or XML documents from the web page.

Features:

  1. It has no large dependencies like jQuery and Cheerio.
  2. It has a clean promise-like interface.
  3. Fast parsing and small memory footprint.

Advantages:

  1. It supports retries and redirects limits.
  2. Supports single and multiple proxies.
  3. Supports form submission, session cookies, etc.

Is Node JS good for web scraping?

Yes, Node JS is good for web scraping. It has various powerful libraries like Axios and Puppeteer which makes it a preferred choice for data extraction. Also, the ease of extraction of data from websites that uses JavaScript to load dynamic content has made it a great option for web scraping tasks.

In the end, the great community support available for Node JS will never let you down!

Conclusion

In this tutorial, we learned about various libraries in Node JS which can be used for scraping, we also learned their advantages and disadvantages.

If you think we can complete your web scraping tasks and help you collect data, feel free to contact us.

I hope this tutorial gave you a complete overview of web scraping with Node JS. Please do not hesitate to message me if I missed something. Follow me on Twitter. Thanks for reading!

Additional Resources

I have prepared a complete list of blogs for scraping Google on Node JS which can give you an idea of how to gather data from advanced websites like Google.

Top comments (2)

Collapse
 
mohanrajlearn profile image
Mohanraj

I am using puppeteer library to scrape the data from url of website.I got the scraped data but it is in improper format. I need to convert this scraped data into relevant question and answers format in nextjs project.

I want to convert scraped data into relevant question and answers format in nextjs project.

Note:Web scraping process. When type any url in textfield i need scraped data with question and answer format.

Collapse
 
apiforseo profile image
ApiForSeo

Which URL you are trying to scrape?