Michael Burrows

Posted on Aug 21, 2020 • Edited on Mar 25, 2022 • Originally published at w3collective.com

Scrape sever-side rendered HTML content with JavaScript

#javascript #node #tutorial

Note: An updated version of this working version tutorial can be found here.

“Scraping” can be used to collect and analyse data from sources that don’t have API’s.

In this tutorial we’ll scrape content using JavaScript from a website that’s rendered server-side.

You’ll need to have Node.js and npm installed if you haven’t already.

Let’s start by creating a project folder and initialising it with a package.json file:

mkdir scraper
npm init -y

We’ll be using two packages to build our scraper script.

axios – Promise based HTTP client for the browser and node.js.
cheerio – Implementation of jQuery designed for the server (makes it easy to work with the DOM).

Install the packages by running the following command:

npm install axios cheerio --save

Next create a file called scrape.js and include the packages we just installed:

const axios = require("axios");
const cheerio = require("cheerio");

In this example i’ll be using https://lobste.rs/ as the data source to be scraped.

Inspecting the code the site name in the header has a cur_url class so let’s see if we can scrape it’s text:

Add the following to scrape.js to fetch the HTML and log the title text if successful:

axios('https://lobste.rs/')
  .then((response) => {
    const html = response.data;
    const $ = cheerio.load(html);    
    const title = $(".cur_url").text();   
    console.log(title);
  })
  .catch(console.error);

Run the script with the following command and you should see Lobsters logged in the terminal:

node scrape.js

If everything’s working we can proceed to scrape some actual content from the website.

Let’s get the titles, domains and points for each of the stories on the homepage by updating scrape.js:

axios("https://lobste.rs/")
  .then((response) => {
    const html = response.data;
    const $ = cheerio.load(html);
    const storyItem = $(".story");
    const stories = [];
    storyItem.each(function () {
      const title = $(this).find(".u-url").text();
      const domain = $(this).find(".domain").text();
      const points = $(this).find(".score").text();
      stories.push({
        title,
        domain,
        points,
      });
    });
    console.log(stories);
  })
  .catch(console.error);