In a previous tutorial I wrote about scraping server-side rendered HTML content. Many modern websites however are rendered client-side so a different approach to scraping them is required.
Enter Puppeteer a Node.js library for running a headless Chrome browser. This allows us to scrape content from a URL after it has been rendered as it would in a standard browser.
Before beginning you’ll need to have Node.js installed.
Let’s get started by creating a project folder, initialising the project and installing the required dependencies by running the following commands in a terminal:
mkdir scraper
cd scraper
npm init -y
npm install puppeteer cheerio
cheerio
– is an implementation of core jQuery designed specifically for the server. It make’s selecting elements from the DOM easier as we can use the familiar jQuery syntax.
Next create a new file called scrape.js and load in the dependencies:
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const fs = require("fs");
fs
– Is a Node.js module that enables interacting with the file system which we’ll use to save the scraped data into a JSON file.
Then add a getData()
function that will launch a browser using Puppeteer, fetch the contents of a URL and call a processData()
function that’ll process the page content:
async function getData() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.reddit.com/r/webdev/");
const data = await page.content();
await browser.close();
processData(data);
}
getData();
With the page content scraped let’s setup the processData()
function. Here we use cheerio to select only the content we require (username, post title and number of votes):
function processData(data) {
console.log("Processing Data...");
const $ = cheerio.load(data);
const posts = [];
$(".Post").each(function () {
posts.push({
user: $("._2tbHP6ZydRpjI44J3syuqC", this).text(),
title: $("._eYtD2XCVieq6emjKBH3m", this).text(),
votes: $("._1E9mcoVn4MYnuBQSVDt1gC", this).first().text(),
});
});
fs.writeFileSync("data.json", JSON.stringify(posts));
console.log("Complete");
}
This code loops through each of the .Post
elements, grabs the data we specified (Reddit doesn’t use human readable class names hence the long strings of random characters), and pushes it to a posts
array.
Once each of the posts has been processed a data.json
file is created using fs.writeFileSync
. You can now run the script using node scrape.js
. It’ll take a little while to complete, once finished browse to the project folder and you’ll see the data.json file complete with data.
Top comments (0)