Antoine Mesnil

Posted on Oct 28, 2022 • Edited on Nov 4, 2022 • Originally published at antoinemesnil.com

Practical intro to scraping with Puppeteer: fetch your country climate data

#tutorial #scraping #puppeteer #node

Introduction

I will show how to setup Puppeteer with Nodejs, we will use some essential functions to move on different pages and search content on them. Our goal is to scrape the climate data of your country's main cities on Wikipedia.
You only need Node.js and your IDE (like VScode) installed

1. What is Puppeteer and why choose it

Puppeteer is a popular Node library (80k stars on Github) and the Chrome DevTools team maintains it.
With Puppeteer you can control a headless chrome browser with an easy API and a simple setup.
If you are looking for alternatives, Playwright is a good match but it focuses much more on testing. There is also Selenium which has the benefit to work with many browsers and languages but comes with a more complex setup and API.

2. Setup the project

At the end of this part, you should have a running script that opens chrome and a new page automatically.

Run those commands:

npm init
npm install puppeteer

Create a file scrapeWikipedia.js or the name of your choice and copy-paste this:

const puppeteer = require("puppeteer");

const script = async () => {
  const browser = await puppeteer.launch({
    //this will open a chromium window, this is useful to see what is going on and test stuff before the finalized script
    headless: true, 
  });
  const page = await browser.newPage();
  //
  //your code will go there
  //
  await browser.close()
}
script()

now you can run:

node scrapeWikipedia.js

3. Search within a page and scrape data

Now we are going into the core of this script, we will search our targeted content into multiple pages.

To do that we are going to use 3 functions :

page.goto - which is used to navigate to a page

page.evaluate - uses a callback which we use to execute javascript in the page, in this introduction we use it to search and handle elements with common JavaScript methods like document.querySelectorAll

page.waitForNavigation - as its name, it is used to wait for the content to load, there is also waitForSelector which can be useful

First we are going to fetch the list of cities of your country

  //Get the page where your country is between those : A-B • C-D-E-F • G-H-I-J-K • L-M-N-O • P-Q-R-S • T-U-V-W-Y-Z
  await page.goto(
    "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
  )
  const cityLinks = await page.evaluate(() => {
    //set your country here
    const country = "YOUR COUNTRY"

    //remove the elements between the countries and the cities to make the scrapping easier
    const thumbs = document.querySelectorAll(".thumb")
    thumbs.forEach((thumb) => thumb.remove())

    //get the list of countries and cities
    const countries = document.querySelectorAll(".mw-headline")
    const countryIndex = Array.from(countries).findIndex((item) =>
      item.innerText.includes(country)
    )
    const cityTables = document.querySelectorAll("h2 + table.wikitable")
    const cityList = cityTables[countryIndex].querySelectorAll("tbody tr")

    return Array.from(cityList).map((row) => row.querySelector("a")?.href)
  })

At this point you should have a list of cities with a link for each of them.
Now are going to go to every page and get the data from their climate tables

   const data = []
  await page.waitForSelector("table.wikitable > tbody")
  for (let link of cityLinks.filter((item) => !!item)) {
    await page.goto(link)
    const cityData = await page.evaluate(() => {
      const name = document.querySelector("h1").innerText
      const values = { name }

      //there are more types of data if you need
      const labels = ["Average high", "Average precipitation", "sunshine hours"]
      const tables = Array.from(
        document.querySelectorAll("table.wikitable > tbody")
      )
      const table = tables.find((item) => item.innerText.includes("Climate"))
      labels.forEach((label) => {
        const data = Array.from(table?.children || {}).find((item) =>
          item?.innerText?.includes(label)
        )

        const dataValues = Array.from(data?.children || {}).map(
          (item, index) => ({
            value: item?.innerText || "",
            time:
              table?.children?.["1"]?.children[String(index)]?.innerText || "",
          })
        )
        values[label] = dataValues
      })
      return values
    })
    data.push(cityData)
  }
  console.log(data)

Get the full code here: Gist of the code

This should give you a result like this:

4. Conclusion

You can adapt this script to get pretty much anything, all you have to do is study the structure of the pages you want the data from and do some trials and errors. As for myself, I used this method to get the initial data for a tool I'm working on https://dreamclimate.city