Introduction
I will show how to setup Puppeteer with Nodejs, we will use some essential functions to move on different pages and search content on them. Our goal is to scrape the climate data of your country's main cities on Wikipedia.
You only need Node.js and your IDE (like VScode) installed
1. What is Puppeteer and why choose it
Puppeteer is a popular Node library (80k stars on Github) and the Chrome DevTools team maintains it.
With Puppeteer you can control a headless chrome browser with an easy API and a simple setup.
If you are looking for alternatives, Playwright is a good match but it focuses much more on testing. There is also Selenium which has the benefit to work with many browsers and languages but comes with a more complex setup and API.
2. Setup the project
At the end of this part, you should have a running script that opens chrome and a new page automatically.
Run those commands:
npm init
npm install puppeteer
Create a file scrapeWikipedia.js or the name of your choice and copy-paste this:
const puppeteer = require("puppeteer");
const script = async () => {
const browser = await puppeteer.launch({
//this will open a chromium window, this is useful to see what is going on and test stuff before the finalized script
headless: true,
});
const page = await browser.newPage();
//
//your code will go there
//
await browser.close()
}
script()
now you can run:
node scrapeWikipedia.js
3. Search within a page and scrape data
Now we are going into the core of this script, we will search our targeted content into multiple pages.
To do that we are going to use 3 functions :
page.goto - which is used to navigate to a page
page.evaluate - uses a callback which we use to execute javascript in the page, in this introduction we use it to search and handle elements with common JavaScript methods like document.querySelectorAll
page.waitForNavigation - as its name, it is used to wait for the content to load, there is also waitForSelector
which can be useful
First we are going to fetch the list of cities of your country
//Get the page where your country is between those : A-B • C-D-E-F • G-H-I-J-K • L-M-N-O • P-Q-R-S • T-U-V-W-Y-Z
await page.goto(
"https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
)
const cityLinks = await page.evaluate(() => {
//set your country here
const country = "YOUR COUNTRY"
//remove the elements between the countries and the cities to make the scrapping easier
const thumbs = document.querySelectorAll(".thumb")
thumbs.forEach((thumb) => thumb.remove())
//get the list of countries and cities
const countries = document.querySelectorAll(".mw-headline")
const countryIndex = Array.from(countries).findIndex((item) =>
item.innerText.includes(country)
)
const cityTables = document.querySelectorAll("h2 + table.wikitable")
const cityList = cityTables[countryIndex].querySelectorAll("tbody tr")
return Array.from(cityList).map((row) => row.querySelector("a")?.href)
})
At this point you should have a list of cities with a link for each of them.
Now are going to go to every page and get the data from their climate tables
const data = []
await page.waitForSelector("table.wikitable > tbody")
for (let link of cityLinks.filter((item) => !!item)) {
await page.goto(link)
const cityData = await page.evaluate(() => {
const name = document.querySelector("h1").innerText
const values = { name }
//there are more types of data if you need
const labels = ["Average high", "Average precipitation", "sunshine hours"]
const tables = Array.from(
document.querySelectorAll("table.wikitable > tbody")
)
const table = tables.find((item) => item.innerText.includes("Climate"))
labels.forEach((label) => {
const data = Array.from(table?.children || {}).find((item) =>
item?.innerText?.includes(label)
)
const dataValues = Array.from(data?.children || {}).map(
(item, index) => ({
value: item?.innerText || "",
time:
table?.children?.["1"]?.children[String(index)]?.innerText || "",
})
)
values[label] = dataValues
})
return values
})
data.push(cityData)
}
console.log(data)
Get the full code here: Gist of the code
This should give you a result like this:
4. Conclusion
You can adapt this script to get pretty much anything, all you have to do is study the structure of the pages you want the data from and do some trials and errors. As for myself, I used this method to get the initial data for a tool I'm working on https://dreamclimate.city
Top comments (0)