DEV Community

Jason F
Jason F

Posted on

Web Scraping With Puppeteer for Total Noobs: Part 3

Hello and welcome to the third and final post of this series. If you missed the previous, please feel free to give them a quick read before proceeding. In this post I'll show you how I set up a GitHub Action to automatically scrape some weather data on a schedule.

Before we begin I must tell you that I changed the source of the weather data due to some complications with weather.com. I'm sure it's possible to get around the issues I was having, but for the sake of moving forward without the extra kerfuffle, I changed the source to one that I believe will not cause you any grief.

Note: This post assumes you're comfortable with creating and updating a GitHub repository

Catching up

In the previous post I shared with you the scrape function that we used to scrape weather.com and console.log the output. I'll go ahead and paste the updated scrape function, with the new source below:

async function scrape() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(
    "https://www.weathertab.com/en/d/united-states/texas/austin/"
  );

  const weatherData = await page.evaluate(() =>
    Array.from(document.querySelectorAll("tr.fct_day"), (e) => ({
      dayOfMonth: e.querySelector("td > div.text-left > .fct-day-of-month")
        .innerText,
      dayName: e.querySelector("td > div.text-left > .fct-day-of-week")
        .innerText,
      weatherIcon: e.querySelector("td.text-center > .fct_daily_icon")
        .classList[1],
      weatherIconPercent: e
        .querySelector("td.text-center")
        .getElementsByTagName("div")[1].innerText,
      highTempRange: e.querySelector(
        "td.text-center > .F > div > .label-danger"
      ).innerText,
      lowTempRange: e
        .querySelector("td.text-center > .F")
        .getElementsByTagName("div")[1].innerText,
    }))
  );

  await browser.close();
  return weatherData;
}
Enter fullscreen mode Exit fullscreen mode

As you can see, the source is now https://www.weathertab.com/en/d/united-states/texas/austin/, of course you can change this to whatever city you fancy. The logic is practically the same (as in we're creating an array of data), however, since we switched sources, of course we have to target the elements in the page, which are without a doubt different than the weather.com page. Rather than a 10-day forecast, we're now getting the entire current month's forecast.

The data is a bit odd without some context. However, I'm sure you could develop some kind of map or object that would translate the data, such as the weather icon and weather icon percent to something meaningful.

Anyways, on with the show.

Preparing to write the data

In order for us to reach the end goal of writing this scraped data to a .json file in a GitHub repository, we'll need to add some final touches to the scraper.js file. Admittedly, I'm no expert when it comes interacting with the file system using node.js (or any language for that matter). However, I can fanagle my way around. I'll share with you what I've got, and if you're more knowledgeable than me on the subject, please feel free to fill in the blanks.

We want to write our scraped data to a file in our repository that we'll name weatherdata.json.

In order to do so, we'll have this line at the end of our scraper.js file:

// execute and persist data
scrape().then((data) => {
  // persist data
  fs.writeFileSync(path.resolve(pathToData), JSON.stringify(data, null, 2));
});
Enter fullscreen mode Exit fullscreen mode

the writeFileSync method is part of the node.js filesystem module. You can loearn more about this method here. Essentially what we're doing with it is passing in the path to our file, weatherdata.json as the first parameter, and our scraped data as the second parameter. The writeFileSync method will create the file if it doesn't exist, or, overwrite the file with the new data if it does exist.

As I mentioned, our first parameter is the path to the weatherdata.json file, which is passed in to the writeFileSync like so: path.resolve(pathToData).

Since we are using ES Modules, __filename and __dirname are not readily available. I shamelessly cut and pasted the code below (from this source) in order to create the pathToData variable.

const __filename = fileURLToPath(import.meta.url);

const __dirname = path.dirname(__filename);

const pathToData = path.join(__dirname, "weatherdata.json");
Enter fullscreen mode Exit fullscreen mode

Having the __dirname available helps us find the path to our weatherdata.json file.

That wraps up all of the changes to the scraper.js file, minus the import statements at the top. I'll share the whole file so that it may be easier to read.

import puppeteer from "puppeteer";
import fs from "fs";
import path from "path";
import { fileURLToPath } from "url";

const __filename = fileURLToPath(import.meta.url);

const __dirname = path.dirname(__filename);

const pathToData = path.join(__dirname, "weatherdata.json");

async function scrape() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(
    "https://www.weathertab.com/en/d/united-states/texas/austin/"
  );

  const weatherData = await page.evaluate(() =>
    Array.from(document.querySelectorAll("tr.fct_day"), (e) => ({
      dayOfMonth: e.querySelector("td > div.text-left > .fct-day-of-month")
        .innerText,
      dayName: e.querySelector("td > div.text-left > .fct-day-of-week")
        .innerText,
      weatherIcon: e.querySelector("td.text-center > .fct_daily_icon")
        .classList[1],
      weatherIconPercent: e
        .querySelector("td.text-center")
        .getElementsByTagName("div")[1].innerText,
      highTempRange: e.querySelector(
        "td.text-center > .F > div > .label-danger"
      ).innerText,
      lowTempRange: e
        .querySelector("td.text-center > .F")
        .getElementsByTagName("div")[1].innerText,
    }))
  );

  await browser.close();
  return weatherData;
}

// execute and persist data
scrape().then((data) => {
  // persist data
  fs.writeFileSync(path.resolve(pathToData), JSON.stringify(data, null, 2));
});
Enter fullscreen mode Exit fullscreen mode

Setting Up the GitHub Action

The goal of our GitHub Action is to scrape our weather source on a weekly basis and save the latest data in a .json file.

If you've never created an action, I'll walk you through steps.

First, find the Actions tab in your GitHub repository.

Image description

Click to go in to the Actions page. Next, click the "New Workflow" button. This will open up a page of options, however, we just want to select the "Set up a workflow yourself" options.

Image description

This will drop you in to an editor. Feel free to name the yaml file whatever you'd like, or leave it as main.yml.

Go ahead and paste this in:

name: Resources
on:
  schedule:
    - cron: "0 13 * * 1"
  # workflow_dispatch:
permissions:
  contents: write
jobs:
  resources:
    name: Scrape
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - uses: actions/setup-node@v3
        with:
          node-version: 16
      - run: npm ci

      - name: Fetch resources
        run: node ./scraper.js

      - name: Update resources
        uses: test-room-7/action-update-file@v1
        with:
          file-path: weatherdata.json
          commit-msg: Update resources
          github-token: ${{ secrets.GITHUB_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

After pasting it in, you can commit the new yml file by clicking the "Start Commit" button and entering the commit message, etc.

So, there are a couple of things to note about this action:

  1. The action is run on a schedule using the cron property. The value that I have in place says that the action will be triggered every Monday at 1300 (1PM).
  2. Make sure your JavaScript file with the scrape method is named scraper.js, or, if it's not, update the action and replace scraper.js with your file name.
  3. We are using the test-room-7/action-update-file@v1 to assist us with updating the scraper.js file. Look them up on GitHub if you'd like to know more about that action.

If you'd rather run the action whenever you please, comment out or delete these two lines:

schedule:
    - cron: "0 13 * * 1"
Enter fullscreen mode Exit fullscreen mode

and uncomment this line:

# workflow_dispatch:
Enter fullscreen mode Exit fullscreen mode

Then you'll see a button somewhere in the actions area that will allow you to run the job whenever you'd like. This works well for testing out actions.

You can see my repository here

Conclusion

I have learned a whole heck of a lot in this series. I stepped out of my comfort zone and was able to get something working. There were some hurdles along the way, but at the end of the day, it's all about getting my hands dirty and figuring things out. I hope you learned something as well. If you found these posts interesting, boring, lacking, great, etc...please let me know. Thank you!

Top comments (0)