DEV Community

Sam Thorogood
Sam Thorogood

Posted on • Edited on

Automate Reading Form Results with 🤖 Chrome

So, I have an upcoming internet upgrade and I want to check its 'coming soon' status. Because—well, 100/40 compared to what I have now is nothing to sneeze at—I'm reasonably excited and of course, I've been checking the status page every few days. 🇫️5️⃣🇫️5️⃣🇫️5️⃣

Let's automate this instead so I can save my sanity. There's two options for this kind of thing and I want to go through both.

1. Send a raw HTTP request

First, I've opened the "check my address" page and opened Chrome's DevTools (or I guess Edgium's DevTools too, now) to the Network tab. I've found my address and submitted the form. Let's look at the requests.

Network requests

Some APIs are intended to be used publicly. I've spent a bit of time on this one though, and it's a pain: it needs a valid cookie to be set, and that's hard to get right. 😡

Let's instead be lazy, and use Chrome's headless mode!

2. Using Chrome and Puppeteer

Instead of trying to match the HTTP request ourselves, you can just pretend to be a real user and go through the form flows programatically. Let's start:

$ yarn add puppeteer
$ npm i puppeteer
Enter fullscreen mode Exit fullscreen mode

And create a tiny script (run.js) to get started:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
});
Enter fullscreen mode Exit fullscreen mode

Great! Save and run (node run.js). You'll see Chromium launch and open Google. Notably, we've set {headless: false}—this is useful during development so you can see what's going on—but you might turn it off when you deploy. 📴

Hit Ctrl-C in your terminal when you're done marvelling at Google. You should replace the URL with whatever form you'd like to scrape.

a. Page Interaction

For my example, I need to put my address in an input box first. Open your target page in a normal browser, right-click on it, "Inspect Element", and check it out.

the Find Address box

Notably, it has an ID—great! We can use a single HTML selector to find it. Let's type some text into it, inside our main function:

  await page.goto('https://example.com/', {waitUntil: 'networkidle2'});
  await page.type('#findAddress', 'Your Address');
Enter fullscreen mode Exit fullscreen mode

Rinse and repeat until you've entered all your user data.

For some pages, you might need to click a button to submit a form. In my case, I must wait for my matched address to appear. By doing this manually, you can find out what selector to click on:

Matched address

You can instruct Puppeteer to wait for a certain element to appear on the page (because it's being added by the page's JS when an operation finishes), then click it:

  const target = '.ui-autocomplete a.ui-corner-all';
  await page.waitForSelector(target);
  await page.click(target);
Enter fullscreen mode Exit fullscreen mode

Remember, you can run your script with {headless: false} as much as you like. Every instance of Chrome it starts will be hermetic.

b. Getting Data

Once you submit your final form, you can probably wait for the results using page.waitForSelector, or perhaps another waiting option.

To extract data from the page, we can run page.evaluate, or in our case, a derivative page.$eval, which accepts a selector and passes in that element as its first function. In my case, I'm looking for:

  const results = await page.$eval('.poi_results tbody', (tbody) => {
    // do stuff
  });
Enter fullscreen mode Exit fullscreen mode

It's worth noting that Puppeteer's API is actually serializing the method you pass to the page (the whole (tbody) => { ... }). This means you can't access variables from outside that function's scope. If you need to pass more values, you can add them to $eval, like this:

   await page.$eval('.selector', (selectorResult, arg1, arg2) => {
     // arg1, arg2 (and more?) are brought in from outside
   }, arg1, arg2);
Enter fullscreen mode Exit fullscreen mode

For me, my final method looks like this, because I'm reading from a table with keys and values in each row:

  // returns [{key: 'Ready Date', value: '14 June 2019'}, ... ]
  const results = await page.$eval('.poi_results tbody', (tbody) => {
    return Array.from(tbody.children).map((tr) => {
      const key = tr.firstElementChild;
      const value = tr.lastElementChild;
      return {
        key: key.textContent,
        value: value.textContent,
      };
    });
  });
Enter fullscreen mode Exit fullscreen mode

c. Diff

To put it together, we can save the result to a file and determine what's changed when you run it. Add some dependencies:

const fs = require('fs');
const diff = require('diff');  // yarn install diff / npm i diff
Enter fullscreen mode Exit fullscreen mode

And compare the output:

  const out = results.map(({key, value}) => {
    return `${key}: ${value}\n`;
  }).join('');

  let prev = '';
  try {
    prev = fs.readFileSync('status.txt');
  } catch (e) {}

  const changes = jsdiff.diffTrimmedLines(prev, out);
  console.info(changes);
Enter fullscreen mode Exit fullscreen mode

JSDiff produces a list of individual changes. I'll leave formatting them to the reader. For me, my script ended up generating something like:

Final output

d. Close the Browser

Be sure to close the browser once you're done, so the script can end:

  await browser.close();
Enter fullscreen mode Exit fullscreen mode

This might also be a good time to remove {headless: false} from the top of the program, so that your automated tool can actually... be automated.

e. Run Every Day

For me, I run this script every day via a crontab on a Linux server I own, and the results are emailed to me. It's also possible to run Puppeteer on Firebase Functions, App Engine, or your cloud service of choice.

Digression

I'm in Australia 🇦🇺, and this upgrade is part of an absolute mess of a government infrastructure project known as the NBN. Functionally it's an Ethernet bridge between you and your ISP, provided by the government (since the "last mile" is a natural monopoly).

Thanks!

I hope you've learned something about Puppeteer and scraping! Puppeteer is most commonly used for automated testing, or using features of the browser like generating PDFs, and you'll find plenty of more articles online.

7 👋

Top comments (0)