loading...

Automate Reading Form Results with Chrome

samthor profile image Sam Thorogood Updated on 4 min read

Blog-A-Day in June (19 Part Series)

1) Rebuild only when necessary in Node 2) Civilization is a game you never lose 3 ... 17 3) Arrow functions break JavaScript parsers 4) Detecting Select All on the Web 5) Declaring JS Variables in 2019 6) Sam's dotfiles highlights 7) Automate Reading Form Results with Chrome 8) Beyond appendChild: Better convenience methods for HTML 9) AMA, Sam 10-yr Googler in Web DevRel 10) Disable a HTML form while in-flight using fieldset 11) PWAs that download like apps 儭 12) Matching elements with selectors in JS 13) Install This PWA To Continue 14) Google Assistant now supports "Open/Close" devices 15) Modern Web Components 16) What To Expect When You're Expecting To Drop IE11 儭 17) Divert Vertical Scroll To The Side 儭 18) Graceful Shutdown Is A Lie 19) Progress Indicator With Fetch

So, I have an upcoming internet upgrade and I want to check its 'coming soon' status. Becausewell, 100/40 compared to what I have now is nothing to sneeze atI'm reasonably excited and of course, I've been checking the status page every few days. 恬5儭恬5儭恬5儭

Let's automate this instead so I can save my sanity. There's two options for this kind of thing and I want to go through both.

1. Send a raw HTTP request

First, I've opened the "check my address" page and opened Chrome's DevTools (or I guess Edgium's DevTools too, now) to the Network tab. I've found my address and submitted the form. Let's look at the requests.

Network requests

Some APIs are intended to be used publicly. I've spent a bit of time on this one though, and it's a pain: it needs a valid cookie to be set, and that's hard to get right.

Let's instead be lazy, and use Chrome's headless mode!

2. Using Chrome and Puppeteer

Instead of trying to match the HTTP request ourselves, you can just pretend to be a real user and go through the form flows programatically. Let's start:

$ yarn add puppeteer
$ npm i puppeteer

And create a tiny script (run.js) to get started:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
});

Great! Save and run (node run.js). You'll see Chromium launch and open Google. Notably, we've set {headless: false}this is useful during development so you can see what's going onbut you might turn it off when you deploy.

Hit Ctrl-C in your terminal when you're done marvelling at Google. You should replace the URL with whatever form you'd like to scrape.

a. Page Interaction

For my example, I need to put my address in an input box first. Open your target page in a normal browser, right-click on it, "Inspect Element", and check it out.

the Find Address box

Notably, it has an IDgreat! We can use a single HTML selector to find it. Let's type some text into it, inside our main function:

  await page.goto('https://example.com/', {waitUntil: 'networkidle2'});
  await page.type('#findAddress', 'Your Address');

Rinse and repeat until you've entered all your user data.

For some pages, you might need to click a button to submit a form. In my case, I must wait for my matched address to appear. By doing this manually, you can find out what selector to click on:

Matched address

You can instruct Puppeteer to wait for a certain element to appear on the page (because it's being added by the page's JS when an operation finishes), then click it:

  const target = '.ui-autocomplete a.ui-corner-all';
  await page.waitForSelector(target);
  await page.click(target);

Remember, you can run your script with {headless: false} as much as you like. Every instance of Chrome it starts will be hermetic.

b. Getting Data

Once you submit your final form, you can probably wait for the results using page.waitForSelector, or perhaps another waiting option.

To extract data from the page, we can run page.evaluate, or in our case, a derivative page.$eval, which accepts a selector and passes in that element as its first function. In my case, I'm looking for:

  const results = await page.$eval('.poi_results tbody', (tbody) => {
    // do stuff
  });

It's worth noting that Puppeteer's API is actually serializing the method you pass to the page (the whole (tbody) => { ... }). This means you can't access variables from outside that function's scope. If you need to pass more values, you can add them to $eval, like this:

   await page.$eval('.selector', (selectorResult, arg1, arg2) => {
     // arg1, arg2 (and more?) are brought in from outside
   }, arg1, arg2);

For me, my final method looks like this, because I'm reading from a table with keys and values in each row:

  // returns [{key: 'Ready Date', value: '14 June 2019'}, ... ]
  const results = await page.$eval('.poi_results tbody', (tbody) => {
    return Array.from(tbody.children).map((tr) => {
      const key = tr.firstElementChild;
      const value = tr.lastElementChild;
      return {
        key: key.textContent,
        value: value.textContent,
      };
    });
  });

c. Diff

To put it together, we can save the result to a file and determine what's changed when you run it. Add some dependencies:

const fs = require('fs');
const diff = require('diff');  // yarn install diff / npm i diff

And compare the output:

  const out = results.map(({key, value}) => {
    return `${key}: ${value}\n`;
  }).join('');

  let prev = '';
  try {
    prev = fs.readFileSync('status.txt');
  } catch (e) {}

  const changes = jsdiff.diffTrimmedLines(prev, out);
  console.info(changes);

JSDiff produces a list of individual changes. I'll leave formatting them to the reader. For me, my script ended up generating something like:

Final output

d. Close the Browser

Be sure to close the browser once you're done, so the script can end:

  await browser.close();

This might also be a good time to remove {headless: false} from the top of the program, so that your automated tool can actually... be automated.

e. Run Every Day

For me, I run this script every day via a crontab on a Linux server I own, and the results are emailed to me. It's also possible to run Puppeteer on Firebase Functions, App Engine, or your cloud service of choice.

Digression

I'm in Australia 佞, and this upgrade is part of an absolute mess of a government infrastructure project known as the NBN. Functionally it's an Ethernet bridge between you and your ISP, provided by the government (since the "last mile" is a natural monopoly).

Thanks!

I hope you've learned something about Puppeteer and scraping! Puppeteer is most commonly used for automated testing, or using features of the browser like generating PDFs, and you'll find plenty of more articles online.

7

Blog-A-Day in June (19 Part Series)

1) Rebuild only when necessary in Node 2) Civilization is a game you never lose 3 ... 17 3) Arrow functions break JavaScript parsers 4) Detecting Select All on the Web 5) Declaring JS Variables in 2019 6) Sam's dotfiles highlights 7) Automate Reading Form Results with Chrome 8) Beyond appendChild: Better convenience methods for HTML 9) AMA, Sam 10-yr Googler in Web DevRel 10) Disable a HTML form while in-flight using fieldset 11) PWAs that download like apps 儭 12) Matching elements with selectors in JS 13) Install This PWA To Continue 14) Google Assistant now supports "Open/Close" devices 15) Modern Web Components 16) What To Expect When You're Expecting To Drop IE11 儭 17) Divert Vertical Scroll To The Side 儭 18) Graceful Shutdown Is A Lie 19) Progress Indicator With Fetch

Posted on by:

samthor profile

Sam Thorogood

@samthor

Developer Relations for Web at Google.

Discussion

markdown guide