DEV Community

loading...

Write the scraping script

Gonçalo Morais
UX Engineer, @recursecenter alumnus, ESTJ. Getting into Ember at the moment, previously Rails and Vue. Runner and climber. I grow a beard most of the time. 🤘 light themes 🤘
Originally published at blog.gnclmorais.com on ・3 min read

I’ve had a few situations in the past where I was waiting for something to get updated on a website and just kept refreshing the page every so often… But when you don’t know when that update is going to happen, this can get tedious and hey, we’re programmers, we can build something to do this for us!

Puppeteer is a Node library which provides a high-level API to control Chrome” and it’s the one I usually use just because it makes building a simple web scraper super simple. Let’s dig in and build a Minimum Viable Product that, for the sake of this example, grabs the top news from The New York Times’ Today’s Paper.

Project start

Begin by creating a package.json that will hold the project’s dependencies. You can use npm init for this, but for simplicity’s sake, I’ll create a stripped-down version:

// package.json
{
  "name": "web-scraper-with-puppeteer",
  "version": "1.0.0",
  "private": true
}
Enter fullscreen mode Exit fullscreen mode

Now we add our only dependency, Puppeteer. Run this on the terminal:

npm install puppeteer
Enter fullscreen mode Exit fullscreen mode

Your package.json has changed a bit now, here’s the difference:

 {
   "name": "web-scraper-with-puppeteer",
   "version": "1.0.0",
- "private": true
+ "private": true,
+ "dependencies": {
+   "puppeteer": "^9.1.1"
+ }
 }
Enter fullscreen mode Exit fullscreen mode

Let’s start with our main script now. Open up a brand new index.js and write the following:

// index.js
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();

  await page.goto(
    'https://nytimes.com/section/todayspaper'
  );
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

For now, this is a simple script that you can run right now with node index.js in order to see if everything is going well so far. You should see a Chrome window opening up (because we specified headless: false) and closing as soon as the page stops loading. So far so good! Let’s now grab from the DOM the first article on the page.

Add the next lines to your script to grab the first article and just output its HTML, so we can see if we’re retrieving the right thing:

   await page.goto(
     'https://nytimes.com/section/todayspaper'
   );
+
+ const firstArticle = await page.$eval(
+   'article:first-of-type',
+   e => e.outerHTML
+ );
+
+ console.log(firstArticle);
+
   await browser.close();
 })();
Enter fullscreen mode Exit fullscreen mode

Run your script with node index.js and you should see a lot of HTML inside an <article> tag on your console. We’re almost there!

Now, we don’t want the full article, only its headline and summary. Looking closer at the HTML we get, we see an h2 and the first p that look promising. Let’s refactor our code a bit to have firstArticle as a variable we can use, create a function to be used for both the header and the summary, and pluck both of them to show on the console:

     'https://nytimes.com/section/todayspaper'
   );

- const firstArticle = await page.$eval(
- 'article:first-of-type',
- e => e.outerHTML
- );
+ const firstArticle = await page.$('article:first-of-type');
+
+ const getText = (parent, selector) => {
+   return parent.$eval(selector, el => el.innerText);
+ };
+
+ const header = await getText(firstArticle, 'h2');
+ const summary = await getText(firstArticle, 'p:first-of-type');

- console.log(firstArticle);
+ console.log(`${header}\n${summary}`);

   await browser.close();
 })();
Enter fullscreen mode Exit fullscreen mode

Go ahead, run that on the terminal and you show see two lines, the top on as the header and the bottom one as the summary of the article!

To be honest, that’s it! 🎉 A web scraper doesn’t need to be fancy or complicated , it really depends on what you are trying to fetch from a page. I had one running for a few days a while back (which I’ll write about on a following article) and it was basically doing thigs on another page, just checking if a specific string of text has changed already or not.

Having said that, there is so much more you can do with Puppeteer — the sky is the limit. Check their documentation to see the available methods, official examples of wild things you can use it for, and you can even use it to automate performance work!

See you around soon for the second part of this article…

Discussion (0)