loading...

How to scrape that web page with Node.js and puppeteer

napolux profile image Francesco Napoletano ・Updated on ・3 min read

If you're like me sometimes you want to scrape a web page so bad. You probably want some data in a readable format or just need a way to re-crunch that data for other purposes.

I solemnly swear that I am up to no good.

I've found my optimal setup after many tries with Guzzle, BeautifulSoup, etc... Here it is:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

What does it mean? It means you can run a Chrome instance and put it at your service. Cool, isn't it?

Let's see how to do it.

Setup

Yes, the usual setup. Fire up your terminal, create a folder for your project and run npm init in the folder.

When you're setup you'll probably have a package.json file. We're good to go. Now run npm i -S puppeteer to install Puppeteer.

A little warning. Puppeteer will download a full version of Chromium in your node_modules folder

Don't worry: since version 1.7.0 Google publishes the puppeteer-core package, a version of Puppeteer that doesn't download Chromium by default.

So, if you're willing to try it, just run npm i -S puppeteer-core

puppeteer-core is intended to be a lightweight version of puppeteer for launching an existing browser installation or for connecting to a remote one.

Ok, we're good to go now.

Your first scraper

Touch an index.js file in the project folder and paste this code in it.

That's all you need to setup a web scraper. You can also find it in my repo https://github.com/napolux/puppy.

Let's dig a bit in the code

For the sake of our example we'll just grab all the post titles and URLs from my blog homepage. To add a nice touch we'll change our user-agent in order to look like a good old iPhone while browsing the webpage we're scraping.

And because we're lazy, we'll inject jQuery to the page in order to use it's wonderful CSS selectors.

So... Let's go line by line:

  • Line 1-2 we'll require Puppeteer and configure the website we're going to scrape
  • Line 4 we're launching Puppeteer. Please remember we're in the kingdom of Lord Asynchronous, so everything is a Promise, is async, or has to wait for something else ;) As you can see the conf is self-explanatory. We're telling the script to run Chromium headless (no UI).
  • Line 5-10 The browser is up, we create a new page, we set the viewport size to a mobile screen, we set a fake user-agent and we open the webpage we want to scrape. In order to be sure that the page is loaded, we wait for the selector body.blog to be there.
  • Line 11 As I said, we are injecting jQuery into the page
  • Line 13-28 Here is where the magic happens: we evaluate our page and run some jQuery code in order to extract the data we need. Nothing fancy, if you ask me.
  • Line 31-37 We're done: we close the browser and print out our data:

Run from the project folder node index.js and you should end up with something like...

Post: Blah blah 1? URL: https://coding.napolux.com/blah1/
Post: Blah blah 2? URL: https://coding.napolux.com/blah2/
Post: Blah blah 3? URL: https://coding.napolux.com/blah3/

Recap

So, welcome to the world of web scraping. It was easier than expected, right? Just remember that web scraping is a controversial matter: please scrape only websites you're authorized to scrape.

No. As the owner of https://coding.napolux.com I don't authorize you

I leave to you how to scrape AJAX based webpages ;)

Originally published @ https://coding.napolux.com

Discussion

pic
Editor guide
Collapse
g0n_freecs profile image
Gon

This is a great and concisely well-explained article. I decided to try using the whole block within lines 13-28, and I keep getting errors of

< (node:65901) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: reject is not defined

How could I resolve this error?

Collapse
napolux profile image
Francesco Napoletano Author

Well, the puppeteer.launch().then(async browser => { etc... is a promise itself, so the reject is there.

Just tried the code and it still works.

Collapse
usbinternet profile image
USB-internet

Hi,
Francesco Napoletano,
Your code is great !!!

But, I can not save data to a .txt file. It reports an Undefined error. Help me fix it. Why use:

for(var i = 0; i < result.length; i++) {
console.log('Post: ' + result[i].title + ' URL: ' + result[i].url);

}

I can not export the value, it just seems to print to the screen
If exported to .txt file, it appears Undefined error. Please help me export the .txt file

!!! Thanks

NO ERROR BUT devnew Undefined !!!
var devnew = result.title ;

fs.writeFile('devnew.txt',devnew,'utf8');

Collapse
qm3ster profile image
Mihail Malo

Title says scrap instead of scrape

Collapse
menjilx profile image
Menj

how to save the result on a MySQL database?