Scraping dev.to with Puppeteer: Devices and Search

#puppeteer #node #showdev #scraping

Introduction

In the last article, we started to use puppeteer in a little command line application. We took screenshots, simulated clicks and generate PDFs. In this second article, we'll keep building on the application we started. This time, we will add the following functionalities:

Given a device, we will take a screenshot of the dev.to homepage displayed on that device.
Given a search query, we will retrieve the articles' titles,authors, reactions and comments displayed on dev.to.

Screenshot devices

First, let's create a folder called screenshots-devices, where we will store the screenshots.

So, the concept is the following: Puppeteer provides a list of devices on which we can view how our application will look like.

First, let's add our case in our switch statement to handle our new functionality. The function will be called getScreenshotDevice. The argument we'll use in the command line will be getScreenDevice.

switch (process.argv[2]) {
    case 'getScreen':
        getScreenshot(process.argv[3])
        break
    case 'getPDF':
        getPDF(process.argv[3])
        break
    case 'getScreenDevice':
        getScreenshotDevice(process.argv[3])
        break
    default:
        console.log('Wrong argument!')
}

We now need to create the getScreenshotDevice function.

const getScreenshotDevice = async device => {
    try {
        const d = puppeteer.devices[device]
        const browser = await puppeteer.launch()
        const page = await browser.newPage()
        await page.emulate(d)
        await page.goto('https://dev.to')
        await page.screenshot({
            path: `screenshots-devices/${device}.png`,
            fullPage: true
        })
        await browser.close()
    } catch (e) {
        console.log(e)
    }
}

The function takes one argument, the device where we want to display the dev.to homepage. The devices we can use with puppeteer can be found with puppeteer.devices. Some examples:

iPhone 6
iPhone X
iPad
Pixel 2 landscape

All the devices supported can be found here.

After retrieving the proper device informations from puppeteer, we use page.emulate(device) to make sure puppeteer is using the proper device. After that, this is pretty much the same thing we use for the other screenshots functionalities. We just save the result in a different folder.

Note: As you can see, some devices have a space in their label. To make sure the entire label will be considered as one argument in our command line, we need to use quotes. Of course, if the label is a single word, quotes can be omitted.

node index.js getScreenDevice 'iPhone X'
node index.js getScreenDevice 'iPhone 6'
node index.js getScreenDevice iPad
node index.js getScreenDevice 'Pixel 2 landscape'

By running those commands, you'll get a screenshot of the dev.to homepage on the device specified. This can be a great little tool to see how your application is displayed on a particular device.

Search query

This time, we will give our tool a string and use it as a search query in dev.to. We will then retrieve the informations the search query returned on dev.to.

What we'll do:

Get a string query from the user
Travel to dev.to/search?q=myStringQuery
Read the elements displayed

So, first things first, we need to add a special case to handle the proper argument. Let's call it query and call the function getQueryResults.

switch (process.argv[2]) {
    case 'getScreen':
        getScreenshot(process.argv[3])
        break
    case 'getPDF':
        getPDF(process.argv[3])
        break
    case 'getScreenDevice':
        getScreenshotDevice(process.argv[3])
        break
    case 'query':
        getQueryResults(process.argv.slice(3))
        break
    default:
        console.log('Wrong argument!')
}

Notice that we give process.argv.slice(3) as the function argument. Just like the devices before, I want to be able to use several words in my search query. There are two ways we can do that:

Put the words inside quotes, like we did before.
Put all the words in an array by using slice.

This time, we'll group all the words given in the command line after the query command in an array.

So, let's create our getQueryResults function.

const getQueryResults = async query => {
    console.log(`Query results:\n -------------------`)
    try {
        const browser = await puppeteer.launch()
        const page = await browser.newPage()
        await page.goto(`https://dev.to/search?q=${query.join('%20')}`)
        await page.waitForSelector('.single-article')

        const articles = await page.$$('.single-article')

        for (let i = 0; i < articles.length; i++) {
            let title = await articles[i].$eval('h3', t => t.textContent)
            let author = await articles[i].$eval(
                'h4',
                a => a.textContent.split('・')[0]
            )
            let tag = ''
            let numberOfReactions = 0
            let numberOfComments = 0
            if (title.startsWith('#')) {
                tag = await articles[i].$eval('span.tag-identifier', s => s.textContent)
            }
            title = title.substring(tag.length)

            let likes = await articles[i].$('.reactions-count')
            let comments = await articles[i].$('.comments-count')
            if (likes) {
                numberOfReactions = await likes.$eval(
                    '.engagement-count-number',
                    span => span.innerHTML
                )
            }

            if (comments) {
                numberOfComments = await comments.$eval(
                    '.engagement-count-number',
                    span => span.innerHTML
                )
            }

            console.log(
                `${i +
                    1}) ${title} by ${author} has ${numberOfReactions} reactions and ${numberOfComments} comments.`
            )
        }

        await browser.close()
    } catch (e) {
        console.log(e)
    }
}

To achieve this, we need to study the HTML structure a bit. But first, we join every element in the array with the %20 character, for our search to be used in the url. We then travel to the appropriate dev.to search page ( /search?q=... ).

So far, so good. Now, every result is contained in an element with a single-article class. We wait for them to load (waitForSelector). We then retrieve the articles using the page.\$\$ function, which takes a selector as its argument. We now have all the results in the articles variable.

This is the part where we have to study the HTML markup to know where to look for the infos we need.

The title lives in a h3 tag. But, I don't want the tags like #showdev or #discuss. So, we will remove it when present by retrieving the value inside the span with the tag-indicator class.
The author lives in the h4 tag. Inside this tag, there is also the date the article was published. A simple String.split method will get us the author name we need.
The reactions and comments follow the same logic. They live respectively inside a div with the reactions-count class or the comments-count class. By using the \$ method, we'll get the element, or null if none exists. If there is reactions or comments, we'll retrieve their number by looking at the content of the span with the engagement-count-number class.

Aaaaaaand, finally, we just print out the informations to the console.

So, if I run node index.js query puppeteer for example, I will get the following results:

Query results:
 -------------------
1) Generate a PDF from HTML with puppeteer by Damien Cosset has 191 reactions and 11 comments.
2) Front End Development automation with Puppeteer. Part 1 by Jaime Rios has 102 reactions and 0 comments.
3) An introduction to Puppeteer and Headless Chrome by Mohamed Oun has 33 reactions and 2 comments.
4) Generating PDF from HTML with Node.js and Puppeteer by Mate Boer  has 95 reactions and 6 comments.
5) Front End Development Automation with Puppeteer. Part 3 by Jaime Rios has 41 reactions and 4 comments.
6) Mocha and puppeteer on circleCI by Md. Abu Taher 👨‍💻 has 39 reactions and 0 comments.
7) Build a Car Price Scraper-Optimizer Using Puppeteer by Lex Martinez has 23 reactions and 3 comments.
8) Front End Development Automation with Puppeteer. Part 2 by Jaime Rios has 34 reactions and 0 comments.

... more results

That's it for the second article. You can find the code on Github.

Happy coding <3