DEV Community

loading...
Cover image for Image downloader with puppeteer and the fetch API

Image downloader with puppeteer and the fetch API

microworlds profile image Caleb David ・6 min read

In this tutorial, we are going to build a webpage image downloader. Assuming you visit a webpage and saw that the images in that page are cool and you want to have your own copy without downloading them one by one, this simple tool we will build is going to be a life saver for you. This little project is also a good way to practice and hone your webscraping skills.

We will create a new directory called image-downloader and navigate into it. Pop open your terminal window and type in the following commands.

mkdir image-downloader && cd image-downloader
Enter fullscreen mode Exit fullscreen mode

I will assume that you have node js and npm installed on your machine. We will then initialize this directory with the standard package.json file by running npm init -y and then install two dependencies namely puppeteer and node-fetch. Run the following commands to get them installed.

npm install --save puppeteer node-fetch --verbose
Enter fullscreen mode Exit fullscreen mode

You probably just saw a new npm flag --verbose. When installing puppeteer, what happens behind the scenes is that npm also installs the chromium browser because it is a dependency of puppeteer. This file is usually large and we are using the --verbose flag to see the progress of the installation, nothing fancy, but let's just use it because we can. 

One more thing to do before getting our hands dirty with code is to create a directory where we want all our images to be downloaded. Let's name that directory images. We will also create index.js file where all the app's logic will go.

mkdir images && touch index.js
Enter fullscreen mode Exit fullscreen mode

Actually, it's great to clearly outline our thought process before writing a single line of code.

  1. Get all image tags from the page and extract the href property from each of these image tags
  2. Make request to those href links and store them into the images directory (Saving images to disk)

Step one 1: Getting all image tags and href property

'use strict';

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')

// Extract all imageLinks from the page
async function extractImageLinks(){
    const browser = await puppeteer.launch({
        headless: false
    })

    const page = await browser.newPage()

    // Get the page url from the user
    let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"

    try {
        await page.goto(baseURL, {waitUntil: 'networkidle0'})
        await page.waitForSelector('body')

        let imageBank = await page.evaluate(() => {
            let imgTags = Array.from(document.querySelectorAll('img'))

            let imageArray = []

            imgTags.map((image) => {
                let src = image.src

                let srcArray = src.split('/')
                let pos = srcArray.length - 1
                let filename = srcArray[pos]

                imageArray.push({
                    src,
                    filename
                })
            })

            return imageArray
        })

        await browser.close()
        return imageBank

    } catch (err) {
        console.log(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

Now let me explain what is happening here. First, we created an async function called extractImageLinks. In that function, we created an instance of a browser page using puppeteer and stored it in the page constant. Think of this page as the new page you get after launching your chrome browser. We can now heedlessly control this page from our code. We then get the url of the page we want to download the image from the user and stored it in a variable named baseURL. We then navigate to that URL using the page.goto() function. The {waitUntil: 'networkidle0'} object passed as the second argument to this function is to ensure that we wait for the for the network request to complete before we proceed with parsing the page. page.waitForSelector('body') is telling puppeteer to wait for the html body tag to render before we start extracting anything from the page.

The page.evaluate() function allows us to run JavaScript code in that page instance as if we were in our Google Chrome Devtools. To get all image tags from the page, we call the document.querySelectorAll("img") function. However, this function returns an NodeList and not an array. So to convert this to an array, we wrapped the first function with the Array.from() method. Now we have an array to work with.

We then store all the image tags in the imgTags variable and initialized imageArray variable as a placeholder for all the href values. Since imgTags has been converted into an array, we then map through every tag in that array and extract the src property from each image tag.

Now time for some little hack, we want to download the image from the webpage maintianing the original filename as it appears in the webpage. For instance, we have this image src https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg. We wnat to get the green-leaf_BVKZ4QW8LS.jpg from that URL. One way to do this is to split the string using the "/" delimeter. We then end up with something like this:

let src = `https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg`.split("/")

// Output
["https:", "", "cdn.stocksnap.io", "img-thumbs", "960w", "green-leaf_BVKZ4QW8LS.jpg"]
Enter fullscreen mode Exit fullscreen mode

Now the last index of the array after running the split array method on the image source contains the image's name and the extension as well, awesome!!!

Note: to get the last item from any array, we subtract 1 from the lengthm of that array like so:

let arr = [40,61,12] 
let lastItemIndex = arr.length - 1 // This is the index of the last item

console.log(lastItemIndex)
// Output
2

console.log(arr[lastItemIndex])
// Output
12
Enter fullscreen mode Exit fullscreen mode

So we store the index of the last item in the pos variable and then store the name of the file in the filename variable as well. Now we have the source of the file and the file name of the current image in the loop, we then push these values as an object in the imageArray variable. After the mapping is done, we return the imageArray because by now it has been populated. We also return the imageBank variable which now contains the image links (sources) and the filenames.

Saving images to disk

function saveImageToDisk(url, filename){
    fetch(url)
    .then(res => {
        const dest = fs.createWriteStream(filename);
        res.body.pipe(dest)
    })
    .catch((err) => {
        console.log(err)
    })
}


// Run the script on auto-pilot
(async function(){
    let imageLinks = await extractImageLinks()
    console.log(imageLinks)

    imageLinks.map((image) => {
        let filename = `./images/${image.filename}`
        saveImageToDisk(image.src, filename)
    })
})()
Enter fullscreen mode Exit fullscreen mode

Now let's decipher this little piece. In the anonymous IIFE, we are running the extractImageLinks to get the array containing the src and filename. Since the function is returns an array, we run the map function on that array and then pass the required parameters (url and filename) to saveImageToDisk. We then use the fetch API to make a GET request to that url and as the response is coming down the wire, we are concurrently piping it into the filename destination, in this case, a writable stream on our filesystem. This is very efficient because we are not waiting for the image to be fully loaded in memory before saving to disk but instead saving every chunk we get from the response directly.

Lets's run the code, cross our fingers and checkout our images directory

node index.js  https://stocksnap.io
Enter fullscreen mode Exit fullscreen mode

We should see some cool images there in. Wooo! You can add this to your portfolio. There are so many improvements that can be done to this little software, such as allowing the user to specify the directory they want to download the image, handling Data URI images, proper error handling, code refactoring, creating a standalone CLI utility for it. Hint: use the commander npm package for that, etc. You can go ahead and extend this app and I'll be glad to see what improvements you will make it.

Full code

'use strict';

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')

// Browser and page instance
async function instance(){
    const browser = await puppeteer.launch({
        headless: false
    })

    const page = await browser.newPage()
    return {page, browser}
}

// Extract all imageLinks from the page
async function extractImageLinks(){
    const {page, browser} = await instance()

    // Get the page url from the user
    let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"

    try {
        await page.goto(baseURL, {waitUntil: 'networkidle0'})
        await page.waitForSelector('body')

        let imageLinks = await page.evaluate(() => {
            let imgTags = Array.from(document.querySelectorAll('img'))

            let imageArray = []

            imgTags.map((image) => {
                let src = image.src

                let srcArray = src.split('/')
                let pos = srcArray.length - 1
                let filename = srcArray[pos]

                imageArray.push({
                    src,
                    filename
                })
            })

            return imageArray
        })

        await browser.close()
        return imageLinks

    } catch (err) {
        console.log(err)
    }
}

(async function(){
    console.log("Downloading images...")

    let imageLinks = await extractImageLinks()

    imageLinks.map((image) => {
        let filename = `./images/${image.filename}`
        saveImageToDisk(image.src, filename)
    })

    console.log("Download complete, check the images folder")
})()

function saveImageToDisk(url, filename){
    fetch(url)
    .then(res => {
        const dest = fs.createWriteStream(filename);
        res.body.pipe(dest)
    })
    .catch((err) => {
        console.log(err)
    })
}
Enter fullscreen mode Exit fullscreen mode

Shameless plug 😊

If you enjoyed this article and are feeling super pumped, I run 🔗 webscrapingzone.com where I teach advanced webscraping techniques by building real-world projects and how you can monetize your webscraping skills instantly without even being hired. It's still in beta stage but you can join the waiting list and get 💥 50% 💥 off when the course is released.

You can follow me on twitter - @microworlds

Thank you for your time 👍

Discussion

pic
Editor guide
Collapse
abelardoit profile image
abelardoit

Hi there,

An improvement to this code is to automatically create the "images" folder for doing it totally transparent for the user.

Thanks for your amazing & wonderful article.

Following you. :)

Warmest regards.

Collapse
microworlds profile image
Caleb David Author

That's very thoughtful, I will update the article and source code as soon as I'm less busy. Thanks for the kind words mate, really appreciate. 👍

Collapse
loouislow profile image
Loouis Low

I built a similar image scraper but using Selenium.

Collapse
microworlds profile image
Caleb David Author

Yours is incredibly robust and is handling so many edge cases. Great job mate, I have gotten more inspiration from that repo 😄

Collapse
pavelloz profile image
Paweł Kowalski

I have a weird feeling that a strategic use of Promise.all would make it order of magnitude faster :)

Collapse
microworlds profile image
Caleb David Author

Yes you are very correct. I just took the naive approach building it. Performance is something that can definitely be improved. I'll create a repo for this project and then welcome pull requests from the community 😄