DEV Community

Cover image for Web Scraping Google Images with Nodejs
Mikhail Zub for SerpApi

Posted on

Web Scraping Google Images with Nodejs

Intro

In this blog post, I want to show how you can extract images from Google Images. I'll show you two different DIY solutions and a ready-made solution from SerpApi and explain the difference between these solutions.

What will be scraped

what

Process

First of all, we need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results.

how

Solution one (using the raw HTML from request)

๐Ÿ“ŒNote: this solution is fast but it allows to receive images only from the first results page.

If you don't need an explanation, have a look at the full code example in the online IDE

const cheerio = require("cheerio");
const axios = require("axios");

const searchString = "bugatti chiron"; // what we want to search

const AXIOS_OPTIONS = {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
  }, // adding the User-Agent header as one way to prevent the request from being blocked
  params: {
    q: searchString, // our encoded search string
    hl: "en", // parameter defines the language to use for the Google search
    gl: "us", // parameter defines the country to use for the Google search
    tbm: "isch", // parameter defines the type of search you want to do (isch - Google Images)
  },
};

function getGoogleImagesResults() {
  return axios.get("http://google.com/search", AXIOS_OPTIONS).then(function ({ data }) {
    let $ = cheerio.load(data);

    const imagesRawPattern = /AF_initDataCallback\((?<images>[^<]+)\);/gm; //https://regex101.com/r/74JN5w/1
    let imagesRaw = [...data.matchAll(imagesRawPattern)].map(({ groups }) => groups.images);

    imagesRaw = imagesRaw.length > 1 ? imagesRaw[1] : imagesRaw[0];

    eval(`imagesRaw = ${imagesRaw}`);

    imagesRaw = JSON.stringify(imagesRaw);

    const imagesPattern = /\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(?<images>.*),\"All\",/gm; //https://regex101.com/r/qXqmKz/1
    let images = [...imagesRaw.matchAll(imagesPattern)].map(({ groups }) => groups.images)[0];

    const thumbnailsPattern = /\[\"(?<thumbnail>https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]/gm; // https://regex101.com/r/Ms61BF/1
    const thumbnails = [...images.matchAll(thumbnailsPattern)].map(({ groups }) => groups.thumbnail);

    images = images.replace(thumbnailsPattern, "");

    const originalsPattern = /('|,)\[\"(?<original>https|http.*?)\",\d+,\d+\]/gm; // https://regex101.com/r/sA9I4E/1
    const originals = [...images.matchAll(originalsPattern)].map(({ groups }) => groups.original);

    return Array.from($(".PNCib.MSM1fd")).map((el, i) => ({
      title: $(el).find(".VFACy").attr("title"),
      link: $(el).find(".VFACy").attr("href"),
      source: $(el).find(".fxgdke").text(),
      original: originals[i],
      thumbnail: thumbnails[i],
    }));
  });
}

getGoogleImagesResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages cheerio to parse parts of the HTML markup, and axios to make a request to a website.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i cheerio axios.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

Code explanation

Declare constants from cheerio and axios libraries:

const cheerio = require("cheerio");
const axios = require("axios");
Enter fullscreen mode Exit fullscreen mode

Next, we write what we want to search for, HTTP headers with User-Agent (is used to act as a "real" user visit. Default axios requests user-agent is axios/0.27.2 so websites understand that it's a script that sends a request and might block it. Check what's your user-agent), and the necessary parameters for making a request:

const searchString = "bugatti chiron"; // what we want to search

const AXIOS_OPTIONS = {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
  }, // adding the User-Agent header as one way to prevent the request from being blocked
  params: {
    q: searchString, // our encoded search string
    hl: "en", // parameter defines the language to use for the Google search
    gl: "us", // parameter defines the country to use for the Google search
    tbm: "isch", // parameter defines the type of search you want to do (isch - Google Images)
  },
};
Enter fullscreen mode Exit fullscreen mode

Next, we write a function that makes the request and returns the received data. We received the response from axios request that has data key that we destructured and parse it with cheerio:

function getGoogleImagesResults() {
  return axios.get("http://google.com/search", AXIOS_OPTIONS).then(function ({ data }) {
    let $ = cheerio.load(data);
    ...
  })
}
Enter fullscreen mode Exit fullscreen mode

Then, we need to get and clear the part of HTML that contain images data. First, we define imagesRawPattern, then using spread syntax we make an array from an iterable iterator of matches, received from matchAll method.

Next, we take the first or second match result (depending on page context), execute it (using eval() method) to get an object from the string and make a valid JSON string (we need to do this because the starter imagesRaw string has encoded symbols(e.g. '\u0026'), and when we execute it the string is decoded):

const imagesRawPattern = /AF_initDataCallback\((?<images>[^<]+)\);/gm; //https://regex101.com/r/74JN5w/1
let imagesRaw = [...data.matchAll(imagesRawPattern)].map(({ groups }) => groups.images);

imagesRaw = imagesRaw.length > 1 ? imagesRaw[1] : imagesRaw[0];

eval(`imagesRaw = ${imagesRaw}`);

imagesRaw = JSON.stringify(imagesRaw);
Enter fullscreen mode Exit fullscreen mode

Then, using the different Regex patterns we get images string, and thumbnails and originals arrays (to find originals, we need to remove thumbnails links from images string with replace() method):

const imagesPattern = /\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(?<images>.*),\"All\",/gm; //https://regex101.com/r/qXqmKz/1
let images = [...imagesRaw.matchAll(imagesPattern)].map(({ groups }) => groups.images)[0];

const thumbnailsPattern = /\[\"(?<thumbnail>https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]/gm; // https://regex101.com/r/Ms61BF/1
const thumbnails = [...images.matchAll(thumbnailsPattern)].map(({ groups }) => groups.thumbnail);

images = images.replace(thumbnailsPattern, "");

const originalsPattern = /('|,)\[\"(?<original>https|http.*?)\",\d+,\d+\]/gm; // https://regex101.com/r/sA9I4E/1
const originals = [...images.matchAll(originalsPattern)].map(({ groups }) => groups.original);
Enter fullscreen mode Exit fullscreen mode

And finally, we need to get title, link and source of each images from the HTML selectors (using $(), find(), attr() and text() methods) and combine it with original and thumbnail:

return Array.from($(".PNCib.MSM1fd")).map((el, i) => ({
  title: $(el).find(".VFACy").attr("title"),
  link: $(el).find(".VFACy").attr("href"),
  source: $(el).find(".fxgdke").text(),
  original: originals[i],
  thumbnail: thumbnails[i],
}));
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Output

[
   {
      "title":"Bugatti Chiron - Wikipedia",
      "link":"https://en.wikipedia.org/wiki/Bugatti_Chiron",
      "source":"en.wikipedia.org",
      "original":"https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Bugatti_Chiron_%2836559710091%29.jpg/1200px-Bugatti_Chiron_%2836559710091%29.jpg",
      "thumbnail":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQtn09W1RkuFRfj8Gbfj16Jt_ZnQ8vsvZRGBOfO3gOnUPrprSUH3nuFcOz-VdKk1bHGgdI&usqp=CAU"
   },
   {
      "title":"Bugatti Chiron: Breaking new dimensions",
      "link":"https://www.bugatti.com/chiron/",
      "source":"bugatti.com",
      "original":"https://www.bugatti.com/fileadmin/_processed_/sei/p54/se-image-4799f9106491ebb58ca3351f6df5c44a.jpg",
      "thumbnail":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ-usS_C6OB_O4_0QgQsrQlTZXl4n_ouoyrpKBHObK0OgvEgNB0W3lS9EIJdLnm_1WrJy0&usqp=CAU"
   },
   ... and other results
]
Enter fullscreen mode Exit fullscreen mode

Solution two (using browser automation with Puppeteer)

๐Ÿ“ŒNote: this solution is much slower but it allows to receive all images from all results pages (using the infinite scroll).

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const searchQuery = "bugatti chiron";

async function getImagesData(page) {
  const imagesResults = [];
  let iterationsLength = 0;
  while (true) {
    const images = await page.$$(".OcgH4b .PNCib.MSM1fd");
    for (; iterationsLength < images.length; iterationsLength++) {
      images[iterationsLength].click();
      await page.waitForTimeout(2000);
      imagesResults.push(
        await page.evaluate(
          (iterationsLength) => ({
            thumbnail: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".Q4LuWd")?.getAttribute("src"),
            source: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".VFACy div")?.textContent.trim(),
            title: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector("h3")?.textContent.trim(),
            link: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".VFACy")?.getAttribute("href"),
            original: Array.from(document.querySelectorAll(".eHAdSb .n3VNCb"))
              .find((el) => !el.getAttribute("src").includes("data:image") && !el.getAttribute("src").includes("gstatic.com"))
              ?.getAttribute("src"),
          }),
          iterationsLength
        )
      );
    }
    await page.waitForTimeout(5000);
    const newImages = await page.$$(".OcgH4b .PNCib.MSM1fd");
    if (newImages.length === images.length) break;
  }
  return imagesResults;
}

async function getGoogleImagesResults() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const URL = `https://www.google.com/search?q=${encodeURI(searchQuery)}&tbm=isch&hl=en&gl=es`;

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector(".PNCib");

  const imagesResults = await getImagesData(page);

  await browser.close();

  return imagesResults;
}

getGoogleImagesResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

๐Ÿ“ŒNote: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Enter fullscreen mode Exit fullscreen mode

Next, we "say" to puppeteer use StealthPlugin and write what we want to search:

puppeteer.use(StealthPlugin());

const searchQuery = "bugatti chiron";
Enter fullscreen mode Exit fullscreen mode

Next, we write a function to get images data from the Google Search page:

async function getImagesData(page) {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function, first, we create the empty imagesResults array and set iterationsLength equal to "0":

const imagesResults = [];
let iterationsLength = 0;
Enter fullscreen mode Exit fullscreen mode

Next, we use the while loop in which we need to get all images (using $$ method), click (using click() method) on the each image, wait 2 seconds (using waitForTimeout method) and add image data to the end of the imagesResults array (using push() method):

  while (true) {
    const images = await page.$$(".OcgH4b .PNCib.MSM1fd");
    for (; iterationsLength < images.length; iterationsLength++) {
      images[iterationsLength].click();
      await page.waitForTimeout(2000);
      imagesResults.push(
        ...
      );
    }
    ...
  }
  return imagesResults;
Enter fullscreen mode Exit fullscreen mode

Then, we get all image data from the page using evaluate() method and pass iterationsLength variable to the page context:

await page.evaluate((iterationsLength) => ({
...
}), iterationsLength)
Enter fullscreen mode Exit fullscreen mode

Next, we get need information from HTML selectors. We can do this with querySelectorAll() methods to get access to right HTML selectors, textContent and trim() methods, which get the raw text and removes white space from both sides of the string. If we need to get links, we use getAttribute() method to get "href" or "src" HTML element attribute, and finally find() method to get the right selector from an array of the same selectors:

    thumbnail: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".Q4LuWd")?.getAttribute("src"),
    source: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".VFACy div")?.textContent.trim(),
    title: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector("h3")?.textContent.trim(),
    link: document.querySelectorAll(".OcgH4b .PNCib.MSM1fd")[iterationsLength].querySelector(".VFACy")?.getAttribute("href"),
    original: Array.from(document.querySelectorAll(".eHAdSb .n3VNCb"))
        .find((el) => !el.getAttribute("src").includes("data:image") && !el.getAttribute("src").includes("gstatic.com"))
        ?.getAttribute("src"),
Enter fullscreen mode Exit fullscreen mode

Next, we wait 5 seconds until new images are loaded, get all images again and check if the length of the newImages array is the same as the length of the images array we stop the loop, otherwise repeat the loop again:

await page.waitForTimeout(5000);
const newImages = await page.$$(".OcgH4b .PNCib.MSM1fd");
if (newImages.length === images.length) break;
Enter fullscreen mode Exit fullscreen mode

Next, write a function to control the browser, and get information:

async function getGoogleImagesResults() {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: false and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: false,
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();
Enter fullscreen mode Exit fullscreen mode

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until the selector is load:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".PNCib");
Enter fullscreen mode Exit fullscreen mode

Then, we wait until the getImagesData functions was finished and save the results of this function to the imagesResults constant:

const imagesResults = await getImagesData(page);
Enter fullscreen mode Exit fullscreen mode

And finally, we close the browser, and return the received data:

await browser.close();

return imagesResults;
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Output

๐Ÿ“ŒNote: some of the picture's thumbnails are present in the base64 format and some is a links to the thumbnail.

[
   {
      "thumbnail":"",
      "source":"caranddriver.com",
      "title":"El Bugatti Chiron Super Sport se pone a tono: Objetivo, los ยก440 km/h!",
      "link":"https://www.caranddriver.com/es/coches/planeta-motor/a36812435/bugatti-chiron-super-sport-pruebas-velocidad/",
      "original":"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/02-hispeed-css-1624449602.jpg"
   },
   {
      "thumbnail":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS3xUD75WQBPu68wKkS9yGm7Yla62hRKuv4kQ&usqp=CAU",
      "source":"motor.es",
      "title":"El Bugatti Chiron Super Sport, a la caza de nuevos clientes en la Costa Azul",
      "link":"https://www.motor.es/noticias/bugatti-chiron-super-sport-costa-azul-202180151.html",
      "original":"https://static.motor.es/fotos-noticias/2021/08/bugatti-chiron-super-sport-202180151-1628105578_1.jpg"
   }
   ... and other results
]
Enter fullscreen mode Exit fullscreen mode

Using Google Images API from SerpApi

This section is to show the comparison between the DIY solutions and our solution.

The biggest difference is that you don't need to use browser automation to scrape all results, create the parser from scratch and maintain it.

This solution combines the advantages of the solutions shown above, namely the speed and completeness of obtaining results.

There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.

First, we need to install google-search-results-nodejs:

npm i google-search-results-nodejs
Enter fullscreen mode Exit fullscreen mode

Here's the full code example, if you don't need an explanation:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com

const searchQuery = "bugatti chiron";

const params = {
  q: searchQuery, // what we want to search
  engine: "google", // search engine
  hl: "en", // parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
  tbm: "isch", // parameter defines the type of search you want to do (isch - Google Images)
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const imagesResults = [];
  while (true) {
    const json = await getJson();
    if (json.images_results) {
      imagesResults.push(...json.images_results);
      params.ijn ? (params.ijn += 1) : (params.ijn = 1);
    } else break;
  }
  return imagesResults;
};

getResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Code explanation

First, we need to declare SerpApi from google-search-results-nodejs library and define new search instance with your API key from SerpApi:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Enter fullscreen mode Exit fullscreen mode

Next, we write what we want to search and the necessary parameters for making a request:

const searchQuery = "bugatti chiron";

const params = {
  q: searchQuery, // what we want to search
  engine: "google", // search engine
  hl: "en", // parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
  tbm: "isch", // parameter defines the type of search you want to do (isch - Google Images)
};
Enter fullscreen mode Exit fullscreen mode

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};
Enter fullscreen mode Exit fullscreen mode

And finally, we declare the function getResult that gets data from the page and return it:

const getResults = async () => {
  ...
};
Enter fullscreen mode Exit fullscreen mode

In this function first, we declare an array imagesResults with results data:

const imagesResults = [];
Enter fullscreen mode Exit fullscreen mode

Next, we need to use while loop. In this loop we get json with results, check if results are present on the page, push results to imagesResults array, define the start number on the results page (ijn parameter), and repeat the loop until results aren't present on the page:

while (true) {
  const json = await getJson();
  if (json.images_results) {
    imagesResults.push(...json.images_results);
    params.ijn ? (params.ijn += 1) : (params.ijn = 1);
  } else break;
}
return imagesResults;
Enter fullscreen mode Exit fullscreen mode

After, we run the getResults function and print all the received information in the console with the console.dir method, which allows you to use an object with the necessary parameters to change default output options:

getResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Output

[
   {
      "position":1,
      "thumbnail":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT-6pU_hnLtboVfXcuihXvzyP8xRsDJ8fs3zw&usqp=CAU",
      "source":"topspeed.com",
      "title":"2018 Bugatti Chiron | Top Speed",
      "link":"https://www.topspeed.com/cars/bugatti/2018-bugatti-chiron-ar163150.html",
      "original":"https://pictures.topspeed.com/IMG/crop_webp/201602/2018-bugatti-chiron-10_1920x1080.webp",
      "is_product":false
   },
   {
      "position":2,
      "thumbnail":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR_1JAV4au_ZTCjcPk41Ul9blo0XHsJwxt50Q&usqp=CAU",
      "source":"auto-data.net",
      "title":"2021 Bugatti Chiron Super Sport 8.0 W16 (1600 Hp) AWD DSG | Technical  specs, data, fuel consumption, Dimensions",
      "link":"https://www.auto-data.net/en/bugatti-chiron-super-sport-8.0-w16-1600hp-awd-dsg-43596",
      "original":"https://www.auto-data.net/images/f115/Bugatti-Chiron-Super-Sport.jpg",
      "is_product":false
   },
   ... and other results
]
Enter fullscreen mode Exit fullscreen mode

If you want to see some projects made with SerpApi, please write me a message.


Join us on Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž

Top comments (0)