Web scraping Google Hotels with Nodejs

#webscraping #node #google

Currently, we don't have an API that supports extracting data from Google Hotels page, however, it's currently in the development stage which you can track at our public-roadmap.

This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.

What will be scraped

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const searchQuery = "Honolulu";

const URL = `https://www.google.com/travel/hotels/${encodeURI(searchQuery)}`;

async function getHotelsInfo(page) {
  let lastHeight = await page.evaluate(`document.querySelector(".zQTmif").scrollHeight`);
  while (true) {
    await page.waitForTimeout(500);
    await page.keyboard.press("End");
    await page.waitForTimeout(500);
    await page.keyboard.press("PageUp");
    await page.waitForTimeout(5000);
    let newHeight = await page.evaluate(`document.querySelector(".zQTmif").scrollHeight`);
    if (newHeight === lastHeight) {
      break;
    }
    lastHeight = newHeight;
  }
  return await page.evaluate(() =>
    Array.from(document.querySelectorAll(".TNNk1.nzwZbc")).map((el) => {
      const adFrom = el.querySelector(".hVE5 .ogfYpf")?.textContent.trim();
      return {
        link: `https://www.google.com/${el.querySelector(".PVOOXe")?.getAttribute("href")}`,
        images: Array.from(el.querySelectorAll(".pb2I5 img"))
          .map((el) => el.getAttribute("src"))
          .filter((el) => el),
        title: el.querySelector(".BgYkof")?.textContent.trim(),
        price: el.querySelector(".xquSSe .kixHKb > span:first-child")?.textContent.trim(),
        rating: parseFloat(el.querySelector(".KFi5wf")?.textContent.trim()) || "No rating",
        reviews:
          parseInt(
            el
              .querySelector(".jdzyld")
              ?.textContent.trim()
              .replace(/[\(|\)|\s]/gm, "")
          ) || "No reviews",
        stars: parseInt(el.querySelector(".UqrZme")?.textContent.trim()),
        options: Array.from(el.querySelectorAll(".XX3dkb > .LtjZ2d ")).map((el) => el.textContent.trim()),
        adFrom,
      };
    })
  );
}

async function getHotelsResults() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  await page.waitForSelector(".TNNk1.nzwZbc");

  const hotels = await getHotelsInfo(page);

  await browser.close();

  return hotels;
}

getHotelsResults().then(console.log);

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter:

$ npm init -y

And then:

$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

For now, we complete the setup Node.JS environment for our project and move to the step-by-step code explanation.

Process

We need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, write the search query and the search URL:

puppeteer.use(StealthPlugin());

const searchQuery = "Honolulu";

const URL = `https://www.google.com/travel/hotels/${encodeURI(searchQuery)}`;

Next, we write a function to get hotels info from the page:

async function getHotelsInfo(page) {
  ...
}

In this function we'll use the next methods and properties to get the necessary information:

First, we need to scroll the page for loads all results. To do this we get the results container scrollHeight and set it to lastHeight variable, then scroll the page in the while loop by press "End" button and check if new scrollHeight isn't change we stop the loop:

let lastHeight = await page.evaluate(`document.querySelector(".zQTmif").scrollHeight`);
while (true) {
  await page.waitForTimeout(500);
  await page.keyboard.press("End");
  await page.waitForTimeout(500);
  await page.keyboard.press("PageUp");
  await page.waitForTimeout(5000);
  let newHeight = await page.evaluate(`document.querySelector(".zQTmif").scrollHeight`);
  if (newHeight === lastHeight) {
    break;
  }
  lastHeight = newHeight;
}

Then, we get and return all hotels info from the page (using evaluate() method):

return await page.evaluate(() =>
  Array.from(document.querySelectorAll(".TNNk1.nzwZbc")).map((el) => {
    const adFrom = el.querySelector(".hVE5 .ogfYpf")?.textContent.trim();
    return {
      link: `https://www.google.com/${el.querySelector(".PVOOXe")?.getAttribute("href")}`,
      images: Array.from(el.querySelectorAll(".pb2I5 img"))
        .map((el) => el.getAttribute("src"))
        .filter((el) => el),
      title: el.querySelector(".BgYkof")?.textContent.trim(),
      price: el.querySelector(".xquSSe .kixHKb > span:first-child")?.textContent.trim(),
      rating: parseFloat(el.querySelector(".KFi5wf")?.textContent.trim()) || "No rating",
      reviews:
        parseInt(
          el
            .querySelector(".jdzyld")
            ?.textContent.trim()
            .replace(/[\(|\)|\s]/gm, "") // this RegEx matches "(", or ")", or any white space
        ) || "No reviews",
      stars: parseInt(el.querySelector(".UqrZme")?.textContent.trim()),
      options: Array.from(el.querySelectorAll(".XX3dkb > .LtjZ2d ")).map((el) => el.textContent.trim()),
      adFrom,
    };
  })
);

Next, we write a function to control the browser, and get information from each category:

async function getHotelsResults() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: true and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

Next, we change the default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);

Then we wait until ".TNNk1.nzwZbc" selector is load (waitForSelector() method) and add hotels information from the page to hotels constant:

await page.waitForSelector(".TNNk1.nzwZbc");

const hotels = await getHotelsInfo(page);

And finally, we close the browser, and return the received data:

await browser.close();

return hotels;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

[
   {
      "link":"https://www.google.com//travel/hotels/Honolulu/entity/CgoIsva7guWXuqkpEAEae0FBQm5CM2xEcERidFBBWHRBV2x0TWt0UnhmQW4xVjJMWjJDVExrWWd0eG9JVWRfZXlnWWxteHBpaEh1cXptb3pDTWpqUlFkSmJpdmlHRlRLT3NJbER6QW1vMUV5c3VOOW5VR3FFb0xZTlgxLWQ2UllDVnVmNE04QWVRSQ?utm_campaign=sharing&utm_medium=link&utm_source=htls&ved=2ahUKEwjw8ur3wJr9AhVyBRwAHcDoCHoQyvcEegQIAxAr&ts=CAESABogCgIaABIaEhQKBwjnDxACGBMSBwjnDxACGBQYATICEAAqCQoFOgNVU0QaAA&rp=OAE",
      "images":[
         "https://lh5.googleusercontent.com/p/AF1QipNLY_J7UdmiBu8U1ivimm1OFZGASPJxxsdRPXlv=w592-h404-n-k-no-v1",
         "https://lh5.googleusercontent.com/p/AF1QipOkC6UOW2plYus-TxCYqhMtBMi-0HkqyUYmrMzu=w592-h404-n-k-no-v1"
      ],
      "title":"Hilton Waikiki Beach",
      "rating":4.1,
      "reviews":823,
      "stars":4,
      "options":[
         "Breakfast ($)",
         "Free Wi-Fi"
      ],
      "adFrom":"From  Hilton Waikiki Beach"
   },
   {
      "link":"https://www.google.com//travel/hotels/Honolulu/entity/CgoI866GmIeXmos7EAEafkFBQm5CM2xuTDdXWDI2cDRoWllQXzVVazI3Vkd3OWVDbnczM215U05wRmhOOGhiMHpOaGQ4TTRoQ2ZfZVEzbU5yWDZ2U2t2THRDMy1sMlc1YWhLel9GUE9TWEhpd1ZRclgxYjhZY0VFM2ZlOEwzNTlONlJiVDNnd0k2SHpsdw?utm_campaign=sharing&utm_medium=link&utm_source=htls&ved=2ahUKEwjw8ur3wJr9AhVyBRwAHcDoCHoQyvcEegQIAxBE&ts=CAESABogCgIaABIaEhQKBwjnDxACGBMSBwjnDxACGBQYATICEAAqCQoFOgNVU0QaAA&rp=OAE",
      "images":[
         "https://lh5.googleusercontent.com/p/AF1QipMVQjGzCM8CwfDJrdrmY_CCdzwa-LQeSuF2bs5O=w592-h404-n-k-no-v1",
         "https://lh5.googleusercontent.com/p/AF1QipNSGmtQ44aKHkdXo9yU1jqZgjcu9azk7gZ4An1J=w592-h404-n-k-no-v1"
      ],
      "title":"Coconut Waikiki Hotel",
      "rating":4.2,
      "reviews":802,
      "stars":3,
      "options":[
         "Free Wi-Fi",
         "Parking ($)"
      ],
      "adFrom":"From  Booking.com"
   },
  ... and other hotels
]