Serpdog

Posted on Aug 13, 2022 • Updated on Oct 12, 2022 • Originally published at serpdog.io

How to scrape Google Maps Reviews?

#javascript #webscraping #node #tutorial

In this post, we will learn to scrape Google Maps Reviews.

Requirements:

Before we begin, we have to install everything we may need in this tutorial to move forward.

So before starting, we have to ensure that we have set up our Node JS project and installed both the packages - Unirest JS and Cheerio JS. You can install both packages from the above link.

We will use Unirest JS for extracting our raw HTML data and Cheerio JS for parsing our extracted HTML data.

Target:

Eiffel Tower Google Maps Results

We will target to scrape the user reviews on Eiffel Tower.

Process:

Method 1 - Using Google Maps Network URL

Now, we have set up all the things to prepare our scraper. We will use an NPM library Unirest JS to make a get request to our target URL to get our raw HTML data. Then we will use Cheerio JS for parsing the extracted HTML data.

We will target this URL:

`https://www.google.com/async/reviewDialog?hl=en_us&async=feature_id:${data_ID},next_page_token:${next_page_token},sort_by:qualityScore,start_index:,associated_topic:,_fmt:pc`

Where,
data_ID - Data ID is a unique ID given to a particular location in Google Maps.
next_page_token - The next_page_token is used to get the next page results.
sort_by - It is used for sorting and filtering results.

The various values of sort_by are:

qualityScore - the most relevant reviews.
newestFirst - the most recent reviews.
ratingHigh - the highest rating reviews.
ratingLow - the lowest rating reviews.

Now, the question arises how do we get the Data ID of any place?

https://www.google.com/maps/place/Eiffel+Tower/@48.8583701,2.2922926,17z/data=!4m7!3m6!1s0x47e66e2964e34e2d:0x8ddca9ee380ef7e0!8m2!3d48.8583701!4d2.2944813!9m1!1b1

You can see, in the URL the part after our !4m7!3m6!1s and before !8m2! is our Data ID.
So, our data ID in this case is - 0x47e66e2964e34e2d:0x8ddca9ee380ef7e0

You can also use Serpdog's Google Maps Data ID API to retrieve the Data ID of any place.

  const axios = require('axios');

  axios.get('https://api.serpdog.io/dataId?api_key=APIKEY&q=Statue Of Liberty&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

  Result:
  {
  "meta": {
    "api_key": "APIKEY",
    "q": "Statue Of Liberty",
    "gl": "us"
  },
  "placeDetails": [
    {
      "Address": " New York, NY 10004"
    },
    {
      "Phone": " (212) 363-3200"
    },
    {
      "dataId": "0x89c25090129c363d:0x40c6a5770d25022b"
    }
  ]

Our target URL should look like this:

https://www.google.com/async/reviewDialog?hl=en_us&async=feature_id:0x47e66e2964e34e2d:0x8ddca9ee380ef7e0,next_page_token:,sort_by:qualityScore,start_index:,associated_topic:,_fmt:pc

Copy this URL in your browser and press enter. A text file will be downloaded after entering this URL. Open this file in your respective code editor. Convert it into an HTML file. After opening the HTML file, we will search for the HTML tags of the elements we want in our response.

We will first parse the location information of the place, which contains - the location name, address, average rating, and total reviews.

![Scrape Google Maps Reviews 3(https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nuy0kcnu39swwwau71uk.png)

From the above image, the tag for our location name is .P5Bobd, the tag for our address is .T6pBCe, tag for our average rating is span.Aq14fc and tag for our total number of reviews is span.z5jxId.

All done for the location information part, we will now move towards parsing Data ID and next_page_token.

Search for the tag .lcorif. In the above image you can find the .lcorif tag in the second line. Under this tag, we have our tag for Data ID as .loris and of next_page_token as .gws-localreviews__general-reviews-block.

Now, we will search for the tags which contain data about the user and his review.
Search for the tag .gws-localreviews__google-review.

This tag contains all information about the user and his reviews.
We will now parse the extracted HTML for the user's name, link, thumbnail, number of reviews, rating, review, and the images posted by the user, which makes our code look like this:

const unirest = require("unirest");
const cheerio = require("cheerio");

const getReviewsData = () => {
  return unirest
    .get("https://www.google.com/async/reviewDialog?hl=en_us&async=feature_id:0x47e66e2964e34e2d:0x8ddca9ee380ef7e0,next_page_token:,sort_by:qualityScore,start_index:,associated_topic:,_fmt:pc")
    .headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    })
    .then((response) => {
      console.log(response.body)
      let $ = cheerio.load(response.body);

      let user = [] , location_info,data_id,token;

      $(".lcorif").each((i, el) => {
        data_id = $(".loris").attr("data-fid");
        token = $(".gws-localreviews__general-reviews-block").attr(
          "data-next-page-token"
        );
        location_info = {
          title: $(".P5Bobd").text(),
          address: $(".T6pBCe").text(),
          avgRating: $("span.Aq14fc").text(),
          totalReviews: $("span.z5jxId").text(),
        };
      });

      $(".gws-localreviews__google-review").each((i, el) => {
        user.push({
        name:$(el).find(".TSUbDb").text(),

        link:$(el).find(".TSUbDb a").attr("href"),

        thumbnail: $(el).find(".lDY1rd").attr("src"),

        numOfreviews:$(el).find(".Msppse").text(),

        rating:$(el).find(".EBe2gf").attr("aria-label"),

        review:$(el).find(".Jtu6Td").text(),

        images:$(el)
          .find(".EDblX .JrO5Xe")
          .toArray()
          .map($)
          .map(d => d.attr("style").substring(21 , d.attr("style").lastIndexOf(")")))
        })
    });
    console.log("LOCATION INFO: ")
    console.log(location_info)
    console.log("DATA ID:")
    console.log(data_id)
    console.log("TOKEN:");
    console.log(token)
    console.log("USER:")
    console.log(user)
    });
};

getReviewsData();

You can also check some of my other Google scrapers in my Git Repository: https://github.com/Darshan972/GoogleScrapingBlogs

Result:

Our result should look like this 👆🏻.
These are the results of the first ten reviews. If you want to get another ten results, put the token, which we have found by running our code, in the below URL:

https://www.google.com/async/reviewDialog?hl=en_us&async=feature_id:0x47e66e2964e34e2d:0x8ddca9ee380ef7e0,next_page_token:tokenFromResponse,sort_by:qualityScore,start_index:,associated_topic:,_fmt:pc

In this case, we have our token as CAESBkVnSUlDZw== .
You can find the reviews for every next page using the token from their previous pages.

Method 2 - Using Puppeteer Infinte Scrolling

Another method you can use for scraping Google Maps Reviews is Puppeteer Infinite Scrolling. So, first, let us open the reviews page of Google Maps on our browser.

Here is the URL:

https://www.google.com/maps/place/Eiffel+Tower/@48.8583701,2.2944813,15z/data=!4m7!3m6!1s0x0:0x8ddca9ee380ef7e0!8m2!3d48.8583701!4d2.2944813!9m1!1b1

Now, we will make the main function, in which we will first navigate to the target URL and extract the average reviews and ratings given by users.

 const getMapsData = async () => {
    try {
        let url =
        "https://www.google.com/maps/place/Eiffel+Tower/@48.8583701,2.2944813,15z/data=!4m7!3m6!1s0x0:0x8ddca9ee380ef7e0!8m2!3d48.8583701!4d2.2944813!9m1!1b1";
        browser = await puppeteer.launch({
        args: ["--disabled-setuid-sandbox", "--no-sandbox"],
        headless: false
        });
        const page = await browser.newPage();

        await page.goto(url, { waitUntil: "domcontentloaded" , timeout: 60000});
        await page.waitForTimeout(3000);

        let ratings = await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".PPCwl")).map((el) => {
            return {
            avg_rating: el.querySelector(".fontDisplayLarge")?.textContent.trim(),
            total_reviews: el.querySelector(".fontBodySmall")?.textContent.trim(),
            five_stars: el.querySelector(".ExlQHd tbody tr:nth-child(1)").getAttribute("aria-label").split("stars, ")[1].trim(),
            four_stars: el.querySelector(".ExlQHd tbody tr:nth-child(2)").getAttribute("aria-label").split("stars, ")[1].trim(),
            three_stars: el.querySelector(".ExlQHd tbody tr:nth-child(3)").getAttribute("aria-label").split("stars, ")[1].trim(),
            two_stars: el.querySelector(".ExlQHd tbody tr:nth-child(4)").getAttribute("aria-label").split("stars, ")[1].trim(),
            one_stars: el.querySelector(".ExlQHd tbody tr:nth-child(5)").getAttribute("aria-label").split("stars, ")[1].trim(),
            };
        });
        });

        console.log(ratings)

        let data =  await scrollPage(page,'.DxyBCb', 10);

        console.log(data);
        await browser.close();
    } catch (e) {
        console.log(e);
    }
   };

Step-by-step explanation:

puppeteer.launch() - This method will launch the Chromium browser with the options we have set in our code. In our case, we are launching our browser in non-headless mode.
browser.newPage() - This will open a new page or tab in the browser.
page.goto() - This will navigate the page to the specified target URL.
page.waitForTimeout() - It will cause the page to wait for the specified number of seconds we passed as a parameter to do further operations.
scrollPage() - At last, we called our infinite scroller to extract the data we need with the page, the tag for the scroller div, and the number of items we want as parameters.
browser.close() - This will close the browser.

After this, we will move to our infinte scroller function.

    const scrollPage = async(page, scrollContainer, itemTargetCount) => {
        let items = [];
        let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
        while (itemTargetCount > items.length) {
            items = await extractItems(page);
            await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
            await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
            await page.waitForTimeout(2000);
        }
        return items;
    }

Step-by-step explanation:

previousHeight - Scroll height of the container.
extractItems() - Function to parse the scraped HTML.
In the next step, we just scrolled down the container to height equal to previousHeight.
And in the last step, we waited for the container to scroll down until its height got greater than the previous height.

After this, we will parse the HTML in the extractItems function.

    async function extractItems(page) {
        const reviews = await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".jftiEf")).map((el) => {
        return {
            user: {
            name: el.querySelector(".d4r55")?.textContent.trim(),
            thumbnail: el.querySelector("a.WEBjve img")?.getAttribute("src"),
            localGuide: el.querySelector(".RfnDt span:nth-child(1)")?.style.display === "none" ?  false : true,
            reviews: parseInt(el.querySelector(".RfnDt span:nth-child(2)")?.textContent.replace("·", "")),
            link: el.querySelector("a.WEBjve")?.getAttribute("href"),
            },
            rating: el.querySelector(".kvMYJc")?.getAttribute("aria-label").trim(),
            date: el.querySelector(".rsqaWe")?.textContent,
            review: el.querySelector(".wiI7pd")?.textContent.trim(),
            images: Array.from(el.querySelectorAll(".KtCyie button")).length
            ? Array.from(el.querySelectorAll(".KtCyie button")).map((el) => {
                return {
                thumbnail: getComputedStyle(el).backgroundImage.split('")')[0].replace('url("',""),
                };
            })
            : "",
          };
            });
        });
        return reviews;
        }

Step-by-step explanation:

document.querySelectorAll() - It will return all the elements that matches the specified CSS selector. In our case, it is jftiEf.
getAttribute() - This will return the attribute value of the specified element.
textContent - It returns the text content inside the selected HTML element.
split() - Used to split a string into substrings with the help of a specified separator and return them as an array.
trim() - Removes the spaces from the starting and end of the string.
replaceAll() - Replaces the specified pattern from the whole string.

Here is the full code:

    const puppeteer = require("puppeteer");

    async function extractItems(page) {
        const reviews = await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".jftiEf")).map((el) => {
            return {
            user: {
                name: el.querySelector(".d4r55")?.textContent.trim(),
                thumbnail: el.querySelector("a.WEBjve img")?.getAttribute("src"),
                localGuide: el.querySelector(".RfnDt span:nth-child(1)")?.style.display === "none" ?  false : true,
                reviews: el.querySelector(".RfnDt span:nth-child(2)")?.textContent.replace("·", "").replace("reviews", "").trim(),
                link: el.querySelector("a.WEBjve")?.getAttribute("href"),
            },
            rating: el.querySelector(".kvMYJc")?.getAttribute("aria-label").trim(),
            date: el.querySelector(".rsqaWe")?.textContent,
            review: el.querySelector(".wiI7pd")?.textContent.trim(),
            images: Array.from(el.querySelectorAll(".KtCyie button")).length ? Array.from(el.querySelectorAll(".KtCyie button")).map((el) => {
                return {
                    thumbnail: getComputedStyle(el).backgroundImage.split('")')[0].replace('url("',""),
                };
                })
            : "",
            };
        });
        });
        return reviews;
    }

    const scrollPage = async(page, scrollContainer, itemTargetCount) => {
        let items = [];
        let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
        while (itemTargetCount > items.length) {
        items = await extractItems(page);
        await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
        await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
        await page.waitForTimeout(2000);
        }
        return items;
    }

    const getMapsData = async () => {
        try {
        let url =
            "https://www.google.com/maps/place/Eiffel+Tower/@48.8583701,2.2944813,15z/data=!4m7!3m6!1s0x0:0x8ddca9ee380ef7e0!8m2!3d48.8583701!4d2.2944813!9m1!1b1";
        browser = await puppeteer.launch({
            args: ["--disabled-setuid-sandbox", "--no-sandbox"],
            headless: false
        });
        const [page] = await browser.pages();

        await page.goto(url, { waitUntil: "domcontentloaded" , timeout: 60000});
        await page.waitForTimeout(3000);

        let ratings = await page.evaluate(() => {
            return Array.from(document.querySelectorAll(".PPCwl")).map((el) => {
            return {
                avg_rating: el.querySelector(".fontDisplayLarge")?.textContent.trim(),
                total_reviews: el.querySelector(".fontBodySmall")?.textContent.trim(),
                five_stars: el.querySelector(".ExlQHd tbody tr:nth-child(1)").getAttribute("aria-label").split("stars, ")[1].trim(),
                four_stars: el.querySelector(".ExlQHd tbody tr:nth-child(2)").getAttribute("aria-label").split("stars, ")[1].trim(),
                three_stars: el.querySelector(".ExlQHd tbody tr:nth-child(3)").getAttribute("aria-label").split("stars, ")[1].trim(),
                two_stars: el.querySelector(".ExlQHd tbody tr:nth-child(4)").getAttribute("aria-label").split("stars, ")[1].trim(),
                one_stars: el.querySelector(".ExlQHd tbody tr:nth-child(5)").getAttribute("aria-label").split("stars, ")[1].trim(),
            };
            });
        });

        console.log(ratings)

        let data =  await scrollPage(page,'.DxyBCb',10);

        console.log(data);
        await browser.close();
        } catch (e) {
        console.log(e);
        }
    };
    getMapsData();

Our results should look like this 👇🏻:

  [
   {
    avg_rating: '4.6',
    total_reviews: '3,10,611 reviews',
    five_stars: '243,237 reviews',
    four_stars: '42,702 reviews',
    three_stars: '13,474 reviews',
    two_stars: '4,163 reviews',
    one_stars: '7,035 reviews'
   }
  ]
  [
   {
    user: {
      name: 'Wagner Castro',
      thumbnail: 'https://lh3.googleusercontent.com/a-/ACNPEu9wP6T1uyo2ga98cVBzIW0uH6NMyA2vX7KWB26hFeQ=w36-h36-p-c0x00000000-rp-mo-ba6-br100',
      localGuide: true,
      reviews: '554',
      link: 'https://www.google.com/maps/contrib/113391288797697364105/reviews?hl=en-US'
    },
    rating: '5 stars',
    date: '2 months ago',
    review: 'Paris is an incredible experience with innumerable museums, parks, restaurants and  beautiful sites but the Eiffel Tower is one of the most interesting places to visit. …',
    images: [
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object], [Object],
      [Object], [Object]
    ]
  },
  .......

But the main disadvantage associated with this method is it is quite slow, and if you want to scrape tons of results from this method, then I recommend not to try it as it might easily crash the browser.

With Google Maps Reviews API:

Serpdog | Google Search API offers you 100 free requests on sign-up.
Scraping can take a lot of time sometimes, but the already made structured JSON data can save you a lot of time.

const axios = require('axios');

axios.get('https://api.serpdog.com/reviews?api_key=APIKEY&data_id=0x89c25090129c363d:0x40c6a5770d25022b')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });