DEV Community

Cover image for Web scraping Yelp Filters with Nodejs
Mikhail Zub for SerpApi

Posted on

Web scraping Yelp Filters with Nodejs

What will be scraped

what

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const serchQuery = "pizza"; //Parameter defines the query you want to search
const location = "Seattle, WA"; //Parameter defines from where you want the search to originate

const searchParams = {
  query: encodeURI(serchQuery),
  location: encodeURI(location),
};

const URL = `https://www.yelp.com/search?find_desc=${searchParams.query}&find_loc=${searchParams.location}`;

async function getFiltersFromPage(page) {
  const priceAndDistance = await page.evaluate(() => {
    return Array.from(document.querySelectorAll("aside[aria-labelledby='search-vertical-filter-panel-label'] > div > div")).reduce((result, el) => {
      if (!el.querySelector(":scope > div > div:nth-child(2)")) {
        return {
          ...result,
          price: Array.from(el.querySelectorAll(":scope > div > div:nth-child(1) button")).map((el) => {
            const text = el.querySelector("span").textContent;
            return {
              text,
              value: `RestaurantsPriceRange2.${text.length}`,
            };
          }),
        };
      } else {
        const filterTitle = el.querySelector(":scope > div > div:nth-child(1) p").textContent;
        if (filterTitle === "Distance") {
          return {
            ...result,
            distance: Array.from(el.querySelectorAll(":scope > div > div:nth-child(2) label")).map((el) => ({
              text: el.querySelector("span").textContent,
              value: el.querySelector("input").value,
            })),
          };
        } else return result;
      }
    }, {});
  });
  const filters = { ...priceAndDistance };
  const seeAllButtons = await page.$$("aside[aria-labelledby='search-vertical-filter-panel-label'] > div > div a");
  for (button of seeAllButtons) {
    await button.click();
    await page.waitForTimeout(2000);
    const filterTitle = await page.evaluate(() =>
      document.querySelector("#modal-portal-container div[aria-modal] div[role='presentation'] h4").textContent.split(" ")[1].toLowerCase()
    );
    filters[`${filterTitle}`] = await page.evaluate(() => {
      return Array.from(document.querySelectorAll("#modal-portal-container div[aria-modal] div[role='presentation'] li")).map((el) => ({
        text: el.querySelector("span").textContent,
        value: el.querySelector("input").value,
      }));
    });
    await page.click("#modal-portal-container div[aria-modal] div[role='presentation'] button[aria-label='Close']");
    await page.waitForTimeout(2000);
  }
  return filters;
}

async function getFilters() {
  const browser = await puppeteer.launch({
    headless: false, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();
  page.setViewport({ width: 1600, height: 800 });
  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  const filters = await getFiltersFromPage(page);

  await browser.close();

  return filters;
}

getFilters().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter:

$ npm init -y
Enter fullscreen mode Exit fullscreen mode

And then:

$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Enter fullscreen mode Exit fullscreen mode

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

๐Ÿ“ŒNote: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

Process

We need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.

how

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Enter fullscreen mode Exit fullscreen mode

Next, we "say" to puppeteer use StealthPlugin, write what we want to search (serchQuery constant), search location, search URL and make search parameters with encodeURI method:

puppeteer.use(StealthPlugin());

const serchQuery = "pizza"; //Parameter defines the query you want to search
const location = "Seattle, WA"; //Parameter defines from where you want the search to originate

const searchParams = {
  query: encodeURI(serchQuery),
  location: encodeURI(location),
};

const URL = `https://www.yelp.com/search?find_desc=${searchParams.query}&find_loc=${searchParams.location}`;
Enter fullscreen mode Exit fullscreen mode

Next, we write a function to get filters from the page:

async function getFiltersFromPage(page) {
  ...
}
Enter fullscreen mode Exit fullscreen mode

Then, we get price and distance filters info from the page context (using evaluate() method) and save it in the priceAndDistance object:

const priceAndDistance = await page.evaluate(() => {
    ...
});
Enter fullscreen mode Exit fullscreen mode

Next, we need to make and return a new array (Array.from() method) from all "ul > li > div" selectors (querySelectorAll()) and using reduce method make an object from an array:

return Array.from(document.querySelectorAll("aside[aria-labelledby='search-vertical-filter-panel-label'] > div > div")).reduce((result, el) => {
    ...
}, {});
Enter fullscreen mode Exit fullscreen mode

In the reduce method we need to check if ":scope > div > div:nth-child(2)" selector is not present (using querySelector() method) we return price filters (using querySelectorAll() method and textContent property).

Otherwise (else statement), we get the filter category title, and return only "Distance" filters, because other filters are hidden and show only a few of them and we get it all later:

if (!el.querySelector(":scope > div > div:nth-child(2)")) {
  return {
    ...result,
    price: Array.from(el.querySelectorAll(":scope > div > div:nth-child(1) button")).map((el) => {
      const text = el.querySelector("span").textContent;
      return {
        text,
        value: `RestaurantsPriceRange2.${text.length}`,
      };
    }),
  };
} else {
  const filterTitle = el.querySelector(":scope > div > div:nth-child(1) p").textContent;
  if (filterTitle === "Distance") {
    return {
      ...result,
      distance: Array.from(el.querySelectorAll(":scope > div > div:nth-child(2) label")).map((el) => ({
        text: el.querySelector("span").textContent,
        value: el.querySelector("input").value,
      })),
    };
  } else return result;
}
Enter fullscreen mode Exit fullscreen mode

Next, we write priceAndDistance in the filters constant (using spread syntax) and get "See all" buttons from other filter categories with $$() method:

const filters = { ...priceAndDistance };
const seeAllButtons = await page.$$("aside[aria-labelledby='search-vertical-filter-panel-label'] > div > div a");
Enter fullscreen mode Exit fullscreen mode

Next, we need to iterate over seeAllButtons (for...of loop), click each of them (element.click() method), wait 2 seconds (using waitForTimeout method), get filter category title and add filters from the page with this title to filters object. Then we click on "Close" button (page.click() method), wait 2 seconds and repeat the loop with other categories.

To get data from the page we use next methods:

for (button of seeAllButtons) {
  await button.click();
  await page.waitForTimeout(2000);
  const filterTitle = await page.evaluate(() =>
    document.querySelector("#modal-portal-container div[aria-modal] div[role='presentation'] h4").textContent.split(" ")[1].toLowerCase()
  );
  filters[`${filterTitle}`] = await page.evaluate(() => {
    return Array.from(document.querySelectorAll("#modal-portal-container div[aria-modal] div[role='presentation'] li")).map((el) => ({
      text: el.querySelector("span").textContent,
      value: el.querySelector("input").value,
    }));
  });
  await page.click("#modal-portal-container div[aria-modal] div[role='presentation'] button[aria-label='Close']");
  await page.waitForTimeout(2000);
}
Enter fullscreen mode Exit fullscreen mode

Next, write a function to control the browser, and get information:

async function getOrganicResults() {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: true and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page and set page viewport resolution (setViewport() method) to show filters panel:

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();
page.setViewport({ width: 1600, height: 800 });
Enter fullscreen mode Exit fullscreen mode

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
Enter fullscreen mode Exit fullscreen mode

And finally, we get filters from the page, close the browser, and return the received data:

const filters = await getFiltersFromPage(page);

await browser.close();

return filters;
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Output

{
   "price":[
      {
         "text":"$",
         "value":"RestaurantsPriceRange2.1"
      },
      {
         "text":"$$",
         "value":"RestaurantsPriceRange2.2"
      },
        ... and other items
   ],
   "distance":[
      {
         "text":"Bird's-eye View",
         "value":"g:-122.43782043457031,47.55614031294337,-122.23320007324219,47.69497434186282"
      },
      {
         "text":"Driving (5 mi.)",
         "value":"g:-122.38666534423828,47.590651847264034,-122.28435516357422,47.6600691664467"
      },
        ... and other items
   ],
   "categories":[
      {
         "text":"Restaurants",
         "value":"restaurants"
      },
      {
         "text":"Pizza",
         "value":"pizza"
      },
      ... and other items
   ],
   "features":[
      {
         "text":"Reservations",
         "value":"OnlineReservations"
      },
      {
         "text":"Waitlist",
         "value":"OnlineWaitlistReservation"
      },
     ... and other items
   ],
   "neighborhoods":[
      {
         "text":"Admiral",
         "value":"WA:Seattle::Admiral"
      },
      {
         "text":"Alki",
         "value":"WA:Seattle::Alki"
      },
    ... and other items
   ]
}
Enter fullscreen mode Exit fullscreen mode

How to apply filters

You can apply filters those was scraped to the Yelp search using the following URL and change searchParams constant in the DIY solution section in our Web scraping Yelp Organic Results with Nodejs and Web scraping Yelp Ads Results with Nodejs blog posts:

const serchQuery = "pizza"; //Parameter defines the query you want to search
const location = "Seattle, WA"; //Parameter defines from where you want the search to originate
const priceAndFeaturesFilter "RestaurantsPriceRange2.1,OnlineReservations"; // for price and features filters
const categoryFilter "restaurants"; // for category filters
const locationFilter "g:-122.43782043457031,47.55614031294337,-122.23320007324219,47.69497434186282"; // for neighborhoods or distance filters (distance and neighborhoods filters can't be used together)


const searchParams = {
  query: encodeURI(serchQuery),
  location: encodeURI(location),
  priceAndFeaturesFilter: encodeURI(priceAndFeaturesFilter),
  categoryFilter: encodeURI(categoryFilter),
  locationFilter: encodeURI(locationFilter),
};

const URL = `https://www.yelp.com/search?find_desc=${searchParams.query}&find_loc=${searchParams.location}&attrs=${searchParams.priceAndFeaturesFilter}&cflt=${searchParams.categoryFilter}&l=${searchParams.locationFilter}`;
Enter fullscreen mode Exit fullscreen mode

Using Yelp Filters API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to create the parser from scratch and maintain it.

There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.

First, we need to install google-search-results-nodejs:

npm i google-search-results-nodejs
Enter fullscreen mode Exit fullscreen mode

Here's the full code example, if you don't need an explanation:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com

const params = {
  engine: "yelp", // search engine
  device: "desktop", //Parameter defines the device to use to get the results. It can be set to "desktop" (default), "tablet", or "mobile"
  find_loc: "Seattle, WA", //Parameter defines from where you want the search to originate.
  find_desc: "pizza", // Parameter defines the query you want to search
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const json = await getJson();
  return json.filters;
};

getResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Code explanation

First, we need to declare SerpApi from google-search-results-nodejs library and define new search instance with your API key from SerpApi:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Enter fullscreen mode Exit fullscreen mode

Next, we write the necessary parameters for making a request:

const params = {
  engine: "yelp", // search engine
  device: "desktop", //Parameter defines the device to use to get the results. It can be set to "desktop" (default), "tablet", or "mobile"
  find_loc: "Seattle, WA", //Parameter defines from where you want the search to originate.
  find_desc: "pizza", // Parameter defines the query you want to search
};
Enter fullscreen mode Exit fullscreen mode

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};
Enter fullscreen mode Exit fullscreen mode

And finally, we declare the function getResult that gets data from the page and return it:

const getResults = async () => {
  ...
};
Enter fullscreen mode Exit fullscreen mode

In this function we get json with reuslts and return filters from received json:

const json = await getJson();
return json.filters;
Enter fullscreen mode Exit fullscreen mode

After, we run the getResults function and print all the received information in the console with the console.dir method, which allows you to use an object with the necessary parameters to change default output options:

getResults().then((result) => console.dir(result, { depth: null }));
Enter fullscreen mode Exit fullscreen mode

Output

{
   "neighborhoods":{
      "value":"p:WA:Seattle::",
      "list":[
         {
            "text":"Waterfront",
            "value":"Waterfront"
         },
         {
            "text":"Fremont",
            "value":"Fremont"
         },
        ... and other items
      ]
   },
   "distance":[
      {
         "text":"Bird's-eye View",
         "value":"g:-122.43782043457031,47.55614031294337,-122.23320007324219,47.69497434186282"
      },
      {
         "text":"Driving (5 mi.)",
         "value":"g:-122.38666534423828,47.590651847264034,-122.28435516357422,47.6600691664467"
      },
        ... and other items
   ],
   "price":[
      {
         "text":"$",
         "value":"RestaurantsPriceRange2.1"
      },
      {
         "text":"$$",
         "value":"RestaurantsPriceRange2.2"
      },
        ... and other items
   ],
   "category":[
      {
         "text":"Cheesesteaks",
         "value":"cheesesteaks"
      },
      {
         "text":"Middle Eastern",
         "value":"mideastern"
      },
      ... and other items
   ],
   "features":[
      {
         "text":"Waiter Service",
         "value":"RestaurantsTableService"
      },
      {
         "text":"Open to All",
         "value":"BusinessOpenToAll"
      },
      ... and other items
   ]
}
Enter fullscreen mode Exit fullscreen mode

How to apply filters

You can apply filters those was scraped to the Yelp search by changing params constant in the SerpApi solution section in our Web scraping Yelp Organic Results with Nodejs and Web scraping Yelp Ads Results with Nodejs blog posts:

const params = {
  engine: "yelp", // search engine
  device: "desktop", //Parameter defines the device to use to get the results. It can be set to "desktop" (default), "tablet", or "mobile"
  find_loc: "Seattle, WA", //Parameter defines from where you want the search to originate.
  find_desc: "pizza", // Parameter defines the query you want to search
  cflt: "restaurants", // for category filters
  attrs: "RestaurantsPriceRange2.1,OnlineReservations", // for price and features filters
  l: "g:-122.43782043457031,47.55614031294337,-122.23320007324219,47.69497434186282", // for neighborhoods or distance filters (distance and neighborhoods filters can't be used together)
};
Enter fullscreen mode Exit fullscreen mode

If you want other functionality added to this blog post or if you want to see some projects made with SerpApi, write me a message.


Join us on Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž

Top comments (0)