DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Web scraping YouTube autocomplete with Nodejs
Mikhail Zub for SerpApi

Posted on

Web scraping YouTube autocomplete with Nodejs

What will be scraped

what

πŸ“ŒNote: For now, we don't have an API that supports extracting autocomplete data.

This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";

async function getYoutubeAutocomplete() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector("#contents");

  const autocompleteResults = [];
  for (query of queries) {
    await page.click("#search-input");
    await page.keyboard.type(query);
    await page.waitForTimeout(5000);
    const results = {
      query,
      autocompleteResults: await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".sbdd_a li"))
          .map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
          .filter((el) => el);
      }),
    };
    autocompleteResults.push(results);
    await page.click("#search-clear-button");
    await page.waitForTimeout(2000);
  }

  await browser.close();

  return autocompleteResults;
}

getYoutubeAutocomplete().then(console.log);
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

πŸ“ŒNote: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

Process

SelectorGadget Chrome extension was used to grab CSS selectors by clicking on the desired element in the browser. If you have any struggles understanding this, we have a dedicated Web Scraping with CSS Selectors blog post at SerpApi.

The Gif below illustrates the approach of selecting different parts of the results.

how

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Enter fullscreen mode Exit fullscreen mode

Next, we "say" to puppeteer use StealthPlugin, write search queries and YouTube URL:

puppeteer.use(StealthPlugin());

const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";
Enter fullscreen mode Exit fullscreen mode

Next, write a function to control the browser, and get information:

async function getYoutubeAutocomplete() {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: false and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();
Enter fullscreen mode Exit fullscreen mode

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until #contents selector is creating on the page.:

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector("#contents");
Enter fullscreen mode Exit fullscreen mode

Then, we define an array with the results, called autocompleteResults and starts for...of loop to iterate over all queries:

  const autocompleteResults = [];
  for (query of queries) {
    ...
  }
Enter fullscreen mode Exit fullscreen mode

Next, in the loop we cick on #search-input (.click() method), type current query with page.keyboard.type(query) method and wait 5 seconds, using .waitForTimeout(5000) method:

    await page.click("#search-input");
    await page.keyboard.type(query);
    await page.waitForTimeout(5000);
Enter fullscreen mode Exit fullscreen mode

Then, we make the results object that have query and autocompleteResults keys. We get autocompleteResults using page.evaluate() method to run code in the brackets in the browser context.

There we need to use .querySelectorAll() method which returns a static NodeList representing a list of the document's elements that match the css selectors in the brackets and convert result to an array with Array.from() method to iterate over that array.

After that we find element with class name .sbqs_c (.querySelector() method), get raw text (textContent property) and remove whitespace from both ends of a string with .trim() method from each of .sbdd_a li elements. Because sometimes we find empty nodes in the end we need to filter our array and leave true elements (.filter((el) => el)):

    const results = {
      query,
      autocompleteResults: await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".sbdd_a li"))
          .map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
          .filter((el) => el);
      }),
    };
Enter fullscreen mode Exit fullscreen mode

Next, we push results object from current itaration step to the autocompleteResults array, click #search-clear-button to clear search input and wait 2 seconds before next itaration:

    autocompleteResults.push(results);
    await page.click("#search-clear-button");
    await page.waitForTimeout(2000);
Enter fullscreen mode Exit fullscreen mode

And finally, we close the browser and return received data:

  await browser.close();

  return autocompleteResults;
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Output

[
   {
      "query":"javascript",
      "autocompleteResults":[
         "javascript",
         "javascript tutorial for beginners",
         "javascript full course",
         "javascript tutorial",
         "javascript dom",
         "javascript mastery",
         "javascript course",
         "javascript interview questions and answers",
         "javascript for beginners",
         "javascript с нуля",
         "javascript project",
         "javascript ninja",
         "javascript game",
         "javascript interview"
      ]
   },
   {
      "query":"node",
      "autocompleteResults":[
         "node js",
         "node js tutorial",
         "node",
         "node js project",
         "node js express",
         "node js interview",
         "node video tutorial",
         "node video",
         "node js interview questions",
         "node js event loop",
         "node js ΡƒΡ€ΠΎΠΊΠΈ",
         "nodemailer",
         "node red",
         "nodemcu"
      ]
   },
   {
      "query":"web scraping",
      "autocompleteResults":[
         "web scraping weather data python",
         "web scraping",
         "web scraping python",
         "web scraping javascript",
         "web scraping amazon product",
         "web scraping amazon price",
         "web scraping amazon",
         "web scraping amazon reviews",
         "web scraping amazon reviews python",
         "web scraping indeed",
         "web scraping flight prices",
         "web scraping using python",
         "web scraping tutorial"
      ]
   }
]
Enter fullscreen mode Exit fullscreen mode

Extract suggestions from Google Autocomplete Client

Previous example was a "hard" way. Also you can parse data using following URL which will output a txt file:

"https://clients1.google.com/complete/search?client=youtube&hl=en&q=minecraft"
Enter fullscreen mode Exit fullscreen mode

If you want to see some projects made with SerpApi, please write me a message.


Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

Top comments (0)

Dream Big


Use any Linode offering to create something unique or silly in the DEV x Linode Hackathon 2022 and win the Wacky Wildcard category.

β†’ Join the Hackathon <-