DEV Community

Mohan Ganesan
Mohan Ganesan

Posted on • Originally published at proxiesapi.com

How To Scrape Quora Using Puppeteer

In this example, we will try to load a Quora answers page and scroll down till we reach the end of the content and then take a screenshot of the page to our local disk. We will also try and scrape all the answers and save them as a JSON page on your drive.

We are going to scrape this page accurately.

This one has more than 20 answers running into multiple pages, but not all of them load unless you scroll down.

Quora uses an infinite scroll page. Websites with endless page scrolls are basically rendered using AJAX. It calls back to the server for extra content as the user pages down the page.

One of the ways of scraping data like this is to simulate the browser, allow the javascript to fire the ajax, and also to simulate a page scroll.

Puppeteer is the best tool to do that. It controls the Chromium browser behind the scenes.

Let’s install Puppeteer first.

mkdir quora_scraper
cd quora_scraper
npm install --save puppeteer

Then create a file like this and save it in the quora_scraper folder. Call it quora_scroll.js

const fs = require('fs');
const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('https://www.quora.com/Which-one-is-the-best-data-scraping-services');
await page.setViewport({
width: 1200,
height: 800
});

await autoScroll(page);// keep scrolling till resolution


await page.screenshot({
    path: 'quora.png',
    fullPage: true
});

await browser.close();
Enter fullscreen mode Exit fullscreen mode

})();

async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight = distance;

                            //a few of the last scrolling attempts have brought no new 
                            //data so the distance we tried to scroll is now greater 
                            //than the actual page height itself

            if(totalHeight >= scrollHeight){
                clearInterval(timer);//reset 
                resolve();
            }
        }, 100);
    });
});
Enter fullscreen mode Exit fullscreen mode

}

Now run it by the command.

node quora_scroll.js

It should open the Chromium browser, and you should be able to see the page scroll in action.

Once done, you will find a rather large file called quora.png in your folder.

Now let’s add some more code to this to scrape the HTML collected after all the scrolling to get the answers and the details of the users who posted the answers

We need to find the elements containing the user’s name and also the solution. If you inspect the HTML in Chrome’s Inspect tool, you will find the two parts with the class names, user, and ui_qtext_rendered_qtext contain the user’s name and their answer, respectively.

Puppeteer allows you to use CSS selectors to extract data using the querySelectorAll command like this.

var answers = await page.evaluate(() => {
var Answerrers = document.querySelectorAll('.user'); //gets the user's name
var Answers = document.querySelectorAll('.ui_qtext_rendered_qtext');//gets the answer

  var titleLinkArray = [];
  for (var i = 0; i < Answerrers.length; i  ) {
    titleLinkArray[i] = {

      Answerrer: Answerrers[i].innerText.trim(),
      Answer: Answers[i].innerText.trim(),

    };

  }
  return titleLinkArray;
});
Enter fullscreen mode Exit fullscreen mode

We can put this code right after the page scrolling has finished, and so the whole code will look like this.

const fs = require('fs');
const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('https://www.quora.com/Which-one-is-the-best-data-scraping-services');
//await page.goto('https://www.quora.com/Is-data-scraping-easy');
await page.setViewport({
width: 1200,
height: 800
});

await autoScroll(page);// keep scrolling till resolution

var answers = await page.evaluate(() => {
  var Answerrers = document.querySelectorAll('.user');
  var Answers = document.querySelectorAll('.ui_qtext_rendered_qtext');

  var titleLinkArray = [];
  for (var i = 0; i < Answerrers.length; i  ) {
    titleLinkArray[i] = {

      Answerrer: Answerrers[i].innerText.trim(),
      Answer: Answers[i].innerText.trim(),

    };

  }
  return titleLinkArray;
});
console.log(answers);


await page.screenshot({
    path: 'quora.png',
    fullPage: true
});
  console.log("The screenshot has been saved!");

await browser.close();
Enter fullscreen mode Exit fullscreen mode

})();

async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight = distance;

            if(totalHeight >= scrollHeight){//a few of the last scrolling attempts have brought no new data so the distance we tried to scroll is now greater than the actual page height itself
                clearInterval(timer);//reset 
                resolve();
            }
        }, 100);
    });
});
Enter fullscreen mode Exit fullscreen mode

}

Now run it with

node quora_scroll.js

It will print the Answers scraped onto the console when you run it.

Now let’s go further and save it as a JSON file…

fs.writeFile("quora_answers.json", JSON.stringify(answers), function(err) {
if (err) throw err;
console.log("The answers have been saved!");
});

And putting it all together

const fs = require('fs');
const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('https://www.quora.com/Which-one-is-the-best-data-scraping-services');
//await page.goto('https://www.quora.com/Is-data-scraping-easy');
await page.setViewport({
width: 1200,
height: 800
});

await autoScroll(page);// keep scrolling till resolution

var answers = await page.evaluate(() => {
  var Answerrers = document.querySelectorAll('.user');
  var Answers = document.querySelectorAll('.ui_qtext_rendered_qtext');

  var titleLinkArray = [];
  for (var i = 0; i < Answerrers.length; i  ) {
    titleLinkArray[i] = {

      Answerrer: Answerrers[i].innerText.trim(),
      Answer: Answers[i].innerText.trim(),

    };

  }
  return titleLinkArray;
});
console.log(answers);

fs.writeFile("quora_answers.json", JSON.stringify(answers), function(err) {
  if (err) throw err;
  console.log("The answers have been saved!");
});

await page.screenshot({
    path: 'quora.png',
    fullPage: true
});
  console.log("The screenshot has been saved!");

await browser.close();
Enter fullscreen mode Exit fullscreen mode

})();

async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight = distance;

            if(totalHeight >= scrollHeight){//a few of the last scrolling attempts have brought no new data so the distance we tried to scroll is now greater than the actual page height itself
                clearInterval(timer);//reset 
                resolve();
            }
        }, 100);
    });
});
Enter fullscreen mode Exit fullscreen mode

}

Now run it.

node quora_scroll.js

Once it is run, you will find the file quora_answers.json with the following results in it.

The author is the founder of Proxies API, a proxy rotation API service.

Discussion (0)