Zygimantas Sniurevicius for Product Hackers

Posted on Feb 15, 2021

Using Puppeteer to scrape answers in Stackoverflow

#webdev #webscraping #node #javascript

What is Puppeteer

Puppeteer is a node library that lets us control a chrome browser via commands, its one of the most used tools for web scraping because it grants us the ability to automate actions easily.

What are we doing

Today we'll learn how to setup Puppeteer to scrape google top results when searching for a problem in stackoverflow, let's see how it will work:

First we run the script with the question

node index "how to exit vim"

Now we google the top results from stackoverflow
Collect all the links that match half or more words of our question.

[
  {
    keywordMatch: 4,
    url: 'https://stackoverflow.com/questions/31595411/how-to-clear-the-screen-after-exit-vim/51330580'
  }
]

Create a folder for the question asked.
Visit each URL and look for the answer.
Make a screenshot of the answer if there is one.
Save it in our folder previously created.

Repository

Im not going to cover all the code details in this blog post, things like how to create folders with node.js, how to loop through the array of urls and how to allow arguments in the script are all in my github repository.

You can find the full code here

Explaining the code

After seeing the steps we need to do in the previous section its time to build it ourselves.

Let's begin by initializing puppeteer inside an async function.

A headless browser is a web browser without a user interface.

Its recommended to use a try catch block because its difficult to control errors that happen while the browser is running.


(async () => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
    });

    const page = await browser.newPage();

  } catch (error) {
    console.log("Error " + error.toString());
  }
})();

To get all the result's from a specific website we need to construct the URL with +site:stackoverflow.com.

page.goto accepts two parameters a string for the url and an object for the options, in our case we specify to wait to be completly loaded before moving on.

const googleUrl = `https://www.google.com/search?q=how%20to%20exit%20vim+site%3Astackoverflow.com`;

await page.goto(googleUrl, ["load", "domcontentloaded", "networkidle0"]);

Getting the url's

After navigating to the google search page, its time to collect all the href links that belong to the section https://stackoverflow.com/questions.

Inside the page.evaluate method we are allowed to access the DOM with the document object, this means we can use selectors to find the information we need easily using document.querySelector or document.querySelectorAll

remember that document.querySelectorAll doesn't return an Array, instead, its a NodeList, that's why we transform it to Array before filtering.

Then, we map throught all the elements and return the url's.


const queryUrl = "how%20to%20exit%20vim"

const validUrls = await page.evaluate((queryUrl) => {
 const hrefElementsList = Array.from(
      document.querySelectorAll(
          `div[data-async-context='query:${queryUrl}%20site%3Astackoverflow.com'] a[href]`
        )
      );

      const filterElementsList = hrefElementsList.filter((elem) =>
        elem
          .getAttribute("href")
          .startsWith("https://stackoverflow.com/questions")
      );

      const stackOverflowLinks = filterElementsList.map((elem) =>
        elem.getAttribute("href")
      );

      return stackOverflowLinks;
    }, queryUrl);

Matching the url

With our verified urls in a variable called validUrls its time to check if some of them roughtly match what are we looking for.

we split the question into an Array and loop each word, if the word its inside the stackoverflow url we add it to our variable wordCounter, after we are done with this process we check if half of the words match the url.


const queryWordArray = [ 'how', 'to', 'exit', 'vim' ]
const keywordLikeability = [];

validUrls.forEach((url) => {
  let wordCounter = 0;

  queryWordArray.forEach((word) => {
     if (url.indexOf(word) > -1) {
       wordCounter = wordCounter + 1;
     }
  });

  if (queryWordArray.length / 2 < wordCounter) {
    keywordLikeability.push({
      keywordMatch: wordCounter,
      url: url,
    });
  }
});

Capturing the answer

Finally, we need a function that visits the stackoverflow website and checks if there is an answer, in case there is proceed to make a screenshot of the element and save it.

we start by going to the stackoverflow url, and closing the popup because otherwise its gonna appear in our screenshot and we dont want that.

To find the popup close button we use a xpath selector, its like a weird cousin of our beloved CSS selector but for xml/html.

With the pop up gone it's time to see if we even have an answer, if we do, we make a screenshot and save it.

await acceptedAnswer.screenshot({
 path: `.howtoexitvim.png`,
 clip: { x: 0, y: 0, width: 1024, height: 800 },
});

take care when using the screenshot method because its not consistent, to make it a smoother experience try to get the DOM element's size and location as shown in the picture above.


const getAnswerFromQuestion = async (website, page) => {
  console.log("Website", website);
  await page.goto(website,["load","domcontentloaded","networkidle0"]);
  const popUp = (await page.$x("//button[@title='Dismiss']"))[0];
  if (popUp) await popUp.click();

  const acceptedAnswer = await page.$(".accepted-answer");

  if (!acceptedAnswer) return;

  await acceptedAnswer.screenshot({
    path: `./howtoexitvim.png`,
  });
};

Call the function created in the previous section with the parameters and we are done!


await getAnswerFromQuestion(keywordLikeability[0].url, page);

Here is the final result, we can finally exit VIM!

Final remarks

I hope you learned something today and please check up the repository i set up it has all the code, thanks for reading me and stay awesome ❤️

DEV Community