DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Web scraping Google Books Ngram Viewer with Nodejs
Mikhail Zub for SerpApi

Posted on

Web scraping Google Books Ngram Viewer with Nodejs

Intro

Currently, we don't have an API that supports extracting data from Google Books Ngram Viewer page.

This blog post is to show you the way how you can do it yourself with the provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API:

What will be scraped

what

Comparing with the scraped data chart:

scraped

Full code

const axios = require("axios");
const fs = require("fs");
const { ChartJSNodeCanvas } = require("chartjs-node-canvas");

const searchString = "Albert Einstein,Sherlock Holmes,Frankenstein,Steve Jobs,Taras Shevchenko,William Shakespeare"; // what we want to get
const startYear = 1800; // the start year of the search
const endYear = 2019; // the end year of the search

const AXIOS_OPTIONS = {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
  }, // adding the User-Agent header as one way to prevent the request from being blocked
  params: {
    content: searchString, // what we want to search
    year_start: startYear, // parameter defines the start year of the search
    year_end: endYear, // parameter defines the end year of the search
  },
};

async function saveChart(chartData) {
  const width = 1920; //chart width in pixels
  const height = 1080; //chart height in pixels
  const backgroundColour = "white"; // Uses https://www.w3schools.com/tags/canvas_fillstyle.asp
  const chartJSNodeCanvas = new ChartJSNodeCanvas({ width, height, backgroundColour });

  const labels = new Array(endYear - startYear + 1).fill(startYear).map((el, i) => (el += i));

  const configuration = {
    type: "line", // for line chart
    data: {
      labels,
      datasets: chartData?.map((el) => {
        const data = el.timeseries.map((el) => el * 100);
        return {
          label: el.ngram,
          data,
          borderColor: [`rgb(${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)})`],
        };
      }),
    },
    options: {
      scales: {
        y: {
          title: {
            display: true,
            text: "%",
          },
        },
      },
    },
  };

  const base64Image = await chartJSNodeCanvas.renderToDataURL(configuration);

  const base64Data = base64Image.replace(/^data:image\/png;base64,/, "");

  fs.writeFile("chart.png", base64Data, "base64", function (err) {
    if (err) {
      console.log(err);
    }
  });
}

function getChart() {
  return axios.get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS).then(({ data }) => data);
}

getChart().then(saveChart);
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages axios to make a request to a website, chart.js to build chart from received data and chartjs-node-canvas to render chart with Chart.js using canvas.

To do this, in the directory with our project, open the command line and enter:

$ npm init -y
Enter fullscreen mode Exit fullscreen mode

And then:

$ npm i axios chart.js chartjs-node-canvas
Enter fullscreen mode Exit fullscreen mode

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

Process

We'll receive Books Ngram data in JSON format, so we need only handle the received data, and create our own chart (if needed):

Request:

axios.get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS).then(({ data }) => data);
Enter fullscreen mode Exit fullscreen mode

Response JSON:

[
  {
    "ngram": "Albert Einstein",
    "parent": "",
    "type": "NGRAM",
    "timeseries": [
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9.077474010561153e-10, 9.077474010561153e-10, 9.077474010561153e-10,
      ...and other chart data
      ]
  },
  {
    "ngram": "Sherlock Holmes",
    "parent": "",
    "type": "NGRAM",
    "timeseries": [
      4.731798064483428e-9, 3.785438451586742e-9, 3.154532042988952e-9, 2.7038846082762446e-9, 0, 2.47730296593878e-10,
      ...and other chart data
    ]
  },
  ...and other Books Ngram data
]
Enter fullscreen mode Exit fullscreen mode

Code explanation

Declare constants from axios, fs (fs library allows you to work with the file system on your computer) and chartjs-node-canvas libraries:

const axios = require("axios");
const fs = require("fs");
const { ChartJSNodeCanvas } = require("chartjs-node-canvas");
Enter fullscreen mode Exit fullscreen mode

Next, we write what we want to get, start year and end year:

const searchString = "Albert Einstein,Sherlock Holmes,Frankenstein,Steve Jobs,Taras Shevchenko,William Shakespeare";
const startYear = 1800;
const endYear = 2019;
Enter fullscreen mode Exit fullscreen mode

Next, we write a request options: HTTP headers with User-Agent which is used to act as a "real" user visit, and the necessary parameters for making a request.

Default axios request user-agent is axios/<axios_version> so websites understand that it's a script that sends a request and might block it. Check what's your user-agent:

const AXIOS_OPTIONS = {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
  }, // adding the User-Agent header as one way to prevent the request from being blocked
  params: {
    content: searchString, // what we want to search
    year_start: startYear, // parameter defines the start year of the search
    year_end: endYear, // parameter defines the end year of the search
  },
};
Enter fullscreen mode Exit fullscreen mode

Next, we write a function that handles and saves received data to the ".png" file:

async function saveChart(chartData) {
    ...
}
Enter fullscreen mode Exit fullscreen mode

In this function we need to declare the canvas width, height and backgroundColor, then build it using chartjs-node-canvas:

const width = 1920; //chart width in pixels
const height = 1080; //chart height in pixels
const backgroundColour = "white"; // Uses https://www.w3schools.com/tags/canvas_fillstyle.asp
const chartJSNodeCanvas = new ChartJSNodeCanvas({ width, height, backgroundColour });
Enter fullscreen mode Exit fullscreen mode

Then, we need to define and create the "x" axis labels. To do this we need to create a new array with a length that equals the numbers of years from startYear to endYear (we add '1' because we need to include these years also).

Then we fill an array with startYear and add element position (i) to each value (using map() method):

const labels = new Array(endYear - startYear + 1)
  .fill(startYear)
  .map((el, i) => (el += i));
Enter fullscreen mode Exit fullscreen mode

Next, we need to create configuration object for chart.js library. In this object, we define chart type, data, and options.

In the chart data we define the main axis labels and make datasets from received chartData in which we set for each line label, data, and random color (using Math.random() and parseInt() methods).

In the chart options we set the 'y' axis name and allow to show it (display property):

const configuration = {
  type: "line", // for line chart
  data: {
    labels,
    datasets: chartData?.map((el) => {
      const data = el.timeseries.map((el) => el * 100);
      return {
        label: el.ngram,
        data,
        borderColor: [`rgb(${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)})`],
      };
    }),
  },
  options: {
    scales: {
      y: {
        title: {
          display: true,
          text: "%",
        },
      },
    },
  },
};
Enter fullscreen mode Exit fullscreen mode

Next, we wait for building chart in base64 encoding, remove data type properties from base64 string (replace() method) and save the "chart.png" file with writeFile() method:

const base64Image = await chartJSNodeCanvas.renderToDataURL(configuration);

const base64Data = base64Image.replace(/^data:image\/png;base64,/, "");

fs.writeFile("chart.png", base64Data, "base64", function (err) {
  if (err) {
    console.log(err);
  }
});
Enter fullscreen mode Exit fullscreen mode

Then, we write a function that makes the request and returns the received data. We received the response from axios request that has data key that we destructured and return it:

function getChart() {
  return axios
    .get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS)
    .then(({ data }) => data);
}
Enter fullscreen mode Exit fullscreen mode

And finally, we need to run our functions:

getChart().then(saveChart);
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Saved file

scraped

If you want to see some projects made with SerpApi, write me a message.


Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

Top comments (0)

Create an Account!

πŸ‘€ Just want to lurk?

That's fine, you can still create an account and turn on features like 🌚 dark mode.