DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Scrape Google Scholar Results
Serpdog
Serpdog

Posted on • Updated on • Originally published at serpdog.io

Scrape Google Scholar Results

This article will teach us to scrape Google Scholar Result pages with Node JS using Unirest and Cheerio.

Google Scholar Logo

Table of Contents

  1. Requirements
  2. Scraping Google Organic Scholar Results
  3. Scraping Google Scholar Profiles
  4. Scraping Google Scholar Cite Results
  5. Scraping Google Scholar Author Profile
  6. Conclusion
  7. Additional Resources

Requirements:

Web Parsing with CSS selectors

To search the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget for selecting the perfect tags to make your web scraping journey easier.

This gadget can help you to come up with the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you to use this gadget for selecting the best CSS selectors according to your needs.

User Agents

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.
You can also rotate User Agents, read more about this in this article: How to fake and rotate User Agents using Python 3.

If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.

Install Libraries

Before we begin, install these libraries so we can move forward and prepare our scraper.

  1. Unirest JS
  2. Cheerio JS

Or you can type the below commands in your project terminal to install the libraries:

npm i unirest
npm i cheerio
Enter fullscreen mode Exit fullscreen mode

To extract our HTML data, we will use Unirest JS, and for parsing the HTML data, we will use Cheerio JS.

Scraping Google Organic Scholar Results:

Google Scholar Organic Results

We will scrape the title, title link, id, displayed link, snippet, and other site links.
Here is the full code to scrape the Google Organic Scholar Results πŸ‘‡πŸ»:

const cheerio = require("cheerio");
const unirest = require("unirest");


const getScholarData = async() => {
try
{
const url = "https://www.google.com/scholar?q=IIT+MUMBAI&hl=en";

return unirest
.get(url)
.headeras({
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
})
.then((response) => {
    let $ = cheerio.load(response.body);

let scholar_results = [];

$(".gs_ri").each((i,el) => {
    scholar_results.push({
    title: $(el).find(".gs_rt").text(),
    title_link: $(el).find(".gs_rt a").attr("href"),
    id: $(el).find(".gs_rt a").attr("id")
    displayed_link: $(el).find(".gs_a").text(),
    snippet: $(el).find(".gs_rs").text().replace("\n", ""),
    cited_by_count: $(el).find(".gs_nph+ a").text(),
    cited_link: "https://scholar.google.com" + $(el).find(".gs_nph+ a").attr("href"),
    versions_count: $(el).find("a~ a+ .gs_nph").text(),
    versions_link: $(el).find("a~ a+ .gs_nph").text() ? "https://scholar.google.com" + $(el).find("a~ a+ .gs_nph").attr("href") : "",
    })
})

for (let i = 0; i < scholar_results.length; i++) {
    Object.keys(scholar_results[i]).forEach(key => scholar_results[i][key] === "" || scholar_results[i][key] === undefined ? delete scholar_results[i][key] : {});  
}

console.log(scholar_results)
})
}
catch(e)
{
    console.log(e);
}
}
getScholarData();                                       
Enter fullscreen mode Exit fullscreen mode

Our result should like like this πŸ‘‡πŸ»:

[
    {
        title: 'Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study.',
        title_link: 'https://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=22295984&AN=108373670&h=bqlRj0gjNNQoSuJb5zZxtrAWRoe7e4cT7cfMNTEYxWbUdYAXdv0An55XKjithW%2FT3A9v3vC8m87cvR3EXu%2BdkA%3D%3D&crl=c',
        id: 'TPhPjzP8H_MJ',
        displayed_link: 'SK Gupta, S Sharma - International Journal of Information …, 2015 - search.ebscohost.com',
        snippet: "The rapid advancement in information technology has changed the resources and services of a library. Now day's libraries are not confined only to print resources and traditional library …",
        cited_by_count: 'Cited by 19',
        cited_link: 'https://scholar.google.com/scholar?cites=17518998373872433228&as_sdt=2005&sciodt=0,5&hl=en',
        versions_count: 'All 5 versions',
        versions_link: 'https://scholar.google.com/scholar?cluster=17518998373872433228&hl=en&as_sdt=0,5'
    },
    {
        title: '[PDF][PDF] Design of Solar powered vehicle. project III, Industrial Design Center, IIT Mumbai',
        title_link: 'https://dsource.in/sites/default/files/case-study/solar-powered-rickshaw/introduction/file/solar-powered-rickshaw.pdf',
        id: '_w_nBYVUe8AJ',
        displayed_link: 'UA Athavankar, SR Singh - 2016 - dsource.in',
        snippet: 'The greatest problem that faces the world today is Global warming. It is more apparent here in India than anywhere else, specially Rajasthan where temperatures over the last few years …',
        cited_by_count: 'Cited by 2',
        cited_link: 'https://scholar.google.com/scholar?cites=13869772407723986943&as_sdt=2005&sciodt=0,5&hl=en'
    },
    ....
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Profiles

Google Scholar Authors Results

Now we will scrape the author name, link, position and department in the organization, email and cited by.
Here is our code πŸ‘‡πŸ»:

const unirest = require("unirest");
const cheerio = require("cheerio")

const getScholarProfiles = async() => {

try
{
const url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=IIT+MUMBAI";

return unirest
.get(url)
.headeras({
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
})
.then((response) => {
    let $ = cheerio.load(response.body);

let scholar_profiles = [];

$(".gsc_1usr").each((i,el) => {
    scholar_profiles.push({
    name: $(el).find(".gs_ai_name").text(),
    name_link: "https://scholar.google.com" + $(el).find(".gs_ai_name a").attr("href"),
    position: $(el).find(".gs_ai_aff").text(),
    email: $(el).find(".gs_ai_eml").text(),
    departments: $(el).find(".gs_ai_int").text(),
    cited_by_count: $(el).find(".gs_ai_cby").text().split(" ")[2],
    })
})

for (let i = 0; i < scholar_profiles.length; i++) {
    Object.keys(scholar_profiles[i]).forEach(key => scholar_profiles[i][key] === "" || scholar_profiles[i][key] === undefined ? delete scholar_profiles[i][key] : {});  
}

console.log(scholar_profiles)
});

}
catch(e)
{
    console.log(e);
}
}
getScholarProfiles();
Enter fullscreen mode Exit fullscreen mode

Our results should look like this πŸ‘‡πŸ»:


  [
    {
        name: 'Piyali Banerjee',
        name_link: 'https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ',
        position: 'Postdoctoral Researcher in Physics, IIT Bombay',
        email: 'Verified email at iitb.ac.in',
        departments: 'Experimental High Energy Physics Phenomenology ',
        cited_by_count: '230769'
    },
    {
        name: 'Archana Pai',
        name_link: 'https://scholar.google.com/citations?hl=en&user=2Dw4Y9AAAAAJ',
        position: 'IIT Bombay',
        email: 'Verified email at phy.iitb.ac.in',
        departments: 'Gravitational Wave Astronomy Statistical Signal Processing Multimessenger astronomy ',
        cited_by_count: '70703'
    },
    {
        name: 'Krithi Ramamritham',
        name_link: 'https://scholar.google.com/citations?hl=en&user=LFLG5pcAAAAJ',
        position: 'Sai University, Chennai, India (retired from IIT Bombay)',
        email: 'Verified email at iitb.ac.in',
        departments: 'databases real-time systems ICT based  solutions for society ',
        cited_by_count: '23765'
    },
    ....
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Cite Results

Google Scholar Cite Results<br>

The below block of code will scrape the cite result of an organic scholar search result.

const cheerio = require("cheerio");
const unirest = require("unirest");

const getData = async () => {
    try {
    const url =
        "https://scholar.google.com/scholar?q=info:TPhPjzP8H_MJ:scholar.google.com&output=cite";

    return unirest
        .get(url)
        .headers({})
        .then((response) => {
        let $ = cheerio.load(response.body);

        let cite_results = [];

        $("#gs_citt tr").each((i, el) => {
            cite_results.push({
            title: $(el).find(".gs_cith").text(),
            snippet: $(el).find(".gs_citr").text(),
            });
        });

        let links = [];

        $("#gs_citi .gs_citi").each((i, el) => {
            links.push({
            name: $(el).text(),
            link: $(el).attr("href"),
            });
        });

        console.log(cite_results);
        console.log(links);

        });
    } catch (e) {
    console.log(e);
    }
};
getData();                                
Enter fullscreen mode Exit fullscreen mode

If you look at the target URL, after info, I have used a string which is nothing but just an id we got from scraping Google Scholar Organic Results.
Our result should look like this πŸ‘‡πŸ»:


  [
    {
        title: 'MLA',
        snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5.1 (2015).'
    },
    {
        title: 'APA',
        snippet: 'Gupta, S. K., & Sharma, S. (2015). Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
    },
    {
        title: 'Chicago',
        snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5, no. 1 (2015).'
    },
    {
        title: 'Harvard',
        snippet: 'Gupta, S.K. and Sharma, S., 2015. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
    },
    {
        title: 'Vancouver',
        snippet: 'Gupta SK, Sharma S. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology. 2015 Jan 1;5(1).'
    }
  ]
  [
    {
        name: 'BibTeX',
        link: 'https://scholar.googleusercontent.com/scholar.bib?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=4&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'EndNote',
        link: 'https://scholar.googleusercontent.com/scholar.enw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=3&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'RefMan',
        link: 'https://scholar.googleusercontent.com/scholar.ris?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=2&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'RefWorks',
        link: 'https://scholar.googleusercontent.com/scholar.rfw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=1&ct=citation&cd=-1&hl=en'
    }
  ]

Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Author Profile

Google Scholar Author Profile

We will now work to scrape the Google Scholar Author Profile.
First, we will scrape the author's name, position, email, and department.

Google Scholar Profile

const unirest = require("unirest");
const cheerio = require("cheerio");

const getAuthorProfileData = async () => {
try {
    const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";

    return unirest.get(url)
    .headers({
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    })
    .then((response) => {
    const $ = cheerio.load(response.body)                                
    let author_results = {};

    author_results.name = $("#gsc_prf_in").text();
    author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
    author_results.email = $("#gsc_prf_ivh").text();
    author_results.departments = $("#gsc_prf_int").text();

    console.log(author_results);
})
} catch (e) {
    console.log(e);
}
};
getAuthorProfileData();
Enter fullscreen mode Exit fullscreen mode

Our result should look like this πŸ‘‡πŸ»:

  {
    name: 'Piyali Banerjee',
    position: 'Postdoctoral Researcher in Physics, IIT Bombay',
    email: 'Verified email at iitb.ac.in',
    departments: 'Experimental High Energy PhysicsPhenomenology'
  }

Enter fullscreen mode Exit fullscreen mode

Now we will scrape the articles written by author from his profile. Google Scholar Author Profile Articles

$(".gsc_a_t").each((i,el) => {
    articles.push({
        title: $(el).find(".gsc_a_at").text(),
        link: "https://scholar.google.com" + $(el).find(".gsc_a_at a").attr("href"),
        authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
        publication: $(el).find(".gs_gray+ .gs_gray").text()
    })
}) 

for (let i = 0; i < articles.length; i++) {
    Object.keys(articles[i]).forEach((key) =>
        articles[i][key] === "" || articles[i][key] === undefined
        ? delete articles[i][key]
        : {}
    );
    }
Enter fullscreen mode Exit fullscreen mode

And the results should look like this:

 [
  {
    title: 'Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:u5HHmVD_uO8C',
    authors: 'G Aad, T Abajyan, B Abbott, J Abdallah, SA Khalek, AA Abdelalim, ...',
    publication: 'Physics Letters B 716 (1), 1-29, 2012'
  },
  {
    title: 'The ATLAS simulation infrastructure',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:d1gkVwhDpl0C',
    authors: 'G Aad, B Abbott, J Abdallah, AA Abdelalim, A Abdesselam, B Abi, ...',
    publication: 'The European Physical Journal C 70 (3), 823-874, 2010'
  },

Enter fullscreen mode Exit fullscreen mode

Now, we will scrape the Google Scholar Author profile Cited By results in which we will cover citation, h-index, and the i10-index since 2017.Google Scholar cited by resultsHere is the code πŸ‘‡πŸ»:

let cited_by = {};

cited_by.table = [];
cited_by.table[0] = {};
cited_by.table[0].citations = {};
cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
cited_by.table[1] = {};
cited_by.table[1].h_index = {};
cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
cited_by.table[2] = {};
cited_by.table[2].i_index = {};
cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();
Enter fullscreen mode Exit fullscreen mode

And the result for it will be look like this πŸ‘‡πŸ»:

{
  [
    { citations: { all: '230769', since_2017: '105070' } },
    { h_index: { all: '185', since_2017: '133' } },
    { i_index: { all: '1154', since_2017: '706' } }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Here is the full code to scrape complete Google Author Profile Page πŸ‘‡πŸ»:

    const cheerio = require("cheerio");
    const unirest = require("unirest");

    const getAuthorProfileData = async () => {
    try {
    const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";

    return unirest
    .get(url)
    .headers({
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);

        let author_results = {};
        let articles = {};

        author_results.name = $("#gsc_prf_in").text();
        author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
        author_results.email = $("#gsc_prf_ivh").text();
        author_results.departments = $("#gsc_prf_int").text();

        $("#gsc_a_b .gsc_a_t").each((i, el) => {
            articles.push({
                title: $(el).find(".gsc_a_at").text(),
                link: "https://scholar.google.com" + $(el).find(".gsc_a_at").attr("href"),
                authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
                publication: $(el).find(".gs_gray+ .gs_gray").text()
            })
        })

        for (let i = 0; i < articles.length; i++) {
            Object.keys(articles[i]).forEach((key) =>
                articles[i][key] === "" || articles[i][key] === undefined
                    ? delete articles[i][key]
                    : {}
            );
        }

        let cited_by = {};

        cited_by.table = [];
        cited_by.table[0] = {};
        cited_by.table[0].citations = {};
        cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
        cited_by.table[1] = {};
        cited_by.table[1].h_index = {};
        cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
        cited_by.table[2] = {};
        cited_by.table[2].i_index = {};
        cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();

        console.log(author_results);
        console.log(articles);
        console.log(cited_by.table);
    })

    } catch (e) {
    console.log(e);
    }
    };
    getAuthorProfileData();
Enter fullscreen mode Exit fullscreen mode

Conclusion:

In this tutorial, we learned to scrape Google Scholar Results using Node JS. Feel free to message me if I missed something or anything you need clarification on. Follow me on Twitter Thanks for reading!

Additional Resources

  1. Scrape Google Organic Search Result
  2. Scrape Google Images Results
  3. Scrape Google News Results
  4. Scrape Google Maps Reviews

Author:

My name is Darshan and I am the founder of serpdog.io.

Top comments (0)

Layoffs: It’s Okay to Not Be Okay