Dmitriy Zub ☀️

Posted on Feb 9, 2022 • Edited on Nov 28, 2022 • Originally published at serpapi.com

Scrape Google Scholar Case Law Results to CSV with Python and SerpApi

#python #tutorial #programming #webscraping

What will be scraped
Prerequisites
Scrape and Save Case Law results
Links

What will be scraped

Prerequisites

Separate virtual environment

If you're on Linux:

python -m venv env && source env/bin/activate

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

Install libraries:

pip install pandas google-search-results

Scrape and save Google Scholar Case Law results to CSV

If you don't need an explanation, try it in the online IDE.

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd

def case_law_results():

    print("Extracting case law results..")

    params = {
        "api_key": os.getenv("API_KEY"),  # SerpApi API key
        "engine": "google_scholar",       # Google Scholar search results
        "q": "minecraft education ",      # search query
        "hl": "en",                       # language
        "start": "0",                     # first page
        "as_sdt": "6"                     # case law results. Wierd, huh? Try without it.
    }
    search = GoogleSearch(params)

    case_law_results_data = []

    while True:
        results = search.get_dict()

        if "error" in results:
            break

      print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")

      for result in results["organic_results"]:
          title = result.get("title")
          publication_info_summary = result["publication_info"]["summary"]
          result_id = result.get("result_id")
          link = result.get("link")
          result_type = result.get("type")
          snippet = result.get("snippet")

        try:
          file_title = result["resources"][0]["title"]
        except: file_title = None

        try:
          file_link = result["resources"][0]["link"]
        except: file_link = None

        try:
          file_format = result["resources"][0]["file_format"]
        except: file_format = None

        cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
        cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
        cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
        total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
        all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
        all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

        case_law_results_data.append({
          "page_number": results['serpapi_pagination']['current'],
          "position": result["position"] + 1,
          "result_type": result_type,
          "title": title,
          "link": link,
          "result_id": result_id,
          "publication_info_summary": publication_info_summary,
          "snippet": snippet,
          "cited_by_count": cited_by_count,
          "cited_by_link": cited_by_link,
          "cited_by_id": cited_by_id,
          "total_versions": total_versions,
          "all_versions_link": all_versions_link,
          "all_versions_id": all_versions_id,
          "file_format": file_format,
          "file_title": file_title,
          "file_link": file_link
        })

      if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
      else:
        break

    return case_law_results_data


def save_case_law_results_to_csv():
    print("Waiting for case law results to save..")
    pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)

    print("Case Law Results Saved.")

Code explanation

Import libraries:

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd

pandas will be used to easily save extracted data to CSV file.
urllib will be used in the pagination process.
os is used to return the value of the SerpApi API key environment variable.

Create, pass search parameters to SerpApi and create a temporary list() to store extracted data:

params = {
    "api_key": os.getenv("API_KEY"),  # SerpApi API key
    "engine": "google_scholar",       # Google Scholar search results
    "q": "minecraft education ",      # search query
    "hl": "en",                       # language
    "start": "0",                     # first page
    "as_sdt": "6"                     # case law results
}
search = GoogleSearch(params)

case_law_results_data = []

as_sdt is used to determine and filter which Court(s) are targeted in an API call. Refer to supported SerpApi Google Scholar Courts or select courts on Google Scholar and pass it to as_sdt parameter.

Note: if you want to search results for Missouri Court Of Appeals, as_sdt parameter would become as_sdt=4,204. Pay attention to 4,, without it, article results will appear instead.

Set up a while loop, add an if statement to be able to exit the loop:

while True:
    results = search.get_dict()

    # if any backend service error or search fail
    if "error" in results:
      break

    # extraction code here... 

    # if next page is present -> update previous results to new page results.
    # if next page is not present -> exit the while loop.
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
    else:
        break

search.params_dict.update() will split next page URL in parts and pass updated search param values to GoogleSearch(search) as a dictionary.

Extract results in a for loop and handle exceptions:

for result in results["organic_results"]:
    title = result.get("title")
    publication_info_summary = result["publication_info"]["summary"]
    result_id = result.get("result_id")
    link = result.get("link")
    result_type = result.get("type")
    snippet = result.get("snippet")

    try:
      file_title = result["resources"][0]["title"]
    except: file_title = None

    try:
      file_link = result["resources"][0]["link"]
    except: file_link = None

    try:
      file_format = result["resources"][0]["file_format"]
    except: file_format = None

    # if something is None it will return an empty {} dict()
    cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
    cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
    cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
    total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
    all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
    all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

Append results to temporary list() as a dictionary {}:

case_law_results_data.append({
    "page_number": results['serpapi_pagination']['current'],
    "position": position + 1,
    "result_type": result_type,
    "title": title,
    "link": link,
    "result_id": result_id,
    "publication_info_summary": publication_info_summary,
    "snippet": snippet,
    "cited_by_count": cited_by_count,
    "cited_by_link": cited_by_link,
    "cited_by_id": cited_by_id,
    "total_versions": total_versions,
    "all_versions_link": all_versions_link,
    "all_versions_id": all_versions_id,
    "file_format": file_format,
    "file_title": file_title,
    "file_link": file_link
})

Return extracted data:

return case_law_results_data

Save returned case_law_results() data to_csv():

pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)

data argument inside DataFrame is your data.
encoding='utf-8' argument just to make sure everything will be saved correctly. I used it explicitly even thought it's a default value.
index=False argument to drop default pandas row numbers.