DEV Community

ApiForSeo
ApiForSeo

Posted on • Originally published at serpdog.io

How to scrape Google Patents

Google Patents contains a vast database of patents granted worldwide and is used to explore them for a particular query. It has a vast collection of documents and consists of patents issued by the United States Patent and Trademark Office, the European Patent Office (EPO), and many other patent offices globally.

Scrape Google Patents

In this article, we will try to access and scrape Google Patents, a valuable resource for accessing patent information from the internet.

Let’s Start Scraping Google Patents Using Python

We’ll mainly focus on scraping organic results from Google Patents.

Google Patentss Target Page

Let’s get started by installing the required libraries which we will use further in this tutorial.

pip install selenium
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
  1. Selenium — Web Driver for opening the URL in the Chrome browser.

  2. Beautiful Soup — For parsing the HTML data.

Next, we will import these libraries into our program file.

import time
import json
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
Enter fullscreen mode Exit fullscreen mode

After that, we will initialize our Service Path where the Chrome driver is installed.

SERVICE_PATH = "E:\chromedriver.exe"

service = Service(SERVICE_PATH)
driver = webdriver.Chrome(service=service)
Enter fullscreen mode Exit fullscreen mode

Then, we will define our target URL to make a get request with the driver and wait for two seconds till the page completely loads.

driver.get("https://patents.google.com/?q=(ball+bearings)&oq=ball+bearings")
time.sleep(2)
Enter fullscreen mode Exit fullscreen mode

After that, we will create a Beautiful Soup instance to parse the extracted HTML data.

soup = BeautifulSoup(driver.page_source, "html.parser")
Enter fullscreen mode Exit fullscreen mode

Then, we will inspect the target web page to find the respective tags of the required data points.

Google Patents Organic Results

As you can see, every patent result is under the tag search-result-item. We will use this tag as a reference to find other elements.

patent_results = [];
for el in soup.select("search-result-item"):
Enter fullscreen mode Exit fullscreen mode

Scraping Patent Title

Google Patent Title

So, the title is present under the h3 tag. Add this title inside the for loop block.

title = el.select_one("h3").get_text(),
Enter fullscreen mode Exit fullscreen mode

Scraping Patent Link

In the above image, we can see that the anchor link is under the tag a with the class state-modifier.

Let us study this link before adding it to our code.

https://patents.google.com/patent/US10945783B2/en?q=(ball+bearings)&oq=ball+bearings&peid=61296253f5d80%3Af%3A51b23c37
Enter fullscreen mode Exit fullscreen mode

Remove the noisy part, you will get this as the link.

https://patents.google.com/patent/US10945783B2/
Enter fullscreen mode Exit fullscreen mode

The unique entity in this URL is the patent ID which is US10945783B2. So, we need a patent ID to create a link.

If you inspect the title again, you will find that the h3 header is inside the tag state-modifier which consists of an attribute data-result that has the patent ID.

Google Patent Link

We will add this to our code to create a link.

link = "https://patents.google.com/" + element.select_one("state-modifier")['data-result']
Enter fullscreen mode Exit fullscreen mode

Scraping Metadata and Dates

Metadata is additional information on the Google Patents page for categorization, indexing, and search purposes. You can notice it below the title of the organic patent result.

Google Patents Metadata

Let us add this in our code also.

metadata = ' '.join(element.select_one('h4.metadata').get_text().split())
Enter fullscreen mode Exit fullscreen mode

Similarly, we will select the dates that are present with the same tag h4 but with different class name dates.

Google Patents Dates

dates = element.select_one('h4.dates').get_text().strip()
Enter fullscreen mode Exit fullscreen mode

Finally, we will extract the snippet part of the patent.

Scraping Snippet

The snippet is contained inside the span tag with the id htmlContent.

Google Patents Snippet

At last, we will add this to our code.

snippet = element.select_one('span#htmlContent').get_text()
Enter fullscreen mode Exit fullscreen mode

We have parsed every data point we need from the web page.

At the end, we will append these data points to our patent_results array.

for el in soup.select('search-result-item'):
    title = el.select_one('h3').get_text().strip()
    link = "https://patents.google.com/" + el.select_one("state-modifier")['data-result']
    metadata = ' '.join(el.select_one('h4.metadata').get_text().split())
    date = el.select_one('h4.dates').get_text().strip()
    snippet = el.select_one('span#htmlContent').get_text()

    patent_results.append({
        'title': title,
        'link': link,
        'metadata': metadata,
        'date': date,
        'snippet': snippet
    })
Enter fullscreen mode Exit fullscreen mode

Complete Code

You can modify the below code as per your requirements. But for this tutorial, we will go with this one:

import time
import json
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium import webdriver

SERVICE_PATH = "E:\chromedriver.exe"

service = Service(SERVICE_PATH)
driver = webdriver.Chrome(service=service)

driver.get("https://patents.google.com/?q=(ball+bearings)&oq=ball+bearings")
time.sleep(2);

soup = BeautifulSoup(driver.page_source, "html.parser")
patent_results = []

for el in soup.select('search-result-item'):
    title = el.select_one('h3').get_text().strip()
    link = "https://patents.google.com/" + el.select_one("state-modifier")['data-result']
    metadata = ' '.join(el.select_one('h4.metadata').get_text().split())
    date = el.select_one('h4.dates').get_text().strip()
    snippet = el.select_one('span#htmlContent').get_text()

    patent_results.append({
        'title': title,
        'link': link,
        'metadata': metadata,
        'date': date,
        'snippet': snippet
    })

print(json.dumps(patent_results, indent=2))

driver.quit()
Enter fullscreen mode Exit fullscreen mode

Run this program in your terminal and you will get the below desired results.

[
      {
        title: 'Surgical instrument with modular shaft and end effector',
        link: 'https://patent.google.com/patent/US10945783B2/en',
        metadata: 'WO EP US CN JP AU CA US10945783B2 Kevin L. Houser Ethicon Llc',
        date: 'Priority 2010-11-05 • Filed 2019-08-02 • Granted 2021-03-16 • Published 2021-03-16',
        snippet: ' Surgical instrument with modular shaft and end effectorKevin L. HouserEthicon Llc A surgical instrument operable to sever tissue includes a body assembly and a selectively coupleable end effector assembly. The end effector assembly may include a transmission assembly, an end effector, and a rotational knob operable to rotate the transmission assembly and the end effector. The …'
      },
      {
        title: 'Disk drive executing jerk seeks to rotate pivot ball bearings relative to races',
        link: 'https://patent.google.com/patent/US8780479B1/en',
        metadata: 'US CN HK US8780479B1 Daniel L. Helmick Western Digital Technologies, Inc.',
        date: 'Priority 2013-05-17 • Filed 2013-06-24 • Granted 2014-07-15 • Published 2014-07-15',
        snippet: ' Disk drive executing jerk seeks to rotate pivot ball bearings relative to racesDaniel L. HelmickWestern Digital Technologies, Inc. a voice coil motor (VCM) operable to rotate the actuator arm about a pivot bearing including a race and a plurality of ball bearings; and control circuitry operable to: control the VCM to execute a first jerk seek in a first radial direction so that the ball bearings slip within the race by a first …'
      },
     .....
Enter fullscreen mode Exit fullscreen mode

Conclusion

In a nutshell, scraping Google Patents can be a powerful tool for researchers who seek to create a patent database for market research, data analysis, academic projects, and much more.

In this article, we learned to scrape Google Patent results using Python. Shortly, we will keep updating this article to add more insights into scraping Google Patents data more efficiently.

Feel free to contact us for anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

I have prepared a complete list of blogs on Google scraping, which can help you in your data extraction journey:

  1. Scraping Google News

  2. Scrape Google Maps

  3. Scrape Google Maps Reviews

  4. Scrape Google Jobs

Top comments (1)

Collapse
 
learncodeprofessor profile image
LearnCodeProfessor

Great Post!