DEV Community

Artur Chukhrai for SerpApi

Posted on • Updated on • Originally published at serpapi.com

Scrape Google Product Page with Python

What will be scraped

wwbs-google-shopping-product-page

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import requests, json
from parsel import Selector


def get_product_page_results(url, params, headers):
    html = requests.get(url, params=params, headers=headers)
    selector = Selector(html.text)

    title = selector.css('.sh-t__title::text').get()
    prices = [price.css('::text').get() for price in selector.css('.MLYgAb .g9WBQb')]
    low_price = selector.css('.KaGvqb .qYlANb::text').get()
    high_price = selector.css('.xyYTQb .qYlANb::text').get()
    shown_price = selector.css('.FYiaub').xpath('normalize-space()').get()
    reviews = int(selector.css('.YVQvvd .HiT7Id span::text').get()[1:-1].replace(',', ''))
    rating = float(selector.css('.uYNZm::text').get())
    extensions = [extension.css('::text').get() for extension in selector.css('.OA4wid')]
    description = selector.css('.sh-ds__trunc-txt::text').get()
    media = [image.css('::attr(src)').get() for image in selector.css('.sh-div__image')]
    highlights = [highlight.css('::text').get() for highlight in selector.css('.KgL16d span')]

    data = {
        'title': title,
        'prices': prices,
        'typical_prices': {
            'low': low_price,
            'high': high_price,
            'shown_price': shown_price
        },
        'reviews': reviews,
        'rating': rating,
        'extensions': extensions,
        'description': description,
        'media': media,
        'highlights': highlights
    }

    return data


def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        'product_id': '16230039729797264158',   # product id
        'hl': 'en',                             # language
        'gl': 'us'                              # country of the search, US -> USA
    }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = f'https://www.google.com/shopping/product/{params["product_id"]}?hl={params["hl"]}&gl={params["gl"]}'

    product_page_results = get_product_page_results(URL, params, headers)

    print(json.dumps(product_page_results, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Preparation

Install libraries:

pip install requests parsel
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

import requests, json
from parsel import Selector
Enter fullscreen mode Exit fullscreen mode
Library Purpose
requests to make a request to the website.
json to convert extracted data to a JSON object.
Selector XML/HTML parser that have full XPath and CSS selectors support.

At the beginning of the main() function, parameters and headers are defined for generating the URL. If you want to pass other parameters or headers to the URL, you can do so using the params and headers dictionaries:

def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        'product_id': '16230039729797264158',   # product id
        'hl': 'en',                             # language
        'gl': 'us'                              # country of the search, US -> USA
    }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = f'https://www.google.com/shopping/product/{params["product_id"]}?hl={params["hl"]}&gl={params["gl"]}'
Enter fullscreen mode Exit fullscreen mode

Next, the URL, params and headers is passed to the get_product_page_results(URL, params, headers) function to get all data. The product_page_results dictionary holds the retrieved data that this function returns. At the end of the function, the data is printed out in JSON format:

product_page_results = get_product_page_results(URL, params, headers)

print(json.dumps(product_page_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

This code uses the generally accepted rule of using the __name__ == "__main__" construct:

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work. You can watch the video Python Tutorial: if name == 'main' for more details.

Let's take a look at the get_product_page_results(url, params, headers) function mentioned earlier.

This function takes url, params and headers parameters to create a request. Now we need to parse the HTML from the Parsel package, into which we pass the HTML structure that was received after the request. This is necessary for successful data extraction:

def get_product_page_results(url, params, headers):
    html = requests.get(url, params=params, headers=headers)
    selector = Selector(html.text)
Enter fullscreen mode Exit fullscreen mode

Data like title, low_price, high_price and description are pretty easy to retrieve. You need to find the selector and get the value:

title = selector.css('.sh-t__title::text').get()
low_price = selector.css('.KaGvqb .qYlANb::text').get()
high_price = selector.css('.xyYTQb .qYlANb::text').get()
description = selector.css('.sh-ds__trunc-txt::text').get()
Enter fullscreen mode Exit fullscreen mode
Code Explanation
css() to access elements by the passed selector.
::text or ::attr(<attribute>) to extract textual or attribute data from the node.
get() to actually extract the textual data.

Extracting show_price differs from the previous ones in that you need to extract the text not only from this selector, but also from those nested in it:

shown_price = selector.css('.FYiaub').xpath('normalize-space()').get()
Enter fullscreen mode Exit fullscreen mode

Data such as reviews and rating must be converted to the numeric data type. I want to draw your attention to the fact that reviews is retrieved in this format: (63,413). To convert to a number, you need to remove the brackets and the comma:

reviews = int(selector.css('.YVQvvd .HiT7Id span::text').get()[1:-1].replace(',', ''))
rating = float(selector.css('.uYNZm::text').get())
Enter fullscreen mode Exit fullscreen mode

The prices, extensions, media and highlights lists contain multiple elements in their selector, so they are extracted using list comprehensions:

prices = [price.css('::text').get() for price in selector.css('.MLYgAb .g9WBQb')]
extensions = [extension.css('::text').get() for extension in selector.css('.OA4wid')]
media = [image.css('::attr(src)').get() for image in selector.css('.sh-div__image')]
highlights = [highlight.css('::text').get() for highlight in selector.css('.KgL16d span')]
Enter fullscreen mode Exit fullscreen mode

After extracting all the data, the data dictionary is formed:

data = {
    'title': title,
    'prices': prices,
    'typical_prices': {
        'low': low_price,
        'high': high_price,
        'shown_price': shown_price
    },
    'reviews': reviews,
    'rating': rating,
    'extensions': extensions,
    'description': description,
    'media': media,
    'highlights': highlights
}
Enter fullscreen mode Exit fullscreen mode

At the end of the function, the data dictionary is returned.

return data
Enter fullscreen mode Exit fullscreen mode

Output:

{
  "title": "Sony PlayStation 5 - Standard",
  "prices": [
    "$499.00",
    "$700.00",
    "$729.00"
  ],
  "typical_prices": {
    "low": "$499.00",
    "high": "$719.75",
    "shown_price": "$499.00 at EvQ"
  },
  "reviews": 63413,
  "rating": 4.7,
  "extensions": [
    "Blu-ray Compatible",
    "4K Capable",
    "Backward Compatible",
    "Standard Edition",
    "With Motion Control",
    "Bluetooth",
    "Wi-Fi"
  ],
  "description": "Experience lightning-fast loading with an ultra-high-speed SSD, deeper immersion with support for haptic feedback, adaptive triggers and 3D audio, and a next generation of incredible PlayStation games.",
  "media": [
    "https://encrypted-tbn3.gstatic.com/shopping?q=tbn:ANd9GcSbKnqqdMH6hYKh8mzk9kje2m3KI-bRktHWihZ_LYAHQF0BNIXyfzjjusW0XMVpuUk13pFiHLVztP7Rk7GDgxBUnC6hFY84sQ&usqp=CAY",
    "https://encrypted-tbn1.gstatic.com/shopping?q=tbn:ANd9GcTa0aWvl4ZCffiyfM3sBvdYLk1K8SkMIo6ZkmN3ASkW7GPgVmB_XMOFCBgmW-AMOspQ9KFLJjKN9uPZbj0ScCVOizsmX8Fegg&usqp=CAY",
    "https://encrypted-tbn2.gstatic.com/shopping?q=tbn:ANd9GcRJRtfshsdgf4JJGzS-QzvYXjzOy4NKV-y_0yQn-W6n109ziyqOzTvDcX-YXNmr3rPu4cHKpo7OVV2fkDzodE7LK6Pxh63l&usqp=CAY",
    "https://encrypted-tbn3.gstatic.com/shopping?q=tbn:ANd9GcS6lCLgdUU42DbmP2Y8o5MPMHF_j1LFpMvdBHNTPLIfBOn8bnpC-xPBYl14wDMiPK7lQ1YL_BEeOm5vqVmfJpBLnOomYoXy&usqp=CAY"
  ],
  "highlights": [
    "Integrated I/O: Marvel at incredible graphics and experience new PS5 features.",
    "Ultra-high speed SSD: Maximize your play sessions with near-instant load times for installed PS5 games.",
    "HDR technology: With an HDR TV, supported PS5 games display an unbelievably vibrant and lifelike range of colors.",
    "8K output: PS5 consoles support an 8K output, so you can play games on your 4320p resolution display.",
    "4K TV gaming: Play your favorite PS5 games on your stunning 4K TV. Up to 120 fps with 120Hz output"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Using Google Product Page API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The main difference is that it's a quicker approach. Google Product Page API will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.

First, we need to install google-search-results:

pip install google-search-results
Enter fullscreen mode Exit fullscreen mode

Import the necessary libraries for work:

from serpapi import GoogleSearch
import os, json
Enter fullscreen mode Exit fullscreen mode

Next, we write a search query and the necessary parameters for making a request:

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '16230039729797264158',   # product id
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}
Enter fullscreen mode Exit fullscreen mode

We then create a search object where the data is retrieved from the SerpApi backend. In the results dictionary we get data from JSON:

search = GoogleSearch(params)   # where data extraction happens on the SerpApi backend
results = search.get_dict()     # JSON -> Python dict
Enter fullscreen mode Exit fullscreen mode

The data is retrieved quite simply, we just need to turn to the 'product_results' key.

product_results = results['product_results']
Enter fullscreen mode Exit fullscreen mode

Example code to integrate:

from serpapi import GoogleSearch
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '16230039729797264158',   # product id
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}


search = GoogleSearch(params)               # where data extraction happens on the SerpApi backend
results = search.get_dict()                 # JSON -> Python dict

product_results = results['product_results']

print(json.dumps(product_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output:

{
  "product_id": 16230039729797264158,
  "title": "Sony PlayStation 5 - Standard",
  "prices": [
    "$499.99",
    "$499.00",
    "$700.00"
  ],
  "conditions": [
    "New",
    "New",
    "New"
  ],
  "typical_prices": {
    "low": "$499.00",
    "high": "$719.75",
    "shown_price": "$499.99 at Gamestop"
  },
  "reviews": 63413,
  "rating": 4.7,
  "extensions": [
    "Blu-ray Compatible",
    "4K Capable",
    "Backward Compatible",
    "Standard Edition",
    "With Motion Control",
    "Bluetooth",
    "Wi-Fi"
  ],
  "description": "Experience lightning-fast loading with an ultra-high-speed SSD, deeper immersion with support for haptic feedback, adaptive triggers and 3D audio, and a next generation of incredible PlayStation games.",
  "media": [
    {
      "type": "image",
      "link": "https://encrypted-tbn3.gstatic.com/shopping?q=tbn:ANd9GcRoN7Gg6r9ZxPZGkfTEbukowBuBvalGRrJG44Dwnw8_PAmLUNjt&usqp=CAY"
    },
    {
      "type": "image",
      "link": "https://encrypted-tbn1.gstatic.com/shopping?q=tbn:ANd9GcQOuj8omxssTbuSixiKmldKmSOCllkb1jLSqYHbThqgR3l78gjS&usqp=CAY"
    },
    {
      "type": "image",
      "link": "https://encrypted-tbn2.gstatic.com/shopping?q=tbn:ANd9GcQmw7DOYYmm5nQSQoEhAaE78a5IyNW3tHoCE1VRI2cxTHn9QGg&usqp=CAY"
    },
    {
      "type": "image",
      "link": "https://encrypted-tbn3.gstatic.com/shopping?q=tbn:ANd9GcRsejj3qFlCeXGkvHMG7yGdM6gR_AbzoT_fWZUcYrhS3QKxpHI&usqp=CAY"
    }
  ],
  "highlights": [
    "Integrated I/O: Marvel at incredible graphics and experience new PS5 features.",
    "Ultra-high speed SSD: Maximize your play sessions with near-instant load times for installed PS5 games.",
    "HDR technology: With an HDR TV, supported PS5 games display an unbelievably vibrant and lifelike range of colors.",
    "8K output: PS5 consoles support an 8K output, so you can play games on your 4320p resolution display.",
    "4K TV gaming: Play your favorite PS5 games on your stunning 4K TV. Up to 120 fps with 120Hz output"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž

Top comments (0)