DEV Community

Artur Chukhrai for SerpApi

Posted on • Updated on • Originally published at serpapi.com

Scrape Google Product Online Sellers with Python

What will be scraped

wwbs-google-online-sellers

πŸ“ŒNote: In this image, I demonstrate that the data will be received with pagination. Therefore, I only show 5 sellers, and not all, as the image could take up a lot of space.

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import requests, json
from parsel import Selector


def get_online_sellers_results(url, headers):
    data = []

    while True:
        html = requests.get(url, headers=headers)
        selector = Selector(html.text)

        for result in selector.css('.sh-osd__offer-row'):
            name = result.css('.kjM2Bf::text, .b5ycib::text').get() 
            link = 'https://www.google.com' + result.css('.b5ycib::attr(href)').get() if result.css('.b5ycib') else None
            base_price = result.css('.fObmGc::text').get()
            shipping = result.css('.SuutWb tr:nth-child(2) td:nth-child(2)::text').get() 
            tax = result.css('.SuutWb tr:nth-child(3) td:nth-child(2)::text').get()
            total_price = result.css('.drzWO::text').get()

            data.append({
                'name': name,
                'link': link,
                'base_price': base_price,
                'additional_price': {
                    'shipping': shipping,
                    'tax': tax
                },
                'total_price': total_price
            })

        if 'Next' in selector.css('.R9e18b .internal-link::text').get():
            url = 'https://www.google.com' + selector.css('.R9e18b .internal-link::attr(data-url)').get()
        else:
            break

    return data


def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = 'https://www.google.com/shopping/product/14019378181107046593/offers?hl=en&gl=us'

    online_sellers = get_online_sellers_results(URL, headers)

    print(json.dumps(online_sellers, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Preparation

Install libraries:

pip install requests parsel
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

import requests, json
from parsel import Selector
Enter fullscreen mode Exit fullscreen mode
Library Purpose
requests to make a request to the website.
json to convert extracted data to a JSON object.
Selector XML/HTML parser that have full XPath and CSS selectors support.

At the beginning of the main() function, the headers and URL are defined. This data is then passed to the get_online_sellers_results(URL, headers) function to form a request and extract information.

The online_sellers list contains the received data that this function returns. At the end of the function, the data is output in JSON format:

def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = 'https://www.google.com/shopping/product/14019378181107046593/offers?hl=en&gl=us'

    online_sellers = get_online_sellers_results(URL, headers)

    print(json.dumps(online_sellers, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

This code uses the generally accepted rule of using the __name__ == "__main__" construct:

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work. You can watch the video Python Tutorial: if name == 'main' for more details.

Let's take a look at the get_online_sellers_results(url, headers) function mentioned earlier. This function takes url and headers parameters to create a request. At the beginning of the function, the data list in which the data will be stored is defined:

def get_online_sellers_results(url, headers):
    data = []
Enter fullscreen mode Exit fullscreen mode

Now we need to parse the HTML from the Parsel package, into which we pass the HTML structure that was received after the request.

Up to 20 sellers fit on one page. If there are more than 20 of them, then a page with the remaining sellers is added. To scrape a Google Product Online Sellers with pagination, you need to check for the presence of the Next button. While the Next button exists, you need to fetch the url for the next page in order to access it. If the Next button is not present, then you need to break the while loop:

while True:
    html = requests.get(url, headers=headers)
    selector = Selector(html.text)

    # data extraction from current page will be here

    if 'Next' in selector.css('.R9e18b .internal-link::text').get():
        url = 'https://www.google.com' + selector.css('.R9e18b .internal-link::attr(data-url)').get()
    else:
        break
Enter fullscreen mode Exit fullscreen mode

To retrieve data, you first need to find the .sh-osd__offer-row selector that is responsible for each seller and iterate over it:

for result in selector.css('.sh-osd__offer-row'):
    # data extraction from each seller will be here
Enter fullscreen mode Exit fullscreen mode

Data such as name, base_price, shipping, tax and total_price are retrieved for each seller. I want to draw your attention to the fact that not every seller has a link, so a ternary expression is used when extracting:

name = result.css('.kjM2Bf::text, .b5ycib::text').get() 
link = 'https://www.google.com' + result.css('.b5ycib::attr(href)').get() if result.css('.b5ycib') else None
base_price = result.css('.fObmGc::text').get()
shipping = result.css('.SuutWb tr:nth-child(2) td:nth-child(2)::text').get() 
tax = result.css('.SuutWb tr:nth-child(3) td:nth-child(2)::text').get()
total_price = result.css('.drzWO::text').get()
Enter fullscreen mode Exit fullscreen mode
Code Explanation
css() to access elements by the passed selector.
::text or ::attr(<attribute>) to extract textual or attribute data from the node.
get() to actually extract the textual data.

After extracting all data about the seller, a dictionary with this data is appended to the data list:

data.append({
    'name': name,
    'link': link,
    'base_price': base_price,
    'additional_price': {
        'shipping': shipping,
        'tax': tax
    },
    'total_price': total_price
})
Enter fullscreen mode Exit fullscreen mode

At the end of the function, the data list is returned.

return data
Enter fullscreen mode Exit fullscreen mode

Output:

[
  {
    "name": "Best Buy",
    "link": "https://www.google.com/url?q=https://www.bestbuy.com/site/steelseries-aerox-3-2022-edition-lightweight-wired-optical-gaming-mouse-onyx/6485231.p%3FskuId%3D6485231%26ref%3DNS%26loc%3D101&sa=U&ved=0ahUKEwiSuKKm1r_7AhWESDABHQvhDGwQ2ykIJA&usg=AOvVaw37TQlxlXfUf7Aow3-oj3Wr",
    "base_price": "$34.99",
    "additional_price": {
      "shipping": "$0.00",
      "tax": "$3.11"
    },
    "total_price": "$38.10"
  },
  ... other sellers
  {
    "name": "Network Hardwares",
    "link": "https://www.google.com/url?q=https://www.networkhardwares.com/products/aerox-3-wireless-2022-edition-62611%3Fcurrency%3DUSD%26variant%3D41025510441165%26utm_medium%3Dcpc%26utm_source%3Dgoogle%26utm_campaign%3DGoogle%2520Shopping%26srsltid%3DAYJSbAeM3Wi-nx6CPNXcQIZqlFcEv3uyBEgwTXa36ijEua1hx_LNmAm5EiM&sa=U&ved=0ahUKEwiSuKKm1r_7AhWESDABHQvhDGwQ2ykImgE&usg=AOvVaw1rOVOsiroUgnyyTT2JBN61",
    "base_price": "$64.51",
    "additional_price": {
      "shipping": "$0.00",
      "tax": "$5.73"
    },
    "total_price": "$70.24"
  }
]
Enter fullscreen mode Exit fullscreen mode

Using Google Online Sellers API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The main difference is that it's a quicker approach. Google Online Sellers API will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.

First, we need to install google-search-results:

pip install google-search-results
Enter fullscreen mode Exit fullscreen mode

Import the necessary libraries for work:

from serpapi import GoogleSearch
import os, json
Enter fullscreen mode Exit fullscreen mode

Next, we write the necessary parameters for making a request:

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '14019378181107046593',   # product id
    'offers': True,                         # more offers, could be also set as '1` which is the same as True
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}
Enter fullscreen mode Exit fullscreen mode

We then create a search object where the data is retrieved from the SerpApi backend. In the results dictionary we get data from JSON:

search = GoogleSearch(params)   # where data extraction happens on the SerpApi backend
results = search.get_dict()     # JSON -> Python dict
Enter fullscreen mode Exit fullscreen mode

Retrieving the data is quite simple, we just need to access the 'sellers_results' key and then the 'online_sellers' key:

online_sellers = results['sellers_results']['online_sellers']
Enter fullscreen mode Exit fullscreen mode

After reviewing the playground, you will be able to understand which keys you can turn to into this JSON structure.

Example code to integrate:

from serpapi import GoogleSearch
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '14019378181107046593',   # product id
    'offers': True,                         # more offers, could be also set as '1` which is the same as True
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}

search = GoogleSearch(params)               # where data extraction happens on the backend
results = search.get_dict()                 # JSON -> Python dict

online_sellers = results['sellers_results']['online_sellers']

print(json.dumps(online_sellers, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output:

[
  {
    "position": 1,
    "name": "Best Buy",
    "link": "https://www.google.com/url?q=https://www.bestbuy.com/site/steelseries-aerox-3-2022-edition-lightweight-wired-optical-gaming-mouse-onyx/6485231.p%3FskuId%3D6485231%26ref%3DNS%26loc%3D101&sa=U&ved=0ahUKEwiYt4fxyb_7AhXGFlkFHQZoCLMQ2ykIJA&usg=AOvVaw198AdAmbpUT5YEupYrp_iH",
    "base_price": "$34.99",
    "additional_price": {
      "shipping": "$0.00",
      "tax": "$3.02"
    },
    "total_price": "$38.01"
  },
  ... other sellers
  {
    "position": 38,
    "name": "Network Hardwares",
    "link": "https://www.google.com/url?q=https://www.networkhardwares.com/products/aerox-3-wireless-2022-edition-62611%3Fcurrency%3DUSD%26variant%3D41025510441165%26utm_medium%3Dcpc%26utm_source%3Dgoogle%26utm_campaign%3DGoogle%2520Shopping%26srsltid%3DAYJSbAdn6Cgm7HKsOdgiZ1_T8TK8NyOtSJpq2EC5meylVz982o4QDNcuTfA&sa=U&ved=0ahUKEwiYt4fxyb_7AhXGFlkFHQZoCLMQ2ykI5wE&usg=AOvVaw18MAXohnYThkG5Ip4Igqx-",
    "base_price": "$64.51",
    "additional_price": {
      "shipping": "$0.00",
      "tax": "$5.56"
    },
    "total_price": "$70.07"
  }
]
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

Top comments (0)