DEV Community

Artur Chukhrai for SerpApi

Posted on • Updated on • Originally published at serpapi.com

 

Scrape Google Lens with Python

What will be scraped

wwbs-google-lens

Using Google Lens API from SerpApi

If you don't need an explanation, have a look at the full code example in the online IDE.

from serpapi import GoogleSearch
import json

params = {
    'api_key': '...',
    'engine': 'google_lens',
    'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
    'hl': 'en',
}

search = GoogleSearch(params)                   # data extraction on the SerpApi backend
google_lens_results = search.get_dict()         # JSON -> Python dict

del google_lens_results['search_metadata']
del google_lens_results['search_parameters']

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Why using API?

There're a couple of reasons that may use API, ours in particular:

  • No need to create a parser from scratch and maintain it.
  • Bypass blocks from Google: solve CAPTCHA or solve IP blocks.
  • Pay for proxies, and CAPTCHA solvers.
  • Don't need to use browser automation.

SerpApi handles everything on the backend with fast response times under ~4.3 seconds per request and without browser automation, which becomes much faster. Response times and status rates are shown under SerpApi Status page:

serpapi-status-product

Head to the Google Lens playground for a live and interactive demo.

Preparation

Install library:

pip install google-search-results
Enter fullscreen mode Exit fullscreen mode

google-search-results is a SerpApi API package.

Code Explanation

Import libraries:

from serpapi import GoogleSearch
import json
Enter fullscreen mode Exit fullscreen mode
Library Purpose
GoogleSearch to scrape and parse Google results using SerpApi web scraping library.
json to convert extracted data to a JSON object.

The parameters are defined for generating the URL. If you want to pass other parameters to the URL, you can do so using the params dictionary:

params = {
    'api_key': '...',
    'engine': 'google_lens',
    'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
    'hl': 'en',
}
Enter fullscreen mode Exit fullscreen mode
Parameters Explanation
api_key Parameter defines the SerpApi private key to use. You can find it under your account -> API key.
engine Set parameter to google_lens to use the Google Lens API engine.
url Parameter defines the URL of an image to perform the Google Lens search.
hl Parameter defines the language to use for the Google Lens search. It's a two-letter language code. Head to the Google languages page for a full list of supported Google languages.

πŸ“ŒNote: You can also add other API Parameters.

Then, we create a search object where the data is retrieved from the SerpApi backend. In the google_lens_results dictionary we get data from JSON:

search = GoogleSearch(params)               # data extraction on the SerpApi backend
google_lens_results = search.get_dict()     # JSON -> Python dict
Enter fullscreen mode Exit fullscreen mode

The google_lens_results dictionary, in addition to the necessary data, contains information about the request. The request information is not needed, so we remove the corresponding keys using the del statement:

del google_lens_results['search_metadata']
del google_lens_results['search_parameters']
Enter fullscreen mode Exit fullscreen mode

After the all data is retrieved, it is output in JSON format:

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output

{
  "reverse_image_search": {
    "link": "https://www.google.com/search?tbs=sbi:AMhZZiurdULpuTy4_1HSkPv2ZrEBN9afXDH2j7s2drhaSQmdFuOJlf9HaxhrjxEfBrWzj1xi-ZONFSwWi3UlhnMtRXlu68S24Kv5fLuNstTqFQfpUQXGbPBuplF8jDJuvLTDAJow06N44R7keGB1GOU5fRzsc4rirzA"
  },
  "knowledge_graph": [
    {
      "title": "Black cat",
      "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
      "more_images": {
        "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
        "serpapi_link": "https://serpapi.com/search.json?device=desktop&engine=google&gl=US&google_domain=google.com&hl=en&q=Black+cat&tbm=isch"
      },
      "thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
      "images": [
        {
          "title": "Image #1 for Black cat",
          "source": "https://vbspca.com/tag/stigma/",
          "link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
          "size": {
            "width": 293,
            "height": 172
          }
        },
        ... other images
      ]
    },
    ... other knowledge graph results
  ],
  "visual_matches": [
    {
      "position": 1,
      "title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
      "link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
      "source": "vaildaily.com",
      "source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
      "thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
    },
    ... other visual matches results
  ]
}
Enter fullscreen mode Exit fullscreen mode

DIY solution

This section is to show the comparison between our solution and the DIY solution.

The fact is that when you click on a regular link, it changes to another link. The GIF below shows this:

google-lens

The data is correspondingly different and there is no way to extract it without reverse engineering. For simplicity, the DIY solution uses playwright. It helps to extract data from the modified link.

The data extraction itself is done with selectolax because it has Lexbor parser which is incredibly fast. In terms of syntax, it is very similar to both bs4 and parsel, making it easy to use. Please note that selectolax does not currently support XPath.

Example code to integrate:

from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json


def run(playwright):
    image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'

    page = playwright.chromium.launch(headless=True).new_page()
    page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')

    parser = LexborHTMLParser(page.content())
    page.close()

    reverse_image_search = {
        'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
    }

    knowledge_graph = {
        'title': parser.root.css_first('.DeMn2d').text(),
        'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
        'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
        'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
        'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
        'images': [
            {
                'title': image.attributes['aria-label'],
                'source': image.attributes['href'],
                'link': image.css_first('.wETe9b').attributes['src']
            }
            for image in parser.root.css('.Y02Gld a')
        ]
    }

    visual_matches = [
        {
            'title': result.css_first('.UAiK1e').text(),
            'link': result.css_first('.GZrdsf').attributes['href'],
            'source': result.css_first('.fjbPGe').text(),
            'source_icon': result.css_first('.KRdrw').attributes['src'],
            'thumbnail': result.css_first('.jFVN1').attributes['src']
        }
        for result in parser.root.css('.xuQ19b')
    ]

    google_lens_results = {
        'reverse_image_search': reverse_image_search,
        'knowledge_graph': knowledge_graph,
        'visual_matches': visual_matches
    }

    print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

πŸ“ŒNote: In the online IDE this code does not work because the Replit does not support the playwright. You can do all the manipulations described below to check how the DIY solution works.

Preparation

Install library:

pip install playwright selectolax
Enter fullscreen mode Exit fullscreen mode

Install the required browser:

playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Code Explanation

Import libraries:

from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json
Enter fullscreen mode Exit fullscreen mode
Library Purpose
sync_playwright for synchronous API. playwright have asynchronous API as well using asyncio module.
LexborHTMLParser a fast HTML5 parser with CSS selectors using Lexbor engine.
json to convert extracted data to a JSON object.

Declare a function:

def run(playwright):
    # further code ...
Enter fullscreen mode Exit fullscreen mode

The image_url variable is defined, which contains the URL of the image:

image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'
Enter fullscreen mode Exit fullscreen mode

Initialize playwright, connect to chromium, launch() a browser new_page() and goto() a given URL:

page = playwright.chromium.launch(headless=True).new_page()
page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')
Enter fullscreen mode Exit fullscreen mode
Parameters Explanation
playwright.chromium is a connection to the Chromium browser instance.
launch() will launch the browser, and headless argument will run it in headless mode. Default is True.
new_page() creates a new page in a new browser context.
page.goto() will make a request to provided website.

After the page has loaded, pass HTML content to Lexbor and close the browser:

parser = LexborHTMLParser(page.content())
page.close()
Enter fullscreen mode Exit fullscreen mode

The first thing to extract is the reverse image search link. To do this, you need to pass the .kuwdsf .VfPpkd-RLmnJb selector that is responsible for this element to the css_first() method. Then extract the value of the href attribute from attributes:

reverse_image_search = {
    'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
}
Enter fullscreen mode Exit fullscreen mode

The algorithm for extracting data from the knowledge graph works similarly. There is a difference in extracting title and subtitle. For them, the text content is retrieved, so the corresponding text() method is used. Sometimes there may not be a subtitle, so a ternary expression is used for such cases:

knowledge_graph = {
    'title': parser.root.css_first('.DeMn2d').text(),
    'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
    'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
    'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
    'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
    'images': [
        {
            'title': image.attributes['aria-label'],
            'source': image.attributes['href'],
            'link': image.css_first('.wETe9b').attributes['src']
        }
        for image in parser.root.css('.Y02Gld a')
    ]
}
Enter fullscreen mode Exit fullscreen mode

For both knowledge graph images and visual matches, list comprehensions are used to provide a concise way to create lists. To find multiple elements and iterate them, the css() method was used:

visual_matches = [
    {
        'title': result.css_first('.UAiK1e').text(),
        'link': result.css_first('.GZrdsf').attributes['href'],
        'source': result.css_first('.fjbPGe').text(),
        'source_icon': result.css_first('.KRdrw').attributes['src'],
        'thumbnail': result.css_first('.jFVN1').attributes['src']
    }
    for result in parser.root.css('.xuQ19b')
]
Enter fullscreen mode Exit fullscreen mode

The google_lens_results dictionary is created and previously extracted data is added to the corresponding keys:

google_lens_results = {
    'reverse_image_search': reverse_image_search,
    'knowledge_graph': knowledge_graph,
    'visual_matches': visual_matches
}
Enter fullscreen mode Exit fullscreen mode

After the all data is retrieved, it is output in JSON format:

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Run your code using context manager:

with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

Output

{
  "reverse_image_search": {
    "link": "https://www.google.com/search?tbs=sbi:AMhZZivbhNZ5ZFwCBpcEUAlEHVFDQnaZIC-4PcD5za7g6xuScvksUbf8osCVDaAg70m3b2eMkaodmPSm_1PiNZgCOEV5wma9PX1piaCV3GtLReFcsjRlP7On4aF3HUJAyPinMnEYGIATNPvQ7PLMoMZlmUXj4uQ1xHw"
  },
  "knowledge_graph": {
    "title": "Black cat",
    "subtitle": null,
    "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
    "more_images": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
    "thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
    "images": [
      {
        "title": "Image #1 for Black cat",
        "source": "https://vbspca.com/tag/stigma/",
        "link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w"
      },
      ... other images
    ]
  },
  "visual_matches": [
    {
      "title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
      "link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
      "source": "vaildaily.com",
      "source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
      "thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
    },
    ... other visual matches results
  ]
}
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

Top comments (0)

50 CLI Tools You Can't Live Without

The top 50 must-have CLI tools, including some scripts to help you automate the installation and updating of these tools on various systems/distros.