Artur Chukhrai for SerpApi

Posted on Jan 29, 2023 • Edited on Aug 12, 2023 • Originally published at serpapi.com

Scrape Google Lens with Python

#perl #howto #programming

What will be scraped
Using Google Lens API from SerpApi
DIY solution
Links

What will be scraped

Using Google Lens API from SerpApi

If you don't need an explanation, have a look at the full code example in the online IDE.

from serpapi import GoogleSearch
import json

params = {
    'api_key': '...',
    'engine': 'google_lens',
    'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
    'hl': 'en',
}

search = GoogleSearch(params)                   # data extraction on the SerpApi backend
google_lens_results = search.get_dict()         # JSON -> Python dict

del google_lens_results['search_metadata']
del google_lens_results['search_parameters']

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))

Why use API?

There're a couple of reasons that may use API, ours in particular:

No need to create a parser from scratch and maintain it.
Bypass blocks from Google: solve CAPTCHA or solve IP blocks.
Pay for proxies, and CAPTCHA solvers.
Don't need to use browser automation.

SerpApi handles everything on the backend with fast response times under ~4.3 seconds per request and without browser automation, which becomes much faster. Response times and status rates are shown under SerpApi Status page:

Head to the Google Lens playground for a live and interactive demo.

Preparation

Install library:

pip install google-search-results

google-search-results is a SerpApi API package.

Code Explanation

Import libraries:

from serpapi import GoogleSearch
import json

Library	Purpose
`GoogleSearch`	to scrape and parse Google results using SerpApi web scraping library.
`json`	to convert extracted data to a JSON object.

The parameters are defined for generating the URL. If you want to pass other parameters to the URL, you can do so using the params dictionary:

params = {
    'api_key': '...',
    'engine': 'google_lens',
    'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
    'hl': 'en',
}

Parameters	Explanation
`api_key`	Parameter defines the SerpApi private key to use. You can find it under your account -> API key.
`engine`	Set parameter to `google_lens` to use the Google Lens API engine.
`url`	Parameter defines the URL of an image to perform the Google Lens search.
`hl`	Parameter defines the language to use for the Google Lens search. It's a two-letter language code. Head to the Google languages page for a full list of supported Google languages.

📌Note: You can also add other API Parameters.

Then, we create a search object where the data is retrieved from the SerpApi backend. In the google_lens_results dictionary we get data from JSON:

search = GoogleSearch(params)               # data extraction on the SerpApi backend
google_lens_results = search.get_dict()     # JSON -> Python dict

The google_lens_results dictionary, in addition to the necessary data, contains information about the request. The request information is not needed, so we remove the corresponding keys using the del statement:

del google_lens_results['search_metadata']
del google_lens_results['search_parameters']

After the all data is retrieved, it is output in JSON format:

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))

Output

{
  "reverse_image_search": {
    "link": "https://www.google.com/search?tbs=sbi:AMhZZiurdULpuTy4_1HSkPv2ZrEBN9afXDH2j7s2drhaSQmdFuOJlf9HaxhrjxEfBrWzj1xi-ZONFSwWi3UlhnMtRXlu68S24Kv5fLuNstTqFQfpUQXGbPBuplF8jDJuvLTDAJow06N44R7keGB1GOU5fRzsc4rirzA"
  },
  "knowledge_graph": [
    {
      "title": "Black cat",
      "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
      "more_images": {
        "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
        "serpapi_link": "https://serpapi.com/search.json?device=desktop&engine=google&gl=US&google_domain=google.com&hl=en&q=Black+cat&tbm=isch"
      },
      "thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
      "images": [
        {
          "title": "Image #1 for Black cat",
          "source": "https://vbspca.com/tag/stigma/",
          "link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
          "size": {
            "width": 293,
            "height": 172
          }
        },
        ... other images
      ]
    },
    ... other knowledge graph results
  ],
  "visual_matches": [
    {
      "position": 1,
      "title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
      "link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
      "source": "vaildaily.com",
      "source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
      "thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
    },
    ... other visual matches results
  ]
}

DIY solution

This section is to show the comparison between our solution and the DIY solution.

The fact is that when you click on a regular link, it changes to another link. The GIF below shows this:

The data is correspondingly different and there is no way to extract it without reverse engineering. For simplicity, the DIY solution uses playwright. It helps to extract data from the modified link.

The data extraction itself is done with selectolax because it has Lexbor parser which is incredibly fast. In terms of syntax, it is very similar to both bs4 and parsel, making it easy to use. Please note that selectolax does not currently support XPath.

Example code to integrate:

from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json


def run(playwright):
    image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'

    page = playwright.chromium.launch(headless=True).new_page()
    page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')

    parser = LexborHTMLParser(page.content())
    page.close()

    reverse_image_search = {
        'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
    }

    knowledge_graph = {
        'title': parser.root.css_first('.DeMn2d').text(),
        'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
        'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
        'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
        'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
        'images': [
            {
                'title': image.attributes['aria-label'],
                'source': image.attributes['href'],
                'link': image.css_first('.wETe9b').attributes['src']
            }
            for image in parser.root.css('.Y02Gld a')
        ]
    }

    visual_matches = [
        {
            'title': result.css_first('.UAiK1e').text(),
            'link': result.css_first('.GZrdsf').attributes['href'],
            'source': result.css_first('.fjbPGe').text(),
            'source_icon': result.css_first('.KRdrw').attributes['src'],
            'thumbnail': result.css_first('.jFVN1').attributes['src']
        }
        for result in parser.root.css('.xuQ19b')
    ]

    google_lens_results = {
        'reverse_image_search': reverse_image_search,
        'knowledge_graph': knowledge_graph,
        'visual_matches': visual_matches
    }

    print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)

📌Note: In the online IDE this code does not work because the Replit does not support the playwright. You can do all the manipulations described below to check how the DIY solution works.

Preparation

Install library:

pip install playwright selectolax

Install the required browser:

playwright install chromium

Code Explanation

Import libraries:

from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json

Library	Purpose
`sync_playwright`	for synchronous API. `playwright` have asynchronous API as well using `asyncio` module.
`LexborHTMLParser`	a fast HTML5 parser with CSS selectors using Lexbor engine.
`json`	to convert extracted data to a JSON object.

Declare a function:

def run(playwright):
    # further code ...

The image_url variable is defined, which contains the URL of the image:

image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'

Initialize playwright, connect to chromium, launch() a browser new_page() and goto() a given URL:

page = playwright.chromium.launch(headless=True).new_page()
page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')

Parameters	Explanation
`playwright.chromium`	is a connection to the Chromium browser instance.
`launch()`	will launch the browser, and headless argument will run it in headless mode. Default is True.
`new_page()`	creates a new page in a new browser context.
`page.goto()`	will make a request to provided website.

After the page has loaded, pass HTML content to Lexbor and close the browser:

parser = LexborHTMLParser(page.content())
page.close()

The first thing to extract is the reverse image search link. To do this, you need to pass the .kuwdsf .VfPpkd-RLmnJb selector that is responsible for this element to the css_first() method. Then extract the value of the href attribute from attributes:

reverse_image_search = {
    'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
}

The algorithm for extracting data from the knowledge graph works similarly. There is a difference in extracting title and subtitle. For them, the text content is retrieved, so the corresponding text() method is used. Sometimes there may not be a subtitle, so a ternary expression is used for such cases:

knowledge_graph = {
    'title': parser.root.css_first('.DeMn2d').text(),
    'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
    'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
    'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
    'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
    'images': [
        {
            'title': image.attributes['aria-label'],
            'source': image.attributes['href'],
            'link': image.css_first('.wETe9b').attributes['src']
        }
        for image in parser.root.css('.Y02Gld a')
    ]
}

For both knowledge graph images and visual matches, list comprehensions are used to provide a concise way to create lists. To find multiple elements and iterate them, the css() method was used:

visual_matches = [
    {
        'title': result.css_first('.UAiK1e').text(),
        'link': result.css_first('.GZrdsf').attributes['href'],
        'source': result.css_first('.fjbPGe').text(),
        'source_icon': result.css_first('.KRdrw').attributes['src'],
        'thumbnail': result.css_first('.jFVN1').attributes['src']
    }
    for result in parser.root.css('.xuQ19b')
]

The google_lens_results dictionary is created and previously extracted data is added to the corresponding keys:

google_lens_results = {
    'reverse_image_search': reverse_image_search,
    'knowledge_graph': knowledge_graph,
    'visual_matches': visual_matches
}

After the all data is retrieved, it is output in JSON format:

print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))

Run your code using context manager:

with sync_playwright() as playwright:
    run(playwright)

Output

{
  "reverse_image_search": {
    "link": "https://www.google.com/search?tbs=sbi:AMhZZivbhNZ5ZFwCBpcEUAlEHVFDQnaZIC-4PcD5za7g6xuScvksUbf8osCVDaAg70m3b2eMkaodmPSm_1PiNZgCOEV5wma9PX1piaCV3GtLReFcsjRlP7On4aF3HUJAyPinMnEYGIATNPvQ7PLMoMZlmUXj4uQ1xHw"
  },
  "knowledge_graph": {
    "title": "Black cat",
    "subtitle": null,
    "link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
    "more_images": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
    "thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
    "images": [
      {
        "title": "Image #1 for Black cat",
        "source": "https://vbspca.com/tag/stigma/",
        "link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w"
      },
      ... other images
    ]
  },
  "visual_matches": [
    {
      "title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
      "link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
      "source": "vaildaily.com",
      "source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
      "thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
    },
    ... other visual matches results
  ]
}