DEV Community

Cover image for Scrape multiple images on the web
kayYOLO
kayYOLO

Posted on • Edited on

Scrape multiple images on the web

Scrape multiple images on the web

This article is about scraping multiple images from a web page. The basic requirement is to get all images from the web page and save them into a local folder, and the additional requirement is to save the images with their titles so that these files can be easily managed or processed later on.

There are some download tools that can be used to save all web images into a folder, but the images mostly are saved with ids or random names that can't be easily understood. So I am implementing this scenario with Python module Clicknium for its easy to start and good experience in capturing a list of similar elements.

Let's have a look at the web page for the image list as below. Each item contains an image, title and price. The expected result is a folder containing all the images with title as their names.

Image description

We will cover the scraping in 3 parts as below:

  • Development tool preparation
  • Capture locator for the image
  • Write automation code

Development tool preparation

  • Install Visual Studio Code and Clicknium extension.
  • Follow the instructions of the quick start document in Clicknium extension to complete the setup. Image description

Capture locator for the image

After setting up the development environment, open an empty folder in VSCode and create a new .py file.

  • Start capturing the locator by clicking the button below or press "Ctrl+F10".

    Image description

  • Once the "Clicknium Recorder" is invoked, click the "Similar elements" button in the recorder to capture a locator for the image list.

    Image description

  • After clicking the button, a wizard pops up guiding you to generate a locator which can match all expected images.

Hover mouse cursor over the element and add the first target element by pressing "Ctrl+Click". It can be any of the image list. Image description
Once the element is added, the wizard will show how many similar elements can be matched by the added locator.
Since only one element is added here, it also matches the same target one for now. We can capture another image from the list to match more.
Image description
After adding 3 images to the wizard, we can see that 21 elements are now matched with the locator auto-generated. Image description
As there are 22 images in total on the web page, we will continue to add more image elements to the wizard, till 22 elements can all be matched by the auto-generated locator. (If the matched number is not expected, we can always add more elements.)
Click "Save" button to complete the wizard.
Image description

After capturing the locator, we can open the locator to see its details as below in Visual Studio Code. The detailed properties can be updated manually if it can be optimized further.

Image description

From the locator editor panel, we can also click "Validate" button to ensure that all matched 22 elements are expected. After clicking the "Validate" button, a wizard can be operated to locate the target elements one by one. If any target one is incorrect, we may

  • recapture the locator by going through the wizard again
  • or manually modify the locator in the locator edit panel above. Image description

Capture image titles in the same way as above. The locator definition is as below:

Image description

Write Automation Code

With the locators, now we can write code as below

  • Get images and titles
  • Download image and save it with title as file name
import os
import requests
import shutil
from clicknium import clicknium as cc, locator, ui

# attach to the opened browser, the url is a fake site
tab = cc.edge.attach_by_title_url(url = "https://gallerydemo.com/pages/outerwear")

# get images and titles
imgs = tab.find_elements(locator.msedge.gallerydept.img_out)
titles = tab.find_elements(locator.msedge.gallerydept.span_out)

# iterate every image element
for x in range(len(imgs)):
    src = imgs[x].get_property("src")
    tstr = titles[x].get_text()

# download image with url and save to folder with title as name
    res = requests.get("https:"+src, stream = True)
    if res.status_code == 200:
        file = "c:\\test\\gallery\\" + tstr + ".png"
# use different name if the title is duplicated
        if(os.path.exists(file)):
            file = "c:\\test\\gallery\\" + tstr + str(x) + ".png"
        with open(file,'wb') as f:
            shutil.copyfileobj(res.raw, f)
        print('Image sucessfully downloaded: ',tstr)
    else:
        print('Image Couldn\'t be retrieved')
Enter fullscreen mode Exit fullscreen mode
  • The complete code can be found on GitHub.

The execution result is as below. The images are saved in folder c:\test\gallery with title as name and same as the one on the web page.

Image description

Conclusion

I demonstrated how to scrape images from the web in this article. With Clicknium "Similar elements" function, it is easy to locate the images by mouse clicking, and write code simply with the generated locator.

The important part is to capture the similar elements, the more elements you add, the auto-generated locator is more accurate. A good practice is to add elements in different locations, like different columns and different rows, so that it has higher coverage to generate correct locator.

Check the Document for more detail about Clicknium.

Top comments (0)