In this tough time and each of us should share knowledge and collaborate. I was trying to make a dataset of People wearing mask and without mask,I have collected a little data. But I am sharing,how can you scrape google images and do this task. Here is my video explaining the concept.
First of all we need to have Selenium and a webdriver, e.g. chromium webdriver.
Here is the code:
import os import time import urllib.request from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome("C:\\Users\\Sourabh\\chromedriver.exe") driver.get('https://www.google.com/') #opens up google search = driver.find_element_by_name('q') # the name of the searchbox search.send_keys('people wearing mask',Keys.ENTER)
Now, we need to go to the images section
elem = driver.find_element_by_link_text('Images') elem.get_attribute('href') elem.click()
value = 0 for i in range(50): #Scrolls the page 50 times driver.execute_script('scrollBy("+ str(value) +",+100);') value += 100 time.sleep(4)
Now we need to find the class/id of img tag to get the src attribute from there.As of now there are three classes in google images img tag.Keep in mind that google changes it periodically ,So, It might not work after weeks.
elements = driver.find_elements_by_xpath('//img[contains(@class,"rg_i") and contains(@class, "Q4LuWd") and contains(@class, "tx8vtf")]') try: os.mkdir('peoplewithmask') except FileExistsError: pass
Finally we need to retrieve and download the links
count = 0 for i in elements: src = i.get_attribute('src') try: if src != None: src = str(src) count+=1 urllib.request.urlretrieve(src, os.path.join('withMask','image'+str(count)+'.jpg')) if count%10 == 0: print("downloaded",count,"images") else: raise TypeError except TypeError: pass
Done, This was all for today. Feel free to reach out if you need help.
I did not explain how to inspect and find out the class,id,etc because I feel that most developers know,Still if you find problem please refer to this video tutorial.