DEV Community

Sammy Deprez
Sammy Deprez

Posted on

Data Gathering For Image Recognition

The other time I was reading an interesting blog from Henk Boelman (a Microsoft Advocate). He was describing how you can build an image classifier with Azure Machine Learning Studio. He used a dataset that he had downloaded from Kaggle. But what if you cant  find on Kaggle what you need. This blog post will discuss how you can make use of the Bing Image Search to generate your own dataset.

So first of all some explanation. "Bing Image Search API" is part of the Cognitive Services on Azure and it gives you the possibility to search for images just like how you would do on bing.com/images. The API has a search function that allows you to search for a specific keyword but also filter the results based on size (height/width), file size, license, color, ...

More info and free easy try out can be found on the product page of Microsoft.\
https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/

Since we are talking about Machine Learning all the code that will be discussed will be in Python. That way you can just copy it in your Jupiter/Azure Notebook.

Step 1: Install/Import necessary packages

The Bing Image Search has its own Python SDK, what is very handy. But if you want you can also make use of the classic HTTP request methods.

!pip install azure-cognitiveservices-search-imagesearch

from azure.cognitiveservices.search.imagesearch import ImageSearchClient
from msrest.authentication import CognitiveServicesCredentials
import pandas as pd

Step 2: Configure subscription key + endpoint

Subscription Key + Endpoint can be found in your Azure Portal after you create a Cognitive Service resource. You don't need to look for a specific Bing Search Cognitive Services. Since a while Bing Search and many more cognitive services have been implemented in 1 resource. Which means 1 endpoint and 1 key to use them all.

I also added the search term here

subscription_key = "[YourKeyHere]"

subscription_endpoint = "[YourEndpointHere]"
search_term = "apples"

Step 3: Configure Client

So in this step we create 2 new objects. One is the 'credentials' which contains the subscription key that we just configured. And the other is the client that we will use to make the call. Last mentioned also needs the endpoint.

credentials = CognitiveServicesCredentials(subscription_key)

client = ImageSearchClient(endpoint=subscription_endpoint, credentials=credentials)

Step 4: Search and gather the results

Searching for items is very easy. Just by making use of the client object and the search function that has a bunch of variables that you can configure (like color, size, ...) In this case we only configure what we are search for and how many results we want to get back.

Search for images

image_results = client.images.search(query=search_term,count=150)

Step 5: Convert to dataframe

The result we receive from the API is a list of objects. This might we handy if we are in an c# application or so. But in this case we want to work with clear data. So with below functionality we convert the data to a dataframe. You will notice we get quite some information back. Some information is still stored in an object like "_type", but in this case we don't need this data. So we will leave it like this.

convert to dataframe

df = pd.DataFrame([x.as_dict() for x in image_results.value])

df.head()

  _type accent_color content_size content_url date_published encoding_format height host_page_display_url host_page_url image_id image_insights_token insights_metadata name thumbnail thumbnail_url web_search_url width
0 ImageObject 9D2E39 224643 B https://upload.wikimedia.org/wikipedia/commons... 2019-09-14T23:40:00.0000000Z jpeg 1200 https://en.wikipedia.org/wiki/Apple https://en.wikipedia.org/wiki/Apple 6AEAF1C3894ED7D563916634549ED00175C970D3 ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E... {} Apple -- Wikipedia {'_type': 'ImageObject', 'width': 474, 'height... https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6... https://www.bing.com/images/search?view=detail... 1200
1 ImageObject 4F300B 161831 B https://upload.wikimedia.org/wikipedia/commons... 2019-12-21T20:48:00.0000000Z jpeg 1083 https://en.wikipedia.org/wiki/Honeycrisp https://en.wikipedia.org/wiki/Honeycrisp 4531DA369849BECBA2980492B1F021A9EF15BB99 ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0... {} Honeycrisp -- Wikipedia {'_type': 'ImageObject', 'width': 474, 'height... https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP... https://www.bing.com/images/search?view=detail... 1200
2 ImageObject C47807 172706 B https://www.tasteofhome.com/wp-content/uploads... 2018-08-15T00:12:00.0000000Z jpeg 1200 https://www.tasteofhome.com/collection/new-typ... https://www.tasteofhome.com/collection/new-typ... EC1EBFE716FD1033C0E0983931FEC881E8A1177D ccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE... {} 15 New Types of Apples You Should Be Buying ... {'_type': 'ImageObject', 'width': 474, 'height... https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp... https://www.bing.com/images/search?view=detail...
3 ImageObject 7E0F24 4462721 B http://www.michiganapples.com/portals/0/MAC%20... 2019-11-15T14:57:00.0000000Z jpeg 3738 www.michiganapples.com/About/Varieties http://www.michiganapples.com/About/Varieties 753A53E3C6A184B4C55A074E275F120D20D4914C ccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F... {} Michigan Apple Varieties Michigan Apple Comm... {'_type': 'ImageObject', 'width': 474, 'height... https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ... https://www.bing.com/images/search?view=detail...
4 ImageObject 4F1E0F 419131 B http://www.flinchbaughsorchard.com/wp-content/... 2019-11-01T04:51:00.0000000Z jpeg 1691 www.flinchbaughsorchard.com/apple-varieties http://www.flinchbaughsorchard.com/apple-varie... 7480E3C7DDE4776D891D003001872D954786935C ccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187... {} Apple Varieties Flinchbaugh's Orchard & Farm... {'_type': 'ImageObject', 'width': 474, 'height... https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo... https://www.bing.com/images/search?view=detail...

Step 6: Generate new filenames

Since we get data back from many different websites, there is a chance that some filenames might be totally the same. Plus some files might not have an extension within the URI, because of routing or image generation. To fix this we add a function that guesses the MimeType and based on that looks up the extension that belongs to it. This in combination with a GUID that is generated we are sure we have a unique filename.

import mimetypes

import uuid

def getFileName(contentUrl):
mt = mimetypes.guess_type(contentUrl)
if mt[0] != None :
ext = mimetypes.guess_extension(mt[0])
return str(uuid.uuid1()) + ext
else:
return ""

By applying above funtion to each row we can add an extra column to the dataframe that contains the newly generated fileName. NOTE: you might have noticed that the function can return an emtpy string. This is only the case when it can't figure out the MimeType. Those results we filter out.

df['fileName'] = df.apply(lambda x: getFileName(x.content_url), axis=1)

df = df[df['fileName'] != ""]

df.head()

  _type accent_color content_size content_url date_published encoding_format height host_page_display_url host_page_url image_id image_insights_token insights_metadata name thumbnail thumbnail_url web_search_url width fileName
0 ImageObject 9D2E39 224643 B https://upload.wikimedia.org/wikipedia/commons... 2019-09-14T23:40:00.0000000Z jpeg 1200 https://en.wikipedia.org/wiki/Apple https://en.wikipedia.org/wiki/Apple 6AEAF1C3894ED7D563916634549ED00175C970D3 ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E... {} Apple -- Wikipedia {'_type': 'ImageObject', 'width': 474, 'height... https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6... https://www.bing.com/images/search?view=detail... 1200 11c6ddea-3197-11ea-87d5-000d3aaa7d6e.jpe
1 ImageObject 4F300B 161831 B https://upload.wikimedia.org/wikipedia/commons... 2019-12-21T20:48:00.0000000Z jpeg 1083 https://en.wikipedia.org/wiki/Honeycrisp https://en.wikipedia.org/wiki/Honeycrisp 4531DA369849BECBA2980492B1F021A9EF15BB99 ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0... {} Honeycrisp -- Wikipedia {'_type': 'ImageObject', 'width': 474, 'height... https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP... https://www.bing.com/images/search?view=detail... 1200 11c6e2ea-3197-11ea-87d5-000d3aaa7d6e.jpe
2 ImageObject C47807 172706 B https://www.tasteofhome.com/wp-content/uploads... 2018-08-15T00:12:00.0000000Z jpeg 1200 https://www.tasteofhome.com/collection/new-typ... https://www.tasteofhome.com/collection/new-typ... EC1EBFE716FD1033C0E0983931FEC881E8A1177D ccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE... {} 15 New Types of Apples You Should Be Buying ... {'_type': 'ImageObject', 'width': 474, 'height... https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp... https://www.bing.com/images/search?view=detail... 1200
3 ImageObject 7E0F24 4462721 B http://www.michiganapples.com/portals/0/MAC%20... 2019-11-15T14:57:00.0000000Z jpeg 3738 www.michiganapples.com/About/Varieties http://www.michiganapples.com/About/Varieties 753A53E3C6A184B4C55A074E275F120D20D4914C ccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F... {} Michigan Apple Varieties Michigan Apple Comm... {'_type': 'ImageObject', 'width': 474, 'height... https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ... https://www.bing.com/images/search?view=detail... 3738
4 ImageObject 4F1E0F 419131 B http://www.flinchbaughsorchard.com/wp-content/... 2019-11-01T04:51:00.0000000Z jpeg 1691 www.flinchbaughsorchard.com/apple-varieties http://www.flinchbaughsorchard.com/apple-varie... 7480E3C7DDE4776D891D003001872D954786935C ccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187... {} Apple Varieties Flinchbaugh's Orchard & Farm... {'_type': 'ImageObject', 'width': 474, 'height... https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo... https://www.bing.com/images/search?view=detail... 1800

Step 7: Create folder to save the images

This is a folder on your local or cloud environment.

dirName = "image_gathering"

os.mkdir(dirName)

Requirement already satisfied: wget in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (3.2)
Directory image_gathering/apples already exists. Folder will be cleared

Step 8: Download images

First we create a function that tries to download the file to the new destination with the new filename. Bing search might be outdated, so there are possibilities that certain downloads fails (Ex. NotFound, NotAuthenticated, ...) Therefore the function returns True or False depending if download was succesful.

import wget

def downloadFile(dirName, contentUrl, fileName):
try:
wget.download(contentUrl, dirName + "/" + fileName)
return True
except:
return False

By applying this function do every row, every result will be downloaded and we add an extra column that keeps track if it was successfull or not.

df['fileDownloaded'] = df.apply(lambda x: downloadFile(dirName, x.content_url, x.fileName), axis=1)

THE END

Its not that hard and quite fast to gather your own data based on results from Bing Search. Depending on what you want to realize you will need to run this script multiple times if you want images for different search terms. Or you can convert this script into a function that accepts an array of search strings.

Extra -- Move data to datastore

In case you are using Azure Machine Learning Studio, its a good idea to move your images to a datastore. That way other data scientists can also make use of it. Below code uploads the folder of images to the same folder on your datastore.

import azureml.core

from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

datastore_name = 'workspaceblobstore'
datastore = Datastore.get(ws, datastore_name)
datastore.upload(dirName, dirName)

The full Jupyter Notebook can be found on my GitHub

https://github.com/sammydeprez/PythonExamples/blob/master/BingImageSearch/BingImageSearch.ipynb

Enjoy gathering new data!

Originally posted on https://www.datafish.eu/article/data-gathering-for-image-recognition/

Top comments (0)