The other time I was reading an interesting blog from Henk Boelman (a Microsoft Advocate). He was describing how you can build an image classifier with Azure Machine Learning Studio. He used a dataset that he had downloaded from Kaggle. But what if you cant find on Kaggle what you need. This blog post will discuss how you can make use of the Bing Image Search to generate your own dataset.
So first of all some explanation. "Bing Image Search API" is part of the Cognitive Services on Azure and it gives you the possibility to search for images just like how you would do on bing.com/images. The API has a search function that allows you to search for a specific keyword but also filter the results based on size (height/width), file size, license, color, ...
More info and free easy try out can be found on the product page of Microsoft.\
https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/
Since we are talking about Machine Learning all the code that will be discussed will be in Python. That way you can just copy it in your Jupiter/Azure Notebook.
Step 1: Install/Import necessary packages
The Bing Image Search has its own Python SDK, what is very handy. But if you want you can also make use of the classic HTTP request methods.
!pip install azure-cognitiveservices-search-imagesearch
from azure.cognitiveservices.search.imagesearch import ImageSearchClient
from msrest.authentication import CognitiveServicesCredentials
import pandas as pd
Step 2: Configure subscription key + endpoint
Subscription Key + Endpoint can be found in your Azure Portal after you create a Cognitive Service resource. You don't need to look for a specific Bing Search Cognitive Services. Since a while Bing Search and many more cognitive services have been implemented in 1 resource. Which means 1 endpoint and 1 key to use them all.
I also added the search term here
subscription_key = "[YourKeyHere]"
subscription_endpoint = "[YourEndpointHere]"
search_term = "apples"
Step 3: Configure Client
So in this step we create 2 new objects. One is the 'credentials' which contains the subscription key that we just configured. And the other is the client that we will use to make the call. Last mentioned also needs the endpoint.
credentials = CognitiveServicesCredentials(subscription_key)
client = ImageSearchClient(endpoint=subscription_endpoint, credentials=credentials)
Step 4: Search and gather the results
Searching for items is very easy. Just by making use of the client object and the search function that has a bunch of variables that you can configure (like color, size, ...) In this case we only configure what we are search for and how many results we want to get back.
Search for images
image_results = client.images.search(query=search_term,count=150)
Step 5: Convert to dataframe
The result we receive from the API is a list of objects. This might we handy if we are in an c# application or so. But in this case we want to work with clear data. So with below functionality we convert the data to a dataframe. You will notice we get quite some information back. Some information is still stored in an object like "_type", but in this case we don't need this data. So we will leave it like this.
convert to dataframe
df = pd.DataFrame([x.as_dict() for x in image_results.value])
df.head()
_type | accent_color | content_size | content_url | date_published | encoding_format | height | host_page_display_url | host_page_url | image_id | image_insights_token | insights_metadata | name | thumbnail | thumbnail_url | web_search_url | width | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ImageObject | 9D2E39 | 224643 B | https://upload.wikimedia.org/wikipedia/commons... | 2019-09-14T23:40:00.0000000Z | jpeg | 1200 | https://en.wikipedia.org/wiki/Apple | https://en.wikipedia.org/wiki/Apple | 6AEAF1C3894ED7D563916634549ED00175C970D3 | ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E... | {} | Apple -- Wikipedia | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6... | https://www.bing.com/images/search?view=detail... | 1200 |
1 | ImageObject | 4F300B | 161831 B | https://upload.wikimedia.org/wikipedia/commons... | 2019-12-21T20:48:00.0000000Z | jpeg | 1083 | https://en.wikipedia.org/wiki/Honeycrisp | https://en.wikipedia.org/wiki/Honeycrisp | 4531DA369849BECBA2980492B1F021A9EF15BB99 | ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0... | {} | Honeycrisp -- Wikipedia | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP... | https://www.bing.com/images/search?view=detail... | 1200 |
2 | ImageObject | C47807 | 172706 B | https://www.tasteofhome.com/wp-content/uploads... | 2018-08-15T00:12:00.0000000Z | jpeg | 1200 | https://www.tasteofhome.com/collection/new-typ... | https://www.tasteofhome.com/collection/new-typ... | EC1EBFE716FD1033C0E0983931FEC881E8A1177D | ccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE... | {} | 15 New Types of Apples You Should Be Buying | ... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp... | https://www.bing.com/images/search?view=detail... |
3 | ImageObject | 7E0F24 | 4462721 B | http://www.michiganapples.com/portals/0/MAC%20... | 2019-11-15T14:57:00.0000000Z | jpeg | 3738 | www.michiganapples.com/About/Varieties | http://www.michiganapples.com/About/Varieties | 753A53E3C6A184B4C55A074E275F120D20D4914C | ccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F... | {} | Michigan Apple Varieties | Michigan Apple Comm... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ... | https://www.bing.com/images/search?view=detail... |
4 | ImageObject | 4F1E0F | 419131 B | http://www.flinchbaughsorchard.com/wp-content/... | 2019-11-01T04:51:00.0000000Z | jpeg | 1691 | www.flinchbaughsorchard.com/apple-varieties | http://www.flinchbaughsorchard.com/apple-varie... | 7480E3C7DDE4776D891D003001872D954786935C | ccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187... | {} | Apple Varieties | Flinchbaugh's Orchard & Farm... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo... | https://www.bing.com/images/search?view=detail... |
Step 6: Generate new filenames
Since we get data back from many different websites, there is a chance that some filenames might be totally the same. Plus some files might not have an extension within the URI, because of routing or image generation. To fix this we add a function that guesses the MimeType and based on that looks up the extension that belongs to it. This in combination with a GUID that is generated we are sure we have a unique filename.
import mimetypes
import uuid
def getFileName(contentUrl):
mt = mimetypes.guess_type(contentUrl)
if mt[0] != None :
ext = mimetypes.guess_extension(mt[0])
return str(uuid.uuid1()) + ext
else:
return ""
By applying above funtion to each row we can add an extra column to the dataframe that contains the newly generated fileName. NOTE: you might have noticed that the function can return an emtpy string. This is only the case when it can't figure out the MimeType. Those results we filter out.
df['fileName'] = df.apply(lambda x: getFileName(x.content_url), axis=1)
df = df[df['fileName'] != ""]
df.head()
_type | accent_color | content_size | content_url | date_published | encoding_format | height | host_page_display_url | host_page_url | image_id | image_insights_token | insights_metadata | name | thumbnail | thumbnail_url | web_search_url | width | fileName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ImageObject | 9D2E39 | 224643 B | https://upload.wikimedia.org/wikipedia/commons... | 2019-09-14T23:40:00.0000000Z | jpeg | 1200 | https://en.wikipedia.org/wiki/Apple | https://en.wikipedia.org/wiki/Apple | 6AEAF1C3894ED7D563916634549ED00175C970D3 | ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E... | {} | Apple -- Wikipedia | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6... | https://www.bing.com/images/search?view=detail... | 1200 | 11c6ddea-3197-11ea-87d5-000d3aaa7d6e.jpe |
1 | ImageObject | 4F300B | 161831 B | https://upload.wikimedia.org/wikipedia/commons... | 2019-12-21T20:48:00.0000000Z | jpeg | 1083 | https://en.wikipedia.org/wiki/Honeycrisp | https://en.wikipedia.org/wiki/Honeycrisp | 4531DA369849BECBA2980492B1F021A9EF15BB99 | ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0... | {} | Honeycrisp -- Wikipedia | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP... | https://www.bing.com/images/search?view=detail... | 1200 | 11c6e2ea-3197-11ea-87d5-000d3aaa7d6e.jpe |
2 | ImageObject | C47807 | 172706 B | https://www.tasteofhome.com/wp-content/uploads... | 2018-08-15T00:12:00.0000000Z | jpeg | 1200 | https://www.tasteofhome.com/collection/new-typ... | https://www.tasteofhome.com/collection/new-typ... | EC1EBFE716FD1033C0E0983931FEC881E8A1177D | ccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE... | {} | 15 New Types of Apples You Should Be Buying | ... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp... | https://www.bing.com/images/search?view=detail... | 1200 |
3 | ImageObject | 7E0F24 | 4462721 B | http://www.michiganapples.com/portals/0/MAC%20... | 2019-11-15T14:57:00.0000000Z | jpeg | 3738 | www.michiganapples.com/About/Varieties | http://www.michiganapples.com/About/Varieties | 753A53E3C6A184B4C55A074E275F120D20D4914C | ccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F... | {} | Michigan Apple Varieties | Michigan Apple Comm... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ... | https://www.bing.com/images/search?view=detail... | 3738 |
4 | ImageObject | 4F1E0F | 419131 B | http://www.flinchbaughsorchard.com/wp-content/... | 2019-11-01T04:51:00.0000000Z | jpeg | 1691 | www.flinchbaughsorchard.com/apple-varieties | http://www.flinchbaughsorchard.com/apple-varie... | 7480E3C7DDE4776D891D003001872D954786935C | ccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187... | {} | Apple Varieties | Flinchbaugh's Orchard & Farm... | {'_type': 'ImageObject', 'width': 474, 'height... | https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo... | https://www.bing.com/images/search?view=detail... | 1800 |
Step 7: Create folder to save the images
This is a folder on your local or cloud environment.
dirName = "image_gathering"
os.mkdir(dirName)
Requirement already satisfied: wget in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (3.2)
Directory image_gathering/apples already exists. Folder will be cleared
Step 8: Download images
First we create a function that tries to download the file to the new destination with the new filename. Bing search might be outdated, so there are possibilities that certain downloads fails (Ex. NotFound, NotAuthenticated, ...) Therefore the function returns True or False depending if download was succesful.
import wget
def downloadFile(dirName, contentUrl, fileName):
try:
wget.download(contentUrl, dirName + "/" + fileName)
return True
except:
return False
By applying this function do every row, every result will be downloaded and we add an extra column that keeps track if it was successfull or not.
df['fileDownloaded'] = df.apply(lambda x: downloadFile(dirName, x.content_url, x.fileName), axis=1)
THE END
Its not that hard and quite fast to gather your own data based on results from Bing Search. Depending on what you want to realize you will need to run this script multiple times if you want images for different search terms. Or you can convert this script into a function that accepts an array of search strings.
Extra -- Move data to datastore
In case you are using Azure Machine Learning Studio, its a good idea to move your images to a datastore. That way other data scientists can also make use of it. Below code uploads the folder of images to the same folder on your datastore.
import azureml.core
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
datastore_name = 'workspaceblobstore'
datastore = Datastore.get(ws, datastore_name)
datastore.upload(dirName, dirName)
The full Jupyter Notebook can be found on my GitHub
https://github.com/sammydeprez/PythonExamples/blob/master/BingImageSearch/BingImageSearch.ipynb
Enjoy gathering new data!
Originally posted on https://www.datafish.eu/article/data-gathering-for-image-recognition/
Top comments (0)