ashish

Posted on Nov 21, 2020 • Updated on Nov 24, 2020

building an image dataset

#machinelearning #computervision

It's bit of hectic process in creating image datasets. It Basically consists of below mentioned pipeline.(to my understanding)

Model Bias.
Whats your model goal?
Ways to collect images.
Cleaning the data.
Resizing the images.

Model Bias

Can you solve this riddle??

A man and his son are in a terrible accident and are rushed to the hospital in critical care. The doctor looks at the boy and exclaims "I can't operate on this boy, he's my son!" How could this be?

Firstly most people generally think what i think😃, this is an example for human bias.

If you train your model with more cat images and expect it to perform well on detecting cats and dogs, this happens

source: Sidney Harris

For more details on data bias you can go through this excellent slides by cs224n: Bias in the Vision and Language of Artificial Intelligence

Ways to collect data

here's a just a sample list of sources to collect images data

Search engines 🔍 (Google, Bing, Yandex, Duck Duck Go)
Social Media > through hashtags#️⃣
Youtube videos and flickr📹
take a camera/mobile and go around collect data by yourself.

Cleaning the data.

Trash the Images which can't be loaded/ corrupted.
find out duplicate images(due to various search engines).
Do what's necessary...

Resizing the images

Resize maintaining its aspect ratio.
If you have images of different sizes, and you try using resize with padding(filling the pixels with black/white).
Smaller your images >>> faster your model training.

Codes you need(💪 open source)

Some of these requires chromedriver and selenium.

For images downloading based on Search engines:

Image Downloader by sczhengyabin [google | bing]
google images download by hardikvasa
yandex images download by bobokvsky
Bulk Bing Image downloader by ostrolucky
Flickr image-scraping software developed by Ultralytics LLC

For downloading from instagram based on hashtags:

Instagram-scraper by arc298

For duplicate images cleaning:

Imagededup by idealo✨ 😎

This "imagededup" package uses Convolutional Neural Network (CNN) and hashing algorithms to find duplicates in images.

DEV Community

building an image dataset

Model Bias

Ways to collect data

Cleaning the data.

Resizing the images

Codes you need(💪 open source)

Top comments (0)

Read next

Php features: first-class callable syntax

Ready To Cook

Implementing Push Notifications with PHP and Firebase: Engage Users with Real-Time Updates

What are Middle-Class Jobs?