DEV Community

Cover image for building an image dataset

building an image dataset

ash11sh profile image ashish ・Updated on ・2 min read

It's bit of hectic process in creating image datasets. It Basically consists of below mentioned pipeline.(to my understanding)

  1. Model Bias.
  2. Whats your model goal?
  3. Ways to collect images.
  4. Cleaning the data.
  5. Resizing the images.

Model Bias

Can you solve this riddle??

A man and his son are in a terrible accident and are rushed to the hospital in critical care. The doctor looks at the boy and exclaims "I can't operate on this boy, he's my son!" How could this be?

Firstly most people generally think what i think😃, this is an example for human bias.

If you train your model with more cat images and expect it to perform well on detecting cats and dogs, this happens

source: Sidney Harris

For more details on data bias you can go through this excellent slides by cs224n: Bias in the Vision and Language of Artificial Intelligence

Ways to collect data

here's a just a sample list of sources to collect images data

  • Search engines 🔍 (Google, Bing, Yandex, Duck Duck Go)
  • Social Media > through hashtags#️⃣
  • Youtube videos and flickr📹
  • take a camera/mobile and go around collect data by yourself.

Cleaning the data.

  1. Trash the Images which can't be loaded/ corrupted.
  2. find out duplicate images(due to various search engines).
  3. Do what's necessary...

Resizing the images

  • Resize maintaining its aspect ratio.
  • If you have images of different sizes, and you try using resize with padding(filling the pixels with black/white).
  • Smaller your images >>> faster your model training.

Codes you need(💪 open source)

Some of these requires chromedriver and selenium.

For images downloading based on Search engines:

For downloading from instagram based on hashtags:

For duplicate images cleaning:

  • Imagededup by idealo✨ 😎

This "imagededup" package uses Convolutional Neural Network (CNN) and hashing algorithms to find duplicates in images.

Discussion (0)

Editor guide