It's bit of hectic process in creating image datasets. It Basically consists of below mentioned pipeline.(to my understanding)
- Model Bias.
- Whats your model goal?
- Ways to collect images.
- Cleaning the data.
- Resizing the images.
Can you solve this riddle??
A man and his son are in a terrible accident and are rushed to the hospital in critical care. The doctor looks at the boy and exclaims "I can't operate on this boy, he's my son!" How could this be?
Firstly most people generally think what i think😃, this is an example for human bias.
If you train your model with more cat images and expect it to perform well on detecting cats and dogs, this happens
source: Sidney Harris
For more details on data bias you can go through this excellent slides by cs224n: Bias in the Vision and Language of Artificial Intelligence
here's a just a sample list of sources to collect images data
- Search engines 🔍 (Google, Bing, Yandex, Duck Duck Go)
- Social Media > through hashtags#️⃣
- Youtube videos and flickr📹
- take a camera/mobile and go around collect data by yourself.
- Trash the Images which can't be loaded/ corrupted.
- find out duplicate images(due to various search engines).
- Do what's necessary...
- Resize maintaining its aspect ratio.
- If you have images of different sizes, and you try using resize with padding(filling the pixels with black/white).
- Smaller your images >>> faster your model training.
Some of these requires chromedriver and selenium.
For images downloading based on Search engines:
Image Downloader by sczhengyabin [google | bing]
google images download by hardikvasa
yandex images download by bobokvsky
Bulk Bing Image downloader by ostrolucky
Flickr image-scraping software developed by Ultralytics LLC
For downloading from instagram based on hashtags:
- Instagram-scraper by arc298
For duplicate images cleaning:
- Imagededup by idealo✨ 😎
This "imagededup" package uses Convolutional Neural Network (CNN) and hashing algorithms to find duplicates in images.