A colleague of mine showed something to me today that I had never come across before and I was impressed 👍:
A search engine (powered by Google, who aren't too bad at that search thing) that returns results back as a semi-curated list of datasets 📚 available on the web, regardless of where they are hosted!
(It's been around for quite a while now too!)
One of the biggest problems with both learning and understanding topics like machine learning and big data analytics is getting access to large datasets.
Lots of sites (such as Kaggle) have made awesome inroads into making datasets more accessible but they can't possibly host everything.
And that's where properly indexing and search can help.
Google has a good history in making popular search engines. But it's the approach behind dataset search that I'm more interested in:
Standardisation 📋 - it's up to dataset owners to make their dataset indexable in a specific format, so it can be found more easily and more precisely.
Okay, try searching for "programming":
What do we see?
- 🗂️ Three different datasets.
- 💡 Three potential project ideas.
- 🌍 Three different data sources.
It's that last one that works for me - I don't need to go through curated lists of data set sources or validate the security of a dataset found in a Reddit post. I can just search.
- Have you used dataset search?
- Where do you get your datasets?
- What do you use public datasets for?
🧡 Tom Anderson
Liked something I did and want to help me out?