DEV Community

Cover image for 7 Favorite Sites to Find Datasets
Jonathan Fetterolf
Jonathan Fetterolf

Posted on

7 Favorite Sites to Find Datasets

Introduction

So, you're looking for the next perfect dataset to use in your upcoming project? Look no further. Well, look a little further... I share my favorite sites below.

I'm always looking for the next dataset to use in solidifying a new concept I'm studying. This list should help cut down the time it takes to find that perfect dataset. The sites included all have access to free datasets.

kaggle logo

Kaggle

Kaggle is an online community platform for data scientists and machine learning enthusiasts. Kaggle hosts over 186,000 data sets on topics ranging from games to death rates and everything in between.

What is nice about Kaggle?
Kaggle suggests categories of datasets such as trending, music, business, computer science, and classification so you can quickly get to a dataset you're interested in. If you know exactly what you're looking for (or what you don't want included), Kaggle also offers options to filter by file size, file type, and license type. My favorite feature that Kaggle offers is that each dataset has a "Usability" rating which takes into account the completeness, credibility, and compatibility of the dataset.

Data Is Plural

Data Is Plural

Data Is Plural is a weekly newsletter of useful datasets published by Jeremy Singer-Vine. There have been over 300 editions of this newsletter, the earliest is from October 21, 2015.

Pros and Cons for Data Is Plural:
This is a super fun site for discovering useful/curious datasets. It's not the easiest site to navigate but once you navigate to reading the full archive as a spreadsheet, it becomes searchable by keywords (using ctrl+f).

FiveThirtyEight

FiveThirtyEight

FiveThirtyEight shares the data behind some of their articles for you to use to create your own stories and visualizations. FiveThirtyEight offers datasets used in articles about culture, economics, politics, science & health, and sports.

Pros and Cons for FiveThirtyEight:
Their datasets are listed in order of most recently updated with those in the status of currently being updated first. The most prominent place to click will take you to the article(s) that reference the dataset. The best way to explore what the dataset offers is to click the info link and navigate to the GitHub repository that houses all its available datasets (Repository found here).

UCI Machine Learning Repository

UCI Machine Learning Repository

The University of California, Irvine hosts its own Machine Learning Repository with over 600 datasets. The site is very intuitive but looks a bit dated. It has an open beta for its new site that you can access here.

Pros and Cons for UCI Machine Learning Repository:
Most of its datasets are intended for use with classification. The site is very easy to use and its beta site has an updated look and includes filtering features within the search option.

data.gov

Data.gov

Data.gov Hosts over 245,000 datasets and is managed and hosted by the U.S. General Services Administraion, Technology Transformation Service. Recently, I used Data.gov to find a dataset for Meteorite Mania, my personal project that uses geopandas to plot the data of known meteorite landing sites onto a map of the world.

Pros and Cons for Data.gov:
Data.gov has a basic layout that's easy to use and intuitive. It is easy to filter by location, formats, publishers, and bureaus.

BuzzFeedNews

BuzzFeedNews

Who doesn't love a good BuzzFeed article? BuzzFeedNews has made its datasets available with links to their repositories and articles so you can see what is actually going on with the data behind the scenes. If you haven't looked into this source, I highly recommend it.

Just Love for BuzzFeedNews:

I love that you can see the notebooks behind the articles. This allows you to see exactly how the authors used the data to get to the later stated conclusions. The datasets have already been analyzed so you may struggle to come up with a new spin on them but it's a great source for learning new methods.

AwesomeData

AwesomeData

AwesomeData is a topic-centric list of links to open datasets. This is a great place to explore a topic through the lens of data.

Pros & Cons for Awesome Data:

If you're looking for a quick and to-the-point link to a dataset, this isn't it. If you're looking to explore and find some unique and interesting places with datasets, this one is for you!

Conclusion

Kaggle is my go to for finding a useable dataset as quickly as possible. Data.gov is a very close second. If you're looking for a little bit of a detour that can lead you down a tangent but end up somewhere fun in the end: Data Is Plural is where it's at. BuzzFeedNews is great if you want to learn some new techniques or see the 'how' behind some number crunching. Any of these resources should get you to a useable data set, it just depends on how you want to get there...

Top comments (0)