If “the data is the new oil” then there is a lot of free oil just waiting to be used. And you can do some pretty interesting things with that data, like finding the answer to the question: Is Buffalo, New York really that cold in the winter?
There is plenty of free data out there, ready to be used for school projects, market research, or just for fun. Before you go crazy, however, you should be aware of the quality of the data you find. Here are some great sources of free data and some ways to determine their quality.
All of these dataset sources have strengths, weaknesses, and specialties. All in all, these are great pieces of equipment and you can spend a lot of your time digging rabbit holes.
But if you want to stay focused and find what you need, it’s important to understand the nuances of each source and use their strengths to your advantage.
- Google Dataset Search
As the name suggests, Google Dataset Search is “a dataset search engine,” whose primary audience includes journalists and data researchers.
Google Dataset Search has the most datasets of any options listed here, with 25 million datasets available when it exited beta in January 2020. As it comes to a Google product, the search function is powerful, but if you have to be really specific, it has plenty of filters to narrow down the results.
When it comes to finding free public datasets, you can’t do much better than Google Dataset Search right now. Keep in mind that the Google Graveyard, which is a phenomenon where Google cancels a service or product on short notice, is a pervasive danger to Google products large and small. It is good to know the other options.
- Kaggle
Kaggle is a popular data science competition website that provides free public datasets that you can use to learn more about artificial intelligence (AI) and machine learning (ML).
Organizations use Kaggle to display a prompt and # 40, as cassava leaf disease classification and # 41; and teams from around the world will compete against each other to solve it using algorithms (and win a cash prize).
Kaggle is quite prominent in the data science community because it provides a way to test and demonstrate your skills — your performance in the Kaggle competition sometimes shows up in job interviews for AI / ML positions.
After these competitions, the datasets are made available for use. At the time of writing, Kaggle has a collection of over 68,000 datasets, which he organizes using a system of tagging, usability scores, as well as positive reviews and negative.
Kaggle has a strong community on their site, with discussion boards within each dataset and within each competition. There are also active communities outside of Kaggle, such as r / kaggle, which share tips and tutorials.
All of this is to say that Kaggle is more than just a free dataset distributor; it’s also a way to test your skills as a data scientist. Free datasets are a side benefit that anyone can take advantage of.
- GitHub
GitHub is the global standard for collaborative and open-source online code repositories, and many of the projects it hosts have datasets you can use. There is a specific project for public datasets aptly called Awesome Public Datasets.
Like Kaggle, the datasets available on GitHub are a side benefit of the site’s real purpose. In the case of GitHub, this is primarily a code repository service. This is not a data repository optimized for discovering datasets, so you might need to get a little creative to find what you’re looking for, and it won’t have the same variety as Google or Kaggle.
- Government Sources
Many government agencies make their data freely available online, allowing anyone to download and use public datasets. You can find a wide variety of government data from municipal, state, federal, and international sources.
These datasets are great for students and those focusing on the environment, the economy, healthcare (a lot of these types of data due to COVID19), or demographics. Keep in mind that these aren’t the most stylish sites of all time — they are mostly focused on function rather than style.
- FiveThirtyEight
FiveThirtyEight is a data journalism website that occasionally makes its datasets available. Their original focus was sport but has since spread to pop culture, science and (most famous) politics.
The datasets made available by FiveThirtyEight are highly organized and specific to their journalistic production. Unlike the other options on this list, you’ll likely end up browsing inventory rather than searching. And you might come across some fun and interesting data sets, like 50 years of a World Cup doppelganger.
- Data.world
Data.world is a data catalog service that simplifies collaboration on data projects. Most of these projects make their datasets available free of charge.
Anyone can use data.world to create a workspace or a project that hosts a dataset. A wide variety of data is available, but it is not easy to navigate. You will need to know what you are looking for to see results.
Data.world requires login to access their free community plan, which allows you to create your own projects / datasets and provides access to others’ projects / datasets. You will need to pay to access multiple projects, datasets, and repositories.
- Newsdata.io news datasets
Newsdata.io is a news API and they collect worldwide news data on a daily basis and they offer that news data with their news API. They also provide free news datasets and the best is that you can also make a news dataset according to your requirement with the help of Newsdata.io news API in python, which may take longer when you are fetching large sums of data.
- AWS Public Data sets
Amazon makes large datasets available on its Amazon Web Services platform. You can download the data and use it on your computer, or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here.
Amazon has a page that lists all the datasets to browse. You will need an AWS account, although Amazon does provide you with a free level of access for new accounts that will allow you to explore data at no cost.
- Wikipedia
Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an astonishing expanse of knowledge, with pages on everything from the Ottoman Wars of the Habsburgs to Leonard Nimoy.
As part of Wikipedia’s commitment to the advancement of knowledge, they offer all of their content free of charge and regularly generate dumps of all articles on the site.
In addition, Wikipedia offers a history of changes and activities, which allows you to follow the evolution of a page on a topic over time and to know who contributes to it. You can find different ways to download the data on the Wikipedia site. You will also find scripts to reformat the data in various ways.
- UCI Machine Learning Repository
The UCI Machine Learning Repository is one of the oldest sources of datasets on the web. While the datasets are user-supplied and therefore have varying levels of documentation and cleanliness, the vast majority are clean and ready to apply. UCI is a great first stop when looking for interesting datasets.
The data can be downloaded directly from the UCI Machine Learning repository, without registration. These datasets tend to be quite small and don’t have a lot of nuance, but they are useful for machine learning.
Quality data gives you quality work
Free data is great, High-quality free data is better. If you want to do a great job with the data you find, you need to do your due diligence to make sure it’s good quality data by asking a few questions.
Should I trust the data source?
First, consider the overall reputation of your data source. Ultimately, datasets are created by humans, and those humans may have specific agendas or biases that can translate into your work.
All of the data sources we have listed here are reliable, but there are several data sources that are not as reliable. The only downside to our listing here is that community-provided collections, such as data.world or GitHub, may vary in quality. If you have doubts about the reputation of your data source, compare it with similar sources on the same topic.
Could the data be Incorrect?
Next, examine your data set for any inaccuracies. Again, humans create these datasets and humans are not perfect. There may be errors in the data which, using a few quick tips, you can quickly identify and correct.
First tip: calculate estimates for the minimum and maximum for any of your columns. Check if the values in your dataset are outside of this using the filtering and sorting options, shown here:
Let’s say you have a small data set on used car prices. You would expect the price data to be somewhere between $ 7,000 and $ 20,000 or so. When you filter the price column from low to high, the low price probably shouldn’t be very far from $ 7,000.
But humans can make mistakes and enter data incorrectly: Instead of $ 11,000.00, someone can type $ 1,100.00 or $ 11.00.00. Another common example is that sometimes people don’t want to provide actual data for things like phone numbers. You can get a lot of 9999999999 or 0000000000 in these columns.
Also, pay attention to the column headings. A field can be titled “% occupied” and the entries can have 0.80 or 80. Both could mean 80% but would show up differently in the final data set.
Then check for errors. If these are simple and obvious mistakes, correct them. If they are clearly incorrect, remove the entry from the dataset so that they do not collapse.
Could the Data Be Unfinished?
It is very common for a dataset to run out of data. Before you start working with the dataset, it is a good idea to check for null or missing values. If there are a lot of NULL values, the dataset is incomplete and may not be good to use.
In Excel, you can do this by using the COUNTBLANK function, for example, COUNTBLANK (B1: B3) in the following image gives a number of 1.
Too many zero values probably mean an incomplete data set. some null values, but not too many, you can pass and replace null values with 0 using SQL, or you can do it manually.
How to know if the data is skewed?
Understanding how your data set is asymmetric will help you choose the right data to analyze. It’s helpful to use visualizations to see how skewed your dataset is, as it’s not always obvious by just looking at the numbers.
For numeric columns, use a histogram to see the type of distribution of each column (normal, left, right, uniform, bimodal, etc.).
Strict recommendations of what to do next based on the dataset, but overall the way it is biased will give a general idea of the quality of the data and suggest which columns to use in the analysis. You can then use this general idea to avoid misrepresenting the data.
For non-numeric columns, use a frequency table to see how many times a value is displayed. In particular, you might want to check if there is mainly a value present. If so, your analysis may be limited due to the low diversity of values. Again, this is just to give you a general idea of the quality of the data and indicate which relevant columns to use.
You can create these visuals and frequency tables in Excel or Google Sheets using CSV, but you might want to turn to a Business Intelligence (BI) tool for complex data sets.
Use free datasets
Once you have your data and are confident in its quality, it’s time to put it to work. You can go a long way with tools like Excel, Google Sheets, and Google Data Studio, but if you really want best practices for your career data, you need to be familiar with the real deal: a BI platform.
A BI platform will provide powerful data visualization capabilities for any data set, from small CSVs to large data sets hosted in data warehouses, such as Google BigQuery or Amazon Redshift. You can play around with your data to create dashboards and even collaborate with others.
Top comments (0)