DEV Community

Jingles (Hong Jing)
Jingles (Hong Jing)

Posted on

Clean your data

Before you can tell any data stories, you need data. If you are working in an organisation, a class assignment, or a Kaggle dataset; you have the data. You have to figure out what is the story you want to tell.

If you don’t have data yet, you have to define your hypothesis before collecting data. Your hypothesis should be measurable and clear. This will guide you to find or collect suitable dataset for analysis. Try searching for an open-source dataset that might answer your key question. However, if your question is niche, you have to build your data collection system.

Before you can extract any insights from your data, you have to ensure that the data is correct. This process is defined as data cleaning. Typically you want to clean data that are incomplete, inaccurate, inconsistent and duplicated; to have accurate results.

Identify bad data

Imagine if someone spotted an error during your presentation, that will make your work less credible. For example, if you have a dataset that contains human age, it wouldn’t make sense if someone is 5,000 years old. You might have to remove them first.

Identify missing values

Missing values can be represented as empty values or values that are out of range, like “-1” or “-99" for human age. Your job is to identify and handle these missing values. You may have to get rid of columns or rows that have too many missing values.

Look for outliers

These are data points that contain values that are outside the normal range. Unlike bad data, outliers’ values are valid. For example, in the Seattle Airbnb dataset, there is 1 host who set the minimum rental nights to 1,000. It may offer interesting stories and insights, but they may also skew your results. You have to identify and decide how to deal with these kinds of data.

Top comments (0)