DEV Community

Omale Happiness Ojone
Omale Happiness Ojone

Posted on

Data Cleaning

Data cleaning refers to the process of “cleaning” data, by identifying errors in the data and then rectifying them.
The main aim of Data Cleaning is to identify and remove errors & duplicate data, in order to create a reliable dataset.
We will use the fish dataset as the basis for this tutorial.

Fish Dataset

The “Fish Dataset” is a machine learning dataset.
The task involves predicting the weight of a fish.
You can access the dataset here:
[(https://www.kaggle.com/aungpyaeap/fish-market)]

         from pandas import read_csv
         from numpy import unique
         import pandas as pd
         import seaborn as sns
         import matplotlib.pyplot as plt
         import numpy as np
         fish = pd.read_csv("Fish.csv")
Enter fullscreen mode Exit fullscreen mode

. How does the data look like?

Image description

Fill-Out Missing Values

One of the first steps of fixing errors in your dataset is to find incomplete values and fill them out. Most of the data that you may have can be categorized.
In most cases, it is best to fill out your missing values based on different categories or create entirely new categories to include the missing values.
If your data are numerical, you can use mean and median to rectify the errors.
let's check our dataset:

Image description

As you can see, in this case, we do not have missing values.

Removing rows with missing values

One of the simplest things to do in data cleansing is to remove or delete rows with missing values. This may not be the ideal step in case of a huge amount of errors in your training data.
If the missing values are considerably less, then removing or deleting missing values can be the right approach. You will have to be very sure that the data you are deleting does not include information that is present in the other rows of the training data.

Note: As you can see, in this case, we do not have missing values. However, this is not always the case.

Fixing errors in the Dataset

Ensure there are no typographical errors and inconsistencies in the upper or lower case.
Go through your data set, identify such errors, and solve them to make sure that your training set is completely error-free. This will help you to yield better result from your machine learning functions.

Identify Columns That Contain a Single Value

Columns that have a single observation or value are probably useless for modeling.
These columns or predictors are referred to zero-variance predictors as if we measured the variance (average value from the mean), it would be zero.
When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.
You can detect rows that have this property using the nunique() Pandas function that will report the number of unique values in each column.

Image description

Delete Columns That Contain a Single Value

Variables or columns that have a single value should probably be removed from your dataset.
From the above picture we could see that the column Species has a single value.
Columns are relatively easy to remove from a NumPy array or Pandas DataFrame.
One approach is to record all columns that have a single unique value, then delete them from the Pandas DataFrame by calling the drop() function.

Image description

Identify Rows That Contain Duplicate Data

Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.
A duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.
The pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either False to indicate that it is not a duplicate or True to indicate that it is a duplicate. If there are duplicates, the first occurrence of the row is marked False (by default), as we might expect.

Image description

First, the presence of any duplicate rows is reported, and in this case, we can see that there are no duplicates (False).
But in a case where there are duplicates, we could also use the Pandas function drop_duplicates() to drop the duplicates row.

Conclusion

Data Cleaning is a critical process for the success of any machine learning function. For most machine learning projects, about 80 percent of the effort is spent on data cleaning. We have discussed some of the points.

Top comments (0)