GharamElhendy

Posted on Apr 28, 2021 • Edited on May 14, 2021

Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them

#python #datascience #pandas #numpy

1. Missing Data

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()

The range index will show you the total number, and then beside each entry, you'll find its count. If it doesn't equate to the total number, then you have missing data in your set.

How to Deal with It?

This varies according to the situation at hand. For example, why is the data missing? And whether or not the occurrences seem random.

One way to go about this issue is to calculate the missing values using the mean.

For example, if you have missing values for the duration that a user viewed a product on your website. "duration" is the name of the variable in this case.

mean = df['duration'].mean()
df['duration'] = df['duration'].fillna(mean)

The second line can be written as:

df['duration'].fillna(mean, inplace=True)

And both serve to apply the changes (adding the data you just calculated) to the original set.

2. Duplicates

How to Check?

df.duplicated()

This should display "False" next to all the lines that aren't duplicates, and "True" next to the ones that are a duplicate of the ones above them.

I.e. The first instance will be marked as "False" but the second instance (which is the duplicate) will be marked as "True".

You can also check with:

sum(df.duplicated())

This works for bigger data sets, and it shows you just how many instances of duplicates you have.

How to Deal with It?

df.drop_duplicates(inplace=True)

Again, (inplace=True) is used to apply changes to the original data set.

3. Incorrect Data Types

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()

for example, if beside the variable "Timestamp" you find "object", this means that your data set is dealing with the timestamp as a string (str) which is not ideal. The proper representation is DateTime object.

In this case, we'll use:

df['timestamp'] = pd.to_datetime(df['timestamp']

Note: Data type corrections aren't applied when you re-open the csv file. So, next time you parse the file, make sure to change them again accordingly.

Git_It

Top comments (3)

sudarshan • Apr 28 '21

Nice One.
But, here df['duration'] = df['duration'].fillna(mean) I will suggest the SimpleImputer() instead of fillna() OR if you still go with fillna() then use median as filling value.

GharamElhendy • Apr 28 '21

I will read more into this, but it would be great if you can explain SimpleImputer()'s advantage

Also, why do you think using the median is better? Are there certain data sets with which the mean is better and ones with which the median is better?

Thanks in advance! :)

sudarshan • Apr 29 '21

Regarding your first question :-
sklearn.preprocessing.Imputer and sklearn.preprocessing.SimpleImputer() both are used for imputing missing values in regards with values present in the dataser already.
Take a look at here => scikit-learn.org/stable/modules/ge...

For second Question :
Median is more robust estimate than mean. Because, mean value will be drastically vary in the value if your dataset contains outliers like age = 198.
But, median is calculated after taking all values in the consideration. Hence, i think you should prefer median over mean.

Hope it will help :)