Header photo by Emily Morter on Unsplash
A key part of data science is to clean the data before work with it. For example, it's common finding missing values in a dataset, either because that data was not collected or don't exist though.
Depends on the data we have missing, or even why we don't have it, we handle it in different ways. So, we need to analyze each dataset column carefully and decide whether we will remove all rows with missing values, or remove the column itself, or fill the cells with some data (and which data could be).
Count missing values
Before handle the missing values, we should know exactly how much of them we have. The code below counts all missing values in a pandas DataFrame df
and print it all by column.
df.isnull().sum()
Output:
budget 0
genres 28
homepage 3091
id 0
keywords 412
original_language 0
Check relative data missing
So, we know we have missing values in our dataset, and now we may figure out the proportion of missing values in relation to all data. To do this we sum the missing values, divide by the number of cells, and multiply by 100, as we can in the code below.
def percent_missing(df: pd.DataFrame) -> float:
# # of rows x # of columns using numpy product function
total_data_points = np.product(df.shape)
total_missing = df.isnull().sum().sum()
return (total_missing/total_data_points)*100
print(f"{percent_missing(df)}"%)
Output:
3.863904504129364%
Drop rows with missing values
An approach to handle missing data is just to drop all rows containing missing values. Again, the decision needs to be made after a careful analysis. But, if we decide to drop the rows, the code is quite simple.
df.dropna()
Drop columns with missing values
If we decide to drop the columns instead, we just say that to the dropna()
method with the axis parameter.
df.dropna(axis=1)
Fill in missing values
Even though, if we don't want to drop the rows or columns, we may fill the cells with some data. As always, we need to analyze and decide which data we will fill with. Some approaches are: fill with 0
, with blank space, with the most common value observed in that column, with the last or the next valid value observed, with statistics measurements like mean
or median
, and so on.
Using pandas
Pandas has the fillna()
method to deal with it. We specify the value we want to fill and pass it as an argument to the method. In the code below we are filling all missing cells with 0
.
df.fillna(0)
Using Scikit-Learn
With scikit-learn, we have more sophisticated ways to deal with the problem. For example, we may fill a categorical column with the most frequent value observed on it, or fill a numerical column with the mean value of the data. The sklearn SimpleImputer
class has many strategies for imputing data as you can see in its documentation. The code below uses the most_frequent
strategy, that
replace missing using the most frequent value along each column. Can be used with strings or numeric data.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
new_df = imputer.fit_transform(df)
In a future article, I will cover in more detail the use of scikit-learn. See you!
Documentation
Pandas dropna()
Pandas fillna()
Scikit-learn SimpleImputer()
Top comments (0)