DEV Community

Cover image for Finding out the Missing Values Using Missingno and Pandas
Alisha Rana
Alisha Rana

Posted on

Finding out the Missing Values Using Missingno and Pandas

The first step in data cleaning for me is typically looking for missing data, missing data can have different sources, maybe it isn't available, maybe it gets lost, maybe it gets damaged and normally its not an issue, we can fill it but I think often time missing data is very informative in itself, while we can fill the data with the average or something like that and I will show you how to do that frequently,
For instance, if you have an online clothing store, if a customer never clicked on the baby category, it is likely that they do not have children. You can learn a lot by simply taking the information that is not there.

The missingno Library
Missingno is a great Python module that provides a set of visualisations to help you understand the presence and distribution of missing data within a pandas dataframe. This can take the shape of a dendrogram, heatmap, barplot, or matrix plot.
We can determine where missing values occur, the magnitude of the missingness, and whether any of the missing values are associated with each other using these graphs.
Using the pip command, you may install the missingno library:

pip install missingno
Enter fullscreen mode Exit fullscreen mode

Importing Libraries and Loading the Data

import pandas as pd
import missingno as msno
df = pd.read_csv('housing.csv')
df.head()
Enter fullscreen mode Exit fullscreen mode

Image description

Quick Analysis with Pandas
Before we utilise the missingno library, there are a few features in the pandas library that can provide us with an idea of how much missing data there is.

The first method is to use the .describe() method. This function returns a table with summary statistics about the dataframe, such as the mean, maximum, and minimum values.

df.describe()
Enter fullscreen mode Exit fullscreen mode

Image description
Using the .info() method, we can go one step farther. This will provide you a count of the non-null values in addition to a summary of the dataframe.

df.info()
Enter fullscreen mode Exit fullscreen mode

Image description

Yet another quick technique is

df.isna().sum()
Enter fullscreen mode Exit fullscreen mode

This function produces a summary of the number of missing values in the dataframe. The isna() function finds missing values in the dataframe and returns a Boolean result for each element in the dataframe. The sum() function adds up all of the True values.

Image description
Using missingno to Identify Missing Data
There are four types of plots in the missingno library for visualising data completeness: barplots, matrix plots, heatmaps, and dendrogram plots.

msno.matrix(df)

Enter fullscreen mode Exit fullscreen mode

Image description
The column total_bedrooms in the resulting graphic displays some amounts of missing data.

msno.bar(df)
Enter fullscreen mode Exit fullscreen mode

Image description

The barplot provides a simple plot where each bar represents a column within the dataframe. The height of the bar indicates how complete that column is, i.e, how many non-null values are present.

you can notice the height of total_bedrooms which is less than others

Summary
Identifying missing data before using machine learning is a critical step in the data quality pipeline. This is possible with the missingno library and a sequence of visualisations.

Thank you for your time!

Top comments (0)