loading...

Visualizing the patterns of missing value occurrence with Python

tomoyukiaota profile image Tomoyuki Aota Updated on ・3 min read
(A Japanese translation is available here.)

During data analysis, we need to deal with missing values. Handling missing data is so profound that it will be an entire topic of a book. However, before doing anything to missing values, we need to know the pattern of occurrence of missing values. This article describes easy visualization techniques for missing value occurrence with Python. The techniques are useful in early stages of exploratory data analysis.

I've uploaded a Jupyter notebook in my GitHub repo. You can run it using Binder by clicking the badge below.

Binder

Prerequisite

I'm using the Titanic train dataset from Kaggle as an example. To begin with, following code is assumed to be executed.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
# Confirm the number of missing values in each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Method 1: seaborn.heatmap

The first method is by seaborn.heatmap. The next single-line code will visualize the location of missing values.

sns.heatmap(df.isnull(), cbar=False)

seaborn_heatmap.png

Against Index, I can see that

  • Age column has missing values with variation in occurrence,
  • Cabin column are almost filled with missing values with variation in occurrence, and
  • Embarked column has few missing values in the beginning part.

This is not the case for this Titanic dataset, but especially in time series data, we need know if the occurrence of missing values are sparsely located or located as a big chunk. This heatmap visualization immediately tells us such tendency. Also, if more than 2 columns have correlation in missing value locations, such correlation will be visualized. (Again, not the case for this dataset, but it is important to know the fact that there is no such correlation in this dataset.)

This single-line code tells us a lot of information of missing value occurrence.

Method 2: missingno module

If you want to proceed further, missingno module will be useful.
To begin with, install and import it.

pip install missingno
import missingno as msno

If you want the similar result to seaborn.heatmap described earlier, use missingno.matrix.

msno.matrix(df)

missingno_matrix

In addition to the heatmap, there is a bar on the right side of this diagram. This is a line plot for each row's data completeness. In this dataset, all rows have 10 - 12 valid values and hence 0 - 2 missing values.

Also, missingno.heatmap visualizes the correlation matrix about the locations of missing values in columns.

msno.heatmap(df)

missingno_heatmap

missingno module has more features, such as the bar chart of the number of missing values in each column and the dendrogram generated from the correlation of missing value locations. For more information, README is a good primer.

Closing

Two easy visualization methods are described in this article. seaborn.heatmap is the first choice as it requires seaborn only, but it you need more, missingno module will help you.

Discussion

pic
Editor guide
Collapse
ra312 profile image
Rauan Akylzhanov

You still need to call plt.show(), right ?

Collapse
balaranga33 profile image
balaranga33

Actually no, if you used this magic function in jupyter notebook "%matplotlib inline" then you don't need to call plt.show()

Collapse
radekpjanik profile image
radekpjanik

How would you plot the missingno package plots into 3 subplots? E.g. have 3 subplots, one with matrix, one with heatmap and one with dendogram?

Thanks!

Collapse
prateek2901 profile image
Prateek Srivastava

fig, ax = plt.subplots(figsize=(25, 15),nrows=1,ncols=2)

Visualize the number of missing values as a bar chart

msno.bar(df,ax=ax[0])

Visualize the correlation between the number of missing values in different columns as a heatmap

msno.heatmap(df,ax=ax[1])

Maybe you can try something like this..