(A Japanese translation is available here.)
During data analysis, we need to deal with missing values. Handling missing data is so profound that it will be an entire topic of a book. However, before doing anything to missing values, we need to know the pattern of occurrence of missing values. This article describes easy visualization techniques for missing value occurrence with Python. The techniques are useful in early stages of exploratory data analysis.
I've uploaded a Jupyter notebook in my GitHub repo. You can run it using Binder by clicking the badge below.
I'm using the Titanic train dataset from Kaggle as an example. To begin with, following code is assumed to be executed.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
df = pd.read_csv('train.csv')
# Confirm the number of missing values in each column. df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
Method 1: seaborn.heatmap
The first method is by
seaborn.heatmap. The next single-line code will visualize the location of missing values.
Against Index, I can see that
- Age column has missing values with variation in occurrence,
- Cabin column are almost filled with missing values with variation in occurrence, and
- Embarked column has few missing values in the beginning part.
This is not the case for this Titanic dataset, but especially in time series data, we need know if the occurrence of missing values are sparsely located or located as a big chunk. This heatmap visualization immediately tells us such tendency. Also, if more than 2 columns have correlation in missing value locations, such correlation will be visualized. (Again, not the case for this dataset, but it is important to know the fact that there is no such correlation in this dataset.)
This single-line code tells us a lot of information of missing value occurrence.
Method 2: missingno module
If you want to proceed further, missingno module will be useful.
To begin with, install and import it.
pip install missingno
import missingno as msno
If you want the similar result to
seaborn.heatmap described earlier, use
In addition to the heatmap, there is a bar on the right side of this diagram. This is a line plot for each row's data completeness. In this dataset, all rows have 10 - 12 valid values and hence 0 - 2 missing values.
missingno.heatmap visualizes the correlation matrix about the locations of missing values in columns.
missingno module has more features, such as the bar chart of the number of missing values in each column and the dendrogram generated from the correlation of missing value locations. For more information, README is a good primer.
Two easy visualization methods are described in this article.
seaborn.heatmap is the first choice as it requires
seaborn only, but it you need more, missingno module will help you.
Top comments (5)
You still need to call plt.show(), right ?
Actually no, if you used this magic function in jupyter notebook "%matplotlib inline" then you don't need to call plt.show()
How would you plot the missingno package plots into 3 subplots? E.g. have 3 subplots, one with matrix, one with heatmap and one with dendogram?
fig, ax = plt.subplots(figsize=(25, 15),nrows=1,ncols=2)
Visualize the number of missing values as a bar chart
Visualize the correlation between the number of missing values in different columns as a heatmap
Maybe you can try something like this..
on the seaborn.heatmap , is there a way to show only the index of null rows on the left side of the graph?