The pandas.MultiIndex
is a powerful tool for handling multi-level indexing in pandas.DataFrames
giving us increased flexibility for manipulating, preparing, and analyzing complex datasets
from pandas import MultiIndex
Let's investigate how we can leverage MultiIndex
to complete missing combinations of categories in our datasets with an incredibly elegant solution!
Table of Contents
- Understanding missing combinations of categories
- What is MultiIndex?
- Creating a MultiIndex with all possible combinations of categories
- Reindexing our DataFrame to align with the MultiIndex
- Conclusion
- Additional resources
Understanding missing combinations of categories
When it comes to data preparation, a common scenario we encounter is categorical data that spans across multiple levels of our dataset. While working with this data it's important to understand and identify where our categories might contain missing combinations
A good example of this could be sales data that spans several geographic regions and product categories.
A missing combination might indicate that:
- a particular product is not sold in a certain region
- OR that the data itself for that region-product combination is missing!
Let's take this a step further and look at a simple dataset of spam
and eggs
sales in New York
, Texas
, and California
- can you tell which combinations of region and product are missing sales data?:
region product sales
0 New York spam 9
1 New York eggs 2
2 Texas eggs 5 <- ?
3 California spam 7 <- ?
In this case, Texas
is missing a row for spam
sales and California
is missing a row for eggs
!
Some important questions to consider:
- Were there zero sales so those rows were ommitted?
- Are those products respectively not offered in Texas or California?
- How will these missing combinations affect our analyses?
And instead of just leaving these combinations out it might be imperative to complete the missing categories and fill the associated values with zero (or NA
) to give us a more complete picture of our dataset
region product sales
0 New York spam 9
1 New York eggs 2
2 Texas spam 0 <- :D
3 Texas eggs 5
4 California spam 7
5 California eggs 0 <- :D
What is MultiIndex?
As mentioned in the introduction, MultiIndex
is a powerful tool for managing DataFrame
's that contain nested layers, categories, and/or segmentations
In our sales data example from the prior section, we have multiple indices that categorize our sales data by:
region
product
By leveraging MultiIndex
, we're able to encode this hierarchy into a DataFrame
and gain access to an elegant toolkit for manipulating, preparing, and analyzing the different levels of our data
Creating a MultiIndex with all possible combinations of categories
Let's take our spam
and eggs
raw sales data from earlier now and store it in a DataFrame
(let's call it sales_df
)
import pandas as pd
sales_df = pd.DataFrame({
"region": ["New York", "New York", "Texas", "California"],
"product": ["spam", "eggs", "eggs", "spam"],
"sales": [9, 2, 5, 7]
})
To create a MultiIndex
that contains every possible combination of the unique values in region
and product
, we can leverage the pd.MultiIndex.from_product
method by passing as arguments:
- a list of lists containing every unique
region
and every uniqueproduct
in our dataset - and a list of strings containing the
names
of our columns
unique_categories = [
sales_df['region'].unique(),
sales_df['product'].unique()
]
names = ["region", "product"]
multiindex = pd.MultiIndex.from_product(
unique_categories,
names=names
)
>>> print(multiindex)
MultiIndex([( 'New York', 'spam'),
( 'New York', 'eggs'),
( 'Texas', 'spam'), <- wow!
( 'Texas', 'eggs'),
('California', 'spam'),
('California', 'eggs')], <- great!
names=['region', 'product'])
Check it out! By taking the cross-product of our unique categories, MultiIndex
went ahead and created every possible combination of region
and product
for us
Thanks MultiIndex, you're the best!
Reindexing our DataFrame to align with the MultiIndex
And for the grand finale! We will reindex
our DataFrame
to align with our MultiIndex
completing missing combinations and filling them with zero in the process!
To do this, we will:
-
set our
DataFrame
's index on columnsregion
andproduct
using ournames
list from the previous section -
reindex our
DataFrame
using ourmultiindex
and filling missing values with zero - and reset the index to remove the encoded hierarchy
sales_df = (
sales_df
.set_index(names)
.reindex(multiindex, fill_value = 0)
.reset_index()
)
>>> print(sales_df)
region product sales
0 New York spam 9
1 New York eggs 2
2 Texas spam 0 <- yay!
3 Texas eggs 5
4 California spam 7
5 California eggs 0 <- fantastic!
Conclusion
And just like that, we've learned how we can use pandas.MultiIndex
to complete missing combinations with an incredibly elegant solution
While our product-region example was trivial, this will scale to an arbitrary amount of categorizations (including time series i.e. dates, weeks, and months!)
So get out there and go complete those missing combinations, I believe in you!
If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!
Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website π
Cheers!
Top comments (2)
Hey thank you so much for the feedback my friend :D
Definitely one of my favorite techniques when it comes to prepping data, super clean