Completing missing combinations of categories in our data with pandas.MultiIndex!

#python #datascience #tutorial #pandas

The pandas.MultiIndex is a powerful tool for handling multi-level indexing in pandas.DataFrames giving us increased flexibility for manipulating, preparing, and analyzing complex datasets

from pandas import MultiIndex

Let's investigate how we can leverage MultiIndex to complete missing combinations of categories in our datasets with an incredibly elegant solution!

Understanding missing combinations of categories
What is MultiIndex?
Creating a MultiIndex with all possible combinations of categories
Reindexing our DataFrame to align with the MultiIndex
Conclusion
Additional resources

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Understanding missing combinations of categories

When it comes to data preparation, a common scenario we encounter is categorical data that spans across multiple levels of our dataset. While working with this data it's important to understand and identify where our categories might contain missing combinations

A good example of this could be sales data that spans several geographic regions and product categories.

A missing combination might indicate that:

a particular product is not sold in a certain region
OR that the data itself for that region-product combination is missing!

Let's take this a step further and look at a simple dataset of spam and eggs sales in New York, Texas, and California - can you tell which combinations of region and product are missing sales data?:

       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    eggs      5  <- ?
3  California    spam      7  <- ?

In this case, Texas is missing a row for spam sales and California is missing a row for eggs!

Some important questions to consider:

Were there zero sales so those rows were ommitted?
Are those products respectively not offered in Texas or California?
How will these missing combinations affect our analyses?

And instead of just leaving these combinations out it might be imperative to complete the missing categories and fill the associated values with zero (or NA) to give us a more complete picture of our dataset

       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    spam      0  <- :D
3       Texas    eggs      5
4  California    spam      7
5  California    eggs      0  <- :D

What is MultiIndex?

As mentioned in the introduction, MultiIndex is a powerful tool for managing DataFrame's that contain nested layers, categories, and/or segmentations

In our sales data example from the prior section, we have multiple indices that categorize our sales data by:

region
product

By leveraging MultiIndex, we're able to encode this hierarchy into a DataFrame and gain access to an elegant toolkit for manipulating, preparing, and analyzing the different levels of our data

Creating a MultiIndex with all possible combinations of categories

Let's take our spam and eggs raw sales data from earlier now and store it in a DataFrame (let's call it sales_df)

import pandas as pd

sales_df = pd.DataFrame({
    "region": ["New York", "New York", "Texas", "California"],
    "product": ["spam", "eggs", "eggs", "spam"],
    "sales": [9, 2, 5, 7]
})

To create a MultiIndex that contains every possible combination of the unique values in region and product, we can leverage the pd.MultiIndex.from_product method by passing as arguments:

a list of lists containing every unique region and every unique product in our dataset
and a list of strings containing the names of our columns

unique_categories = [
    sales_df['region'].unique(), 
    sales_df['product'].unique()
]
names = ["region", "product"]
multiindex = pd.MultiIndex.from_product(
    unique_categories,
    names=names
)

>>> print(multiindex)
MultiIndex([(  'New York', 'spam'),
            (  'New York', 'eggs'),
            (     'Texas', 'spam'),  <- wow!
            (     'Texas', 'eggs'),
            ('California', 'spam'),
            ('California', 'eggs')], <- great!
           names=['region', 'product'])

Check it out! By taking the cross-product of our unique categories, MultiIndex went ahead and created every possible combination of region and product for us

Thanks MultiIndex, you're the best!

Reindexing our DataFrame to align with the MultiIndex

And for the grand finale! We will reindex our DataFrame to align with our MultiIndex completing missing combinations and filling them with zero in the process!

To do this, we will:

set our DataFrame's index on columns region and product using our names list from the previous section
reindex our DataFrame using our multiindex and filling missing values with zero
and reset the index to remove the encoded hierarchy

sales_df = (
    sales_df
    .set_index(names)
    .reindex(multiindex, fill_value = 0)
    .reset_index()
)

>>> print(sales_df)
       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    spam      0  <- yay!
3       Texas    eggs      5
4  California    spam      7
5  California    eggs      0  <- fantastic!

Conclusion

And just like that, we've learned how we can use pandas.MultiIndex to complete missing combinations with an incredibly elegant solution

While our product-region example was trivial, this will scale to an arbitrary amount of categorizations (including time series i.e. dates, weeks, and months!)

So get out there and go complete those missing combinations, I believe in you!

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Cheers!