DEV Community

Cover image for Completing missing combinations of categories in our data with pandas.MultiIndex!
Chris Greening
Chris Greening

Posted on • Updated on

Completing missing combinations of categories in our data with pandas.MultiIndex!

The pandas.MultiIndex is a powerful tool for handling multi-level indexing in pandas.DataFrames giving us increased flexibility for manipulating, preparing, and analyzing complex datasets

from pandas import MultiIndex
Enter fullscreen mode Exit fullscreen mode

Let's investigate how we can leverage MultiIndex to complete missing combinations of categories in our datasets with an incredibly elegant solution!

Table of Contents

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Understanding missing combinations of categories

When it comes to data preparation, a common scenario we encounter is categorical data that spans across multiple levels of our dataset. While working with this data it's important to understand and identify where our categories might contain missing combinations

A good example of this could be sales data that spans several geographic regions and product categories.

A missing combination might indicate that:

  • a particular product is not sold in a certain region
  • OR that the data itself for that region-product combination is missing!

Let's take this a step further and look at a simple dataset of spam and eggs sales in New York, Texas, and California - can you tell which combinations of region and product are missing sales data?:

       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    eggs      5  <- ?
3  California    spam      7  <- ?
Enter fullscreen mode Exit fullscreen mode

In this case, Texas is missing a row for spam sales and California is missing a row for eggs!

Some important questions to consider:

  • Were there zero sales so those rows were ommitted?
  • Are those products respectively not offered in Texas or California?
  • How will these missing combinations affect our analyses?

And instead of just leaving these combinations out it might be imperative to complete the missing categories and fill the associated values with zero (or NA) to give us a more complete picture of our dataset

       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    spam      0  <- :D
3       Texas    eggs      5
4  California    spam      7
5  California    eggs      0  <- :D
Enter fullscreen mode Exit fullscreen mode

What is MultiIndex?

As mentioned in the introduction, MultiIndex is a powerful tool for managing DataFrame's that contain nested layers, categories, and/or segmentations

In our sales data example from the prior section, we have multiple indices that categorize our sales data by:

  • region
  • product

By leveraging MultiIndex, we're able to encode this hierarchy into a DataFrame and gain access to an elegant toolkit for manipulating, preparing, and analyzing the different levels of our data


Creating a MultiIndex with all possible combinations of categories

Let's take our spam and eggs raw sales data from earlier now and store it in a DataFrame (let's call it sales_df)

import pandas as pd

sales_df = pd.DataFrame({
    "region": ["New York", "New York", "Texas", "California"],
    "product": ["spam", "eggs", "eggs", "spam"],
    "sales": [9, 2, 5, 7]
})
Enter fullscreen mode Exit fullscreen mode

To create a MultiIndex that contains every possible combination of the unique values in region and product, we can leverage the pd.MultiIndex.from_product method by passing as arguments:

  1. a list of lists containing every unique region and every unique product in our dataset
  2. and a list of strings containing the names of our columns
unique_categories = [
    sales_df['region'].unique(), 
    sales_df['product'].unique()
]
names = ["region", "product"]
multiindex = pd.MultiIndex.from_product(
    unique_categories,
    names=names
)

>>> print(multiindex)
MultiIndex([(  'New York', 'spam'),
            (  'New York', 'eggs'),
            (     'Texas', 'spam'),  <- wow!
            (     'Texas', 'eggs'),
            ('California', 'spam'),
            ('California', 'eggs')], <- great!
           names=['region', 'product'])
Enter fullscreen mode Exit fullscreen mode

Check it out! By taking the cross-product of our unique categories, MultiIndex went ahead and created every possible combination of region and product for us

Thanks MultiIndex, you're the best!


Reindexing our DataFrame to align with the MultiIndex

And for the grand finale! We will reindex our DataFrame to align with our MultiIndex completing missing combinations and filling them with zero in the process!

To do this, we will:

  1. set our DataFrame's index on columns region and product using our names list from the previous section
  2. reindex our DataFrame using our multiindex and filling missing values with zero
  3. and reset the index to remove the encoded hierarchy
sales_df = (
    sales_df
    .set_index(names)
    .reindex(multiindex, fill_value = 0)
    .reset_index()
)

>>> print(sales_df)
       region product  sales
0    New York    spam      9
1    New York    eggs      2
2       Texas    spam      0  <- yay!
3       Texas    eggs      5
4  California    spam      7
5  California    eggs      0  <- fantastic!
Enter fullscreen mode Exit fullscreen mode

Conclusion

And just like that, we've learned how we can use pandas.MultiIndex to complete missing combinations with an incredibly elegant solution

While our product-region example was trivial, this will scale to an arbitrary amount of categorizations (including time series i.e. dates, weeks, and months!)

So get out there and go complete those missing combinations, I believe in you!

GIF showing spam and eggs sales in New York, Texas, and California with missing combinations of categories getting completed and filled with zero

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Cheers!



Additional resources

Top comments (2)

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
chrisgreening profile image
Chris Greening

Hey thank you so much for the feedback my friend :D

Definitely one of my favorite techniques when it comes to prepping data, super clean