DEV Community

Riley Molloy
Riley Molloy

Posted on • Originally published at rpm4real.github.io

In-line Renaming of Pandas Aggregates

The Problem

When working with aggregating dataframes in pandas, I've found myself frustrated with how the results of aggregated columns are named. By default, they inherit the name of the column of which you're aggregating. For example,

import pandas as pd 
import numpy as np

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

iris.groupby('species').agg({
    'sepal_length': np.mean
}).round(2)
sepal_length
species
setosa 5.01
versicolor 5.94
virginica 6.59

So obviously, we as the writers of the above code know that we took a mean of sepal length. But just looking at the output we have no idea what was done to the sepal length value. We can get around this if we enclose the aggregate function in a list:

iris.groupby('species').agg({
    'sepal_length': [np.mean]
}).round(2)
sepal_length
mean
species
setosa 5.01
versicolor 5.94
virginica 6.59

Pandas adds a row (technically adds a level, creating a multiIndex) to tell us the different aggregate functions we applied to the column. In this case, we only applied one, but you could see how it would work for multiple aggregation expressions.

This approach works well. If you want to collapse the multiIndex to create more accessible columns, you can leverage a concatenation approach, inspired by this stack overflow post (note that other implementations similarly use .ravel()):

df = iris.groupby('species').agg({
    'sepal_length': [np.mean]
}).round(2)
df.columns = ['_'.join(gp) for gp in df.columns.values]
df
sepal_length_mean
species
setosa 5.01
versicolor 5.94
virginica 6.59

Both of these solutions have a few immediate issues:

  • Column names can still be far from readable English;
  • The concatenation approach may not scale for all applications;
  • Pandas takes the __name__ attribute of any custom functions and uses it for the column name here. In the case of aggregating with custom functions or lambda functions, it's not likely the column names will make sense in these formats.

A Different Solution

We can leverage the __name__ attribute to create a clearer column name and maybe even one others can make sense of. 👍

To be clear: we could obviously rename any of these columns after the dataframe is returned, but in this case I wanted a solution where I could set column names on the fly.

Taking Advantage of the __name__ Attribute

If you're unfamiliar, the __name__ attribute is something every function you or someone else defines in python comes along with.

def this_function():
    pass 

print(this_function.__name__)

this_function

We can change this attribute after we define it:

def this_function():
    pass 

this_function.__name__ = 'that.'

print(this_function.__name__)

that.

There are also some great options for adjusting a function __name__ as you define the function using decorators. More about that here.

Returning to our application, lets examine the following situation:

def my_agg(x): 
    return (x/20).sum()

iris.groupby('species').agg({
    'sepal_length': [my_agg],
    'sepal_width': [my_agg]
}).round(2)
sepal_length sepal_width
my_agg my_agg
species
setosa 12.52 8.57
versicolor 14.84 6.92
virginica 16.47 7.44

We could add a line adjusting the __name__ of my_agg() before we start our aggregation. But what if we could rename the function as we were aggregating? Similar to how we can rename columns in a SQL statement as we define them.

Higher-order Renaming Function

To solve this problem, we can define a higher-order function which returns a copy of our original function, but with the name attribute changed. It looks like this:

def renamer(agg_func,desired_name):
    def return_func(x):
        return agg_func(x)
    return_func.__name__ = desired_name
    return return_func

We can apply this function outside of our application of my_agg to reset the __name__ on-the-fly:

iris.groupby('species').agg({
    'sepal_length': [renamer(my_agg,'Cool Name')],
    'sepal_width': [renamer(my_agg,'Better Name')]
}).round(2)
sepal_length sepal_width
Cool Name Better Name
species
setosa 12.52 8.57
versicolor 14.84 6.92
virginica 16.47 7.44

Realistic Example

Here's a perfect scenario to utilize this solution:

from numpy import percentile

iris.groupby('species').agg({
    'sepal_length': [renamer(lambda x: percentile(x,25),'25th Percentile')],
    'sepal_width': [renamer(lambda x: percentile(x,75),'75th Percentile')]
}).round(2) 
sepal_length sepal_width
25th Percentile 75th Percentile
species
setosa 4.80 3.68
versicolor 5.60 3.00
virginica 6.22 3.18

In order to get various percentiles of sepal widths and lengths, we can leverage lambda functions and not have to bother defining our own. We use the renamer to fix give these lambda functions understandable names.

To take this a step further, we can include the column name in the rename string and drop the top level of the column multiIndex:

from numpy import percentile

df3 = iris.groupby('species').agg({
    'sepal_length': [renamer(lambda x: percentile(x,25),'Length 25th Percentile')],
    'sepal_width': [renamer(lambda x: percentile(x,75),'Width 75th Percentile')]
}).round(2) 

df3.columns = df3.columns.droplevel()

df3
Length 25th Percentile Width 75th Percentile
species
setosa 4.80 3.68
versicolor 5.60 3.00
virginica 6.22 3.18

Final Thoughts

There are many ways to skin a cat when working with pandas dataframes, but I'm constantly looking for ways to simplify and speed-up my work-flow. This solution helps me work through aggregation steps and easily create sharable tables. It certainly won't work for all situations, but consider using it the next time you get frustrated with unhelpful column names!

Top comments (0)