Chris Greening

Posted on Jan 29, 2023 • Edited on Aug 24, 2023

Leveraging the pipe method to write beautiful and concise data transformations in pandas

#python #datascience #tutorial #codequality

When it comes to data science and analysis, being able to prepare and transform our data is a critical component of any successful project

So let's learn how we can leverage the pandas pipe method in Python to abstract complex data transformations into easy-to-read, self documenting operations!

Overview of the .pipe() method
A concrete example of the pipe method
The benefits of using the pipe operation
Conclusion
Additional resources

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Overview of the .pipe() method

import pandas as pd

The pipe method allows us to chain Series or DataFrame data transformations together in a semantically continuous pipeline of inputs and outputs

It accomplishes this by leveraging Python's support for higher-order functions - the ability to pass a function as an argument to another function

Let's take a look at a simple example (NOTE: assume the functions and DataFrame are pre-defined offscreen):

transformed_df = (
    df
    .pipe(_select_columns)
    .pipe(_multiply_columns_by_two)
    .pipe(_filter_segments)
)

The code snippet above shows each pipe method:

Inputting the output from the previous pipe
Performing a transformation (i.e. selecting columns)
Chaining the output into the input of the next pipe

"Wait I still don't understand what any of this means!!! Can we take a look at a more concrete example?!"

No worries! Yeah - let's take a look at a more concrete example in the next section

A concrete example of the pipe method

Let's pretend we have a DataFrame, let's call it town_df, that contains weekly time-series data for how much electricity every single town in the United States consumes

import pandas as pd
town_df = pd.read_csv("time_series_data_for_every_single_town_in_the_united_states.csv")

And let's say we want to perform these specific transformations in this specific order:

select relevant columns
filter date range
approximate missing values
map town to state
aggregate up to week and state
upsample week frequency to daily
interpolate daily values

Wouldn't it be great if we could implement each of those steps as it's own self-contained function and then *pipe* those functions together in an explicitly obvious chain of transformations?...

Well I'm glad you asked (😉)! Check this out:

transformed_df = (
    df
    .pipe(_select_relevant_columns)
    .pipe(_filter_date_range)
    .pipe(_approximate_missing_values)
    .pipe(_map_town_to_state)
    .pipe(_aggregate_up_to_week_and_state)
    .pipe(_upsample_week_frequency_to_daily)
    .pipe(_interpolate_daily_values)
)

And that's it!

A clear and concise chain of immediately obvious data transformations - let's talk about some of the benefits of writing our code like this

The benefits of using the pipe method

You may have noticed that I did not explicitly reveal any of the implementation details behind any of the piped functions

And yet you probably didn't have a hard time understanding (at least from a top-level view) of what transformations were taking place behind the scenes!

I bet you could even show this to someone that has never written a single line of code in their life and even they'd be able to get the overall gist of what's happening to the dataset

While some simple transformations can be accomplished in a single line of code, more complex transformations might take dozens, hundreds, or even thousands of lines before we can move onto the "next" transformation

transformed_df = (
    df
    .pipe(_some_oneliner_transformation)
    .pipe(_some_million_lines_of_code_transformation_but_guess_what_you_dont_have_to_know_how_its_implemented)
)

So being able to abstract the implementation details under a well-defined unit or block of code removes the cognitive overhead of having to read every single line to know what's going on - you can just focus on the big picture

And when something (inevitably) does go wrong you're able to isolate, test, and debug your inputs and outputs because they're already logically isolated into well-defined units

Conclusion

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Cheers!

Connecting to a relational database using SQLAlchemy and Python

Chris Greening ・ Apr 30 '22

#python #beginners #tutorial #database

Joining multiple datasets on the same column in R using dplyr and purrr

Chris Greening ・ Jan 28 '23

#datascience #r #tidyverse #beginners

Additional resources

Latest comments (6)

JulieS • Feb 19 '23

May I ask a question? Is the underscore at the beginning of the function name a must?(for example, _ in _select_relevant_columns or _filter_date_range) Thank you!

Chris Greening • Feb 19 '23 • Edited

Hey Julie, please always ask questions - I love to help! :D

Fantastic question!! It is not a requirement at all, it's more a matter of personal preference (and a little bit of convention) - as long as Python considers it a valid function definition then you can pass it into pipe

Prefixing functions with an underscore indicates to other users that those functions are intended to be private and are more for internal implementation details than for external users to call upon. Python does not enforce this rule it's just a convention that some developers follow for readability

I personally really like doing it because I'm often working with dozens of files and thousands of lines of Python and its useful to know when a function is intended for internal use only versus importing into other modules

I hope this answers your question, always feel free to reach out!

JulieS • Feb 22 '23

Thanks a lot, Chris! Very helpful !

When I read your post again, I find that the pipe() method in pandas is a little different from that in scikit-learn and pytorch. In those libraries the pipe() method is used as pipe(function1, function2). Here in pandas the pipe() method is used as a general interface to control the data flow(df.pipe(function1), df.pipe(function2)).

JulieS • Feb 19 '23

Nice Work! I've seen people use pipe() method in scikit-learn, pytorch, etc, but it didn't occur to me that pipe() method can also be used in pandas until your post. Thank you Chris! By the way, the design of your website is amazing!

Chris Greening • Feb 19 '23

No problem Julie, so glad I could help! pipe is one of my favorite tools to use in pandas, it can help sooo much with readability/maintainability and I actually only discovered it fairly recently! Def one of my favorite tools to use nowadays when working in Python/pandas

And haha omg thank you so much for checking my site out!! A lot of sweat and tears went into it 😅