DEV Community

Cover image for Leveraging the pipe method to write beautiful and concise data transformations in pandas
Chris Greening
Chris Greening

Posted on • Updated on

Leveraging the pipe method to write beautiful and concise data transformations in pandas

When it comes to data science and analysis, being able to prepare and transform our data is a critical component of any successful project

So let's learn how we can leverage the pandas pipe method in Python to abstract complex data transformations into easy-to-read, self documenting operations!

Table of Contents

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Overview of the .pipe() method

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

The pipe method allows us to chain Series or DataFrame data transformations together in a semantically continuous pipeline of inputs and outputs

It accomplishes this by leveraging Python's support for higher-order functions - the ability to pass a function as an argument to another function

Let's take a look at a simple example (NOTE: assume the functions and DataFrame are pre-defined offscreen):

transformed_df = (
    df
    .pipe(_select_columns)
    .pipe(_multiply_columns_by_two)
    .pipe(_filter_segments)
)
Enter fullscreen mode Exit fullscreen mode

The code snippet above shows each pipe method:

  1. Inputting the output from the previous pipe
  2. Performing a transformation (i.e. selecting columns)
  3. Chaining the output into the input of the next pipe

"Wait I still don't understand what any of this means!!! Can we take a look at a more concrete example?!"

No worries! Yeah - let's take a look at a more concrete example in the next section


A concrete example of the pipe method

Let's pretend we have a DataFrame, let's call it town_df, that contains weekly time-series data for how much electricity every single town in the United States consumes

import pandas as pd
town_df = pd.read_csv("time_series_data_for_every_single_town_in_the_united_states.csv")
Enter fullscreen mode Exit fullscreen mode

And let's say we want to perform these specific transformations in this specific order:

  1. select relevant columns
  2. filter date range
  3. approximate missing values
  4. map town to state
  5. aggregate up to week and state
  6. upsample week frequency to daily
  7. interpolate daily values

Wouldn't it be great if we could implement each of those steps as it's own self-contained function and then *pipe* those functions together in an explicitly obvious chain of transformations?...

Well I'm glad you asked (😉)! Check this out:

transformed_df = (
    df
    .pipe(_select_relevant_columns)
    .pipe(_filter_date_range)
    .pipe(_approximate_missing_values)
    .pipe(_map_town_to_state)
    .pipe(_aggregate_up_to_week_and_state)
    .pipe(_upsample_week_frequency_to_daily)
    .pipe(_interpolate_daily_values)
)
Enter fullscreen mode Exit fullscreen mode

And that's it!

A clear and concise chain of immediately obvious data transformations - let's talk about some of the benefits of writing our code like this

The benefits of using the pipe method

You may have noticed that I did not explicitly reveal any of the implementation details behind any of the piped functions

And yet you probably didn't have a hard time understanding (at least from a top-level view) of what transformations were taking place behind the scenes!

I bet you could even show this to someone that has never written a single line of code in their life and even they'd be able to get the overall gist of what's happening to the dataset

While some simple transformations can be accomplished in a single line of code, more complex transformations might take dozens, hundreds, or even thousands of lines before we can move onto the "next" transformation

transformed_df = (
    df
    .pipe(_some_oneliner_transformation)
    .pipe(_some_million_lines_of_code_transformation_but_guess_what_you_dont_have_to_know_how_its_implemented)
)
Enter fullscreen mode Exit fullscreen mode

So being able to abstract the implementation details under a well-defined unit or block of code removes the cognitive overhead of having to read every single line to know what's going on - you can just focus on the big picture

And when something (inevitably) does go wrong you're able to isolate, test, and debug your inputs and outputs because they're already logically isolated into well-defined units


Conclusion

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Cheers!



Additional resources

Latest comments (6)

Collapse
 
juliecodestack profile image
JulieS

May I ask a question? Is the underscore at the beginning of the function name a must?(for example, _ in _select_relevant_columns or _filter_date_range) Thank you!

Collapse
 
chrisgreening profile image
Chris Greening • Edited

Hey Julie, please always ask questions - I love to help! :D

Fantastic question!! It is not a requirement at all, it's more a matter of personal preference (and a little bit of convention) - as long as Python considers it a valid function definition then you can pass it into pipe

Prefixing functions with an underscore indicates to other users that those functions are intended to be private and are more for internal implementation details than for external users to call upon. Python does not enforce this rule it's just a convention that some developers follow for readability

I personally really like doing it because I'm often working with dozens of files and thousands of lines of Python and its useful to know when a function is intended for internal use only versus importing into other modules

I hope this answers your question, always feel free to reach out!

Collapse
 
juliecodestack profile image
JulieS

Thanks a lot, Chris! Very helpful !

When I read your post again, I find that the pipe() method in pandas is a little different from that in scikit-learn and pytorch. In those libraries the pipe() method is used as pipe(function1, function2). Here in pandas the pipe() method is used as a general interface to control the data flow(df.pipe(function1), df.pipe(function2)).

Collapse
 
juliecodestack profile image
JulieS

Nice Work! I've seen people use pipe() method in scikit-learn, pytorch, etc, but it didn't occur to me that pipe() method can also be used in pandas until your post. Thank you Chris! By the way, the design of your website is amazing!

Collapse
 
chrisgreening profile image
Chris Greening

No problem Julie, so glad I could help! pipe is one of my favorite tools to use in pandas, it can help sooo much with readability/maintainability and I actually only discovered it fairly recently! Def one of my favorite tools to use nowadays when working in Python/pandas

And haha omg thank you so much for checking my site out!! A lot of sweat and tears went into it 😅

Collapse
 
juliecodestack profile image
JulieS

Thank you for your introduction. I'll have a try of the pipe method. It really makes the process more clear.

I like and admire your website design, the home page game(I may need more time to figure it out haha) and the photo planet, Wow! So Wonderful!