Elias Elikem Ifeanyi Dzobo

Posted on Aug 5

Pandas to Pipelines

#machinelearning #aiops #mlops #mlpipelines

Introduction

If you've ever wrangled data using pandas, you know it's a powerful tool. But as your projects grow, so do the challenges. Enter: pipelines. Think of moving from pandas to pipelines as upgrading from riding a bicycle to driving a car on a highway. It's all about efficiency, scalability, and getting to your destination faster. Let's explore how data preparation evolves from pandas to pipelines.

The Pandas Approach: Manual Labor

Using pandas for data preparation is like cooking a meal from scratch every single time. You chop the veggies, sauté the onions, and season to taste. It's a hands-on process that works well for small, one-off tasks. With pandas, you load your data, clean it, transform it, and merge it all within a few lines of code. For a single data scientist working on a small project, this might seem just fine.

The Drawbacks

However, as your project scales, the drawbacks of this approach become apparent:

Repetition: Every time you run your analysis, you have to manually execute the same steps. This repetition is not only time-consuming but also error-prone.
Lack of Modularity: Your code can become a tangled mess of transformations, making it hard to maintain and debug.
Scalability Issues: Pandas operates in-memory, which can be a bottleneck when dealing with large datasets.
Collaboration Challenges: Sharing your work with others or deploying it in a production environment can be cumbersome without a structured workflow.

Enter Pipelines: Automation and Efficiency

Pipelines are like having a professional kitchen with a team of chefs, each responsible for a specific task, working in harmony to prepare a gourmet meal. In the context of ML, a pipeline automates the sequence of data processing steps, ensuring each step is executed correctly and efficiently.

How Pipelines Transform Your Workflow

Automation: Pipelines automate repetitive tasks, reducing manual intervention and minimizing errors.
Modularity: Each step in a pipeline is a separate component, making your code more organized and easier to debug.
Scalability: Pipelines can handle large datasets by processing data in chunks or leveraging distributed computing.
Collaboration and Deployment: Pipelines provide a clear structure, making it easier for teams to collaborate and deploy models in production environments.

Lets transform a data preparation script into a beautiful pipeline using Prefect

Transforming a Data Preparation Script into a Pipeline Script Using Prefect

Here's a simple script that reads a CSV file, cleans the data, and saves the cleaned data to a new CSV file.

Step 1: Original Pandas Script

import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Clean data
data.dropna(inplace=True)  # Drop missing values
data['date'] = pd.to_datetime(data['date'])  # Convert date column to datetime
data = data[data['value'] > 0]  # Filter out non-positive values

# Save cleaned data
data.to_csv('cleaned_data.csv', index=False)

Step 2: Install Prefect

First, install Prefect if you haven't already:
pip install prefect

Step 3: Transform the Script into a Prefect Flow

We'll break down the script into individual tasks and then combine them into a Prefect flow.

Import Prefect and Define Tasks

import pandas as pd
from prefect import task, Flow
from prefect.schedules import IntervalSchedule
from datetime import timedelta

# Define tasks
@task
def load_data(filepath):
    data = pd.read_csv(filepath)
    return data

@task
def clean_data(data):
    data.dropna(inplace=True)
    data['date'] = pd.to_datetime(data['date'])
    data = data[data['value'] > 0]
    return data

@task
def save_data(data, output_filepath):
    data.to_csv(output_filepath, index=False)

Create a Prefect Flow

Now, we'll create a Prefect flow that chains these tasks together.

# Create a Prefect flow
@flow
def run():
    filepath = 'data.csv'
    output_filepath = 'cleaned_data.csv'

    data = load_data(filepath)
    cleaned_data = clean_data(data)
    save_data(cleaned_data, output_filepath)

Step 5: Register and Run the Flow

Finally, register and run your flow. You can run the flow locally or use Prefect Cloud for additional features like monitoring and logging.

Run Locally

if __name__ == "__main__":
    run()

You can then use Prefect's UI to monitor and manage your flow.
By transforming your pandas script into a Prefect flow, you gain automation, scalability, and improved error handling. Prefect makes it easy to manage your data workflows and integrate them into production environments.

Conclusion: From Pandas to Pipelines

Transitioning from pandas to pipelines is like moving from a bicycle to a car—it's about embracing efficiency, scalability, and the ability to tackle bigger challenges. By investing in pipeline tools like Airflow, Mage, and Prefect, you're setting yourself up for success in the world of machine learning production.

So, the next time you're prepping your data, remember: you can chop those veggies yourself, or you can let a team of chefs handle it for you. Happy coding, and stay tuned for more insights on bringing your ML projects to life in production!

DEV Community