DEV Community

Dare Johnson
Dare Johnson

Posted on

How to Streamline Machine Learning Projects with Scikit-learn Pipeline

This is a follow-up tutorial. We will go through how to use the Scikit Learn Pipeline module in addition to modularization. If you need to go through the previous tutorial which is on code modularization in data science, check here

However, just for a recap, the first tutorial introduced the industry standard of separating different stages of machine learning projects into modular scripts whereby each stage deals with different processes (data loading, data preprocessing, model selection, etc.)

Table of Contents

  1. Scikit-learn Pipeline
  2. Advantages of Scikit-learn Pipeline
  3. Illustration of Scikit-learn Pipeline
  4. Conclusion

The advantages of script modularization include code organization, code reusability, collaboration and teamwork, maintainability, and debugging.

Now, let's talk about the Scikit-learn Pipeline module briefly.

Scikit-learn Pipeline

A Scikit-learn (Sklearn) pipeline is a powerful tool for streamlining, simplifying, and organizing machine learning workflows. It's essentially a way to automate a sequence of data processing and modeling steps into a single, cohesive unit.

The Pipeline allows chaining together multiple data processing and modeling steps into a single, unified object.
Modular Script + Pipeline Implementation = Best Industry Practices
(The essence of this tutorial is to show how we can make use of the Sklearn Pipeline module in modular scripts for an even more streamlined industry standard)

Here's a list of aspects of the machine learning process where Scikit-learn Pipeline can be used:

1. Data Preprocessing

  • Imputing missing values.
  • Scaling and standardizing features.
  • Encoding categorical variables.
  • Handling outliers.

2. Feature Engineering

  • Creating new features or transforming existing ones.
  • Applying dimensionality reduction techniques (e.g., PCA)

3. Model Training and Evaluation

  • Constructing a sequence of data preprocessing and modeling steps.
  • Cross-validation and evaluation.

4. Hyperparameter Tuning

  • Using Grid Search or Random Search to find optimal hyperparameters.

5. Model Selection

  • Comparing different models with the same preprocessing steps.
  • Selecting the best model based on cross-validation results.

6. Prediction and Inference

  • Applying the entire preprocessing and modeling pipeline to new data.

7. Deployment and Production

  • Packaging the entire pipeline into a deployable unit.

8. Pipeline Serialization and Persistence

  • Saving the entire pipeline to disk for future use.

Before demonstrating a few of these steps with a sample dataset, it is worth mentioning that the general outlook of Scikit-learn Pipeline loosely tends to follow this pattern:

  • Importing Necessary Libraries
  • Creating Individual Transformers and Estimators
  • Constructing the Pipeline
  • Fitting and Transforming Data
  • Fitting the Final Estimator

Advantages of Using Scikit-learn Pipeline

Don’t forget, from our previous article, the purpose is to demonstrate modular codes in data science problems, then in this follow-up, we are adding more enhancements to our code, to make it robust by the use of Pipeline.

The Scikit-learn Pipeline offers a range of advantages

  1. Simplicity and Readability of Code: This can not be overemphasized. Sequences of data processing steps can be organized and structured into a single unit of codes, as we will soon see with an example. This will make the codes to be readable and then easily maintainable. This advantage of assembling multiple steps (transformers and an estimator) often results in a seamless workflow.

  2. Prevention of Data Leakage: Scikit-learn Pipeline will take care of likely data leakage. To be clear about what data leakage is; data leakage occurs when your machine learning model gets access to information during training that it shouldn't have. It is an unintended exposure or mixing of data between training and test datasets. As a result, machine learning model performance will appear very great during training but otherwise when faced with new and unseen data. Using the Scikit-learn Pipeline is one of the ways to avoid data leakage.

  3. Ease of Deployment and productionization: Pipelines also make the transition from model development to model deployment known as productionization of machine learning models easy. This is achieved by the Pipeline ensuring that all the steps and the model training are applied consistently in the same order and then encapsulate the entire workflow from preprocessing to deployment

Illustration of Scikit-learn Pipeline with Tips Dataset

For illustration, we will use the shipped seaborn dataset to demo a few of these steps (as this data is quite a simple one):

  1. Data Preprocessing (Imputation, Scaling, Encoding)
  2. Model Training and Evaluation (including Cross-Validation)
  3. Hyperparameter Tuning (GridSearchCV)
  4. Prediction and Inference

About the dataset

A few things about the data we will be making use of:

  1. The Tips dataset is part of the Seaborn data repository which serves as an illustrative example within the Seaborn package.
  2. You can easily load the Tips dataset using the sns.load_dataset("tips") command.
  3. The dataset contains information related to restaurant tips and comes with these variables/features:
    • total_bill: The total bill amount.
    • tip: The tip amount.
    • time: Whether the meal was during lunch or dinner.
    • smoker: Whether the customer is a smoker.
    • size: The size of the dining party.

We want to make a (binary) classification problem out of this dataset, whereby given the features, we would predict if the tip feature is greater than 5 or less. If the tip is greater than 5, our target (y) which we want to predict will be 1, but if the tip is less than 5, our y will be 0. Do you understand?
Let’s get at it.

Step 1: Load the dataset

We will load the dataset we described above with the code snippet here. We will save it as load_data.py. We aim to use both modular code and Scikit-learn Pipeline to show how to approach machine learning problems.

import seaborn as sns #import the seaborn library

#load the tips dataset from seaborn library
def load_tips_data():
    df = sns.load_dataset('tips')
    return df

if __name__ == "__main__":
    tips_data = load_tips_data()
    print(tips_data.head())
Enter fullscreen mode Exit fullscreen mode

Output:

load_output

Step 2: Preprocessing Pipeline

At this stage, we will apply the preprocess Pipeline to our data previously loaded in the load_data.py. We will save the script as preprocess_data.py.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from load_data import load_tips_data #to load_data.py script

class PreprocessingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.numeric_features = ['total_bill', 'size']
        self.categorical_features = ['sex', 'smoker', 'day', 'time']

        self.numeric_preprocessor = Pipeline(
            steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())
        ])

        self.categorical_preprocessor = Pipeline(
            steps=[('imputer', SimpleImputer(fill_value='missing', strategy='constant')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ])

        self.preprocessor = ColumnTransformer(
            transformers=[('numeric', self.numeric_preprocessor, self.numeric_features),
                ('categorical', self.categorical_preprocessor, self.categorical_features)
            ]
        )

    def fit(self, X, y=None):
        self.preprocessor.fit(X, y)
        return self

    def transform(self, X):
        return self.preprocessor.transform(X)

# Create an instance of PreprocessingTransformer
preprocessor_instance = PreprocessingTransformer()

# Load and preprocess data
tips_data = load_tips_data() # load_tips_data from the first script
X = tips_data.drop('tip', axis=1)
y = (tips_data['tip'] > 5).astype(int)

# Fit the preprocessor & transform the data
preprocessor_instance.fit(X, y)
transformed_data = preprocessor_instance.transform(X)
Enter fullscreen mode Exit fullscreen mode

Let us see so far what our codes have done to the Tips dataset:

The preprocessing steps are as follows:

  1. Feature Selection: The features are divided into numeric (total_bill, size) and categorical (sex, smoker, day, time) features.
  2. Numeric Preprocessing: The numeric features are preprocessed using a pipeline that includes:
    • Imputation: Missing values in the numeric features are replaced with the mean value of the respective feature.
    • Scaling: The numeric features are scaled to have zero mean and unit variance. This is done using the StandardScaler which standardizes features by removing the mean and scaling to unit variance.
  3. Categorical Preprocessing: The categorical features are preprocessed using a pipeline that includes:
    • Imputation: Missing values in the categorical features are replaced with the constant string ‘missing’.
    • One-Hot Encoding: The categorical features are one-hot encoded. This means that each unique value in each categorical feature is turned into its binary feature.
  4. Column Transformation: The ColumnTransformer applies the appropriate preprocessing pipeline to each subset of features (numeric and categorical).
  5. Fitting and Transforming: The preprocessing transformer is fitted to the data, learning any parameters necessary for imputation and scaling. It then transforms the data according to the fitted parameters.

To see the output. If we use print(transformed_data), our output looks as below:

Processed Output

The output of the code is a Numpy array (transformed_data). This represents the preprocessed features of the dataset. This data has been scaled and encoded, which is why it does not look easily interpretable like the actual Tips dataset we began with.

Let’s move on to train the preprocessed data. (Assuming we are dealing with other datasets that require more preprocessed steps, our codes above would have to be more inclusive and accommodating to allow for all the necessary preprocessed steps.

Step 3: Training Pipeline

In this third step, we will introduce the training Pipeline, where the two previous scripts will be inputs. We will save this as train_model.py.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from load_data import load_tips_data # accessing the load_data script
from preprocess_data import PreprocessingTransformer # accessing the preprocess_data script
from sklearn.pipeline import Pipeline

class TrainAndEvaluateModelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        self.model = RandomForestClassifier(random_state=42)
        self.model.fit(X_train, y_train)

        return self

    def transform(self, X):
        y_pred = self.model.predict(X)
        return y_pred

if __name__ == "__main__":
    tips_data = load_tips_data()  # Load and preprocess data
    preprocessor = PreprocessingTransformer()  # Create an instance without passing data
    preprocessed_data = preprocessor.fit_transform(tips_data)  # Fit and transform
    X = preprocessed_data
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline with TrainAndEvaluateModelTransformer
    model_pipeline = Pipeline([('model', TrainAndEvaluateModelTransformer())]) 

    # Fit and evaluate the model
    y_pred = model_pipeline.fit_transform(X, y)
    accuracy = accuracy_score(y, y_pred)
    print(f"Model Accuracy: {accuracy}")
Enter fullscreen mode Exit fullscreen mode

Here is the summary of what happened in the train_model script.

  1. Create Model Pipeline: A pipeline is created with TrainAndEvaluateModelTransformer. This transformer splits the data into training and testing sets, trains a RandomForestClassifier on the training data, and then uses this trained model to make predictions.
  2. Fit and Evaluate the Model: The model pipeline is fitted to the data and used to make predictions. The accuracy of these predictions is then calculated and printed to the console. The output of this script will be the accuracy of the model, which is the proportion of correct predictions made by the model. This will be printed to the console as a decimal between 0 and 1, with 1 indicating perfect accuracy.

Let’s take a look at the hyperparameter tuning Pipeline.

Step 4: Hyperparameter Pipeline

Even though our model accuracy was not bad in the last step, the reason for this hyperparameter stage is to demonstrate also how we can use Pipeline for hyperparameter tuning. And we will also be saving the code as hyperparameter_tuning.py.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer  # Import the PreprocessingTransformer
from sklearn.pipeline import Pipeline

if __name__ == "__main__":
    # Load and preprocess data
    tips_data = load_tips_data()
    X = tips_data.drop('tip', axis=1)
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline for preprocessing and model training
    model_pipeline = Pipeline([
        ('preprocessor', PreprocessingTransformer()),  # Use the custom preprocessing transformer
        ('model', RandomForestClassifier(random_state=42))
    ])

    # Define hyperparameter grid
    param_grid = {
        'model__n_estimators': [50, 100, 150],
        'model__max_depth': [None, 10, 20],
        'model__min_samples_split': [2, 5, 10]
    }

    # Perform GridSearchCV
    grid_search = GridSearchCV(model_pipeline, param_grid, cv=5)
    grid_search.fit(X, y)

    # Get the best tuned model
    best_tuned_model = grid_search.best_estimator_

    print("Best Hyperparameters:", best_tuned_model.named_steps['model'].get_params())
Enter fullscreen mode Exit fullscreen mode

Briefly, the hyperparameter Pipeline code defines the hyperparameter grid (includes different values for several hyperparameters), performs the grid search (tries different combinations of hyperparameters which results in the best score), gets the best model with the best hyperparameters combination, and lastly, print out the very best hyperparameters.

This is our output, the best hyperparameters, as printed by the last line of code:

hyperparameter tuning

Step 5: Serialize the model

Finally, these whole modular scripts can be serialized and made ready for another program to be used as a pickled file (in production). We will save the script as serialized_model.py.

import joblib # for saving and loading Python objects to/from disk
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer
from train_model import TrainAndEvaluateModelTransformer
from hyperparameter_tuning import GridSearchCV, RandomForestClassifier

def serialize_model(model, filename):
    joblib.dump(model, filename)
    print(f"Model serialized and saved as '{filename}'")

if __name__ == "__main__":
    data = load_tips_data()
    X = PreprocessingTransformer().fit_transform(data)
    y = (data['tip'] > 5).astype(int)
    trained_model = TrainAndEvaluateModelTransformer().fit(X, y)
    tuned_model = GridSearchCV(RandomForestClassifier(random_state=42), param_grid={
        'n_estimators': [50, 100, 150],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }, cv=5).fit(X, y)
    serialize_model(tuned_model, "best_model.pkl")
Enter fullscreen mode Exit fullscreen mode

This last code will take a bit longer to finish executing because it involves all the steps (the Pipelines), and then the serializing (“pickling”) of the best model into a file.
The output of our code will be a print statement indicating that the model has been serialized and saved as ‘best_model.pkl’. We won’t see the model itself in the console, but a new file named ‘best_model.pkl’ will be created in our current working directory. This same file contains our trained and tuned model. (To load the model back into memory, we can use joblib.load('best_model.pkl'). This will return the model object)

Conclusion

In conclusion, we made attempts to show how to use modular scripts in conjunction with the Scikit-learn Pipeline module (with its classes) - way beyond what is demonstrated in many online courses whereby the Jupyter Notebook is portrayed as the only and best way to make data science projects work. Machine learning projects can be better streamlined, going by the industry's best approaches.
The codes, the approaches, and the structures used here are never cast in stone; depending on the aim, nature, and goal of a data science/machine learning project, flexibility is always very much allowed.

Top comments (0)