DEV Community

Dare Johnson
Dare Johnson

Posted on • Updated on

Streamlining Machine Learning Projects with Scikit-Learn Pipeline

In this followed-up tutorial, I want to demonstrate how to use the Scikit Learn Pipeline module in addition with modularization.

If you need to go through the previous tutorial, check here

However, in a recap, first tutorial introduced the industry-standard of separating different stages of machine learning projects into modular scripts whereby each stage deals with different process (data loading, data preprocessing, model selection, etc.)

The advantages of scripts modularization include code organization, code reusability, collaboration and teamwork, maintainability and debugging.

Now, let's talk about the Scikit Learn Pipeline module briefly, a bit.
A Scikit Learn (Sklearn) pipeline is a powerful tool for streamlining, simplifying and organizing machine learning workflows. It's essentially a way to automate a sequence of data processing and modeling steps into a single, cohesive unit. It allows to chain together multiple data processing and modeling steps into a single, unified object.

Modular Script + Pipeline Implementation = Best Industry Practices

(The essence of this tutorial is to show how we can make use of Pipeline module in modular scripts for even more streamlined industry standard)

Here's a list of aspects of the machine learning process where Scikit learn Pipeline can be used:

Data Preprocessing

  • Imputing missing values.
  • Scaling and standardizing features.
  • Encoding categorical variables.
  • Handling outliers.

Feature Engineering

  • Creating new features or transforming existing ones.
  • Applying dimensionality reduction techniques (e.g., PCA)

Model Training and Evaluation

  • Constructing a sequence of data preprocessing and modeling steps.
  • Cross-validation and evaluation.

Hyperparameter Tuning

  • Using Grid Search or Random Search to find optimal hyperparameters.

Model Selection

  • Comparing different models with the same preprocessing steps.
  • Selecting the best model based on cross-validation results.

Prediction and Inference

  • Applying the entire preprocessing and modeling pipeline to new data.

Deployment and Production

  • Packaging the entire pipeline into a deployable unit.

Pipeline Serialization and Persistence

  • Saving the entire pipeline to disk for future use.

Before demonstrating some of these steps, it is worth mentioning that general outlook of Scikit Learn Pipeline loosely tends to follow this pattern:

  • Import Necessary Libraries
  • Create Individual Transformers and Estimators
  • Construct the Pipeline
  • Fit and Transform Data
  • Fit the Final Estimator

For illustration purpose, I will use the shipped seaborn data to demo a few of these steps (as this data is quite a simple one):

  1. Data Preprocessing (Imputation, Scaling, Encoding)
  2. Model Training and Evaluation (including Cross-Validation)
  3. Hyperparameter Tuning (GridSearchCV)
  4. Prediction and Inference script

import seaborn as sns

def load_tips_data():
    df = sns.load_dataset('tips')
    return df

if __name__ == "__main__":
    tips_data = load_tips_data()

Enter fullscreen mode Exit fullscreen mode script; here pipeline is being used

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from load_data import load_tips_data #first .py script

class PreprocessingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.numeric_features = ['total_bill', 'size']
        self.categorical_features = ['sex', 'smoker', 'day', 'time']

        self.numeric_preprocessor = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())

        self.categorical_preprocessor = Pipeline(steps=[
            ('imputer', SimpleImputer(fill_value='missing', strategy='constant')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))

        self.preprocessor = ColumnTransformer(
                ('numeric', self.numeric_preprocessor, self.numeric_features),
                ('categorical', self.categorical_preprocessor, self.categorical_features)

    def fit(self, X, y=None):, y)
        return self

    def transform(self, X):
        return self.preprocessor.transform(X)

# Create an instance of PreprocessingTransformer
preprocessor_instance = PreprocessingTransformer()

# Load and preprocess data
tips_data = load_tips_data() # load_tips_data from the first script
X = tips_data.drop('tip', axis=1)
y = (tips_data['tip'] > 5).astype(int)

# Fit the preprocessor & transform the data, y)
transformed_data = preprocessor_instance.transform(X)

Enter fullscreen mode Exit fullscreen mode - pipeline also used

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from load_data import load_tips_data #calling a script
from preprocess_data import PreprocessingTransformer #calling a script
from sklearn.pipeline import Pipeline

class TrainAndEvaluateModelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):

    def fit(self, X, y):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        self.model = RandomForestClassifier(random_state=42), y_train)

        return self

    def transform(self, X):
        y_pred = self.model.predict(X)
        return y_pred

if __name__ == "__main__":
    # Load and preprocess data
    tips_data = load_tips_data()
    preprocessor = PreprocessingTransformer()  # Create an instance without passing data
    preprocessed_data = preprocessor.fit_transform(tips_data)  # Fit and transform
    X = preprocessed_data
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline with TrainAndEvaluateModelTransformer
    model_pipeline = Pipeline([
        ('model', TrainAndEvaluateModelTransformer())

    # Fit and evaluate the model
    y_pred = model_pipeline.fit_transform(X, y)
    accuracy = accuracy_score(y, y_pred)
    print(f"Model Accuracy: {accuracy}")

Enter fullscreen mode Exit fullscreen mode

While the above is run, result is shown as:

Image depicting train model pipeline created - another Pipeline also used;

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer  # Import the PreprocessingTransformer
from sklearn.pipeline import Pipeline

if __name__ == "__main__":
    # Load and preprocess data
    tips_data = load_tips_data()
    X = tips_data.drop('tip', axis=1)
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline for preprocessing and model training
    model_pipeline = Pipeline([
        ('preprocessor', PreprocessingTransformer()),  # Use the custom preprocessing transformer
        ('model', RandomForestClassifier(random_state=42))

    # Define hyperparameter grid
    param_grid = {
        'model__n_estimators': [50, 100, 150],
        'model__max_depth': [None, 10, 20],
        'model__min_samples_split': [2, 5, 10]

    # Perform GridSearchCV
    grid_search = GridSearchCV(model_pipeline, param_grid, cv=5), y)

    # Get the best tuned model
    best_tuned_model = grid_search.best_estimator_

    print("Best Hyperparameters:", best_tuned_model.named_steps['model'].get_params())

Enter fullscreen mode Exit fullscreen mode

The demonstration of using pipeline in hyperparameter tuning stage is as shown here:

Image depicting hyperparameter tuning pipeline created

Finally, these whole modular scripts can be serialized and made ready for another programme to be used as a pickled file (production):

import joblib
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer
from train_model import TrainAndEvaluateModelTransformer
from hyperparameter_tuning import GridSearchCV, RandomForestClassifier

def serialize_model(model, filename):
    joblib.dump(model, filename)
    print(f"Model serialized and saved as '{filename}'")

if __name__ == "__main__":
    data = load_tips_data()
    X = PreprocessingTransformer().fit_transform(data)
    y = (data['tip'] > 5).astype(int)
    trained_model = TrainAndEvaluateModelTransformer().fit(X, y)
    tuned_model = GridSearchCV(RandomForestClassifier(random_state=42), param_grid={
        'n_estimators': [50, 100, 150],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }, cv=5).fit(X, y)
    serialize_model(tuned_model, "best_model.pkl")

Enter fullscreen mode Exit fullscreen mode

This a bit long tutorial already, but yet I believe the main idea has been passed: using Scikit Learn Pipeline in modular scripts of machine learning workflow.

Lastly, what do you think are some of the benefits of implementing Scikit Learn Pipeline in your data science stages?

  1. Code Simplicity: With Pipeline, you can combine all the steps of the ML pipeline into a single object, making the code more concise and easier to understand.
  2. Data Leakage Prevention: Pipeline ensures that the data transformation steps are applied only to the appropriate dataset, avoiding data leakage between the training and test sets.
  3. Hyperparameter Grid Search: When combined with GridSearchCV, Pipeline allows for hyperparameter tuning across the entire pipeline, including data preprocessing and model parameters.
  4. Model Deployment: Using the Pipeline, you can export the entire pipeline, including all preprocessing steps and the trained model, for deployment in a production environment.

In conclusion, the attempts have been to show how by using modular scripts in conjunction with Sklearn pipeline module (with its classes) - way beyond what is demonstrated in many online courses whereby Jupyter Notebook seems to be the "actual thing", machine learning projects can be better streamlined, going by the industry best approaches.
The codes, the approaches and the structures used here are never cast in stone; depending on the aim, nature and the end-goal of a data science/machine learning project, flexibility is very much allowed.

Top comments (0)