In this followed-up tutorial, I want to demonstrate how to use the Scikit Learn Pipeline module in addition with modularization.
If you need to go through the previous tutorial, check here
However, in a recap, first tutorial introduced the industry-standard of separating different stages of machine learning projects into modular scripts whereby each stage deals with different process (data loading, data preprocessing, model selection, etc.)
The advantages of scripts modularization include code organization, code reusability, collaboration and teamwork, maintainability and debugging.
Now, let's talk about the Scikit Learn Pipeline module briefly, a bit.
A Scikit Learn (Sklearn) pipeline is a powerful tool for streamlining, simplifying and organizing machine learning workflows. It's essentially a way to automate a sequence of data processing and modeling steps into a single, cohesive unit. It allows to chain together multiple data processing and modeling steps into a single, unified object.
Modular Script + Pipeline Implementation = Best Industry Practices
(The essence of this tutorial is to show how we can make use of Pipeline module in modular scripts for even more streamlined industry standard)
Here's a list of aspects of the machine learning process where Scikit learn Pipeline can be used:
Data Preprocessing
- Imputing missing values.
- Scaling and standardizing features.
- Encoding categorical variables.
- Handling outliers.
Feature Engineering
- Creating new features or transforming existing ones.
- Applying dimensionality reduction techniques (e.g., PCA)
Model Training and Evaluation
- Constructing a sequence of data preprocessing and modeling steps.
- Cross-validation and evaluation.
Hyperparameter Tuning
- Using Grid Search or Random Search to find optimal hyperparameters.
Model Selection
- Comparing different models with the same preprocessing steps.
- Selecting the best model based on cross-validation results.
Prediction and Inference
- Applying the entire preprocessing and modeling pipeline to new data.
Deployment and Production
- Packaging the entire pipeline into a deployable unit.
Pipeline Serialization and Persistence
- Saving the entire pipeline to disk for future use.
Before demonstrating some of these steps, it is worth mentioning that general outlook of Scikit Learn Pipeline loosely tends to follow this pattern:
- Import Necessary Libraries
- Create Individual Transformers and Estimators
- Construct the Pipeline
- Fit and Transform Data
- Fit the Final Estimator
For illustration purpose, I will use the shipped seaborn data to demo a few of these steps (as this data is quite a simple one):
- Data Preprocessing (Imputation, Scaling, Encoding)
- Model Training and Evaluation (including Cross-Validation)
- Hyperparameter Tuning (GridSearchCV)
- Prediction and Inference
load_data.py script
import seaborn as sns
def load_tips_data():
df = sns.load_dataset('tips')
return df
if __name__ == "__main__":
tips_data = load_tips_data()
print(tips_data.head())
preprocess_data.py script; here pipeline is being used
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from load_data import load_tips_data #first .py script
class PreprocessingTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self.numeric_features = ['total_bill', 'size']
self.categorical_features = ['sex', 'smoker', 'day', 'time']
self.numeric_preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
self.categorical_preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(fill_value='missing', strategy='constant')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
self.preprocessor = ColumnTransformer(
transformers=[
('numeric', self.numeric_preprocessor, self.numeric_features),
('categorical', self.categorical_preprocessor, self.categorical_features)
]
)
def fit(self, X, y=None):
self.preprocessor.fit(X, y)
return self
def transform(self, X):
return self.preprocessor.transform(X)
# Create an instance of PreprocessingTransformer
preprocessor_instance = PreprocessingTransformer()
# Load and preprocess data
tips_data = load_tips_data() # load_tips_data from the first script
X = tips_data.drop('tip', axis=1)
y = (tips_data['tip'] > 5).astype(int)
# Fit the preprocessor & transform the data
preprocessor_instance.fit(X, y)
transformed_data = preprocessor_instance.transform(X)
train_model.py - pipeline also used
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from load_data import load_tips_data #calling a script
from preprocess_data import PreprocessingTransformer #calling a script
from sklearn.pipeline import Pipeline
class TrainAndEvaluateModelTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
self.model = RandomForestClassifier(random_state=42)
self.model.fit(X_train, y_train)
return self
def transform(self, X):
y_pred = self.model.predict(X)
return y_pred
if __name__ == "__main__":
# Load and preprocess data
tips_data = load_tips_data()
preprocessor = PreprocessingTransformer() # Create an instance without passing data
preprocessed_data = preprocessor.fit_transform(tips_data) # Fit and transform
X = preprocessed_data
y = (tips_data['tip'] > 5).astype(int)
# Create a pipeline with TrainAndEvaluateModelTransformer
model_pipeline = Pipeline([
('model', TrainAndEvaluateModelTransformer())
])
# Fit and evaluate the model
y_pred = model_pipeline.fit_transform(X, y)
accuracy = accuracy_score(y, y_pred)
print(f"Model Accuracy: {accuracy}")
While the above train_model.py is run, result is shown as:
hyperparameter_tuning.py - another Pipeline also used;
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer # Import the PreprocessingTransformer
from sklearn.pipeline import Pipeline
if __name__ == "__main__":
# Load and preprocess data
tips_data = load_tips_data()
X = tips_data.drop('tip', axis=1)
y = (tips_data['tip'] > 5).astype(int)
# Create a pipeline for preprocessing and model training
model_pipeline = Pipeline([
('preprocessor', PreprocessingTransformer()), # Use the custom preprocessing transformer
('model', RandomForestClassifier(random_state=42))
])
# Define hyperparameter grid
param_grid = {
'model__n_estimators': [50, 100, 150],
'model__max_depth': [None, 10, 20],
'model__min_samples_split': [2, 5, 10]
}
# Perform GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5)
grid_search.fit(X, y)
# Get the best tuned model
best_tuned_model = grid_search.best_estimator_
print("Best Hyperparameters:", best_tuned_model.named_steps['model'].get_params())
The demonstration of using pipeline in hyperparameter tuning stage is as shown here:
Finally, these whole modular scripts can be serialized and made ready for another programme to be used as a pickled file (production):
serialized_model.py
import joblib
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer
from train_model import TrainAndEvaluateModelTransformer
from hyperparameter_tuning import GridSearchCV, RandomForestClassifier
def serialize_model(model, filename):
joblib.dump(model, filename)
print(f"Model serialized and saved as '{filename}'")
if __name__ == "__main__":
data = load_tips_data()
X = PreprocessingTransformer().fit_transform(data)
y = (data['tip'] > 5).astype(int)
trained_model = TrainAndEvaluateModelTransformer().fit(X, y)
tuned_model = GridSearchCV(RandomForestClassifier(random_state=42), param_grid={
'n_estimators': [50, 100, 150],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}, cv=5).fit(X, y)
serialize_model(tuned_model, "best_model.pkl")
This a bit long tutorial already, but yet I believe the main idea has been passed: using Scikit Learn Pipeline in modular scripts of machine learning workflow.
Lastly, what do you think are some of the benefits of implementing Scikit Learn Pipeline in your data science stages?
- Code Simplicity: With Pipeline, you can combine all the steps of the ML pipeline into a single object, making the code more concise and easier to understand.
- Data Leakage Prevention: Pipeline ensures that the data transformation steps are applied only to the appropriate dataset, avoiding data leakage between the training and test sets.
- Hyperparameter Grid Search: When combined with GridSearchCV, Pipeline allows for hyperparameter tuning across the entire pipeline, including data preprocessing and model parameters.
- Model Deployment: Using the Pipeline, you can export the entire pipeline, including all preprocessing steps and the trained model, for deployment in a production environment.
In conclusion, the attempts have been to show how by using modular scripts in conjunction with Sklearn pipeline module (with its classes) - way beyond what is demonstrated in many online courses whereby Jupyter Notebook seems to be the "actual thing", machine learning projects can be better streamlined, going by the industry best approaches.
The codes, the approaches and the structures used here are never cast in stone; depending on the aim, nature and the end-goal of a data science/machine learning project, flexibility is very much allowed.
Top comments (0)