This is why you should use Pipeline

Do you feel there are workflows in ML project that can be automated??

Then you should read this blog.

Once you have done enough of modeling and crossed the barrier of beginner, you will find yourself doing the same few steps over and over again in the same anaysis. You need some tool to automate the same repeating steps.

And guess what?

You have tool in Python scikit-learn, Pipelines that help to to clearly define and automate these workflows.

Pipelines allows linear sequence of data transforms to be chained together.

Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once ( fit() , predict() , etc).

For better understanding let us consider simple example of a machine learning workflow where we generate features from text data using count vectorizer and tf-idf transformer, and then fit it to a random forest classifier.

But to understand How much Pipeline help in the project we should do same process without using Pipeline and then compare with using it.

Without pipeline:

vect = CountVectorizer() 
tfidf = TfidfTransformer()
clf = RandomForestClassifier()

# train classifier
X_train_vect = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_vect)
clf.fit(X_train_tfidf, y_train)

# predict on test data
X_test_vect = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_vect)
y_pred = clf.predict(X_test_tfidf)

What are CountVectorizer() and TfidfTransformer() ?

what is RandomForestClassifier()?

What are transformers and estimator which we saw in Pipeline defination?

TRANSFORMER: A transformer is a specific type of estimator that has a fit method to learn from training data, and then a transform method to apply a transformation model to new data. These transformations can include cleaning, reducing, expanding, or generating features.

In the example above, CountVectorizer and TfidfTransformer are transformers.

Thats why we used vect.fit_transform.

ESTIMATOR: An estimator is any object that learns from data and extracts or filters useful features from raw data. Since estimators learn from data, they each must have a fit method that takes a dataset.
In the example RandomForestClassifier is estimators, and have a fit method.

PREDICTOR: A predictor is a specific type of estimator that has a predict method to predict on test data based on a supervised learning algorithm, and has a fit method to train the model on training data. The final estimator, RandomForestClassifier, in the example is a predictor.

Fortunately, we can automate all of this fitting, transforming, and predicting, by chaining these estimators together into a single estimator object. That single estimator would be scikit-learn's Pipeline.

To create this pipeline, we just need a list of (key, value) pairs, where the key is a string containing what you want to name the step, and the value is the estimator object.

WIth using pipeline:


pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# evaluate all steps on test set
predicted = pipeline.predict(Xtest)

Now with Pipeline when we use fit() on training data, we would get same result we got in previous example without pipeline. This makes code shorter, simpler .

But do build pipeline each step has to be a transformer, except for the last step, which can be of an estimator type. In our example since the final estimator of our pipeline is a classifier, the pipeline object can be used as a classifier, taking on the fit and predict methods of its last step. Alternatively, if the last estimator was a transformer, then pipeline would be a transformer.

Isn't this cool?

Pipeline makes our code Simple and convenient.

Chaining all of your steps into one estimator allows you to fit and predict on all steps of your sequence automatically with one call. Pipeline will handles smaller steps and we need to focus on implementing higher level changes which will help us to easily understand the workflow.

Using Pipeline, all transformations for data preparation and feature extractions occur within each fold of the cross validation process. This prevents common mistakes like training process to be influenced by your test data.

So instead of repeating same steps, lets use Pipeline.

For more information on Pipeline: