DEV Community

loading...
Cover image for A Quickstart Guide to Auto-Sklearn (AutoML) for Machine Learning Practitioners

A Quickstart Guide to Auto-Sklearn (AutoML) for Machine Learning Practitioners

patrycjajenkner profile image Patrycja Jenkner Originally published at neptune.ai ・8 min read

This article was originally written by MJ Bahmani and posted on the Neptune blog.


Using AutoML frameworks in the real world is becoming a regular thing for machine learning practitioners. People often ask: does automated machine learning (AutoML) replace data scientists?

Not really. If you're eager to find out what AutoML is and how it works, join me in this article. I'm going to show you auto-sklearn, a state-of-the-art and open-source AutoML framework.

To do this, I had to do some research:

  • Read the first and second paper for auto-sklearn V1 and V2.
  • Took a deep dive into the auto-sklearn documentation and examples.
  • Checked the official Auto Sklearn blog post.
  • Did some experiments on my own.

As do AutoML research, and I've learned quite a lot so far. After reading this post, you'll know more about:

  • What is AutoML, and who is AutoML for?
  • Why does auto-sklearn matter to the ML community?
  • How to use auto-sklearn in practice?
  • What are the main features of auto-sklearn?
  • A use-case of auto-sklearn with result tracking in Neptune.

Automated Machine Learning

AutoML is a young field. The AutoML community wants to build an automated workflow that could take raw data as input, and produce a prediction automatically.

This automated workflow should automatically do preprocessing, model selection, hyperparameter tuning, and all other stages of the ML process. For example, take a look at the image below to see how Microsoft Azure uses AutoML.

AutoML can improve the quality of work for data scientists, it's not going to remove data scientists from the cycle.

Experts could use AutoML to increase their job performance by focusing on the best-performing pipelines, and non-experts could use AutoML systems without a broad ML education. If you have 15 minutes to spare, the conversation below might help you understand what AutoML is all about.

What is AutoML: A conversation between Josh Starmer and Ioannis Tsamardinos

AutoML frameworks

There are different types of AutoML frameworks, each has unique features. Each of them has automated a few steps of a full machine learning workflow, from pre-processing to model development. In this table, I summed up only a few of them that are worth mentioning:

Auto-sklearn

auto-sklearn is an AutoML framework on top of scikit-Learn. It's state of the art, and open-source.

auto-sklearn combines powerful methods and techniques which helped the creators win the first and second international AutoML challenge.

auto-sklearn is based on defining AutoML as a CASH problem.

CASH = Combined Algorithm Selection and Hyperparameter optimization. Put simply, we want to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. In the figure below, you can see a representation of auto-sklearn provided by its authors.

auto-sklearn can solve classification and regression problems. The first version of auto-sklearn was introduced with an article titled "Efficient and robust automated machine learning " in 2015, at the 28th International Conference on Neural Information Processing Systems. The second version was presented with the paper "auto-sklearn 2.0: The Next Generation" in 2020.

Auto-sklearn features

What can auto-sklearn do for users? It has several valuable features, helpful for both novices and experts.

By writing just five lines of Python code, beginners can see the prediction, and experts can boost their productivity. Here are some main features of auto-sklearn:

  • Written in Python, on top of the most popular ML library (scikit-learn).
  • Useful for many tasks, such as classification, regression, multi-label classification.
  • Consists of several preprocessing methods (handling missing values, normalizing data).
  • Searches for optimal ML pipelines among a considerable search space (15 classifiers, more than 150 hyperparameters are searched).
  • State of the art thanks to using meta-learning, Bayesian optimization, ensemble techniques.

How does auto-sklearn work?

Auto-sklearn can solve classification and regression problems, but how? There's a lot that goes into a machine learning pipeline. In general, auto-sklearn V1 has three main components:

  1. Meta-learning
  2. Bayesian optimization
  3. Build ensemble

So when we want to apply a classification or regression on a new dataset, auto-sklearn starts by extracting its meta-feature to find the similarity of the new dataset to the knowledge base relying on meta-learning.

In the next step, when the search space shrinks enough through meta-learning, Bayesian optimization will try to find and select the out-performing ML pipelines. In the last step, auto-sklearn will build the ensemble model based on the best ML workflow in the search space.

Auto-sklearn v2: the new generation

Recently the second version of auto-sklearn went public. Let's review what's changed in the new generation. Based on the official blog post and original paper, there are four improvements:

  • They allowed each ML pipeline to use an early-stopping strategy inside the whole search space; this feature improved performance on large datasets, but it's mostly useful for tree-based classifiers.
  • Improving model selection strategy: one vital step in auto-sklearn is how to select models. In auto sklearn V2, they used a multi-fidelity optimization method such as BOHB. However, they showed that a single model selection is not fit for all types of the problem, and they integrated several strategies. To get familiar with new optimization methods, you can read this article: "HyperBand and BOHB: Understanding State of the Art Hyperparameter Optimization Algorithms."
  • Building a portfolio instead of using meta-feature to find a similar dataset in the knowledge base. You can see this improvement in the image below.
  • Build an automated policy selection on top of the previous improvements to select the best strategy.

Auto-sklearn main parameters

Although Auto-sklearn might be able to find an outperforming pipeline without setting any parameters, there are some parameters that you can use to boost your productivity. To check all parameters visit the official page.

Now let's apply what we learned in a case-study, and perform some experiments!

Track Auto-sklearn experiments on Neptune

I made some Notebooks which you can easily download and do the experiments on your own. But to do all the steps together again, you need to:

Check all the experiments in Neptune

First, you need to install auto-sklearn on your machine. Simply use pip3 for this:

pip3 install auto-sklearn
Enter fullscreen mode Exit fullscreen mode

If you get an error, you may need to install dependencies for that, so please check the official installation page. You can also use the notebooks I prepared for you in Neptune. Then run the following command to make sure the installation is done correctly:

import autosklearn\
print(autosklearn.__version__)\
*#  0.12.1*
Enter fullscreen mode Exit fullscreen mode

Let's tackle some classification and regression problems.

Auto-sklearn for classification

For the classification problem, I chose a cherished Kaggle competition --- Santander Customer Transaction Prediction. Please download the dataset and select 10000 records randomly. Then follow the experiments in the first notebook:

*#load and split dataset into training and test folds*\
import autosklearn\
X_train=None\
X_val=None\
y_train=None\
y_val=None\
train=pd.read_csv("./sample_train_Santander.csv")\
X=train.drop(["ID_code",'target'],axis=1)\
y=train["target"]\
X_train,X_val,y_train,y_val = train_test_split(X,y, stratify=y,test_size=0.33, random_state=42)\
*#define the model*\
automl = autosklearn.classification.AutoSklearnClassifier()\
*#train the model*\
automl.fit(X_train, y_train )\
*#predict*\
y_pred=automl.predict_proba(X_val)\
*# score*\
score=roc_auc_score(y_val,y_pred[:,1])\
print(score)\
*# show all models*\
show_modes_str=automl.show_models()\
sprint_statistics_str = automl.sprint_statistics()
Enter fullscreen mode Exit fullscreen mode

We also need to define some configurations to gain more insight into auto-sklearn:

To use the above configuration, you could define the automl object as follows:

*#define the model*\
TIME_BUDGET=60\
automl = autosklearn.classification.AutoSklearnClassifier(\
time_left_for_this_task=TIME_BUDGET,\
metric=autosklearn.metrics.roc_auc,\
n_jobs=-1,\
resampling_strategy='cv',\
resampling_strategy_arguments={'folds': 5},\
)*#train the model*automl.fit(X_train, y_train )
Enter fullscreen mode Exit fullscreen mode

As I used plenty of different configurations, I just track them on the Neptune. You can see one of them in the image, and check all of them in Neptune.

When you fit the auto-sklearn model, you can check all the best outperforming pipelines with PipelineProfiler (pip install pipelineprofiler). To do that, you need to run the following code:

import PipelineProfiler\
*# automl is an object Which has already been created.*\
profiler_data= PipelineProfiler.import_autosklearn(automl)\
PipelineProfiler.plot_pipeline_matrix(profiler_data)
Enter fullscreen mode Exit fullscreen mode

Your output should be like this:

On the other hand, I also ran some experiments based on auto-sklearn V2. The result was fascinating. You can see the outcome below:

To use auto-sklearn V2, you can use following code:

TIME_BUDGET=60\
automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(\
time_left_for_this_task=TIME_BUDGET,\
n_jobs=-1,\
metric=autosklearn.metrics.roc_auc,\
)
Enter fullscreen mode Exit fullscreen mode

Auto-sklearn for regression

The second type of problem which auto-sklearn can solve is regression. I ran some experiments based on the official example in the auto-sklearn documentation.

TIME_BUDGET=60\
automl = autosklearn.regression.AutoSklearnRegressor(\
time_left_for_this_task=TIME_BUDGET,\
n_jobs=-1\
)\
automl.fit(X_train, y_train, dataset_name='boston')\
y_pred = automl.predict(X_test)\
score=r2_score(y_test, y_pred)\
print(score)\
show_modes_str=automl.show_models()\
sprint_statistics_str = automl.sprint_statistics()print(show_modes_str)\
print(sprint_statistics_str)
Enter fullscreen mode Exit fullscreen mode

I just changed the time budget to track the performance based on the time limitation. The image below shows the results.

Final thought

Overall, auto-sklearn is still a new technology. Because auto-sklearn is built on top of scikit-learn, many ML practitioners can quickly try it and see how it works.

The most important advantage of this framework is that it saves a lot of time for experts. The one weakness is that it acts as a black box, and doesn't say anything about how to make a decision.

All in all, it's a pretty interesting tool, so it's worth giving auto-sklearn a look.


Read also:


This article was originally written by MJ Bahmani and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

Discussion (0)

Forem Open with the Forem app