Simplifying Machine Learning with AutoGluon

#machinelearning #datascience #automl #tutorial

Machine Learning

We have all seen machine learning (ML) being used but does everyone know it is machine learning? Some examples of machine learning we use or encounter every day are:

Recommendation engines: Have you ever wondered how Netflix and other streaming services always show tv shows or movies you might like? Machine learning algorithms use your movie consumption behavior data to discover trends that can be used to recommend the most relevant movie to you. As this is a marketing tool, it isn’t just used in streaming services. It is also used on Amazon and even LinkedIn.
Speech recognition: This application of ML uses natural language processing (NLP) that translates human speech into written format. Most of us have encountered this with voice search and virtual personal assistants like Siri, Cortana, and Google Assistant.
Email filtering: When we get mail, if it doesn’t end up in our inbox, it is probably in the promotions or spam folder. The mail service uses machine learning to filter it so spam or promotion emails we might not care about end up in those sections while important mail end up in the important section.

How does machine learning work?

The examples of machine learning applications I listed above use different classes of machine learning and algorithms but generally still follow the same procedure.

Data collection and Preprocessing.
Model Selection and Training.
Evaluate the model’s performance and tune hyperparameters.
Prediction.
Deployment.

This is an iterative process and can sometimes become very complicated. It might be difficult for even ML experts to keep up with all the best practices in modeling. This is where AutoML comes in.

Understanding AutoML

Automated Machine Learning, or AutoML, is a cutting-edge technology that automates various steps of the machine learning workflow, making it easier for both novices and experts to develop high-performing models. AutoML aims to reduce the human effort required in designing, training, and deploying machine learning models, thus democratizing the use of machine learning across various domains.
AutoGluon is a powerful AutoML framework that automates several critical steps of the machine-learning workflow.

So what problems can AutoGluon solve?

AutoGluon is a framework that is an AI-based solution to automate the processing, creating, and tuning stages of ML models so that anyone can build and deploy models even if they don’t have a lot of experience in ML.

Some specific problems that AutoGluon can solve are:

Tabular data prediction. This can be used to build models for tasks like predicting customer churn or the likelihood of a loan default.
Object detection. AutoGluon can be used to build models for tasks like identifying faces in images and tracking objects or detecting cars in traffic.
Image classification. AutoGluon can be used to build models for tasks like identifying what objects are in an image or classifying medical images.
Text classification. We can also use AutoGluon to build models for tasks like detecting hate speech or classifying emails as spam.
Multi-modal Prediction. We can also use AutoGluon to build models for tasks that involve the use of tabular, textual, and image data. This can be used for a task like detecting food in images and printing out its recipe.

AutoGluon in Action: Predicting Bike-Sharing Demand

To demonstrate AutoGluon's capabilities, I used the library to tackle the bike-sharing demand competition on Kaggle. By predicting demand, companies like Uber and Lyft can better prepare for spikes in service usage.

Here I import the essential libraries, including pandas for data manipulation and AutoGluon for automating the machine learning workflow. I then used the Kaggle API to download the dataset and parsed the datetime column as dates to handle time-related features effectively.

`import pandas as pd
from autogluon.tabular import TabularPredictor

Download the dataset from Kaggle and parse the datetime column

!kaggle competitions download -c bike-sharing-demand
!unzip -o bike-sharing-demand.zip
train = pd.read_csv("train.csv", parse_dates=["datetime"])
`

The unzipped file contains separate files for training, testing, and a submission csv where predicted values will be filled. After preprocessing the test data the same way, I trained AutoGluon’s tabular table prediction model with these simple lines of code.

predictor = TabularPredictor(label='count', eval_metric='root_mean_squared_error', problem_type='regression', learner_kwargs={'ignored_columns':["casual", "registered"]}).fit( train_data=train, presets='best_quality', time_limit=600)

I defined ‘count’ as the target variable (label) since the bike rental count is to be predicted. The ‘casual’ and ‘registered’ columns are ignored because they are not included in the test.csv file. I set the time limit to 600 so, this model will train for 10 minutes. I have chosen the best quality preset so we get the best quality model for this dataset. I also customized the evaluation metric to be the root mean squared error (RMSE) for regression.
AutoGluon will take care of preprocessing the data, identifying data types, and generating relevant features automatically as shown below.
AutoGluon will try a variety of models and parameters up to the time limit. Here’s some of its output.

After training, I evaluated the model’s performance by looking at its output below.
predictor.fit_summary()

We can make predictions and evaluate the performance by running this line of code.
predictions = predictor.predict(test)
With the predictions ready, I saved them to the submission file to submit to Kaggle for evaluation:

submission_new_features["count"] = predictions submission_new_features.to_csv(“submission_new_features.csv”, index=False)

The model had a result of 0.7, which was already promising.
However, it can be optimized further by doing some hyperparameter tuning. Here’s what I did.

The purpose of this code is to optimize hyperparameters for GBM and XGB models using AutoGluon's automated hyperparameter tuning capabilities to achieve a lower root mean squared error (RMSE) by creating a dictionary called 'hyperparameters' that maps each model ('GBM' and 'XGB') to its respective hyperparameter options. It would improve the predictive performance of the models and reduce prediction errors.

After training again, The model’s performance improved. It reduced to 0.49. With further optimization, it could reduce even more. This demonstrates the power of AutoGluon in efficiently optimizing models for accurate predictions.

Although AutoGluon provides remarkable advantages, such as automating model selection and tuning, it also has limitations. It is

Difficult to interpret. AutoGluon has a leaderboard feature where we can see the top models, but it cannot analyze the hyperparameters/feature importance or understand why it may perform best. It is hard.
Due to the first point, it is also difficult to debug. It is very integrated and thus abstracts many underlying processes. It may be hard to pinpoint subpar performance.
Hyperparameter tuning is resource-intensive. AutoGluon performs an extensive search over various hyperparameter configurations to find the best combination. It requires substantial computational power, especially for complex datasets and large model architectures. One would need to balance the search space with practical constraints.

To gain further insights into AutoGluon's incredible capabilities, I recommend exploring its official documentation at https://auto.gluon.ai/stable/tutorials/timeseries/forecasting-quick-start.html.