Darth Espressius

Posted on

# Gradient Boosting on the GPU

Decisions, decisions, it seems like every data-centric problem simmers down to making some sort of choice. Whether it be choosing a class of object present in an image, or modeling churn prediction by choosing a probability, solving data-related problem is typically centred around making decisions.

## Decision Trees

Source: Song et. al., 2015

Decision and regression trees are a form of supervised learning. These algorithms infer implicit rules from structured data to predict some given outcome.
The first regression tree algorithm was published in a 1963 paper, whilst the first decision tree algorithm was published in this 1972 paper The premise of a decision tree to recursively divide the input attributes in smaller, "purer" subsets, which can then more accurately define a given output.

In other words, say you are attempting to predict what type of personal computer someone may purchase. You have a list of previous purchases of persons given their age and occupation. Naturally, you may first group your input data by age (the younger folk may make a different sort of purchasing decision, aiming for portability, performance or flashier features, whilst the elderly may aim on a larger, more tactile keyboard and included software).

After you split your data by age, you end up with two groups; put another way, your decision making process has created its first 'branch'. These two groups can then be further split by occupation. Persons working in software or AI may look to more performance oriented options, while accountants, managers and writers may focus on ergonomics and portability. This multi-way split into various occupation then splits your original two groups further, resulting in "purer" input attributes.

In other terms, your resulting groups or leaves are of one or a few age groups and one or a few occupations. Yes, a few, it may not be necessary to split on every single occupation, owing to a concept known as overfitting, where your model does not generalize well owing to its incredibly high specificity. To the stats folks, this is when your model exhibits unreasonably high variance owing to unconstrained optimization.

This results in an incredibly interpretable model, since the decision made at each branch is easily found. However, a decision tree may leave some room for accuracy improvement, and can require complete re-training if data changes. The latter point is a significant consideration, as organizations may have large amounts of data, where data drift may result in models slowly becoming less effective over time

## Strength In Numbers

A decision tree ensemble is a group of decision tree classifiers/regressors. A data tuple is mapped to an output leaf for a series of decision trees, and the average of the output is taken. This can assist in improving accuracy. A specific implementation of the ensemble method is the gradient-boosted tree.

The decision tree ensemble is trained to optimize some given loss function

$\mathcal{L} = \sum_{i}l(\hat{y_{i}}, y_{i}) + \sum_{k}\Omega(f_{k})$

The $\Omega$ term above penalizes the impact of any single decision tree. This may sound strange at first, however the situation may arise where two decision trees in the ensemble decide to over- and under- weight any given input attribute, resulting in a net zero effect. For example, if one tree thinks that age positively correlates very strongly with a person buying a less portable machine (i.e. the older folk buy desktops instead of tablets), another decision tree added to the model may instigate an exactly opposite relationship to effectively cancel out the first tree's impact.

This is a phenomena known as parameter explosion and is a serious issue in statistical learning algorithms. (This regularization technique is not limited to Decision-Trees, Regression has its variants in the Lasso and Ridge variants, whilst Deep Learning employs dropout, etc)

An additive function (a new tree) is added greedily to some decision making function to minimize your objective function. In other terms, a gradient-boosting scheme trains and adds the next-best-performing tree at each iteration.
In more mathematical terms, at each iteration, a new newly trained tree which most optimally minimizes a chosen objective function is added to the overall ensemble.

There are a few methods by which trees split input data, the following list does not claim to be exhaustive, but rather gives a general introduction to popular split-finding techniques

#### Exact Greedy

Every possible split of input data (for a given attribute) is enumerated, and the split resulting in the maximal increase in Information Gain is chosen. Information Gain represents the difference in entropy before and after your data is split, whilst entropy measures how "impure" (how many different values occur in the split) your attribute is. For example, following our example above, if we split based on occupation, and any given group of data after the split contains multiple, unrelated occupations, this data is said to be high-entropy, as no one consistent theme is present for the occupation attribute. If, however, the split data contains a single occupation per split, the data is low-entropy.

This implies then, that if the difference in entropy before and after a split is high, the data has become "purified", or has experienced positive information gain. As stated in the name for this method, the next split is chosen greedily (which means the next best option is chosen out of all possible options).

This technique can be computationally expensive to enumerate all splits for continuous features however, since it includes sorting all values for a given attribute and accumulating these gradient statistics for every possible split.

#### Approximate Greedy

Exact greedy above cannot work for data which is not held is main system memory. When dealing with terra- and peta-bytes of data, it is unreasonable to expect all data to be held in memory to perform any sort of statistical analysis. The approximate greedy method of split-finding proposes split points based on percentiles of a given feature by means of samples from the main database. Continuous features are additionally bucketed using these candidate points, and the best solution is found based on aggregated statistics.

#### Sparsity Aware

Data is never perfect. That's a fact of life. In the library we will be using, a default split is defined for missing data based on existing data. This ensures that all data within your input database is used to train the model, and ensures robustness against future missing data. This paper found an exponential decrease in classification time when using sparsity-aware methods. In other words, the algorithm did not have to estimate adjusted gradients at time of classification when subject to missing features, as a pre-baked option was already selected. This can be customised depending on application need.

## The GPU

Or as I like to call it, a concurrency nerd's DREAM. I'm planning on writing an article completely dedicate to the wonders of GPU processing, and how to do some cool things using NVIDIA CUDA, but for now, we will be using NVIDIA's RAPIDS library.

RAPIDS allows execution of end-to-end pipeline entirely on GPUs, and is built on CUDA primitives (a blend of C++ and C code). This allows workloads that are highly parallelizable (single-instruction-multiple-data) to be scaled outwards across multiple GPU cores. Whilst your GPU's individual cores may be much simpler than a CPU's, a GPU typically has hundreds (to thousands) of these cores. For example, the lowest end NVIDIA GPU is currently upwards of EIGHT HUNDRED. The lowest end GPU has orders of magnitude more cores than many high-end desktop CPUs.

Additionally RAPIDS is incredibly compliant with typical Pandas' function calls. What this translates to is a very natural transition from typical Pandas' function calls (which run on your CPU) to RAPIDS API calls, which run by default on your GPU. See here for a complete reference the RAPIDS' direct analog to Pandas.

# GPU-powered API
import cudf as cd

# usual import
import pandas as pd

# does exactly the same thing from a programmer's perspective
cd.DataFrame(data)
pd.DataFrame(data)



Now the fun part, since gradient-boosting involves iteratively adding decision trees to a main model, at first it may seem completely counter-intuitive to attempt to run this on GPU. However, we are not parallelizing tree creation, RAPIDS works to parallelized across data. Data points for each iteration will be scaled across your multiple GPUs cores in the background to essentially "unroll" the summation expressed in the additive equation above.

Enough talk, it's time to code.
Note, this will require you to have a RAPIDS-compatible GPU, the latest version of the CUDA Toolkit and RAPIDS installed. See the following links on how to check for compatibility with RAPIDS, and how to install both RAPIDS and the CUDA toolkit:

(If you have a 10-series NVIDIA GPU or newer, you should be fine)

This does not claim to be a fully end-to-end tutorial, this highlights some of the main features of the RAPIDS and XGBoost APIs.

Let's ensure we have our libraries imported

import cupy as cp
import xgboost as xgb


Let's also say we have sorted our data set, and have numpy arrays of the training and validation data. From these arrays, we need to wrap this data in XGBoost's DMatrix format. This according the XGBoost Documentation is a more optimized data wrapper for Extreme Gradient boosting.

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)


#### Tilling the Soil

Now we need to specify some parameters for our Gradient-Boosted Tree. I'm ignoring some safety-checking for the purposes of clarity. If this were being deployed, the code would first check for the existence of a GPU, then proceed to create the tree

params = {
'tree_method':'gpu_hist',
'n_gpus': 1,
'eval_metric': # your choice of metric
'objective': # some objective function
}


The most important part of the above is the assignment 'tree_method': 'gpu_hist'. This tells XGBoost to use a CUDA-accelerated GPU-based tree construction method.

#### Light, Water and Love

Now it's time to grow our tree. XGBoost and RAPIDS make this incredibly simple:

num_round = 10
bast = xgb.train(params, dtrain, num_round)


And that's it! You have successfully grown a Gradient-boosted tree on the GPU.