Hyperparameter Optimisation for Machine Learning - Bayesian Optimisation

#css #webdev #frontend #discuss

When building a machine-learning or deep learning model, have you ever ran into the dilemma about setting up certain parameters which directly affect your model, like how many layers should you stack? Or how many units of filters should be there for each layer? What activation function should you use? These architecture-level parameters are called as hyperparameters which play an important role in deep learning to set out the best configuration so as to produce the best performance of the model.
In this blog we will cover some of the concepts describing how bayesian optimisation works and how fast it is compared to random search and grid search hyperparameter optimisation methods.

Introduction

Algorithms today serve as proxies for decision making processes traditionally performed by humans. As we automate these decision making processes, we must ask the question recursively if our algorithms are behaving as intended. If not, how far is the algorithm from the the state of the art standards? Sometimes the best trained model may yet not be the best because of the amount of inaccuracies when run on the validation data. Hyperparameter optimisation is the process of tuning the parameters of a machine learning model to improve its performance. These are not learned from the data, but set manually by the user. For example, if the model in question is a neural network (NN), the user can set the learning rate of the NN which counts as a hyperparameter. More and more complex machine learning models usually require more parameters to be fine-tuned. The goal of this is to find the set of hyperparameters that result in the best performance of the model.

Grid Search

One of the most common methods for such kind of a problem is grid search, where a pre-defined set of values is used to train and evaluate the model. The process is to train the model over and over again for every combination of values of the hyperparameters and choose the combination which results in the model delivering the best performance. However, a consequence of this is that it can be extremely time-consuming when the number of hyperparametes and the size of the values are large. For example, suppose we have 3 hyperparameters, each of which takes a list of 3 values. The number of combinations it has to process would be 3 x 3 x 3 = 27. More generally, if 'm' is the number of hyperparameters to be optimised and each of them contains n values in a list, then the number of combinations would be
$m^n$ , which becomes a problem when the number of samples is very large. We need a more efficient way to reduce the time complexity of this fine-tuning.

Random Search

Random search is another method where random values are chosen from a pre-defined range for hyperparameters. The model is trained each time and evaluated for each set of random values. It performs much faster than the grid search algorithm since it uses random sampling, however, it can still be very time-consuming and might not find the best set of hyperparameters. Let us assume a hypervolume (measure of the size of the feasible region in a multi-dimensional space) $v_{\epsilon}$ where a function takes values within $1 - \epsilon$ of its maximum. The random search then will sample this hypervolume with probability

P(\epsilon) = \frac{v_{\epsilon}}{V}

where 'V' is the total search space volume. If the total search space is given as

R^d

and the hypervolume spans with

r^d

('d' being the input dimension), the random search would have to process number of samples in the order of

\left(\frac{r}{R} \right)^{-d}

. If r << R then this becomes really expensive!

Bayesian Optimisation

To mitigate the aforementioned problem, there is yet another method which is called Bayesian optimisation. It is an alternative method which works faster and more efficient than the grid search and random search algorithms. It uses Bayesian inference to model the unknown function that maps the parameter values to the performance of the model. The major advantage here is that it uses the information about the past iterations to inform the next set of iterations. If you recall Bayes theorem, the conditional probability of an event A, given that event B has already occurred is given by:

P(A|B) = \frac{P(B|A)*P(A)}{P(B)}

We can further simplify this by removing the normalising value P(B) and we are left with:

P(A|B) = P(B|A)*P(A)

What we have calculated here is known as the posterior probability and it is the calculated using the reverse conditional probability (P(B|A)), also called the likelihood and the prior probability (P(A)). Suppose we have some sample values

x = {x_1, x_2, ....., x_n}

that we evaluate using an objective function f(x) and create a dataset (D) out of the samples and the values returned by the function on those samples. The likelihood in this case is defined as the probability of observing the data given the function P(D|f). This is how we can maximise the likelihood to observe for the best hyperparameters from the sample.

Applications in NLP and bricks

In NLP, the hyperparameter optimisation can be used to tune the hyperparameters of a word embedding model, such as word2vec, GloVe, etc. We can train the embedding models with different sets of hyperparameters and evaluate the performance on basic NLP tasks like text classification, named entity recognition, sentiment analysis, etc. It will soon be available to be used as an active learner, along with the other active learners like random search and grid search in bricks. Since it is an active learner, there is no live runtime environment for it in bricks. Instead, one has to copy the code from the code snippet and paste it in refinery. In refinery, we use sentence embeddings (one vector per sentence) to train the ML models, which can be obtained from large language models such as BERT, or even simpler methods like TF-IDF/BoW. Word2Vec/Glove produce word embeddings (one vector per word), which are more useful if you are interested to learn about the relationships between words and phases. You can find more information about the active transfer learning and applications in refinery in this blog.

Here is an example which shows the implementation of the bayesian optimisation for a text classification task:

from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization

def bayesian_optimization(sentences, labels, n_iter=10, cv=5, random_state=42):
    """
    Perform Bayesian optimization to tune the hyperparameters of a word2vec model for text classification.
    """

    def cross_validation_score(dim, window, negative):
        """
        Function to evaluate the performance of the word2vec model on the text classification task
        """
        model = Word2Vec(sentences, vector_size=int(dim), window=int(window), negative=int(negative))
        X = model[model.wv.vocab]  # load the vocabulary
        clf = RandomForestClassifier(random_state=random_state)
        score = cross_val_score(clf, X, labels, cv=cv).mean()
        return score

    optimizer = BayesianOptimization(
        f=cross_validation_score,
        pbounds={
            "dim": (50, 100, 150), 
            "window": (2, 4, 6),
            "negative": (5, 10, 15)  # the hyperparameters used
        },
        random_state=random_state
    )
    optimizer.maximize(init_points=5, n_iter=n_iter)  # maximize the likelihood function
    best_params = optimizer.max["params"]
    best_params["dim"] = int(best_params["dim"])
    best_params["window"] = int(best_params["window"])
    best_params["negative"] = int(best_params["negative"])
    return best_params

To conclude, grid search is great if you already know what hyperparameters work well and their sample space is not too big. Random search works good if you have no knowledge of the hyperparameters to fine-tune. Try not to use it on too large of sample sizes due to reasons explained above. Bayesian optimisation is a powerful method and can be more efficiently and effectively used over random search and grid search methods. In comparison, it is almost 10 times faster than the grid search method. It serves many crucial use cases in the domain of natural language processing. With its flexibility in the choice of the acquisition functions, it can be tailored to a wide range of problems.

Citations: