Hi everyone!

This is the third of the series of articles on cross-validation and hyperparameter tuning. Here are the links to the first two if you’d like to give them a read before going ahead with reading this blog post.

- Role of Cross-Validation in Model Validation and Hyperparameter Search
- Hyperparameter Tuning: Understanding Grid Search

Let us quickly outline the key ideas that we’ve covered thus far.

In the first article, we observed how

`train/test split`

depends heavily on the particular data records that end up in the training and test sets, which led us to understand the need for cross-validation as a way to effectively evaluate a model’s performance.We then saw how cross-validation can be used in hyperparameter search; If you can remember, we decided to choose that value of the hyperparameters that yielded the highest cross-validated accuracy.

In the second article, we looked at searching for the best hyperparameters more conveniently using

`GridSearchCV`

in`scikit-learn`

. If you remember, towards the end of the article, we touched upon the fact that performing Grid Search to tune hyperparameters is actually computationally inefficient.

We shall now try to rephrase it better, in a more formal way.

- Say we have to search for
`M`

parameters; Let`p_1, p_2,p_3, …, p_M`

be the`M`

parameters. Let the number of values that we would like to search over for`p_1`

be`n1`

, for`p_2`

be`n2`

, and so on, with`nM`

values for`p_M`

. Grid Search considers all possible hyperparameter settings (combinations) into account and creates a model for each possible setting to choose the best model with optimal hyperparameters.

To understand it better, assume that out of M parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameter

`p_M`

. So, Grid Search involves searching through the list of nM values for the M_th hyperparameter; So, there are`nM`

models created.Suppose we now freeze the values of all hyperparameters except two, say the last two

`(p_M and p_(M-1))`

. We now have to search through**all possible combinations**of`p_M`

and`p_(M-1)`

, each having`nM`

and`n_(M-1)`

possible values that we could search over.We now take a step back and freeze the value of

`p_M-1`

and search through all values for`p_M`

; To account for all possible combinations, we should repeat the procedure for all`n_M-1`

values for`p_M-1`

. So, this process would leave us with`n_(M-1) * nM`

models.

Hope it’s clear how the complexity scales with increasing values of the number of values each hyperparameter could take. For the above example with M hyperparameters, we would have `n1*n2*n3*…*n_M`

models.This is why we said that things could scale up quickly and become computationally intractable with Grid Search.

With this motivation to make hyperparameter search computationally more efficient, let us proceed to understand Randomized Search.

## Understanding RandomizedSearchCV

In contrast to `GridSearchCV`

, not all parameter values are tried out in `RandomizedSearchCV`

, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.

If some of the hyperparameters that we’re searching for are continuous, then we should specify the distribution rather than the list of values, while defining the parameter grid. How do we define the fixed number of parameter settings?

The number of parameter settings that are tried is given by `n_iter`

. There’s quality vs computational cost trade-off in picking `n_iter`

.

A very small value of

`n_iter`

would imply that we’re more likely to find a suboptimal solution, because we are actually considering too few combinations.A very high value of

`n_iter`

would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before. In fact, if we set`n_iter= n1*n2*n3*…*n_M`

from the previous example, then, we’re essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.

Let us build on the same example of `KNNClassifier`

from the previous blog posts; In addition to `n_neighbors`

, let us also search for the `optimal weighting strategy`

— ‘`uniform`

’ where all points are weighted equally and ‘`distance`

’ option weights points by the inverse of their distance. And now, let us implement Randomized Search in `scikit-learn`

and do the following steps, as we did for Grid Search.

**1. Import RandomizedSearchCV class**

```
from sklearn.model_selection import RandomizedSearchCV
```

**2. Define the parameter grid**

```
# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
```

**3. Instantiate the grid; Set n_iter=10, Fit the grid & View the results**

```
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#DataFrame
mean_test_score std_test_score params
0 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 16}
1 0.966667 0.033333 {'weights': 'uniform', 'n_neighbors': 22}
2 0.980000 0.030551 {'weights': 'uniform', 'n_neighbors': 18}
3 0.966667 0.044721 {'weights': 'uniform', 'n_neighbors': 27}
4 0.953333 0.042687 {'weights': 'uniform', 'n_neighbors': 29}
5 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 10}
6 0.966667 0.044721 {'weights': 'distance', 'n_neighbors': 22}
7 0.973333 0.044222 {'weights': 'uniform', 'n_neighbors': 14}
8 0.973333 0.044222 {'weights': 'distance', 'n_neighbors': 12}
9 0.973333 0.032660 {'weights': 'uniform', 'n_neighbors': 15}
```

**4. Examine the best score and best hyperparameters**

```
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}
```

**Parameters of the best model**

- Surprisingly, we see that the highest accuracy score obtained in this case, where we only looked at 10 different parameter settings instead of 60 in Grid Search, is the same as before: 0.98 ✔
- And the value for n_neighbors= 18, which is also one of the optimal values that we got when we initially searched for the optimal value of n_neighbors. (Recall from the earlier blog post with the same example)

Maybe we just got lucky?

What is the guarantee that we will always get the best results?

Ah, this question makes perfect sense, doesn’t it?

Let us do the following now: Let us run `RandomizedSearchCV`

for multiple times and see how many times we really end up getting lucky!

➡️Run RandomizedSearchCV 20 times and see what happens; We log the best_score_ for every run.

```
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]
```

Upon examining the best scores above for all the 20 runs, we see that we get the best accuracy score of 0.98 about 13 times.

Looks like we’re lucky indeed! What about the other 7 times when we didn't quite get the best accuracy score? These accuracy scores are around 0.973 which is pretty close to 0.98.

This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search. This means, the best models thus obtained, with the hyperparameters from randomized search are clearly very close to the optimal model.

In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings. Hope you all understood how we could use Randomized Search for hyperparameter tuning.

Thanks for reading! Happy Learning! Until next time!

## References

[1] Here’s the link to the Google Colab notebook for the example discussed above

[2] Introduction to Machine Learning in Python with scikit-learn by DataSchool.

[3] Scikit-learn documentation- RandomizedSearchCV

http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Photo by Chris Lawton on Unsplash

## Discussion (0)