Cross-validation in Machine Learning Explained

Cross-validation is a technique for evaluating machine learning models by training several machine learning models on subsets of the available input data and evaluating them on the complementary subset of the data.

Basically, cross-validation is the practice of judging the performance and accuracy of a given machine learning model by using various different divisions of training and testing data, sometimes across multiple rounds of validation, and then finally aggregating the results to derive a highly accurate evaluation of the given machine learning model.

In some types of cross-validation such as K Fold cross-validation, the subsets used for training purposes and the subset used for validating purposes are changed over multiple attempts in order to generate a more accurate conclusion about the performance of the given machine learning model.

While the main benefit of cross-validation is to test the ability of a machine learning model to predict new data, it is also very effective in detecting overfitting and selection bias

Cross-validation allows us to compare the results of various machine learning models and get a sense of how well each one of them will work in practice.

Figure 2: General process of training models

General process for Cross-validation:

Although there are various different types of cross-validation all of them adhere to the following workflow listed below:

Divide the dataset into two different sections: the training section and the testing section
Train the model using the training dataset
Validate the model using the testing dataset
Repeat steps 2 and 3. The exact number of repetitions depends on the specific type of cross-validation that you choose to use

Figure 3: General process for Cross-validation

Different types of Cross-validation:

There are numerous types of cross-validation. Here are brief descriptions of the most common ones:

K-fold cross-validation: This type of cross-validation divides the dataset into k number of groups and then proceeds with the general process of cross-validation.
Hold-out cross-validation: Hold-out cross-validation divides the entire dataset randomly into a training set and a validation set. This type of cross-validation has the advantage of faster execution times since the dataset is split into only two sets, the training and validation sets, the model is build just one time
Stratified k-fold cross-validation: This is considered identical to K-fold cross-validation with the addition that it seeks to make sure that each group has the same proportion of observations/samples with a given categorical value. This type of cross-validation is highly useful when dealing with imbalanced datasets. Keep in mind that Stratified K Fold cross-validation is not suitable to handle time series data since the samples are selected in random order
Leave-p-out cross-validation: In this type of cross-validation there are P number of data points used as test set data while the remainder of the data points are used for training purposes. Its advantage is that it uses the whole dataset for both training and testing and therefore provides much less biased results, but it is not suitable for use with large datasets as it can be very time-consuming and computationally expensive to do so
Monte Carlo(shuffle-split) cross-validation: This type of cross-validation splits the dataset randomly into training and testing sets. The number of cross-validation iterations are not fixed but are decided by analysis. In contrast to K Fold cross-validation where you use one fold as the test set and the remaining folds as the training set, shuffle-split uses a specific training and test set from iteration n for each round n(More on this in these Stack OverFlow answers: link1, link2). Shuffle-split is a good cross-validation technique for large datasets but is not suitable for imbalanced datasets
Time series(rolling) cross-validation: In this type of cross-validation the test set consists of only a single observation/sample. In contrast, the training set consists of the cumulative observations/samples that are encountered before the observation/sample that is used as the test set. When initially starting the cross-validation process, the test set is not a reliable predictor because the model is based on a small training set. Therefore, in order to leverage the full power of time series cross-validation: only when the test set is based on later observations/samples should validation for the model's accuracy be conducted.

Advantages and Disadvantages of Cross-validation:

Overall, I believe that there are more advantages to cross-validation than disadvantages.

Some pros of cross-validation include:

Minimizes overfitting issues: This is accomplished by splitting the dataset into multiple sections and training the model on constantly differing sections. Doing so increases the robustness of the model and therefore reduces overfitting as well
Increases efficiency of data usage: This is because almost all the observations/samples in the dataset are used for both training and testing purposes
Model accuracy tends to increases: This is due to all the data in the dataset(or almost all of it) being utilized to train the model

Here are some disadvantages of cross-validation:

Expensive computational cost: Due to the increased demand of using multiple folds of the dataset for training and testing purposes, computational cost inevitably increases.
High training time: Due to having to train a model on multiple training sets, using cross-validation can incur higher than average training times

Conclusion

That's it for this blog post on Cross-validation.

Thanks for reading this blog post!

If you have any questions or concerns please feel free to post a comment in this post and I will get back to you if I find the time.

If you found this article helpful please share it and make sure to follow me on Twitter and GitHub, connect with me on LinkedIn and subscribe to my YouTube channel.