Learn how to split your data for training and testing your machine learning models with K Fold Cross Validation.
- Conceptual example
- How to code
- Magical no-code solution ✨🔮
K-fold cross-validation is a data partitioning technique which splits an entire dataset into k groups. Then, we train and test k different models using different combinations of the groups of data we just partitioned, and use the results from these k models to check the model’s overall performance and generality.
In the context of machine learning, a fold is a set of rows in a dataset. We will use k-folds to describe a number of groups we decide to partition the data, so in an example of 20 rows, we can split them into 2 folds with 10 rows each, 4 folds with 5 rows each, or 10 folds with 2 rows each.
A simple explanation of how k-fold cross validation scores a model’s performance is:
The entire dataset is randomly split into equally-sized, independent k-folds, without reusing any of the rows in another fold.
We use k-1 folds for model training, and once that model is complete, we test it using the remaining 1 fold to obtain a score of the model’s performance.
We repeat this process k times, so we have k number of models and scores for each.
Lastly, we take the mean of the k number of scores to evaluate the model’s performance.
To improve your understanding twice-fold 😏, consider this analogy about k-fold cross validation with Twice, a K-pop girl group. Say we are trying to see how well a model can dance by inviting different subsets of Twice girls (called folds) as training and test samples.
If the entire dataset has 9 girls, which are our data points, then we need to manually choose how many folds to split our data into. I’m going with 3 for our example, but there are strategies to pick the best k.
Since we need an equal amount of data in each fold, we randomly pick 3 girls from Twice for each of the three folds, with no overlaps:
With these 3 folds, we will train and evaluate 3 models (because we picked k=3) by training it on 2 folds (k-1 folds) and use the remaining 1 as a test. We pick different combinations of folds for the 3 models we’re evaluating.
Model 2: Trained on Fold 2 + Fold 3, Tested on Fold 1
Model 3: Trained on Fold 1 + Fold 3, Tested on Fold 2
The performance scores would get skewed if the same Twice girls who taught you how to dance were also your judges. So whichever six girls (data points) the model from, the remaining three girls would judge and score you.
Now that you have 3 models and their scores, we can choose a model evaluation method (discussed in another lesson) to determine– generally– whether this model dances well. This is also to ensure that, in one metric, the opinions of all 9 judges/test samples are included.
The resulting evaluation metric would tell us whether we did a good job at dancing. So did we do a good job?
Let’s try to evaluate how well a model learns to predict whether customers of a tourism company flake on their plans or not using Tejashvi’s dataset. Maybe this model could tell us whether we’d follow through with our dreams of vacationing overseas this year, too?
1 import pandas as pd 2 df = pd.read_csv("Customertravel.csv") 3 4 df
Since scikit-learn takes numpy arrays, we’d first have to use Pandas to convert our data frame into a numpy array.
Then, we can use the “KFold” class to configure our evaluation. Our next step is to choose the amount of folds to split our rows of data into. Above, we can see that our dataset has 954 rows, which divides nicely into 9 folds with 106 rows of data each.
This means we’d build and evaluate 9 models total, using 8 folds as training and 1 for scoring each.
1 from sklearn.model_selection import KFold 2 3 # 2nd + 3rd param: shuffle data before splitting into folds 4 kfold = KFold(n_splits=9, shuffle=True, random_state=1) 5 6 model = 1 7 # displaying indices for the rows that will be for training/testing 8 for train, test in kfold.split(np_array): 9 print('Model #%d:' % model) 10 print('train: %s, test: %s' % (train, test)) 11 model = model+1
Now that we’re done splitting our data into 9 folds, we’re ready to continue onto the next lesson of evaluating the model!
To skip all those configuration steps for K-fold cross validation, Mage provides an easy, no-code experience of training and testing a dataset. Although we, as users, aren’t able to customize how much of our data is split, Mage uses an algorithm to decide. For this dataset, Mage decided on approximately a 9:1 training to testing split.
You can find further details about the training/test split under “Review > Statistics” on our Mage web application.
Want to learn more about machine learning (ML)? Visit Mage Academy! ✨🔮