# Model selection, under fitting and over fitting

In the previous experiments based on fashion MNIST dataset, we evaluated the performance of machine learning model on training dataset and test dataset. If you have changed the model structure or super parameters in the experiment, you may find that when the model is more accurate on the training data set, it is not necessarily more accurate on the test data set. Why?

## Training error and generalization error

Before explaining the above phenomena, we need to distinguish between training error and generalization error. Generally speaking, the former refers to the error of the model on the training data set, and the latter refers to the expectation of the error of the model on any test data sample, which is often approximated by the error on the test data set. The training error and generalization error can be calculated by using the loss function introduced before, such as the square loss function used in linear regression and the cross entropy loss function used in softmax regression.

Let's take the college entrance examination as an example to intuitively explain the two concepts of training error and generalization error. The training error can be regarded as the error rate when doing the college entrance examination (training questions) in previous years, and the generalization error can be approximated by the error rate when actually participating in the college entrance examination (test questions).

Suppose that the training questions and test questions are randomly sampled from an unknown huge test question bank according to the same test outline. If a pupil who has not learned the knowledge of middle school is asked to answer the questions, the error rate of test questions and training questions may be very similar. However, if you change to a senior three candidate who repeatedly practices the training questions, even if the error rate on the training questions is 0, it does not mean that the real college entrance examination results will be so.

In machine learning, we usually assume that each sample in the training data set (training question) and the test data set (test question) is generated independently from the same probability distribution. Based on this assumption, given any machine learning model (including parameters), its training error expectation and generalization error are the same.

For example, if we set the model parameters to random values (pupils), the training error and generalization error will be very close. However, we have learned from the previous sections that the parameters of the model are learned by training the model on the training data set, and the selection of parameters is based on minimizing the training error (senior three candidates).

Therefore, the expectation of the training error is less than or equal to the generalization error. In other words, generally, the model parameters learned from the training data set will make the performance of the model on the training data set better or equal to that on the test data set. Since the generalization error cannot be estimated from the training error, blindly reducing the training error does not mean that the generalization error will be reduced.

Machine learning model should focus on reducing generalization error.

## Model selection

In machine learning, it is usually necessary to evaluate the performance of several candidate models and select models from them. This process is called model selection. Alternative candidate models can be similar models with different superparameters.

Taking multi-layer perceptron as an example, we can select the number of hidden layers, the number of hidden units in each hidden layer and the activation function. In order to get an effective model, we usually have to work hard on model selection. Next, let's describe the validation data set often used in model selection.

### Validation dataset

Strictly speaking, the test set can only be used once after all super parameters and model parameters are selected. You cannot use test data to select models, such as parameters. Since the generalization error cannot be estimated from the training error, it should not only rely on the training data to select the model. In view of this, we can reserve some data other than training data set and test data set for model selection. This part of data is called validation data set, or validation set for short.

For example, we can randomly select a small part from a given training set as the verification set and the rest as the real training set.

However, in practical application, because the data is not easy to obtain, the test data is rarely used once and discarded. Therefore, in practice, the boundary between verification data set and test data set may be fuzzy.

Strictly speaking, unless otherwise specified, the test set used in the experiment in this book shall be the verification set, and the test results (such as test accuracy) in the experiment report shall be the verification results (such as verification accuracy).

### K K K-fold cross validation

Because the validation data set does not participate in model training, it is too extravagant to reserve a large amount of validation data when the training data is not enough. One way to improve is K K K-fold cross validation( K K K-fold cross-validation).

stay K K In K-fold cross validation, we divide the original training data set into K K K non coincident sub datasets, and then we do K K K times of model training and validation. Each time, we use a subset of data to validate the model and use others K − 1 K-1 K − 1 sub data set to train the model K K In the K times of training and verification, the sub data sets used to verify the model are different every time. Finally, we K K K training errors and verification errors are averaged respectively.

## Under fitting and over fitting

Next, we will explore two typical problems that often occur in model training: one is that the model cannot obtain low training error, which we call underfitting; the other is that the training error of the model is much smaller than its error on the test data set, which we call overfitting In practice, we should deal with under fitting and over fitting as much as possible.

Although there are many factors that may lead to these two fitting problems, here we focus on two factors: model complexity and training data set size.

For a detailed theoretical analysis of the impact of model complexity and training set size on learning, please refer to my article This blog.

### Model complexity

In order to explain the complexity of the model, we take polynomial function fitting as an example x x x and corresponding scalar labels y y For the training data set composed of y, the goal of polynomial function fitting is to find one K K K-order polynomial function

y ^ = b + ∑ k = 1 K x k w k \hat{y} = b + \sum_{k=1}^K x^k w_k y^=b+k=1∑Kxkwk

To approximate y y y. In the above formula, w k w_k wk ， is the weight parameter of the model, b b b is the deviation parameter. Like linear regression, polynomial function fitting also uses square loss function. In particular, first-order polynomial function fitting is also called linear function fitting.

Because higher-order polynomial functions have more model parameters and larger selection space of model functions, the complexity of higher-order polynomial functions is higher than that of lower order polynomial functions.

Therefore, higher-order polynomial functions are easier to obtain lower training errors on the same training data set than lower order polynomial functions. Given the training data set, the relationship between model complexity and error is usually shown in Figure 3.4. Given the training data set, if the complexity of the model is too low, it is easy to appear under fitting; If the complexity of the model is too high, it is easy to have over fitting. One way to deal with under fitting and over fitting is to select the model with appropriate complexity for the data set.

### Training dataset size

Another important factor affecting under fitting and over fitting is the size of the training data set. Generally speaking, if the number of samples in the training data set is too small, especially when it is less than the number of model parameters (by elements), over fitting is more likely to occur. In addition, the generalization error will not increase with the increase of the number of samples in the training data set.

Therefore, within the allowable range of computing resources, we usually want to have a larger training data set, especially when the model complexity is high, such as the deep learning model with more layers.

## Polynomial function fitting experiment

In order to understand the influence of model complexity and training data set size on under fitting and over fitting, we take polynomial function fitting as an example.

First, import the package or module required for the experiment:

%matplotlib inline import torch import numpy as np import sys sys.path.append("..") import d2lzh_pytorch as d2l

### Generate dataset

We will generate a manual data set. In the training data set and test data set, the sample characteristics are given x x x. We use the following third-order polynomial function to generate the label of the sample:

y = 1.2 x − 3.4 x 2 + 5.6 x 3 + 5 + ϵ , y = 1.2x - 3.4x^2 + 5.6x^3 + 5 + \epsilon, y=1.2x−3.4x2+5.6x3+5+ϵ,

Noise term ϵ \epsilon ϵ It follows a normal distribution with a mean of 0 and a standard deviation of 0.01. The number of samples in both the training data set and the test data set is set to 100.

n_train, n_test, true_w, true_b = 100, 100, [1.2, -3.4, 5.6], 5 features = torch.randn((n_train + n_test, 1)) poly_features = torch.cat((features, torch.pow(features, 2), torch.pow(features, 3)), 1) labels = (true_w[0] * poly_features[:, 0] + true_w[1] * poly_features[:, 1] + true_w[2] * poly_features[:, 2] + true_b) labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

Take a look at the first two samples of the generated dataset.

features[:2], poly_features[:2], labels[:2]

Output:

(tensor([[-1.0613], [-0.8386]]), tensor([[-1.0613, 1.1264, -1.1954], [-0.8386, 0.7032, -0.5897]]), tensor([-6.8037, -1.7054]))

### Define, train, and test models

We first define the mapping function semilogy, where y y The y-axis uses a logarithmic scale.

# This function has been saved in d2lzh_pytorch package for later use def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None, legend=None, figsize=(3.5, 2.5)): d2l.set_figsize(figsize) d2l.plt.xlabel(x_label) d2l.plt.ylabel(y_label) d2l.plt.semilogy(x_vals, y_vals) if x2_vals and y2_vals: d2l.plt.semilogy(x2_vals, y2_vals, linestyle=':') d2l.plt.legend(legend)

Like linear regression, polynomial function fitting also uses square loss function.

Because we will try to use models with different complexity to fit the generated data set, we put the model definition part in fit_ and_ In the plot function.

The training and testing steps of polynomial function fitting are similar to those in softmax regression described in section 3.6 (implementation of softmax regression from scratch).

num_epochs, loss = 100, torch.nn.MSELoss() def fit_and_plot(train_features, test_features, train_labels, test_labels): net = torch.nn.Linear(train_features.shape[-1], 1) # According to the Linear document, pytorch has initialized the parameters, so we don't initialize them manually here batch_size = min(10, train_labels.shape[0]) dataset = torch.utils.data.TensorDataset(train_features, train_labels) train_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True) optimizer = torch.optim.SGD(net.parameters(), lr=0.01) train_ls, test_ls = [], [] for _ in range(num_epochs): for X, y in train_iter: l = loss(net(X), y.view(-1, 1)) optimizer.zero_grad() l.backward() optimizer.step() train_labels = train_labels.view(-1, 1) test_labels = test_labels.view(-1, 1) train_ls.append(loss(net(train_features), train_labels).item()) test_ls.append(loss(net(test_features), test_labels).item()) print('final epoch: train loss', train_ls[-1], 'test loss', test_ls[-1]) semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss', range(1, num_epochs + 1), test_ls, ['train', 'test']) print('weight:', net.weight.data, '\nbias:', net.bias.data)

### Third order polynomial function fitting (normal)

We first use a third-order polynomial function of the same order as the data generation function. Experiments show that the training error of this model and the error in the test data set are low. The trained model parameters are also close to the real value: w 1 = 1.2 , w 2 = − 3.4 , w 3 = 5.6 , b = 5 w_1 = 1.2, w_2=-3.4, w_3=5.6, b = 5 w1=1.2,w2=−3.4,w3=5.6,b=5.

fit_and_plot(poly_features[:n_train, :], poly_features[n_train:, :], labels[:n_train], labels[n_train:])

Output:

final epoch: train loss 0.00010175639908993617 test loss 9.790256444830447e-05 weight: tensor([[ 1.1982, -3.3992, 5.6002]]) bias: tensor([5.0014])

### Linear function fitting (under fitting)

Let's try linear function fitting again. Obviously, the training error of the model is difficult to continue to reduce after it decreases at the early stage of iteration. After the last iteration, the training error is still very high. Linear models are easy to under fit on the data set generated by nonlinear models (such as third-order polynomial functions).

fit_and_plot(features[:n_train, :], features[n_train:, :], labels[:n_train], labels[n_train:])

Output:

final epoch: train loss 249.35157775878906 test loss 168.37705993652344 weight: tensor([[19.4123]]) bias: tensor([0.5805])

### Insufficient training samples (over fitting)

In fact, even if the third-order polynomial function model of the same order as the data generation model is used, if the training samples are insufficient, the model is still easy to over fit. Let's use only two samples to train the model. Obviously, there are too few training samples, even less than the number of model parameters. This makes the model too complex to be affected by the noise in the training data. In the iterative process, although the training error is low, the error on the test data set is very high. This is a typical over fitting phenomenon.

fit_and_plot(poly_features[0:2, :], poly_features[n_train:, :], labels[0:2], labels[n_train:])

Output:

final epoch: train loss 1.198514699935913 test loss 166.037109375 weight: tensor([[1.4741, 2.1198, 2.5674]]) bias: tensor([3.1207])

We will continue to discuss the problem of fitting and the methods to deal with over fitting in the next two sections.

## Summary

- Since the generalization error cannot be estimated from the training error, blindly reducing the training error does not mean that the generalization error will be reduced. Machine learning model should focus on reducing generalization error.
- You can use validation datasets for model selection.
- Under fitting means that the model can not get low training error, and over fitting means that the training error of the model is much smaller than its error on the test data set.
- The model with appropriate complexity should be selected and too few training samples should be avoided.

Note: this section is basically the same as the original book except for the code, Original book portal

For the purpose of learning, I quote the content of this book for non-commercial purposes. I recommend you to read this book and study together!!!

come on.

thank!

strive!