Prevent the Overfitting through Regularization: An example by Ridge Regression

#python #datascience #machinelearning

Description

In my post on Polynomial Curve Fitting was discussed that adding more examples is one of the possible ways to prevent overfitting, the phenomenon that occurs in the figure above, where is a gap between the training (lower) and validation (higher) errors.

Another approach used to control the overfitting is Regularization, which involves adding a penalty term to the error function to discourage the coefficients from reaching large values, how to introduce Bishop in the Pattern Recognition and Machine Learning book.

This post continues the polynomial curve fitting analysis but through the Regularization, known as Ridge Regression instead of Linear Regression.

Regularization

To apply the regularization in the previous analysis, it is necessary to modify the Sum-of-Squares Error (SSE) function adding the regularization parameter λ, as shown in the equation below:

SSE = \frac{1}{2}\sum_{n=1}^{N}\left ( y_{pred}-y \right )^{2}+\frac{\lambda}{2}\left| w \right|^{2}

Where ||w||² is equivalent to w.T * w, and the parameter or coefficient λ conducts the relative importance of the regularization term compared with the SSE term.

As before, instead use some optimization algorithms, like gradient descent, it is used the adapted normal equation to obtain the coefficients w, as shown below:

W=\left ( X^{T}X + \lambda I \right )^{-1}X^{T}y

Where λ is the regularization parameter, I is the identity matrix of size M + 1 and M is the order of the polynomial. The coefficients obtained through normal equation are given by the function below:

def normal_equation_ridge(x, y, M, L):

  # Normal equation: w = ((x'*x)^-1 + L*I) *x'*y

  I = np.identity(M+1)

  xT = x.T
  w = np.dot(np.dot(np.linalg.pinv(np.dot(xT, x) + L*I), xT), y)

  return w

Choosing the regularization parameter

To exemplify the regularization, we used the overfitted model of M = 9, which obtained a Root-Mean Square Error (RMSE) of 0.0173 and 6.1048 in training and validation sets, respectively. The regularization parameter λ values ranging from -40 to 0 were investigated, but to better illustrate, the values are displayed in terms of the natural log, between -40 ≤ ln(λ) ≤ 0, where the value λ = exp(L) is input from the above function and L is the value in range. The RMSE of the validation set is used to choose the parameter λ. The figure below shows the analysis done to choose the parameter λ.

For the value ln(λ) = -40, in the figure above, the RMSE is approximately the value without regularization (6.1042), as the value of λ tends to zero (λ = exp(-40) = 4.2483e-18). The best parameter found (red dashed) is ln(λ) = -11.43 (λ = exp(-11.43) = 1.0880e-5), reaching 0.1218 of RMSE of validation set, while the training set got 0.0637.

The table below compares the coefficient values for ln(λ) = -∞ and ln(λ) = -11.43. Note that ln(λ) = -∞ corresponds to the model without regularization and ln(λ) = -11.43 the model that has the smallest validation error with regularization. It is possible to notice that the coefficients of ln(λ) = -∞ are large, while the values ln(λ) = -11.43 are smaller due to the addition of the penalty term.

Polynomial order with normalization

As the post of Polynomial Curve Fitting, the error analysis is performed in the training and validation sets by the order of the polynomial, but now with regularization.

The figure below shows the training and validation RMSE by the order of the polynomial, where the prevention of overfitting due to the use of regularization in each analyzed order (M= [0, 1, 3 and 9]) is noted.