DEV Community

Cover image for Overfit vs Underfit: The Modeling War
Timothy Cummins
Timothy Cummins

Posted on

Overfit vs Underfit: The Modeling War

Introduction

When creating machine learning models overfitting and underfitting are key concepts. Having an underfit model will not pick up on key features in your data and perform poorly on both your training and testing sets, while having an overfit model will look great on your training data but will not fit very well on your testing data. To create a model that returns good predictions you will need to find a balance between these two, so let's go over these concepts.

To provide some examples I will be coding some visuals, so if you want to follow along here are the libraries and data that I am using:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
Enter fullscreen mode Exit fullscreen mode

To create our data I am going to create random points with the quadratic formula and then create some noise in the data to help it replicate real data (to a point).

np.random.seed(14)
x = np.random.uniform(0, 11, 20)
x = np.sort(x)
y = (-x+4) * (x-9) + np.random.normal(0, 3,20)
Enter fullscreen mode Exit fullscreen mode

Underfitting

Now let's get started with created a visual underfitting so that we can take a look at what we see. To do this we are going to split our data into training and testing sets, then fit a Linear Model to our data.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.40, random_state=14)
reg = LinearRegression().fit(X_train.reshape(-1, 1), y_train)
Enter fullscreen mode Exit fullscreen mode
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_train.reshape(-1, 1), reg.predict(X_train.reshape(-1, 1)),label='Underfit Model')
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_train.reshape(-1, 1), reg.predict(X_train.reshape(-1, 1)),label='Underfit Model')
plt.legend(loc=[0.17, 0.1]);
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now that we have the visual above we can see that the model is not really picking up the relation between x and y, or in your normal case relations between the features in your data. The nice thing about underfitting is that it can usually be seen from poor performance from your training data. This could be caused by our model not being complex enough for our data (similar to our case here) or by having too many features. Another term commonly used to describe underfitting is that the model has a high bias, which I like to think about as people ignoring the data that does not match their point of view.

Overfitting

So now to show overfitting let's try fitting an 8th degree polynomial to our quadratic dataset.

poly = PolynomialFeatures(8)
x_fin = poly.fit_transform(X_train.reshape(-1, 1))
reg_poly = LinearRegression().fit(x_fin, y_train)

X_linspace = np.linspace(0, 11, 30)
X_linspace_fin = poly.fit_transform(X_linspace.reshape(-1,1))
y_poly_pred = reg_poly.predict(X_linspace_fin)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_linspace, y_poly_pred,label='Overfit Model')
plt.ylim(bottom=-50,top=20)
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_linspace, y_poly_pred,label='Overfit Model')
plt.ylim(bottom=-50,top=20)
plt.legend(loc=[0.17, 0.1]);
Enter fullscreen mode Exit fullscreen mode

Alt Text

Overfitting is a little trickier as you can see from the graphic above, while modeling the data very well in the training set (on the left), it is totally missing the points on the testing set. This is because the algorithm is modeling the noise in the training set rather that the intended outputs, which is known as having a high variance. The cause of overfitting can normally be traced to having a model that is too complex for our dataset, in our case an 8th degree polynomial on a data set that is based on a 2nd degree, having outliers/errors in your data or by just not having enough data.

Conclusion

For a final image lets take a look at how the model should look, by plugging in a 2nd degree polynomial.

poly = PolynomialFeatures(2)
x_fin = poly.fit_transform(x.reshape(-1, 1))
reg_poly = LinearRegression().fit(x_fin, y)
X_linspace = np.linspace(0, 11, 30)
X_linspace_fin = poly.fit_transform(X_linspace.reshape(-1,1))
y_poly_pred = reg_poly.predict(X_linspace_fin)
Enter fullscreen mode Exit fullscreen mode
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_linspace, y_poly_pred,label='Ideal Model')
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_linspace, y_poly_pred,label='Ideal Model')
plt.legend(loc=[0.17, 0.1]);
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now we can see while our model does not fit our data perfectly on either the training or testing sets, due to the random noise, it does fit both data sets very well. If you were looking at the causes of the opposing issues you might have noticed that they were opposites. While increasing the complexity of your model you are going to have a higher variance and by lowering it you will have a higher bias, this is called the Bias-Variance Tradeoff and the key to creating a well done machine learning model lies somewhere in between.

Top comments (0)