The final step in the linear regression model is creating an optimizer function to improve our weights and bias. I'm going to explain how gradient descent works in this article, and also give you a quick explanation of what a derivative and a partial derivative are, so you can follow the process. So if you haven't studied calculus yet, don't worry. You won't become a calculus expert by reading this article (I'm certainly not one), but I think you'll be able to follow the process of gradient descent a little bit better.
I will also link an article which helped me understand the math behind partial derivatives, which you can look at to fill in some details I won't be covering here.
The goal of gradient descent is to minimize the loss. In an ideal world we want our loss to be 0 (but keep in mind that this isn't realistically possible). We minimize the loss by improving the parameters of the model, which are the weight w and the bias b in linear regression. We improve those parameters either by making them larger, or smaller - whichever makes the loss go down.
Gradient descent is an iterative process - this is just a fancy way to say that the process repeats over and over again until you reach some condition for ending it.
The condition for ending could be:
- We are tired of waiting: i.e. we let gradient descent run for a certain number of iterations and then tell it to stop
- The loss is minimized as much as we need for our problem: i.e. the loss is equal to or less than a certain number that we decide on
This is where derivatives come in.
For a given function:
- It tells us how much a change in the weight will change the output of a function For example, for the MSE loss function:
- How much will changing w a little bit change the loss?
- Basically, the derivative tells us the slope of the line at a given point
A partial derivative is a derivative with more than one variable. In the linear regression equation we have w and b which both can change, so there are two variables that can affect the loss. We want to isolate each of those variables so that we can figure out how much w affects the loss and how much b affects the loss separately.
- So we measure the derivative, or the slope one variable at a time
- Whichever variable we are not measuring, we make a constant, by setting it equal to 0
- First we calculate the derivative of the loss with respect to w
- And then we calculate the derivative of the loss with respect to b
Rather than illustrating the formula for partial derivatives of MSE here (which I am still learning to understand myself), I am going to include a link to a very helpful article that goes through the mathematical formula step by step for finding the partial derivatives of mean squared error. The author basically does what I was hoping to do in this article before I became a little overwhelmed by the amount of background I would need to provide.
Now that we have calculated the derivatives we need to actually use them to update the parameters w and b.
We will use something called the Learning Rate to tell us how big of a step to take in our gradient descent. It is called the learning rate, because it affects how quickly our model will learn the patterns in the data. What do we do with it? We use it to multiply the derivative with respect to w and b when we update w and b in each iteration of training our model.
So, in short, it's a number that controls how quickly our parameters w and b change. A lower learning rate will cause w and b to change slowly (the model learns slower), and a higher learning rate will cause w and b to change more quickly (the model learns faster).
Remember in my overview of linear regression article I discussed how after we find the loss we'll need to use that information to update our weight and bias to minimize the loss? Well we're finally ready for that step.
A quick summary before we get started with the code. We have a forward pass, where we calculate our predictions and our current loss, based on those predictions. Then we have a backward pass, where we calculate the partial derivative of the loss with respect to each of our parameters (w and b). Then, using those gradients that we gained through calculating the derivatives, we train the model by updating our parameters in the direction that reduces the loss. We use the learning rate to control how much those parameters are changed at a time in each iteration of training.
This is called the forward pass:
- So we initialize our parameters
- we can start them off at 0
- or we can start them off at random numbers (but I've decided to start them at 0, to simplify the code)
- We calculate linear regression with our current weight and bias
- We calculate the current loss, based on the current values for w and b
# import the Python library used for scientific computing import numpy as np # predict function, based on y = wx + b equation def predict(X, w, b): return X * w + b # loss function, based on MSE equation def mse(X, Y, w, b): return np.average(Y - (predict(X, w, b)) ** 2)
This part is called the backward pass:
- Using the current loss we calculate the derivative of the loss with respect to w...
- ...and with respect to b
# calculate the gradients def gradients(X, Y, w, b): w_gradient = np.average(-2 * (X * (predict(X, w, b) - Y))) b_gradient = np.average(-2 * (predict(X, w, b) - Y)) return (w_gradient, b_gradient)
- Then we update the weight and bias with the derivative of the loss in the direction that minimizes the loss, by multiplying each derivative with the learning rate
- Then we repeat that process as long as we want (set in the number of epochs) to reduce the loss as much as we want
# train the model # lr stands for learning rate def train(X, Y, iterations, lr): # initialize w and b to 0 w = 0 b = 0 # empty lists to keep track of parameters current values and loss log =  mse =  # the training loop for i in range(iterations): w_gradient, b_gradient = gradient(X, Y, w, b) # update w and b w -= w_gradient * lr b -= b_gradient * lr # recalculate loss to see log.append((w,b)) mse.append(mse(X, Y, w, b) return w, b, log, mse
A parting note: There are tricks to avoid using explicit loops in your code, so that the code will run faster, when we start to train very large datasets. But to give an idea of what is going on, I thought it made sense to visualize the
train function as a loop.
I hope you enjoyed this overview of Gradient Descent. My code might not be very eloquent, but hopefully it gives you an idea of what's going on here.
If you like this style of building up the functions used in machine learning models a little bit at a time, you may enjoy this book, Programming Machine Learning, whose code I relied on in preparing this article.