DEV Community

Rubens Barbosa
Rubens Barbosa

Posted on

Gradient Descent, an Optimization Method used in Machine Learning

In this article we are going to use some concepts of Differential Calculus, mainly in partial derivative and chain rule. 📚

function with minima and maxima locals

Optimization

A mathematical optimization problem, or just optimization problem, has only one goal: to find the best element from a set of candidates through a function known as cost function, loss function, or objective function.

Mathematically, an unbounded optimization problem with decision variables θ of the cost function L, has the following form

min

We are interested in finding a vector θ that leads to the lowest value of the cost function. If it exists, the vector is called optima solution or minimum global and will be denoted by θ*, as follows:

min global

Note that, depending on the function, the equation above might not have optima solution. Sometimes, the function may have more than one minima or maxima local like the graph presented in the top of this article. However, we want to find a minimum global. In this context, let’s admit that our problem has an optima solution.

Assuming L is differentiable and convex, a necessary and sufficient condition for a vector θ* to be an optima solution is

gradient

where ∇ is the gradient operator:

gradient

Concretely, if the function is strictly convex, θ* is the only optima solution for the optimization problem.

Cost Function

In Machine Learning it is common that we compute the cost function in order to know how our model is performing, because our purpose is to find the best hypothesis. The cost function computes the error of all training data set. So, we want to minimize the cost function, i.e., minimize the error of any machine learning model.

There are different types of cost functions and we choose them depending on the model that we are going to work. In this article, we use the linear regression model and our cost function is the mean squared error, as shown bellow

MSE

Gradient Descent

Probably, gradient descent is one of the most simple and widely used iterative algorithms for optimization of continuous and differentiable functions. The basic idea behind the method is to begin from an initial point chosen randomly and improve it repeatedly, taking small steps in the opposite direction of the gradient at each iteration.

That is, we start with an initial guess θ(0) and at each iteration t = 0. . . we compute the update rule:

gradient_descent

The choice of the learning rate α is crucial for the convergence of the gradient descent. In practice, it is common to adopt a constant value for α.

— ∇L(θ) is the direction in which L decrease in θ.

Applying gradient descent to the cost function

We are going to use the linear regression model represented bellow

Image description

Find the gradient vector of θ(0)

theta zero

Find the gradient vector of θ(1)

theta one

We can visualize our cost function with an illustration below. Observe that our function is convex. Thus, our solution θ* always has a global minimum.

convex cost function

Gradient Descent Algorithm

Below you can find the pseudocode of gradient descent

pseudocode

Conclusion

In this article, we learned the theoretical part of the gradient descent applied on the linear regression model. However, gradient descent might be used in others machine learning and deep learning algorithms. Something important for gradient descent is the learning rate choice. Large values for the learning rate can result in overshooting. That is, if we update the parameters using large values for the learning rate the function may fail to converge to the global minimum. On the other hand, small values for the learning rate would take too many iterations for the function to converge.

Although I only talked about gradient descent, there are others versions of it, such as: batch gradient descent, stochastic gradient descent and mini-batch gradient descent. I would like you to search about these types of algorithms and try to understand when you might use each one of them.

I hope you enjoyed reading this article. Thank you! 🙂

Top comments (0)