Like everything in this world, finding the right path to a high-end goal can become tedious if you don't have the right tools. Each objective and environment has different requirements and must be treated differently. An example of this might be traveling, using a car to go to the grocery shop might be the fastest and most comfortable way to get there. On the other hand, if we want to travel abroad it might be a better idea to get on an airplane (unless you are one of those who loves driving for hours).
But we are not here to talk about the different types of transportation, we are here to talk about how to improve the training of your neural networks and choosing the best optimizer based on the memory it uses, its complexity and speed.
Training a deep neural network can be very slow, there are multiple ways to improve the speed of convergence. By improving the learning rules of the optimizer we can make the network learn faster (with some computational and memory cost).
The most simple optimizer out there is a Stochastic Gradient Descent optimizer, this works by calculating the gradient and error through backpropagation and updating the corresponding weights with the learning rate factor.
Speed: because it is the most basic implementation it is the fastest
Memory: it is also the one that uses the fewest memory since it only needs to save the gradients of each weight for backpropagation.
Performance: it has a very slow convergence but generalizes better than most methods.
Usage: this function can be used in pytorch by providing the models parameters (weights) and the learning rate, the rest of the parameters are optional.
torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
The momentum optimization is a variant of the SGD that incorporates the previous update in the current change as if there is a momentum. This momentum provides a smoothing effect on the training. The value of the momentum is usually between 0.5 and 1.0
Speed: very fast since it only has an additional multiplication.
Memory: this optimization requires a memory increase, since it needs to save the memory of the weight of the update in the last step.
Performance: very useful since it provides an averaging and smooth effect in the trajectory during convergence. It promotes a faster convergence and helps roll past local optima. It almost always goes faster than SGD.
Usage: to activate the momentum, you need to specify its value through the momentum parameter.
torch.optim.SGD(params, lr=<required parameter>, momentum>0, dampening=0, weight_decay=0, nesterov=False)
A variant of the momentum optimization was proposed in which instead of mesuring the gradient at the local position,we measure it in the direction of the momentum.
Speed: an additional sum must be done to apply the momentum to the parameter.
Memory: no extra memory is used in this case.
Performance: it usually works better than simple momentum since the momentum vector points towards the optimum. In general, it converges faster than the original momentum since we are promoting the movement towards a specific direction.
Usage: to apply the use of Nesterov we must set the Nesterov flag to true and add some momentum to the optimizer.
torch.optim.SGD(params, lr=<required parameter>, momentum>0, dampening=0, weight_decay=0, nesterov=True)
Adagrad stands for adaptive learning rate and it works by adapting the learning rate depending on where we are located. When we are near a local minimum, Adagrad tries to optimize the learning rate in order to get faster in that direction. A benefit of using this optimizer is that we don't need to concern ourselves too much in tuning the learning rate manually. The learning rate adapts based on all the gradients in the current training.
Speed: it is much slower since it needs to multiply a lot of things.
Memory: it does not require any additional memory.
Performance: in general, it works well for simple quadratic problems, but it often stops too early when training neural networks, since the learning rate gets scaled too much, thus never getting to the minimum. It is not recommended for neural networks but it may be efficient for simpler problems.
Usage: Adagrad can be used by providing the default parameters.
torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
This is a variant of the Adagrad algorithm that fixes its never converging issue. It does it by accumulating only the gradients from the most recent iterations.
Speed: it is very similar to Adagrad
Memory: it uses the same memory as Adagrad
Performance: it converges much faster than Adagrad and does not stop before a local minimum, it. It has been used by machine learning researches for a long time before Adam came out. It does not perform very well on very simple problems.
Usage: you might notice there is a new hyperparameter, but the default values usually work well, this technique can be combined with a momentum.
torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
Adam is a relatively new gradient descent optimization method, it stands for adaptive moment estimation. It is a mix between momentum optimization and RMSProp.
Speed: the one that costs more since it combines two methods.
Memory: the same as RMSprop
Performance: it usually performs better than RMSprop since it a combination of techniques trying to converge faster on the training data.
Usage: Adam can be used perfectly with the default parameters, it is even recommended to leave the learning rate as it is since it is an adaptive method that provides an automatic learning rate update.
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
In the end, which optimization algorithm should you use? It depends, adaptive algorithms are becoming really fancy nowadays but require more computational power, and most of the time more memory. It has been proven that simple SGD has better results on the validation set, as it tends to generalize better, it seems adaptive algorithms try to optimize the training set too much, thus ending with high variance and overfitting the data. The problem with SGD is that it might take a lot of time to reach a minimum, the computational resources needed in total are much higher than the ones needed in adaptive optimizations. So in the end, if you have a lot of computer resources you should consider using SGD with momentum as it tends to generalize better. On the other hand, if your resources, especially time resources, are limited Adam is your best choice.