DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Updated on

Batch, Mini-Batch & Stochastic Gradient Descent in PyTorch

Buy Me a Coffee

*My post explains optimizers in PyTorch.

There are Batch Gradient Descent(BGD), Mini-Batch Gradient Descent(MBGD) and Stochastic Gradient Descent(SGD) which are the ways of how to take data from dataset to do gradient descent with the optimizers such as Adam(), SGD(), RMSprop(), Adadelta(), Adagrad(), etc in PyTorch.

*Memos:

  • SGD() in PyTorch is just the basic gradient descent with no special features(Classic Gradient Descent(CGD)) but not Stochastic Gradient Descent(SGD).
  • For example, using these ways below, you can flexibly do batch, mini-batch or SGD Adam with Adam(), CGD with SGD(), RMSprop with RMSprop(), Adadelta with Adadelta(), Adagrad with Adagrad(), etc in PyTorch.

Image description

(1) Batch Gradient Descent(BGD):

  • can do gradient descent with a whole dataset, taking only one step in one epoch. For example, a whole dataset has 100 samples(1x100), then gradient descent happens only once in one epoch which means model's parameters are updated only once in one epoch.
  • uses the average of a whole dataset, so each sample is less stood out(less emphasized) than MBGD and SGD. As a result, the convergence is more stable(less fluctuated) than MBGD and SGD and also more strong in noise(noisy data) than MBGD and SGD, causing less overshooting than MBGD and SGD and creating a more accurate model than MBGD and SGD if not got stuck in local minima but BGD less easily escapes local minima or saddle points than MBGD and SGD because the convergence is more stable(less fluctuated) than MBGD and SGD as I said before. *Memos:
    • Convergence means an initial weight moves towards the global minimum of a function by gradient descent.
    • Noise(noisy data) means outliers and anomalies.
    • Overshooting means jumping over the global minimum of a function.
  • 's pros:
    • The convergence is more stable(less fluctuated) than MBGD and SGD.
    • It's strong in noise(noisy data) than MBGD and SGD.
    • It less causes overshooting than MBGD and SGD.
    • It creates a more accurate model than MBGD and SGD if not got stuck in local minima.
  • 's cons:
    • It's not good at a large dataset such as online learning because it takes much memory, slowing down the convergence. *Online learning is the way which a model incrementally learns from a stream of dataset in real-time.
    • It needs the repreparation of a whole dataset if you want to update a model.
    • It less easily escapes local minima or saddle points than MBGD and SGD.

(2) Mini-Batch Gradient Descent(MBGD):

  • can do gradient descent with splitted dataset(the small batches of a whole dataset) one small batch by one small batch, taking the same number of steps as the small batches of a whole dataset in one epoch. For example, the whole dataset which has 100 samples(1x100) is splitted into 5 small batches(5x20), then gradient descent happens 5 times in one epoch which means model's parameters are updated 5 times in one epoch.
  • uses the average of each small batch splitted from a whole dataset so each sample is more stood out(more emphasized) than BDG. *Splitting a whole dataset into smaller batches can make each sample more and more stood out(more and more emphasized). As a result, the convergence is less stable(more fluctuated) than BGD and also less strong in noise(noisy data) than BGD, more causing overshooting than BGD and creating a less accurate model than BGD even if not got stuck in local minima but MBGD more easily escapes local minima or saddle points than BGD because the convergence is less stable(more fluctuated) than BGD as I said before.
  • 's pros:
    • It's better at a large dataset such as online learning than BGD because it takes smaller memory than BGD, less slowing down the convergence than BGD.
    • It doesn't need the repreparation of a whole dataset if you want to update a model.
    • It more easily escapes local minima or saddle points than BGD.
  • 's cons:
    • The convergence is less stable(more fluctuated) than BGD.
    • It's less strong in noise(noisy data) than BGD.
    • It more causes overshooting than BGD.
    • It creates a less accurate model than BGD even if not got stuck in local minima.

(3) Stochastic Gradient Descent(SGD):

  • can do gradient descent with every single sample of a whole dataset one sample by one sample, taking the same number of steps as the samples of a whole dataset in one epoch. For example, a whole dataset has 100 samples(1x100), then gradient descent happens 100 times in one epoch which means model's parameters are updated 100 times in one epoch.
  • uses every single sample of a whole dataset one sample by one sample but not the average so each sample is more stood out(more emphasized) than MBGD. As a result, the convergence is less stable(more fluctuated) than MBGD and also less strong in noise(noisy data) than MBGD, more causing overshooting than MBGD and creating a less accurate model than MBGD even if not got stuck in local minima but SGD more easily escapes local minima or saddle points than MBGD because the convergence is less stable(more fluctuated) than MBGD as I said before.
  • 's pros:
    • It's better at a large dataset such as online learning than MBGD because it takes smaller memory than MBGD, less slowing down the convergence than MBGD.
    • It doesn't need the repreparation of a whole dataset if you want to update a model.
    • It more easily escapes local minima or saddle points than MBGD.
  • 's cons:
    • The convergence is less stable(more fluctuated) than MBGD.
    • It's less strong in noise(noisy data) than MBGD.
    • It more causes overshooting than MBGD.
    • It creates a less accurate model than MBGD if not got stuck in local minima.

Top comments (0)