Batch gradient descent vs Stochastic gradient descent

bogotobogo.com site search:

Batch gradient descent vs Stochastic gradient descent

Stochastic gradient descent (SGD or "on-line") typically reaches convergence much faster than batch (or "standard") gradient descent since it updates weight more frequently.

Unlike the batch gradient descent which computes the gradient using the whole dataset, because the SGD, also known as incremental gradient descent, tries to find minimums or maximums by iteration from a single randomly picked training example, the error is typically noisier than in gradient descent.

However, this can also have the advantage that stochastic gradient descent can escape shallow local minima more easily.

In order to obtain accurate results with stochastic gradient descent, the data sample should be in a random order, and this is why we want to shuffle the training set for every epoch.

bogotobogo.com site search:

picture source : https://wikidocs.net/3413

Stochastic gradient descent

The cost function to learn the weights for Adaline (Adaptive Linear Neuron) is defined by associating with the $i$-th observation in the training data set as:

$$ J(\mathbf w) = \frac {1}{2} \sum_{i=1}^N \left( y^{(i)}-\phi (\mathbf w^T \mathbf x)^{(i)} \right )^2 $$

where the $\phi$ is an activation function.

To find the weights that minimize our cost function, we can use optimization algorithm called gradient descent:

picture source: Python Machine Learning by Sebastian Raschka

The weight change $\Delta w$ is defined as the negative gradient multiplied by the learning rate $\eta $:

$$ \Delta \mathbf w = - \eta \nabla J = \eta \sum_{i=1}^N \left( y^{(i)}-\phi (\mathbf w^T \mathbf x)^{(i)} \right ) \mathbf x^{(i)}$$

In order to minimize a cost function, in batch gradient descent, the gradient is calculated from the whole training set (this is why this approach is also referred to as "batch").

If we have a huge dataset with millions of data points, running the batch gradient descent can be quite costly since we need to reevaluate the whole training dataset each time we take one step towards the global minimum.

So, in stochastic gradient descent method, instead of updating the weights based on the sum of the accumulated errors over all samples $\mathbf x^{(i)}$ via the ($ \Delta \mathbf w$) defined above, we can use the following update:

$$ \Delta \mathbf w = - \eta \nabla J = \eta \left( y^{(i)}-\phi (\mathbf w^T \mathbf x)^{(i)} \right ) \mathbf x^{(i)}$$

Note that we now update the weights incrementally with a single training sample but not with the whole training set.

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges - https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Here is the pseudocode:

Choose an initial vector of parameters $\mathbf w$ and learning rate $\eta$.
Repeat until an approximate minimum is obtained:
1. Randomly shuffle examples in the training set.
2. For $i = 1 , 2 , . . . , n$ do:
  $w := w + \Delta \mathbf w $

Here are the output from Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD):

Mini-batch gradient descent

A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a "mini-batch") at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples- https://en.wikipedia.org/wiki/Stochastic_gradient_descent