Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)

bogotobogo.com site search:

Introduction

In Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method, we minimized a cost function (objective function) by taking a step into the opposite direction of a gradient that is calculated from the whole training set with batch gradient descent.

Running batch gradient descent with a huge data set can be very costly because we need to reevaluate the whole training dataset each time step.

One of the popular alternative to the batch gradient descent algorithm is stochastic gradient descent (SGD), also known as incremental gradient descent.

The picture below (Stochastic gradient descent -wiki) is the typical result from SGD:

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

bogotobogo.com site search:

Stochastic gradient descent (SGD)

In our previous tutorial (Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method), we calculated the weights based on the sum of the accumulated errors over all samples $x^{(i)}$ like this:

$$ \Delta w = \eta \sum_i (y^{(i)}-\phi(z^{(i)}))x^{(i)}$$

With SGD, we update the weights incrementally for each training sample:

$$ \eta (y^{(i)}-\phi(z^{(i)}))x^{(i)}$$

Stochastic gradient descent (SGD) is an approximation of gradient descent, however, it usualy reaches convergence much faster because of the more frequent weight updates.

Since each gradient is calculated based on a single training example, as shown in the sample of the previous section, the error is noisier than in gradient descent, which can also have the advantage that stochastic gradient descent can escape shallow local minima more readily.

To obtain accurate results via stochastic gradient descent, it is important to present it with data in a random order, which is why we want to shuffle the training set for every epoch to prevent cycles.

Here are the steps of SGD in pseudo code:

Choose an initial vector of parameters $w$ and learning rate $\eta$.
Repeat until an approximate minimum is obtained:

Randomly shuffle examples in the training set.
For $i=1,2,...,n$ do:

Implementation

Since we've already implemented using gradient descent (Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method), we only need to make a few adjustments to the existing learning algorithm to update the weights via stochastic gradient descent.

Inside the fit method, we will now update the weights after each training sample.

Furthermore, we will implement an additional partial_fit method, which does not reinitialize the weights, for on-line learning.

In order to check if our algorithm converged after training, we will calculate the cost as the average cost of the training samples in each iteration.

Also, we will shuffle the training data before each epoch to avoid cycles when we are optimizing the cost function via the random_state parameter.

Here is the source code using SGD:

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import seed
import pandas as pd

# Stochastic Gradient Descent
class SGD(object):
   def __init__(self, rate = 0.01, niter = 10,
                shuffle=True, random_state=None):
      self.rate = rate
      self.niter = niter
      self.weight_initialized = False

      # If True, Shuffles training data every epoch
      self.shuffle = shuffle

      # Set random state for shuffling and initializing the weights.
      if random_state:
         seed(random_state)

   def fit(self, X, y):
      """Fit training data
      X : Training vectors, X.shape : [#samples, #features]
      y : Target values, y.shape : [#samples]
      """

      # weights
      self.initialize_weights(X.shape[1])

      # Cost function
      self.cost = []

      for i in range(self.niter):
         if self.shuffle:
            X, y = self.shuffle_set(X, y)
         cost = []
         for xi, target in zip(X, y):
            cost.append(self.update_weights(xi, target))
         avg_cost = sum(cost)/len(y)
         self.cost.append(avg_cost)
      return self

   def partial_fit(self, X, y):
      """Fit training data without reinitializing the weights"""
      if not self.weight_initialized:
         self.initialize_weights(X.shape[1])
      if y.ravel().shape[0] > 1:
         for xi, target in zip(X, y):
            self.update_weights(xi, target)
      else:
         self.up
      return self

   def shuffle_set(self, X, y):
      """Shuffle training data"""
      r = np.random.permutation(len(y))
      return X[r], y[r]

   def initialize_weights(self, m):
      """Initialize weights to zeros"""
      self.weight = np.zeros(1 + m)
      self.weight_initialized = True

   def update_weights(self, xi, target):
      """Apply SGD learning rule to update the weights"""
      output = self.net_input(xi)
      error = (target - output)
      self.weight[1:] += self.rate * xi.dot(error)
      self.weight[0] += self.rate * error
      cost = 0.5 * error**2
      return cost

   def net_input(self, X):
      """Calculate net input"""
      return np.dot(X, self.weight[1:]) + self.weight[0]

   def activation(self, X):
      """Compute linear activation"""
      return self.net_input(X)

   def predict(self, X):
      """Return class label after unit step"""
      return np.where(self.activation(X) >= 0.0, 1, -1)

def plot_decision_regions(X, y, classifier, resolution=0.02):
   # setup marker generator and color map
   markers = ('s', 'x', 'o', '^', 'v')
   colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
   cmap = ListedColormap(colors[:len(np.unique(y))])

   # plot the decision surface
   x1_min, x1_max = X[:,  0].min() - 1, X[:, 0].max() + 1
   x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
   xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
   np.arange(x2_min, x2_max, resolution))
   Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
   Z = Z.reshape(xx1.shape)
   plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
   plt.xlim(xx1.min(), xx1.max())
   plt.ylim(xx2.min(), xx2.max())

   # plot class samples
   for idx, cl in enumerate(np.unique(y)):
      plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
      alpha=0.8, c=cmap(idx),
      marker=markers[idx], label=cl)

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)

y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values

# standardize
X_std = np.copy(X)
X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

sgd = SGD(niter=15, rate=0.01, random_state=1)
sgd.fit(X_std, y)
plot_decision_regions(X_std, y, classifier=sgd)
plt.title('SGD - Stochastic Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.show()
plt.plot(range(1, len(sgd.cost) + 1), sgd.cost, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.show()

The np.random.permutationpermutation function in suffle_set() generates a random sequence of unique numbers in the range 0 to 100.

As we can see from the pictures, the average cost goes down pretty quickly, and the final decision boundary after 15 epochs looks similar to the results from batch gradient descent in previous tutorial.

If we want to update our model via on-line learning scenario with streaming data, we can simply call the partial_fit method on individual samples: