(Batch) gradient descent algorithm
Gradient descent is an optimization algorithm that works by efficiently searching the parameter space, intercept($\theta_0$) and slope($\theta_1$) for linear regression, according to the following rule:
\[\begin{aligned} \theta := \theta -\alpha \frac{\delta}{\delta \theta}J(\theta). \end{aligned} \]Note that we used '$:=$' to denote an assign or an update.
The \(J(\theta)\) is known as the cost function and \(\alpha\) is the learning rate, a free parameter. In this tutorial, we're going to use a least squares cost function defined as following:
\[\begin{aligned} J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2, \end{aligned} \]where \(m\) is the total number of training examples and $h_\theta(x^{(i)})$ is the hypothesis function defined like this:
\[\begin{aligned} h_\theta(x^{(i)}) = \theta_0 + \theta_1 x^{(i)} \end{aligned} \]where the super script $(i)$ is used to denote the $i^{th}$ sample (we'll save subscripts for the index to a feature when we deal with multi-features).
We need derivatives for both $\theta_0$ and $\theta_1$:
\[\begin{aligned} \frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}) \end{aligned} \] \[\begin{aligned} = \frac{1}{m} \sum_{i=1}^m (\theta_0 + \theta_1 x^{(i)}-y^{(i)}) \end{aligned} \] \[\begin{aligned} \frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x^{(i)} \end{aligned} \] \[\begin{aligned} = \frac{1}{m} \sum_{i=1}^m (\theta_0 + \theta_1 x^{(i)}-y^{(i)})x^{(i)} \end{aligned} \]The following code runs until it converges or reaches iteration maximum. We get $\theta_0$ and $\theta_1$ as its output:
import numpy as np import random import sklearn from sklearn.datasets.samples_generator import make_regression import pylab from scipy import stats def gradient_descent(alpha, x, y, ep=0.0001, max_iter=10000): converged = False iter = 0 m = x.shape[0] # number of samples # initial theta t0 = np.random.random(x.shape[1]) t1 = np.random.random(x.shape[1]) # total error, J(theta) J = sum([(t0 + t1*x[i] - y[i])**2 for i in range(m)]) # Iterate Loop while not converged: # for each training sample, compute the gradient (d/d_theta j(theta)) grad0 = 1.0/m * sum([(t0 + t1*x[i] - y[i]) for i in range(m)]) grad1 = 1.0/m * sum([(t0 + t1*x[i] - y[i])*x[i] for i in range(m)]) # update the theta_temp temp0 = t0 - alpha * grad0 temp1 = t1 - alpha * grad1 # update theta t0 = temp0 t1 = temp1 # mean squared error e = sum( [ (t0 + t1*x[i] - y[i])**2 for i in range(m)] ) if abs(J-e) <= ep: print 'Converged, iterations: ', iter, '!!!' converged = True J = e # update error iter += 1 # update iter if iter == max_iter: print 'Max interactions exceeded!' converged = True return t0,t1 if __name__ == '__main__': x, y = make_regression(n_samples=100, n_features=1, n_informative=1, random_state=0, noise=35) print 'x.shape = %s y.shape = %s' %(x.shape, y.shape) alpha = 0.01 # learning rate ep = 0.01 # convergence criteria # call gredient decent, and get intercept(=theta0) and slope(=theta1) theta0, theta1 = gradient_descent(alpha, x, y, ep, max_iter=1000) print ('theta0 = %s theta1 = %s') %(theta0, theta1) # check with scipy linear regression slope, intercept, r_value, p_value, slope_std_error = stats.linregress(x[:,0], y) print ('intercept = %s slope = %s') %(intercept, slope) # plot for i in range(x.shape[0]): y_predict = theta0 + theta1*x pylab.plot(x,y,'o') pylab.plot(x,y_predict,'k-') pylab.show() print "Done!"
Output:
x.shape = (100, 1) y.shape = (100,) Converged, iterations: 641 !!! theta0 = [-2.81943944] theta1 = [ 43.1387759] intercept = -2.84963639461 slope = 43.2042438802
The regression line in the picture above confirms we got the right result from our Gradient Descent algorithm. Also, the sciPy's stats.linregress reassures (intercept and slope values) the correctness of our outcome (theta0 and theta1).
We need to be a little bit careful when we updates $\theta_0$ and $\theta_1$. First, we calculate the temporary $\theta_0$ and $\theta_1$ with old $\theta_0$ and $\theta_1$, and then we get new $\theta_0$ and $\theta_1$ from $temp0$ and $temp1$:
\[\begin{aligned} temp0 := \theta_0 -\alpha \frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1) \end{aligned} \] \[\begin{aligned} temp1 := \theta_1 -\alpha \frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1) \end{aligned} \] \[\begin{aligned} \theta_0 := temp0 \end{aligned} \] \[\begin{aligned} \theta_1 := temp1. \end{aligned} \]The following code is almost the same as the code we used in the previous section but simpler since it utilized numPy better. It runs until it reaches iteration maximum. We get $\theta_0$ and $\theta_1$ as its output:
import numpy as np import random from sklearn.datasets.samples_generator import make_regression import pylab from scipy import stats def gradient_descent_2(alpha, x, y, numIterations): m = x.shape[0] # number of samples theta = np.ones(2) x_transpose = x.transpose() for iter in range(0, numIterations): hypothesis = np.dot(x, theta) loss = hypothesis - y J = np.sum(loss ** 2) / (2 * m) # cost print "iter %s | J: %.3f" % (iter, J) gradient = np.dot(x_transpose, loss) / m theta = theta - alpha * gradient # update return theta if __name__ == '__main__': x, y = make_regression(n_samples=100, n_features=1, n_informative=1, random_state=0, noise=35) m, n = np.shape(x) x = np.c_[ np.ones(m), x] # insert column alpha = 0.01 # learning rate theta = gradient_descent_2(alpha, x, y, 1000) # plot for i in range(x.shape[1]): y_predict = theta[0] + theta[1]*x pylab.plot(x[:,1],y,'o') pylab.plot(x,y_predict,'k-') pylab.show() print "Done!"
Output:
iter 0 | J: 1604.873 iter 1 | J: 1586.636 iter 2 | J: 1568.768 iter 3 | J: 1551.261 ... iter 997 | J: 699.300 iter 998 | J: 699.300 iter 999 | J: 699.300 theta = [ -2.84837957 43.20234847]
We may also want to see how the batch gradient descent is used for the well known Iris dataset : Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient method
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization