Artificial Neural Network (ANN) 6 - Training via BFGS

bogotobogo.com site search:

Note

Continued from Artificial Neural Network (ANN) 5 - Checking gradient where computed the gradient of our cost function and check the computing accuracy and added helper function to our neural network class so that we are ready to train our Neural Network.

Broyden-Fletcher-Goldfarb-Shanno algorithm

In this article, we're going to use a variant of gradient descent method known as Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm.

The BFGS algorithm overcomes some of the limitations of plain gradient descent by seeking the second derivative (a stationary point) of the cost function.

For such problems, a necessary condition for optimality is that the gradient be zero. Newton's method and the BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum. These methods use both the first and second derivatives of the function. However, BFGS has proven to have good performance even for non-smooth optimizations. - wiki - Broyden-Fletcher-Goldfarb-Shanno algorithm

Once the network is trained, we'll use trained parameters instead of random parameters.

Trainer class

However, we're not going to write the BFGS algorithm but we'll use scipy's optimize package (scipy.optimize.minimize) instead.

Here is a code defining a "Trainer" class:

To use BFGS, the minimize function should have an objective function that accepts a vector of parameters, input data, and output data, and returns both the cost and gradients.

So, we're using a wrapper API around our ANN code. Also, we're implementing a callback function that allows us to track the cost function value as we train the network:

_res = optimize.minimize(self.costFunctionWrapper, params0, \
                         jac=True, method='BFGS', \
                         args=(X, y), options=options, \
                         callback=self.callbackF)

Note that we pass in initial parameters, set the jacobian parameter to true since we're computing the gradient within our neural network class, set the method to BFGS, pass in our input and output data.

Cost function ($J$) plot

Let's train our neural network:

Plot the cost($J$) against the number of iterations through training:

As we can see we have monotonically decreasing function.

Also, the number of iterations is less than 100, and much more efficient than the brute force algorithm we used in part 3.

Note that as we're approaching the solution, the curve becomes flatten and the gradient of $J$ gets smaller and smaller. We can check the values: $\frac {\partial J}{\partial W^{(1)}}$ and $\frac {\partial J}{\partial W^{(2)}}$:

Trained neural network

Finally, we have a trained network that can predict our score on a test based on how many hours we sleep and how many hours we study the night before!

If we run our training data through our forward method (forwardPropagation().

We can see that our predictions ($\hat y$) from forward method are pretty good compared with our target values ($y$).

Testing our trained neural network

Now we can go one step further and explore the input space created from numpy's linspace for various combinations of hours sleeping and hours studying:

hoursSleep = linspace(0, 10, 100)
hoursStudy = linspace(0, 5, 100)

Here is our test code:

The plot looks like this:

From the contour picture we may find an optimal combination of the two for our next test!

We can draw 3-D plot as well:

It looks like our sleep hours actually has a bigger impact on our grade than study hours!

Overfitting

Our trained neural network looks pretty good. However, it's not guaranteed that we get the quality results from real world data. That's because we used very small size training data. Even though we use lots of data set to train our model, it may be overfitted and may not give us the desired results for a new data set.