Artificial Neural Network (ANN) 7 - Overfitting & Regularization

bogotobogo.com site search:

Note

Continued from Artificial Neural Network (ANN) 6 - Training via BFGS where we trained our neural network via BFGS

We saw our neural network gave a pretty good predictions of our test score based on how many hours we slept, and how many hours we studied the night before.

In this article, we want to check how well our model reflects the real world data.

We want our model to fit the signal but not the noise so that we should be able to avoid overfitting.

picture source : Python machine learning by Sebastian Raschka

First, we'll work on diagnosing overfitting, and then we'll work on fixing it.

Training inputs

Let's start with an input data for training our neural network:

Here is the plot for our input data, scores vs hours of sleep/study:

To train our model, we need to normalize training data:

Training neural network

Let's start training our network with the normalized data set:

The cost function ($J$) plot vs iterations looks like this:

More data for the neural network

Now we want to generate more data using "numpy.linspace()":

Contour for the newly generated data looks like this:

From the picture, we can see our model is overfitting, but how do we know for sure?

Data split : training and testing

In general, we want to split our data into 2 portions: training and testing. We won't touch our testing data while training the model, and only use it to see how we're doing since our testing data is a simulation of the real world.

We may want to modify Trainer class a bit to check testing error during training:

Let's train our model with the new data:

We can plot the error on our training and testing sets as we train our model and identify the exact point at which overfitting begins.

As we can see from the picture above, our cost function ($\color{green}{J}$) with real data(test) soars around it=125 while the $\color{blue}{J}$ with training data continues to become smaller and smaller.

Now we know we have overfitting issue, but how do we fix it?

A simple rule of thumb is that we should have at least 10 times as many examples as the degrees for freedom in our model. For us, since we have 9 weights that can change, we would need 90 observations, which we certainly don't have.

Regularization

One of th popular and effective ways of mitigating the overfitting issue is to use a technique called regularization.

One way to implement regularization is to add a term to our cost function that penalizes overly complex models.

A simple, but effective way to do this is to add together the square of our weights to our cost function so that models with larger magnitudes of weights, cost more.

We'll need to normalize the other part of our cost function to ensure that our ratio of the two error terms does not change with respect to the number of examples.

We're going to introduce a regularization hyper parameter, $\lambda$, that will allow us to tune the relative cost. So, higher values of lambda will impose bigger penalties for high model complexity.

We need to make changes to costFunction and costFunctionPrime as well as the __init__():

#New complete class, with changes:
class NeuralNetwork(object):
    def __init__(self, Lambda=0):        
        #Define Hyperparameters
        self.inputLayerSize = 2
        self.outputLayerSize = 1
        self.hiddenLayerSize = 3
        
        #Weights (parameters)
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
        #Regularization Parameter:
        self.Lambda = Lambda
        
    def forwardPropagation(self, X):
        #Propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        #Apply sigmoid activation function to scalar, vector, or matrix
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        #Gradient of sigmoid
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        #Compute cost for given X,y, use weights already stored in class.
        self.yHat = self.forwardPropagation(X)
        J = 0.5*sum((y-self.yHat)**2)/X.shape[0] + (self.Lambda/2)*(np.sum(self.W1**2)+np.sum(self.W2**2))
        return J
        
    def costFunctionPrime(self, X, y):
        #Compute derivative with respect to W and W2 for a given X and y:
        self.yHat = self.forwardPropagation(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        #Add gradient of regularization term:
        dJdW2 = np.dot(self.a2.T, delta3)/X.shape[0] +  self.Lambda*self.W2
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        #Add gradient of regularization term:
        dJdW1 = np.dot(X.T, delta2)/X.shape[0] + self.Lambda*self.W1
        
        return dJdW1, dJdW2
    
    #Helper functions for interacting with other methods/classes
    def getParams(self):
        #Get W1 and W2 Rolled into vector:
        params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
        return params
    
    def setParams(self, params):
        #Set W1 and W2 using single parameter vector:
        W1_start = 0
        W1_end = self.hiddenLayerSize*self.inputLayerSize
        self.W1 = np.reshape(params[W1_start:W1_end], \
                             (self.inputLayerSize, self.hiddenLayerSize))
        W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize
        self.W2 = np.reshape(params[W1_end:W2_end], \
                             (self.hiddenLayerSize, self.outputLayerSize))
        
    def computeGradients(self, X, y):
        dJdW1, dJdW2 = self.costFunctionPrime(X, y)
        return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))

Since we made some changes, let's make sure our gradients are correct after making those changes:

Let's train our model again.

Here is the data set we're going to use:

Now our training and testing errors are much closer, which is the indication of the success in reducing the overfit on this dataset.

Let's see our contour plot for test scores against sleep/study hours:

3-D plot:

While we see that the fit is still good, but our model is no longer that interested in the fitting accuracy to our data.

To reduce the overfitting further, we may want to increase the regularization parameter, $\lambda$.

Updating weight by backpropagation

Here is the plot of the 6 weights ($W^{(1)}$) for hidden layer and 3 weights ($W^{(2)}$) for output layer of our neural network:

Note: this picture for weights update was plotted later from a separated run. So, it does not reflect the pictures in the previous section though it shows the general trend how the weight are updated during the iterations.

To get the $W$, we need to modify the lines highlighted in our Trainer class as shown below: