Logistic Regression, Overfitting & regularization

bogotobogo.com site search:

Logistic regression - Linear Model

Logistic regression is a generalized linear model using the same underlying formula, but instead of the continuous output, it is regressing for the probability of a categorical outcome.

In other words, it deals with one outcome variable with two states of the variable - either 0 or 1.

The following picture compares the logistic regression with other linear models:

Here are the sample cases with 3 linear models related to the credit analysis:

picture source: Caltech : Lecture 09 - The Linear Model II

The signal $s$ in the figure is defined as the following:

$$ s = \sum_{i=0}^n w_i x_i = \mathbf w^T \mathbf x$$

Note that the linear regression does nothing to the signal while logistic regression processes the signal via the added non-linear probability ($\theta$), and the output from the logistic regression is interpreted as the probability.

For example, we can think of the $\theta(s)$ as the "probability of a heart attack" and the signal $s$ as a "risk factor".

Usually, the logistic function is given like this (sigmoid):

$$ \theta (s) = \frac {e^s}{1+e^s} $$

The likelihood for a given data $(x,y)$ becomes:

$$ P(y|\mathbf x) = \theta(y\mathbf w^T \mathbf x)$$

For a given whole data set, the likelihood should be like this:

$$ \prod_{n=1}^N \theta(y_n \color{purple}{\mathbf w}^T \mathbf x_n)$$

Now we want to maximize with respect to our parameter $\color{purple}{w}$ which in turn to a problem of minimizing the "in sample error" that can be defined as following:

$$ E_{in}(\color{purple}{\mathbf w}) = \frac {1}{N} \sum_{n=1}^N \ln \left( \frac{1}{\theta(y_n \color{purple}{\mathbf w}^T \mathbf x_n)} \right) $$

If we use the sigmoid, the in-sample error for logistic regression becomes:

$$ E_{in}(\color{purple}{\mathbf w}) = \frac {1}{N} \sum_{n=1}^N \underbrace{ \ln \left( 1 + e^{-y_n \color{purple}{\mathbf w}^T \mathbf x_n} \right) }_{\text{ "cross-entropy" error}}$$

At this point, we can compare it with the one for linear regression:

$$ E_{in}(\color{purple}{\mathbf w}) = \frac {1}{N} \sum_{n=1}^N \left( \color{purple}{\mathbf w}^T \mathbf x_n - y_n \right)^2$$

To minimize the error we use the general method for nonlinear optimization called gradient descent method.

The $\Delta E_{in}$ can be calculated as the following:

$$ \Delta E_{in} \ge \eta \Vert {\nabla E_{in} \left( \mathbf w(0) \right)} \Vert$$

where $\eta$ is the stp size.

So, the steepest univ vector ($\hat n$) can be given as:

$$ \hat n = - \frac {\nabla E_{in} \left( \mathbf w(0) \right)} { \Vert {\nabla E_{in} \left( \mathbf w(0) \right)} \Vert } $$

$ \Delta \mathbf w$ becomes:

$$ \Delta \mathbf w = -\eta \frac {\nabla E_{in} \left( \mathbf w(0) \right)} { \Vert {\nabla E_{in} \left( \mathbf w(0) \right)} \Vert } $$

If we use learning rate ($\eta_{learn}$):

$$ \Delta \mathbf w = -\eta_{learn} \nabla E_{in} \left( \mathbf w(0) \right) $$

So, in each iteration, the weight($w$) can be updated like this:

$$ w(t+1) = w(t)-\eta_{learn} \nabla E_{in}$$

where $\nabla E_{in}$ is:

$$ \nabla E_{in} = \frac{1}{N} \sum_{n=1}^N \frac {y_n \mathbf x_n}{1+e^{y_n \mathbf w^T \mathbf x_n}}$$

scikit-learn logistic regression

In Maximum Likelihood Estimation (MLE), we get the following cost function:

$$ J(w) = \sum_i^n -y^{(i)}log(\phi(z^{(i)})-(1-y^{(i)})log(1-\phi(z^{(i)}))$$

We can implement the cost function for our own logistic regression.

The scikit-learn, however, implements a highly optimized version of logistic regression that also supports multiclass settings off-the-shelf, we will skip our own implementation and use the sklearn.linear_model.LogisticRegression class instead.

bogotobogo.com site search:

scikit-learn code

For the iris-dataset, as we've done before, we splited the set into separate training and test datasets: we randomly split the X and y arrays into 30 percent test data(45 samples, index 105-149) and 70 percent training data(105, index 0-104) samples.

We also did feature scaling for optimal performance of our algorithm suing the StandardScaler class from scikit-learn's preprocessing module.

Also, by using the fit method, StandardScaler estimated the parameter $\mu$ (sample mean) and $\sigma$ (standard deviation) for each feature dimension from the training data.

Then, by calling the transform method, we standardized the training data using those $\mu$ and $\sigma$.

For the testing data, we used the same scaling parameters to standardize the set so that both the values in the training and test dataset are comparable to each other.

Here is the code for the scikit-learn's logistic regression:

# scikit-learn logistic regression

from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)

# Decision region drawing
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
   # setup marker generator and color map
   markers = ('s', 'x', 'o', '^', 'v')
   colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
   cmap = ListedColormap(colors[:len(np.unique(y))])

   # plot the decision surface
   x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
   x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
   xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
   np.arange(x2_min, x2_max, resolution))
   Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
   Z = Z.reshape(xx1.shape)
   plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
   plt.xlim(xx1.min(), xx1.max())
   plt.ylim(xx2.min(), xx2.max())

   # plot all samples
   X_test, y_test = X[test_idx, :], y[test_idx]
   for idx, cl in enumerate(np.unique(y)):
      plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
               alpha=0.8, c=cmap(idx),
               marker=markers[idx], label=cl)
   # highlight test samples
   if test_idx:
      X_test, y_test = X[test_idx, :], y[test_idx]
      plt.scatter(X_test[:, 0], X_test[:, 1], c='',
               alpha=1.0, linewidth=1, marker='o',
               s=55, label='test set')

X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined_std,
                      y_combined, classifier=lr,
                      test_idx=range(105,150))

plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.show()

As we can see from the code, we used the LogisticRegression model.

In later section, we'll learn the "C" in:

lr = LogisticRegression(C=1000.0, random_state=0)

Also, we're going to go over the concepts such as overfitting and regularization.

Plot

After fitting the model on the training data, we plotted the decision regions, training samples and test samples. Here is the output from the run:

predict_proba method

We can predict the class-membership probability of the samples via the predict_proba method.

For example, we can predict the probabilities of the first Iris sample like this:

>>> lr.predict_proba(X_test_std[0,:])

This returns the following array:

array([[  2.05743774e-11,   6.31620264e-02,   9.36837974e-01]])

The array tells us that the model predicts a chance of 93.7 percent that the sample belongs to the Iris-Virginica class, and a 6.3 percent chance that the sample is a Iris-Versicolor flower. We can check the first one is Iris-Virginica class:

>>> y_test
array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0,
       0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 1, 1, 1, 2, 0, 2, 0, 0])

Overfitting & regularization

Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data).

Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

picture from wiki

On the contrary, our model can also suffer from underfitting (high bias), which means that our model is not complex enough to capture the pattern in the training data well and therefore also suffers from low performance on unseen data.

In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors).

Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting.

The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights.

The most common form of regularization is the so-called L2 regularization, which can be written as follows:

$$ \frac {\lambda}{2} {\Vert w \Vert}^2 = \frac {\lambda}{2} \sum_{j=1}^m w_j^2 $$

where $\lambda$ is the regularization parameter.

picture from wiki - Regularization

To apply regularization to our logistic regression, we just need to add the regularization term to the cost function to shrink the weights:

$$ J(w) = \left[\sum_i^n -y^{(i)}log(\phi(z^{(i)})-(1-y^{(i)})log(1-\phi(z^{(i)})) \right] + \frac {\lambda}{2} {\Vert w \Vert}^2$$

Via the regularization parameter $\lambda$, we can then control how well we fit the training data while keeping the weights small. By increasing the value of $\lambda$ , we increase the regularization strength.

The parameter C that is implemented for the LogisticRegression class in scikit-learn comes from a convention in support vector machines, and C is directly related to the regularization parameter $\lambda$ which is its inverse:

$$ C = \frac {1}{\lambda} $$

As we can see in the following plot, the weight coefficients shrink if we decrease the parameter C (increase the regularization strength, $\lambda$):

In the picture, we fitted ten logistic regression models with different values for the inverse-regularization parameter C. The code for the plot looks like this:

# scikit-learn logistic regression

from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression

weights, params = [], []
for c in np.arange(-5, 5):
   lr = LogisticRegression(C=10**c, random_state=0)
   lr.fit(X_train_std, y_train)
   weights.append(lr.coef_[1])
   params.append(10**c)

weights = np.array(weights)

# Decision region drawing
import matplotlib.pyplot as plt

plt.plot(params, weights[:, 0], color='blue', marker='x', label='petal length')
plt.plot(params, weights[:, 1], color='green',  marker='o', label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='right')
plt.xscale('log')
plt.show()

With virtually identical code in Jupyter: