scikit-learn : Random Decision Forests Classification

bogotobogo.com site search:

Random-Decision-Forests

Basically, a random forests is an ensemble of decision trees.

Thanks to their good classification performance, scalability, and ease of use, random forests have gained huge popularity in machine learning.

Algorithm

The random forest algorithm can be summarized as following steps (ref: Python Machine Learning by Sebastian Raschka):

Draw a random bootstrap sample of size $n$ (randomly choose $n$ samples from the training set with replacement).
Grow a decision tree from the bootstrap sample. At each node:
1. Randomly select $d$ features without replacement.
2. Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.
Repeat the steps $1$ to $2k$ times.
Aggregate the prediction by each tree to assign the class label by majority vote.

Random forests don't provide the same level of interpretability as decision trees. However, a big advantage of random forests is that we don't have to worry so much about selecting good hyper-parameter values.

Because the ensemble model is quite robust and resistant to noise from the individual decision trees, we typically don't need to prune the random forest, and the only parameter we care about is the number of trees $k$ (step 3).

The larger the number of trees, we get the better the performance of the random forest classifier at the cost of increased computations.

In the scikit-learn's RandomForestClassifier implementation, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set.

Note that by choosing a larger value for the sample size $n$, we decrease the randomness and thus the forest is more likely to overfit.

On the other hand, we can reduce the degree of overfitting by choosing smaller values for $n$ at the expense of the model performance.

For the number of features $d$ at each split, we want to choose a value that is smaller than the total number of features in the training set. $d = \sqrt{m}$ is used in scikit-learn, where $m$ is the number of features in the training set.

From the picture, we can see the decision regions created by the ensemble of trees in the random forest.

We trained a random forest from 10 decision trees via the n_estimators parameter and used the entropy criterion as an impurity measure to split the nodes. Although we are growing a very small random forest from a very small training dataset, we used the n_jobs parameter for demonstration purposes, which allows us to parallelize the model training using 2 cores.

The code for the picture looks like this:

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
   # setup marker generator and color map
   markers = ('s', 'x', 'o', '^', 'v')
   colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
   cmap = ListedColormap(colors[:len(np.unique(y))])

   # plot the decision surface
   x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
   x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
   xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
   np.arange(x2_min, x2_max, resolution))
   Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
   Z = Z.reshape(xx1.shape)
   plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
   plt.xlim(xx1.min(), xx1.max())
   plt.ylim(xx2.min(), xx2.max())

   # plot all samples
   X_test, y_test = X[test_idx, :], y[test_idx]
   for idx, cl in enumerate(np.unique(y)):
      plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
               alpha=0.8, c=cmap(idx),
               marker=markers[idx], label=cl)
   # highlight test samples
   if test_idx:
      X_test, y_test = X[test_idx, :], y[test_idx]
      plt.scatter(X_test[:, 0], X_test[:, 1], c='',
               alpha=1.0, linewidth=1, marker='o',
               s=55, label='test set')

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

forest = RandomForestClassifier(criterion='entropy',
                               n_estimators=10, random_state=1, n_jobs=2)

forest.fit(X_train, y_train)

X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined, y_combined,
         classifier=forest, test_idx=range(105,150))

plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.show()