scikit-learn : Data Preprocessing III - Dimensionality reduction via Sequential feature selection / Assessing feature importance via random forests

bogotobogo.com site search:

Note

In this tutorial, we'll use the pandas Wine DataFrame we built in the previous section: scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization.

Dimensionality reduction via Sequential feature selection

Sequential feature selection is one of the ways of dimensionality reduction techniques to avoid overfitting by reducing the complexity of the model.

A sequential feature selection learns which features are most informative at each time step, and then chooses the next feature depending on the already selected features.

Another way of dimensionality reduction is feature extraction where we derive information from the feature set to construct a new feature subspace.

Algorithm for sequential feature selection

Sequential feature selection algorithm reduces an initial $d$-dimensional feature space to a $k$-dimensional feature, and it is a family of greedy search algorithms.

It automatically selects a subset of features that are most relevant to the problem to reduce the generalization error or to improve computational efficiency of the model by removing irrelevant features or noise, which can be useful for algorithms that don't support regularization.

The Sequential Backward Selection(SBS) algorithm removes features sequentially from the full feature subset until the new feature subspace contains the desired number of features.

In order to decide which feature is to be removed at each step, we need to define a function $J$ that we want to minimize. The criterion calculated by the function can simply be the difference in performance of the classifier after and before the removal of a particular feature.

Then, at each stage, we eliminate the feature that has the least performance hit after removal.

The steps are:

Initialize the algorithm with $k = d$, where $d$ is the dimensionality of the full feature space $X_d$.
Determine the feature $x$ that maximizes the criterion $x = {argmax}J(X_k-x)$.
Remove the feature $x$ from the feature set.
Terminate if $k$ equals the number of desired features, if not, go to step 2.

Code for Sequential Backward Selection(SBS)

SBS algorithm is not yet implemented in scikit-learn.

Here is the code for sequential feature selection algorithm borrowed from "Python Machine Learning" by Sebastian Raschka:

SBS in action with KNN classifier

Now, it's time to see the SBS implementation in action using the KNN classifier from scikit-learn:

Our SBS implementation already splits the dataset into a test and training dataset inside the fit function, however, we still fed the training dataset X_train to the algorithm.

Then, the SBS fit method is going to create new training-subsets for testing (validation) and training, which is why this test set is also called validation dataset.

This approach is necessary to prevent our original test set becoming part of the training data.

Recall that our SBS algorithm collects the scores of the best feature subset at each stage, so let's move on to the more exciting part of our implementation and plot the classification accuracy of the KNN classifier that was calculated on the validation dataset:

As we can see in the picture above, the accuracy of the KNN classifier improved on the validation dataset as we reduced the number of features, which is likely due to a decrease of the curse of dimensionality. Also note that the classifier achieved 100 percent accuracy for $k={5, 6, 7, 8, 9, 10}$:

Let's see what those five features are that yielded such a good performance on the validation dataset:

Note that we obtained the column indices of the 5-feature subset from the 9 th position in the sbs.subsets_ attribute and returned the corresponding feature names from the column-index of the pandas Wine DataFrame:

Now we want to evaluate the performance of the KNN classifier on the original test set:

In the code, we used the complete feature set and obtained ~98.4 percent accuracy on the training dataset. However, the accuracy on the test dataset was slightly lower (~94.4 percent), which is an indicator of a slight degree of overfitting.

This time, let's use the selected 5-feature subset and see how well KNN performs:

Using fewer than half of the original features in the Wine dataset, the prediction accuracy on the test set improved by almost 2 percent. Also, we reduced overfitting, which we can tell from the small gap between test (~96.3 percent) and training (~96.0 percent) accuracy.

Assessing feature importance via random decision forests

As an ensemble learning method for classification and regression, random forests or random decision forests operates by constructing a multitude of decision trees at training time and outputting the class (classification) or mean prediction (regression) of the individual trees.

In the previous sections, we used L1 regularization to remove irrelevant features via logistic regression and use the SBS algorithm for feature selection, the random forest is another approach to select relevant features from a dataset.

To select relevant features, unlike the L1 regularization case where we used our own algorithm for feature selection, the random forest implementation in scikit- learn already collects feature importances for us. So, all we have to do is to access them via the feature_importances_ attribute after fitting a RandomForestClassifier.

Ranking features via importance measures

>Let's train a forest of 10,000 trees on the Wine dataset.

We're going to rank the 13 features by their respective importance measures.

Here is the plot:

In the picture, the different features in the Wine dataset are ranked by their relative importance. Note that feature importances are normalized so that they sum up to 1.0.

From the plot we can see that the alcohol content of wine is the most discriminative feature in the dataset based on the average impurity decrease in the 10,000 decision trees.

"The random forest technique comes with an important gotcha that is worth mentioning. For instance, if two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don't need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importances." - Python Machine Learning by Sebastian Raschka

Github repository

Source is available from bogotobogo-Machine-Learning .