scikit-learn : Data Preprocessing III - Dimensionality reduction via Sequential feature selection / Assessing feature importance via random forests
In this tutorial, we'll use the pandas Wine DataFrame we built in the previous section: scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization.
Sequential feature selection is one of the ways of dimensionality reduction techniques to avoid overfitting by reducing the complexity of the model.
A sequential feature selection learns which features are most informative at each time step, and then chooses the next feature depending on the already selected features.
Another way of dimensionality reduction is feature extraction where we derive information from the feature set to construct a new feature subspace.
Sequential feature selection algorithm reduces an initial $d$-dimensional feature space to a $k$-dimensional feature, and it is a family of greedy search algorithms.
It automatically selects a subset of features that are most relevant to the problem to reduce the generalization error or to improve computational efficiency of the model by removing irrelevant features or noise, which can be useful for algorithms that don't support regularization.
The Sequential Backward Selection(SBS) algorithm removes features sequentially from the full feature subset until the new feature subspace contains the desired number of features.
In order to decide which feature is to be removed at each step, we need to define a function $J$ that we want to minimize. The criterion calculated by the function can simply be the difference in performance of the classifier after and before the removal of a particular feature.
Then, at each stage, we eliminate the feature that has the least performance hit after removal.
The steps are:
- Initialize the algorithm with $k = d$, where $d$ is the dimensionality of the full feature space $X_d$.
- Determine the feature $x$ that maximizes the criterion $x = {argmax}J(X_k-x)$.
- Remove the feature $x$ from the feature set.
- Terminate if $k$ equals the number of desired features, if not, go to step 2.
SBS algorithm is not yet implemented in scikit-learn.
Here is the code for sequential feature selection algorithm borrowed from "Python Machine Learning" by Sebastian Raschka:
Now, it's time to see the SBS implementation in action using the KNN classifier from scikit-learn:
Our SBS implementation already splits the dataset into a test and training dataset inside the fit function, however, we still fed the training dataset X_train to the algorithm.
Then, the SBS fit method is going to create new training-subsets for testing (validation) and training, which is why this test set is also called validation dataset.
This approach is necessary to prevent our original test set becoming part of the training data.
Recall that our SBS algorithm collects the scores of the best feature subset at each stage, so let's move on to the more exciting part of our implementation and plot the classification accuracy of the KNN classifier that was calculated on the validation dataset:
As we can see in the picture above, the accuracy of the KNN classifier improved on the validation dataset as we reduced the number of features, which is likely due to a decrease of the curse of dimensionality. Also note that the classifier achieved 100 percent accuracy for $k={5, 6, 7, 8, 9, 10}$:
Let's see what those five features are that yielded such a good performance on the validation dataset:
Note that we obtained the column indices of the 5-feature subset from the 9 th position in the sbs.subsets_ attribute and returned the corresponding feature names from the column-index of the pandas Wine DataFrame:
Now we want to evaluate the performance of the KNN classifier on the original test set:
In the code, we used the complete feature set and obtained ~98.4 percent accuracy on the training dataset. However, the accuracy on the test dataset was slightly lower (~94.4 percent), which is an indicator of a slight degree of overfitting.
This time, let's use the selected 5-feature subset and see how well KNN performs:
Using fewer than half of the original features in the Wine dataset, the prediction accuracy on the test set improved by almost 2 percent. Also, we reduced overfitting, which we can tell from the small gap between test (~96.3 percent) and training (~96.0 percent) accuracy.
As an ensemble learning method for classification and regression, random forests or random decision forests operates by constructing a multitude of decision trees at training time and outputting the class (classification) or mean prediction (regression) of the individual trees.
In the previous sections, we used L1 regularization to remove irrelevant features via logistic regression and use the SBS algorithm for feature selection, the random forest is another approach to select relevant features from a dataset.
To select relevant features, unlike the L1 regularization case where we used our own algorithm for feature selection, the random forest implementation in scikit- learn already collects feature importances for us. So, all we have to do is to access them via the feature_importances_ attribute after fitting a RandomForestClassifier.
>Let's train a forest of 10,000 trees on the Wine dataset.
We're going to rank the 13 features by their respective importance measures.
Here is the plot:
In the picture, the different features in the Wine dataset are ranked by their relative importance. Note that feature importances are normalized so that they sum up to 1.0.
From the plot we can see that the alcohol content of wine is the most discriminative feature in the dataset based on the average impurity decrease in the 10,000 decision trees.
"The random forest technique comes with an important gotcha that is worth mentioning. For instance, if two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don't need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importances." - Python Machine Learning by Sebastian Raschka
Source is available from bogotobogo-Machine-Learning .
Machine Learning with scikit-learn
scikit-learn installation
scikit-learn : Features and feature extraction - iris dataset
scikit-learn : Machine Learning Quick Preview
scikit-learn : Data Preprocessing I - Missing / Categorical data
scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization
scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests
Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)
scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)
scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis
scikit-learn : Logistic Regression, Overfitting & regularization
scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset
scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset
scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)
scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain
scikit-learn : Decision Tree Learning II - Constructing the Decision Tree
scikit-learn : Random Decision Forests Classification
scikit-learn : Support Vector Machines (SVM)
scikit-learn : Support Vector Machines (SVM) II
Flask with Embedded Machine Learning I : Serializing with pickle and DB setup
Flask with Embedded Machine Learning II : Basic Flask App
Flask with Embedded Machine Learning III : Embedding Classifier
Flask with Embedded Machine Learning IV : Deploy
Flask with Embedded Machine Learning V : Updating the classifier
scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one
Machine learning algorithms and concepts
Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function
Batch gradient descent versus stochastic gradient descent
Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method
Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)
Logistic Regression
VC (Vapnik-Chervonenkis) Dimension and Shatter
Bias-variance tradeoff
Maximum Likelihood Estimation (MLE)
Neural Networks with backpropagation for XOR using one hidden layer
minHash
tf-idf weight
Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)
Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)
Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)
Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)
Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)
Artificial Neural Networks (ANN)
[Note] Sources are available at Github - Jupyter notebook files1. Introduction
2. Forward Propagation
3. Gradient Descent
4. Backpropagation of Errors
5. Checking gradient
6. Training via BFGS
7. Overfitting & Regularization
8. Deep Learning I : Image Recognition (Image uploading)
9. Deep Learning II : Image Recognition (Image classification)
10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization