scikit-learn : k-Nearest Neighbors (k-NN) Algorithm

bogotobogo.com site search:

Introduction

k-Nearest Neighbor (k-NN) classifier is a supervised learning algorithm, and it is a lazy learner. It is called lazy algorithm because it doesn't learn a discriminative function from the training data but memorizes the training dataset instead.

The K-nearest neighbor classifier offers an alternative approach to classification using lazy learning that allows us to make predictions without any model training but at the cost of expensive prediction step.

picture source - wiki

As shown in the description of the above picture, the k-NN algorithm can be summarized as following steps:

Choose the number of $k$ and a distance metric.
Find the $k$ nearest neighbors of the sample that we want to classify.
Assign the class label by majority vote.

Pros and cons:

Advantages:
k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data.
Downside:
The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario.

Note that we can't discard training samples since no training step is involved. Therefore, storage space can be challenging if we work with large datasets.

k-NN code

Here is the output from a k-NN model in scikit-learn using an Euclidean distance metric. With 5 neighbors in the KNN model for this dataset, we obtain a relatively smooth decision boundary:

The implemented code looks like this:

from sklearn import datasets
from sklearn.cross_validation import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
   # setup marker generator and color map
   markers = ('s', 'x', 'o', '^', 'v')
   colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
   cmap = ListedColormap(colors[:len(np.unique(y))])

   # plot the decision surface
   x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
   x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
   xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
   np.arange(x2_min, x2_max, resolution))
   Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
   Z = Z.reshape(xx1.shape)
   plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
   plt.xlim(xx1.min(), xx1.max())
   plt.ylim(xx2.min(), xx2.max())

   # plot all samples
   X_test, y_test = X[test_idx, :], y[test_idx]
   for idx, cl in enumerate(np.unique(y)):
      plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
               alpha=0.8, c=cmap(idx),
               marker=markers[idx], label=cl)
   # highlight test samples
   if test_idx:
      X_test, y_test = X[test_idx, :], y[test_idx]
      plt.scatter(X_test[:, 0], X_test[:, 1], c='',
               alpha=1.0, linewidth=1, marker='o',
               s=55, label='test set')

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)

X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

knn = KNeighborsClassifier(n_neighbors=5, p=2,
                           metric='minkowski')
knn.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined,
                      classifier=knn, test_idx=range(105,150))

plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.show()

The value of $k$ is crucial to find a good balance between over/under-fitting.

The distance metric should be appropriate for the features in the dataset. For a simple Euclidean distance metric is used for real-valued samples, for example, the flowers in our Iris dataset, which have features measured in centimeters.

However, if we are using a Euclidean distance measure, it is also important to standardize the data so that each feature contributes equally to the distance.

The 'minkowski' distance that we used in the code is just a generalization of the Euclidean and Manhattan distance:

Curse of dimensionality

The k-NN is very susceptible to overfitting due to the curse of dimensionality.

The curse of dimensionality happens when the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset.

Basically, in a high-dimensional space, even the closest neighbors can be too far away to give a good estimate.

In that case, we use feature selection and dimensionality reduction techniques to avoid the curse of dimensionality.

Preprocessing, feature selection, and dimensionality reduction

To make good predictions we need to have informative and discriminatory features.

So, in subsequent articles, we will further discuss the preprocessing of data, dimensionality reduction, and feature selection to build a good machine learning models.

Data pre-processing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. - https://en.wikipedia.org/wiki/Data_pre-processing

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction. - https://en.wikipedia.org/wiki/Dimensionality_reduction

Ref: Python Machine Learing by Sebastian Raschka