scikit-learn : Unsupervised Learning - Clustering

"Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics." - wiki : Cluster analysis
Clustering performs the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance).
In this section, we'll use KMeans algorithm which is one of the simplest clustering algorithms. We will reuse the output of the 2D PCA of the iris dataset from the previous chapter (scikit-learn : PCA dimensionality reduction with iris dataset) and try to find 3 groups of samples:
>>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> X = >>> y = >>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=2, whiten=True).fit(X) >>> X_pca = pca.transform(X) >>> from sklearn.cluster import KMeans >>> from numpy.random import RandomState >>> rng = RandomState(42) >>> kmeans = KMeans(n_clusters=3, random_state=rng).fit(X_pca) >>> import numpy as np >>> np.round(kmeans.cluster_centers_, decimals=2) array([[ 1.02, -0.71], [ 0.33, 0.89], [-1.29, -0.44]]) >>> kmeans.labels_[:10] array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32) >>> kmeans.labels_[-10:] array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1], dtype=int32)
The code ( like this:
from sklearn.datasets import load_iris from itertools import cycle from sklearn.decomposition import PCA from sklearn.cluster import KMeans from numpy.random import RandomState import pylab as pl class clustering: def __init__(self): self.plot(load_iris().data) def plot(self, X): pca = PCA(n_components=2, whiten=True).fit(X) X_pca = pca.transform(X) kmeans = KMeans(n_clusters=3, random_state=RandomState(42)).fit(X_pca) plot_2D(X_pca, kmeans.labels_, ["c0", "c1", "c2"]) def plot_2D(data, target, target_names): colors = cycle('rgbcmykw') target_ids = range(len(target_names)) pl.figure() for i, c, label in zip(target_ids, colors, target_names): pl.scatter(data[target == i, 0], data[target == i, 1], c=c, label=label) pl.legend() if __name__ == '__main__': c = clustering()
The code draws the picture below, "KMeans cluster assignements on 2D PCA iris data":

