Clustering

Sanjiv R. Das

In [1]:
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

Overview

  • Grouping individuals, firms, projects, etc.

  • Cluster analysis comprises a group of techniques that uses distance metrics to bunch data into categories.

  • Two approaches.

  1. Partitioning or Top-down: In this approach, the entire set of $n$ entities is assumed to be divided into $k$ clusters. Then entities are assigned clusters.

  2. Agglomerative or Hierarchical or Bottom-up: In this case we begin with all entities in the analysis being given their own cluster, so that we start with $n$ clusters. Then, entities are grouped into clusters based on a given distance metric between each pair of entities. In this way a hierarchy of clusters is built up and the researcher can choose which grouping is preferred.

K-means

  1. Form a distance matrix.
  2. Initialize cluster centroids with evenly spaced items.
  3. Assign each observation to closest cluster (to centroid, closest or farthest mamber).
  4. Repeat a few times to re-assign until the scheme stabilizes.

SBA dataset

In [2]:
import pickle
SBAdata = pickle.load(open("data/SBAdata.p", "rb"))
X = SBAdata["sba"]
print(X.shape)
X.head()
(527700, 24)
Out[2]:
GrossApproval ApprovalFiscalYear InitialInterestRate TermInMonths RevolverStatus JobsSupported Community Advantage Initiative Community Express Contract Guaranty EXPORT IMPORT HARMONIZATION ... Lender Advantage Initiative Patriot Express Revolving Line of Credit Exports - Sec. 7(a) (14) Rural Lender Advantage Seasonal Line of Credit Small Asset Based Small General Contractors - Sec. 7(a) (9) Standard Asset Based INDIVIDUAL PARTNERSHIP
0 50000 2006 11.25 84 1 4 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 35000 2006 12.00 84 0 3 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 15000 2006 12.00 84 0 4 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 16000 2006 11.50 84 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 16000 2006 11.50 84 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 24 columns

In [3]:
#FIT MODEL

#Normalize data, min-max scaling
Xn = (X-X.min())/(X.max()-X.min())

from sklearn.cluster import KMeans
model = KMeans(n_clusters=4).fit(Xn)
In [4]:
#GET CLUSTERS
clusters = model.labels_
X["Cluster"] = clusters
X.groupby(["Cluster"]).mean()
Out[4]:
GrossApproval ApprovalFiscalYear InitialInterestRate TermInMonths RevolverStatus JobsSupported Community Advantage Initiative Community Express Contract Guaranty EXPORT IMPORT HARMONIZATION ... Lender Advantage Initiative Patriot Express Revolving Line of Credit Exports - Sec. 7(a) (14) Rural Lender Advantage Seasonal Line of Credit Small Asset Based Small General Contractors - Sec. 7(a) (9) Standard Asset Based INDIVIDUAL PARTNERSHIP
Cluster
0 703702.161108 2011.367123 6.200822 172.779829 0.000290 16.809084 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.090780 0.029103
1 75825.438249 2010.224700 8.704848 70.136531 1.000000 7.705401 0.000000 0.000000 0.001754 0.000062 ... 0.000000 0.012489 0.004524 0.000005 0.000605 0.000251 0.000385 0.010217 0.178651 0.013530
2 78997.048929 2009.815696 8.286286 77.099233 0.000000 7.785070 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.213993 0.015966
3 120454.275497 2011.021555 7.682084 101.503900 0.019229 8.345480 0.029091 0.389474 0.000558 0.000078 ... 0.351094 0.115357 0.003753 0.065284 0.000264 0.000202 0.000853 0.001396 0.207018 0.022066

4 rows × 24 columns

Clustering on PCA reduced data

In [5]:
#FOUR CLUSTERS
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(Xn)
kmeans = KMeans(init='k-means++', n_clusters=4, n_init=10)
kmeans.fit(reduced_data)
Out[5]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [6]:
#PLOT SET UP
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
In [7]:
figure(1)
imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), 
           yy.max()), cmap=cm.Paired, aspect='auto', origin='lower')
plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, 
        color='w', zorder=10)
xlim(x_min, x_max); ylim(y_min, y_max)
xticks(()); yticks(())
show()
In [8]:
#ELEVEN CLUSTERS
kmeans = KMeans(init='k-means++', n_clusters=11, n_init=10)
kmeans.fit(reduced_data)
#PLOT SET UP
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
In [9]:
figure(1)
imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), 
           yy.max()), cmap=cm.Paired, aspect='auto', origin='lower')
plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, 
        color='w', zorder=10)
xlim(x_min, x_max); ylim(y_min, y_max)
xticks(()); yticks(())
show()

Hierarchical Clustering (bottom up)

  1. Get distance matrix for $n$ observations. Each in its own cluster.
  2. Club the two closest observations into a cluster. Now we have $(n-1)$ clusters.
  3. Recalculate centroids.
  4. Repeat to get hierarchical structure.

NCAA dataset

In [10]:
ncaa = pd.read_table("data/ncaa.txt")
yy = append(list(ones(32)), list(zeros(32)))
ncaa["y"] = yy
ncaa.head()
Out[10]:
No NAME GMS PTS REB AST TO A/T STL BLK PF FG FT 3P y
0 1. NorthCarolina 6 84.2 41.5 17.8 12.8 1.39 6.7 3.8 16.7 0.514 0.664 0.417 1.0
1 2. Illinois 6 74.5 34.0 19.0 10.2 1.87 8.0 1.7 16.5 0.457 0.753 0.361 1.0
2 3. Louisville 5 77.4 35.4 13.6 11.0 1.24 5.4 4.2 16.6 0.479 0.702 0.376 1.0
3 4. MichiganState 5 80.8 37.8 13.0 12.6 1.03 8.4 2.4 19.8 0.445 0.783 0.329 1.0
4 5. Arizona 4 79.8 35.0 15.8 14.5 1.09 6.0 6.5 13.3 0.542 0.759 0.397 1.0
In [11]:
#CREATE FEATURES
X = ncaa.iloc[:,2:13]
X.head()
Out[11]:
PTS REB AST TO A/T STL BLK PF FG FT 3P
0 84.2 41.5 17.8 12.8 1.39 6.7 3.8 16.7 0.514 0.664 0.417
1 74.5 34.0 19.0 10.2 1.87 8.0 1.7 16.5 0.457 0.753 0.361
2 77.4 35.4 13.6 11.0 1.24 5.4 4.2 16.6 0.479 0.702 0.376
3 80.8 37.8 13.0 12.6 1.03 8.4 2.4 19.8 0.445 0.783 0.329
4 79.8 35.0 15.8 14.5 1.09 6.0 6.5 13.3 0.542 0.759 0.397
In [12]:
#HIERARCHICAL CLUSTERING
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')
In [13]:
#DENDROGRAM
figure(figsize=(20, 5))
title('Hierarchical Clustering Dendrogram')
xlabel('sample index')
ylabel('distance')
dendrogram(Z, 
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
show()

Redo Hierarchical Clustering in R

In [2]:
%load_ext rpy2.ipython
In [3]:
%%R
#CLUSTER
ncaa = read.table("data/ncaa.txt",header=TRUE)
d = dist(ncaa[,3:14], method="euclidian")
fit = hclust(d, method="ward.D")
plot(fit,main="NCAA Teams")
groups = cutree(fit, k=2)
rect.hclust(fit, k=2, border="blue")
In [4]:
%%R
#CLUSTER PLOT
library(cluster)
clusplot(ncaa[,3:14],groups,color=TRUE,shade=TRUE,labels=2,lines=0)