Clustering

Sanjiv R. Das

Overview

  1. Partitioning or Top-down: In this approach, the entire set of $n$ entities is assumed to be divided into $k$ clusters. Then entities are assigned clusters.

  2. Agglomerative or Hierarchical or Bottom-up: In this case we begin with all entities in the analysis being given their own cluster, so that we start with $n$ clusters. Then, entities are grouped into clusters based on a given distance metric between each pair of entities. In this way a hierarchy of clusters is built up and the researcher can choose which grouping is preferred.

K-means

  1. Form a distance matrix.
  2. Initialize cluster centroids with evenly spaced items.
  3. Assign each observation to closest cluster (to centroid, closest or farthest mamber).
  4. Repeat a few times to re-assign until the scheme stabilizes.

SBA dataset

Clustering on PCA reduced data

Hierarchical Clustering (bottom up)

  1. Get distance matrix for $n$ observations. Each in its own cluster.
  2. Club the two closest observations into a cluster. Now we have $(n-1)$ clusters.
  3. Recalculate centroids.
  4. Repeat to get hierarchical structure.

NCAA dataset

Redo Hierarchical Clustering in R