# Clustering¶

Sanjiv R. Das

## Overview¶

• Grouping individuals, firms, projects, etc.

• Cluster analysis comprises a group of techniques that uses distance metrics to bunch data into categories.

• Two approaches.

1. Partitioning or Top-down: In this approach, the entire set of $n$ entities is assumed to be divided into $k$ clusters. Then entities are assigned clusters.

2. Agglomerative or Hierarchical or Bottom-up: In this case we begin with all entities in the analysis being given their own cluster, so that we start with $n$ clusters. Then, entities are grouped into clusters based on a given distance metric between each pair of entities. In this way a hierarchy of clusters is built up and the researcher can choose which grouping is preferred.

# K-means¶

1. Form a distance matrix.
2. Initialize cluster centroids with evenly spaced items.
3. Assign each observation to closest cluster (to centroid, closest or farthest mamber).
4. Repeat a few times to re-assign until the scheme stabilizes.

## Hierarchical Clustering (bottom up)¶

1. Get distance matrix for $n$ observations. Each in its own cluster.
2. Club the two closest observations into a cluster. Now we have $(n-1)$ clusters.
3. Recalculate centroids.
4. Repeat to get hierarchical structure.