In [1]:

```
%pylab inline
import pandas as pd
from ipypublish import nb_setup
%load_ext rpy2.ipython
```

Populating the interactive namespace from numpy and matplotlib

There are many aspects of data analysis that call for grouping individuals, firms, projects, etc. These fall under the rubric of what may be termed as **classification** analysis. Cluster analysis comprises a group of techniques that uses distance metrics to bunch data into categories.

There are two broad approaches to cluster analysis:

Agglomerative or Hierarchical or Bottom-up: In this case we begin with all entities in the analysis being given their own cluster, so that we start with $n$ clusters. Then, entities are grouped into clusters based on a given distance metric between each pair of entities. In this way a

**hierarchy**of clusters is built up and the researcher can choose which grouping is preferred.Partitioning or Top-down: In this approach, the entire set of $n$ entities is assumed to be a cluster. Then it is progressively partitioned into smaller and smaller clusters.

We will employ both clustering approaches and examine their properties with various data sets as examples.

This approach is top-down. If we have a sample of $n$ observations to be allocated to $k$ clusters, then we can initialize the clusters in many ways. One approach is to assume that each of $k$ observations is a cluster unto itself. We proceed by taking each observation of the remaining observations and allocating it to the nearest cluster using a distance metric. At the outset, we would simply allocate an observation to its nearest neighbor. The steps are as follows:

- Suppose you have $n$ observations for clustering.
- K-means first identifies $k$ reasonable starting nodes as the only members of $k$ clusters.
- Then it takes the remaining $(n-k)$ observations one by one and allocates each to the closest clusters centroid. After which all $n$ observations are allocated to one of the $k$ clusters.
- Then we examine each of the $n$ observations to see if they are closer to the centroid of another cluster and if so, reassign them.
- Repeat step 4 as many times as desired.

How is nearness measured? We need a distance metric, and one common one is Euclidian distance. Suppose we have two observations $x_i$ and $x_j$. These may be represented by a vector of attributes. Suppose our observations are people, and the attributes are {height, weight, IQ} = $x_i = \{h_i, w_i, I_i\}$ for the $i$-th individual. Then the Euclidian distance between two individuals $i$ and $j$ is

$$ d_{ij} = \sqrt{(h_i-h_j)^2 + (w_i-w_j)^2 + (I_i - I_j)^2} $$It is usually computed using normalized variables, so that no single variable of large size dominates the distance calculation. (Normalization is the process of subtracting the mean from each observation and then dividing by the standard deviation.)

In contrast, the "Manhattan" distance is given by (when is this more appropriate?)

$$ d_{ij} = |h_i-h_j| + |w_i-w_j| + |I_i - I_j| $$We may use other metrics such as the cosine distance, or the Mahalanobis distance. A matrix of $n \times n$ values of all $d_{ij}$s is called the **distance matrix**. Using this distance metric we assign nodes to clusters or attach them to nearest neighbors. After a few iterations, no longer are clusters made up of singleton observations, and the number of clusters reaches $k$, the preset number required, and then all nodes are assigned to one of these $k$ clusters. As we examine each observation we then assign it (or re-assign it) to the nearest cluster, where the distance is measured from the observation to some representative node of the cluster.

Some common choices of the representative node in a cluster of are:

- Centroid of the cluster. This is the mean of the observations in the cluster for each attribute. The centroid of the two observations above is the average vector $\{(h_i+h_j)/2, (w_i+w_j)/2, (I_i + I_j)/2\}$. This is often called the
**center**of the cluster. If there are more nodes then the centroid is the average of the same coordinate for all nodes. - Closest member of the cluster.
- Furthest member of the cluster.

The algorithm converges when no re-assignments of observations to clusters occurs. Note that $k$-means is a random algorithm, and may not always return the same clusters every time the algorithm is run. Also, one needs to specify the number of clusters to begin with and there may be no a-priori way in which to ascertain the correct number. Hence, trial and error and examination of the results is called for. Also, the algorithm aims to have balanced clusters, but this may not always be appropriate.

In R, we may construct the distance matrix using the **dist** function. Using the NCAA data we are already familiar with, we have:

In [2]:

```
%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
print(names(ncaa))
d = dist(ncaa[,3:14], method="euclidian")
print(head(d))
print(dim(as.matrix(d)))
```

Examining this matrix will show that it contains $n(n-1)/2$ elements, i.e., the number of pairs of nodes. Only the lower triangular matrix of $d$ is populated.

Clustering takes many observations with their characteristics and then allocates them into buckets or clusters based on their similarity. In finance, we may use cluster analysis to determine groups of similar firms.

Unlike regression analysis, cluster analysis uses only the right-hand side variables, and there is no dependent variable required. We group observations purely on their overall similarity across characteristics. Hence, it is closely linked to the notion of **communities** that we studied in network analysis, though that concept lives primarily in the domain of networks.

Here we use the example from the **kmeans** function to see how the clusters appear. This function is standard issue, i.e., it comes with the **stats** package, which is included in the base R distribution and does not need to be separately installed. The data is randomly generated but has two bunches of items with different means, so we should be easily able to see two separate clusters. You will need the **graphics** package which is also in the base installation.

In [3]:

```
%%R
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
```

In [4]:

```
%%R
#PLOTTING CLUSTERS
print(names(cl))
plot(x, col = cl$cluster)
points(cl$, col = 1:2, pch = 8, cex=4)
```

Error in (function (file = "", n = NULL, text = NULL, prompt = "?", keep.source = getOption("keep.source"), : <text>:4:11: unexpected ',' 3: plot(x, col = cl$cluster) 4: points(cl$, ^

/Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Error in (function (file = "", n = NULL, text = NULL, prompt = "?", keep.source = getOption("keep.source"), : <text>:4:11: unexpected ',' 3: plot(x, col = cl$cluster) 4: points(cl$, ^ warnings.warn(x, RRuntimeWarning)

In [5]:

```
%%R
#REDO ANALYSIS WITH 5 CLUSTERS
## random starts do help here with too many clusters
(cl <- kmeans(x, 5, nstart = 25))
```

In [6]:

```
%%R
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)
```

We revisit our NCAA data set, and form clusters there.

In [7]:

```
%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
print(names(ncaa))
fit = kmeans(ncaa[,3:14],4)
print(fit$size)
print(fit$centers)
```

In [8]:

```
%%R
#Since there are more than two attributes of each observation in the data,
#we picked two of them {AST, PTS} and plotted the clusters against those.
idx = c(4,6)
plot(ncaa[,idx],col=fit$cluster)
```

Hierarchical clustering is both, a top-down (divisive) approach and bottom-up (agglomerative) approach. At the top level there is just one cluster. A level below, this may be broken down into a few clusters, which are then further broken down into more sub-clusters a level below, and so on. This clustering approach is computationally expensive, and the divisive approach is exponentially expensive in $n$, the number of entities being clustered. In fact, the algorithm is ${\cal O}(2^n)$.

The function for clustering is **hclust** and is included in the **stats** package in the base R distribution.

We begin by first computing the distance matrix. Then we call the **hclust** function and the **plot** function applied to object **fit** gives what is known as a **dendrogram** plot, showing the cluster hierarchy. We may pick clusters at any level. In this case, we chose a **cut** level such that we get four clusters, and the **rect.hclust** function allows us to superimpose boxes on the clusters so we can see the grouping more clearly. The result is plotted in the Figure below.

In [9]:

```
%%R
d = dist(ncaa[,3:14], method="euclidian")
fit = hclust(d, method="ward.D")
names(fit)
```

[1] "merge" "height" "order" "labels" "method" [6] "call" "dist.method"

In [10]:

```
%%R
plot(fit,main="NCAA Teams")
groups = cutree(fit, k=3)
rect.hclust(fit, k=3, border="blue")
```

We can also visualize the clusters loaded on to the top two principal components as follows, using the **clusplot** function that resides in package **cluster**. The result is plotted in the Figure below.

In [11]:

```
%%R
print(groups)
library(cluster)
clusplot(ncaa[,3:14],groups,color=TRUE,shade=TRUE,labels=2,lines=0)
```

In [12]:

```
%%R
#Using the correlation matrix as a proxy for distance
x = t(as.matrix(ncaa[,3:14]))
d = as.dist((1-cor(x))/2)
fit = hclust(d, method="ward.D")
plot(fit,main="NCAA Teams")
groups = cutree(fit, k=3)
rect.hclust(fit, k=3, border="red")
```

In [13]:

```
%%R
print(groups)
library(cluster)
clusplot(ncaa[,3:14],groups,color=TRUE,shade=TRUE,labels=2,lines=0)
```

This is one of the simplest algorithms for classification and grouping. Simply define a distance metric over a set of observations, each with $M$ characteristics, i.e., $x_1, x_2,..., x_M$. Compute the pairwise distance between each pair of observations, using any of the metrics above.

Next, fix $k$, the number of nearest neighbors in the population to be considered. Finally, assign the category based on which one has the majority of nearest neighbors to the case we are trying to classify.

We see an example in R using the **iris** data set that we examined before.

In [14]:

```
%%R
library(class)
data(iris)
print(head(iris))
#SAMPLE A SUBSET
idx = seq(1,length(iris[,1]))
train_idx = sample(idx,100)
test_idx = setdiff(idx,train_idx)
x_train = iris[train_idx,1:4]
x_test = iris[test_idx,1:4]
y_train = iris[train_idx,5]
y_test = iris[test_idx,5]
#RUN knn
res = knn(x_train, x_test, y_train, k = 3, prob = FALSE, use.all = TRUE)
res
```

In [15]:

```
%%R
table(res,y_test)
```

y_test res setosa versicolor virginica setosa 14 0 0 versicolor 0 12 1 virginica 0 3 20

Prediction trees are a natural outcome of recursive partitioning of the data. Hence, they are a particular form of clustering at different levels. Usual cluster analysis results in a **flat** partition, but prediction trees develop a multi-level cluster of trees. The term used here is CART, which stands for classification analysis and regression trees. But prediction trees are different from vanilla clustering in an important way -- there is a dependent variable, i.e., a category or a range of values (e.g., a score) that one is attempting to predict.

Prediction trees are of two types: (a) Classification trees, where the leaves of the trees are different categories of discrete outcomes. and (b) Regression trees, where the leaves are continuous outcomes. We may think of the former as a generalized form of limited dependent variables, and the latter as a generalized form of regression analysis.

To set ideas, suppose we want to predict the credit score of an individual using age, income, and education as explanatory variables. Assume that income is the best explanatory variable of the three. Then, at the top of the tree, there will be income as the branching variable, i.e., if income is less than some threshold, then we go down the left branch of the tree, else we go down the right. At the next level, it may be that we use education to make the next bifurcation, and then at the third level we use age. A variable may even be repeatedly used at more than one level. This leads us to several leaves at the bottom of the tree that contain the average values of the credit scores that may be reached. For example if we get an individual of young age, low income, and no education, it is very likely that this path down the tree will lead to a low credit score on average. Instead of credit score (an example of a regression tree), consider credit ratings of companies (an example of a classification tree). These ideas will become clearer once we present some examples.

Recursive partitioning is the main algorithmic construct behind prediction trees. We take the data and using a single explanatory variable, we try and bifurcate the data into two categories such that the additional information from categorization results in better **information** than before the binary split. For example, suppose we are trying to predict who will make donations and who will not using a single variable -- income. If we have a sample of people and have not yet analyzed their incomes, we only have the raw frequency $p$ of how many people made donations, i.e., and number between 0 and 1. The **information** of the predicted likelihood $p$ is inversely related to the sum of squared errors (SSE) between this value $p$ and the 0 values and 1 values of the observations.

where $x_i = \{0,1\}$, depending on whether person $i$ made a donation or not.

Now, if we bifurcate the sample based on income, say to the left we have people with income less than $K$, and to the right, people with incomes greater than or equal to $K$. If we find that the proportion of people on the left making donations is $p_L < p$ and on the right is $p_R > p$, our new information is:

$$ SSE_2 = \sum_{i, Income < K} (x_i - p_L)^2 + \sum_{i, Income \geq K} (x_i - p_R)^2 $$By choosing $K$ correctly, our recursive partitioning algorithm will maximize the gain, i.e., $\delta = (SSE_1 - SSE_2)$. We stop branching further when at a given tree level $\delta$ is less than a pre-specified threshold.

We note that as $n$ gets large, the computation of binary splits on any variable is expensive, i.e., of order ${\cal O}(2^n)$. But as we go down the tree, and use smaller subsamples, the algorithm becomes faster and faster. In general, this is quite an efficient algorithm to implement.

The motivation of prediction trees is to emulate a decision tree. It also helps make sense of complicated regression scenarios where there are lots of variable interactions over many variables, when it becomes difficult to interpret the meaning and importance of explanatory variables in a prediction scenario. By proceeding in a hierarchical manner on a tree, the decision analysis becomes transparent, and can also be used in practical settings to make decisions.

To demonstrate this, let's use a data set that is already in R. We use the **kyphosis** data set which contains data on children who have had spinal surgery. The model we wish to fit is to predict whether a child has a post-operative deformity or not (variable: Kyphosis = {absent, present}). The variables we use are Age in months, number of vertebrae operated on (Number), and the beginning of the range of vertebrae operated on (Start). The package used is called **rpart** which stands for **recursive partitioning**.

In [16]:

```
%%R
library(rpart)
data(kyphosis)
head(kyphosis)
```

In [17]:

```
%%R
fit = rpart(Kyphosis~Age+Number+Start, method="class", data=kyphosis)
printcp(fit)
summary(kyphosis)
```

In [18]:

```
%%R
summary(fit)
```

We can plot the tree as well using the **plot** command. The dendrogram like tree shows the allocation of the $n=81$ cases to various branches of the tree.

In [19]:

```
%%R
plot(fit, uniform=TRUE)
text(fit, use.n=TRUE, all=TRUE, cex=0.8)
```

This classifier also follows recursive partitioning as in the previous case, but instead of minimizing the sum of squared errors between the sample data $x$ and the true value $p$ at each level, here the goal is to minimize entropy. This improves the information gain. Natural entropy ($H$) of the data $x$ is defined as

$$ H = -\sum_x\; f(x) \cdot ln \;f(x) $$where $f(x)$ is the probability density of $x$. This is intuitive because after the optimal split in recursing down the tree, the distribution of $x$ becomes narrower, lowering entropy. This measure is also often known as ``differential entropy.''

To see this let's do a quick example. We compute entropy for two distributions of varying spread (standard deviation).

In [20]:

```
%%R
dx = 0.001
x = seq(-5,5,dx)
H2 = -sum(dnorm(x,sd=2)*log(dnorm(x,sd=2))*dx)
print(H2)
H3 = -sum(dnorm(x,sd=3)*log(dnorm(x,sd=3))*dx)
print(H3)
```

[1] 2.042076 [1] 2.111239

Therefore, we see that entropy increases as the normal distribution becomes wider.

We move from classification trees (discrete outcomes) to regression trees (scored or continuous outcomes). Again, we use an example that already exists in R, i.e., the *cars* dataset in the **cu.summary** data frame. Let's load it up.

In [21]:

```
%%R
data(cu.summary)
print(names(cu.summary))
print(head(cu.summary))
print(tail(cu.summary))
print(dim(cu.summary))
print(unique(cu.summary$Type))
print(unique(cu.summary$Country))
```

We will try and predict Mileage using the other variables. (Note: if we tried to predict Reliability, then we would be back in the realm of classification trees, here we are looking at regression trees.)

In [22]:

```
%%R
library(rpart)
fit <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)
print(summary(fit))
```

In [23]:

```
%%R
plot(fit, uniform=TRUE)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
```

This example is taken from a data set posted by Cosmo Shalizi at CMU. We use a different package here, called **tree**, though this has been subsumed in most of its functionality by **rpart** used earlier. The analysis is as follows:

In [24]:

```
%%R
library(tree)
cahomes = read.table("DSTMAA_data/cahomedata.txt",header=TRUE)
print(dim(cahomes))
head(cahomes)
```

In [25]:

```
%%R
summary(cahomes)
mhv = as.matrix(as.numeric(cahomes$MedianHouseValue))
logmhv = log(mhv)
lat = as.matrix(as.numeric(cahomes$Latitude))
lon = as.matrix(as.numeric(cahomes$Longitude))
summary(lm(mhv~lat+lon))
```

In [26]:

```
%%R
fit = tree(logmhv~lon+lat)
plot(fit)
text(fit,cex=0.8)
```