Chapter 12 In the Same Boat: Cluster Analysis and Prediction Trees

12.1 Overview

There are many aspects of data analysis that call for grouping individuals, firms, projects, etc. These fall under the rubric of what may be termed as classification analysis. Cluster analysis comprises a group of techniques that uses distance metrics to bunch data into categories.

There are two broad approaches to cluster analysis:

  1. Agglomerative or Hierarchical or Bottom-up: In this case we begin with all entities in the analysis being given their own cluster, so that we start with \(n\) clusters. Then, entities are grouped into clusters based on a given distance metric between each pair of entities. In this way a hierarchy of clusters is built up and the researcher can choose which grouping is preferred.

  2. Partitioning or Top-down: In this approach, the entire set of \(n\) entities is assumed to be a cluster. Then it is progressively partitioned into smaller and smaller clusters.

We will employ both clustering approaches and examine their properties with various data sets as examples.

12.2 k-MEANS

This approach is bottom-up. If we have a sample of \(n\) observations to be allocated to \(k\) clusters, then we can initialize the clusters in many ways. One approach is to assume that each observation is a cluster unto itself. We proceed by taking each observation and allocating it to the nearest cluster using a distance metric. At the outset, we would simply allocate an observation to its nearest neighbor.

How is nearness measured? We need a distance metric, and one common one is Euclidian distance. Suppose we have two observations \(x_i\) and \(x_j\). These may be represented by a vector of attributes. Suppose our observations are people, and the attributes are {height, weight, IQ} = \(x_i = \{h_i, w_i, I_i\}\) for the \(i\)-th individual. Then the Euclidian distance between two individuals \(i\) and \(j\) is

\[ d_{ij} = \sqrt{(h_i-h_j)^2 + (w_i-w_j)^2 + (I_i - I_j)^2} \]

It is usually computed using normalized variables, so that no single variable of large size dominates the distance calculation. (Normalization is the process of subtracting the mean from each observation and then dividing by the standard deviation.)

In contrast, the “Manhattan” distance is given by (when is this more appropriate?)

\[ d_{ij} = |h_i-h_j| + |w_i-w_j| + |I_i - I_j| \]

We may use other metrics such as the cosine distance, or the Mahalanobis distance. A matrix of \(n \times n\) values of all \(d_{ij}\)s is called the distance matrix. Using this distance metric we assign nodes to clusters or attach them to nearest neighbors. After a few iterations, no longer are clusters made up of singleton observations, and the number of clusters reaches \(k\), the preset number required, and then all nodes are assigned to one of these \(k\) clusters. As we examine each observation we then assign it (or re-assign it) to the nearest cluster, where the distance is measured from the observation to some representative node of the cluster. Some common choices of the representative node in a cluster of are:

  1. Centroid of the cluster. This is the mean of the observations in the cluster for each attribute. The centroid of the two observations above is the average vector \(\{(h_i+h_j)/2, (w_i+w_j)/2, (I_i + I_j)/2\}\). This is often called the center of the cluster. If there are more nodes then the centroid is the average of the same coordinate for all nodes.
  2. Closest member of the cluster.
  3. Furthest member of the cluster.

The algorithm converges when no re-assignments of observations to clusters occurs. Note that \(k\)-means is a random algorithm, and may not always return the same clusters every time the algorithm is run. Also, one needs to specify the number of clusters to begin with and there may be no a-priori way in which to ascertain the correct number. Hence, trial and error and examination of the results is called for. Also, the algorithm aims to have balanced clusters, but this may not always be appropriate.

In R, we may construct the distance matrix using the dist function. Using the NCAA data we are already familiar with, we have:

ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
print(names(ncaa))
##  [1] "No"   "NAME" "GMS"  "PTS"  "REB"  "AST"  "TO"   "A.T"  "STL"  "BLK" 
## [11] "PF"   "FG"   "FT"   "X3P"
d = dist(ncaa[,3:14], method="euclidian")
print(head(d))
## [1] 12.842301 10.354557  7.996641  9.588546 15.892854 20.036546

Examining this matrix will show that it contains \(n(n-1)/2\) elements, i.e., the number of pairs of nodes. Only the lower triangular matrix of \(d\) is populated.

Clustering takes many observations with their characteristics and then allocates them into buckets or clusters based on their similarity. In finance, we may use cluster analysis to determine groups of similar firms.

Unlike regression analysis, cluster analysis uses only the right-hand side variables, and there is no dependent variable required. We group observations purely on their overall similarity across characteristics. Hence, it is closely linked to the notion of communities that we studied in network analysis, though that concept lives primarily in the domain of networks.

12.2.1 Example: Randomly generated data in k-means

Here we use the example from the kmeans function to see how the clusters appear. This function is standard issue, i.e., it comes with the stats package, which is included in the base R distribution and does not need to be separately installed. The data is randomly generated but has two bunches of items with different means, so we should be easily able to see two separate clusters. You will need the graphics package which is also in the base installation.

# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
## K-means clustering with 2 clusters of sizes 49, 51
## 
## Cluster means:
##             x          y
## 1  1.04959200 1.05894643
## 2 -0.01334206 0.02180248
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 11.059361  9.740516
##  (between_SS / total_SS =  72.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
#PLOTTING CLUSTERS
print(names(cl))
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex=4)

#REDO ANALYSIS WITH 5 CLUSTERS
## random starts do help here with too many clusters
(cl <- kmeans(x, 5, nstart = 25))
## K-means clustering with 5 clusters of sizes 24, 16, 23, 27, 10
## 
## Cluster means:
##            x          y
## 1  0.1426836  0.3005998
## 2  1.3211293  0.8482919
## 3  0.7201982  0.9970443
## 4 -0.1520315 -0.2260174
## 5  1.3727382  1.5383686
## 
## Clustering vector:
##   [1] 1 1 1 4 4 1 4 1 4 4 4 4 1 1 4 1 1 1 4 4 4 1 1 4 4 1 1 4 4 1 4 4 4 4 4
##  [36] 4 1 1 4 1 4 4 4 1 4 1 1 1 1 4 1 2 5 5 3 3 3 3 2 3 5 3 3 3 3 3 3 2 2 2
##  [71] 5 2 2 3 2 3 5 2 2 3 5 2 5 5 3 3 3 3 3 3 3 2 3 3 2 2 5 2 2 5
## 
## Within cluster sum of squares by cluster:
## [1] 2.6542258 1.2278786 1.2401518 2.4590282 0.7752739
##  (between_SS / total_SS =  89.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)

12.2.2 Example: NCAA teams

We revisit our NCAA data set, and form clusters there.

ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
print(names(ncaa))
##  [1] "No"   "NAME" "GMS"  "PTS"  "REB"  "AST"  "TO"   "A.T"  "STL"  "BLK" 
## [11] "PF"   "FG"   "FT"   "X3P"
fit = kmeans(ncaa[,3:14],4)
print(fit$size)
## [1]  6 27 14 17
print(fit$centers)
##        GMS      PTS      REB       AST       TO       A.T      STL
## 1 1.000000 50.33333 28.83333 10.333333 12.50000 0.9000000 6.666667
## 2 1.777778 68.39259 33.17407 13.596296 12.83704 1.1107407 6.822222
## 3 3.357143 80.12857 34.15714 16.357143 13.70714 1.2357143 6.821429
## 4 1.529412 60.24118 38.76471  9.282353 16.45882 0.5817647 6.882353
##        BLK       PF        FG        FT       X3P
## 1 2.166667 19.33333 0.3835000 0.6565000 0.2696667
## 2 2.918519 18.68519 0.4256296 0.7071852 0.3263704
## 3 2.514286 18.48571 0.4837143 0.7042143 0.4035714
## 4 2.882353 18.51176 0.3838824 0.6683529 0.3091765
#Since there are more than two attributes of each observation in the data, 
#we picked two of them {AST, PTS} and plotted the clusters against those. 
idx = c(4,6)
plot(ncaa[,idx],col=fit$cluster)

12.3 Hierarchical Clustering

Hierarchical clustering is both, a top-down (divisive) approach and bottom-up (agglomerative) approach. At the top level there is just one cluster. A level below, this may be broken down into a few clusters, which are then further broken down into more sub-clusters a level below, and so on. This clustering approach is computationally expensive, and the divisive approach is exponentially expensive in \(n\), the number of entities being clustered. In fact, the algorithm is \({\cal O}(2^n)\).

The function for clustering is hclust and is included in the stats package in the base R distribution.

We begin by first computing the distance matrix. Then we call the hclust function and the plot function applied to object fit gives what is known as a dendrogram plot, showing the cluster hierarchy. We may pick clusters at any level. In this case, we chose a cut level such that we get four clusters, and the rect.hclust function allows us to superimpose boxes on the clusters so we can see the grouping more clearly. The result is plotted in the Figure below.

d = dist(ncaa[,3:14], method="euclidian")
fit = hclust(d, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
names(fit)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
plot(fit,main="NCAA Teams")
groups = cutree(fit, k=3)
rect.hclust(fit, k=3, border="blue")

We can also visualize the clusters loaded on to the top two principal components as follows, using the clusplot function that resides in package cluster. The result is plotted in the Figure below.

print(groups)
##  [1] 1 1 1 1 1 2 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 1 2 2
## [36] 2 2 3 1 2 3 3 3 2 2 2 3 2 1 2 2 3 1 2 3 2 2 2 2 3 3 3 3 2
library(cluster)
clusplot(ncaa[,3:14],groups,color=TRUE,shade=TRUE,labels=2,lines=0)

#Using the correlation matrix as a proxy for distance
x = t(as.matrix(ncaa[,3:14]))
d = as.dist((1-cor(x))/2)
fit = hclust(d, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(fit,main="NCAA Teams")
groups = cutree(fit, k=3)
rect.hclust(fit, k=3, border="red")

print(groups)
##  [1] 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 2 2 2 1 1 1
## [36] 2 3 3 1 1 3 3 3 3 2 1 3 2 1 2 2 2 1 1 2 2 2 2 3 2 1 3 1 2
library(cluster)
clusplot(ncaa[,3:14],groups,color=TRUE,shade=TRUE,labels=2,lines=0)

12.4 k Nearest Neighbors

This is one of the simplest algorithms for classification and grouping. Simply define a distance metric over a set of observations, each with \(M\) characteristics, i.e., \(x_1, x_2,..., x_M\). Compute the pairwise distance between each pair of observations, using any of the metrics above.

Next, fix \(k\), the number of nearest neighbors in the population to be considered. Finally, assign the category based on which one has the majority of nearest neighbors to the case we are trying to classify.

We see an example in R using the iris data set that we examined before.

library(class)
data(iris)
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#SAMPLE A SUBSET
idx = seq(1,length(iris[,1]))
train_idx = sample(idx,100)
test_idx = setdiff(idx,train_idx)
x_train = iris[train_idx,1:4]
x_test = iris[test_idx,1:4]
y_train = iris[train_idx,5]
y_test = iris[test_idx,5]

#RUN knn
res = knn(x_train, x_test, y_train, k = 3, prob = FALSE, use.all = TRUE)
res
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  versicolor versicolor versicolor versicolor virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica  virginica  virginica  virginica 
## [49] virginica  virginica 
## Levels: setosa versicolor virginica
table(res,y_test)
##             y_test
## res          setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         19         0
##   virginica       0          1        15

12.5 Prediction Trees

Prediction trees are a natural outcome of recursive partitioning of the data. Hence, they are a particular form of clustering at different levels. Usual cluster analysis results in a flat partition, but prediction trees develop a multi-level cluster of trees. The term used here is CART, which stands for classification analysis and regression trees. But prediction trees are different from vanilla clustering in an important way – there is a dependent variable, i.e., a category or a range of values (e.g., a score) that one is attempting to predict.

Prediction trees are of two types: (a) Classification trees, where the leaves of the trees are different categories of discrete outcomes. and (b) Regression trees, where the leaves are continuous outcomes. We may think of the former as a generalized form of limited dependent variables, and the latter as a generalized form of regression analysis.

To set ideas, suppose we want to predict the credit score of an individual using age, income, and education as explanatory variables. Assume that income is the best explanatory variable of the three. Then, at the top of the tree, there will be income as the branching variable, i.e., if income is less than some threshold, then we go down the left branch of the tree, else we go down the right. At the next level, it may be that we use education to make the next bifurcation, and then at the third level we use age. A variable may even be repeatedly used at more than one level. This leads us to several leaves at the bottom of the tree that contain the average values of the credit scores that may be reached. For example if we get an individual of young age, low income, and no education, it is very likely that this path down the tree will lead to a low credit score on average. Instead of credit score (an example of a regression tree), consider credit ratings of companies (an example of a classification tree). These ideas will become clearer once we present some examples.

12.5.1 Fitting the tree

Recursive partitioning is the main algorithmic construct behind prediction trees. We take the data and using a single explanatory variable, we try and bifurcate the data into two categories such that the additional information from categorization results in better information than before the binary split. For example, suppose we are trying to predict who will make donations and who will not using a single variable – income. If we have a sample of people and have not yet analyzed their incomes, we only have the raw frequency \(p\) of how many people made donations, i.e., and number between 0 and 1. The information of the predicted likelihood \(p\) is inversely related to the sum of squared errors (SSE) between this value \(p\) and the 0 values and 1 values of the observations.

\[ SSE_1 = \sum_{i=1}^n (x_i - p)^2 \]

where \(x_i = \{0,1\}\), depending on whether person \(i\) made a donation or not. Now, if we bifurcate the sample based on income, say to the left we have people with income less than \(K\), and to the right, people with incomes greater than or equal to \(K\). If we find that the proportion of people on the left making donations is \(p_L < p\) and on the right is \(p_R > p\), our new information is:

\[ SSE_2 = \sum_{i, Income < K} (x_i - p_L)^2 + \sum_{i, Income \geq K} (x_i - p_R)^2 \]

By choosing \(K\) correctly, our recursive partitioning algorithm will maximize the gain, i.e., \(\delta = (SSE_1 - SSE_2)\). We stop branching further when at a given tree level \(\delta\) is less than a pre-specified threshold.

We note that as \(n\) gets large, the computation of binary splits on any variable is expensive, i.e., of order \({\cal O}(2^n)\). But as we go down the tree, and use smaller subsamples, the algorithm becomes faster and faster. In general, this is quite an efficient algorithm to implement.

The motivation of prediction trees is to emulate a decision tree. It also helps make sense of complicated regression scenarios where there are lots of variable interactions over many variables, when it becomes difficult to interpret the meaning and importance of explanatory variables in a prediction scenario. By proceeding in a hierarchical manner on a tree, the decision analysis becomes transparent, and can also be used in practical settings to make decisions.

12.6 Classification Trees

To demonstrate this, let’s use a data set that is already in R. We use the kyphosis data set which contains data on children who have had spinal surgery. The model we wish to fit is to predict whether a child has a post-operative deformity or not (variable: Kyphosis = {absent, present}). The variables we use are Age in months, number of vertebrae operated on (Number), and the beginning of the range of vertebrae operated on (Start). The package used is called rpart which stands for recursive partitioning.

library(rpart)
data(kyphosis)
head(kyphosis)
##   Kyphosis Age Number Start
## 1   absent  71      3     5
## 2   absent 158      3    14
## 3  present 128      4     5
## 4   absent   2      5     1
## 5   absent   1      4    15
## 6   absent   1      2    16
fit = rpart(Kyphosis~Age+Number+Start, method="class", data=kyphosis)
printcp(fit)
## 
## Classification tree:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
##     method = "class")
## 
## Variables actually used in tree construction:
## [1] Age   Start
## 
## Root node error: 17/81 = 0.20988
## 
## n= 81 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.176471      0   1.00000 1.0000 0.21559
## 2 0.019608      1   0.82353 1.2353 0.23200
## 3 0.010000      4   0.76471 1.2353 0.23200
summary(kyphosis)
##     Kyphosis       Age             Number           Start      
##  absent :64   Min.   :  1.00   Min.   : 2.000   Min.   : 1.00  
##  present:17   1st Qu.: 26.00   1st Qu.: 3.000   1st Qu.: 9.00  
##               Median : 87.00   Median : 4.000   Median :13.00  
##               Mean   : 83.65   Mean   : 4.049   Mean   :11.49  
##               3rd Qu.:130.00   3rd Qu.: 5.000   3rd Qu.:16.00  
##               Max.   :206.00   Max.   :10.000   Max.   :18.00
summary(fit)
## Call:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
##     method = "class")
##   n= 81 
## 
##           CP nsplit rel error   xerror      xstd
## 1 0.17647059      0 1.0000000 1.000000 0.2155872
## 2 0.01960784      1 0.8235294 1.235294 0.2320031
## 3 0.01000000      4 0.7647059 1.235294 0.2320031
## 
## Variable importance
##  Start    Age Number 
##     64     24     12 
## 
## Node number 1: 81 observations,    complexity param=0.1764706
##   predicted class=absent   expected loss=0.2098765  P(node) =1
##     class counts:    64    17
##    probabilities: 0.790 0.210 
##   left son=2 (62 obs) right son=3 (19 obs)
##   Primary splits:
##       Start  < 8.5  to the right, improve=6.762330, (0 missing)
##       Number < 5.5  to the left,  improve=2.866795, (0 missing)
##       Age    < 39.5 to the left,  improve=2.250212, (0 missing)
##   Surrogate splits:
##       Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
## 
## Node number 2: 62 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
##     class counts:    56     6
##    probabilities: 0.903 0.097 
##   left son=4 (29 obs) right son=5 (33 obs)
##   Primary splits:
##       Start  < 14.5 to the right, improve=1.0205280, (0 missing)
##       Age    < 55   to the left,  improve=0.6848635, (0 missing)
##       Number < 4.5  to the left,  improve=0.2975332, (0 missing)
##   Surrogate splits:
##       Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
##       Age    < 16   to the left,  agree=0.597, adj=0.138, (0 split)
## 
## Node number 3: 19 observations
##   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
##     class counts:     8    11
##    probabilities: 0.421 0.579 
## 
## Node number 4: 29 observations
##   predicted class=absent   expected loss=0  P(node) =0.3580247
##     class counts:    29     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 33 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
##     class counts:    27     6
##    probabilities: 0.818 0.182 
##   left son=10 (12 obs) right son=11 (21 obs)
##   Primary splits:
##       Age    < 55   to the left,  improve=1.2467530, (0 missing)
##       Start  < 12.5 to the right, improve=0.2887701, (0 missing)
##       Number < 3.5  to the right, improve=0.1753247, (0 missing)
##   Surrogate splits:
##       Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
##       Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
## 
## Node number 10: 12 observations
##   predicted class=absent   expected loss=0  P(node) =0.1481481
##     class counts:    12     0
##    probabilities: 1.000 0.000 
## 
## Node number 11: 21 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
##     class counts:    15     6
##    probabilities: 0.714 0.286 
##   left son=22 (14 obs) right son=23 (7 obs)
##   Primary splits:
##       Age    < 111  to the right, improve=1.71428600, (0 missing)
##       Start  < 12.5 to the right, improve=0.79365080, (0 missing)
##       Number < 3.5  to the right, improve=0.07142857, (0 missing)
## 
## Node number 22: 14 observations
##   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
##     class counts:    12     2
##    probabilities: 0.857 0.143 
## 
## Node number 23: 7 observations
##   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
##     class counts:     3     4
##    probabilities: 0.429 0.571

We can plot the tree as well using the plot command. The dendrogram like tree shows the allocation of the \(n=81\) cases to various branches of the tree.

plot(fit, uniform=TRUE)
text(fit, use.n=TRUE, all=TRUE, cex=0.8)

12.7 C4.5 Classifier

This classifier also follows recursive partitioning as in the previous case, but instead of minimizing the sum of squared errors between the sample data \(x\) and the true value \(p\) at each level, here the goal is to minimize entropy. This improves the information gain. Natural entropy (\(H\)) of the data \(x\) is defined as

\[ H = -\sum_x\; f(x) \cdot ln \;f(x) \]

where \(f(x)\) is the probability density of \(x\). This is intuitive because after the optimal split in recursing down the tree, the distribution of \(x\) becomes narrower, lowering entropy. This measure is also often known as ``differential entropy.’’

To see this let’s do a quick example. We compute entropy for two distributions of varying spread (standard deviation).

dx = 0.001
x = seq(-5,5,dx)
H2 = -sum(dnorm(x,sd=2)*log(dnorm(x,sd=2))*dx)
print(H2)
## [1] 2.042076
H3 = -sum(dnorm(x,sd=3)*log(dnorm(x,sd=3))*dx)
print(H3)
## [1] 2.111239

Therefore, we see that entropy increases as the normal distribution becomes wider.

library(RWeka)
data(iris)
print(head(iris))
res = J48(Species~.,data=iris)
print(res)
summary(res)

12.8 Regression Trees

We move from classification trees (discrete outcomes) to regression trees (scored or continuous outcomes). Again, we use an example that already exists in R, i.e., the cars dataset in the cu.summary data frame. Let’s load it up.

data(cu.summary)
print(names(cu.summary))
## [1] "Price"       "Country"     "Reliability" "Mileage"     "Type"
print(head(cu.summary))
##                 Price Country Reliability Mileage  Type
## Acura Integra 4 11950   Japan Much better      NA Small
## Dodge Colt 4     6851   Japan        <NA>      NA Small
## Dodge Omni 4     6995     USA  Much worse      NA Small
## Eagle Summit 4   8895     USA      better      33 Small
## Ford Escort   4  7402     USA       worse      33 Small
## Ford Festiva 4   6319   Korea      better      37 Small
print(tail(cu.summary))
##                      Price Country Reliability Mileage Type
## Ford Aerostar V6     12267     USA     average      18  Van
## Mazda MPV V6         14944   Japan Much better      19  Van
## Mitsubishi Wagon 4   14929   Japan        <NA>      20  Van
## Nissan Axxess 4      13949   Japan        <NA>      20  Van
## Nissan Van 4         14799   Japan        <NA>      19  Van
## Volkswagen Vanagon 4 14080 Germany        <NA>      NA  Van
print(dim(cu.summary))
## [1] 117   5
print(unique(cu.summary$Type))
## [1] Small   Sporty  Compact Medium  Large   Van    
## Levels: Compact Large Medium Small Sporty Van
print(unique(cu.summary$Country))
##  [1] Japan     USA       Korea     Japan/USA Mexico    Brazil    Germany  
##  [8] France    Sweden    England  
## 10 Levels: Brazil England France Germany Japan Japan/USA Korea ... USA

We will try and predict Mileage using the other variables. (Note: if we tried to predict Reliability, then we would be back in the realm of classification trees, here we are looking at regression trees.)

library(rpart)
fit <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)
print(summary(fit))
## Call:
## rpart(formula = Mileage ~ Price + Country + Reliability + Type, 
##     data = cu.summary, method = "anova")
##   n=60 (57 observations deleted due to missingness)
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.62288527      0 1.0000000 1.0278364 0.17665513
## 2 0.13206061      1 0.3771147 0.5199982 0.10233496
## 3 0.02544094      2 0.2450541 0.4095695 0.08549195
## 4 0.01160389      3 0.2196132 0.4195450 0.09312124
## 5 0.01000000      4 0.2080093 0.4171213 0.08786038
## 
## Variable importance
##   Price    Type Country 
##      48      42      10 
## 
## Node number 1: 60 observations,    complexity param=0.6228853
##   mean=24.58333, MSE=22.57639 
##   left son=2 (48 obs) right son=3 (12 obs)
##   Primary splits:
##       Price       < 9446.5  to the right, improve=0.6228853, (0 missing)
##       Type        splits as  LLLRLL,      improve=0.5044405, (0 missing)
##       Reliability splits as  LLLRR,       improve=0.1263005, (11 missing)
##       Country     splits as  --LRLRRRLL,  improve=0.1243525, (0 missing)
##   Surrogate splits:
##       Type    splits as  LLLRLL,     agree=0.950, adj=0.750, (0 split)
##       Country splits as  --LLLLRRLL, agree=0.833, adj=0.167, (0 split)
## 
## Node number 2: 48 observations,    complexity param=0.1320606
##   mean=22.70833, MSE=8.498264 
##   left son=4 (23 obs) right son=5 (25 obs)
##   Primary splits:
##       Type        splits as  RLLRRL,      improve=0.43853830, (0 missing)
##       Price       < 12154.5 to the right, improve=0.25748500, (0 missing)
##       Country     splits as  --RRLRL-LL,  improve=0.13345700, (0 missing)
##       Reliability splits as  LLLRR,       improve=0.01637086, (10 missing)
##   Surrogate splits:
##       Price   < 12215.5 to the right, agree=0.812, adj=0.609, (0 split)
##       Country splits as  --RRLRL-RL,  agree=0.646, adj=0.261, (0 split)
## 
## Node number 3: 12 observations
##   mean=32.08333, MSE=8.576389 
## 
## Node number 4: 23 observations,    complexity param=0.02544094
##   mean=20.69565, MSE=2.907372 
##   left son=8 (10 obs) right son=9 (13 obs)
##   Primary splits:
##       Type    splits as  -LR--L,      improve=0.515359600, (0 missing)
##       Price   < 14962   to the left,  improve=0.131259400, (0 missing)
##       Country splits as  ----L-R--R,  improve=0.007022107, (0 missing)
##   Surrogate splits:
##       Price < 13572   to the right, agree=0.609, adj=0.1, (0 split)
## 
## Node number 5: 25 observations,    complexity param=0.01160389
##   mean=24.56, MSE=6.4864 
##   left son=10 (14 obs) right son=11 (11 obs)
##   Primary splits:
##       Price       < 11484.5 to the right, improve=0.09693168, (0 missing)
##       Reliability splits as  LLRRR,       improve=0.07767167, (4 missing)
##       Type        splits as  L--RR-,      improve=0.04209834, (0 missing)
##       Country     splits as  --LRRR--LL,  improve=0.02201687, (0 missing)
##   Surrogate splits:
##       Country splits as  --LLLL--LR, agree=0.80, adj=0.545, (0 split)
##       Type    splits as  L--RL-,     agree=0.64, adj=0.182, (0 split)
## 
## Node number 8: 10 observations
##   mean=19.3, MSE=2.21 
## 
## Node number 9: 13 observations
##   mean=21.76923, MSE=0.7928994 
## 
## Node number 10: 14 observations
##   mean=23.85714, MSE=7.693878 
## 
## Node number 11: 11 observations
##   mean=25.45455, MSE=3.520661 
## 
## n=60 (57 observations deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 60 1354.58300 24.58333  
##    2) Price>=9446.5 48  407.91670 22.70833  
##      4) Type=Large,Medium,Van 23   66.86957 20.69565  
##        8) Type=Large,Van 10   22.10000 19.30000 *
##        9) Type=Medium 13   10.30769 21.76923 *
##      5) Type=Compact,Small,Sporty 25  162.16000 24.56000  
##       10) Price>=11484.5 14  107.71430 23.85714 *
##       11) Price< 11484.5 11   38.72727 25.45455 *
##    3) Price< 9446.5 12  102.91670 32.08333 *
plot(fit, uniform=TRUE)
text(fit, use.n=TRUE, all=TRUE, cex=.8)

12.8.1 Example: Califonia Home Data

This example is taken from a data set posted by Cosmo Shalizi at CMU. We use a different package here, called tree, though this has been subsumed in most of its functionality by rpart used earlier. The analysis is as follows:

library(tree)
cahomes = read.table("DSTMAA_data/cahomedata.txt",header=TRUE)
print(dim(cahomes))
## [1] 20640     9
head(cahomes)
##   MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms
## 1           452600       8.3252             41        880           129
## 2           358500       8.3014             21       7099          1106
## 3           352100       7.2574             52       1467           190
## 4           341300       5.6431             52       1274           235
## 5           342200       3.8462             52       1627           280
## 6           269700       4.0368             52        919           213
##   Population Households Latitude Longitude
## 1        322        126    37.88   -122.23
## 2       2401       1138    37.86   -122.22
## 3        496        177    37.85   -122.24
## 4        558        219    37.85   -122.25
## 5        565        259    37.85   -122.25
## 6        413        193    37.85   -122.25
summary(cahomes)
##  MedianHouseValue  MedianIncome     MedianHouseAge    TotalRooms   
##  Min.   : 14999   Min.   : 0.4999   Min.   : 1.00   Min.   :    2  
##  1st Qu.:119600   1st Qu.: 2.5634   1st Qu.:18.00   1st Qu.: 1448  
##  Median :179700   Median : 3.5348   Median :29.00   Median : 2127  
##  Mean   :206856   Mean   : 3.8707   Mean   :28.64   Mean   : 2636  
##  3rd Qu.:264725   3rd Qu.: 4.7432   3rd Qu.:37.00   3rd Qu.: 3148  
##  Max.   :500001   Max.   :15.0001   Max.   :52.00   Max.   :39320  
##  TotalBedrooms      Population      Households        Latitude    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   :32.54  
##  1st Qu.: 295.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.:33.93  
##  Median : 435.0   Median : 1166   Median : 409.0   Median :34.26  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   :35.63  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.:37.71  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :41.95  
##    Longitude     
##  Min.   :-124.3  
##  1st Qu.:-121.8  
##  Median :-118.5  
##  Mean   :-119.6  
##  3rd Qu.:-118.0  
##  Max.   :-114.3
mhv = as.matrix(as.numeric(cahomes$MedianHouseValue))
logmhv = log(mhv)
lat = as.matrix(as.numeric(cahomes$Latitude))
lon = as.matrix(as.numeric(cahomes$Longitude))
summary(lm(mhv~lat+lon))
## 
## Call:
## lm(formula = mhv ~ lat + lon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -316022  -67502  -22903   46042  483381 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5829397.0    82092.2  -71.01   <2e-16 ***
## lat           -69551.0      859.6  -80.91   <2e-16 ***
## lon           -71209.4      916.4  -77.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100400 on 20637 degrees of freedom
## Multiple R-squared:  0.2424, Adjusted R-squared:  0.2423 
## F-statistic:  3302 on 2 and 20637 DF,  p-value: < 2.2e-16
fit = tree(logmhv~lon+lat)
plot(fit)
text(fit,cex=0.8)

price.deciles = quantile(mhv,0:10/10)

cut.prices = cut(mhv,price.deciles,include.lowest=TRUE)
plot(lon, lat, col=grey(10:2/11)[cut.prices],pch=20,xlab="Longitude",ylab="Latitude")
partition.tree(fit,ordvars=c("lon","lat"),add=TRUE,cex=0.8)

12.9 Random Forests

A random forest model is an extension of the CART class of models. In CART, at each decision node, all variables in the feature set are selected and the best one is determined for the bifurcation rule at that node. This approach tends to overfit the model to training data.

To ameliorate overfitting Breiman (2001) suggested generating classification and regression trees using a random subset of the feature set at each. One at a time, a random tree is grown. By building a large set of random trees (the default number in R is 500), we get a “random forest” of decision trees, and when the algorithm is run, each tree in the forest classifies the input. The output classification is determined by taking the modal classification across all trees.

The default number of variables from a feature set of \(p\) variables is defaulted to \(p/3\), rounded down, for a regression tree, and \(\sqrt{p}\) for a classification tree.

Reference: Breiman (2001)

For the NCAA data, take the top 32 teams and make their dependent variable 1, and that of the bottom 32 teams zero.

ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)

y1 = 1:32
y1 = y1*0+1
y2 = y1*0
y = c(y1,y2)
print(y)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x = as.matrix(ncaa[4:14])
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
yf = factor(y)
res = randomForest(x,yf)
print(res)
## 
## Call:
##  randomForest(x = x, y = yf) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 28.12%
## Confusion matrix:
##    0  1 class.error
## 0 24  8      0.2500
## 1 10 22      0.3125
print(importance(res))
##     MeanDecreaseGini
## PTS         4.625922
## REB         1.605147
## AST         1.999855
## TO          3.882536
## A.T         3.880554
## STL         2.026178
## BLK         1.951694
## PF          1.756469
## FG          4.159391
## FT          3.258861
## X3P         2.354894
res = randomForest(x,yf,mtry=3)
print(res)
## 
## Call:
##  randomForest(x = x, y = yf, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 31.25%
## Confusion matrix:
##    0  1 class.error
## 0 23  9     0.28125
## 1 11 21     0.34375
print(importance(res))
##     MeanDecreaseGini
## PTS         4.576616
## REB         1.379877
## AST         2.158874
## TO          3.847833
## A.T         3.674293
## STL         1.983024
## BLK         2.089959
## PF          1.621722
## FG          4.408469
## FT          3.562817
## X3P         2.191143

12.10 Top Ten Algorithms in Data Science

The top voted algorithms in machine learning are: C4.5, k-means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, CART. (This is just from one source, and differences of opinion will remain.)

12.11 Boosting

Boosting is an immensely popular machine learning technique. It is an iterative approach that takes weak learning algorithms and “boosts” them into strong learners. The method is intuitive. Start out with a classification algorithm such as logit for binary classification and run one pass to fit the model. Check which cases are correctly predicted in-sample, and which are incorrect. In the next classification pass (also known as a round), reweight the misclassified observations versus the correctly classified ones, by overweighting the former, and underweighting the latter. This forces the learner to “focus” more on the tougher cases, and adjust so that it gets these classified more accurately. Through multiple rounds, the results are boosted to higher levels of accuracy. Because there are many different weighting schemes, data scientists have evolved many different boosting algorithms. AdaBoost is one such popular algorithm, developed by Schapire and Singer (1999).

In recent times, these boosting algorithms have improved in their computer implementation, mostly through parallelization to speed them up when using huge data sets. Such versions are known as “extreme gradient” boosting algorithms. In R, the package xgboost contains an easy to use implementation. We illustrate with an example.

We use the sample data that comes with the xgboost package. Read in the data for the model.

library(xgboost)
## Warning: package 'xgboost' was built under R version 3.3.2
data("agaricus.train")
print(names(agaricus.train))
## [1] "data"  "label"
print(dim(agaricus.train$data))
## [1] 6513  126
print(length(agaricus.train$label))
## [1] 6513
data("agaricus.test")
print(names(agaricus.test))
## [1] "data"  "label"
print(dim(agaricus.test$data))
## [1] 1611  126
print(length(agaricus.test$label))
## [1] 1611

Fit the model. All that is needed is a single-line call to the xgboost function.

res = xgboost(data=agaricus.train$data, label=agaricus.train$label, objective = "binary:logistic", nrounds=5)
## [1]  train-error:0.000614 
## [2]  train-error:0.001228 
## [3]  train-error:0.000614 
## [4]  train-error:0.000614 
## [5]  train-error:0.000000
print(names(res))
## [1] "handle"         "raw"            "niter"          "evaluation_log"
## [5] "call"           "params"         "callbacks"

Undertake prediction using the predict function and then examine the confusion matrix for performance.

#In sample
yhat = predict(res,agaricus.train$data)
print(head(yhat,50))
##  [1] 0.8973738 0.1030030 0.1030030 0.8973738 0.1018238 0.1030030 0.1030030
##  [8] 0.8973738 0.1030030 0.1030030 0.1030030 0.1030030 0.1018238 0.1058771
## [15] 0.1018238 0.8973738 0.8973738 0.8973738 0.1030030 0.8973738 0.1030030
## [22] 0.1030030 0.8973738 0.1030030 0.1030030 0.8973738 0.1030030 0.1057071
## [29] 0.1030030 0.1144627 0.1058771 0.1139800 0.1030030 0.1057071 0.1058771
## [36] 0.1030030 0.1030030 0.1030030 0.1030030 0.1057071 0.1057071 0.1030030
## [43] 0.1030030 0.8973738 0.1030030 0.1030030 0.1057071 0.1058771 0.1030030
## [50] 0.1030030
cm = table(agaricus.train$label,as.integer(round(yhat)))
print(cm)
##    
##        0    1
##   0 3373    0
##   1    0 3140
print(chisq.test(cm))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cm
## X-squared = 6509, df = 1, p-value < 2.2e-16
#Out of sample
yhat = predict(res,agaricus.test$data)
print(head(yhat,50))
##  [1] 0.1030030 0.8973738 0.1030030 0.1030030 0.1058771 0.1139800 0.8973738
##  [8] 0.1030030 0.8973738 0.1057071 0.8973738 0.1030030 0.1018238 0.1030030
## [15] 0.1018238 0.1057071 0.1030030 0.8973738 0.1058771 0.1030030 0.1030030
## [22] 0.1057071 0.1030030 0.1030030 0.1057071 0.8973738 0.1139800 0.1030030
## [29] 0.1030030 0.1018238 0.1030030 0.1030030 0.1057071 0.1058771 0.1030030
## [36] 0.1030030 0.1139800 0.8973738 0.1030030 0.1030030 0.1058771 0.1030030
## [43] 0.1030030 0.1030030 0.1030030 0.1144627 0.1057071 0.1144627 0.1058771
## [50] 0.1030030
cm = table(agaricus.test$label,as.integer(round(yhat)))
print(cm)
##    
##       0   1
##   0 835   0
##   1   0 776
print(chisq.test(cm))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cm
## X-squared = 1607, df = 1, p-value < 2.2e-16

There are many types of algorithms that may be used with boosting, see the documentation of the function in R. But here are some of the options.

  • reg:linear, linear regression (Default).
  • reg:logistic, logistic regression.
  • binary:logistic, logistic regression for binary classification. Output probability.
  • binary:logitraw, logistic regression for binary classification, output score before logistic transformation.
  • multi:softmax, set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class - 1.
  • multi:softprob, same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
  • rank:pairwise set xgboost to do ranking task by minimizing the pairwise loss.

Let’s repeat the exercise using the NCAA data.

ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
y = as.matrix(c(rep(1,32),rep(0,32)))
x = as.matrix(ncaa[4:14])
res = xgboost(data=x,label=y,objective = "binary:logistic", nrounds=10)
## [1]  train-error:0.109375 
## [2]  train-error:0.062500 
## [3]  train-error:0.031250 
## [4]  train-error:0.046875 
## [5]  train-error:0.046875 
## [6]  train-error:0.031250 
## [7]  train-error:0.015625 
## [8]  train-error:0.015625 
## [9]  train-error:0.015625 
## [10] train-error:0.000000
yhat = predict(res,x)
print(yhat)
##  [1] 0.93651539 0.91299230 0.94973743 0.92731959 0.88483542 0.78989410
##  [7] 0.87560666 0.90532523 0.86085796 0.83430755 0.91133112 0.77964365
## [13] 0.65978771 0.91299230 0.93371087 0.91403663 0.78532064 0.80347157
## [19] 0.60545647 0.79564470 0.84763408 0.86694145 0.79334742 0.91165835
## [25] 0.80980736 0.76779360 0.90779346 0.88314682 0.85020524 0.77409834
## [31] 0.85503411 0.92695338 0.49809304 0.15059802 0.13718443 0.30433667
## [37] 0.35902274 0.08057866 0.16935477 0.06189578 0.08516480 0.12777112
## [43] 0.06224639 0.18913418 0.07675765 0.33156753 0.06586388 0.13792981
## [49] 0.22327985 0.08479820 0.16396984 0.10236575 0.16346745 0.27498406
## [55] 0.10642117 0.07299758 0.15809764 0.15259050 0.07768227 0.15006000
## [61] 0.08349544 0.06932075 0.10376420 0.11887703
cm = table(y,as.integer(round(yhat)))
print(cm)
##    
## y    0  1
##   0 32  0
##   1  0 32
print(chisq.test(cm))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cm
## X-squared = 60.062, df = 1, p-value = 9.189e-15

References

Breiman, Leo. 2001. “Random Forests.” Mach. Learn. 45 (1). Hingham, MA, USA: Kluwer Academic Publishers: 5–32. doi:10.1023/A:1010933404324.

Schapire, Robert E., and Yoram Singer. 1999. “Improved Boosting Algorithms Using Confidence-Rated Predictions.” In Machine Learning, 80–91.