Decision Trees

Sanjiv R. Das

Prediction Trees

Recursive Partitioning

$$ SSE_1 = \sum_{i=1}^n (x_i - p)^2 $$

C4.5 Classifier

Recursive partitioning as in the previous case, but instead of minimizing the sum of squared errors between the sample data $x$ and the true value $p$ at each level, here the goal is to minimize entropy. This improves the information gain. Natural entropy ($H$) of the data $x$ is defined as

$$ H = -\sum_x\; f(x) \cdot ln \;f(x) $$

where $f(x)$ is the probability density of $x$. This is intuitive because after the optimal split in recursing down the tree, the distribution of $x$ becomes narrower, lowering entropy. This measure is also often known as "differential entropy."

NCAA Dataset

Gini coefficient

The Gibi measures the quality of the split. It is defined as

$$ Gini = 1 - \sum_{c=1}^C P_c^2 $$

where $c$ indexes $C$ categories and $P_c$ is the proportion of the split in category $c$, or simply, the probability of splitting to $c$.

Here you have a binary split, so you need just two probabilities, left and right. For example look at the 3rd row in the tree, second box from left (blue color)

$$ Gini = 1 - (2/21)^2 - (19/21)^2 = 0.172 $$

The smaller the Gini the better the split. Notice at the top node the Gini is 0.5.

Credit Card Dataset