Decision Trees¶

Sanjiv R. Das¶

In [1]:

%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

Prediction Trees¶

A natural outcome of recursive partitioning of the data.
CART, which stands for classification analysis and regression trees.
Prediction trees are of two types: (a) Classification trees, where the leaves of the trees are different categories of discrete outcomes. and (b) Regression trees, where the leaves are continuous outcomes.
We may think of the former as a generalized form of limited dependent variables, and the latter as a generalized form of regression analysis.

Recursive Partitioning¶

Bifurcate the data into two categories such that the additional information from categorization results in better information than before the binary split.
Raw frequency $p$ of how many people made donations, i.e., and number between 0 and 1. The information of the predicted likelihood $p$ is inversely related to the sum of squared errors (SSE) between this value $p$ and the $x_i = 0,1$ values of the observations.

$$ SSE_1 = \sum_{i=1}^n (x_i - p)^2 $$

Second bifurcation: $$ SSE_2 = \sum_{i, Income < K} (x_i - p_L)^2 + \sum_{i, Income \geq K} (x_i - p_R)^2 $$
By choosing $K$ correctly, our recursive partitioning algorithm will maximize the gain, i.e., $\delta = (SSE_1 - SSE_2)$. We stop branching further when at a given tree level $\delta$ is less than a pre-specified threshold.

C4.5 Classifier¶

Recursive partitioning as in the previous case, but instead of minimizing the sum of squared errors between the sample data $x$ and the true value $p$ at each level, here the goal is to minimize entropy. This improves the information gain. Natural entropy ($H$) of the data $x$ is defined as

$$ H = -\sum_x\; f(x) \cdot ln \;f(x) $$

where $f(x)$ is the probability density of $x$. This is intuitive because after the optimal split in recursing down the tree, the distribution of $x$ becomes narrower, lowering entropy. This measure is also often known as "differential entropy."

In [2]:

#PREDICTION ON TEST DATA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix

NCAA Dataset¶

In [3]:

ncaa = pd.read_table("data/ncaa.txt")
yy = append(list(ones(32)), list(zeros(32)))
ncaa["y"] = yy
ncaa.head()

Out[3]:

	No NAME	GMS	PTS	REB	AST	TO	A/T	STL	BLK	PF	FG	FT	3P	y
0	1. NorthCarolina	6	84.2	41.5	17.8	12.8	1.39	6.7	3.8	16.7	0.514	0.664	0.417	1.0
1	2. Illinois	6	74.5	34.0	19.0	10.2	1.87	8.0	1.7	16.5	0.457	0.753	0.361	1.0
2	3. Louisville	5	77.4	35.4	13.6	11.0	1.24	5.4	4.2	16.6	0.479	0.702	0.376	1.0
3	4. MichiganState	5	80.8	37.8	13.0	12.6	1.03	8.4	2.4	19.8	0.445	0.783	0.329	1.0
4	5. Arizona	4	79.8	35.0	15.8	14.5	1.09	6.0	6.5	13.3	0.542	0.759	0.397	1.0

In [4]:

#CREATE FEATURES
y = ncaa['y']
X = ncaa.iloc[:,2:13]
X.head()

Out[4]:

	PTS	REB	AST	TO	A/T	STL	BLK	PF	FG	FT	3P
0	84.2	41.5	17.8	12.8	1.39	6.7	3.8	16.7	0.514	0.664	0.417
1	74.5	34.0	19.0	10.2	1.87	8.0	1.7	16.5	0.457	0.753	0.361
2	77.4	35.4	13.6	11.0	1.24	5.4	4.2	16.6	0.479	0.702	0.376
3	80.8	37.8	13.0	12.6	1.03	8.4	2.4	19.8	0.445	0.783	0.329
4	79.8	35.0	15.8	14.5	1.09	6.0	6.5	13.3	0.542	0.759	0.397

In [19]:

#FIT MODEL
from sklearn.tree import DecisionTreeClassifier as CART
model = CART()
model.fit(X,y)
ypred = model.predict(X)

In [20]:

#CONFUSION MATRIX
cm = confusion_matrix(y, ypred)
cm

Out[20]:

array([[32,  0],
       [ 0, 32]])

In [21]:

#ACCURACY
accuracy_score(y,ypred)

Out[21]:

1.0

In [22]:

#CLASSIFICATION REPORT
print(classification_report(y, ypred))

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00        32
        1.0       1.00      1.00      1.00        32

avg / total       1.00      1.00      1.00        64

In [23]:

#ROC, AUC
y_score = model.predict_proba(X)[:,1]
fpr, tpr, _ = roc_curve(y, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))

Area under curve (AUC):  1.0

In [24]:

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

In [27]:

Image(graph.create_png())

Out[27]:

Credit Card Dataset¶

In [10]:

#LOAD IN CREDIT CARD DATA
import pickle
CCdata = pickle.load(open("data/CCdata.p", "rb"))
X_train = CCdata['X_train']
y_train = CCdata['y_train']
X_test = CCdata['X_test']
y_test = CCdata['y_test']

In [11]:

#FIT MODEL
from sklearn.tree import DecisionTreeClassifier as CART
model = CART()
model.fit(X_train,y_train)

Out[11]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [12]:

#CONFUSION MATRIX
ypred = model.predict(X_test)
cm = confusion_matrix(y_test, ypred)
cm

Out[12]:

array([[93650,   177],
       [   34,   126]])

In [13]:

#ACCURACY
accuracy_score(y_test,ypred)

Out[13]:

0.997755008671412

In [14]:

#CLASSIFICATION REPORT
print(classification_report(y_test, ypred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     93827
          1       0.42      0.79      0.54       160

avg / total       1.00      1.00      1.00     93987

In [15]:

#ROC, AUC
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))

Area under curve (AUC):  0.8928067747023779

In [16]:

dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#graph.write('images/tree.dot')

Out[16]:

True

In [17]:

Image(graph.create_png())

Out[17]: