Decision Trees

Sanjiv R. Das

In [1]:
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

Prediction Trees

  • A natural outcome of recursive partitioning of the data.
  • CART, which stands for classification analysis and regression trees.
  • Prediction trees are of two types: (a) Classification trees, where the leaves of the trees are different categories of discrete outcomes. and (b) Regression trees, where the leaves are continuous outcomes.
  • We may think of the former as a generalized form of limited dependent variables, and the latter as a generalized form of regression analysis.

Recursive Partitioning

  • Bifurcate the data into two categories such that the additional information from categorization results in better information than before the binary split.
  • Raw frequency $p$ of how many people made donations, i.e., and number between 0 and 1. The information of the predicted likelihood $p$ is inversely related to the sum of squared errors (SSE) between this value $p$ and the $x_i = 0,1$ values of the observations.

$$ SSE_1 = \sum_{i=1}^n (x_i - p)^2 $$

  • Second bifurcation: $$ SSE_2 = \sum_{i, Income < K} (x_i - p_L)^2 + \sum_{i, Income \geq K} (x_i - p_R)^2 $$

  • By choosing $K$ correctly, our recursive partitioning algorithm will maximize the gain, i.e., $\delta = (SSE_1 - SSE_2)$. We stop branching further when at a given tree level $\delta$ is less than a pre-specified threshold.

C4.5 Classifier

Recursive partitioning as in the previous case, but instead of minimizing the sum of squared errors between the sample data $x$ and the true value $p$ at each level, here the goal is to minimize entropy. This improves the information gain. Natural entropy ($H$) of the data $x$ is defined as

$$ H = -\sum_x\; f(x) \cdot ln \;f(x) $$

where $f(x)$ is the probability density of $x$. This is intuitive because after the optimal split in recursing down the tree, the distribution of $x$ becomes narrower, lowering entropy. This measure is also often known as "differential entropy."

In [2]:
#PREDICTION ON TEST DATA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix

NCAA Dataset

In [3]:
ncaa = pd.read_table("data/ncaa.txt")
yy = append(list(ones(32)), list(zeros(32)))
ncaa["y"] = yy
ncaa.head()
Out[3]:
No NAME GMS PTS REB AST TO A/T STL BLK PF FG FT 3P y
0 1. NorthCarolina 6 84.2 41.5 17.8 12.8 1.39 6.7 3.8 16.7 0.514 0.664 0.417 1.0
1 2. Illinois 6 74.5 34.0 19.0 10.2 1.87 8.0 1.7 16.5 0.457 0.753 0.361 1.0
2 3. Louisville 5 77.4 35.4 13.6 11.0 1.24 5.4 4.2 16.6 0.479 0.702 0.376 1.0
3 4. MichiganState 5 80.8 37.8 13.0 12.6 1.03 8.4 2.4 19.8 0.445 0.783 0.329 1.0
4 5. Arizona 4 79.8 35.0 15.8 14.5 1.09 6.0 6.5 13.3 0.542 0.759 0.397 1.0
In [4]:
#CREATE FEATURES
y = ncaa['y']
X = ncaa.iloc[:,2:13]
X.head()
Out[4]:
PTS REB AST TO A/T STL BLK PF FG FT 3P
0 84.2 41.5 17.8 12.8 1.39 6.7 3.8 16.7 0.514 0.664 0.417
1 74.5 34.0 19.0 10.2 1.87 8.0 1.7 16.5 0.457 0.753 0.361
2 77.4 35.4 13.6 11.0 1.24 5.4 4.2 16.6 0.479 0.702 0.376
3 80.8 37.8 13.0 12.6 1.03 8.4 2.4 19.8 0.445 0.783 0.329
4 79.8 35.0 15.8 14.5 1.09 6.0 6.5 13.3 0.542 0.759 0.397
In [19]:
#FIT MODEL
from sklearn.tree import DecisionTreeClassifier as CART
model = CART()
model.fit(X,y)
ypred = model.predict(X)
In [20]:
#CONFUSION MATRIX
cm = confusion_matrix(y, ypred)
cm
Out[20]:
array([[32,  0],
       [ 0, 32]])
In [21]:
#ACCURACY
accuracy_score(y,ypred)
Out[21]:
1.0
In [22]:
#CLASSIFICATION REPORT
print(classification_report(y, ypred))
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00        32
        1.0       1.00      1.00      1.00        32

avg / total       1.00      1.00      1.00        64

In [23]:
#ROC, AUC
y_score = model.predict_proba(X)[:,1]
fpr, tpr, _ = roc_curve(y, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  1.0
In [24]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
In [27]:
Image(graph.create_png())
Out[27]:

Credit Card Dataset

In [10]:
#LOAD IN CREDIT CARD DATA
import pickle
CCdata = pickle.load(open("data/CCdata.p", "rb"))
X_train = CCdata['X_train']
y_train = CCdata['y_train']
X_test = CCdata['X_test']
y_test = CCdata['y_test']
In [11]:
#FIT MODEL
from sklearn.tree import DecisionTreeClassifier as CART
model = CART()
model.fit(X_train,y_train)
Out[11]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [12]:
#CONFUSION MATRIX
ypred = model.predict(X_test)
cm = confusion_matrix(y_test, ypred)
cm
Out[12]:
array([[93650,   177],
       [   34,   126]])
In [13]:
#ACCURACY
accuracy_score(y_test,ypred)
Out[13]:
0.997755008671412
In [14]:
#CLASSIFICATION REPORT
print(classification_report(y_test, ypred))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     93827
          1       0.42      0.79      0.54       160

avg / total       1.00      1.00      1.00     93987

In [15]:
#ROC, AUC
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  0.8928067747023779
In [16]:
dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#graph.write('images/tree.dot')
Out[16]:
True
In [17]:
Image(graph.create_png())
Out[17]: