K Nearest Neighbors¶

Sanjiv R. Das¶

In [1]:

%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

What is kNN?¶

This is one of the simplest algorithms for classification and grouping.
Simply define a distance metric over a set of observations, each with $M$ characteristics, i.e., $x_1,x_2,...,x_M$.

Compute the pairwise distance between each pair of observations, using any of the standard metrics. For example, Euclidian distance between data $x$ and $y$:

$$ d = \sqrt{\sum_{i=1}^M (x_i - y_i)^2} $$

Next, fix $k$, the number of nearest neighbors in the population to be considered.
Finally, assign the category based on which one has the majority of nearest neighbors to the case we are trying to classify.

Classified Neighborhoods¶

In [2]:

#PREDICTION ON TEST DATA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix

NCAA Dataset¶

In [3]:

ncaa = pd.read_table("data/ncaa.txt")
yy = append(list(ones(32)), list(zeros(32)))
ncaa["y"] = yy
ncaa.head()

Out[3]:

	No NAME	GMS	PTS	REB	AST	TO	A/T	STL	BLK	PF	FG	FT	3P	y
0	1. NorthCarolina	6	84.2	41.5	17.8	12.8	1.39	6.7	3.8	16.7	0.514	0.664	0.417	1.0
1	2. Illinois	6	74.5	34.0	19.0	10.2	1.87	8.0	1.7	16.5	0.457	0.753	0.361	1.0
2	3. Louisville	5	77.4	35.4	13.6	11.0	1.24	5.4	4.2	16.6	0.479	0.702	0.376	1.0
3	4. MichiganState	5	80.8	37.8	13.0	12.6	1.03	8.4	2.4	19.8	0.445	0.783	0.329	1.0
4	5. Arizona	4	79.8	35.0	15.8	14.5	1.09	6.0	6.5	13.3	0.542	0.759	0.397	1.0

In [4]:

#CREATE FEATURES
y = ncaa['y']
X = ncaa.iloc[:,2:13]
X.head()

Out[4]:

	PTS	REB	AST	TO	A/T	STL	BLK	PF	FG	FT	3P
0	84.2	41.5	17.8	12.8	1.39	6.7	3.8	16.7	0.514	0.664	0.417
1	74.5	34.0	19.0	10.2	1.87	8.0	1.7	16.5	0.457	0.753	0.361
2	77.4	35.4	13.6	11.0	1.24	5.4	4.2	16.6	0.479	0.702	0.376
3	80.8	37.8	13.0	12.6	1.03	8.4	2.4	19.8	0.445	0.783	0.329
4	79.8	35.0	15.8	14.5	1.09	6.0	6.5	13.3	0.542	0.759	0.397

In [36]:

#FIT MODEL
from sklearn.neighbors import NearestNeighbors as KNN
model = KNN(n_neighbors=5, algorithm='ball_tree')
model.fit(X)
distances, indices = model.kneighbors(X)
print(indices[:6])

[[ 0 23  3  4 22]
 [ 1 15 10  2 32]
 [ 2 26  4 27  3]
 [ 3  2  0 26 27]
 [ 4  2 38  3  0]
 [ 5 16 45 32 10]]

In [24]:

nn_sum = [sum(y[j]) for j in indices]
ypred = [1 if j>2 else 0 for j in nn_sum]
print(ypred)

[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

In [25]:

#CONFUSION MATRIX
cm = confusion_matrix(y, ypred)
cm

Out[25]:

array([[25,  7],
       [ 4, 28]])

In [26]:

#ACCURACY
accuracy_score(y,ypred)

Out[26]:

0.828125

In [29]:

#CLASSIFICATION REPORT
print(classification_report(y, ypred))

             precision    recall  f1-score   support

        0.0       0.86      0.78      0.82        32
        1.0       0.80      0.88      0.84        32

avg / total       0.83      0.83      0.83        64

Credit Card Dataset¶

In [31]:

#LOAD IN CREDIT CARD DATA
import pickle
CCdata = pickle.load(open("data/CCdata.p", "rb"))
X_train = CCdata['X_train']
y_train = CCdata['y_train']
X_test = CCdata['X_test']
y_test = CCdata['y_test']

In [43]:

#FIT MODEL
from sklearn.neighbors import KNeighborsClassifier as KNNC
model = KNNC(n_neighbors=5, algorithm='ball_tree')
model.fit(X_train, y_train)

Out[43]:

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [44]:

#CONFUSION MATRIX
ypred = model.predict(X_test)
cm = confusion_matrix(y_test, ypred)
cm

Out[44]:

array([[86086,  7741],
       [   74,    86]])

In [45]:

#ACCURACY
accuracy_score(y_test,ypred)

Out[45]:

0.9168502026876058

In [46]:

#CLASSIFICATION REPORT
print(classification_report(y_test, ypred))

             precision    recall  f1-score   support

          0       1.00      0.92      0.96     93827
          1       0.01      0.54      0.02       160

avg / total       1.00      0.92      0.95     93987

In [47]:

#ROC, AUC
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))

Area under curve (AUC):  0.7396254542935403