8. Machine Learning: A Quick Introduction#

A quick recap of machine learning before we get started with NLP.

8.1. Jupyter Extensions#

These allow using the notebooks in myriad ways, see: https://blog.jupyter.org/99-ways-to-extend-the-jupyter-ecosystem-11e5dab7c54

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
from IPython.display import Image
Image("NLP_images/ML_AI.png", width=700)

https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12

8.2. 5 Applications of ML in Finance#

https://www.verypossible.com/blog/5-applications-of-machine-learning-in-finance

https://towardsdatascience.com/machine-learning-in-finance-why-what-how-d524a2357b56

Image("NLP_images/ML_use_cases.jpg", width=600)
Image("NLP_images/AI_solutions.jpg", width=600)

8.3. Many Applications of ML in Finance#

  1. Fraud prevention

  2. Risk management

  3. Investment predictions

  4. Customer service

  5. Digital assistants

  6. Marketing

  7. Network security

  8. Loan underwriting

  9. Algorithmic trading

  10. Process automation

  11. Document interpretation

  12. Content creation

  13. Trade settlements

  14. Money-laundering prevention

  15. Custom machine learning solutions

https://igniteoutsourcing.com/fintech/machine-learning-in-finance/

For projects, start looking at Kaggle for finance datasets you may be able to use.

8.4. J.P. Morgan Guide to ML in Finance#

https://news.efinancialcareers.com/uk-en/285249/machine-learning-and-big-data-j-p-morgan

You won’t need to be a machine learning expert, you will need to be an excellent quant and an excellent programmer

J.P. Morgan says the skillset for the role of data scientists is virtually the same as for any other quantitative researchers. Existing buy side and sell side quants with backgrounds in computer science, statistics, maths, financial engineering, econometrics and natural sciences should therefore be able to reinvent themselves. Expertise in quantitative trading strategies will be the crucial skill. “It is much easier for a quant researcher to change the format/size of a dataset, and employ better statistical and Machine Learning tools, than for an IT expert, silicon valley entrepreneur, or academic to learn how to design a viable trading strategy,” say Kolanovic and Krishnamacharc.”

8.5. ML Tasks in Finance#

Image("NLP_images/JPMorgan-machine-learning-2.jpg", width=600)

8.6. ML with NLP#

8.7. scikit-learn: Python’s one-stop shop for ML#

https://scikit-learn.org/stable/

8.8. Supervised Learning Models#

  1. Linear Models

  2. Logistic Regression

  3. Discriminant Analysis

  4. Bayes Classifier

  5. Support Vector Machines

  6. Nearest Neighbors (kNN)

  7. Decision Trees

  8. Neural Networks

8.9. Unsupervised Learning Models#

8.9.1. Clustering#

  • K-means

  • Hierarchical clustering

8.9.2. Dimension Reduction#

  • Principal Components Analysis

  • Factor Analysis

  • Autoencoders

8.10. Ensemble Methods#

  1. Bagging

  2. Stacking

  3. Boosting

8.11. Small Business Association (SBA) Loans Dataset#

#Import the SBA Loans dataset
pd.set_option('display.max_columns', 500)
sba = pd.read_csv("NLP_data/SBA.csv")
print(sba.columns)
print(sba.shape)
sba.head()
Index(['LoanID', 'GrossApproval', 'SBAGuaranteedApproval', 'subpgmdesc',
       'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
       'ProjectState', 'BusinessType', 'LoanStatus', 'RevolverStatus',
       'JobsSupported'],
      dtype='object')
(527700, 12)
LoanID GrossApproval SBAGuaranteedApproval subpgmdesc ApprovalFiscalYear InitialInterestRate TermInMonths ProjectState BusinessType LoanStatus RevolverStatus JobsSupported
0 733784 50000 25000 FA$TRK (Small Loan Express) 2006 11.25 84 IN CORPORATION CANCLD 1 4
1 733785 35000 17500 FA$TRK (Small Loan Express) 2006 12.00 84 IL CORPORATION CANCLD 0 3
2 733786 15000 7500 FA$TRK (Small Loan Express) 2006 12.00 84 WV INDIVIDUAL CANCLD 0 4
3 733787 16000 13600 Community Express 2006 11.50 84 MD CORPORATION PIF 0 1
4 733788 16000 13600 Community Express 2006 11.50 84 MD CORPORATION CANCLD 0 1

8.12. Feature Engineering#

#Feature engineering
sba["GuaranteePct"] = sba.SBAGuaranteedApproval.astype("float")/sba.GrossApproval.astype("float")
X = sba[['ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
        'RevolverStatus','JobsSupported','GuaranteePct']]

x1 = pd.get_dummies(sba.subpgmdesc, dtype=int)
X = pd.concat([X,x1],axis=1)

x2 = pd.get_dummies(sba.BusinessType, dtype=int)
X = pd.concat([X,x2],axis=1)

X.head()
ApprovalFiscalYear InitialInterestRate TermInMonths RevolverStatus JobsSupported GuaranteePct 509 - DEALER FLOOR PLAN Community Advantage Initiative Community Express Contract Guaranty EXPORT IMPORT HARMONIZATION FA$TRK (Small Loan Express) Guaranty Gulf Opportunity International Trade - Sec, 7(a) (16) Lender Advantage Initiative Patriot Express Revolving Line of Credit Exports - Sec. 7(a) (14) Rural Lender Advantage Seasonal Line of Credit Small Asset Based Small General Contractors - Sec. 7(a) (9) Standard Asset Based CORPORATION INDIVIDUAL PARTNERSHIP
0 2006 11.25 84 1 4 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 2006 12.00 84 0 3 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2 2006 12.00 84 0 4 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
3 2006 11.50 84 0 1 0.85 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4 2006 11.50 84 0 1 0.85 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

8.13. Logistic Regression (Logit)#

8.13.1. Limited Dependent Variables#

  • The dependent variable may be discrete, and could be binomial or multinomial. That is, the dependent variable is limited. In such cases, we need a different approach.

  • Discrete dependent variables are a special case of limited dependent variables. The Logit model we look at here is a discrete dependent variable model. Such models are also often called qualitative response (QR) models.

8.14. The Logistic Function#

\[ y = \frac{1}{1+e^{-f(x_1,x_2,...,x_n)}} \in (0,1) \]

where

\[ f(x_1,x_2,...,x_n) = a_0 + a_1 x_1 + ... + a_n x_n \in (-\infty,+\infty) \]
#Sigmoid Function
def logit(fx):
    return exp(fx)/(1+exp(fx))

fx = linspace(-4,4,100)
y = logit(fx)
plot(fx,y)
xlabel('f(x)')
ylabel('Logit value')
grid()
_images/508e06e09ff17183a9fdc927f7a56048a6cf01147bdb0d4301bc42cfcc8ed823.png
#Dependent categorical variable
y = pd.get_dummies(sba.LoanStatus, dtype=int)
y.head()
CANCLD CHGOFF EXEMPT PIF
0 1 0 0 0
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
#Prepare the X and y variables for chargeoffs vs paid in full
idx1 = list(where(y.CHGOFF==1)[0])
idx2 = list(where(y.PIF==1)[0])
idx = append(idx1,idx2)
print(len(idx))
X = X.iloc[idx]
# X["Intercept"] = 1.0

y = y.CHGOFF.iloc[idx]

#Save for later
y_SBA = y.copy()
X_SBA = X.copy()
223647
y_SBA
CHGOFF
5 1
8 1
9 1
11 1
14 1
... ...
508757 0
510524 0
510828 0
511814 0
514919 0

223647 rows × 1 columns


X_SBA
ApprovalFiscalYear InitialInterestRate TermInMonths RevolverStatus JobsSupported GuaranteePct 509 - DEALER FLOOR PLAN Community Advantage Initiative Community Express Contract Guaranty EXPORT IMPORT HARMONIZATION FA$TRK (Small Loan Express) Guaranty Gulf Opportunity International Trade - Sec, 7(a) (16) Lender Advantage Initiative Patriot Express Revolving Line of Credit Exports - Sec. 7(a) (14) Rural Lender Advantage Seasonal Line of Credit Small Asset Based Small General Contractors - Sec. 7(a) (9) Standard Asset Based CORPORATION INDIVIDUAL PARTNERSHIP
5 2006 11.50 28 0 2 0.85 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
8 2006 11.25 46 1 6 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
9 2006 13.25 18 0 7 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
11 2006 12.25 32 0 2 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
14 2006 13.25 68 0 1 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
508757 2015 5.00 12 1 5 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
510524 2015 5.50 6 0 0 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
510828 2015 6.00 60 0 12 0.85 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
511814 2015 5.21 84 0 4 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
514919 2015 8.75 84 1 0 0.50 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

223647 rows × 26 columns

%%time

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score

# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression(penalty=None, max_iter=1000, tol=0.001) # higher number of iterations needed if the convergence rate is slow
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)
CPU times: user 25.8 s, sys: 7.69 s, total: 33.5 s
Wall time: 21.8 s
0.8277464039311951
#Show the coefficients
pd.DataFrame({'X':X.columns, 'Coeff':model.coef_[0]})
X Coeff
0 ApprovalFiscalYear -0.000784
1 InitialInterestRate 0.333744
2 TermInMonths -0.042977
3 RevolverStatus -0.340903
4 JobsSupported -0.000136
5 GuaranteePct -0.158797
6 509 - DEALER FLOOR PLAN -0.054545
7 Community Advantage Initiative -0.011115
8 Community Express 1.568193
9 Contract Guaranty -0.306265
10 EXPORT IMPORT HARMONIZATION -0.027013
11 FA$TRK (Small Loan Express) -0.040360
12 Guaranty 1.654685
13 Gulf Opportunity -1.292674
14 International Trade - Sec, 7(a) (16) -0.009668
15 Lender Advantage Initiative -0.496368
16 Patriot Express 1.574268
17 Revolving Line of Credit Exports - Sec. 7(a) (14) -1.705447
18 Rural Lender Advantage -0.096996
19 Seasonal Line of Credit -0.048645
20 Small Asset Based -0.037303
21 Small General Contractors - Sec. 7(a) (9) -0.133505
22 Standard Asset Based -0.508500
23 CORPORATION 0.180144
24 INDIVIDUAL 0.208657
25 PARTNERSHIP -0.358629

8.15. Training, validation, and testing data: Underfitting, Overfitting, Cross-Validation#

As we will see below, we split our data sample into training and testing subsets. Training data is used for machine learning a model and then the same model is applied to the test data (out-of-sample) to make sure the model performs well on data it has not already seen. The same model is also applied back to the same data on which it was trained. So, we get two accuracy scores, one on the training data set and another for the test data set. One hopes that a model performs as accurately on the test data as it does on the training data.

When accuracy is low on the training data, the model “underfits” the data. However, it may show a very high level of accuracy on the training data. This is a good thing, unless it achieves a very low accuracy level on the test data. In this case, we say that the model is “overfitted” to the data. The model is too specifically attuned to the training data so that it is useless for data outside the training data set. This occurs when the model has too many free parameters so it almost “memorizes” the training data set, which explains why it performs poorly on data it has not seen before. An analogy to this occurs when students memorize math homework problems without understanding the underlying concepts. When faced with a slightly different problem on the exam, they fail miserably.

We often break down a data sample into 3 types of data: training, validation, and testing data. Say we keep 20% of our data aside for testing. This is also known as “holdout” data. Of the remaining 80% data we may randomly sample 75% of it, and train the model so that it performs well on the remaining 25%. Then we randomly sample a different 75% and train to fit the remaining 25% starting from the current model or afresh. This is also called “rotation sampling”. If we repeat this \(n\) times to get the best model, we are said to undertake “\(n\)-fold cross-validation”. The results are averaged to assess fit. Once a model has been trained through this cross-validated process, it is then taken to the test data to assess how well it performs and determination is made as to the extent of overfitting, if any.

The figure below provides a visual depiction of under- and over-fitting.

Image("NLP_images/overfitting.png", width=1000)
# Evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression(penalty=None, max_iter=1000, tol=0.001)
model2.fit(X_train, y_train)
LogisticRegression(max_iter=1000, penalty=None, tol=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)
[0 0 0 ... 1 1 1]
# Generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)
[[0.93046744 0.06953256]
 [0.9527771  0.0472229 ]
 [0.83383113 0.16616887]
 ...
 [0.29892397 0.70107603]
 [0.41446059 0.58553941]
 [0.22209138 0.77790862]]
model2.score(X_test, y_test)
0.8288546091362993

8.16. Metrics#

  1. Accuracy: the number of correctly predicted class values.

  2. TPR = sensitivity or recall = TP/(TP+FN)

  3. FPR = (1 − specificity) = FP/(FP+TN)

# Confusion Matrix
CM = confusion_matrix(predicted, y_test) # predicted values on rows, true values on columns
CM
array([[43674,  7969],
       [ 3514, 11938]])

8.17. More Metrics#

  1. Precision = \(\frac{TP}{TP+FP}\)

  2. Recall = \(\frac{TP}{TP+FN}\)

  3. F1 score = \(\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}\)

(F1 is the harmonic mean of precision and recall.)

print(classification_report(predicted, y_test))
              precision    recall  f1-score   support

           0       0.93      0.85      0.88     51643
           1       0.60      0.77      0.68     15452

    accuracy                           0.83     67095
   macro avg       0.76      0.81      0.78     67095
weighted avg       0.85      0.83      0.84     67095

8.18. The Matthews Correlation Coefficient#

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

This is a useful classification metric that is not widely used. See: https://towardsdatascience.com/the-best-classification-metric-youve-never-heard-of-the-matthews-correlation-coefficient-3bf50a2f3e9a

def MCC(tp,tn,fp,fn):
  return (tp*tn - fp*fn)/sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))

Let’s take the confusion matrix from above and apply the numbers therein to compute MCC.

print(CM)
mcc = MCC(CM[1,1], CM[0,0], CM[0,1], CM[1,0])
print("MCC =", mcc)
[[43674  7969]
 [ 3514 11938]]
MCC = 0.5698522319148718

8.19. ROC and AUC#

The Receiver-Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different levels of the cut-off posterior probability. This is an essential trade-off in all classification systems.

Image("NLP_images/roc_example.jpg", width=600)
# generate evaluation metrics
print('Accuracy =', accuracy_score(y_test, predicted))
print('AUC =', roc_auc_score(y_test, probs[:, 1]))
Accuracy = 0.8288546091362993
AUC = 0.863604458068324
#ROC, AUC
from sklearn.metrics import roc_curve, auc
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))
_images/77452cfe269a9f6b89c3997dd6599c635d96bfb06a01ebe8c0a6b12c77f81cf8.png
Area under curve (AUC):  0.8634953095597164

8.20. All In One#

A terrific article in Scientific American on ROC Curves by Swets, Dawes, Monahan (2000); pdf. And Dawes (1979) on the use of “Improper Linear Models”.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Image("NLP_images/all_metrics.png", width=800)

8.21. ML Comic#