Sanjiv R. Das
These allow using the notebooks in myriad ways, see: https://blog.jupyter.org/99-ways-to-extend-the-jupyter-ecosystem-11e5dab7c54
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib
%%capture
!pip install ipypublish
from ipypublish import nb_setup
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Teaching/OnlineMSFA_FNCE2431/FNCE2431_Machine Learning for Finance/3_Course Content/Notebooks/")
Mounted at /content/drive
nb_setup.images_hconcat(["DSTMAA_images/ML_AI.png"], width=700)
nb_setup.images_hconcat(["DSTMAA_images/ML_use_cases.jpg"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/AI_solutions.jpg"], width=600)
https://igniteoutsourcing.com/fintech/machine-learning-in-finance/
For projects, start looking at Kaggle for finance datasets you may be able to use.
https://news.efinancialcareers.com/uk-en/285249/machine-learning-and-big-data-j-p-morgan
"You won't need to be a machine learning expert, you will need to be an excellent quant and an excellent programmer
J.P. Morgan says the skillset for the role of data scientists is virtually the same as for any other quantitative researchers. Existing buy side and sell side quants with backgrounds in computer science, statistics, maths, financial engineering, econometrics and natural sciences should therefore be able to reinvent themselves. Expertise in quantitative trading strategies will be the crucial skill. "It is much easier for a quant researcher to change the format/size of a dataset, and employ better statistical and Machine Learning tools, than for an IT expert, silicon valley entrepreneur, or academic to learn how to design a viable trading strategy," say Kolanovic and Krishnamacharc."
nb_setup.images_hconcat(["DSTMAA_images/JPMorgan-machine-learning-2.jpg"], width=600)
Credit scoring, sentiment analysis, document search: https://emerj.com/ai-sector-overviews/natural-language-processing-applications-in-finance-3-current-applications/
Gather real-time intelligence on specific stocks; Provide key hire alerts; Monitor company sentiment; Anticipate client concerns; Upgrade quality of analyst reporting; Understand and respond to news events; Detect insider trading: https://www.ibm.com/blogs/watson/2016/06/natural-language-processing-transforming-financial-industry-2/
Ravenpack: https://www.ravenpack.com/; https://www.ravenpack.com/research/browse/
#Import the SBA Loans dataset
sba = pd.read_csv("DSTMAA_data/SBA.csv")
print(sba.columns)
print(sba.shape)
sba.head()
Index(['LoanID', 'GrossApproval', 'SBAGuaranteedApproval', 'subpgmdesc', 'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths', 'ProjectState', 'BusinessType', 'LoanStatus', 'RevolverStatus', 'JobsSupported'], dtype='object') (527700, 12)
LoanID | GrossApproval | SBAGuaranteedApproval | subpgmdesc | ApprovalFiscalYear | InitialInterestRate | TermInMonths | ProjectState | BusinessType | LoanStatus | RevolverStatus | JobsSupported | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 733784 | 50000 | 25000 | FA$TRK (Small Loan Express) | 2006 | 11.25 | 84 | IN | CORPORATION | CANCLD | 1 | 4 |
1 | 733785 | 35000 | 17500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | IL | CORPORATION | CANCLD | 0 | 3 |
2 | 733786 | 15000 | 7500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | WV | INDIVIDUAL | CANCLD | 0 | 4 |
3 | 733787 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | PIF | 0 | 1 |
4 | 733788 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | CANCLD | 0 | 1 |
#Feature engineering
sba["GuaranteePct"] = sba.SBAGuaranteedApproval.astype("float")/sba.GrossApproval.astype("float")
X = sba[['ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
'RevolverStatus','JobsSupported','GuaranteePct']]
x1 = pd.get_dummies(sba.subpgmdesc)
X = pd.concat([X,x1],axis=1)
x2 = pd.get_dummies(sba.BusinessType)
X = pd.concat([X,x2],axis=1)
X.head()
ApprovalFiscalYear | InitialInterestRate | TermInMonths | RevolverStatus | JobsSupported | GuaranteePct | 509 - DEALER FLOOR PLAN | Community Advantage Initiative | Community Express | Contract Guaranty | EXPORT IMPORT HARMONIZATION | FA$TRK (Small Loan Express) | Guaranty | Gulf Opportunity | International Trade - Sec, 7(a) (16) | Lender Advantage Initiative | Patriot Express | Revolving Line of Credit Exports - Sec. 7(a) (14) | Rural Lender Advantage | Seasonal Line of Credit | Small Asset Based | Small General Contractors - Sec. 7(a) (9) | Standard Asset Based | CORPORATION | INDIVIDUAL | PARTNERSHIP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006 | 11.25 | 84 | 1 | 4 | 0.50 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 2006 | 12.00 | 84 | 0 | 3 | 0.50 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 2006 | 12.00 | 84 | 0 | 4 | 0.50 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 2006 | 11.50 | 84 | 0 | 1 | 0.85 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 2006 | 11.50 | 84 | 0 | 1 | 0.85 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
The dependent variable may be discrete, and could be binomial or multinomial. That is, the dependent variable is limited. In such cases, we need a different approach.
Discrete dependent variables are a special case of limited dependent variables. The Logit model we look at here is a discrete dependent variable model. Such models are also often called qualitative response (QR) models.
where
$$ f(x_1,x_2,...,x_n) = a_0 + a_1 x_1 + ... + a_n x_n \in (-\infty,+\infty) $$#Sigmoid Function
def logit(fx):
return exp(fx)/(1+exp(fx))
fx = linspace(-4,4,100)
y = logit(fx)
plot(fx,y)
xlabel('f(x)')
ylabel('Logit value')
grid()
#Dependent categorical variable
y = pd.get_dummies(sba.LoanStatus)
y.head()
CANCLD | CHGOFF | EXEMPT | PIF | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 |
#Prepare the X and y variables for chargeoffs vs paid in full
idx1 = list(where(y.CHGOFF==1)[0])
idx2 = list(where(y.PIF==1)[0])
idx = append(idx1,idx2)
print(len(idx))
X = X.iloc[idx]
X["Intercept"] = 1.0
y = y.CHGOFF.iloc[idx]
#Save for later
y_SBA = y
X_SBA = X
223647
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression(max_iter=10000) # higher number of iterations needed if the convergence rate is slow
model = model.fit(X, y)
# check the accuracy on the training set
model.score(X, y)
0.8278000599158495
#Show the coefficients
pd.DataFrame({'X':X.columns, 'Coeff':model.coef_[0]})
X | Coeff | |
---|---|---|
0 | ApprovalFiscalYear | -0.000790 |
1 | InitialInterestRate | 0.334618 |
2 | TermInMonths | -0.042856 |
3 | RevolverStatus | -0.348371 |
4 | JobsSupported | -0.000127 |
5 | GuaranteePct | -0.142879 |
6 | 509 - DEALER FLOOR PLAN | -0.052420 |
7 | Community Advantage Initiative | -0.010575 |
8 | Community Express | 1.525305 |
9 | Contract Guaranty | -0.294455 |
10 | EXPORT IMPORT HARMONIZATION | -0.025977 |
11 | FA$TRK (Small Loan Express) | -0.067749 |
12 | Guaranty | 1.595878 |
13 | Gulf Opportunity | -1.240564 |
14 | International Trade - Sec, 7(a) (16) | -0.009282 |
15 | Lender Advantage Initiative | -0.476432 |
16 | Patriot Express | 1.513717 |
17 | Revolving Line of Credit Exports - Sec. 7(a) (14) | -1.638254 |
18 | Rural Lender Advantage | -0.091847 |
19 | Seasonal Line of Credit | -0.046767 |
20 | Small Asset Based | -0.035860 |
21 | Small General Contractors - Sec. 7(a) (9) | -0.128380 |
22 | Standard Asset Based | -0.488682 |
23 | CORPORATION | 0.178202 |
24 | INDIVIDUAL | 0.193262 |
25 | PARTNERSHIP | -0.342430 |
26 | Intercept | 0.027656 |
As we will see below, we split our data sample into training and testing subsets. Training data is used for machine learning a model and then the same model is applied to the test data (out-of-sample) to make sure the model performs well on data it has not already seen. The same model is also applied back to the same data on which it was trained. So, we get two accuracy scores, one on the training data set and another for the test data set. One hopes that a model performs as accurately on the test data as it does on the training data.
When accuracy is low on the training data, the model "underfits" the data. However, it may show a very high level of accuracy on the training data. This is a good thing, unless it achieves a very low accuracy level on the test data. In this case, we say that the model is "overfitted" to the data. The model is too specifically attuned to the training data so that it is useless for data outside the training data set. This occurs when the model has too many free parameters so it almost "memorizes" the training data set, which explains why it performs poorly on data it has not seen before. An analogy to this occurs when students memorize math homework problems without understanding the underlying concepts. When faced with a slightly different problem on the exam, they fail miserably.
We often break down a data sample into 3 types of data: training, validation, and testing data. Say we keep 20\% of our data aside for testing. This is also known as "holdout" data. Of the remaining 80% data we may randomly sample 75\% of it, and train the model so that it performs well on the remaining 25\%. Then we randomly sample a different 75\% and train to fit the remaining 25\% starting from the current model or afresh. This is also called "rotation sampling". If we repeat this $n$ times to get the best model, we are said to undertake "$n$-fold cross-validation". The results are averaged to assess fit. Once a model has been trained through this cross-validated process, it is then taken to the test data to assess how well it performs and determination is made as to the extent of overfitting, if any.
The figure below provides a visual depiction of under- and over-fitting.
nb_setup.images_vconcat(["DSTMAA_images/overfitting.png"], width=600)
# Evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression(max_iter=10000)
model2.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=10000, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
# Predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)
[0 0 0 ... 1 1 1]
# Generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)
[[0.93356543 0.06643457] [0.95553833 0.04446167] [0.83625796 0.16374204] ... [0.34094718 0.65905282] [0.47334767 0.52665233] [0.19629927 0.80370073]]
Accuracy: the number of correctly predicted class values.
TPR = sensitivity or recall = TP/(TP+FN)
FPR = (1 − specificity) = FP/(FP+TN)
# Confusion Matrix
print(confusion_matrix(predicted, y_test))
[[43693 7996] [ 3495 11911]]
Precision = $\frac{TP}{TP+FP}$
Recall = $\frac{TP}{TP+FN}$
F1 score = $\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$
(F1 is the harmonic mean of precision and recall.)
print(classification_report(predicted, y_test))
precision recall f1-score support 0 0.93 0.85 0.88 51689 1 0.60 0.77 0.67 15406 accuracy 0.83 67095 macro avg 0.76 0.81 0.78 67095 weighted avg 0.85 0.83 0.84 67095
https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
This is a useful classification metric that is not widely used. See: https://towardsdatascience.com/the-best-classification-metric-youve-never-heard-of-the-matthews-correlation-coefficient-3bf50a2f3e9a
def MCC(tp,tn,fp,fn):
return (tp*tn - fp*fn)/sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
Let's take the confusion matrix from above and apply the numbers therein to compute MCC.
mcc = MCC(11911, 43693, 7996, 3495)
print("MCC =", mcc)
MCC = 0.5694125590112834
The Receiver-Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different levels of the cut-off posterior probability. This is an essential trade-off in all classification systems.
nb_setup.images_hconcat(["DSTMAA_images/roc_example.jpg"], width=600)
# generate evaluation metrics
print('Accuracy =', accuracy_score(y_test, predicted))
print('AUC =', roc_auc_score(y_test, probs[:, 1]))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-14-118d2858bf3e> in <module>() 1 # generate evaluation metrics ----> 2 print('Accuracy =', accuracy_score(y_test, predicted)) 3 print('AUC =', roc_auc_score(y_test, probs[:, 1])) NameError: name 'predicted' is not defined
#ROC, AUC
from sklearn.metrics import roc_curve, auc
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)
title('ROC curve')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')
plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC): 0.8634222585901785
A terrific article in Scientific American on ROC Curves by Swets, Dawes, Monahan (2000); pdf. And Dawes (1979) on the use of "Improper Linear Models".
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
nb_setup.images_hconcat(["DSTMAA_images/all_metrics.png"], width=600)