Machine Learning: A Quick Introduction

Sanjiv R. Das

Jupyter Extensions

These allow using the notebooks in myriad ways, see: https://blog.jupyter.org/99-ways-to-extend-the-jupyter-ecosystem-11e5dab7c54

https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12

5 Applications of ML in Finance

https://www.verypossible.com/blog/5-applications-of-machine-learning-in-finance

https://towardsdatascience.com/machine-learning-in-finance-why-what-how-d524a2357b56

Many Applications of ML in Finance

  1. Fraud prevention
  2. Risk management
  3. Investment predictions
  4. Customer service
  5. Digital assistants
  6. Marketing
  7. Network security
  8. Loan underwriting
  9. Algorithmic trading
  10. Process automation
  11. Document interpretation
  12. Content creation
  13. Trade settlements
  14. Money-laundering prevention
  15. Custom machine learning solutions

https://igniteoutsourcing.com/fintech/machine-learning-in-finance/

For projects, start looking at Kaggle for finance datasets you may be able to use.

J.P. Morgan Guide to ML in Finance

https://news.efinancialcareers.com/uk-en/285249/machine-learning-and-big-data-j-p-morgan

"You won't need to be a machine learning expert, you will need to be an excellent quant and an excellent programmer

J.P. Morgan says the skillset for the role of data scientists is virtually the same as for any other quantitative researchers. Existing buy side and sell side quants with backgrounds in computer science, statistics, maths, financial engineering, econometrics and natural sciences should therefore be able to reinvent themselves. Expertise in quantitative trading strategies will be the crucial skill. "It is much easier for a quant researcher to change the format/size of a dataset, and employ better statistical and Machine Learning tools, than for an IT expert, silicon valley entrepreneur, or academic to learn how to design a viable trading strategy," say Kolanovic and Krishnamacharc."

ML Tasks in Finance

ML with NLP

scikit-learn: Python's one-stop shop for ML

https://scikit-learn.org/stable/

Supervised Learning Models

  1. Linear Models
  2. Logistic Regression
  3. Discriminant Analysis
  4. Bayes Classifier
  5. Support Vector Machines
  6. Nearest Neighbors (kNN)
  7. Decision Trees
  8. Neural Networks

Unsupervised Learning Models

Clustering

Dimension Reduction

Ensemble Methods

  1. Bagging
  2. Stacking
  3. Boosting

Small Business Association (SBA) Loans Dataset

Feature Engineering

Logistic Regression (Logit)

Limited Dependent Variables

The Logistic Function

$$ y = \frac{1}{1+e^{-f(x_1,x_2,...,x_n)}} \in (0,1) $$

where

$$ f(x_1,x_2,...,x_n) = a_0 + a_1 x_1 + ... + a_n x_n \in (-\infty,+\infty) $$

Training, validation, and testing data: Underfitting, Overfitting, Cross-Validation

As we will see below, we split our data sample into training and testing subsets. Training data is used for machine learning a model and then the same model is applied to the test data (out-of-sample) to make sure the model performs well on data it has not already seen. The same model is also applied back to the same data on which it was trained. So, we get two accuracy scores, one on the training data set and another for the test data set. One hopes that a model performs as accurately on the test data as it does on the training data.

When accuracy is low on the training data, the model "underfits" the data. However, it may show a very high level of accuracy on the training data. This is a good thing, unless it achieves a very low accuracy level on the test data. In this case, we say that the model is "overfitted" to the data. The model is too specifically attuned to the training data so that it is useless for data outside the training data set. This occurs when the model has too many free parameters so it almost "memorizes" the training data set, which explains why it performs poorly on data it has not seen before. An analogy to this occurs when students memorize math homework problems without understanding the underlying concepts. When faced with a slightly different problem on the exam, they fail miserably.

We often break down a data sample into 3 types of data: training, validation, and testing data. Say we keep 20\% of our data aside for testing. This is also known as "holdout" data. Of the remaining 80% data we may randomly sample 75\% of it, and train the model so that it performs well on the remaining 25\%. Then we randomly sample a different 75\% and train to fit the remaining 25\% starting from the current model or afresh. This is also called "rotation sampling". If we repeat this $n$ times to get the best model, we are said to undertake "$n$-fold cross-validation". The results are averaged to assess fit. Once a model has been trained through this cross-validated process, it is then taken to the test data to assess how well it performs and determination is made as to the extent of overfitting, if any.

The figure below provides a visual depiction of under- and over-fitting.

Metrics

  1. Accuracy: the number of correctly predicted class values.

  2. TPR = sensitivity or recall = TP/(TP+FN)

  3. FPR = (1 − specificity) = FP/(FP+TN)

More Metrics

  1. Precision = $\frac{TP}{TP+FP}$

  2. Recall = $\frac{TP}{TP+FN}$

  3. F1 score = $\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$

(F1 is the harmonic mean of precision and recall.)

The Matthews Correlation Coefficient

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

This is a useful classification metric that is not widely used. See: https://towardsdatascience.com/the-best-classification-metric-youve-never-heard-of-the-matthews-correlation-coefficient-3bf50a2f3e9a

Let's take the confusion matrix from above and apply the numbers therein to compute MCC.

ROC and AUC

The Receiver-Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different levels of the cut-off posterior probability. This is an essential trade-off in all classification systems.

All In One

A terrific article in Scientific American on ROC Curves by Swets, Dawes, Monahan (2000); pdf. And Dawes (1979) on the use of "Improper Linear Models".

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

ML Comic