REFERENCE BOOK: http://srdas.github.io/DLBook2
“You are my creator, but I am your master; Obey!”
― Mary Shelley, Frankenstein
!pip install ipypublish
Collecting ipypublish Downloading https://files.pythonhosted.org/packages/06/e1/1d5a845940e558fd3fc3c6f4265f83d45945fd8fb103fab35b0c8af0da27/ipypublish-0.10.11-py2.py3-none-any.whl (292kB) |████████████████████████████████| 296kB 4.6MB/s Requirement already satisfied: six>=1.11.0 in /usr/local/lib/python3.6/dist-packages (from ipypublish) (1.15.0) Collecting ruamel.yaml Downloading https://files.pythonhosted.org/packages/a6/92/59af3e38227b9cc14520bf1e59516d99ceca53e3b8448094248171e9432b/ruamel.yaml-0.16.10-py2.py3-none-any.whl (111kB) |████████████████████████████████| 112kB 14.7MB/s Requirement already satisfied: docutils; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from ipypublish) (0.15.2) Collecting jsonextended>=0.7 Downloading https://files.pythonhosted.org/packages/7b/aa/e084e46ed3a7aab0b910790ca82f496e71dc5a2b7cc64793ee54f5d8bbd3/jsonextended-0.7.11-py2.py3-none-any.whl (466kB) |████████████████████████████████| 471kB 16.8MB/s Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from ipypublish) (2.11.2) Requirement already satisfied: traitlets in /usr/local/lib/python3.6/dist-packages (from ipypublish) (4.3.3) Requirement already satisfied: nbformat in /usr/local/lib/python3.6/dist-packages (from ipypublish) (5.0.7) Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from ipypublish) (5.6.1) Collecting panflute Downloading https://files.pythonhosted.org/packages/84/e3/7b5c4b449b7b06b8d92884ebc04103a718ed00dea8179f2be504f91776a8/panflute-1.12.5.tar.gz Collecting bibtexparser Downloading https://files.pythonhosted.org/packages/7c/c3/c184a4460ba2f4877e3389e2d63479f642d0d3bdffeeffee0723d3b0156d/bibtexparser-1.2.0.tar.gz (46kB) |████████████████████████████████| 51kB 8.3MB/s Collecting ordered-set Downloading https://files.pythonhosted.org/packages/f5/ab/8252360bfe965bba31ec05112b3067bd129ce4800d89e0b85613bc6044f6/ordered-set-4.0.2.tar.gz Requirement already satisfied: tornado in /usr/local/lib/python3.6/dist-packages (from ipypublish) (5.1.1) Collecting jupytext Downloading https://files.pythonhosted.org/packages/bc/71/eaba4f15759a8295e51dd8bffcb5bbd076a2e1742da56509fe5ade1271ec/jupytext-1.5.2.tar.gz (677kB) |████████████████████████████████| 686kB 19.6MB/s Requirement already satisfied: jsonschema in /usr/local/lib/python3.6/dist-packages (from ipypublish) (2.6.0) Collecting ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.9" Downloading https://files.pythonhosted.org/packages/53/77/4bcd63f362bcb6c8f4f06253c11f9772f64189bf08cf3f40c5ccbda9e561/ruamel.yaml.clib-0.2.0-cp36-cp36m-manylinux1_x86_64.whl (548kB) |████████████████████████████████| 552kB 29.8MB/s Collecting pathlib2 Downloading https://files.pythonhosted.org/packages/e9/45/9c82d3666af4ef9f221cbb954e1d77ddbb513faf552aea6df5f37f1a4859/pathlib2-2.3.5-py2.py3-none-any.whl Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->ipypublish) (1.1.1) Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from traitlets->ipypublish) (4.4.2) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets->ipypublish) (0.2.0) Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbformat->ipypublish) (4.6.3) Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (3.1.5) Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (2.1.3) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (0.8.4) Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (0.4.4) Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (0.6.0) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (0.3) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->ipypublish) (1.4.2) Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from panflute->ipypublish) (3.13) Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from panflute->ipypublish) (7.1.2) Requirement already satisfied: pyparsing>=2.0.3 in /usr/local/lib/python3.6/dist-packages (from bibtexparser->ipypublish) (2.4.7) Requirement already satisfied: future>=0.16.0 in /usr/local/lib/python3.6/dist-packages (from bibtexparser->ipypublish) (0.16.0) Requirement already satisfied: toml in /usr/local/lib/python3.6/dist-packages (from jupytext->ipypublish) (0.10.1) Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->ipypublish) (20.4) Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->ipypublish) (0.5.1) Building wheels for collected packages: panflute, bibtexparser, ordered-set, jupytext Building wheel for panflute (setup.py) ... done Created wheel for panflute: filename=panflute-1.12.5-cp36-none-any.whl size=31627 sha256=676fe43af43c2ee924ac78cb5443892d7dd3fa6cfccf92713c51d53eb77fde2e Stored in directory: /root/.cache/pip/wheels/32/b3/ae/db7fa3b632575b1050b93af1ab043eb868ed20b5cefa5a3912 Building wheel for bibtexparser (setup.py) ... done Created wheel for bibtexparser: filename=bibtexparser-1.2.0-cp36-none-any.whl size=36714 sha256=6fa2e578a13d20a30fbcd3260d4e7e0006f031e75de048f2c3b61d9c172c95b2 Stored in directory: /root/.cache/pip/wheels/b2/5a/e7/867bcbc3a81c15b675b931aa19b6698375c5a5e90419a366db Building wheel for ordered-set (setup.py) ... done Created wheel for ordered-set: filename=ordered_set-4.0.2-py2.py3-none-any.whl size=8209 sha256=f8f7f8dfa2b453410667a3a19e84697071a9598225a888609ff0bc57fa2489ee Stored in directory: /root/.cache/pip/wheels/e1/c6/9b/651d8a21d59b51a75ab9c070838f9231b8126421bc0569af47 Building wheel for jupytext (setup.py) ... done Created wheel for jupytext: filename=jupytext-1.5.2-cp36-none-any.whl size=281979 sha256=3299047d859eb8c3f24e06a74b358487553ab824f305a8f0fecac8aa67b69c68 Stored in directory: /root/.cache/pip/wheels/d1/3b/42/81158eb89f58243fe0b47f41d0d28670171354276f437ee630 Successfully built panflute bibtexparser ordered-set jupytext Installing collected packages: ruamel.yaml.clib, ruamel.yaml, pathlib2, jsonextended, panflute, bibtexparser, ordered-set, jupytext, ipypublish Successfully installed bibtexparser-1.2.0 ipypublish-0.10.11 jsonextended-0.7.11 jupytext-1.5.2 ordered-set-4.0.2 panflute-1.12.5 pathlib2-2.3.5 ruamel.yaml-0.16.10 ruamel.yaml.clib-0.2.0
%pylab inline
import pandas as pd
from IPython.external import mathjax
from ipypublish import nb_setup
Populating the interactive namespace from numpy and matplotlib
from google.colab import drive
drive.mount('/content/drive') # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/ML_Book/')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-4-28aef19873cf> in <module>() 3 4 import os ----> 5 os.chdir('drive/My Drive') 6 os.chdir('Books_Writings/ML_Book/') FileNotFoundError: [Errno 2] No such file or directory: 'drive/My Drive'
nb_setup.images_hconcat(["DSTMAA_images/ML_AI.png"], width=600)
Interesting short book: https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12
The Universal Approximation Theorem: https://medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126
nb_setup.images_hconcat(["DSTMAA_images/DL_PatternRecognition.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/NN_diagram.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/NN_subset.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Activation_functions.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Softmax.png"], width=700)
#The Softmax function
#Assume 10 output nodes with randomly generated values
z = randn(32) #inputs from last hidden layer of 32 nodes to the output layer
w = rand(32*10).reshape((10,32)) #weights for the output layer
b = rand(10) #bias terms at output later
a = w.dot(z) + b #Net input at output layer
e = exp(a)
softmax_output = e/sum(e)
print(softmax_output.round(3))
print('final tag =',where(softmax_output==softmax_output.max())[0][0])
[0.198 0.033 0.267 0.144 0.043 0.155 0.023 0.039 0.035 0.063] final tag = 2
nb_setup.images_hconcat(["DSTMAA_images/Loss_function.png"], width=600)
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
Notation from the previous slides:
Log Loss is a special case for binary outcomes of Cross Entropy; pdf
y = [0.33, 0.33, 0.34]
bits = log2(y)
print(bits)
entropy = -sum(y*bits)
print(entropy)
[-1.59946207 -1.59946207 -1.55639335] 1.58481870497303
y = [0.2, 0.3, 0.5]
entropy = -sum(y*log2(y))
print(entropy)
1.4854752972273344
y = [0.1, 0.1, 0.8]
print(log2(y))
entropy = -sum(y*log2(y))
print(entropy)
[-3.32192809 -3.32192809 -0.32192809] 0.9219280948873623
Cross-entropy:
$$ C = - \frac{1}{n} \sum_i [y_i \ln a_i] $$where $a_i = {\hat y_i}$.
Note that $C > E$, always.
#Correct prediction
y = [0, 0, 1]
yhat = [0.1, 0.1, 0.8]
crossentropy = -sum(y*log2(yhat))
print(crossentropy)
0.3219280948873623
#Wrong prediction
yhat = [0.1, 0.6, 0.3]
crossentropy = -sum(y*log2(yhat))
print(crossentropy)
1.736965594166206
Measures the extra bits required if the wrong selection is made.
#Correct prediction
y = [0, 0, 1.0]
yhat = [0.1, 0.1, 0.8]
KL = sum(y[2]*log2(y[2]/yhat[2]))
print(KL)
#Wrong prediction
yhat = [0.1, 0.6, 0.3]
KL = sum(y[2]*log2(y[2]/yhat[2]))
print(KL)
0.32192809488736235 1.736965594166206
nb_setup.images_hconcat(["DSTMAA_images/Gradient_descent.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Chain_rule.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Delta_values.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Output_layer.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Feedforward_Backprop.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Recap.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/Backprop_one_slide.png"], width=600)
Given that $\frac{\partial D}{\partial a_j} = e^{a_j}$,
$$ = h_j (\sum_i T_i) - T_j = h_j - T_j $$nb_setup.images_hconcat(["DSTMAA_images/batch_stochastic_gradient.png"], width=700)
Initialize all the weight and bias parameters $(w_{ij}^{(r)},b_i^{(r)})$ (this is a critical step).
For $q = 0,...,{M\over B}-1$, repeat the following steps (2a) - (2f):
a. For the training inputs $X_i(m), qB\le m\le (q+1)B$, compute the model predictions $y(m)$ given by
$$ a_i^{(r)}(m) = \sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)} \quad \mbox{and} \quad z_i^{(r)}(m) = f(a_i^{(r)}(m)), \quad 2 \leq r \leq R, 1 \leq i \leq P^r $$and for $r=1$,
$$ a_i^{(1)}(m) = \sum_{j=1}^{P^1} w_{ij}^{(1)}x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$The logits and classification probabilities are computed using
$$ a_i^{(R+1)}(m) = \sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)} $$and
$$ y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$This step constitutes the forward pass of the algorithm.
b. Evaluate the gradients $\delta_k^{(R+1)}(m)$ for the logit layer nodes using
$$ \delta_k^{(R+1)}(m) = y_k(m) - t_k(m),\ \ 1\le k\le K $$This step and the following one constitute the start of the backward pass of the algorithm, in which we compute the gradients $\delta_k^{(r)}, 1 \leq k \leq K, 1\le r\le R$ for all the hidden nodes.
c. Back-propagate the $\delta$s using the following equation to obtain the $\delta_j^{(r)}(m), 1 \leq r \leq R, 1 \leq j \leq P^r$ for each hidden node in the network.
$$ \delta_j^{(r)}(m) = f'(a_j^{(r)}(m)) \sum_k w_{kj}^{(r+1)} \delta_k^{(r+1)}(m), \quad 1 \leq r \leq R $$d. Compute the gradients of the Cross Entropy Function $\mathcal L(m)$ for the $m$-th training vector $(X{(m)}, T{(m)})$ with respect to all the weight and bias parameters using the following equation.
$$ \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r+1)}} = \delta_i^{(r+1)}(m) z_j^{(r)}(m) $$and
$$ \frac{\partial \mathcal L(m)}{\partial b_i^{(r+1)}} = \delta_i^{(r+1)}(m), \quad 0 \leq r \leq R $$e. Change the model weights according to the formula
$$ w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r)}}, $$$$ b_i^{(r)} \leftarrow b_i^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial{\mathcal L}(m)}{\partial b_i^{(r)}}, $$f. Increment $q\leftarrow ((q+1)\mod B)$, and go back to step $(a)$.
Compute the Loss Function $L$ over the Validation Dataset given by
$$ L = -{1\over V}\sum_{m=1}^V\sum_{k=1}^K t_k{(m)} \log y_k{(m)} $$If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.
nb_setup.images_hconcat(["DSTMAA_images/Gradient_Descent_Scheme.png"], width=600)
def f(x):
return 3*x**2 -5*x + 10
x = linspace(-4,4,100)
plot(x,f(x))
grid()
dx = 0.001
eta = 0.05 #learning rate
x = -3
for j in range(20):
df_dx = (f(x+dx)-f(x))/dx
x = x - eta*df_dx
print(x,f(x))
-1.8501500000001698 29.519915067502733 -1.0452550000002532 18.503949045077853 -0.4818285000003115 13.105618610239208 -0.08742995000019249 10.460081738472072 0.18864903499989083 9.163520200219716 0.3819043244999847 8.528031116715445 0.5171830271499616 8.216519714966186 0.6118781190049631 8.06379390252634 0.6781646833034642 7.988898596522943 0.7245652783124417 7.95215813604575 0.7570456948186948 7.934126078037087 0.7797819863730542 7.925269906950447 0.7956973904611502 7.920916059254301 0.8068381733227685 7.918772647178622 0.8146367213259147 7.917715356568335 0.8200957049281836 7.917192371084045 0.8239169934497053 7.916932669037079 0.8265918954147882 7.9168030076222955 0.8284643267903373 7.916737788340814 0.8297750287532573 7.916704651261121
In large problems, gradients simply vanish too soon before training has reached an acceptable level of accuracy.
There are several issues with gradient descent that need handling, and to solve this, there are several fixes that may be applied.
Learning rate $\eta$ may be too large or too small.
nb_setup.images_hconcat(["DSTMAA_images/LearningRate_Matters.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/GD_MultipleDimensions.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/GD_saddle.png"], width=400)
The idea is to start with a high learning rate and then adaptively reduce it as we get closer to the minimum of the loss function.
nb_setup.images_hconcat(["DSTMAA_images/LearningRateAnnealing.png"], width=500)
Adjust the learning rate as a step function when the reduction in the loss function begins to plateau.
nb_setup.images_hconcat(["DSTMAA_images/LearningRateAnnealing2.png"], width=500)
Improve the speed of convergence (for example the Momentum, Nesterov Momentum, and Adam algorithms).
Adapt the effective Learning Rate as the training progresses (for example the ADAGRAD, RMSPROP and Adam algorithms).
Fixes the problem where the gradient is multidimensional and has a fast gradient on some axes and a slow one on the others.
At the end of the $n^{th}$ iteration of the Backprop algorithm, define a sequence $v(n)$ by
$$ v(n) = \rho\; v(n-1) - \eta \; g(n), \quad \quad v(0)= -\eta g(0) $$where $\rho$ is new hyper-parameter called the "momentum" parameter, and $g(n)$ is the gradient evaluated at parameters value $w(n)$.
$g(n)$ is defined by
$$ g(n) = \frac{\partial {\mathcal L(n)}}{\partial w} $$for Stochastic Gradient Descent and
$$ g(n) = {1\over B}\sum_{m=nB}^{(n+1)B}\frac{\partial {\mathcal L(m)}}{\partial w} $$for Batch Stochastic Gradient Descent (note that in this case $n$ is an index into the batch number).
The change in parameter values on each iteration is now defined as
$$ w(n+1) = w(n) + v(n) $$It can be shown from these equations that $v(n)$ can be written as
$$ v(n) = - \eta\sum_{i=0}^{n} \rho^{n-i} g(i) $$so that
$$ w(n+1) = w(n) - \eta\sum_{i=0}^{n} \rho^{n-i} g(i) $$When the momentum parameter $\rho = 0$, then this equation reduces to the usual Stochastic Gradient Descent iteration. On the other hand, when $\rho > 0$, then we get some interesting behaviors:
Note that $$ \sum_{i=0}^{n} \rho^{n-i}g(i) \le {g_{max}\over 1-\rho} $$ $\rho$ is usually set to the neighborhood of $0.9$ and from the above equation it follows that $\sum_{i=0}^n \rho^{n-i}g(i)\approx 10g$ assuming all the $g(i)$ are approximately equal to $g$. Hence the effective gradient is ten times the value of the actual gradient. This results in an "overshoot" where the value of the parameter shoots past the minimum point to the other side of the bowl, and then reverses itself. This is a desirable behavior since it prevents the algorithm from getting stuck at a saddle point or a local minima, since the momentum carries it out from these areas.
Circular? in order to compute $w(n+1)$ we first need to compute $g(w(n))$. $$ w(n+1)\approx w(n) + \rho v(n-1) $$
Parameter update rule: $$ w(n+1) = w(n) - \frac{\eta}{\sqrt{\sum_{i=1}^n g(n)^2+\epsilon}}\; g(n) $$
Constant $\epsilon$ has been added to better condition the denominator and is usually set to a small number such $10^{-7}$.
Each parameter gets its own adaptive Learning Rate, such that large gradients have smaller learning rates and small gradients have larger learning rates ($\eta$ is usually defaulted to $0.01$). As a result the progress along each dimension evens out over time, which helps the training process.
The change in rates happens automatically as part of the parameter update equation.
Downside: accumulation of gradients in the denominator leads to the continuous decrease in Learning Rates which can lead to a halt of training in large networks that require a greater number of iterations.
Note that $$ E[g^2]_n = (1-\rho)\sum_{i=0}^n \rho^{n-i} g(i)^2 \le \frac{g_{max}}{1-\rho} $$ which shows that the parameter $\rho$ prevents the sum from blowing up, and a large value of $\rho$ is equivalent to using a larger window of previous gradients in computing the sum.
nb_setup.images_hconcat(["DSTMAA_images/sigmoid_activation.png"], width=700)
nb_setup.images_hconcat(["DSTMAA_images/tanh_activation.png"], width=500)
Unless the input is in the neghborhood of zero, the function enters its saturated regime.
It is superior to the sigmoid in one respect, i.e., its output is zero centered. This speeds up the training process.
The $\tanh$ function is rarely used in modern DLNs, the exception being a type DLN called LSTM.
nb_setup.images_hconcat(["DSTMAA_images/relu_activation.png"], width=400)
No saturation problem.
Gradients $\frac{\partial L}{\partial w}$ propagate undiminished through the network, provided all the nodes are active.
nb_setup.images_hconcat(["DSTMAA_images/dead_relu.png"], width=400)
The dotted line in this figure shows a case in which the weight parameters $w_i$ are such that the hyperplane $\sum w_i z_i$ does not intersect the "data cloud" of possible input activations. There does not exist any possible input values that can lead to $\sum w_i z_i > 0$. The neuron's output activation will always be zero, and it will kill all gradients backpropagating down from higher layers.
Vary initialization to correct this.
nb_setup.images_hconcat(["DSTMAA_images/leaky_relu.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/prelu.png"], width=500)
Note that each neuron $i$ now has its own parameter $\beta_i, 1\le i\le S$, where $S$ is the number of nodes in the network. These parameters are iteratively estimated using Backprop.
$$ \frac{\partial\mathcal L}{\partial\beta_i} = \frac{\partial\mathcal L}{\partial z_i}\frac{\partial z_i}{\partial\beta_i},\ \ 1\le i\le S $$Substituting the value for $\frac{\partial z_i}{\partial\beta_i}$ we obtain $$ \frac{\partial\mathcal L}{\partial\beta_i} = a_i\frac{\partial\mathcal L}{\partial z_i}\ \ if\ a_i \le 0\ \ \mbox{and} \ \ 0 \ \ \mbox{otherwise} $$
which is then used to update $\beta_i$ using $\beta_i\rightarrow\beta_i - \eta\frac{\partial\mathcal L}{\partial\beta_i}$.
Once training is complete, the PreLU based DLN network ends up with a different value of $\beta_i$ at each neuron, which increases the flexibility of the network at the cost of an increase in the number of parameters.
nb_setup.images_hconcat(["DSTMAA_images/maxout.png"], width=600)
Generalizes Leaky ReLU.
$$ z'_i = \max(c\big[\sum_j w_{ij}z_j +b_i\big],\sum_j w_{ij}z_j +b_i), $$We may allow the two hyperplanes to be independent with their own set of parameters, as shown in the Figure above.
In practice, the DLN weight parameters are initialized with random values drawn from Gaussian or Uniform distributions and the following rules are used:
Guassian Initialization: If the weight is between layers with $n_{in}$ input neurons and $n_{out}$ output neurons, then they are initialized using a Gaussian random distribution with mean zero and standard deviation $\sqrt{2\over n_{in}+n_{out}}$.
Uniform Initialization: In the same configuration as above, the weights should be initialized using an Uniform distribution between $-r$ and $r$, where $r = \sqrt{6\over n_{in}+n_{out}}$.
When using the ReLU or its variants, these rules have to be modified slightly:
Guassian Initialization: If the weight is between layers with $n_{in}$ input neurons and $n_{out}$ output neurons, then they are initialized using a Gaussian random distribution with mean zero and standard deviation $\sqrt{4\over n_{in}+n_{out}}$.
Uniform Initialization: In the same configuration as above, the weights should be initialized using an Uniform distribution between $-r$ and $r$, where $r = \sqrt{12\over n_{in}+n_{out}}$.
The reasoning behind scaling down the initialization values as the number of incident weights increases is to prevent saturation of the node activations during the forward pass of the Backprop algorithm, as well as large values of the gradients during backward pass.
nb_setup.images_hconcat(["DSTMAA_images/data_preprocessing.png"], width=600)
Centering: This is also sometimes called Mean Subtraction, and is the most common form of preprocessing. Given an input dataset consisting of $M$ vectors $X(m) = (x_1(m),...,x_N(m)), m = 1,...,M$, it consists of subtracting the mean across each individual input component $x_i, 1\leq i\leq N$ such that $$ x_i(m) \leftarrow x_i(m) - \frac{\sum_{s=1}^{M}x_i(s)}{M},\ \ 1\leq i\leq N, 1\le m\le M $$
Scaling: After the data has been centered, it can be scaled in one of two ways:
By Normalizing each dimension so that the min and max along each axis are -1 and +1 repectively.
In general Scaling helps optimization because it balances out the rate at which the weights connected to the input nodes learn.
Recall that for a K-ary Linear Classifier, the parameter update equation is given by:
$$ w_{kj} \leftarrow w_{kj} - \eta x_j(y_k-t_k),\ \ 1\le k\le K,\ \ 1\le j\le N $$If the training sample is such that $t_q = 1$ and $t_k = 0, j\ne q$, then the update becomes:
$$ w_{qj} \leftarrow w_{qj} - \eta x_j(y_q-1) $$and
$$ w_{kj} \leftarrow w_{kj} - \eta x_j(y_k),\ \ k\ne q $$Lets assume that the input data is not centered so that $x_j\ge 0, j=1,...,N$. Since $0\le y_k\le 1$ it follows that
$$ \Delta w_{kj} = -\eta x_jy_k <0, k\ne q $$and
$$ \Delta w_{qj} = -\eta x_j(y_q - 1) > 0 $$i.e. the update results in all the weights moving in the same direction, except for one. This is shown graphically in the Figure above, in which the system is trying move in the direction of the blue arrow which is the quickest path to the minimum. However if the input data is not centered, then it is forced to move in a zig-zag fashion as shown in the red-curve. The zig-zag motion is caused due to the fact that all the parameters move in the same direction at each step due to the lact of zero-centering in the input data.
nb_setup.images_hconcat(["DSTMAA_images/zero_centering_helps.png"], width=400)
Normalization applied to the hidden layers.
nb_setup.images_hconcat(["DSTMAA_images/batch_normalization.png"], width=600)
Higher learning rates: In a non-normalized network, a large learning rate can lead to oscillations and cause the loss function increase rather than decrease.
Better Gradient Propagation through the network, enabling DLNs with more hidden layers.
Reduces strong dependencies on the parameter initialization values.
Helps to regularize the model.
## Under and Over-fitting
nb_setup.images_hconcat(["DSTMAA_images/underoverfitting.png"], width=600)
Early Stopping
L1 Regularization
L2 Regularization
Dropout Regularization
Training Data Augmentation
Batch Normalization
nb_setup.images_hconcat(["DSTMAA_images/early_stopping.png"], width=600)
L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:
\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \end{equation}L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot.
L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):
$$ \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| $$At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights.
Both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different:
In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller.
In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true.
As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.
nb_setup.images_hconcat(["DSTMAA_images/dropout.png"], width=600)
The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:
After the Backprop is complete, we have effectively trained a collection of up to $2^s$ thinned DLNs all of which share the same weights, where $s$ is the total number of hidden nodes in the DLN.
In order to test the network, strictly speaking we should be averaging the results from all these thinned models, however a simple approximate averaging method works quite well.
The main idea is to use the complete DLN as the test network.
nb_setup.images_hconcat(["DSTMAA_images/bagging.png"], width=500)
nb_setup.images_hconcat(["DSTMAA_images/TensorFlow_playground.png"], width=600)
import pandas as pd
## Read in the data set
data = pd.read_csv("DSTMAA_data/BreastCancer.csv")
data.head()
Id | Cl.thickness | Cell.size | Cell.shape | Marg.adhesion | Epith.c.size | Bare.nuclei | Bl.cromatin | Normal.nucleoli | Mitoses | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | benign |
1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | benign |
2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | benign |
3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | benign |
4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | benign |
x = data.loc[:,'Cl.thickness':'Mitoses']
print(x.head())
y = data.loc[:,'Class']
print(y.head())
Cl.thickness Cell.size Cell.shape ... Bl.cromatin Normal.nucleoli Mitoses 0 5 1 1 ... 3 1 1 1 5 4 4 ... 3 2 1 2 3 1 1 ... 3 1 1 3 6 8 8 ... 3 7 1 4 4 1 1 ... 3 1 1 [5 rows x 9 columns] 0 benign 1 benign 2 benign 3 benign 4 benign Name: Class, dtype: object
## Convert the class variable into binary numeric
ynum = zeros((len(x),1))
for j in arange(len(y)):
if y[j]=="malignant":
ynum[j]=1
ynum[:10]
array([[0.], [0.], [0.], [0.], [0.], [1.], [0.], [0.], [0.], [0.]])
## Make label data have 1-shape, 1=malignant
from keras import utils
y.labels = utils.to_categorical(ynum, num_classes=2)
#x = x.as_matrix()
print(y.labels[:10])
print(shape(x))
print(shape(y.labels))
print(shape(ynum))
Using TensorFlow backend.
[[1. 0.] [1. 0.] [1. 0.] [1. 0.] [1. 0.] [0. 1.] [1. 0.] [1. 0.] [1. 0.] [1. 0.]] (683, 9) (683, 2) (683, 1)
## Define the neural net and compile it
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=9))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
## Fit/train the model (x,y need to be matrices)
model.fit(x, ynum, epochs=25, batch_size=32, verbose=2, validation_split=0.3)
Train on 478 samples, validate on 205 samples Epoch 1/25 - 2s - loss: 0.5969 - accuracy: 0.5439 - val_loss: 0.5223 - val_accuracy: 0.9073 Epoch 2/25 - 0s - loss: 0.4930 - accuracy: 0.8473 - val_loss: 0.4246 - val_accuracy: 0.9610 Epoch 3/25 - 0s - loss: 0.4308 - accuracy: 0.8912 - val_loss: 0.3669 - val_accuracy: 0.9610 Epoch 4/25 - 0s - loss: 0.3815 - accuracy: 0.9121 - val_loss: 0.3117 - val_accuracy: 0.9659 Epoch 5/25 - 0s - loss: 0.3443 - accuracy: 0.9121 - val_loss: 0.2514 - val_accuracy: 0.9854 Epoch 6/25 - 0s - loss: 0.3159 - accuracy: 0.9121 - val_loss: 0.2656 - val_accuracy: 0.9561 Epoch 7/25 - 0s - loss: 0.2954 - accuracy: 0.9121 - val_loss: 0.2374 - val_accuracy: 0.9610 Epoch 8/25 - 0s - loss: 0.2755 - accuracy: 0.9310 - val_loss: 0.1803 - val_accuracy: 0.9951 Epoch 9/25 - 0s - loss: 0.2575 - accuracy: 0.9226 - val_loss: 0.1933 - val_accuracy: 0.9854 Epoch 10/25 - 0s - loss: 0.2369 - accuracy: 0.9351 - val_loss: 0.1463 - val_accuracy: 0.9902 Epoch 11/25 - 0s - loss: 0.2223 - accuracy: 0.9414 - val_loss: 0.1606 - val_accuracy: 0.9902 Epoch 12/25 - 0s - loss: 0.2072 - accuracy: 0.9414 - val_loss: 0.1613 - val_accuracy: 0.9854 Epoch 13/25 - 0s - loss: 0.2144 - accuracy: 0.9519 - val_loss: 0.1073 - val_accuracy: 0.9951 Epoch 14/25 - 0s - loss: 0.1904 - accuracy: 0.9519 - val_loss: 0.1032 - val_accuracy: 0.9951 Epoch 15/25 - 0s - loss: 0.1771 - accuracy: 0.9540 - val_loss: 0.1048 - val_accuracy: 0.9951 Epoch 16/25 - 0s - loss: 0.1711 - accuracy: 0.9519 - val_loss: 0.0820 - val_accuracy: 0.9951 Epoch 17/25 - 0s - loss: 0.1614 - accuracy: 0.9603 - val_loss: 0.0964 - val_accuracy: 0.9902 Epoch 18/25 - 0s - loss: 0.1603 - accuracy: 0.9477 - val_loss: 0.0855 - val_accuracy: 0.9902 Epoch 19/25 - 0s - loss: 0.1542 - accuracy: 0.9582 - val_loss: 0.0762 - val_accuracy: 0.9951 Epoch 20/25 - 0s - loss: 0.1462 - accuracy: 0.9603 - val_loss: 0.0636 - val_accuracy: 0.9902 Epoch 21/25 - 0s - loss: 0.1321 - accuracy: 0.9707 - val_loss: 0.0679 - val_accuracy: 0.9902 Epoch 22/25 - 0s - loss: 0.1379 - accuracy: 0.9623 - val_loss: 0.0612 - val_accuracy: 0.9902 Epoch 23/25 - 0s - loss: 0.1258 - accuracy: 0.9644 - val_loss: 0.0631 - val_accuracy: 0.9902 Epoch 24/25 - 0s - loss: 0.1250 - accuracy: 0.9665 - val_loss: 0.0523 - val_accuracy: 0.9902 Epoch 25/25 - 0s - loss: 0.1210 - accuracy: 0.9644 - val_loss: 0.0564 - val_accuracy: 0.9902
<keras.callbacks.callbacks.History at 0x7f24f71ea7f0>
## Accuracy
yhat = model.predict_classes(x, batch_size=32)
acc = sum(yhat==ynum)
print("Accuracy = ",acc/len(ynum))
## Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(yhat,ynum)
Accuracy = 0.9765739385065886
array([[431, 3], [ 13, 236]])
nb_setup.images_hconcat(["DSTMAA_images/MNIST.png"], width=800)
## Read in the data set
train = pd.read_csv("DSTMAA_data/train.csv", header=None)
test = pd.read_csv("DSTMAA_data/test.csv", header=None)
print(shape(train))
print(shape(test))
(60000, 785) (10000, 785)
train.shape
(60000, 785)
## Reformat the data
X_train = train.loc[:,:783]
Y_train = train.loc[:,784]
print(shape(X_train))
print(shape(Y_train))
X_test = test.loc[:,:783]
Y_test = test.loc[:,784]
print(shape(X_test))
print(shape(Y_test))
y.labels = utils.to_categorical(Y_train, num_classes=10)
print(shape(y.labels))
print(y.labels[1:5,:])
print(Y_train[1:5])
(60000, 784) (60000,) (10000, 784) (10000,) (60000, 10) [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]] 1 3 2 0 3 0 4 2 Name: 784, dtype: int64
## Define the neural net and compile it
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.utils import plot_model
data_dim = shape(X_train)[1]
model = Sequential([
Dense(100, input_shape=(784,)),
Activation('sigmoid'),
Dense(100),
Activation('sigmoid'),
Dense(100),
Activation('sigmoid'),
Dense(100),
Activation('sigmoid'),
Dense(10),
Activation('softmax'),
])
#model = Sequential()
#model.add(Dense(100, activation='sigmoid', input_dim=data_dim))
#model.add(Dropout(0.25))
#model.add(Dense(100, activation='sigmoid'))
#model.add(Dropout(0.25))
#model.add(Dense(100, activation='sigmoid'))
#model.add(Dropout(0.25))
#model.add(Dense(100, activation='sigmoid'))
#model.add(Dropout(0.25))
#model.add(Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
Model: "sequential_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_20 (Dense) (None, 100) 78500 _________________________________________________________________ activation_16 (Activation) (None, 100) 0 _________________________________________________________________ dense_21 (Dense) (None, 100) 10100 _________________________________________________________________ activation_17 (Activation) (None, 100) 0 _________________________________________________________________ dense_22 (Dense) (None, 100) 10100 _________________________________________________________________ activation_18 (Activation) (None, 100) 0 _________________________________________________________________ dense_23 (Dense) (None, 100) 10100 _________________________________________________________________ activation_19 (Activation) (None, 100) 0 _________________________________________________________________ dense_24 (Dense) (None, 10) 1010 _________________________________________________________________ activation_20 (Activation) (None, 10) 0 ================================================================= Total params: 109,810 Trainable params: 109,810 Non-trainable params: 0 _________________________________________________________________
plot_model(model)
## Fit/train the model (x,y need to be matrices)
model.fit(X_train, y.labels, epochs=10, batch_size=32, verbose=2, validation_split=0.2)
Train on 48000 samples, validate on 12000 samples Epoch 1/10 - 6s - loss: 0.2277 - accuracy: 0.9320 - val_loss: 0.2233 - val_accuracy: 0.9314 Epoch 2/10 - 5s - loss: 0.2188 - accuracy: 0.9334 - val_loss: 0.2216 - val_accuracy: 0.9336 Epoch 3/10 - 6s - loss: 0.2135 - accuracy: 0.9353 - val_loss: 0.2231 - val_accuracy: 0.9333 Epoch 4/10 - 6s - loss: 0.2106 - accuracy: 0.9360 - val_loss: 0.2071 - val_accuracy: 0.9379 Epoch 5/10 - 5s - loss: 0.2034 - accuracy: 0.9382 - val_loss: 0.2023 - val_accuracy: 0.9375 Epoch 6/10 - 5s - loss: 0.1954 - accuracy: 0.9406 - val_loss: 0.2177 - val_accuracy: 0.9348 Epoch 7/10 - 5s - loss: 0.1915 - accuracy: 0.9423 - val_loss: 0.1890 - val_accuracy: 0.9423 Epoch 8/10 - 6s - loss: 0.1856 - accuracy: 0.9434 - val_loss: 0.1951 - val_accuracy: 0.9427 Epoch 9/10 - 5s - loss: 0.1794 - accuracy: 0.9460 - val_loss: 0.1944 - val_accuracy: 0.9414 Epoch 10/10 - 6s - loss: 0.1786 - accuracy: 0.9464 - val_loss: 0.1918 - val_accuracy: 0.9427
<keras.callbacks.callbacks.History at 0x7f23dc93af98>
## In Sample
yhat = model.predict_classes(X_train, batch_size=32)
## Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(yhat,Y_train)
print(" ")
print(cm)
##
acc = sum(diag(cm))/len(Y_train)
print("Accuracy = ",acc)
[[5758 1 23 4 13 20 28 16 18 25] [ 1 6608 35 25 18 9 10 30 118 17] [ 24 33 5660 103 40 28 24 41 63 11] [ 10 24 35 5656 0 130 2 12 93 50] [ 4 9 27 2 5557 11 12 35 31 253] [ 28 15 17 143 2 5017 38 6 70 30] [ 61 4 54 15 81 86 5784 1 55 3] [ 1 16 51 72 7 7 0 6016 12 63] [ 27 23 54 70 8 76 20 9 5301 40] [ 9 9 2 41 116 37 0 99 90 5457]] Accuracy = 0.9469
## Out of Sample
yhat = model.predict_classes(X_test, batch_size=32)
## Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(yhat,Y_test)
print(" ")
print(cm)
##
acc = sum(diag(cm))/len(Y_test)
print("Accuracy = ",acc)
[[ 960 0 9 0 1 7 9 1 3 4] [ 0 1119 3 1 1 4 3 10 11 8] [ 1 3 968 19 5 1 3 18 5 1] [ 2 4 14 941 0 31 2 3 13 7] [ 0 1 3 0 940 3 1 1 7 44] [ 2 0 4 21 0 811 6 0 16 8] [ 13 3 10 1 17 14 928 0 8 0] [ 1 2 9 10 2 3 0 978 7 6] [ 1 2 11 13 2 12 5 1 894 10] [ 0 1 1 4 14 6 1 16 10 921]] Accuracy = 0.946
Slides on Image Processing using Deep Learning by Subir Varma: https://drive.google.com/file/d/19xPCf2M66Dws06XxXLgEhuqJFiKmUEJ1/view?usp=sharing
Image processing with transfer learning: https://drive.google.com/file/d/1D3Cg288wVY-e5BHuDNHec3wd-LStflMr/view?usp=sharing. From: Practical Deep Learning for Cloud and Mobile (O'Reilly) by Anirudh Koul, Siddha Ganju & Meher Kasam.
Recognizing Images using NNs: https://drive.google.com/file/d/1OMQOZuEmnw0Kxvo5O1C9kf6mbQnPXYvj/view?usp=sharing. (Build your first Convolutional Neural Network to recognize images: A step-by-step guide to building your own image recognition software with Convolutional Neural Networks using Keras on CIFAR-10 images! by Joseph Lee Wei En.): https://medium.com/intuitive-deep-learning/build-your-first-convolutional-neural-network-to-recognize-images-84b9c78fe0ce
See : Hutchinson, Lo, Poggio (1994)
from scipy.stats import norm
def BSM(S,K,T,sig,rf,dv,cp): #cp = {+1.0 (calls), -1.0 (puts)}
d1 = (math.log(S/K)+(rf-dv+0.5*sig**2)*T)/(sig*math.sqrt(T))
d2 = d1 - sig*math.sqrt(T)
return cp*S*math.exp(-dv*T)*norm.cdf(d1*cp) - cp*K*math.exp(-rf*T)*norm.cdf(d2*cp)
df = pd.read_csv('DSTMAA_data/BS_training.csv')
$C$ is homogeneous degree one, so $$ aC(S,K) = C(aS,aK) $$ This means we can normalize spot and call prices and remove a variable by dividing by $K$. $$ \frac{C(S,K)}{K} = C(S/K,1) $$
df['Stock Price'] = df['Stock Price']/df['Strike Price']
df['Call Price'] = df['Call Price'] /df['Strike Price']
n = 300000
n_train = (int)(0.8 * n)
train = df[0:n_train]
X_train = train[['Stock Price', 'Maturity', 'Dividends', 'Volatility', 'Risk-free']].values
y_train = train['Call Price'].values
test = df[n_train+1:n]
X_test = test[['Stock Price', 'Maturity', 'Dividends', 'Volatility', 'Risk-free']].values
y_test = test['Call Price'].values
#Import libraries
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, LeakyReLU
from keras import backend
def custom_activation(x):
return backend.exp(x)
nodes = 120
model = Sequential()
model.add(Dense(nodes, input_dim=X_train.shape[1]))
#model.add("relu")
model.add(Dropout(0.25))
model.add(Dense(nodes, activation='elu'))
model.add(Dropout(0.25))
model.add(Dense(nodes, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(nodes, activation='elu'))
model.add(Dropout(0.25))
model.add(Dense(1))
model.add(Activation(custom_activation))
model.compile(loss='mse',optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=64, epochs=10, validation_split=0.1, verbose=2)
Train on 216000 samples, validate on 24000 samples Epoch 1/10 - 10s - loss: 0.0053 - val_loss: 0.0022 Epoch 2/10 - 10s - loss: 0.0015 - val_loss: 0.0012 Epoch 3/10 - 10s - loss: 0.0011 - val_loss: 1.5032e-04 Epoch 4/10 - 10s - loss: 9.4730e-04 - val_loss: 1.1840e-04 Epoch 5/10 - 10s - loss: 8.2619e-04 - val_loss: 3.4301e-04 Epoch 6/10 - 10s - loss: 7.5072e-04 - val_loss: 5.3885e-04 Epoch 7/10 - 10s - loss: 6.9686e-04 - val_loss: 6.3596e-04 Epoch 8/10 - 10s - loss: 6.6734e-04 - val_loss: 3.9628e-04 Epoch 9/10 - 10s - loss: 6.4379e-04 - val_loss: 3.7819e-04 Epoch 10/10 - 10s - loss: 6.2233e-04 - val_loss: 2.2964e-04
<keras.callbacks.callbacks.History at 0x7f23db54ff98>
def CheckAccuracy(y,y_hat):
stats = dict()
stats['diff'] = y - y_hat
stats['mse'] = mean(stats['diff']**2)
print("Mean Squared Error: ", stats['mse'])
stats['rmse'] = sqrt(stats['mse'])
print("Root Mean Squared Error: ", stats['rmse'])
stats['mae'] = mean(abs(stats['diff']))
print("Mean Absolute Error: ", stats['mae'])
stats['mpe'] = sqrt(stats['mse'])/mean(y)
print("Mean Percent Error: ", stats['mpe'])
#plots
mpl.rcParams['agg.path.chunksize'] = 100000
figure(figsize=(10,3))
plt.scatter(y, y_hat,color='black',linewidth=0.3,alpha=0.4, s=0.5)
plt.xlabel('Actual Price',fontsize=20,fontname='Times New Roman')
plt.ylabel('Predicted Price',fontsize=20,fontname='Times New Roman')
plt.show()
figure(figsize=(10,3))
plt.hist(stats['diff'], bins=50,edgecolor='black',color='white')
plt.xlabel('Diff')
plt.ylabel('Density')
plt.show()
return stats
y_train_hat = model.predict(X_train)
#reduce dim (240000,1) -> (240000,) to match y_train's dim
y_train_hat = squeeze(y_train_hat)
CheckAccuracy(y_train, y_train_hat)
findfont: Font family ['Times New Roman'] not found. Falling back to DejaVu Sans.
Mean Squared Error: 0.00023008994372701063 Root Mean Squared Error: 0.015168715955116657 Mean Absolute Error: 0.011827044126932673 Mean Percent Error: 0.05670397821662379
{'diff': array([0.01986485, 0.0093275 , 0.0173008 , ..., 0.01663176, 0.0257388 , 0.00174695]), 'mae': 0.011827044126932673, 'mpe': 0.05670397821662379, 'mse': 0.00023008994372701063, 'rmse': 0.015168715955116657}
y_test_hat = model.predict(X_test)
y_test_hat = squeeze(y_test_hat)
test_stats = CheckAccuracy(y_test, y_test_hat)
Mean Squared Error: 0.00023010178562453136 Root Mean Squared Error: 0.015169106289578545 Mean Absolute Error: 0.011837113037171105 Mean Percent Error: 0.0567868417818699
A Random Forest uses several decision trees to make hypotheses about regions within subsamples of the data, then makes predictions based on the majority vote of these trees. This safeguards against overfitting/memorization of the training data.
n = 300000
n_train = (int)(0.8 * n)
train = df[0:n_train]
X_train = train[['Stock Price', 'Maturity', 'Dividends', 'Volatility', 'Risk-free']].values
y_train = train['Call Price'].values
test = df[n_train+1:n]
X_test = test[['Stock Price', 'Maturity', 'Dividends', 'Volatility', 'Risk-free']].values
y_test = test['Call Price'].values
def CheckAccuracy(y,y_hat):
stats = dict()
stats['diff'] = y - y_hat
stats['mse'] = mean(stats['diff']**2)
print("Mean Squared Error: ", stats['mse'])
stats['rmse'] = sqrt(stats['mse'])
print("Root Mean Squared Error: ", stats['rmse'])
stats['mae'] = mean(abs(stats['diff']))
print("Mean Absolute Error: ", stats['mae'])
stats['mpe'] = sqrt(stats['mse'])/mean(y)
print("Mean Percent Error: ", stats['mpe'])
#plots
mpl.rcParams['agg.path.chunksize'] = 100000
#figure(figsize=(14,10))
plt.scatter(y, y_hat,color='black',linewidth=0.3,alpha=0.4, s=0.5)
plt.xlabel('Actual Price',fontsize=20,fontname='Times New Roman')
plt.ylabel('Predicted Price',fontsize=20,fontname='Times New Roman')
plt.show()
#figure(figsize=(14,10))
plt.hist(stats['diff'], bins=50,edgecolor='black',color='white')
plt.xlabel('Diff')
plt.ylabel('Density')
plt.show()
return stats
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest = forest.fit(X_train, y_train)
y_test_hat = forest.predict(X_test)
stats = CheckAccuracy(y_test, y_test_hat)
Mean Squared Error: 3.432145217739371e-05 Root Mean Squared Error: 0.00585845134633665 Mean Absolute Error: 0.004268844808400679 Mean Percent Error: 0.021931611746946578
Deep Learning, specifically, and machine learning, more generally, has been criticized by econometricians as being weaker than causal models. That is, correlation is not causality. Here is an article about a recent development in taking NNs in the direction of causal models: https://medium.com/mit-technology-review/deep-learning-could-reveal-why-the-world-works-the-way-it-does-9be8b5fbfe4f; https://drive.google.com/file/d/1r4UPFQQv-vutQXdlmpmCyB9_nO14FZMe/view?usp=sharing