Deep Learning: Introduction

Sanjiv R. Das

Professor of Finance and Data Science

Santa Clara University

REFERENCE BOOK: http://srdas.github.io/DLBook2

“You are my creator, but I am your master; Obey!”

― Mary Shelley, Frankenstein

Interesting short book: https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12

The Universal Approximation Theorem: https://medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126

Net input

$$ a_j^{(r)} = \sum_{i=1}^{n^{(r-1)}} W_{ij} Z_i^{(r-1)} + b_j^{(r)} $$

Examples of Different Types of Neurons

Sigmoid

$$ Z_j^{(r)} = \frac{1}{1+\exp(-a_j^{(r)})} \in (0,1) $$

ReLU (restricted linear unit)

$$ Z_j^{(r)} = \max[0, a_j^{(r)}] \in (0,1) $$

TanH (hyperbolic tangent)

$$ Z_j^{(r)} = \tanh[a_j^{(r)}] \in (-1,+1) $$

Output Layer

More on Cross Entropy

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/

Notation from the previous slides:

Log Loss is a special case for binary outcomes of Cross Entropy; pdf

Entropy

$$ E = - \frac{1}{n} \sum_i [y_i \ln y_i] $$

Cross-entropy:

$$ C = - \frac{1}{n} \sum_i [y_i \ln a_i] $$

where $a_i = {\hat y_i}$.

Note that $C > E$, always.

Kullback-Leibler Divergence

$$ KL = \sum_i y_i \cdot \log \left(\frac{y_i}{\hat{y_i}} \right) $$

Measures the extra bits required if the wrong selection is made.

SoftMax Function

$$ L = -\sum_j T_i \log(h_i) $$$$ h_i = \frac{e^{a_i}}{\sum_j e^{a_j}} = \frac{e^{a_i}}{D} $$

Delta of Softmax

$$ \frac{\partial L}{\partial a_j} = -\sum_i T_i \frac{\partial \log h_i}{\partial a_j} $$$$ = -\sum_i T_i \frac{\partial}{\partial a_j}[a_i - \log D] $$$$ = \sum_i T_i \frac{\partial \log D}{\partial a_j} - \sum_i T_i \frac{\partial a_i}{\partial a_j} $$$$ = \sum_i T_i \frac{1}{D}\frac{\partial D}{\partial a_j} - \sum_i T_i \frac{\partial a_i}{\partial a_j} $$

Given that $\frac{\partial D}{\partial a_j} = e^{a_j}$,

$$ = h_j (\sum_i T_i) - T_j = h_j - T_j $$

Batch Stochastic Gradient

nb_setup.images_hconcat(["DSTMAA_images/batch_stochastic_gradient.png"], width=700)

Fitting the NN

  1. Initialize all the weight and bias parameters $(w_{ij}^{(r)},b_i^{(r)})$ (this is a critical step).

  2. For $q = 0,...,{M\over B}-1$, repeat the following steps (2a) - (2f):

a. For the training inputs $X_i(m), qB\le m\le (q+1)B$, compute the model predictions $y(m)$ given by

$$ a_i^{(r)}(m) = \sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)} \quad \mbox{and} \quad z_i^{(r)}(m) = f(a_i^{(r)}(m)), \quad 2 \leq r \leq R, 1 \leq i \leq P^r $$

and for $r=1$,

$$ a_i^{(1)}(m) = \sum_{j=1}^{P^1} w_{ij}^{(1)}x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$

The logits and classification probabilities are computed using

$$ a_i^{(R+1)}(m) = \sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)} $$

and

$$ y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$

This step constitutes the forward pass of the algorithm.

b. Evaluate the gradients $\delta_k^{(R+1)}(m)$ for the logit layer nodes using

$$ \delta_k^{(R+1)}(m) = y_k(m) - t_k(m),\ \ 1\le k\le K $$

This step and the following one constitute the start of the backward pass of the algorithm, in which we compute the gradients $\delta_k^{(r)}, 1 \leq k \leq K, 1\le r\le R$ for all the hidden nodes.

c. Back-propagate the $\delta$s using the following equation to obtain the $\delta_j^{(r)}(m), 1 \leq r \leq R, 1 \leq j \leq P^r$ for each hidden node in the network.

$$ \delta_j^{(r)}(m) = f'(a_j^{(r)}(m)) \sum_k w_{kj}^{(r+1)} \delta_k^{(r+1)}(m), \quad 1 \leq r \leq R $$

d. Compute the gradients of the Cross Entropy Function $\mathcal L(m)$ for the $m$-th training vector $(X{(m)}, T{(m)})$ with respect to all the weight and bias parameters using the following equation.

$$ \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r+1)}} = \delta_i^{(r+1)}(m) z_j^{(r)}(m) $$

and

$$ \frac{\partial \mathcal L(m)}{\partial b_i^{(r+1)}} = \delta_i^{(r+1)}(m), \quad 0 \leq r \leq R $$

e. Change the model weights according to the formula

$$ w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r)}}, $$$$ b_i^{(r)} \leftarrow b_i^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial{\mathcal L}(m)}{\partial b_i^{(r)}}, $$

f. Increment $q\leftarrow ((q+1)\mod B)$, and go back to step $(a)$.

Compute the Loss Function $L$ over the Validation Dataset given by

$$ L = -{1\over V}\sum_{m=1}^V\sum_{k=1}^K t_k{(m)} \log y_k{(m)} $$

If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.

Gradient Descent Example

nb_setup.images_hconcat(["DSTMAA_images/Gradient_Descent_Scheme.png"], width=600)

Vanishing Gradients

Gradients in Multiple Dimensions

Saddle Points

Effect of the Learning Rate

The idea is to start with a high learning rate and then adaptively reduce it as we get closer to the minimum of the loss function.

Annealing

Adjust the learning rate as a step function when the reduction in the loss function begins to plateau.

Learning Rate Algorithms

Momentum

Fixes the problem where the gradient is multidimensional and has a fast gradient on some axes and a slow one on the others.

At the end of the $n^{th}$ iteration of the Backprop algorithm, define a sequence $v(n)$ by

$$ v(n) = \rho\; v(n-1) - \eta \; g(n), \quad \quad v(0)= -\eta g(0) $$

where $\rho$ is new hyper-parameter called the "momentum" parameter, and $g(n)$ is the gradient evaluated at parameters value $w(n)$.

$g(n)$ is defined by

$$ g(n) = \frac{\partial {\mathcal L(n)}}{\partial w} $$

for Stochastic Gradient Descent and

$$ g(n) = {1\over B}\sum_{m=nB}^{(n+1)B}\frac{\partial {\mathcal L(m)}}{\partial w} $$

for Batch Stochastic Gradient Descent (note that in this case $n$ is an index into the batch number).

The change in parameter values on each iteration is now defined as

$$ w(n+1) = w(n) + v(n) $$

It can be shown from these equations that $v(n)$ can be written as

$$ v(n) = - \eta\sum_{i=0}^{n} \rho^{n-i} g(i) $$

so that

$$ w(n+1) = w(n) - \eta\sum_{i=0}^{n} \rho^{n-i} g(i) $$

Behavior of Momentum Parameter

When the momentum parameter $\rho = 0$, then this equation reduces to the usual Stochastic Gradient Descent iteration. On the other hand, when $\rho > 0$, then we get some interesting behaviors:

Properties of the Momentum algorithm

Note that $$ \sum_{i=0}^{n} \rho^{n-i}g(i) \le {g_{max}\over 1-\rho} $$ $\rho$ is usually set to the neighborhood of $0.9$ and from the above equation it follows that $\sum_{i=0}^n \rho^{n-i}g(i)\approx 10g$ assuming all the $g(i)$ are approximately equal to $g$. Hence the effective gradient is ten times the value of the actual gradient. This results in an "overshoot" where the value of the parameter shoots past the minimum point to the other side of the bowl, and then reverses itself. This is a desirable behavior since it prevents the algorithm from getting stuck at a saddle point or a local minima, since the momentum carries it out from these areas.

Nesterov Momentum

  1. Recall that the Momentum parameter update equations can be written as: $$ v(n) = \rho\; v(n-1) - \eta \; g(w(n)) $$ $$ w(n+1) = w(n) + v(n) $$
  2. These equations can be improved by evaluation of the gradient at parameter value $w(n+1)$ instead.

Circular? in order to compute $w(n+1)$ we first need to compute $g(w(n))$. $$ w(n+1)\approx w(n) + \rho v(n-1) $$

  1. Gives the velocity update equation for Nesterov Momentum $$ v(n) = \rho\; v(n-1) - \eta \; g(w(n)+\rho v(n-1)) $$ where $g(w(n)+\rho v(n-1))$ denotes the gradient computed at parameter values $w(n) + \rho v(n-1)$.
  2. Gradient Descent process speeds up considerably when compared to the plain Momentum method.

The ADAGRAD Algorithm

  1. Parameter update rule: $$ w(n+1) = w(n) - \frac{\eta}{\sqrt{\sum_{i=1}^n g(n)^2+\epsilon}}\; g(n) $$

  2. Constant $\epsilon$ has been added to better condition the denominator and is usually set to a small number such $10^{-7}$.

  1. Each parameter gets its own adaptive Learning Rate, such that large gradients have smaller learning rates and small gradients have larger learning rates ($\eta$ is usually defaulted to $0.01$). As a result the progress along each dimension evens out over time, which helps the training process.

  2. The change in rates happens automatically as part of the parameter update equation.

  3. Downside: accumulation of gradients in the denominator leads to the continuous decrease in Learning Rates which can lead to a halt of training in large networks that require a greater number of iterations.

The RMSPROP Algorithm

  1. Accumulates the sum of gradients using a sliding window: $$ E[g^2]_n = \rho E[g^2]_{n-1} + (1-\rho) g(n)^2 $$ where $\rho$ is a decay constant (usually set to $0.9$). This operation (called a Low Pass Filter) has a windowing effect, since it forgets gradients that are far back in time.
  2. The quantity $RMS[g]_n$ defined by $$ RMS[g]_n = \sqrt{E[g^2]_n + \epsilon} $$ \begin{equation} w(n+1) = w(n) - \frac{\eta}{RMS[g]_n}\; g(n) \end{equation}

Note that $$ E[g^2]_n = (1-\rho)\sum_{i=0}^n \rho^{n-i} g(i)^2 \le \frac{g_{max}}{1-\rho} $$ which shows that the parameter $\rho$ prevents the sum from blowing up, and a large value of $\rho$ is equivalent to using a larger window of previous gradients in computing the sum.

The ADAM Algorithm

  1. The Adaptive Moment Estimation (Adam) algorithm combines the best of algorithms such as Momentum that speed up the training process, with algorithms such as RMSPROP that adaptively vary the effective Learning Rate.
$$ \Lambda(n) = \beta\Lambda(n-1) +(1-\beta)g(n),\ \ \ {\hat\Gamma}(n) = \frac{\Gamma(n)}{1-\beta^n} $$$$ \Delta(n) = \alpha\Delta(n-1) + (1-\alpha) g(n)^2,\ \ \ {\hat\Delta}(n) = \frac{\Delta(n)}{1-\alpha^n} $$
$$ w(n+1) = w(n) - \eta\frac{\hat\Lambda(n)}{\sqrt{\hat\Delta(n) + \epsilon}} $$
  1. $\Delta(n)$ is identical to that of $E[g^2]_n$ in the RMSPROP, and it serves an identical purpose, i.e., rates for parameters with larger gradients are equalized with those for parameters with smaller gradients.
  1. The sequence $\Lambda(n)$ is used to provide "Momentum" to the updates, and works in a fashion similar to the velocity sequence $v(n)$ in the Momentum algorithm.
$$ \Lambda(n) = (1-\rho)\sum_{i=0}^n \rho^{n-i} g(i) \le \frac{g_{max}}{1-\rho} $$
  1. The parameters $\alpha$ and $\beta$ are usually defaulted to $10^{-8}$ and $0.999$ respectively.

Back to Activation Functions (Vanishing Gradients)

tanh Activation

  1. Unless the input is in the neghborhood of zero, the function enters its saturated regime.

  2. It is superior to the sigmoid in one respect, i.e., its output is zero centered. This speeds up the training process.

  3. The $\tanh$ function is rarely used in modern DLNs, the exception being a type DLN called LSTM.

ReLU Activation

$$ z = \max(0,a) $$
  1. No saturation problem.

  2. Gradients $\frac{\partial L}{\partial w}$ propagate undiminished through the network, provided all the nodes are active.

Dead ReLU Problem

  1. The dotted line in this figure shows a case in which the weight parameters $w_i$ are such that the hyperplane $\sum w_i z_i$ does not intersect the "data cloud" of possible input activations. There does not exist any possible input values that can lead to $\sum w_i z_i > 0$. The neuron's output activation will always be zero, and it will kill all gradients backpropagating down from higher layers.

  2. Vary initialization to correct this.

Leaky ReLU

$$ z = \max(ca,a),\ \ 0\le c<1 $$

PreLU

$$ z_i = \max(\beta_i a_i,a_i), \quad 1 \le i \le S $$
  1. Instead of deciding on the value of $c$ through experimentation, why not determine it using Backpropagation as well. This is the idea behind the Pre-ReLU or PReLU function.

Note that each neuron $i$ now has its own parameter $\beta_i, 1\le i\le S$, where $S$ is the number of nodes in the network. These parameters are iteratively estimated using Backprop.

$$ \frac{\partial\mathcal L}{\partial\beta_i} = \frac{\partial\mathcal L}{\partial z_i}\frac{\partial z_i}{\partial\beta_i},\ \ 1\le i\le S $$

Substituting the value for $\frac{\partial z_i}{\partial\beta_i}$ we obtain $$ \frac{\partial\mathcal L}{\partial\beta_i} = a_i\frac{\partial\mathcal L}{\partial z_i}\ \ if\ a_i \le 0\ \ \mbox{and} \ \ 0 \ \ \mbox{otherwise} $$

which is then used to update $\beta_i$ using $\beta_i\rightarrow\beta_i - \eta\frac{\partial\mathcal L}{\partial\beta_i}$.

Once training is complete, the PreLU based DLN network ends up with a different value of $\beta_i$ at each neuron, which increases the flexibility of the network at the cost of an increase in the number of parameters.

Maxout

Generalizes Leaky ReLU.

$$ z'_i = \max(c\big[\sum_j w_{ij}z_j +b_i\big],\sum_j w_{ij}z_j +b_i), $$

We may allow the two hyperplanes to be independent with their own set of parameters, as shown in the Figure above.

Initializing Weights

In practice, the DLN weight parameters are initialized with random values drawn from Gaussian or Uniform distributions and the following rules are used:

When using the ReLU or its variants, these rules have to be modified slightly:

The reasoning behind scaling down the initialization values as the number of incident weights increases is to prevent saturation of the node activations during the forward pass of the Backprop algorithm, as well as large values of the gradients during backward pass.

Data Preprocessing

Centering: This is also sometimes called Mean Subtraction, and is the most common form of preprocessing. Given an input dataset consisting of $M$ vectors $X(m) = (x_1(m),...,x_N(m)), m = 1,...,M$, it consists of subtracting the mean across each individual input component $x_i, 1\leq i\leq N$ such that $$ x_i(m) \leftarrow x_i(m) - \frac{\sum_{s=1}^{M}x_i(s)}{M},\ \ 1\leq i\leq N, 1\le m\le M $$

Scaling: After the data has been centered, it can be scaled in one of two ways:

$$ x_i(m) \leftarrow \frac{x_i(m) - \frac{\sum_{s=1}^{M}x_i(s)}{M}}{\sigma_i},\ \ 1\leq i\leq N, 1\le m\le M $$

Zero-Centering

Recall that for a K-ary Linear Classifier, the parameter update equation is given by:

$$ w_{kj} \leftarrow w_{kj} - \eta x_j(y_k-t_k),\ \ 1\le k\le K,\ \ 1\le j\le N $$

If the training sample is such that $t_q = 1$ and $t_k = 0, j\ne q$, then the update becomes:

$$ w_{qj} \leftarrow w_{qj} - \eta x_j(y_q-1) $$

and

$$ w_{kj} \leftarrow w_{kj} - \eta x_j(y_k),\ \ k\ne q $$

Lets assume that the input data is not centered so that $x_j\ge 0, j=1,...,N$. Since $0\le y_k\le 1$ it follows that

$$ \Delta w_{kj} = -\eta x_jy_k <0, k\ne q $$

and

$$ \Delta w_{qj} = -\eta x_j(y_q - 1) > 0 $$

i.e. the update results in all the weights moving in the same direction, except for one. This is shown graphically in the Figure above, in which the system is trying move in the direction of the blue arrow which is the quickest path to the minimum. However if the input data is not centered, then it is forced to move in a zig-zag fashion as shown in the red-curve. The zig-zag motion is caused due to the fact that all the parameters move in the same direction at each step due to the lact of zero-centering in the input data.

Zero Centering Helps

Batch Normalization

Normalization applied to the hidden layers.

Regularization

Early Stopping

L2 Regularization

L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:

\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \end{equation}

L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot.

L1 Regularization

L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):

$$ \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| $$

At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights.

  1. Both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different:

  2. In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller.

  3. In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true.

  4. As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.

Dropout regularization

The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:

$$ e_j^{(r)} \sim Bernoulli(p^{(r)}), \quad 0 \leq r \leq R,\ \ 1 \leq j \leq P^r $$
\begin{equation} \hat x_j = e_j^{(0)} x_j, \quad 1 \leq j \leq N \end{equation} \begin{equation} \hat z_j^{(r)} = e_j^{(r)} z_j^{(r)}, \quad 1 \leq r \leq R,\ \ 1 \leq j \leq P^r \end{equation}
  1. After the Backprop is complete, we have effectively trained a collection of up to $2^s$ thinned DLNs all of which share the same weights, where $s$ is the total number of hidden nodes in the DLN.

  2. In order to test the network, strictly speaking we should be averaging the results from all these thinned models, however a simple approximate averaging method works quite well.

  3. The main idea is to use the complete DLN as the test network.

Bagging (Ensemble Learning)

TensorFlow Playground

http://playground.tensorflow.org/

Pattern Recognition: Cancer

One-Hot Encoding

Keras to define DLN

Train the Model

Another Canonical Example: Digit Recognition (MNIST)

https://www.cs.toronto.edu/~kriz/cifar.html

Image Processing, Transfer Learning

Learning the Black-Scholes Equation

See : Hutchinson, Lo, Poggio (1994)

Normalizing spot and call prices

$C$ is homogeneous degree one, so $$ aC(S,K) = C(aS,aK) $$ This means we can normalize spot and call prices and remove a variable by dividing by $K$. $$ \frac{C(S,K)}{K} = C(S/K,1) $$

Data, libraries, activation functions

Set up, compile and fit the model

Predict and check accuracy (in-sample)

Predict and check accuracy (validation-sample)

Random Forest of decision trees

A Random Forest uses several decision trees to make hypotheses about regions within subsamples of the data, then makes predictions based on the majority vote of these trees. This safeguards against overfitting/memorization of the training data.

Prepare Data

Fit Random Forest

Relevant Applications

Causal Models

Deep Learning, specifically, and machine learning, more generally, has been criticized by econometricians as being weaker than causal models. That is, correlation is not causality. Here is an article about a recent development in taking NNs in the direction of causal models: https://medium.com/mit-technology-review/deep-learning-could-reveal-why-the-world-works-the-way-it-does-9be8b5fbfe4f; https://drive.google.com/file/d/1r4UPFQQv-vutQXdlmpmCyB9_nO14FZMe/view?usp=sharing

THE END