Quick Reference

For quick reference, we present the entire set of notation that is used in this monograph.

  • $X^{(i)} \in {\cal R}^n$: input vector for observation $i$. Think of each observation as a row (example) in the training or testing data.

  • $n$: size of the feature set, i.e., the number of input variables per observation.

  • $M$: number of observations in the data (sample size).

  • $T^{(i)}$: true class (category, value) of observation $i$.

  • $Y^{(i)} = h_W(X^{(i)})$: function that generates the predicted (estimated) value of the ouput $Y^{(i)}$.

  • $L(Y^{(i)},T^{(i)})$: loss function that penalizes the algorithm for the error between the predicted value $Y^{(i)}$ and the true value $T^{(i)}$ for observation $i$. Therefore, this is also known as the "error" function.

  • $\sigma(a) = \frac{e^a}{1+e^a}$: sigmoid function that generates an output value in $(0,1)$. The sigmoid function is applied at the output layer in networks with a single output node.

  • $Y = \sigma(\sum_{i=1}^n W_i X_i + b)$: generic form of the sigmoid function applied to weighted input data.

  • $W_{i j}^{(r)}$ parameter weights on input $j$ at the $i$-th node in hidden layer $r$ of the neural network.

  • $b_i^{(r)}$: bias parameter at the $i$-th node in layer $r$.

  • $Y_k, k = 1,\ldots,K$: outputs for a multiclass network with $K$ outputs.

  • $K$: number of nodes at the output layer.

  • $Y_k = \frac{e^{a_k}}{\sum_{i=1}^K e^{a_i}}, k=1,2,\ldots,K$: "softmax" function for the $K$ outputs at each node of the multiclass network output layer. This is also known as the "normalized exponential" function.

  • $\Phi(X_i)$: Basis function of input $i$. This is used to modify the sigmoid and introduce nonlinearity. The modified sigmoid function is as follows:

$$ \sigma \left( \sum_{i=1}^n W_i \Phi(X_i) + b \right) $$
  • $Z_j^{(r)} = f(a_j^{(r)})$: "activation" function $f$, at node $j$ in hidden layer $r$, where
$$ a_j^{(r)} = \sum_{i=1}^n W_{ji}^{(r)} X_i + b_j^{(r)}, \quad 1 \leq j \leq n_r $$
  • $n_r$: number of nodes at hidden layer $r$.

  • $R$: number of hidden layers.

  • $\min_{{\bf W,b}} \sum_{m=1}^M L_m \left[h_W(X^{(m)}),T^{(m)} \right]$: optimization required to fit the deep learning network.

  • $s \ll M$: smaller random batch size for Stochastic Batch Gradient Descent.

  • $\frac{\partial L_m}{\partial W_{ij}^{(r+1)}} = \delta_i^{(r+1)} \cdot Z_j^{(r)}$: gradient of ${\bf W}$ for optimization of the model.

  • $\frac{\partial L_m}{\partial b_{i}^{(r+1)}} = \delta_i^{(r+1)}$: gradient of ${\bf b}$ for optimization of the model.

  • $\delta_i^{(r)} = \frac{\partial L_m}{\partial a_{i}^{(r)}}$: link derivative for the Backprop procedure. This is the key to making the deep learning net optimization facile and analytical. Without backpropogation, the computations on a deep learning net grow exponentially in the number of hidden layers $R$.

  • $\left[ n \cdot n_1 + \sum_{i=1}^{R-1} n_i \cdot n_{i+1} + n_R \cdot K \right] + \left[ \sum_{i=1}^R n_i + K \right]$: number of parameters to be estimated in a deep learning net that has $n$ input variables, $R$ hidden layers, and $K$ output nodes. The first bracket is the total number of ${\bf W}$ weights and the second square bracket counts the number of bias parameters. This is also the number of gradients that need to be computed for Gradient Descent.

Data for all the code in the book

The data is downloadable from this link.

In [ ]: