# Glossary¶

## Quick Reference¶

For quick reference, we present the entire set of notation that is used in this monograph.

• $X^{(i)} \in {\cal R}^n$: input vector for observation $i$. Think of each observation as a row (example) in the training or testing data.

• $n$: size of the feature set, i.e., the number of input variables per observation.

• $M$: number of observations in the data (sample size).

• $T^{(i)}$: true class (category, value) of observation $i$.

• $Y^{(i)} = h_W(X^{(i)})$: function that generates the predicted (estimated) value of the ouput $Y^{(i)}$.

• $L(Y^{(i)},T^{(i)})$: loss function that penalizes the algorithm for the error between the predicted value $Y^{(i)}$ and the true value $T^{(i)}$ for observation $i$. Therefore, this is also known as the "error" function.

• $\sigma(a) = \frac{e^a}{1+e^a}$: sigmoid function that generates an output value in $(0,1)$. The sigmoid function is applied at the output layer in networks with a single output node.

• $Y = \sigma(\sum_{i=1}^n W_i X_i + b)$: generic form of the sigmoid function applied to weighted input data.

• $W_{i j}^{(r)}$ parameter weights on input $j$ at the $i$-th node in hidden layer $r$ of the neural network.

• $b_i^{(r)}$: bias parameter at the $i$-th node in layer $r$.

• $Y_k, k = 1,\ldots,K$: outputs for a multiclass network with $K$ outputs.

• $K$: number of nodes at the output layer.

• $Y_k = \frac{e^{a_k}}{\sum_{i=1}^K e^{a_i}}, k=1,2,\ldots,K$: "softmax" function for the $K$ outputs at each node of the multiclass network output layer. This is also known as the "normalized exponential" function.

• $\Phi(X_i)$: Basis function of input $i$. This is used to modify the sigmoid and introduce nonlinearity. The modified sigmoid function is as follows:

$$\sigma \left( \sum_{i=1}^n W_i \Phi(X_i) + b \right)$$
• $Z_j^{(r)} = f(a_j^{(r)})$: "activation" function $f$, at node $j$ in hidden layer $r$, where
$$a_j^{(r)} = \sum_{i=1}^n W_{ji}^{(r)} X_i + b_j^{(r)}, \quad 1 \leq j \leq n_r$$
• $n_r$: number of nodes at hidden layer $r$.

• $R$: number of hidden layers.

• $\min_{{\bf W,b}} \sum_{m=1}^M L_m \left[h_W(X^{(m)}),T^{(m)} \right]$: optimization required to fit the deep learning network.

• $s \ll M$: smaller random batch size for Stochastic Batch Gradient Descent.

• $\frac{\partial L_m}{\partial W_{ij}^{(r+1)}} = \delta_i^{(r+1)} \cdot Z_j^{(r)}$: gradient of ${\bf W}$ for optimization of the model.

• $\frac{\partial L_m}{\partial b_{i}^{(r+1)}} = \delta_i^{(r+1)}$: gradient of ${\bf b}$ for optimization of the model.

• $\delta_i^{(r)} = \frac{\partial L_m}{\partial a_{i}^{(r)}}$: link derivative for the Backprop procedure. This is the key to making the deep learning net optimization facile and analytical. Without backpropogation, the computations on a deep learning net grow exponentially in the number of hidden layers $R$.

• $\left[ n \cdot n_1 + \sum_{i=1}^{R-1} n_i \cdot n_{i+1} + n_R \cdot K \right] + \left[ \sum_{i=1}^R n_i + K \right]$: number of parameters to be estimated in a deep learning net that has $n$ input variables, $R$ hidden layers, and $K$ output nodes. The first bracket is the total number of ${\bf W}$ weights and the second square bracket counts the number of bias parameters. This is also the number of gradients that need to be computed for Gradient Descent.

## Data for all the code in the book¶