Chapter 15 Glossary

For quick reference, we present the entire set of notation that is used in this monograph.

\(X^{(i)} \in {\cal R}^n\): input vector for observation \(i\). Think of each observation as a row (example) in the training or testing data.
\(n\): size of the feature set, i.e., the number of input variables per observation.
\(M\): number of observations in the data (sample size).
\(T^{(i)}\): true class (category, value) of observation \(i\).
\(Y^{(i)} = h_W(X^{(i)})\): function that generates the predicted (estimated) value of the ouput \(Y^{(i)}\).
\(L(Y^{(i)},T^{(i)})\): loss function that penalizes the algorithm for the error between the predicted value \(Y^{(i)}\) and the true value \(T^{(i)}\) for observation \(i\). Therefore, this is also known as the “error” function.
\(\sigma(a) = \frac{e^a}{1+e^a}\): sigmoid function that generates an output value in \((0,1)\). The sigmoid function is applied at the output layer in networks with a single output node.
\(Y = \sigma(\sum_{i=1}^n W_i X_i + b)\): generic form of the sigmoid function applied to weighted input data.
\(W_{i j}^{(r)}\) parameter weights on input \(j\) at the \(i\)-th node in hidden layer \(r\) of the neural network.
\(b_i^{(r)}\): bias parameter at the \(i\)-th node in layer \(r\).
\(Y_k, k = 1,\ldots,K\): outputs for a multiclass network with \(K\) outputs.
\(K\): number of nodes at the output layer.
\(Y_k = \frac{e^{a_k}}{\sum_{i=1}^K e^{a_i}}, k=1,2,\ldots,K\): “softmax” function for the \(K\) outputs at each node of the multiclass network output layer. This is also known as the “normalized exponential” function.
\(\Phi(X_i)\): Basis function of input \(i\). This is used to modify the sigmoid and introduce nonlinearity. The modified sigmoid function is as follows:

\[ \sigma \left( \sum_{i=1}^n W_i \Phi(X_i) + b \right) \]

\(Z_j^{(r)} = f(a_j^{(r)})\): “activation” function \(f\), at node \(j\) in hidden layer \(r\), where

\[ a_j^{(r)} = \sum_{i=1}^n W_{ji}^{(r)} X_i + b_j^{(r)}, \quad 1 \leq j \leq n_r \]

\(n_r\): number of nodes at hidden layer \(r\).
\(R\): number of hidden layers.
\(\min_{{\bf W,b}} \sum_{m=1}^M L_m \left[h_W(X^{(m)}),T^{(m)} \right]\): optimization required to fit the deep learning network.
\(s \ll M\): smaller random batch size for Stochastic Batch Gradient Descent.
\(\frac{\partial L_m}{\partial W_{ij}^{(r+1)}} = \delta_i^{(r+1)} \cdot Z_j^{(r)}\): gradient of \({\bf W}\) for optimization of the model.
\(\frac{\partial L_m}{\partial b_{i}^{(r+1)}} = \delta_i^{(r+1)}\): gradient of \({\bf b}\) for optimization of the model.
\(\delta_i^{(r)} = \frac{\partial L_m}{\partial a_{i}^{(r)}}\): link derivative for the Backprop procedure. This is the key to making the deep learning net optimization facile and analytical. Without backpropogation, the computations on a deep learning net grow exponentially in the number of hidden layers \(R\).
\(\left[ n \cdot n_1 + \sum_{i=1}^{R-1} n_i \cdot n_{i+1} + n_R \cdot K \right] + \left[ \sum_{i=1}^R n_i + K \right]\): number of parameters to be estimated in a deep learning net that has \(n\) input variables, \(R\) hidden layers, and \(K\) output nodes. The first bracket is the total number of \({\bf W}\) weights and the second square bracket counts the number of bias parameters. This is also the number of gradients that need to be computed for Gradient Descent.