# Chapter 15 Glossary

For quick reference, we present the entire set of notation that is used in this monograph.

• $$X^{(i)} \in {\cal R}^n$$: input vector for observation $$i$$. Think of each observation as a row (example) in the training or testing data.

• $$n$$: size of the feature set, i.e., the number of input variables per observation.

• $$M$$: number of observations in the data (sample size).

• $$T^{(i)}$$: true class (category, value) of observation $$i$$.

• $$Y^{(i)} = h_W(X^{(i)})$$: function that generates the predicted (estimated) value of the ouput $$Y^{(i)}$$.

• $$L(Y^{(i)},T^{(i)})$$: loss function that penalizes the algorithm for the error between the predicted value $$Y^{(i)}$$ and the true value $$T^{(i)}$$ for observation $$i$$. Therefore, this is also known as the “error” function.

• $$\sigma(a) = \frac{e^a}{1+e^a}$$: sigmoid function that generates an output value in $$(0,1)$$. The sigmoid function is applied at the output layer in networks with a single output node.

• $$Y = \sigma(\sum_{i=1}^n W_i X_i + b)$$: generic form of the sigmoid function applied to weighted input data.

• $$W_{i j}^{(r)}$$ parameter weights on input $$j$$ at the $$i$$-th node in hidden layer $$r$$ of the neural network.

• $$b_i^{(r)}$$: bias parameter at the $$i$$-th node in layer $$r$$.

• $$Y_k, k = 1,\ldots,K$$: outputs for a multiclass network with $$K$$ outputs.

• $$K$$: number of nodes at the output layer.

• $$Y_k = \frac{e^{a_k}}{\sum_{i=1}^K e^{a_i}}, k=1,2,\ldots,K$$: “softmax” function for the $$K$$ outputs at each node of the multiclass network output layer. This is also known as the “normalized exponential” function.

• $$\Phi(X_i)$$: Basis function of input $$i$$. This is used to modify the sigmoid and introduce nonlinearity. The modified sigmoid function is as follows:

$\sigma \left( \sum_{i=1}^n W_i \Phi(X_i) + b \right)$

• $$Z_j^{(r)} = f(a_j^{(r)})$$: “activation” function $$f$$, at node $$j$$ in hidden layer $$r$$, where

$a_j^{(r)} = \sum_{i=1}^n W_{ji}^{(r)} X_i + b_j^{(r)}, \quad 1 \leq j \leq n_r$

• $$n_r$$: number of nodes at hidden layer $$r$$.

• $$R$$: number of hidden layers.

• $$\min_{{\bf W,b}} \sum_{m=1}^M L_m \left[h_W(X^{(m)}),T^{(m)} \right]$$: optimization required to fit the deep learning network.

• $$s \ll M$$: smaller random batch size for Stochastic Batch Gradient Descent.

• $$\frac{\partial L_m}{\partial W_{ij}^{(r+1)}} = \delta_i^{(r+1)} \cdot Z_j^{(r)}$$: gradient of $${\bf W}$$ for optimization of the model.

• $$\frac{\partial L_m}{\partial b_{i}^{(r+1)}} = \delta_i^{(r+1)}$$: gradient of $${\bf b}$$ for optimization of the model.

• $$\delta_i^{(r)} = \frac{\partial L_m}{\partial a_{i}^{(r)}}$$: link derivative for the Backprop procedure. This is the key to making the deep learning net optimization facile and analytical. Without backpropogation, the computations on a deep learning net grow exponentially in the number of hidden layers $$R$$.

• $$\left[ n \cdot n_1 + \sum_{i=1}^{R-1} n_i \cdot n_{i+1} + n_R \cdot K \right] + \left[ \sum_{i=1}^R n_i + K \right]$$: number of parameters to be estimated in a deep learning net that has $$n$$ input variables, $$R$$ hidden layers, and $$K$$ output nodes. The first bracket is the total number of $${\bf W}$$ weights and the second square bracket counts the number of bias parameters. This is also the number of gradients that need to be computed for Gradient Descent.