For quick reference, we present the entire set of notation that is used in this monograph.
$X^{(i)} \in {\cal R}^n$: input vector for observation $i$. Think of each observation as a row (example) in the training or testing data.
$n$: size of the feature set, i.e., the number of input variables per observation.
$M$: number of observations in the data (sample size).
$T^{(i)}$: true class (category, value) of observation $i$.
$Y^{(i)} = h_W(X^{(i)})$: function that generates the predicted (estimated) value of the ouput $Y^{(i)}$.
$L(Y^{(i)},T^{(i)})$: loss function that penalizes the algorithm for the error between the predicted value $Y^{(i)}$ and the true value $T^{(i)}$ for observation $i$. Therefore, this is also known as the "error" function.
$\sigma(a) = \frac{e^a}{1+e^a}$: sigmoid function that generates an output value in $(0,1)$. The sigmoid function is applied at the output layer in networks with a single output node.
$Y = \sigma(\sum_{i=1}^n W_i X_i + b)$: generic form of the sigmoid function applied to weighted input data.
$W_{i j}^{(r)}$ parameter weights on input $j$ at the $i$-th node in hidden layer $r$ of the neural network.
$b_i^{(r)}$: bias parameter at the $i$-th node in layer $r$.
$Y_k, k = 1,\ldots,K$: outputs for a multiclass network with $K$ outputs.
$K$: number of nodes at the output layer.
$Y_k = \frac{e^{a_k}}{\sum_{i=1}^K e^{a_i}}, k=1,2,\ldots,K$: "softmax" function for the $K$ outputs at each node of the multiclass network output layer. This is also known as the "normalized exponential" function.
$\Phi(X_i)$: Basis function of input $i$. This is used to modify the sigmoid and introduce nonlinearity. The modified sigmoid function is as follows:
$n_r$: number of nodes at hidden layer $r$.
$R$: number of hidden layers.
$\min_{{\bf W,b}} \sum_{m=1}^M L_m \left[h_W(X^{(m)}),T^{(m)} \right]$: optimization required to fit the deep learning network.
$s \ll M$: smaller random batch size for Stochastic Batch Gradient Descent.
$\frac{\partial L_m}{\partial W_{ij}^{(r+1)}} = \delta_i^{(r+1)} \cdot Z_j^{(r)}$: gradient of ${\bf W}$ for optimization of the model.
$\frac{\partial L_m}{\partial b_{i}^{(r+1)}} = \delta_i^{(r+1)}$: gradient of ${\bf b}$ for optimization of the model.
$\delta_i^{(r)} = \frac{\partial L_m}{\partial a_{i}^{(r)}}$: link derivative for the Backprop procedure. This is the key to making the deep learning net optimization facile and analytical. Without backpropogation, the computations on a deep learning net grow exponentially in the number of hidden layers $R$.
$\left[ n \cdot n_1 + \sum_{i=1}^{R-1} n_i \cdot n_{i+1} + n_R \cdot K \right] + \left[ \sum_{i=1}^R n_i + K \right]$: number of parameters to be estimated in a deep learning net that has $n$ input variables, $R$ hidden layers, and $K$ output nodes. The first bracket is the total number of ${\bf W}$ weights and the second square bracket counts the number of bias parameters. This is also the number of gradients that need to be computed for Gradient Descent.