%pylab inline
from ipypublish import nb_setup
The Linear Models that we discussed in Chapter LinearLearningModels work well if the input dataset is approximately linearly separable, but they have limited accuracy for complex datasets. Some of the issues with Linear Models are the following:
If the input data is not linearly separable, then the designer has to expend a lot of effort in finding an appropriate feature map that makes it so. It would be nice to have a model that solves this problem automatically, by learning the best feature map from the data itself.
We showed that the model weight parameters could be regarded as a filter, so that for $K$ classes, the operation of the system is equivalent to trying to the match the input with $K$ different filters. The limitations of this approach can be seen in the filter for the "horse" class in Figure LC2. The filter looks like a horse with two heads, since it is trying its best to match with a horse image, irrespective of the direction in which the horse is facing. This type of filtering will clearly not work for cases in which the horse were standing with some other orientation, or if it were located in a corner of the image. The fact that the best accuracy that can be achieved with linear classifiers and the CIFAR-10 Dataset is only about 40% is a reflection of this shortcoming. The linear system tries to do classification by taking each and every pixel into account, which is a difficult task. What if it were possible to create representations for higher level features in the image, say the head of the horse or its legs, and then use these for classification instead. This will enable the system to identify a horse irrespective of its orientation and its location in the image. This is precisely what Deep Learning systems do.
In general a way to make any model more powerful is by increasing the number of parameters. However in a Linear Model the number of parameters is constrained to $KN + K$ by the sizes of the input data and the number of output classes, which limits its modeling power.
#LC2
nb_setup.images_hconcat(["DL_images/LC2.png"], width=600)
Dense Feed Forward Networks were designed with the objective the overcoming these shortcomings. As Figure DFN1 shows, we are looking for a functional block between the input vector $(x_1,...,x_N)$ and the output logits $(a_1,...,a_K)$, that can create a new representation vector $(z_1,...,z_P)$ which satisfies the approximate linear separability property. One way to do this is shown in Figure DFN2, which is a Deep Feed Forward Network with a single Hidden Layer. Note the following:
The Input layer and Output layers are as before, but we have added a third layer, the so-called Hidden Layer in between. The Input Layer is fully connected to the Hidden Layer, i.e., each node in the Input Layer is connected to every other node in the Hidden Layer, and the same holds true for connections between the Hidden Layer and the Output Layer. DLNs with these characteristics are called Dense Feed Forward Neural Networks. Later in this monograph we will come across examples of DLNs where these properties don’t apply; either because the fully connected property does not hold (as in Convolutional Neural Networks), or the DLN incorporates feedback loops (as in Recurrent Neural Networks).
The $j$-th node in the Hidden Layer performs the following computation on the input variables $x_i$ to generate an output $z_j^{(1)}, 1 \leq j \leq P$ given by $$ a_j^{(1)} = \sum_{i=1}^N w_{ji}^{(1)} x_i + b_j^{(1)} $$ $$ z_j^{(1)} = f(a_j^{(1)}) $$ The vector $(a^{(1)}_1,...,a_P^{(1)})$, which we call the Pre-Activation, is computed as a simple linear combination of the Input Vector. The output of the Hidden Layer $(z^{(1)}_1,...,z_P^{(1)})$ which we call the Activation, is computed as an elementwise non-linear function of the Pre-Activations.
The Output Layer operates on the Activations $z_j^{(1)}$ from the Hidden Layer, and computes the logits for the K classes $(a_1^{(2)},...,a_K^{(2)})$. $$ a_k^{(2)} = \sum_{i=1}^P w_{ki}^{(2)} z_i^{(1)} + b_k^{(2)}, \ \ 1\le k\le K $$ The classification probabilities $y_k, 1\le k\le K$ are obtained by applying the Softmax function to the logits. $$ y_k = \frac{\exp(a_k^{(2)})}{\sum_{j=1}^K \exp(a_j^{(2)})}, \ \ 1\le k\le K $$ Note that the logit and classification probability computations are identical to that done in Linear Systems, with the inputs $X$ now replaced by the activations $Z$.
The weight parameters $w_{ij}^{(1)}, 1\le i\le P,1\le j\le N; w_{ij}^{(2)}, 1\le i\le K,1\le j\le P$ and the bias parameters $b_i^{(1)}, 1\le i\le P; b_i^{(2)}, 1\le i\le K$ have to be learnt using the training data, as in Linear Models. The total number of parameters need to describe this network is given by $NP + P + PK + K$, which is now dependent on the number of nodes in the Hidden Layer $P$. Hence we can build a Dense Feed Forward model with more powerful classification ability by increasing the number of nodes in the Hidden Layer, which is an option that does not exist in Linear Systems.
#DFN1
nb_setup.images_hconcat(["DL_images/DFN1.png"], width=600)
#DFN2
nb_setup.images_hconcat(["DL_images/DFN2.png"], width=600)
The activations $(z^{(1)}_1,...,z_P^{(1)})$ correspond to the new data representation that we are looking for. They filter the input and create higher layer representations, which are then used by the logit layer for classification. Note that the filtering done by the Hidden Layer is non-linear due to the presence of the non-linear Function $f$. This function is called the Activation Function, and plays an important role in system performance. The most popular Activation Function in use is called the Rectified Linear Unit, or ReLU, and is shown in Figure DFN3. It simply passes on the pre-activations that are greater than zero, and blocks those that are less.
The presence of the Activation Function is critical to the functioning of the DLN, and it can be easily shown that if they were to be omitted, then the Hidden and Output layers can be collapsed together so that the resulting model would be equivalent to a Linear Model. Indeed the presence of Activation Functions gives the system its modeling power, and in general we will see later in the book that DLN systems can be made more powerful by increasing the amount of non-linear processing. The appropriate choice of Activation Functions has a big influence on the performance of the DLN, and the discovery of more effective Activation Functions such as the ReLU have helped make DLNs easier to train.
nb_setup.images_hconcat(["DL_images/DFN3.png"], width=600)
The system shown in Figure DFN2 incorporates only a single Hidden Layer. Why not continue the process and enable the model to create higher level representations by adding additional hidden layers? This is certainly possible and the resulting network is shown in Figure DFN4. It shows a Dense Feed Forward Network with $R$ hidden layers, such that layer $r$ consists of $P^r$ nodes. The equations decribing this network can be written as:
With each successive Hidden Layer, this network creates representations at higher levels of abstraction.
Using matrix notation, these equations can be compactly written as (with the $Z^{(0)} = X$):
$$ A^{(r)} = W^{(r)}Z^{(r-1)} + B^{(r)},\ \ Z^{(r)} = f(A^{(r)}),\ \ 1\le r\le R $$$$ A^{(R+1)} = W^{(R+1)}Z^{(R)} + B^{(R+1)},\ \ Y = h(A^{(R+1)}) $$In these equations $f$ and $h$ represent the Activation and Softmax functions respectively, and these operations are carried out on an elementwise basis across all the matrix entries.
nb_setup.images_hconcat(["DL_images/DFN4.png"], width=600)
We have introduced two degrees of freedom in DLN design in this chapter: (1) The number of Hidden Layers, and (2) The number of nodes per Hidden Layer. This leads to the following questions:
Unfortunately there don't exist many theoretical results in this area which can give definite answers to these questions. However there is one interesting theorem regarding Deep Feed Forward Networks with a single Hidden Layer whose proof was given by Cybenko et.al. in 1989:
Given an arbitrary continuous function $g$ of $n$ variables such as
$$ y = g(x_1,...,x_n) $$it is always possible to find a Deep Feed Forward Network with a single Hidden Layer, such that the output of the network approximates $g$, and the approximation can be made as close as we want by adding nodes to the Hidden Layer.
This property is of course dependent on the form of the Activation Function used, but it has been proven to be true for the most commonly used functions. Hence it should be possible to solve any classification problem with a Dense Feed Forward Network containing a single layer. However the theorem does not specify the number of hidden nodes needed for a particular problem.
In practice, the following has been observed that to increase the modeling power of a DLN, it is advantageous to add Hidden Layers, becuase of the following reasons:
More layers allow the model to develop an hierarchical representation of the input data, which simplifies the task of the linear classifier in the final layer.
Having additional layers increases the amount of non-linearity and thus the modeling capacity.
This still begs the question of how wide should the network be. There has been some progress on this recently more recently [Li, Xu, et.al] (https://arxiv.org/pdf/1712.09913.pdf), and their key finding is shown in the Figure convnet46.
#convnet46
nb_setup.images_hconcat(["DL_images/convnet46.png"], width=700)
As illustrated in the figure, the width of the network has a critical effect on the smoothness of its Loss Function. The figure shows four contour plots for the Loss Function of an increasingly wider network, and as can be seen the Loss Function landscape becomes progressively smoother as we move from left to right. This makes the optimization task much easier. This effect is more pronounced for the very deep networks with hundreds of layers that we will study later in the course, and less of an issue in a network with only a few layers.
If the Loss Function is highly chaotic as in the leftmost plot, then this causes the optimization becomes highly dependent on the initialization values, since a bad initialization can cause the trajectory to get caught in the ups and downs of the uneven loss landscape. Increasing the width of the network promotes flat minimizers and prevents the transition to chaotic behavior, which also improces the generalization ability for the network.
The other question that we raised is whether the DLN performance keeps improving as we add more and more Hidden Layers. This is actually not the case, the model performance is constrained due to the following factors:
The Vanishing Gradient Problem: In order to train a multilayer Deep Feed Forward Network, the gradients $\frac{\partial L}{\partial w^{(r)}_{ij}}$ and $\frac{\partial L}{\partial b^{(r)}_i}$ have to be computed. It turns that if the number of layers is large, the gradients of the weights that are either in the first few layers or the last few layers, converge towards zero as the training progresses. Once this happens, the corresponding weights stop adapting to new training data, and thus the training process grinds to a halt. This phenomena is known as the Vanishing Gradient problem, and its causes are explained in detail in Chapter GradientDescentTechniques. In addition adding more layers layers makes the Loss Landscape more chaotic as shown in Figure convnet46 which makes optimization very difficult. This problem contrains the number of layers that can be added to the network to asbout 20 or so, without degrading the training process. In order to get around this problem, we can increase the width of the network as explained above, or use a recent advance in DLN architecture called Residual Connections which allows much deeper networks containing hundreds of layers.
The Overfitting Problem: Larger models with more layers have a larger number of parameters, and this in turn requires larger training datasets. As explained in Chapter ImprovingModelGeneralization, modeling is an exercise in matching the Capacity of the Model with the Complexity of the Dataset. If the Capacity of the Model is greater than the Complexity of the Dataset (which can happen if we add more layers than necessary), then it leads to overfitting. This problem constrains the model's generalization ability.
As this discussion shows, there is no formula or theoretical result which tells us the number of layers or the nodes per layer to use in the model. These numbers, which are also called hyper-parameters are a function of the dataset that we are trying to model, and the only way to find the best numbers is by trial and error. Hence when building the model, the designer has to do several trial runs with different vales for these hyper-parameters before settling on the best ones.
In Chapter ImprovingModelGeneralization we provide some guidelines that can be used to make this process more efficient.
There are two ways to define a Dense Feed Forward Network in Keras:
The code shown below uses the Layers Module to define a Dense Feed Forward Network with two hidden layers with 20 and 15 nodes respectively. The first hidden layer is constrained to accept input tensors of shape (32 32 3, ). Note that the second dimension of this tensor is left un-specified, this allows the system to feed this layer with batches of data such that any batch size can be accepted. The input tensor is transformed into a tensor of shape (20, ) by the first hidden layer, and this tensor is then processed by the second hidden layer with 15 nodes. There is no need to specify an input shape argument for the second layer, since Keras automatically decides on this based on the output of the first layer.
Comparing the results of the Linear Model from the previous chapter and the Dense Feed Forward Model, the accuracy increased from about 40% to 45%. This is a significant jump, however not good enough. One of the main factors that is holding back the Dense Feed Forward model from doing a better job on the accuracy is that it is only able to process images after they have been flattened into a vector shape. Thus a lot of information that is present in the original 3D image shape is lost, especially data about pixels that are in proximity of each other in the original image. In order to process images in the native 3D shape, we will need a more sophisticated Neural Network model called Convolutional Neural Networks, which is discussed in one of the later chapters.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
network = models.Sequential()
network.add(layers.Dense(20, activation='relu', input_shape=(32 * 32 * 3,)))
network.add(layers.Dense(15, activation='relu'))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)