from ipypublish import nb_setup
Once the DLN model has been trained, its true test is how well it is able to classify inputs that it has not seen before, which is also known as its Generalization Ability. There are two kinds of problems that can afflict ML models in general:
Even after the model has been fully trained such that its training error is small, it exhibits a high test error rate. This is known as the problem of Overfitting.
The training error fails to come down in-spite of several epochs of training. This is known as the problem of Underfitting.
#UnderfittingOverfitting
nb_setup.images_hconcat(["DL_images/UnderfittingOverfitting.png"], width=600)
The problems of Underfitting and Overfitting are best visualized in the context of the Regression problem of fitting a curve to the training data, see Figure UnderfittingOverfitting. In this figure, the crosses denote the training data while the solid curve is the ML model that tries to fit this data. As shown, there are three types of models that were used to fit the data: A straight line in the left hand figure, a second degree polynomial in the middle figure and a polynomial of degree 6 in the right hand figure. The left figure shows that the straight line is not able to capture the curve in the training data, and leads to the problem of Underfitting, since even if we were to add more training data it won’t reduce the error. Increasing the complexity of the model by making the polynomial of second-degree helps a lot, as the middle figure shows. On the other hand if we increase the complexity of the model further by using a sixth-degree polynomial, then it backfires as illustrated in the right hand figure. In this case the curve fits all the training points exactly, but fails to fit test data points, which illustrates the problem of Overfitting.
This example shows that it is critical to choose a model whose complexity fits the training data, choosing a model that is either too simple or too complex results in poor performance. A full scale DLN model can have millions of parameters, and in this case a formal definition of its complexity reamins an open problem. A concept that is often used in this context is the that of "Model Capacity", which is defined as the ability of the model to handle complex datasets. This can be formally computed for simple Binary Classification models and is known as the Vapnik-Chervonenkis or VC dimension. Such a formula does not exist for a DLN model in general, but later in this chapter we will give guidelines that can be used to estimate the effect of model's hyper-parameters on its capacity.
The following factors determine how well a model is able to generalize from the training dataset to the test dataset:
The model capacity and its relation to data complexity: In general if the model capacity is less than the data complexity then it leads to underfitting, while if the converse is true, then it can lead to overfitting. Hence we should try to choose a model whose capacity matches the complexity of the training and test datasets.
Even if the model capacity and the data complexity are well matched, we can still encounter the overfitting problem. This is caused to due to an insufficient amount of training data.
Based on these observations, a useful rule of thumb is the following: The chances of encountering the overfitting problem increases as the model capacity grows, but decreases as more training data is available. Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the test error (to avoid the overfitting problem) at the same time. This leads to conflicting demands on the model capacity, since training error reduces monotonically as the model capacity increases, but the test error starts to increase if the model capacity is too high. In general if we plot the test error as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model capacity is the point at which the test error starts to increase. This criteria is used extensively in DLNs to determine the best set of hyper-parameters to use.
#HPO6
nb_setup.images_hconcat(["DL_images/HPO6.png"], width=600)
Figure HPO6 illustrates the tralationship between model capacity and the concepts of underfitting and overfitting by plotting the training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum.
This discussion on the generalization ability of DLN models relies on a very important assumption, which is that both the training and test datasets can be generated using the same probabilistic model. In practice what this means is that we train the model to recognize a certain type of object, human faces for example, we cannot expect it to perform well if the test data consists entirely of cat faces. There is a famous result called the No Free Lunch Theorem that states that if this assumption is not satisfied, i.e., the training and test datasets distributions are un-constrained, then every classification algorithm has the same error rate when classifying previously unobserved points. Hence the only way to do better, is by constraining the training and test datasets to a narrower class of data that is relevant to the problem being solved.
#RL2
nb_setup.images_hconcat(["DL_images/RL2.png"], width=600)
We introduced the validation dataset in Chapter PatternRecognition, and also used it in Chapters LinearLearningModels and TrainingNNsBackprop when describing the Gradient Descent algorithm. The reader may recall that the rule of thumb is to split the data between the training and test datasets in the ratio 80:20, and then further set aside 20% of the resutling training data for the validation dataset (see Figure RL2). We now provide some justification for why the validation dataset is needed.
During the course of this chapter we will often perform experiments whose objective is to determine optimal values of one or more hyper-parameters. We can track the variation of the error rate or classification accuracy in these experiments using the test dataset, but however this is not a good idea. The reason for this is that doing so causes information about the test data to leak into the training process. Hence we can ending up choosing hyper-parameters that are well suited for a particular choice of test data, but won't work well for others. Using the validation dataset ensures that this leakage does not happen.
#RL1
nb_setup.images_hconcat(["DL_images/RL1.png"], width=600)
If a DLN exhibits the following symptom: Its training accuracy does not asymptote to zero, even when it is trained over a large number of epochs, then it is showing signs of underfitting. This means that the capacity of the model is not sufficiently high to be able to classify even the training data with a low probability of error. In other words, the degree of non-linearity in the training data is higher than the amount of non-linearity the DLN is capable of capturing. An example of the output from a model which suffers from underfitting is shown in Figure RL1, this example is taken from Chapter NN Deep Learning, and it shows the training and validation curves for the CIFAR-10 dataset, using a Dense Feed Forward model with 2 hidden layers. As the figure shows both the training error and the validation error closely track each other, and flatten out to a large error value with increasing number of epochs. Hence this Dense Feed Forward Network does not have sufficiently high capacity to capture the complexity in CIFAR-10.
In order to remedy this situation, the modeler can increase the model capacity by increasing the number of hidden layers, adding more nodes per hidden layer, changing regularization parameters (these are introduced in Section Regularization) or the Learning Rate or changing the type of model being used. For the CIFAR-10 example, we will have to replace the Dense Feed Forward model with a Convolutional Neural Network to remedy underfitting, as shown in Chapter ConvNets. If none these steps fail to solve the problem, then it points to bad quality training data.
#RL3
nb_setup.images_hconcat(["DL_images/RL3.png"], width=600)
Overfitting is one of the major problems that plagues ML models. When this problem occurs, the model fits the training data very well, but fails to make good predictions in situations it hasn’t been exposed to before. The causes for overfitting were discussed in the introduction to this chapter, and can be boiled down to either a mismatch between the model capacity and the data complexity and/or insufficient training data. DLNs exhibit the following symptoms when overfitting happens (see Figure RL3): The classification accuracy for the training data increases with the number of epochs and may approach 100%, but the test accuracy diverges and plateaus at a much lower value thus opening up a large gap between the two curves.
The following example illustrates the overfitting problem, using the Fashion MNIST datset, which also comes pre-packaged with Keras. Just as the older MNIST dataset, this also come with 60,000 training and 10,000 test examples, and has grayscale coded images of fashion items that can be classified into 10 categories.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images.shape
len(train_labels)
item = train_images[100]
import matplotlib.pyplot as plt
import numpy as np
plt.imshow(item, cmap = plt.cm.binary)
plt.show()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
from keras import models
from keras import layers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
In the previous section Regularization was introduced as a technique used to combat the Overfitting problem. We describe some popular Regularization algorithms in this section, which have proven to be very effective in practice. Because of the importance of this topic, there is a huge amount of work that has been done in this area. Dropout based Regularization which is described here, is one of the factors that has led to the resurgence of interest in DLNs in the last few years.
There are a wide variety of techniques that are used for Regularization, and in general the one characteristic that unites them is that these techniques reduce the effective capacity of a model, i.e., the ability for the model to handle more complex classification tasks. This makes sense since the basic cause of Overfitting is that the model capacity exceeds the requirements for the problem.
DLNs also exhibit a little understood feature called Self Regularization. For example for a given amount of Training Set data, if we increase the complexity of the model by adding additional Hidden Layers for example, then we should start to see overfitting, as per the arguments that we just presented. However, interestingly enough, increased model complexity leads to higher test data classification accuracy, i.e., the increased complexity somehow self-regularizes the model, see Bishop (1995). Hence when using DLN models, it is a good idea to start with a more complex model that the problem may warrant, and then add Regularization techniques if overfitting is detected.
Some commonly used Regularization techniques include:
Early Stopping
L1 Regularization
L2 Regularization
Dropout Regularization
Training Data Augmentation
Batch Normalization
The first three techniques are well known from Machine Learning days, and continue to be used for DLN models. The last three techniques on the other hand have been specially designed for DLNs, and were discovered in the last few years. They also tend to be more effective than the older ML techniques. Batch Normalization was already described in Chapter GradientDescentTechniques as a way of Normalizing activations within a model, and it is also very effective as a Regularization technique.
These techniques are discussed in the next few sub-sections.
#RL4
nb_setup.images_hconcat(["DL_images/RL4.png"], width=600)
Early Stopping is one of the most popular, and also effective, techniques to prevent overfitting. The basic idea is simple and is illustrated in Figure RL4. Use the validation data set to compute the loss function at the end of each training epoch, and once the loss stops decreasing, stop the training and use the test data to compute the final classification accuracy. In practice it is more robust to wait until the validation loss has stopped decreasing for four or five successive epochs before stopping. The justification for this rule is quite simple: The point at which the validation loss starts to increase is when the model starts to overfit the training data, since from this point onwards its generalization ability starts to decrease. Early Stopping can be used by iteself or in combination with other Regularization techniques.
Note that the Optimal Stopping Point can be considered to be a hyper-parameter, hence effectively we are testing out multiple values of the hyper-parameter during the course of a single training run. This makes Early Stopping more efficient than other hyper-parameter optimization techniques which typically require a complete run of the model to test out a single hyper-parameter value. Another advantage of Early Stopping is that it is a fairly un-obtrusive form of Regularization, since it does not require any changes to the model or objective function which can change the learning dynamics of the system.
The following example shows how to implement Early Stopping in Keras.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
We use two pre-defined Keras callbacks to implement Early Stopping:
The callback EarlyStopping monitors the validation accuracy, and interrupts the execution of the model if this quantity stops increasing for more than patience epochs.
The callback ModelCheckpoint saves the model and its parameters after every epoch in the specified .h5 file. The save_best_only flag ensures that it doesn't override the model file unless val_loss have become smaller, which allows us to save the best model seen during training.
from keras import models
from keras import layers
from keras import callbacks
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
callbacks_list = [
keras.callbacks.EarlyStopping(
monitor = 'val_accuracy',
patience = 4,
),
keras.callbacks.ModelCheckpoint(
filepath = 'Models/fashion.h5',
monitor = 'val_loss',
save_best_only = True,
)
]
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128,
callbacks = callbacks_list, validation_split=0.2)
L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:
\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \quad \quad (**L2reg**) \end{equation}The Regularization Term consists of the sum of the squares of all the link weights in the DLN, multiplied by a parameter $\lambda$ called the Regularization Parameter. This is another Hyperparameter whose appropriate value needs to be chosen as part of the training process by using the validation data set. By choosing a value for this parameter, we decide on the relative importance of the Regularization Term vs the Loss Function term. Note that the Regularization Term does not include the biases, since in practice it has been found that their inclusion does not make much of a difference to the final result.
In order to gain an intuitive understanding of how L2 Regularization works against overfitting, note that the net effect of adding the Regularization Term is to bias the algorithm towards choosing smaller values for the link weight parameters. The value of the parameter $\lambda$ governs the relative importance of the Cross Entropy term vs the regularization term and as $\lambda$ increases, the system tends to favor smaller and smaller weight values.
L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot. How does this help? A complete answer to this question will have to await some sort of theory of regularization, which does not exist at present. But in general, going back to the example of overfitting in the context of Linear Regression in Figure UnderfittingOverfitting, it is observed that when overfitting occurs as in the right hand part of that figure, the parameters of the model (which in this case are coefficients to the powers of $x$) begin to assume very large values in an attempt to fit all of the training data. Hence one of the signs of overfitting is that the model parameters, whether they are DLN weights or polynomial coefficients, blow up in value during training, which results in the model giving too much importance to the idiosyncrasies of the training dataset. This line of argument leads to the conclusion that smaller values of the model parameters enable the model to generalize better, and hence do a better job of classifying patterns it has not seen before. This increase in the values of the model parameters always seems to occur in the later stages of the training process. Hence one of the effects of the Early Stopping rule is to restrain the growth in the model parameter values. Therefore, in some sense, Early Stopping is also a form of Regularization, and indeed it can be shown that L2 Regularization and Early Stopping are mathematically equivalent.
In order to get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations (Wijr)-(bir) for the weight and bias parameters. Taking the derivative on both sides of equation (L2reg), we obtain
\begin{equation} \frac{\partial \mathcal L_R}{\partial w_{ij}^{(r)}} = \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} + {\lambda}\; w_{ij}^{(r)} \quad \quad (**LWM**) \end{equation}Substituting equation (LWM) back into equation (GradDesc), the weight update rule becomes:
\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial W_{ij}^{(r)}} - {\eta \lambda} \; w_{ij}^{(r)} \\ = \left(1 - {\eta \lambda} \right)w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**L2W**) \end{equation}Comparing the preceding equation (L2W) with equation (GradDesc), it follows that the net effect of the Regularization Term on the Gradient Descent rule is to rescale the weight $w_{ij}^{(r)}$ by a factor of $(1-{\eta \lambda})$ before applying the gradient to it. This is called “weight decay” since it causes the weight to become smaller with each iteration.
If Stochastic Gradient Descent with batch size $B$ is used then the weight update rule becomes
\begin{equation} w_{ij}^{(r)}\leftarrow \left(1 - {\eta \lambda} \right) w_{ij}^{(r)} - \frac{\eta}{B} \; \sum_{m=1}^B \frac{\partial \mathcal L(m)}{\partial w_{ij}^{(r)}} \quad \quad (**L2wur**) \end{equation}In both preceding equations (L2W) and (L2wur) the gradients $\frac{\partial L}{\partial w_{ij}^{(r)}}$ are computed using the usual Backprop algorithm.
In the following example, we apply L2 Regularization to Fashion Dataset. From the results we can see that this was not very effective in improving the validation accuracy for the system. Indeed it seems that the system went from a state of Overfitting with no Regularization, to a state of Underfitting with L2 Regularization, since the Training Accuracy also decreased by quite a bit. Since Regularization has the effect of reducing the Model Capacity, it is quire plausible that the resulting decrease in Capacity pushed the system to the Underfitting state.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
from keras import models
from keras import layers
from keras import regularizers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', kernel_regularizer = regularizers.l2(0.001),
input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128,
validation_split=0.2)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):
\begin{equation} \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| \quad \quad (**L1reg**) \end{equation}At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights. It results in the following weight update equation when using Stochastic Gradient Descent (where $sgn$ is the sign function, such that $sgn(w) = +1$ if $w > 0$, $sgn(w) = -1$ if $w< 0$, and $sgn(0) = 0$):
\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - {\eta \lambda}\; sgn(w_{ij}^{(r)}) - {\eta}\; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**Wsgn**) \end{equation}Comparing equations (Wsgn) and (L2W) we can see that both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different: In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller. In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true. As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.
Dropout is one of the most effective Regularization techniques to have emerged in the last few years, see Srivastava (2013); Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov (2014). We first describe the algorithm and then discuss reasons for its effectiveness.
#dropoutRegularization
nb_setup.images_hconcat(["DL_images/dropoutRegularization.png"], width=600)
The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:
Assign probability values $p^{(r)}, 0 \leq r \leq R$, which is defined as the probability that a node is present in the model, and use these to generate $\{0,1\}$-valued Bernoulli random variables $e_j^{(r)}$: $$ e_j^{(r)} \sim Bernoulli(p^{(r)}), \quad 0 \leq r \leq R,\ \ 1 \leq j \leq P^r $$
Modify the input vector as follows: \begin{equation} \hat x_j = e_j^{(0)} x_j, \quad 1 \leq j \leq N \quad \quad (**Xj0**) \end{equation}
Modify the activations $z_j^{(r)}$ of the hidden layer r as follows: \begin{equation} \hat z_j^{(r)} = e_j^{(r)} z_j^{(r)}, \quad 1 \leq r \leq R,\ \ 1 \leq j \leq P^r \quad \quad (**Zj0**) \end{equation}
The net effect of equation (Xj0) is that for each iteration of the Backprop algorithm, instead of using the entire input vector $x_j, 1 \leq j \leq N$, a randomly selected subset $\hat x_j$ is used instead. Similarly the net effect of equation (Zj0) is that, in each iteration of Backprop, a randomly selected subset of the nodes in each of the hidden layers is erased. Note that since the random subset chosen for erasure in each layer changes from iteration to iteration, we are effectively running Backprop on a different and thinned version of the original DLN network each time. This is illustrated in Figure dropoutRegularization, with part (a) showing the original DLN, and part (b) showing the DLN after a random subset of the nodes have been erased after applying Dropout. Also note that the erasure probabilities are the same for each layer, but can vary from layer to layer.
#weightAdjustment
nb_setup.images_hconcat(["DL_images/weightAdjustment.png"], width=600)
After the Backprop is complete, we have effectively trained a collection of up to $2^s$ thinned DLNs all of which share the same weights, where $s$ is the total number of hidden nodes in the DLN. In order to test the network, strictly speaking we should be averaging the results from all these thinned models, however a simple approximate averaging method works quite well. The main idea is to use the complete DLN as the test network, but modify each of the weights as shown in Figure weightAdjustment. Hence if a node is retained with probability $p$ during training, then the outgoing weights from that link are multiplied by $p$ at test time. This modification is needed since during the training period only a fraction $p$ of the nodes are active at any one time, hence the average activation is only $p$ times the value for the case when all the nodes are active, which is the case during the test phase. Hence by multiplying the test phase weights by $p$, we insure that the average activation values are the same as during the training phase. Note that Dropout can be combined with other forms of Regularization such as L2 or L1, as well as other techniques that improve the Stochastic Gradient algorithm such as the learning rate momentum algorithm.
The complete Batch Stochastic Gradient Descent algorithm for Deep Feed Forward Networks using Dropout is given below:
1. Initialize all the weight and bias parameters $(w_{ij}^{(r)},b_i^{(r)})$.
2. Repeat the following Step 3, E times:
3. For $q = 0,...,{M\over B}-1$ repeat the following steps (2a) - (2f):
a. For the training inputs $X_i(m), qB\le m\le (q+1)B$, compute the model predictions $y(m)$ given by
$$ a_i^{(r)}(m) = \sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)}, \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$and
$$ z_i^{(r)}(m) = e_i^{(r)}(m)f(a_i^{(r)}(m)), \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$and for $r=1$,
$$ a_i^{(1)}(m) = \sum_{j=1}^{P^1} w_{ij}^{(1)}e_j^{(0)}(m)x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = e_i^{(1)}(m)f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$The logits and classification probabilities are computed as before using
$$ a_i^{(R+1)}(m) = \sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)}, \quad 1 \leq i \leq K $$and
$$ \quad y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$Note that for each $m$, a different set of erasure numbers $e_i^{(r)}(m)$ are generated, so that the outputs $Y(m),\ qB\le m\le (q+1)B$ are effectiveley generated using $B$ different networks.
b. Evaluate the gradients $\delta_k^{(R+1)}(m)$ for the logit layer nodes using
$$ \delta_k^{(R+1)}(m) = y_k(m) - t_k(m),\ \ 1\le k\le K $$c. Back-propagate the $\delta$s using the following equation to obtain the $\delta_j^{(r)}(m), 1 \leq r \leq R, 1 \leq j \leq P^r$ for each hidden node in the network.
$$ \delta_j^{(r)}(m) = e_j^{(r)}(m)f'(a_j^{(r)}(m)) \sum_k w_{kj}^{(r+1)} \delta_k^{(r+1)}(m), \quad 1 \leq r \leq R $$d. Compute the gradients of the Cross Entropy Function $\mathcal L(m)$ for the $m$-th training vector $(X{(m)}, T{(m)})$ with respect to all the weight and bias parameters using the following equation.
$$ \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r+1)}} = \delta_i^{(r+1)}(m) z_j^{(r)}(m) \quad \mbox{and} \quad \frac{\partial \mathcal L(m)}{\partial b_i^{(r+1)}} = \delta_i^{(r+1)}(m), \quad 0 \leq r \leq R $$e. Change the model weights according to the formula
$$ w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r)}}, $$$$ b_i^{(r)} \leftarrow b_i^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial{\mathcal L}(m)}{\partial b_i^{(r)}}, $$Once again, due to the random erasures, we are averaging the gradients of $B$ different networks in this step.
f. Increment $q \leftarrow (q+1)\mod B$ and go back to step $(a)$.
3. Compute the Loss Function $L$ over the Validation Dataset given by
$$ L = -{1\over V}\sum_{m=1}^V\sum_{k=1}^K t_k{(m)} \log y_k{(m)} $$where the outputs are computed using the formulae:
$$ a_i^{(r)}(m) = p^{(r-1)}\sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)}, \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$and
$$ z_i^{(r)}(m) = f(a_i^{(r)}(m)), \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$and for $r=1$,
$$ a_i^{(1)}(m) = p^{(0)}\sum_{j=1}^{P^1} w_{ij}^{(1)}(m)x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$The logits and classification probabilities are computed using
$$ a_i^{(R+1)}(m) = p^{(R)}\sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)}, \quad 1 \leq i \leq K $$and
$$ \quad y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.
In most applications of Dropout, the number of parameters to be chosen is reduced to one, by making the following choices:
$$ p^{(0)} = 1, \ \ p^{(r)} = p,\ \ r = 1, 2,...,R $$Under these assumptions only a single parameter $p$ need be chosen, and it is usually defaulted to $p = 0.5$.
Why does Dropout work? Once again we lack a complete theory of Regularization in DLNs that can be used to answer this question, but some plausible reasons are as follows:
A popular technique that is often used in ML models is called “bagging”. In this technique several models for the problem under consideration are developed, each of them with their own parameters and trained on a separate slice of data. The test is then run on each model as well, and then the average of the outputs is used as the final result. The reader may notice that the Dropout algorithm accomplishes the same effect as training several different models, each with a different subset of the hidden nodes active, which may explain its effectiveness. Note that Dropout is only approximate bagging, since unlike in full bagging, all the models share their parameters and also some of their inputs.
Another explanation for Dropout’s effectiveness is the following: During Dropout training, a hidden node cannot rely on the presence of other hidden nodes in order to be able to accomplish its task. Hence each hidden node must perform well irrespective of the presence of other hidden nodes, which makes them very robust and work in a variety of different networks, without getting too adapted to any one of them. This property helps in generalizing the classification accuracy from training to test data sets.
#RL6
nb_setup.images_hconcat(["DL_images/RL6.png"], width=600)
Since the invention of Dropout, researchers have introduced other Regularization techniques that work along similar lines. One such technique, called DropConnect, is shown in Figure RL6. In this case, instead of randomly dropping nodes during training, random connections are dropped instead.
In the example below, we apply Dropout Regularization to the Fashion MNIST Dataset.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
from keras import models
from keras import layers
from keras import regularizers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dropout(0.4))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128,
validation_split=0.2)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
We observe that there is some flattening of the validation loss curve (i.e., a reduction in overfitting), but there is no appreciable improvement in the Validation Accuracy.
#RL5
nb_setup.images_hconcat(["DL_images/RL5.png"], width=600)
If the data set used for training is not large enough, which is often the case for many real life test sets, then it can lead to overfitting. A simple technique to get around this problem is by artificially expanding the training set. In the case of image data, this fairly straightforward – for example we should be able to subject an image to the following transformation without changing its classification (see Figure RL5) for some examples):
All these transformations are of the type that the human eye is used to experiencing. However there are other augmentation techniques that don’t fall into this class, such as adding random noise to the training data set which is also very effective as long as it is done carefully.
All these techniques work quite well in practice. Later in Chapter ConvNets we will describe a DLN architecture called Convolutional Neural Networks (CNN) in which Translational Invariance is built into the structure of the network itself.
import os, shutil, pathlib
train_dir = 'DL_data/CatsDogs_train'
validation_dir = 'DL_data/CatsDogs_validation'
train_cats_dir = 'DL_data/CatsDogs_train/cats'
train_dogs_dir = 'DL_data/CatsDogs_train/dogs'
validation_cats_dir = 'DL_data/CatsDogs_validation/cats'
validation_dogs_dir = 'DL_data/CatsDogs_validation/dogs'
new_base_dir = pathlib.Path("/Users/subirvarma/Desktop/DL_Book_new_v2/DL_data")
print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))
print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))
from tensorflow.keras.utils import image_dataset_from_directory
train_dataset = image_dataset_from_directory(
new_base_dir / "CatsDogs_train",
image_size=(150, 150),
batch_size=20)
validation_dataset = image_dataset_from_directory(
new_base_dir / "CatsDogs_validation",
image_size=(150, 150),
batch_size=20)
import keras
import tensorflow
keras.__version__
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers
from tensorflow.keras import Model
from tensorflow.keras import layers
from tensorflow.keras import Input
data_augmentation = tensorflow.keras.Sequential(
[
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
layers.RandomZoom(0.2),
]
)
from keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers
from keras import Model
from keras import layers
from keras import Input
input_tensor = Input(shape=(150,150,3,))
x = data_augmentation(input_tensor)
x = layers.Flatten()(x)
x = layers.Rescaling(1./255)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
output_tensor = layers.Dense(1, activation='sigmoid')(x)
model = Model(input_tensor, output_tensor)
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(learning_rate=1e-4),
metrics=['acc'])
model.summary()
history = model.fit(
train_dataset,
epochs=10, #Use 100 epochs for good results
validation_data=validation_dataset)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
#Bagging
nb_setup.images_hconcat(["DL_images/Bagging.png"], width=600)
Model Averaging is a well known technique from Machine Learning that is used to boost model accuracy. It relies on a result from statistics which states the following: Given random variables $X_1,...,X_n$ with common mean $\mu$ and variance $\sigma$, the variance of the average $\frac{\sum X_i}{n}$ of these variables is given by $\sigma\over n$, i.e., the average becomess more deterministic as the number of random variables increases. Figure RL7 shows how this result can be applied to ML models: If we train $m$ models using $m$ separate input datasets, and then use a simple majority vote to decide on the output, then it can lead to 2-5% improvement in model accuracy. This method can be further simplified in the following ways:
Dividing up the input dataset reduces the number of samples per model which is not desirable. An alternative technique is to create input datasets for each model, by doing sampling with replacement. This results in each model having the same number of inputs as the original dataset, even though they may have samples that occur multiple time or samples that are absent in some of the datasets. This technique is commonly called "bagging".
An even simpler technique is to use the original input dataset in each of the models. This also works fairly well, since the randomness introduced due to factors such as the random initialization and the stochastic gradient algorithm, is sufficient to de-correlate the models. There are several ways in which this can be done:
import keras
keras.__version__
from keras import Sequential, Model
from keras import layers
from keras import Input
from keras.datasets import cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
input_tensor = Input(shape=(32 * 32 * 3,))
x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor1 = layers.Dense(10, activation='softmax')(x)
x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor2 = layers.Dense(10, activation='softmax')(x)
x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor3 = layers.Dense(10, activation='softmax')(x)
output_tensor = layers.Average()([output_tensor1, output_tensor2, output_tensor3])
model = Model(input_tensor, output_tensor)
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
We end this section by making the observation that several Regularization techniques work by means of a two step procedure:
Step 1: Introduce noise into the training procedure, which makes the system work harder in order to learn from the training samples. Examples include:
Dropout: Randomly drop nodes during training.
Batch Normalization: The data from a training sample is affected by the neighboring samples as a result of the normalization procedure.
Drop Connect: Randomly drop connections during training.
Model Averaging: Generate multiple models from the same training data, each of which has a different set of parameters, due to the use of random initializations and stochastic gradient descent.
Step 2: At Test time, average out the randomness to generate final classification decision. Examples include:
Dropout: The final model used for testing is an approximation to the average of the randomly generated training models.
Batch Normalization: The final model is normalized using the entire test dataset, as opposed to normalization over batches during training.
Drop Connect: The final model is an approximation to the average of the randomly generated training models.
Model Averaging: Use a voting mechanism at test time to decide on the best classification.