In [18]:

```
from ipypublish import nb_setup
```

Once the DLN model has been trained, its true test is how well it is able to classify inputs that it has not seen before, which is also known as its Generalization Ability. There are two kinds of problems that can afflict ML models in general:

Even after the model has been fully trained such that its training error is small, it exhibits a high test error rate. This is known as the problem of Overfitting.

The training error fails to come down in-spite of several epochs of training. This is known as the problem of Underfitting.

In [2]:

```
#UnderfittingOverfitting
nb_setup.images_hconcat(["DL_images/UnderfittingOverfitting.png"], width=600)
```

Out[2]:

The problems of Underfitting and Overfitting are best visualized in the context of the Regression problem of fitting a curve to the training data, see Figure **UnderfittingOverfitting**. In this figure, the crosses denote the training data while the solid curve is the ML model that tries to fit this data. As shown, there are three types of models that were used to fit the data: A straight line in the left hand figure, a second degree polynomial in the middle figure and a polynomial of degree 6 in the right hand figure. The left figure shows that the straight line is not able to capture the curve in the training data, and leads to the problem of Underfitting, since even if we were to add more training data it won’t reduce the error. Increasing the complexity of the model by making the polynomial of second-degree helps a lot, as the middle figure shows. On the other hand if we increase the complexity of the model further by using a sixth-degree polynomial, then it backfires as illustrated in the right hand figure. In this case the curve fits all the training points exactly, but fails to fit test data points, which illustrates the problem of Overfitting.

This example shows that it is critical to choose a model whose complexity fits the training data, choosing a model that is either too simple or too complex results in poor performance. A full scale DLN model can have millions of parameters, and in this case a formal definition of its complexity reamins an open problem. A concept that is often used in this context is the that of "Model Capacity", which is defined as the ability of the model to handle complex datasets. This can be formally computed for simple Binary Classification models and is known as the Vapnik-Chervonenkis or VC dimension. Such a formula does not exist for a DLN model in general, but later in this chapter we will give guidelines that can be used to estimate the effect of model's hyper-parameters on its capacity.

The following factors determine how well a model is able to generalize from the training dataset to the test dataset:

The model capacity and its relation to data complexity: In general if the model capacity is less than the data complexity then it leads to underfitting, while if the converse is true, then it can lead to overfitting. Hence we should try to choose a model whose capacity matches the complexity of the training and test datasets.

Even if the model capacity and the data complexity are well matched, we can still encounter the overfitting problem. This is caused to due to an insufficient amount of training data.

Based on these observations, a useful rule of thumb is the following: The chances of encountering the overfitting problem increases as the model capacity grows, but decreases as more training data is available. Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the test error (to avoid the overfitting problem) *at the same time*. This leads to conflicting demands on the model capacity, since training error reduces monotonically as the model capacity increases, but the test error starts to increase if the model capacity is too high. In general if we plot the test error as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model capacity is the point at which the test error starts to increase. This criteria is used extensively in DLNs to determine the best set of hyper-parameters to use.

In [3]:

```
#HPO6
nb_setup.images_hconcat(["DL_images/HPO6.png"], width=600)
```

Out[3]:

Figure **HPO6** illustrates the tralationship between model capacity and the concepts of underfitting and overfitting by plotting the training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum.

This discussion on the generalization ability of DLN models relies on a very important assumption, which is that both the training and test datasets can be generated using the same probabilistic model. In practice what this means is that we train the model to recognize a certain type of object, human faces for example, we cannot expect it to perform well if the test data consists entirely of cat faces. There is a famous result called the **No Free Lunch Theorem** that states that if this assumption is not satisfied, i.e., the training and test datasets distributions are un-constrained, then every classification algorithm has the same error rate when classifying previously unobserved points. Hence the only way to do better, is by constraining the training and test datasets to a narrower class of data that is relevant to the problem being solved.

In [4]:

```
#RL2
nb_setup.images_hconcat(["DL_images/RL2.png"], width=600)
```

Out[4]:

We introduced the validation dataset in Chapter **PatternRecognition**, and also used it in Chapters **LinearLearningModels** and **TrainingNNsBackprop** when describing the Gradient Descent algorithm. The reader may recall that the rule of thumb is to split the data between the training and test datasets in the ratio 80:20, and then further set aside 20% of the resutling training data for the validation dataset (see Figure **RL2**). We now provide some justification for why the validation dataset is needed.

During the course of this chapter we will often perform experiments whose objective is to determine optimal values of one or more hyper-parameters. We can track the variation of the error rate or classification accuracy in these experiments using the test dataset, but however this is not a good idea. The reason for this is that doing so causes information about the test data to leak into the training process. Hence we can ending up choosing hyper-parameters that are well suited for a particular choice of test data, but won't work well for others. Using the validation dataset ensures that this leakage does not happen.

In [5]:

```
#RL1
nb_setup.images_hconcat(["DL_images/RL1.png"], width=600)
```

Out[5]:

If a DLN exhibits the following symptom: Its training accuracy does not asymptote to zero, even when it is trained over a large number of epochs, then it is showing signs of underfitting. This means that the capacity of the model is not sufficiently high to be able to classify even the training data with a low probability of error. In other words, the degree of non-linearity in the training data is higher than the amount of non-linearity the DLN is capable of capturing. An example of the output from a model which suffers from underfitting is shown in Figure **RL1**, this example is taken from Chapter **NN Deep Learning**, and it shows the training and validation curves for the CIFAR-10 dataset, using a Dense Feed Forward model with 2 hidden layers. As the figure shows both the training error and the validation error closely track each other, and flatten out to a large error value with increasing number of epochs. Hence this Dense Feed Forward Network does not have sufficiently high capacity to capture the complexity in CIFAR-10.

In order to remedy this situation, the modeler can increase the model capacity by increasing the number of hidden layers, adding more nodes per hidden layer, changing regularization parameters (these are introduced in Section **Regularization**) or the Learning Rate or changing the type of model being used. For the CIFAR-10 example, we will have to replace the Dense Feed Forward model with a Convolutional Neural Network to remedy underfitting, as shown in Chapter **ConvNets**. If none these steps fail to solve the problem, then it points to bad quality training data.

In [6]:

```
#RL3
nb_setup.images_hconcat(["DL_images/RL3.png"], width=600)
```

Out[6]:

Overfitting is one of the major problems that plagues ML models. When this problem occurs, the model fits the training data very well, but fails to make good predictions in situations it hasn’t been exposed to before. The causes for overfitting were discussed in the introduction to this chapter, and can be boiled down to either a mismatch between the model capacity and the data complexity and/or insufficient training data. DLNs exhibit the following symptoms when overfitting happens (see Figure **RL3**): The classification accuracy for the training data increases with the number of epochs and may approach 100%, but the test accuracy diverges and plateaus at a much lower value thus opening up a large gap between the two curves.

The following example illustrates the overfitting problem, using the Fashion MNIST datset, which also comes pre-packaged with Keras. Just as the older MNIST dataset, this also come with 60,000 training and 10,000 test examples, and has grayscale coded images of fashion items that can be classified into 10 categories.

In [1]:

```
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
```

In [3]:

```
train_images.shape
```

Out[3]:

In [4]:

```
len(train_labels)
```

Out[4]:

In [6]:

```
item = train_images[100]
import matplotlib.pyplot as plt
import numpy as np
plt.imshow(item, cmap = plt.cm.binary)
plt.show()
```

In [7]:

```
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
```

In [8]:

```
from keras import models
from keras import layers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
```

In [9]:

```
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
```

In [10]:

```
history_dict = history.history
history_dict.keys()
```

Out[10]:

In [11]:

```
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
```

In [12]:

```
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
```

In the previous section Regularization was introduced as a technique used to combat the Overfitting problem. We describe some popular Regularization algorithms in this section, which have proven to be very effective in practice. Because of the importance of this topic, there is a huge amount of work that has been done in this area. Dropout based Regularization which is described here, is one of the factors that has led to the resurgence of interest in DLNs in the last few years.

There are a wide variety of techniques that are used for Regularization, and in general the one characteristic that unites them is that these techniques reduce the effective capacity of a model, i.e., the ability for the model to handle more complex classification tasks. This makes sense since the basic cause of Overfitting is that the model capacity exceeds the requirements for the problem.

DLNs also exhibit a little understood feature called Self Regularization. For example for a given amount of Training Set data, if we increase the complexity of the model by adding additional Hidden Layers for example, then we should start to see overfitting, as per the arguments that we just presented. However, interestingly enough, increased model complexity leads to higher test data classification accuracy, i.e., the increased complexity somehow self-regularizes the model, see Bishop (1995). Hence when using DLN models, it is a good idea to start with a more complex model that the problem may warrant, and then add Regularization techniques if overfitting is detected.

Some commonly used Regularization techniques include:

Early Stopping

L1 Regularization

L2 Regularization

Dropout Regularization

Training Data Augmentation

Batch Normalization

The first three techniques are well known from Machine Learning days, and continue to be used for DLN models. The last three techniques on the other hand have been specially designed for DLNs, and were discovered in the last few years. They also tend to be more effective than the older ML techniques. Batch Normalization was already described in Chapter **GradientDescentTechniques** as a way of Normalizing activations within a model, and it is also very effective as a Regularization technique.

These techniques are discussed in the next few sub-sections.

In [19]:

```
#RL4
nb_setup.images_hconcat(["DL_images/RL4.png"], width=600)
```

Out[19]:

Early Stopping is one of the most popular, and also effective, techniques to prevent overfitting. The basic idea is simple and is illustrated in Figure **RL4**. Use the validation data set to compute the loss function at the end of each training epoch, and once the loss stops decreasing, stop the training and use the test data to compute the final classification accuracy. In practice it is more robust to wait until the validation loss has stopped decreasing for four or five successive epochs before stopping. The justification for this rule is quite simple: The point at which the validation loss starts to increase is when the model starts to overfit the training data, since from this point onwards its generalization ability starts to decrease. Early Stopping can be used by iteself or in combination with other Regularization techniques.

Note that the Optimal Stopping Point can be considered to be a hyper-parameter, hence effectively we are testing out multiple values of the hyper-parameter during the course of a single training run. This makes Early Stopping more efficient than other hyper-parameter optimization techniques which typically require a complete run of the model to test out a single hyper-parameter value. Another advantage of Early Stopping is that it is a fairly un-obtrusive form of Regularization, since it does not require any changes to the model or objective function which can change the learning dynamics of the system.

The following example shows how to implement Early Stopping in Keras.

In [20]:

```
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
```

We use two pre-defined Keras callbacks to implement Early Stopping:

The callback

*EarlyStopping*monitors the validation accuracy, and interrupts the execution of the model if this quantity stops increasing for more than*patience*epochs.The callback

*ModelCheckpoint*saves the model and its parameters after every epoch in the specified .h5 file. The*save_best_only*flag ensures that it doesn't override the model file unless*val_loss*have become smaller, which allows us to save the best model seen during training.

In [21]:

```
from keras import models
from keras import layers
from keras import callbacks
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
callbacks_list = [
keras.callbacks.EarlyStopping(
monitor = 'val_accuracy',
patience = 4,
),
keras.callbacks.ModelCheckpoint(
filepath = 'Models/fashion.h5',
monitor = 'val_loss',
save_best_only = True,
)
]
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
```

In [17]:

```
history = network.fit(train_images, train_labels, epochs=100, batch_size=128,
callbacks = callbacks_list, validation_split=0.2)
```

L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:

\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \quad \quad (**L2reg**) \end{equation}The Regularization Term consists of the sum of the squares of all the link weights in the DLN, multiplied by a parameter $\lambda$ called the Regularization Parameter. This is another Hyperparameter whose appropriate value needs to be chosen as part of the training process by using the validation data set. By choosing a value for this parameter, we decide on the relative importance of the Regularization Term vs the Loss Function term. Note that the Regularization Term does not include the biases, since in practice it has been found that their inclusion does not make much of a difference to the final result.

In order to gain an intuitive understanding of how L2 Regularization works against overfitting, note that the net effect of adding the Regularization Term is to bias the algorithm towards choosing smaller values for the link weight parameters. The value of the parameter $\lambda$ governs the relative importance of the Cross Entropy term vs the regularization term and as $\lambda$ increases, the system tends to favor smaller and smaller weight values.

L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot. How does this help? A complete answer to this question will have to await some sort of theory of regularization, which does not exist at present. But in general, going back to the example of overfitting in the context of Linear Regression in Figure **UnderfittingOverfitting**, it is observed that when overfitting occurs as in the right hand part of that figure, the parameters of the model (which in this case are coefficients to the powers of $x$) begin to assume very large values in an attempt to fit all of the training data. Hence one of the signs of overfitting is that the model parameters, whether they are DLN weights or polynomial coefficients, blow up in value during training, which results in the model giving too much importance to the idiosyncrasies of the training dataset. This line of argument leads to the conclusion that smaller values of the model parameters enable the model to generalize better, and hence do a better job of classifying patterns it has not seen before. This increase in the values of the model parameters always seems to occur in the later stages of the training process. Hence one of the effects of the Early Stopping rule is to restrain the growth in the model parameter values. Therefore, in some sense, Early Stopping is also a form of Regularization, and indeed it can be shown that L2 Regularization and Early Stopping are mathematically equivalent.

In order to get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations (**Wijr**)-(**bir**) for the weight and bias parameters. Taking the derivative on both sides of equation (**L2reg**), we obtain

Substituting equation (**LWM**) back into equation (**GradDesc**), the weight update rule becomes:

Comparing the preceding equation (**L2W**) with equation (**GradDesc**), it follows that the net effect of the Regularization Term on the Gradient Descent rule is to rescale the weight $w_{ij}^{(r)}$ by a factor of $(1-{\eta \lambda})$ before applying the gradient to it. This is called “weight decay” since it causes the weight to become smaller with each iteration.

If Stochastic Gradient Descent with batch size $B$ is used then the weight update rule becomes

\begin{equation} w_{ij}^{(r)}\leftarrow \left(1 - {\eta \lambda} \right) w_{ij}^{(r)} - \frac{\eta}{B} \; \sum_{m=1}^B \frac{\partial \mathcal L(m)}{\partial w_{ij}^{(r)}} \quad \quad (**L2wur**) \end{equation}In both preceding equations (**L2W**) and (**L2wur**) the gradients $\frac{\partial L}{\partial w_{ij}^{(r)}}$ are computed using the usual Backprop algorithm.

In the following example, we apply L2 Regularization to Fashion Dataset. From the results we can see that this was not very effective in improving the validation accuracy for the system. Indeed it seems that the system went from a state of Overfitting with no Regularization, to a state of Underfitting with L2 Regularization, since the Training Accuracy also decreased by quite a bit. Since Regularization has the effect of reducing the Model Capacity, it is quire plausible that the resulting decrease in Capacity pushed the system to the Underfitting state.

In [22]:

```
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
from keras import models
from keras import layers
from keras import regularizers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', kernel_regularizer = regularizers.l2(0.001),
input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128,
validation_split=0.2)
```

In [23]:

```
history_dict = history.history
history_dict.keys()
```

Out[23]:

In [24]:

```
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
```

In [25]:

```
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
```

L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):

\begin{equation} \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| \quad \quad (**L1reg**) \end{equation}At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights. It results in the following weight update equation when using Stochastic Gradient Descent (where $sgn$ is the sign function, such that $sgn(w) = +1$ if $w > 0$, $sgn(w) = -1$ if $w< 0$, and $sgn(0) = 0$):

\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - {\eta \lambda}\; sgn(w_{ij}^{(r)}) - {\eta}\; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**Wsgn**) \end{equation}Comparing equations (**Wsgn**) and (**L2W**) we can see that both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different: In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller. In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true. As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.

Dropout is one of the most effective Regularization techniques to have emerged in the last few years, see Srivastava (2013); Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov (2014). We first describe the algorithm and then discuss reasons for its effectiveness.

In [32]:

```
#dropoutRegularization
nb_setup.images_hconcat(["DL_images/dropoutRegularization.png"], width=600)
```

Out[32]:

The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:

Assign probability values $p^{(r)}, 0 \leq r \leq R$, which is defined as the probability that a node is present in the model, and use these to generate $\{0,1\}$-valued Bernoulli random variables $e_j^{(r)}$: $$ e_j^{(r)} \sim Bernoulli(p^{(r)}), \quad 0 \leq r \leq R,\ \ 1 \leq j \leq P^r $$

Modify the input vector as follows: \begin{equation} \hat x_j = e_j^{(0)} x_j, \quad 1 \leq j \leq N \quad \quad (**Xj0**) \end{equation}

Modify the activations $z_j^{(r)}$ of the hidden layer r as follows: \begin{equation} \hat z_j^{(r)} = e_j^{(r)} z_j^{(r)}, \quad 1 \leq r \leq R,\ \ 1 \leq j \leq P^r \quad \quad (**Zj0**) \end{equation}

The net effect of equation (**Xj0**) is that for each iteration of the Backprop algorithm, instead of using the entire input vector $x_j, 1 \leq j \leq N$, a randomly selected subset $\hat x_j$ is used instead. Similarly the net effect of equation (**Zj0**) is that, in each iteration of Backprop, a randomly selected subset of the nodes in each of the hidden layers is erased. Note that since the random subset chosen for erasure in each layer changes from iteration to iteration, we are effectively running Backprop on a different and thinned version of the original DLN network each time. This is illustrated in Figure **dropoutRegularization**, with part (a) showing the original DLN, and part (b) showing the DLN after a random subset of the nodes have been erased after applying Dropout. Also note that the erasure probabilities are the same for each layer, but can vary from layer to layer.

In [33]:

```
#weightAdjustment
nb_setup.images_hconcat(["DL_images/weightAdjustment.png"], width=600)
```

Out[33]: