Training Neural Networks Part 2¶

from ipypublish import nb_setup

Generalization¶

Once the DLN model has been trained, its true test is how well it is able to classify inputs that it has not seen before, which is also known as its Generalization Ability. There are two kinds of problems that can afflict ML models in general:

Even after the model has been fully trained such that its training error is small, it exhibits a high test error rate. This is known as the problem of Overfitting.
The training error fails to come down in-spite of several epochs of training. This is known as the problem of Underfitting.

#UnderfittingOverfitting
nb_setup.images_hconcat(["DL_images/UnderfittingOverfitting.png"], width=600)

Underfitting and Overfitting¶

The problems of Underfitting and Overfitting are best visualized in the context of the Regression problem of fitting a curve to the training data, see Figure UnderfittingOverfitting. In this figure, the crosses denote the training data while the solid curve is the ML model that tries to fit this data. As shown, there are three types of models that were used to fit the data: A straight line in the left hand figure, a second degree polynomial in the middle figure and a polynomial of degree 6 in the right hand figure. The left figure shows that the straight line is not able to capture the curve in the training data, and leads to the problem of Underfitting, since even if we were to add more training data it won’t reduce the error. Increasing the complexity of the model by making the polynomial of second-degree helps a lot, as the middle figure shows. On the other hand if we increase the complexity of the model further by using a sixth-degree polynomial, then it backfires as illustrated in the right hand figure. In this case the curve fits all the training points exactly, but fails to fit test data points, which illustrates the problem of Overfitting.

Model Capacity¶

This example shows that it is critical to choose a model whose complexity fits the training data, choosing a model that is either too simple or too complex results in poor performance. A full scale DLN model can have millions of parameters, and in this case a formal definition of its complexity reamins an open problem. A concept that is often used in this context is the that of "Model Capacity", which is defined as the ability of the model to handle complex datasets. This can be formally computed for simple Binary Classification models and is known as the Vapnik-Chervonenkis or VC dimension. Such a formula does not exist for a DLN model in general, but later in this chapter we will give guidelines that can be used to estimate the effect of model's hyper-parameters on its capacity.

The following factors determine how well a model is able to generalize from the training dataset to the test dataset:

The model capacity and its relation to data complexity: In general if the model capacity is less than the data complexity then it leads to underfitting, while if the converse is true, then it can lead to overfitting. Hence we should try to choose a model whose capacity matches the complexity of the training and test datasets.
Even if the model capacity and the data complexity are well matched, we can still encounter the overfitting problem. This is caused to due to an insufficient amount of training data.

Based on these observations, a useful rule of thumb is the following: The chances of encountering the overfitting problem increases as the model capacity grows, but decreases as more training data is available. Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the test error (to avoid the overfitting problem) at the same time. This leads to conflicting demands on the model capacity, since training error reduces monotonically as the model capacity increases, but the test error starts to increase if the model capacity is too high. In general if we plot the test error as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model capacity is the point at which the test error starts to increase. This criteria is used extensively in DLNs to determine the best set of hyper-parameters to use.

#HPO6
nb_setup.images_hconcat(["DL_images/HPO6.png"], width=600)

Figure HPO6 illustrates the tralationship between model capacity and the concepts of underfitting and overfitting by plotting the training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum.

This discussion on the generalization ability of DLN models relies on a very important assumption, which is that both the training and test datasets can be generated using the same probabilistic model. In practice what this means is that we train the model to recognize a certain type of object, human faces for example, we cannot expect it to perform well if the test data consists entirely of cat faces. There is a famous result called the No Free Lunch Theorem that states that if this assumption is not satisfied, i.e., the training and test datasets distributions are un-constrained, then every classification algorithm has the same error rate when classifying previously unobserved points. Hence the only way to do better, is by constraining the training and test datasets to a narrower class of data that is relevant to the problem being solved.

The Validation Dataset¶

#RL2
nb_setup.images_hconcat(["DL_images/RL2.png"], width=600)

We introduced the validation dataset in Chapter PatternRecognition, and also used it in Chapters LinearLearningModels and TrainingNNsBackprop when describing the Gradient Descent algorithm. The reader may recall that the rule of thumb is to split the data between the training and test datasets in the ratio 80:20, and then further set aside 20% of the resutling training data for the validation dataset (see Figure RL2). We now provide some justification for why the validation dataset is needed.

During the course of this chapter we will often perform experiments whose objective is to determine optimal values of one or more hyper-parameters. We can track the variation of the error rate or classification accuracy in these experiments using the test dataset, but however this is not a good idea. The reason for this is that doing so causes information about the test data to leak into the training process. Hence we can ending up choosing hyper-parameters that are well suited for a particular choice of test data, but won't work well for others. Using the validation dataset ensures that this leakage does not happen.

Detecting Underfitting¶

#RL1
nb_setup.images_hconcat(["DL_images/RL1.png"], width=600)

If a DLN exhibits the following symptom: Its training accuracy does not asymptote to zero, even when it is trained over a large number of epochs, then it is showing signs of underfitting. This means that the capacity of the model is not sufficiently high to be able to classify even the training data with a low probability of error. In other words, the degree of non-linearity in the training data is higher than the amount of non-linearity the DLN is capable of capturing. An example of the output from a model which suffers from underfitting is shown in Figure RL1, this example is taken from Chapter NN Deep Learning, and it shows the training and validation curves for the CIFAR-10 dataset, using a Dense Feed Forward model with 2 hidden layers. As the figure shows both the training error and the validation error closely track each other, and flatten out to a large error value with increasing number of epochs. Hence this Dense Feed Forward Network does not have sufficiently high capacity to capture the complexity in CIFAR-10.

In order to remedy this situation, the modeler can increase the model capacity by increasing the number of hidden layers, adding more nodes per hidden layer, changing regularization parameters (these are introduced in Section Regularization) or the Learning Rate or changing the type of model being used. For the CIFAR-10 example, we will have to replace the Dense Feed Forward model with a Convolutional Neural Network to remedy underfitting, as shown in Chapter ConvNets. If none these steps fail to solve the problem, then it points to bad quality training data.

Detecting Overfitting¶

#RL3
nb_setup.images_hconcat(["DL_images/RL3.png"], width=600)

Overfitting is one of the major problems that plagues ML models. When this problem occurs, the model fits the training data very well, but fails to make good predictions in situations it hasn’t been exposed to before. The causes for overfitting were discussed in the introduction to this chapter, and can be boiled down to either a mismatch between the model capacity and the data complexity and/or insufficient training data. DLNs exhibit the following symptoms when overfitting happens (see Figure RL3): The classification accuracy for the training data increases with the number of epochs and may approach 100%, but the test accuracy diverges and plateaus at a much lower value thus opening up a large gap between the two curves.

The following example illustrates the overfitting problem, using the Fashion MNIST datset, which also comes pre-packaged with Keras. Just as the older MNIST dataset, this also come with 60,000 training and 10,000 test examples, and has grayscale coded images of fashion items that can be classified into 10 categories.

import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images.shape

(60000, 28, 28)

len(train_labels)

60000

item = train_images[100]

import matplotlib.pyplot as plt
import numpy as np
plt.imshow(item, cmap = plt.cm.binary)
plt.show()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)

Epoch 1/100
375/375 [==============================] - 3s 8ms/step - loss: 0.5923 - accuracy: 0.7911 - val_loss: 0.4043 - val_accuracy: 0.8528
Epoch 2/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3981 - accuracy: 0.8546 - val_loss: 0.3612 - val_accuracy: 0.8674
Epoch 3/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3465 - accuracy: 0.8723 - val_loss: 0.3512 - val_accuracy: 0.8734
Epoch 4/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3211 - accuracy: 0.8809 - val_loss: 0.3370 - val_accuracy: 0.8792
Epoch 5/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2978 - accuracy: 0.8891 - val_loss: 0.3537 - val_accuracy: 0.8711
Epoch 6/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2826 - accuracy: 0.8954 - val_loss: 0.3244 - val_accuracy: 0.8848
Epoch 7/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2695 - accuracy: 0.9011 - val_loss: 0.3163 - val_accuracy: 0.8867
Epoch 8/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2577 - accuracy: 0.9038 - val_loss: 0.3538 - val_accuracy: 0.8787
Epoch 9/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2457 - accuracy: 0.9084 - val_loss: 0.3293 - val_accuracy: 0.8879
Epoch 10/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2374 - accuracy: 0.9115 - val_loss: 0.3760 - val_accuracy: 0.8693
Epoch 11/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2271 - accuracy: 0.9156 - val_loss: 0.3391 - val_accuracy: 0.8850
Epoch 12/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2183 - accuracy: 0.9178 - val_loss: 0.3545 - val_accuracy: 0.8832
Epoch 13/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2107 - accuracy: 0.9215 - val_loss: 0.3221 - val_accuracy: 0.8920
Epoch 14/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2039 - accuracy: 0.9246 - val_loss: 0.3372 - val_accuracy: 0.8925
Epoch 15/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1979 - accuracy: 0.9275 - val_loss: 0.3360 - val_accuracy: 0.8938
Epoch 16/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1920 - accuracy: 0.9291 - val_loss: 0.3451 - val_accuracy: 0.8910
Epoch 17/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1866 - accuracy: 0.9315 - val_loss: 0.3842 - val_accuracy: 0.8888
Epoch 18/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1798 - accuracy: 0.9345 - val_loss: 0.3676 - val_accuracy: 0.8922
Epoch 19/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1740 - accuracy: 0.9353 - val_loss: 0.3579 - val_accuracy: 0.8982
Epoch 20/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1707 - accuracy: 0.9364 - val_loss: 0.3772 - val_accuracy: 0.8926
Epoch 21/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1648 - accuracy: 0.9394 - val_loss: 0.3889 - val_accuracy: 0.8845
Epoch 22/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1614 - accuracy: 0.9386 - val_loss: 0.3688 - val_accuracy: 0.8983
Epoch 23/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1575 - accuracy: 0.9416 - val_loss: 0.4003 - val_accuracy: 0.8840
Epoch 24/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1539 - accuracy: 0.9429 - val_loss: 0.3893 - val_accuracy: 0.8955
Epoch 25/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1486 - accuracy: 0.9444 - val_loss: 0.4050 - val_accuracy: 0.8917
Epoch 26/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1437 - accuracy: 0.9467 - val_loss: 0.3898 - val_accuracy: 0.8991
Epoch 27/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1377 - accuracy: 0.9505 - val_loss: 0.4051 - val_accuracy: 0.8944
Epoch 28/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1382 - accuracy: 0.9492 - val_loss: 0.4106 - val_accuracy: 0.8931
Epoch 29/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1361 - accuracy: 0.9505 - val_loss: 0.4171 - val_accuracy: 0.8960
Epoch 30/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1319 - accuracy: 0.9523 - val_loss: 0.4534 - val_accuracy: 0.8913
Epoch 31/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1288 - accuracy: 0.9532 - val_loss: 0.4164 - val_accuracy: 0.8926
Epoch 32/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1236 - accuracy: 0.9542 - val_loss: 0.4401 - val_accuracy: 0.8989
Epoch 33/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1228 - accuracy: 0.9555 - val_loss: 0.4454 - val_accuracy: 0.8976
Epoch 34/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1183 - accuracy: 0.9563 - val_loss: 0.4887 - val_accuracy: 0.8952
Epoch 35/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1166 - accuracy: 0.9576 - val_loss: 0.4749 - val_accuracy: 0.8884
Epoch 36/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1140 - accuracy: 0.9593 - val_loss: 0.4941 - val_accuracy: 0.8908
Epoch 37/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1131 - accuracy: 0.9590 - val_loss: 0.4810 - val_accuracy: 0.8936
Epoch 38/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1107 - accuracy: 0.9605 - val_loss: 0.4970 - val_accuracy: 0.8935
Epoch 39/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1074 - accuracy: 0.9618 - val_loss: 0.4855 - val_accuracy: 0.8913
Epoch 40/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1061 - accuracy: 0.9614 - val_loss: 0.4933 - val_accuracy: 0.8917
Epoch 41/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1046 - accuracy: 0.9622 - val_loss: 0.5009 - val_accuracy: 0.8957
Epoch 42/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1006 - accuracy: 0.9636 - val_loss: 0.5501 - val_accuracy: 0.8876
Epoch 43/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0993 - accuracy: 0.9649 - val_loss: 0.5277 - val_accuracy: 0.8938
Epoch 44/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0964 - accuracy: 0.9649 - val_loss: 0.5159 - val_accuracy: 0.8966
Epoch 45/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0976 - accuracy: 0.9653 - val_loss: 0.5511 - val_accuracy: 0.8940
Epoch 46/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0924 - accuracy: 0.9669 - val_loss: 0.5445 - val_accuracy: 0.8963
Epoch 47/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0927 - accuracy: 0.9682 - val_loss: 0.5381 - val_accuracy: 0.8964
Epoch 48/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0916 - accuracy: 0.9673 - val_loss: 0.5862 - val_accuracy: 0.8875
Epoch 49/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0890 - accuracy: 0.9678 - val_loss: 0.6001 - val_accuracy: 0.8906
Epoch 50/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0888 - accuracy: 0.9684 - val_loss: 0.6575 - val_accuracy: 0.8881
Epoch 51/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0869 - accuracy: 0.9694 - val_loss: 0.5718 - val_accuracy: 0.8953
Epoch 52/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0836 - accuracy: 0.9703 - val_loss: 0.6245 - val_accuracy: 0.8933
Epoch 53/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0859 - accuracy: 0.9697 - val_loss: 0.6129 - val_accuracy: 0.8934
Epoch 54/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0822 - accuracy: 0.9709 - val_loss: 0.6032 - val_accuracy: 0.8952
Epoch 55/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0784 - accuracy: 0.9725 - val_loss: 0.6191 - val_accuracy: 0.8963
Epoch 56/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0806 - accuracy: 0.9723 - val_loss: 0.6220 - val_accuracy: 0.8969
Epoch 57/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0772 - accuracy: 0.9732 - val_loss: 0.6070 - val_accuracy: 0.8971
Epoch 58/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0775 - accuracy: 0.9729 - val_loss: 0.6231 - val_accuracy: 0.8955
Epoch 59/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0796 - accuracy: 0.9721 - val_loss: 0.6412 - val_accuracy: 0.8967
Epoch 60/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0745 - accuracy: 0.9745 - val_loss: 0.6518 - val_accuracy: 0.8966
Epoch 61/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0746 - accuracy: 0.9746 - val_loss: 0.6416 - val_accuracy: 0.8928
Epoch 62/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0710 - accuracy: 0.9749 - val_loss: 0.6986 - val_accuracy: 0.8941
Epoch 63/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0718 - accuracy: 0.9759 - val_loss: 0.6977 - val_accuracy: 0.8889
Epoch 64/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0702 - accuracy: 0.9762 - val_loss: 0.6681 - val_accuracy: 0.8934
Epoch 65/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0674 - accuracy: 0.9767 - val_loss: 0.6653 - val_accuracy: 0.8939
Epoch 66/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0687 - accuracy: 0.9763 - val_loss: 0.7002 - val_accuracy: 0.8951
Epoch 67/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0656 - accuracy: 0.9778 - val_loss: 0.7144 - val_accuracy: 0.8936
Epoch 68/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0647 - accuracy: 0.9776 - val_loss: 0.7465 - val_accuracy: 0.8907
Epoch 69/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0676 - accuracy: 0.9776 - val_loss: 0.7334 - val_accuracy: 0.8928
Epoch 70/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0644 - accuracy: 0.9774 - val_loss: 0.7309 - val_accuracy: 0.8956
Epoch 71/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0645 - accuracy: 0.9783 - val_loss: 0.7459 - val_accuracy: 0.8950
Epoch 72/100
375/375 [==============================] - 2s 7ms/step - loss: 0.0623 - accuracy: 0.9789 - val_loss: 0.7609 - val_accuracy: 0.8954
Epoch 73/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0609 - accuracy: 0.9785 - val_loss: 0.7495 - val_accuracy: 0.8953
Epoch 74/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0620 - accuracy: 0.9791 - val_loss: 0.7812 - val_accuracy: 0.8900
Epoch 75/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0614 - accuracy: 0.9794 - val_loss: 0.7490 - val_accuracy: 0.8930
Epoch 76/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0586 - accuracy: 0.9802 - val_loss: 0.7639 - val_accuracy: 0.8969
Epoch 77/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0612 - accuracy: 0.9799 - val_loss: 0.8097 - val_accuracy: 0.8927
Epoch 78/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0630 - accuracy: 0.9797 - val_loss: 0.8375 - val_accuracy: 0.8928
Epoch 79/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0599 - accuracy: 0.9803 - val_loss: 0.8467 - val_accuracy: 0.8925
Epoch 80/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0612 - accuracy: 0.9803 - val_loss: 0.8122 - val_accuracy: 0.8950
Epoch 81/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0574 - accuracy: 0.9810 - val_loss: 0.8761 - val_accuracy: 0.8930
Epoch 82/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0573 - accuracy: 0.9811 - val_loss: 0.8276 - val_accuracy: 0.8936
Epoch 83/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0580 - accuracy: 0.9814 - val_loss: 0.8373 - val_accuracy: 0.8921
Epoch 84/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0574 - accuracy: 0.9814 - val_loss: 0.8793 - val_accuracy: 0.8947
Epoch 85/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0535 - accuracy: 0.9820 - val_loss: 0.9012 - val_accuracy: 0.8905
Epoch 86/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0566 - accuracy: 0.9814 - val_loss: 0.9478 - val_accuracy: 0.8847
Epoch 87/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0553 - accuracy: 0.9826 - val_loss: 0.8579 - val_accuracy: 0.8948
Epoch 88/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0529 - accuracy: 0.9830 - val_loss: 0.8626 - val_accuracy: 0.8944
Epoch 89/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0503 - accuracy: 0.9837 - val_loss: 0.9221 - val_accuracy: 0.8923
Epoch 90/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0509 - accuracy: 0.9838 - val_loss: 0.8517 - val_accuracy: 0.8937
Epoch 91/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0486 - accuracy: 0.9846 - val_loss: 0.9109 - val_accuracy: 0.8971
Epoch 92/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0533 - accuracy: 0.9833 - val_loss: 0.9352 - val_accuracy: 0.8892
Epoch 93/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0504 - accuracy: 0.9844 - val_loss: 0.9123 - val_accuracy: 0.8976
Epoch 94/100
375/375 [==============================] - 2s 7ms/step - loss: 0.0483 - accuracy: 0.9840 - val_loss: 0.9532 - val_accuracy: 0.8903
Epoch 95/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0502 - accuracy: 0.9842 - val_loss: 0.9666 - val_accuracy: 0.8898
Epoch 96/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0515 - accuracy: 0.9837 - val_loss: 0.9376 - val_accuracy: 0.8938
Epoch 97/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0470 - accuracy: 0.9849 - val_loss: 0.9374 - val_accuracy: 0.8917
Epoch 98/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0473 - accuracy: 0.9854 - val_loss: 0.9197 - val_accuracy: 0.8899
Epoch 99/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0464 - accuracy: 0.9850 - val_loss: 1.0070 - val_accuracy: 0.8866
Epoch 100/100
375/375 [==============================] - 3s 8ms/step - loss: 0.0485 - accuracy: 0.9854 - val_loss: 0.9458 - val_accuracy: 0.8923

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Regularization¶

In the previous section Regularization was introduced as a technique used to combat the Overfitting problem. We describe some popular Regularization algorithms in this section, which have proven to be very effective in practice. Because of the importance of this topic, there is a huge amount of work that has been done in this area. Dropout based Regularization which is described here, is one of the factors that has led to the resurgence of interest in DLNs in the last few years.

There are a wide variety of techniques that are used for Regularization, and in general the one characteristic that unites them is that these techniques reduce the effective capacity of a model, i.e., the ability for the model to handle more complex classification tasks. This makes sense since the basic cause of Overfitting is that the model capacity exceeds the requirements for the problem.

DLNs also exhibit a little understood feature called Self Regularization. For example for a given amount of Training Set data, if we increase the complexity of the model by adding additional Hidden Layers for example, then we should start to see overfitting, as per the arguments that we just presented. However, interestingly enough, increased model complexity leads to higher test data classification accuracy, i.e., the increased complexity somehow self-regularizes the model, see Bishop (1995). Hence when using DLN models, it is a good idea to start with a more complex model that the problem may warrant, and then add Regularization techniques if overfitting is detected.

Some commonly used Regularization techniques include:

Early Stopping
L1 Regularization
L2 Regularization
Dropout Regularization
Training Data Augmentation
Batch Normalization

The first three techniques are well known from Machine Learning days, and continue to be used for DLN models. The last three techniques on the other hand have been specially designed for DLNs, and were discovered in the last few years. They also tend to be more effective than the older ML techniques. Batch Normalization was already described in Chapter GradientDescentTechniques as a way of Normalizing activations within a model, and it is also very effective as a Regularization technique.

These techniques are discussed in the next few sub-sections.

Early Stopping¶

#RL4
nb_setup.images_hconcat(["DL_images/RL4.png"], width=600)

Early Stopping is one of the most popular, and also effective, techniques to prevent overfitting. The basic idea is simple and is illustrated in Figure RL4. Use the validation data set to compute the loss function at the end of each training epoch, and once the loss stops decreasing, stop the training and use the test data to compute the final classification accuracy. In practice it is more robust to wait until the validation loss has stopped decreasing for four or five successive epochs before stopping. The justification for this rule is quite simple: The point at which the validation loss starts to increase is when the model starts to overfit the training data, since from this point onwards its generalization ability starts to decrease. Early Stopping can be used by iteself or in combination with other Regularization techniques.

Note that the Optimal Stopping Point can be considered to be a hyper-parameter, hence effectively we are testing out multiple values of the hyper-parameter during the course of a single training run. This makes Early Stopping more efficient than other hyper-parameter optimization techniques which typically require a complete run of the model to test out a single hyper-parameter value. Another advantage of Early Stopping is that it is a fairly un-obtrusive form of Regularization, since it does not require any changes to the model or objective function which can change the learning dynamics of the system.

The following example shows how to implement Early Stopping in Keras.

import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

We use two pre-defined Keras callbacks to implement Early Stopping:

The callback EarlyStopping monitors the validation accuracy, and interrupts the execution of the model if this quantity stops increasing for more than patience epochs.
The callback ModelCheckpoint saves the model and its parameters after every epoch in the specified .h5 file. The save_best_only flag ensures that it doesn't override the model file unless val_loss have become smaller, which allows us to save the best model seen during training.

from keras import models
from keras import layers
from keras import callbacks

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor = 'val_accuracy',
        patience = 4,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath = 'Models/fashion.h5',
        monitor = 'val_loss',
        save_best_only = True,
    )
]

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, 
                      callbacks = callbacks_list, validation_split=0.2)

Epoch 1/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1903 - accuracy: 0.9296 - val_loss: 0.3448 - val_accuracy: 0.8942
Epoch 2/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1841 - accuracy: 0.9308 - val_loss: 0.3519 - val_accuracy: 0.8942
Epoch 3/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1793 - accuracy: 0.9332 - val_loss: 0.3526 - val_accuracy: 0.8921
Epoch 4/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1739 - accuracy: 0.9359 - val_loss: 0.3971 - val_accuracy: 0.8838
Epoch 5/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1661 - accuracy: 0.9379 - val_loss: 0.4165 - val_accuracy: 0.8827

L2 Regularization¶

L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:

\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \quad \quad (**L2reg**) \end{equation}

The Regularization Term consists of the sum of the squares of all the link weights in the DLN, multiplied by a parameter $\lambda$ called the Regularization Parameter. This is another Hyperparameter whose appropriate value needs to be chosen as part of the training process by using the validation data set. By choosing a value for this parameter, we decide on the relative importance of the Regularization Term vs the Loss Function term. Note that the Regularization Term does not include the biases, since in practice it has been found that their inclusion does not make much of a difference to the final result.

In order to gain an intuitive understanding of how L2 Regularization works against overfitting, note that the net effect of adding the Regularization Term is to bias the algorithm towards choosing smaller values for the link weight parameters. The value of the parameter $\lambda$ governs the relative importance of the Cross Entropy term vs the regularization term and as $\lambda$ increases, the system tends to favor smaller and smaller weight values.

L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot. How does this help? A complete answer to this question will have to await some sort of theory of regularization, which does not exist at present. But in general, going back to the example of overfitting in the context of Linear Regression in Figure UnderfittingOverfitting, it is observed that when overfitting occurs as in the right hand part of that figure, the parameters of the model (which in this case are coefficients to the powers of $x$) begin to assume very large values in an attempt to fit all of the training data. Hence one of the signs of overfitting is that the model parameters, whether they are DLN weights or polynomial coefficients, blow up in value during training, which results in the model giving too much importance to the idiosyncrasies of the training dataset. This line of argument leads to the conclusion that smaller values of the model parameters enable the model to generalize better, and hence do a better job of classifying patterns it has not seen before. This increase in the values of the model parameters always seems to occur in the later stages of the training process. Hence one of the effects of the Early Stopping rule is to restrain the growth in the model parameter values. Therefore, in some sense, Early Stopping is also a form of Regularization, and indeed it can be shown that L2 Regularization and Early Stopping are mathematically equivalent.

In order to get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations (Wijr)-(bir) for the weight and bias parameters. Taking the derivative on both sides of equation (L2reg), we obtain

\begin{equation} \frac{\partial \mathcal L_R}{\partial w_{ij}^{(r)}} = \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} + {\lambda}\; w_{ij}^{(r)} \quad \quad (**LWM**) \end{equation}

Substituting equation (LWM) back into equation (GradDesc), the weight update rule becomes:

\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial W_{ij}^{(r)}} - {\eta \lambda} \; w_{ij}^{(r)} \\ = \left(1 - {\eta \lambda} \right)w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**L2W**) \end{equation}

Comparing the preceding equation (L2W) with equation (GradDesc), it follows that the net effect of the Regularization Term on the Gradient Descent rule is to rescale the weight $w_{ij}^{(r)}$ by a factor of $(1-{\eta \lambda})$ before applying the gradient to it. This is called “weight decay” since it causes the weight to become smaller with each iteration.

If Stochastic Gradient Descent with batch size $B$ is used then the weight update rule becomes

\begin{equation} w_{ij}^{(r)}\leftarrow \left(1 - {\eta \lambda} \right) w_{ij}^{(r)} - \frac{\eta}{B} \; \sum_{m=1}^B \frac{\partial \mathcal L(m)}{\partial w_{ij}^{(r)}} \quad \quad (**L2wur**) \end{equation}

In both preceding equations (L2W) and (L2wur) the gradients $\frac{\partial L}{\partial w_{ij}^{(r)}}$ are computed using the usual Backprop algorithm.

In the following example, we apply L2 Regularization to Fashion Dataset. From the results we can see that this was not very effective in improving the validation accuracy for the system. Indeed it seems that the system went from a state of Overfitting with no Regularization, to a state of Underfitting with L2 Regularization, since the Training Accuracy also decreased by quite a bit. Since Regularization has the effect of reducing the Model Capacity, it is quire plausible that the resulting decrease in Capacity pushed the system to the Underfitting state.

import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

from keras import models
from keras import layers
from keras import regularizers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', kernel_regularizer = regularizers.l2(0.001),
                               input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, 
                    validation_split=0.2)

Epoch 1/100
375/375 [==============================] - 4s 9ms/step - loss: 0.9372 - accuracy: 0.7810 - val_loss: 0.7519 - val_accuracy: 0.7936
Epoch 2/100
375/375 [==============================] - 3s 9ms/step - loss: 0.6167 - accuracy: 0.8319 - val_loss: 0.6080 - val_accuracy: 0.8343
Epoch 3/100
375/375 [==============================] - 3s 9ms/step - loss: 0.5379 - accuracy: 0.8432 - val_loss: 0.5223 - val_accuracy: 0.8495
Epoch 4/100
375/375 [==============================] - 3s 9ms/step - loss: 0.5000 - accuracy: 0.8497 - val_loss: 0.4730 - val_accuracy: 0.8645
Epoch 5/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4842 - accuracy: 0.8533 - val_loss: 0.4772 - val_accuracy: 0.8558
Epoch 6/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4679 - accuracy: 0.8565 - val_loss: 0.4944 - val_accuracy: 0.8537
Epoch 7/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4590 - accuracy: 0.8607 - val_loss: 0.4607 - val_accuracy: 0.8555
Epoch 8/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4502 - accuracy: 0.8626 - val_loss: 0.4692 - val_accuracy: 0.8512
Epoch 9/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4430 - accuracy: 0.8650 - val_loss: 0.4649 - val_accuracy: 0.8546
Epoch 10/100
375/375 [==============================] - 4s 9ms/step - loss: 0.4389 - accuracy: 0.8663 - val_loss: 0.5050 - val_accuracy: 0.8388
Epoch 11/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4304 - accuracy: 0.8682 - val_loss: 0.4567 - val_accuracy: 0.8624
Epoch 12/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4265 - accuracy: 0.8699 - val_loss: 0.4253 - val_accuracy: 0.8730
Epoch 13/100
375/375 [==============================] - 4s 9ms/step - loss: 0.4236 - accuracy: 0.8693 - val_loss: 0.4304 - val_accuracy: 0.8692
Epoch 14/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4203 - accuracy: 0.8709 - val_loss: 0.4329 - val_accuracy: 0.8702
Epoch 15/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4190 - accuracy: 0.8721 - val_loss: 0.4507 - val_accuracy: 0.8656
Epoch 16/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4149 - accuracy: 0.8742 - val_loss: 0.4498 - val_accuracy: 0.8652
Epoch 17/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4095 - accuracy: 0.8746 - val_loss: 0.4082 - val_accuracy: 0.8812
Epoch 18/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4080 - accuracy: 0.8755 - val_loss: 0.4296 - val_accuracy: 0.8718
Epoch 19/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4072 - accuracy: 0.8754 - val_loss: 0.4104 - val_accuracy: 0.8800
Epoch 20/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3998 - accuracy: 0.8779 - val_loss: 0.4238 - val_accuracy: 0.8689
Epoch 21/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4013 - accuracy: 0.8785 - val_loss: 0.5479 - val_accuracy: 0.8257
Epoch 22/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3990 - accuracy: 0.8794 - val_loss: 0.4211 - val_accuracy: 0.8734
Epoch 23/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3960 - accuracy: 0.8799 - val_loss: 0.4208 - val_accuracy: 0.8749
Epoch 24/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3911 - accuracy: 0.8812 - val_loss: 0.4590 - val_accuracy: 0.8567
Epoch 25/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3915 - accuracy: 0.8812 - val_loss: 0.4102 - val_accuracy: 0.8773
Epoch 26/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3865 - accuracy: 0.8835 - val_loss: 0.4190 - val_accuracy: 0.8743
Epoch 27/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3833 - accuracy: 0.8848 - val_loss: 0.4119 - val_accuracy: 0.8784
Epoch 28/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3829 - accuracy: 0.8843 - val_loss: 0.4882 - val_accuracy: 0.8435
Epoch 29/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3806 - accuracy: 0.8843 - val_loss: 0.4367 - val_accuracy: 0.8701
Epoch 30/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3763 - accuracy: 0.8860 - val_loss: 0.4011 - val_accuracy: 0.8805
Epoch 31/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3737 - accuracy: 0.8874 - val_loss: 0.4130 - val_accuracy: 0.8767
Epoch 32/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3729 - accuracy: 0.8867 - val_loss: 0.4130 - val_accuracy: 0.8734
Epoch 33/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3729 - accuracy: 0.8853 - val_loss: 0.4237 - val_accuracy: 0.8742
Epoch 34/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3701 - accuracy: 0.8862 - val_loss: 0.4396 - val_accuracy: 0.8686
Epoch 35/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3686 - accuracy: 0.8883 - val_loss: 0.3959 - val_accuracy: 0.8806
Epoch 36/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3658 - accuracy: 0.8891 - val_loss: 0.4732 - val_accuracy: 0.8550
Epoch 37/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3652 - accuracy: 0.8907 - val_loss: 0.3917 - val_accuracy: 0.8808
Epoch 38/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3600 - accuracy: 0.8920 - val_loss: 0.3969 - val_accuracy: 0.8822
Epoch 39/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3650 - accuracy: 0.8898 - val_loss: 0.4093 - val_accuracy: 0.8778
Epoch 40/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3624 - accuracy: 0.8903 - val_loss: 0.4599 - val_accuracy: 0.8662
Epoch 41/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3590 - accuracy: 0.8923 - val_loss: 0.4229 - val_accuracy: 0.8712
Epoch 42/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3591 - accuracy: 0.8920 - val_loss: 0.4786 - val_accuracy: 0.8602
Epoch 43/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3588 - accuracy: 0.8924 - val_loss: 0.4013 - val_accuracy: 0.8806
Epoch 44/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3577 - accuracy: 0.8911 - val_loss: 0.4109 - val_accuracy: 0.8752
Epoch 45/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3554 - accuracy: 0.8930 - val_loss: 0.3904 - val_accuracy: 0.8830
Epoch 46/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3565 - accuracy: 0.8942 - val_loss: 0.4002 - val_accuracy: 0.8799
Epoch 47/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3543 - accuracy: 0.8925 - val_loss: 0.3936 - val_accuracy: 0.8798
Epoch 48/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3551 - accuracy: 0.8934 - val_loss: 0.4109 - val_accuracy: 0.8783
Epoch 49/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3545 - accuracy: 0.8944 - val_loss: 0.4550 - val_accuracy: 0.8658
Epoch 50/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3487 - accuracy: 0.8966 - val_loss: 0.4015 - val_accuracy: 0.8793
Epoch 51/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3504 - accuracy: 0.8945 - val_loss: 0.4254 - val_accuracy: 0.8739
Epoch 52/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3536 - accuracy: 0.8935 - val_loss: 0.4528 - val_accuracy: 0.8583
Epoch 53/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3463 - accuracy: 0.8967 - val_loss: 0.4257 - val_accuracy: 0.8721
Epoch 54/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3482 - accuracy: 0.8954 - val_loss: 0.4159 - val_accuracy: 0.8774
Epoch 55/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3472 - accuracy: 0.8980 - val_loss: 0.4313 - val_accuracy: 0.8697
Epoch 56/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3474 - accuracy: 0.8957 - val_loss: 0.5067 - val_accuracy: 0.8475
Epoch 57/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3481 - accuracy: 0.8954 - val_loss: 0.3993 - val_accuracy: 0.8847
Epoch 58/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3493 - accuracy: 0.8956 - val_loss: 0.4208 - val_accuracy: 0.8728
Epoch 59/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3451 - accuracy: 0.8973 - val_loss: 0.4028 - val_accuracy: 0.8834
Epoch 60/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3425 - accuracy: 0.8971 - val_loss: 0.3937 - val_accuracy: 0.8830
Epoch 61/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3435 - accuracy: 0.8974 - val_loss: 0.4318 - val_accuracy: 0.8697
Epoch 62/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3417 - accuracy: 0.8990 - val_loss: 0.4004 - val_accuracy: 0.8798
Epoch 63/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3411 - accuracy: 0.8988 - val_loss: 0.4146 - val_accuracy: 0.8782
Epoch 64/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3381 - accuracy: 0.8989 - val_loss: 0.4203 - val_accuracy: 0.8731
Epoch 65/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3417 - accuracy: 0.8977 - val_loss: 0.3919 - val_accuracy: 0.8832
Epoch 66/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3403 - accuracy: 0.8978 - val_loss: 0.4582 - val_accuracy: 0.8657
Epoch 67/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3394 - accuracy: 0.8994 - val_loss: 0.3928 - val_accuracy: 0.8818
Epoch 68/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3369 - accuracy: 0.8997 - val_loss: 0.4492 - val_accuracy: 0.8633
Epoch 69/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3395 - accuracy: 0.8996 - val_loss: 0.4053 - val_accuracy: 0.8823
Epoch 70/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3377 - accuracy: 0.8990 - val_loss: 0.4256 - val_accuracy: 0.8734
Epoch 71/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3365 - accuracy: 0.9001 - val_loss: 0.4205 - val_accuracy: 0.8781
Epoch 72/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3397 - accuracy: 0.8994 - val_loss: 0.4118 - val_accuracy: 0.8753
Epoch 73/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3339 - accuracy: 0.9008 - val_loss: 0.4409 - val_accuracy: 0.8712
Epoch 74/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3367 - accuracy: 0.9009 - val_loss: 0.4364 - val_accuracy: 0.8708
Epoch 75/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3346 - accuracy: 0.9012 - val_loss: 0.4493 - val_accuracy: 0.8637
Epoch 76/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3334 - accuracy: 0.9016 - val_loss: 0.4579 - val_accuracy: 0.8643
Epoch 77/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3341 - accuracy: 0.9023 - val_loss: 0.3982 - val_accuracy: 0.8824
Epoch 78/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3306 - accuracy: 0.9012 - val_loss: 0.4202 - val_accuracy: 0.8752
Epoch 79/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3359 - accuracy: 0.8990 - val_loss: 0.3888 - val_accuracy: 0.8880
Epoch 80/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3340 - accuracy: 0.9004 - val_loss: 0.3998 - val_accuracy: 0.8835
Epoch 81/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3329 - accuracy: 0.9019 - val_loss: 0.4017 - val_accuracy: 0.8817
Epoch 82/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3315 - accuracy: 0.9025 - val_loss: 0.3926 - val_accuracy: 0.8809
Epoch 83/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3309 - accuracy: 0.9019 - val_loss: 0.4474 - val_accuracy: 0.8643
Epoch 84/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3336 - accuracy: 0.9002 - val_loss: 0.4136 - val_accuracy: 0.8758
Epoch 85/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3284 - accuracy: 0.9032 - val_loss: 0.4205 - val_accuracy: 0.8787
Epoch 86/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3325 - accuracy: 0.9011 - val_loss: 0.4305 - val_accuracy: 0.8735
Epoch 87/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3285 - accuracy: 0.9030 - val_loss: 0.3921 - val_accuracy: 0.8831
Epoch 88/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3301 - accuracy: 0.9021 - val_loss: 0.4464 - val_accuracy: 0.8701
Epoch 89/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3313 - accuracy: 0.9020 - val_loss: 0.4274 - val_accuracy: 0.8769
Epoch 90/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3296 - accuracy: 0.9030 - val_loss: 0.4441 - val_accuracy: 0.8754
Epoch 91/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3308 - accuracy: 0.9024 - val_loss: 0.4112 - val_accuracy: 0.8818
Epoch 92/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3302 - accuracy: 0.9032 - val_loss: 0.4363 - val_accuracy: 0.8719
Epoch 93/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3300 - accuracy: 0.9024 - val_loss: 0.4600 - val_accuracy: 0.8624
Epoch 94/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3284 - accuracy: 0.9023 - val_loss: 0.5266 - val_accuracy: 0.8391
Epoch 95/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3298 - accuracy: 0.9034 - val_loss: 0.4257 - val_accuracy: 0.8739
Epoch 96/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3297 - accuracy: 0.9027 - val_loss: 0.4017 - val_accuracy: 0.8800
Epoch 97/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3249 - accuracy: 0.9031 - val_loss: 0.4460 - val_accuracy: 0.8692
Epoch 98/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3308 - accuracy: 0.9010 - val_loss: 0.4293 - val_accuracy: 0.8752
Epoch 99/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3281 - accuracy: 0.9026 - val_loss: 0.4153 - val_accuracy: 0.8761
Epoch 100/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3269 - accuracy: 0.9037 - val_loss: 0.4569 - val_accuracy: 0.8661

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

L1 Regularization¶

L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):

\begin{equation} \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| \quad \quad (**L1reg**) \end{equation}

At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights. It results in the following weight update equation when using Stochastic Gradient Descent (where $sgn$ is the sign function, such that $sgn(w) = +1$ if $w > 0$, $sgn(w) = -1$ if $w< 0$, and $sgn(0) = 0$):

\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - {\eta \lambda}\; sgn(w_{ij}^{(r)}) - {\eta}\; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**Wsgn**) \end{equation}

Comparing equations (Wsgn) and (L2W) we can see that both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different: In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller. In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true. As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.

Dropout Regularization¶

Dropout is one of the most effective Regularization techniques to have emerged in the last few years, see Srivastava (2013); Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov (2014). We first describe the algorithm and then discuss reasons for its effectiveness.

#dropoutRegularization
nb_setup.images_hconcat(["DL_images/dropoutRegularization.png"], width=600)

The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:

Assign probability values $p^{(r)}, 0 \leq r \leq R$, which is defined as the probability that a node is present in the model, and use these to generate $\{0,1\}$-valued Bernoulli random variables $e_j^{(r)}$: $$ e_j^{(r)} \sim Bernoulli(p^{(r)}), \quad 0 \leq r \leq R,\ \ 1 \leq j \leq P^r $$
Modify the input vector as follows: \begin{equation} \hat x_j = e_j^{(0)} x_j, \quad 1 \leq j \leq N \quad \quad (**Xj0**) \end{equation}
Modify the activations $z_j^{(r)}$ of the hidden layer r as follows: \begin{equation} \hat z_j^{(r)} = e_j^{(r)} z_j^{(r)}, \quad 1 \leq r \leq R,\ \ 1 \leq j \leq P^r \quad \quad (**Zj0**) \end{equation}

The net effect of equation (Xj0) is that for each iteration of the Backprop algorithm, instead of using the entire input vector $x_j, 1 \leq j \leq N$, a randomly selected subset $\hat x_j$ is used instead. Similarly the net effect of equation (Zj0) is that, in each iteration of Backprop, a randomly selected subset of the nodes in each of the hidden layers is erased. Note that since the random subset chosen for erasure in each layer changes from iteration to iteration, we are effectively running Backprop on a different and thinned version of the original DLN network each time. This is illustrated in Figure dropoutRegularization, with part (a) showing the original DLN, and part (b) showing the DLN after a random subset of the nodes have been erased after applying Dropout. Also note that the erasure probabilities are the same for each layer, but can vary from layer to layer.

Weight Adjustments in Dropout¶

#weightAdjustment
nb_setup.images_hconcat(["DL_images/weightAdjustment.png"], width=600)

After the Backprop is complete, we have effectively trained a collection of up to $2^s$ thinned DLNs all of which share the same weights, where $s$ is the total number of hidden nodes in the DLN. In order to test the network, strictly speaking we should be averaging the results from all these thinned models, however a simple approximate averaging method works quite well. The main idea is to use the complete DLN as the test network, but modify each of the weights as shown in Figure weightAdjustment. Hence if a node is retained with probability $p$ during training, then the outgoing weights from that link are multiplied by $p$ at test time. This modification is needed since during the training period only a fraction $p$ of the nodes are active at any one time, hence the average activation is only $p$ times the value for the case when all the nodes are active, which is the case during the test phase. Hence by multiplying the test phase weights by $p$, we insure that the average activation values are the same as during the training phase. Note that Dropout can be combined with other forms of Regularization such as L2 or L1, as well as other techniques that improve the Stochastic Gradient algorithm such as the learning rate momentum algorithm.

Batch Stochastic Gradient for Dense Feed Forward Networks with Dropout¶

The complete Batch Stochastic Gradient Descent algorithm for Deep Feed Forward Networks using Dropout is given below:

1. Initialize all the weight and bias parameters $(w_{ij}^{(r)},b_i^{(r)})$.

2. Repeat the following Step 3, E times:

3. For $q = 0,...,{M\over B}-1$ repeat the following steps (2a) - (2f):

a. For the training inputs $X_i(m), qB\le m\le (q+1)B$, compute the model predictions $y(m)$ given by

$$ a_i^{(r)}(m) = \sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)}, \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$

and

$$ z_i^{(r)}(m) = e_i^{(r)}(m)f(a_i^{(r)}(m)), \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$

and for $r=1$,

$$ a_i^{(1)}(m) = \sum_{j=1}^{P^1} w_{ij}^{(1)}e_j^{(0)}(m)x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = e_i^{(1)}(m)f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$

The logits and classification probabilities are computed as before using

$$ a_i^{(R+1)}(m) = \sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)}, \quad 1 \leq i \leq K $$

and

$$ \quad y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$

Note that for each $m$, a different set of erasure numbers $e_i^{(r)}(m)$ are generated, so that the outputs $Y(m),\ qB\le m\le (q+1)B$ are effectiveley generated using $B$ different networks.

b. Evaluate the gradients $\delta_k^{(R+1)}(m)$ for the logit layer nodes using

$$ \delta_k^{(R+1)}(m) = y_k(m) - t_k(m),\ \ 1\le k\le K $$

c. Back-propagate the $\delta$s using the following equation to obtain the $\delta_j^{(r)}(m), 1 \leq r \leq R, 1 \leq j \leq P^r$ for each hidden node in the network.

$$ \delta_j^{(r)}(m) = e_j^{(r)}(m)f'(a_j^{(r)}(m)) \sum_k w_{kj}^{(r+1)} \delta_k^{(r+1)}(m), \quad 1 \leq r \leq R $$

d. Compute the gradients of the Cross Entropy Function $\mathcal L(m)$ for the $m$-th training vector $(X{(m)}, T{(m)})$ with respect to all the weight and bias parameters using the following equation.

$$ \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r+1)}} = \delta_i^{(r+1)}(m) z_j^{(r)}(m) \quad \mbox{and} \quad \frac{\partial \mathcal L(m)}{\partial b_i^{(r+1)}} = \delta_i^{(r+1)}(m), \quad 0 \leq r \leq R $$

e. Change the model weights according to the formula

$$ w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial\mathcal L(m)}{\partial w_{ij}^{(r)}}, $$$$ b_i^{(r)} \leftarrow b_i^{(r)} - \frac{\eta}{B}\sum_{m=qB}^{(q+1)B} \frac{\partial{\mathcal L}(m)}{\partial b_i^{(r)}}, $$

Once again, due to the random erasures, we are averaging the gradients of $B$ different networks in this step.

f. Increment $q \leftarrow (q+1)\mod B$ and go back to step $(a)$.

3. Compute the Loss Function $L$ over the Validation Dataset given by

$$ L = -{1\over V}\sum_{m=1}^V\sum_{k=1}^K t_k{(m)} \log y_k{(m)} $$

where the outputs are computed using the formulae:

$$ a_i^{(r)}(m) = p^{(r-1)}\sum_{j=1}^{P^{r-1}} w_{ij}^{(r)}z_j^{(r-1)}(m)+b_i^{(r)}, \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$

and

$$ z_i^{(r)}(m) = f(a_i^{(r)}(m)), \quad 2 \leq r \leq R,\ \ 1 \leq i \leq P^r $$

and for $r=1$,

$$ a_i^{(1)}(m) = p^{(0)}\sum_{j=1}^{P^1} w_{ij}^{(1)}(m)x_j(m)+b_i^{(1)} \quad \mbox{and} \quad z_i^{(1)}(m) = f(a_i^{(1)}(m)), \quad 1 \leq i \leq N $$

The logits and classification probabilities are computed using

$$ a_i^{(R+1)}(m) = p^{(R)}\sum_{j=1}^K w_{ij}^{(R+1)}z_j^{(R)}(m)+b_i^{(R+1)}, \quad 1 \leq i \leq K $$

and

$$ \quad y_i(m) = \frac{\exp(a_i^{(R+1)}(m))}{\sum_{k=1}^K \exp(a_k^{(R+1)}(m))}, \quad 1 \leq i \leq K $$

If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.

In most applications of Dropout, the number of parameters to be chosen is reduced to one, by making the following choices:

$$ p^{(0)} = 1, \ \ p^{(r)} = p,\ \ r = 1, 2,...,R $$

Under these assumptions only a single parameter $p$ need be chosen, and it is usually defaulted to $p = 0.5$.

Effectiveness of Dropout¶

Why does Dropout work? Once again we lack a complete theory of Regularization in DLNs that can be used to answer this question, but some plausible reasons are as follows:

A popular technique that is often used in ML models is called “bagging”. In this technique several models for the problem under consideration are developed, each of them with their own parameters and trained on a separate slice of data. The test is then run on each model as well, and then the average of the outputs is used as the final result. The reader may notice that the Dropout algorithm accomplishes the same effect as training several different models, each with a different subset of the hidden nodes active, which may explain its effectiveness. Note that Dropout is only approximate bagging, since unlike in full bagging, all the models share their parameters and also some of their inputs.
Another explanation for Dropout’s effectiveness is the following: During Dropout training, a hidden node cannot rely on the presence of other hidden nodes in order to be able to accomplish its task. Hence each hidden node must perform well irrespective of the presence of other hidden nodes, which makes them very robust and work in a variety of different networks, without getting too adapted to any one of them. This property helps in generalizing the classification accuracy from training to test data sets.

#RL6
nb_setup.images_hconcat(["DL_images/RL6.png"], width=600)

Since the invention of Dropout, researchers have introduced other Regularization techniques that work along similar lines. One such technique, called DropConnect, is shown in Figure RL6. In this case, instead of randomly dropping nodes during training, random connections are dropped instead.

In the example below, we apply Dropout Regularization to the Fashion MNIST Dataset.

import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

from keras import models
from keras import layers
from keras import regularizers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dropout(0.4))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, 
                    validation_split=0.2)

Epoch 1/100
375/375 [==============================] - 3s 8ms/step - loss: 0.6169 - accuracy: 0.7811 - val_loss: 0.4402 - val_accuracy: 0.8415
Epoch 2/100
375/375 [==============================] - 3s 8ms/step - loss: 0.4351 - accuracy: 0.8410 - val_loss: 0.3988 - val_accuracy: 0.8580
Epoch 3/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3890 - accuracy: 0.8601 - val_loss: 0.3596 - val_accuracy: 0.8734
Epoch 4/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3659 - accuracy: 0.8657 - val_loss: 0.3387 - val_accuracy: 0.8785
Epoch 5/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3487 - accuracy: 0.8717 - val_loss: 0.3428 - val_accuracy: 0.8764
Epoch 6/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3361 - accuracy: 0.8770 - val_loss: 0.3773 - val_accuracy: 0.8655
Epoch 7/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3239 - accuracy: 0.8832 - val_loss: 0.3630 - val_accuracy: 0.8748
Epoch 8/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3159 - accuracy: 0.8844 - val_loss: 0.3388 - val_accuracy: 0.8815
Epoch 9/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3093 - accuracy: 0.8877 - val_loss: 0.3398 - val_accuracy: 0.8831
Epoch 10/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3031 - accuracy: 0.8909 - val_loss: 0.3487 - val_accuracy: 0.8754
Epoch 11/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2971 - accuracy: 0.8927 - val_loss: 0.3500 - val_accuracy: 0.8855
Epoch 12/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2937 - accuracy: 0.8925 - val_loss: 0.3335 - val_accuracy: 0.8878
Epoch 13/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2882 - accuracy: 0.8968 - val_loss: 0.3324 - val_accuracy: 0.8913
Epoch 14/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2821 - accuracy: 0.8964 - val_loss: 0.3339 - val_accuracy: 0.8902
Epoch 15/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2773 - accuracy: 0.8991 - val_loss: 0.3409 - val_accuracy: 0.8899
Epoch 16/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2767 - accuracy: 0.9000 - val_loss: 0.3296 - val_accuracy: 0.8927
Epoch 17/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2691 - accuracy: 0.9025 - val_loss: 0.3358 - val_accuracy: 0.8913
Epoch 18/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2682 - accuracy: 0.9033 - val_loss: 0.3319 - val_accuracy: 0.8917
Epoch 19/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2665 - accuracy: 0.9050 - val_loss: 0.3361 - val_accuracy: 0.8946
Epoch 20/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2585 - accuracy: 0.9054 - val_loss: 0.3685 - val_accuracy: 0.8862
Epoch 21/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2587 - accuracy: 0.9080 - val_loss: 0.3514 - val_accuracy: 0.8913
Epoch 22/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2583 - accuracy: 0.9070 - val_loss: 0.3441 - val_accuracy: 0.8941
Epoch 23/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2509 - accuracy: 0.9091 - val_loss: 0.3413 - val_accuracy: 0.8953
Epoch 24/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2457 - accuracy: 0.9122 - val_loss: 0.3633 - val_accuracy: 0.8945
Epoch 25/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2475 - accuracy: 0.9108 - val_loss: 0.3460 - val_accuracy: 0.8945
Epoch 26/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2424 - accuracy: 0.9136 - val_loss: 0.3695 - val_accuracy: 0.8928
Epoch 27/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2389 - accuracy: 0.9133 - val_loss: 0.3644 - val_accuracy: 0.8949
Epoch 28/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2364 - accuracy: 0.9153 - val_loss: 0.3679 - val_accuracy: 0.8938
Epoch 29/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2362 - accuracy: 0.9151 - val_loss: 0.3692 - val_accuracy: 0.8929
Epoch 30/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2323 - accuracy: 0.9166 - val_loss: 0.3568 - val_accuracy: 0.8952
Epoch 31/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2304 - accuracy: 0.9180 - val_loss: 0.3747 - val_accuracy: 0.8968
Epoch 32/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2293 - accuracy: 0.9184 - val_loss: 0.3699 - val_accuracy: 0.8938
Epoch 33/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2249 - accuracy: 0.9195 - val_loss: 0.3722 - val_accuracy: 0.8953
Epoch 34/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2267 - accuracy: 0.9185 - val_loss: 0.3564 - val_accuracy: 0.8992
Epoch 35/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2223 - accuracy: 0.9219 - val_loss: 0.4027 - val_accuracy: 0.8912
Epoch 36/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2236 - accuracy: 0.9212 - val_loss: 0.4057 - val_accuracy: 0.8935
Epoch 37/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2198 - accuracy: 0.9228 - val_loss: 0.3811 - val_accuracy: 0.8962
Epoch 38/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2163 - accuracy: 0.9246 - val_loss: 0.3798 - val_accuracy: 0.9004
Epoch 39/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2186 - accuracy: 0.9224 - val_loss: 0.3964 - val_accuracy: 0.8956
Epoch 40/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2138 - accuracy: 0.9251 - val_loss: 0.4228 - val_accuracy: 0.8905
Epoch 41/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2167 - accuracy: 0.9239 - val_loss: 0.4002 - val_accuracy: 0.8912
Epoch 42/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2129 - accuracy: 0.9253 - val_loss: 0.3981 - val_accuracy: 0.9000
Epoch 43/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2103 - accuracy: 0.9265 - val_loss: 0.4310 - val_accuracy: 0.8959
Epoch 44/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2052 - accuracy: 0.9281 - val_loss: 0.4366 - val_accuracy: 0.8959
Epoch 45/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2100 - accuracy: 0.9273 - val_loss: 0.4233 - val_accuracy: 0.8934
Epoch 46/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2037 - accuracy: 0.9286 - val_loss: 0.4316 - val_accuracy: 0.8931
Epoch 47/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2012 - accuracy: 0.9288 - val_loss: 0.4404 - val_accuracy: 0.8876
Epoch 48/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2025 - accuracy: 0.9289 - val_loss: 0.4196 - val_accuracy: 0.8975
Epoch 49/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2018 - accuracy: 0.9292 - val_loss: 0.4362 - val_accuracy: 0.8967
Epoch 50/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1969 - accuracy: 0.9318 - val_loss: 0.4196 - val_accuracy: 0.9004
Epoch 51/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1988 - accuracy: 0.9303 - val_loss: 0.4433 - val_accuracy: 0.8965
Epoch 52/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1940 - accuracy: 0.9320 - val_loss: 0.4301 - val_accuracy: 0.8988
Epoch 53/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1917 - accuracy: 0.9345 - val_loss: 0.4691 - val_accuracy: 0.8957
Epoch 54/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1968 - accuracy: 0.9320 - val_loss: 0.4487 - val_accuracy: 0.8968
Epoch 55/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1918 - accuracy: 0.9336 - val_loss: 0.4409 - val_accuracy: 0.9019
Epoch 56/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1914 - accuracy: 0.9334 - val_loss: 0.4525 - val_accuracy: 0.8996
Epoch 57/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1943 - accuracy: 0.9333 - val_loss: 0.4509 - val_accuracy: 0.8983
Epoch 58/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1914 - accuracy: 0.9350 - val_loss: 0.4789 - val_accuracy: 0.8985
Epoch 59/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1870 - accuracy: 0.9346 - val_loss: 0.4914 - val_accuracy: 0.9007
Epoch 60/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1859 - accuracy: 0.9367 - val_loss: 0.4588 - val_accuracy: 0.8999
Epoch 61/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1895 - accuracy: 0.9367 - val_loss: 0.4690 - val_accuracy: 0.8965
Epoch 62/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1869 - accuracy: 0.9362 - val_loss: 0.4938 - val_accuracy: 0.8976
Epoch 63/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1855 - accuracy: 0.9381 - val_loss: 0.4857 - val_accuracy: 0.8994
Epoch 64/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1888 - accuracy: 0.9377 - val_loss: 0.4672 - val_accuracy: 0.8977
Epoch 65/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1806 - accuracy: 0.9371 - val_loss: 0.5129 - val_accuracy: 0.8945
Epoch 66/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1788 - accuracy: 0.9391 - val_loss: 0.5117 - val_accuracy: 0.8958
Epoch 67/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1798 - accuracy: 0.9390 - val_loss: 0.5007 - val_accuracy: 0.8992
Epoch 68/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1817 - accuracy: 0.9374 - val_loss: 0.5102 - val_accuracy: 0.8927
Epoch 69/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1751 - accuracy: 0.9411 - val_loss: 0.5090 - val_accuracy: 0.8983
Epoch 70/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1749 - accuracy: 0.9400 - val_loss: 0.5006 - val_accuracy: 0.8995
Epoch 71/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1758 - accuracy: 0.9411 - val_loss: 0.5345 - val_accuracy: 0.8982
Epoch 72/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1734 - accuracy: 0.9423 - val_loss: 0.5422 - val_accuracy: 0.8967
Epoch 73/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1754 - accuracy: 0.9409 - val_loss: 0.5057 - val_accuracy: 0.8978
Epoch 74/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1777 - accuracy: 0.9403 - val_loss: 0.5277 - val_accuracy: 0.8985
Epoch 75/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1773 - accuracy: 0.9408 - val_loss: 0.5260 - val_accuracy: 0.8988
Epoch 76/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1729 - accuracy: 0.9419 - val_loss: 0.5195 - val_accuracy: 0.8986
Epoch 77/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1714 - accuracy: 0.9420 - val_loss: 0.5406 - val_accuracy: 0.8961
Epoch 78/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1700 - accuracy: 0.9429 - val_loss: 0.5689 - val_accuracy: 0.8965
Epoch 79/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1706 - accuracy: 0.9436 - val_loss: 0.5417 - val_accuracy: 0.8971
Epoch 80/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1667 - accuracy: 0.9436 - val_loss: 0.6023 - val_accuracy: 0.8994
Epoch 81/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1709 - accuracy: 0.9423 - val_loss: 0.5501 - val_accuracy: 0.8893
Epoch 82/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1684 - accuracy: 0.9437 - val_loss: 0.5641 - val_accuracy: 0.8972
Epoch 83/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1674 - accuracy: 0.9442 - val_loss: 0.5413 - val_accuracy: 0.9013
Epoch 84/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1705 - accuracy: 0.9441 - val_loss: 0.5473 - val_accuracy: 0.9018
Epoch 85/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1681 - accuracy: 0.9442 - val_loss: 0.5537 - val_accuracy: 0.8970
Epoch 86/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1643 - accuracy: 0.9457 - val_loss: 0.5631 - val_accuracy: 0.8986
Epoch 87/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1663 - accuracy: 0.9444 - val_loss: 0.5569 - val_accuracy: 0.8969
Epoch 88/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1620 - accuracy: 0.9461 - val_loss: 0.5868 - val_accuracy: 0.8983
Epoch 89/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1626 - accuracy: 0.9460 - val_loss: 0.5764 - val_accuracy: 0.8970
Epoch 90/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1640 - accuracy: 0.9463 - val_loss: 0.6179 - val_accuracy: 0.8982
Epoch 91/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1667 - accuracy: 0.9453 - val_loss: 0.5680 - val_accuracy: 0.8968
Epoch 92/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1640 - accuracy: 0.9477 - val_loss: 0.5961 - val_accuracy: 0.8969
Epoch 93/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1616 - accuracy: 0.9477 - val_loss: 0.6008 - val_accuracy: 0.8981
Epoch 94/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1586 - accuracy: 0.9482 - val_loss: 0.6461 - val_accuracy: 0.8961
Epoch 95/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1624 - accuracy: 0.9468 - val_loss: 0.6159 - val_accuracy: 0.8985
Epoch 96/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1572 - accuracy: 0.9480 - val_loss: 0.5975 - val_accuracy: 0.8991
Epoch 97/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1573 - accuracy: 0.9491 - val_loss: 0.6082 - val_accuracy: 0.9003
Epoch 98/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1566 - accuracy: 0.9487 - val_loss: 0.6183 - val_accuracy: 0.8974
Epoch 99/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1561 - accuracy: 0.9494 - val_loss: 0.6221 - val_accuracy: 0.8988
Epoch 100/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1579 - accuracy: 0.9494 - val_loss: 0.6049 - val_accuracy: 0.8965

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

We observe that there is some flattening of the validation loss curve (i.e., a reduction in overfitting), but there is no appreciable improvement in the Validation Accuracy.

Training Data Augmentation¶

#RL5
nb_setup.images_hconcat(["DL_images/RL5.png"], width=600)

If the data set used for training is not large enough, which is often the case for many real life test sets, then it can lead to overfitting. A simple technique to get around this problem is by artificially expanding the training set. In the case of image data, this fairly straightforward – for example we should be able to subject an image to the following transformation without changing its classification (see Figure RL5) for some examples):

Translation: Moving the image by a few pixels in various directions
Rotation
Reflection
Skewing
Scaling: Changing the size of the image while preserving its shape
Changing contrast or brightness

All these transformations are of the type that the human eye is used to experiencing. However there are other augmentation techniques that don’t fall into this class, such as adding random noise to the training data set which is also very effective as long as it is done carefully.

All these techniques work quite well in practice. Later in Chapter ConvNets we will describe a DLN architecture called Convolutional Neural Networks (CNN) in which Translational Invariance is built into the structure of the network itself.

import os, shutil, pathlib

train_dir = 'DL_data/CatsDogs_train'
validation_dir = 'DL_data/CatsDogs_validation'

train_cats_dir = 'DL_data/CatsDogs_train/cats'
train_dogs_dir = 'DL_data/CatsDogs_train/dogs'

validation_cats_dir = 'DL_data/CatsDogs_validation/cats'
validation_dogs_dir = 'DL_data/CatsDogs_validation/dogs'

new_base_dir = pathlib.Path("/Users/subirvarma/Desktop/DL_Book_new_v2/DL_data")

print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))

print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))

total training cat images: 1000
total training dog images: 1001
total validation cat images: 500
total validation dog images: 500

from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    new_base_dir / "CatsDogs_train",
    image_size=(150, 150),
    batch_size=20)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "CatsDogs_validation",
    image_size=(150, 150),
    batch_size=20)

Found 2000 files belonging to 2 classes.
Found 1000 files belonging to 2 classes.

import keras
import tensorflow
keras.__version__

from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers

from tensorflow.keras import Model
from tensorflow.keras import layers
from tensorflow.keras import Input

data_augmentation = tensorflow.keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.2),
    ]
)

from keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers

from keras import Model
from keras import layers
from keras import Input

input_tensor = Input(shape=(150,150,3,))
x = data_augmentation(input_tensor)
x = layers.Flatten()(x)
x = layers.Rescaling(1./255)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
output_tensor = layers.Dense(1, activation='sigmoid')(x)

model = Model(input_tensor, output_tensor)

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(learning_rate=1e-4),
              metrics=['acc'])

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, 150, 150, 3)]     0         
_________________________________________________________________
sequential_1 (Sequential)    (None, 150, 150, 3)       0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 67500)             0         
_________________________________________________________________
rescaling_2 (Rescaling)      (None, 67500)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4320064   
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_4 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
=================================================================
Total params: 4,332,609
Trainable params: 4,332,609
Non-trainable params: 0
_________________________________________________________________

history = model.fit(
      train_dataset,
      epochs=10,       #Use 100 epochs for good results
      validation_data=validation_dataset)

Epoch 1/10
100/100 [==============================] - 13s 124ms/step - loss: 0.6728 - acc: 0.5770 - val_loss: 0.6953 - val_acc: 0.5590
Epoch 2/10
100/100 [==============================] - 12s 119ms/step - loss: 0.6654 - acc: 0.5915 - val_loss: 0.6668 - val_acc: 0.5840
Epoch 3/10
100/100 [==============================] - 13s 126ms/step - loss: 0.6704 - acc: 0.5795 - val_loss: 0.6579 - val_acc: 0.5950
Epoch 4/10
100/100 [==============================] - 12s 122ms/step - loss: 0.6683 - acc: 0.5885 - val_loss: 0.7171 - val_acc: 0.5680
Epoch 5/10
100/100 [==============================] - 12s 121ms/step - loss: 0.6616 - acc: 0.6040 - val_loss: 0.6875 - val_acc: 0.5810
Epoch 6/10
100/100 [==============================] - 12s 120ms/step - loss: 0.6588 - acc: 0.6075 - val_loss: 0.6544 - val_acc: 0.6080
Epoch 7/10
100/100 [==============================] - 12s 121ms/step - loss: 0.6569 - acc: 0.6070 - val_loss: 0.6954 - val_acc: 0.5610
Epoch 8/10
100/100 [==============================] - 12s 123ms/step - loss: 0.6593 - acc: 0.5920 - val_loss: 0.6611 - val_acc: 0.5850
Epoch 9/10
100/100 [==============================] - 12s 123ms/step - loss: 0.6606 - acc: 0.6015 - val_loss: 0.6528 - val_acc: 0.6050
Epoch 10/10
100/100 [==============================] - 12s 121ms/step - loss: 0.6566 - acc: 0.6035 - val_loss: 0.6951 - val_acc: 0.5730

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

<Figure size 640x480 with 1 Axes>

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Model Averaging¶

#Bagging
nb_setup.images_hconcat(["DL_images/Bagging.png"], width=600)

Model Averaging is a well known technique from Machine Learning that is used to boost model accuracy. It relies on a result from statistics which states the following: Given random variables $X_1,...,X_n$ with common mean $\mu$ and variance $\sigma$, the variance of the average $\frac{\sum X_i}{n}$ of these variables is given by $\sigma\over n$, i.e., the average becomess more deterministic as the number of random variables increases. Figure RL7 shows how this result can be applied to ML models: If we train $m$ models using $m$ separate input datasets, and then use a simple majority vote to decide on the output, then it can lead to 2-5% improvement in model accuracy. This method can be further simplified in the following ways:

Dividing up the input dataset reduces the number of samples per model which is not desirable. An alternative technique is to create input datasets for each model, by doing sampling with replacement. This results in each model having the same number of inputs as the original dataset, even though they may have samples that occur multiple time or samples that are absent in some of the datasets. This technique is commonly called "bagging".
An even simpler technique is to use the original input dataset in each of the models. This also works fairly well, since the randomness introduced due to factors such as the random initialization and the stochastic gradient algorithm, is sufficient to de-correlate the models. There are several ways in which this can be done:
- Multiple models are independently trained, and then their predictions are subject to majority voting (this is shown in Figure Bagging).
- Multiple models are simultaneously trained, and their output logit laters are averaged (this is illustrated in the code sample below for the CIFAR-10 dataset)

import keras
keras.__version__
from keras import Sequential, Model
from keras import layers
from keras import Input

from keras.datasets import cifar10

(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

input_tensor = Input(shape=(32 * 32 * 3,))

x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor1 = layers.Dense(10, activation='softmax')(x)

x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor2 = layers.Dense(10, activation='softmax')(x)

x = layers.Dense(20, activation='relu')(input_tensor)
x = layers.Dense(15, activation='relu')(x)
output_tensor3 = layers.Dense(10, activation='softmax')(x)

output_tensor = layers.Average()([output_tensor1, output_tensor2, output_tensor3])

model = Model(input_tensor, output_tensor)

model.compile(optimizer='sgd',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)

Epoch 1/100
313/313 [==============================] - 3s 6ms/step - loss: 2.3024 - accuracy: 0.1096 - val_loss: 2.3024 - val_accuracy: 0.0933
Epoch 2/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3024 - accuracy: 0.0958 - val_loss: 2.3024 - val_accuracy: 0.0894
Epoch 3/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.0959 - val_loss: 2.3024 - val_accuracy: 0.0944
Epoch 4/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.1007 - val_loss: 2.3024 - val_accuracy: 0.0928
Epoch 5/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.0977 - val_loss: 2.3024 - val_accuracy: 0.0951
Epoch 6/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.1004 - val_loss: 2.3024 - val_accuracy: 0.0952
Epoch 7/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.1024 - val_loss: 2.3024 - val_accuracy: 0.0953
Epoch 8/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.0995 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 9/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3023 - accuracy: 0.1006 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 10/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1007 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 11/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1012 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 12/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1009 - val_loss: 2.3023 - val_accuracy: 0.0944
Epoch 13/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1010 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 14/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1013 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 15/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1015 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 16/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.1012 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 17/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3022 - accuracy: 0.0989 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 18/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.0999 - val_loss: 2.3023 - val_accuracy: 0.0952
Epoch 19/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1012 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 20/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1010 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 21/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1013 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 22/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1012 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 23/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1012 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 24/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1018 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 25/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3021 - accuracy: 0.1012 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 26/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1012 - val_loss: 2.3022 - val_accuracy: 0.0952
Epoch 27/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1013 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 28/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1012 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 29/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1012 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 30/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1013 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 31/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1012 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 32/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3020 - accuracy: 0.1012 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 33/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1017 - val_loss: 2.3021 - val_accuracy: 0.0952
Epoch 34/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1014 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 35/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1005 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 36/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1012 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 37/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1011 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 38/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1013 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 39/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3019 - accuracy: 0.1012 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 40/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1012 - val_loss: 2.3020 - val_accuracy: 0.0952
Epoch 41/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1016 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 42/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1015 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 43/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1026 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 44/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1012 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 45/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1013 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 46/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3018 - accuracy: 0.1010 - val_loss: 2.3019 - val_accuracy: 0.0952
Epoch 47/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1012 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 48/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1013 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 49/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1012 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 50/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1012 - val_loss: 2.3018 - val_accuracy: 0.0955
Epoch 51/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1018 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 52/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3017 - accuracy: 0.1036 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 53/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1013 - val_loss: 2.3018 - val_accuracy: 0.0952
Epoch 54/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1023 - val_loss: 2.3017 - val_accuracy: 0.0952
Epoch 55/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1013 - val_loss: 2.3017 - val_accuracy: 0.0952
Epoch 56/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1019 - val_loss: 2.3017 - val_accuracy: 0.0954
Epoch 57/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1018 - val_loss: 2.3017 - val_accuracy: 0.0954
Epoch 58/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3016 - accuracy: 0.1024 - val_loss: 2.3017 - val_accuracy: 0.0959
Epoch 59/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3015 - accuracy: 0.1106 - val_loss: 2.3017 - val_accuracy: 0.0952
Epoch 60/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3015 - accuracy: 0.1014 - val_loss: 2.3016 - val_accuracy: 0.0966
Epoch 61/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3015 - accuracy: 0.1109 - val_loss: 2.3016 - val_accuracy: 0.0953
Epoch 62/100
313/313 [==============================] - 2s 5ms/step - loss: 2.3015 - accuracy: 0.1015 - val_loss: 2.3016 - val_accuracy: 0.0967
Epoch 63/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3015 - accuracy: 0.1025 - val_loss: 2.3016 - val_accuracy: 0.0974
Epoch 64/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3015 - accuracy: 0.1080 - val_loss: 2.3016 - val_accuracy: 0.0954
Epoch 65/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3014 - accuracy: 0.1019 - val_loss: 2.3015 - val_accuracy: 0.0968
Epoch 66/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3014 - accuracy: 0.1065 - val_loss: 2.3015 - val_accuracy: 0.0953
Epoch 67/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3014 - accuracy: 0.1041 - val_loss: 2.3015 - val_accuracy: 0.0962
Epoch 68/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3014 - accuracy: 0.1030 - val_loss: 2.3015 - val_accuracy: 0.0959
Epoch 69/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3014 - accuracy: 0.1021 - val_loss: 2.3015 - val_accuracy: 0.0982
Epoch 70/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1057 - val_loss: 2.3015 - val_accuracy: 0.1008
Epoch 71/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1044 - val_loss: 2.3014 - val_accuracy: 0.1027
Epoch 72/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1128 - val_loss: 2.3014 - val_accuracy: 0.0966
Epoch 73/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1052 - val_loss: 2.3014 - val_accuracy: 0.0967
Epoch 74/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1097 - val_loss: 2.3014 - val_accuracy: 0.0966
Epoch 75/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3013 - accuracy: 0.1025 - val_loss: 2.3014 - val_accuracy: 0.1009
Epoch 76/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3012 - accuracy: 0.1190 - val_loss: 2.3014 - val_accuracy: 0.0970
Epoch 77/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3012 - accuracy: 0.1016 - val_loss: 2.3013 - val_accuracy: 0.1057
Epoch 78/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3012 - accuracy: 0.1043 - val_loss: 2.3013 - val_accuracy: 0.1125
Epoch 79/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3012 - accuracy: 0.1093 - val_loss: 2.3013 - val_accuracy: 0.1104
Epoch 80/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3012 - accuracy: 0.1173 - val_loss: 2.3013 - val_accuracy: 0.1028
Epoch 81/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3011 - accuracy: 0.1126 - val_loss: 2.3012 - val_accuracy: 0.1010
Epoch 82/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3011 - accuracy: 0.1096 - val_loss: 2.3012 - val_accuracy: 0.1018
Epoch 83/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3011 - accuracy: 0.1175 - val_loss: 2.3012 - val_accuracy: 0.1034
Epoch 84/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3011 - accuracy: 0.1061 - val_loss: 2.3012 - val_accuracy: 0.1112
Epoch 85/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3011 - accuracy: 0.1130 - val_loss: 2.3011 - val_accuracy: 0.1096
Epoch 86/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3010 - accuracy: 0.1126 - val_loss: 2.3011 - val_accuracy: 0.1124
Epoch 87/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3010 - accuracy: 0.1098 - val_loss: 2.3011 - val_accuracy: 0.1158
Epoch 88/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3010 - accuracy: 0.1173 - val_loss: 2.3011 - val_accuracy: 0.1119
Epoch 89/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3010 - accuracy: 0.1170 - val_loss: 2.3011 - val_accuracy: 0.1147
Epoch 90/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3010 - accuracy: 0.1168 - val_loss: 2.3010 - val_accuracy: 0.1177
Epoch 91/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3009 - accuracy: 0.1231 - val_loss: 2.3010 - val_accuracy: 0.1104
Epoch 92/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3009 - accuracy: 0.1126 - val_loss: 2.3010 - val_accuracy: 0.1148
Epoch 93/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3009 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1206
Epoch 94/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3009 - accuracy: 0.1283 - val_loss: 2.3010 - val_accuracy: 0.1145
Epoch 95/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3008 - accuracy: 0.1207 - val_loss: 2.3009 - val_accuracy: 0.1148
Epoch 96/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3008 - accuracy: 0.1154 - val_loss: 2.3009 - val_accuracy: 0.1153
Epoch 97/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3008 - accuracy: 0.1209 - val_loss: 2.3009 - val_accuracy: 0.1160
Epoch 98/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3008 - accuracy: 0.1276 - val_loss: 2.3009 - val_accuracy: 0.1141
Epoch 99/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3008 - accuracy: 0.1133 - val_loss: 2.3008 - val_accuracy: 0.1170
Epoch 100/100
313/313 [==============================] - 2s 6ms/step - loss: 2.3007 - accuracy: 0.1328 - val_loss: 2.3008 - val_accuracy: 0.1071

Summary of Regularization¶

We end this section by making the observation that several Regularization techniques work by means of a two step procedure:

Step 1: Introduce noise into the training procedure, which makes the system work harder in order to learn from the training samples. Examples include:
- Dropout: Randomly drop nodes during training.
- Batch Normalization: The data from a training sample is affected by the neighboring samples as a result of the normalization procedure.
- Drop Connect: Randomly drop connections during training.
- Model Averaging: Generate multiple models from the same training data, each of which has a different set of parameters, due to the use of random initializations and stochastic gradient descent.
Step 2: At Test time, average out the randomness to generate final classification decision. Examples include:
- Dropout: The final model used for testing is an approximation to the average of the randomly generated training models.
- Batch Normalization: The final model is normalized using the entire test dataset, as opposed to normalization over batches during training.
- Drop Connect: The final model is an approximation to the average of the randomly generated training models.
- Model Averaging: Use a voting mechanism at test time to decide on the best classification.

References and Slides¶

Slides