Training Neural Networks Part 2

In [18]:
from ipypublish import nb_setup

Generalization

Once the DLN model has been trained, its true test is how well it is able to classify inputs that it has not seen before, which is also known as its Generalization Ability. There are two kinds of problems that can afflict ML models in general:

  1. Even after the model has been fully trained such that its training error is small, it exhibits a high test error rate. This is known as the problem of Overfitting.

  2. The training error fails to come down in-spite of several epochs of training. This is known as the problem of Underfitting.

In [2]:
#UnderfittingOverfitting
nb_setup.images_hconcat(["DL_images/UnderfittingOverfitting.png"], width=600)
Out[2]:

Underfitting and Overfitting

The problems of Underfitting and Overfitting are best visualized in the context of the Regression problem of fitting a curve to the training data, see Figure UnderfittingOverfitting. In this figure, the crosses denote the training data while the solid curve is the ML model that tries to fit this data. As shown, there are three types of models that were used to fit the data: A straight line in the left hand figure, a second degree polynomial in the middle figure and a polynomial of degree 6 in the right hand figure. The left figure shows that the straight line is not able to capture the curve in the training data, and leads to the problem of Underfitting, since even if we were to add more training data it won’t reduce the error. Increasing the complexity of the model by making the polynomial of second-degree helps a lot, as the middle figure shows. On the other hand if we increase the complexity of the model further by using a sixth-degree polynomial, then it backfires as illustrated in the right hand figure. In this case the curve fits all the training points exactly, but fails to fit test data points, which illustrates the problem of Overfitting.

Model Capacity

This example shows that it is critical to choose a model whose complexity fits the training data, choosing a model that is either too simple or too complex results in poor performance. A full scale DLN model can have millions of parameters, and in this case a formal definition of its complexity reamins an open problem. A concept that is often used in this context is the that of "Model Capacity", which is defined as the ability of the model to handle complex datasets. This can be formally computed for simple Binary Classification models and is known as the Vapnik-Chervonenkis or VC dimension. Such a formula does not exist for a DLN model in general, but later in this chapter we will give guidelines that can be used to estimate the effect of model's hyper-parameters on its capacity.

The following factors determine how well a model is able to generalize from the training dataset to the test dataset:

  • The model capacity and its relation to data complexity: In general if the model capacity is less than the data complexity then it leads to underfitting, while if the converse is true, then it can lead to overfitting. Hence we should try to choose a model whose capacity matches the complexity of the training and test datasets.

  • Even if the model capacity and the data complexity are well matched, we can still encounter the overfitting problem. This is caused to due to an insufficient amount of training data.

Based on these observations, a useful rule of thumb is the following: The chances of encountering the overfitting problem increases as the model capacity grows, but decreases as more training data is available. Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the test error (to avoid the overfitting problem) at the same time. This leads to conflicting demands on the model capacity, since training error reduces monotonically as the model capacity increases, but the test error starts to increase if the model capacity is too high. In general if we plot the test error as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model capacity is the point at which the test error starts to increase. This criteria is used extensively in DLNs to determine the best set of hyper-parameters to use.

In [3]:
#HPO6
nb_setup.images_hconcat(["DL_images/HPO6.png"], width=600)
Out[3]:

Figure HPO6 illustrates the tralationship between model capacity and the concepts of underfitting and overfitting by plotting the training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum.

This discussion on the generalization ability of DLN models relies on a very important assumption, which is that both the training and test datasets can be generated using the same probabilistic model. In practice what this means is that we train the model to recognize a certain type of object, human faces for example, we cannot expect it to perform well if the test data consists entirely of cat faces. There is a famous result called the No Free Lunch Theorem that states that if this assumption is not satisfied, i.e., the training and test datasets distributions are un-constrained, then every classification algorithm has the same error rate when classifying previously unobserved points. Hence the only way to do better, is by constraining the training and test datasets to a narrower class of data that is relevant to the problem being solved.

The Validation Dataset

In [4]:
#RL2
nb_setup.images_hconcat(["DL_images/RL2.png"], width=600)
Out[4]:

We introduced the validation dataset in Chapter PatternRecognition, and also used it in Chapters LinearLearningModels and TrainingNNsBackprop when describing the Gradient Descent algorithm. The reader may recall that the rule of thumb is to split the data between the training and test datasets in the ratio 80:20, and then further set aside 20% of the resutling training data for the validation dataset (see Figure RL2). We now provide some justification for why the validation dataset is needed.

During the course of this chapter we will often perform experiments whose objective is to determine optimal values of one or more hyper-parameters. We can track the variation of the error rate or classification accuracy in these experiments using the test dataset, but however this is not a good idea. The reason for this is that doing so causes information about the test data to leak into the training process. Hence we can ending up choosing hyper-parameters that are well suited for a particular choice of test data, but won't work well for others. Using the validation dataset ensures that this leakage does not happen.

Detecting Underfitting

In [5]:
#RL1
nb_setup.images_hconcat(["DL_images/RL1.png"], width=600)
Out[5]:

If a DLN exhibits the following symptom: Its training accuracy does not asymptote to zero, even when it is trained over a large number of epochs, then it is showing signs of underfitting. This means that the capacity of the model is not sufficiently high to be able to classify even the training data with a low probability of error. In other words, the degree of non-linearity in the training data is higher than the amount of non-linearity the DLN is capable of capturing. An example of the output from a model which suffers from underfitting is shown in Figure RL1, this example is taken from Chapter NN Deep Learning, and it shows the training and validation curves for the CIFAR-10 dataset, using a Dense Feed Forward model with 2 hidden layers. As the figure shows both the training error and the validation error closely track each other, and flatten out to a large error value with increasing number of epochs. Hence this Dense Feed Forward Network does not have sufficiently high capacity to capture the complexity in CIFAR-10.

In order to remedy this situation, the modeler can increase the model capacity by increasing the number of hidden layers, adding more nodes per hidden layer, changing regularization parameters (these are introduced in Section Regularization) or the Learning Rate or changing the type of model being used. For the CIFAR-10 example, we will have to replace the Dense Feed Forward model with a Convolutional Neural Network to remedy underfitting, as shown in Chapter ConvNets. If none these steps fail to solve the problem, then it points to bad quality training data.

Detecting Overfitting

In [6]:
#RL3
nb_setup.images_hconcat(["DL_images/RL3.png"], width=600)
Out[6]:

Overfitting is one of the major problems that plagues ML models. When this problem occurs, the model fits the training data very well, but fails to make good predictions in situations it hasn’t been exposed to before. The causes for overfitting were discussed in the introduction to this chapter, and can be boiled down to either a mismatch between the model capacity and the data complexity and/or insufficient training data. DLNs exhibit the following symptoms when overfitting happens (see Figure RL3): The classification accuracy for the training data increases with the number of epochs and may approach 100%, but the test accuracy diverges and plateaus at a much lower value thus opening up a large gap between the two curves.

The following example illustrates the overfitting problem, using the Fashion MNIST datset, which also comes pre-packaged with Keras. Just as the older MNIST dataset, this also come with 60,000 training and 10,000 test examples, and has grayscale coded images of fashion items that can be classified into 10 categories.

In [1]:
import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
In [3]:
train_images.shape
Out[3]:
(60000, 28, 28)
In [4]:
len(train_labels)
Out[4]:
60000
In [6]:
item = train_images[100]

import matplotlib.pyplot as plt
import numpy as np
plt.imshow(item, cmap = plt.cm.binary)
plt.show()
In [7]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
In [8]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
In [9]:
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
Epoch 1/100
375/375 [==============================] - 3s 8ms/step - loss: 0.5923 - accuracy: 0.7911 - val_loss: 0.4043 - val_accuracy: 0.8528
Epoch 2/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3981 - accuracy: 0.8546 - val_loss: 0.3612 - val_accuracy: 0.8674
Epoch 3/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3465 - accuracy: 0.8723 - val_loss: 0.3512 - val_accuracy: 0.8734
Epoch 4/100
375/375 [==============================] - 3s 7ms/step - loss: 0.3211 - accuracy: 0.8809 - val_loss: 0.3370 - val_accuracy: 0.8792
Epoch 5/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2978 - accuracy: 0.8891 - val_loss: 0.3537 - val_accuracy: 0.8711
Epoch 6/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2826 - accuracy: 0.8954 - val_loss: 0.3244 - val_accuracy: 0.8848
Epoch 7/100
375/375 [==============================] - 3s 8ms/step - loss: 0.2695 - accuracy: 0.9011 - val_loss: 0.3163 - val_accuracy: 0.8867
Epoch 8/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2577 - accuracy: 0.9038 - val_loss: 0.3538 - val_accuracy: 0.8787
Epoch 9/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2457 - accuracy: 0.9084 - val_loss: 0.3293 - val_accuracy: 0.8879
Epoch 10/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2374 - accuracy: 0.9115 - val_loss: 0.3760 - val_accuracy: 0.8693
Epoch 11/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2271 - accuracy: 0.9156 - val_loss: 0.3391 - val_accuracy: 0.8850
Epoch 12/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2183 - accuracy: 0.9178 - val_loss: 0.3545 - val_accuracy: 0.8832
Epoch 13/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2107 - accuracy: 0.9215 - val_loss: 0.3221 - val_accuracy: 0.8920
Epoch 14/100
375/375 [==============================] - 3s 7ms/step - loss: 0.2039 - accuracy: 0.9246 - val_loss: 0.3372 - val_accuracy: 0.8925
Epoch 15/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1979 - accuracy: 0.9275 - val_loss: 0.3360 - val_accuracy: 0.8938
Epoch 16/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1920 - accuracy: 0.9291 - val_loss: 0.3451 - val_accuracy: 0.8910
Epoch 17/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1866 - accuracy: 0.9315 - val_loss: 0.3842 - val_accuracy: 0.8888
Epoch 18/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1798 - accuracy: 0.9345 - val_loss: 0.3676 - val_accuracy: 0.8922
Epoch 19/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1740 - accuracy: 0.9353 - val_loss: 0.3579 - val_accuracy: 0.8982
Epoch 20/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1707 - accuracy: 0.9364 - val_loss: 0.3772 - val_accuracy: 0.8926
Epoch 21/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1648 - accuracy: 0.9394 - val_loss: 0.3889 - val_accuracy: 0.8845
Epoch 22/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1614 - accuracy: 0.9386 - val_loss: 0.3688 - val_accuracy: 0.8983
Epoch 23/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1575 - accuracy: 0.9416 - val_loss: 0.4003 - val_accuracy: 0.8840
Epoch 24/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1539 - accuracy: 0.9429 - val_loss: 0.3893 - val_accuracy: 0.8955
Epoch 25/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1486 - accuracy: 0.9444 - val_loss: 0.4050 - val_accuracy: 0.8917
Epoch 26/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1437 - accuracy: 0.9467 - val_loss: 0.3898 - val_accuracy: 0.8991
Epoch 27/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1377 - accuracy: 0.9505 - val_loss: 0.4051 - val_accuracy: 0.8944
Epoch 28/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1382 - accuracy: 0.9492 - val_loss: 0.4106 - val_accuracy: 0.8931
Epoch 29/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1361 - accuracy: 0.9505 - val_loss: 0.4171 - val_accuracy: 0.8960
Epoch 30/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1319 - accuracy: 0.9523 - val_loss: 0.4534 - val_accuracy: 0.8913
Epoch 31/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1288 - accuracy: 0.9532 - val_loss: 0.4164 - val_accuracy: 0.8926
Epoch 32/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1236 - accuracy: 0.9542 - val_loss: 0.4401 - val_accuracy: 0.8989
Epoch 33/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1228 - accuracy: 0.9555 - val_loss: 0.4454 - val_accuracy: 0.8976
Epoch 34/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1183 - accuracy: 0.9563 - val_loss: 0.4887 - val_accuracy: 0.8952
Epoch 35/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1166 - accuracy: 0.9576 - val_loss: 0.4749 - val_accuracy: 0.8884
Epoch 36/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1140 - accuracy: 0.9593 - val_loss: 0.4941 - val_accuracy: 0.8908
Epoch 37/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1131 - accuracy: 0.9590 - val_loss: 0.4810 - val_accuracy: 0.8936
Epoch 38/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1107 - accuracy: 0.9605 - val_loss: 0.4970 - val_accuracy: 0.8935
Epoch 39/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1074 - accuracy: 0.9618 - val_loss: 0.4855 - val_accuracy: 0.8913
Epoch 40/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1061 - accuracy: 0.9614 - val_loss: 0.4933 - val_accuracy: 0.8917
Epoch 41/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1046 - accuracy: 0.9622 - val_loss: 0.5009 - val_accuracy: 0.8957
Epoch 42/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1006 - accuracy: 0.9636 - val_loss: 0.5501 - val_accuracy: 0.8876
Epoch 43/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0993 - accuracy: 0.9649 - val_loss: 0.5277 - val_accuracy: 0.8938
Epoch 44/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0964 - accuracy: 0.9649 - val_loss: 0.5159 - val_accuracy: 0.8966
Epoch 45/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0976 - accuracy: 0.9653 - val_loss: 0.5511 - val_accuracy: 0.8940
Epoch 46/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0924 - accuracy: 0.9669 - val_loss: 0.5445 - val_accuracy: 0.8963
Epoch 47/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0927 - accuracy: 0.9682 - val_loss: 0.5381 - val_accuracy: 0.8964
Epoch 48/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0916 - accuracy: 0.9673 - val_loss: 0.5862 - val_accuracy: 0.8875
Epoch 49/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0890 - accuracy: 0.9678 - val_loss: 0.6001 - val_accuracy: 0.8906
Epoch 50/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0888 - accuracy: 0.9684 - val_loss: 0.6575 - val_accuracy: 0.8881
Epoch 51/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0869 - accuracy: 0.9694 - val_loss: 0.5718 - val_accuracy: 0.8953
Epoch 52/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0836 - accuracy: 0.9703 - val_loss: 0.6245 - val_accuracy: 0.8933
Epoch 53/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0859 - accuracy: 0.9697 - val_loss: 0.6129 - val_accuracy: 0.8934
Epoch 54/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0822 - accuracy: 0.9709 - val_loss: 0.6032 - val_accuracy: 0.8952
Epoch 55/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0784 - accuracy: 0.9725 - val_loss: 0.6191 - val_accuracy: 0.8963
Epoch 56/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0806 - accuracy: 0.9723 - val_loss: 0.6220 - val_accuracy: 0.8969
Epoch 57/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0772 - accuracy: 0.9732 - val_loss: 0.6070 - val_accuracy: 0.8971
Epoch 58/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0775 - accuracy: 0.9729 - val_loss: 0.6231 - val_accuracy: 0.8955
Epoch 59/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0796 - accuracy: 0.9721 - val_loss: 0.6412 - val_accuracy: 0.8967
Epoch 60/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0745 - accuracy: 0.9745 - val_loss: 0.6518 - val_accuracy: 0.8966
Epoch 61/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0746 - accuracy: 0.9746 - val_loss: 0.6416 - val_accuracy: 0.8928
Epoch 62/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0710 - accuracy: 0.9749 - val_loss: 0.6986 - val_accuracy: 0.8941
Epoch 63/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0718 - accuracy: 0.9759 - val_loss: 0.6977 - val_accuracy: 0.8889
Epoch 64/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0702 - accuracy: 0.9762 - val_loss: 0.6681 - val_accuracy: 0.8934
Epoch 65/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0674 - accuracy: 0.9767 - val_loss: 0.6653 - val_accuracy: 0.8939
Epoch 66/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0687 - accuracy: 0.9763 - val_loss: 0.7002 - val_accuracy: 0.8951
Epoch 67/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0656 - accuracy: 0.9778 - val_loss: 0.7144 - val_accuracy: 0.8936
Epoch 68/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0647 - accuracy: 0.9776 - val_loss: 0.7465 - val_accuracy: 0.8907
Epoch 69/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0676 - accuracy: 0.9776 - val_loss: 0.7334 - val_accuracy: 0.8928
Epoch 70/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0644 - accuracy: 0.9774 - val_loss: 0.7309 - val_accuracy: 0.8956
Epoch 71/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0645 - accuracy: 0.9783 - val_loss: 0.7459 - val_accuracy: 0.8950
Epoch 72/100
375/375 [==============================] - 2s 7ms/step - loss: 0.0623 - accuracy: 0.9789 - val_loss: 0.7609 - val_accuracy: 0.8954
Epoch 73/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0609 - accuracy: 0.9785 - val_loss: 0.7495 - val_accuracy: 0.8953
Epoch 74/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0620 - accuracy: 0.9791 - val_loss: 0.7812 - val_accuracy: 0.8900
Epoch 75/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0614 - accuracy: 0.9794 - val_loss: 0.7490 - val_accuracy: 0.8930
Epoch 76/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0586 - accuracy: 0.9802 - val_loss: 0.7639 - val_accuracy: 0.8969
Epoch 77/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0612 - accuracy: 0.9799 - val_loss: 0.8097 - val_accuracy: 0.8927
Epoch 78/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0630 - accuracy: 0.9797 - val_loss: 0.8375 - val_accuracy: 0.8928
Epoch 79/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0599 - accuracy: 0.9803 - val_loss: 0.8467 - val_accuracy: 0.8925
Epoch 80/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0612 - accuracy: 0.9803 - val_loss: 0.8122 - val_accuracy: 0.8950
Epoch 81/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0574 - accuracy: 0.9810 - val_loss: 0.8761 - val_accuracy: 0.8930
Epoch 82/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0573 - accuracy: 0.9811 - val_loss: 0.8276 - val_accuracy: 0.8936
Epoch 83/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0580 - accuracy: 0.9814 - val_loss: 0.8373 - val_accuracy: 0.8921
Epoch 84/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0574 - accuracy: 0.9814 - val_loss: 0.8793 - val_accuracy: 0.8947
Epoch 85/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0535 - accuracy: 0.9820 - val_loss: 0.9012 - val_accuracy: 0.8905
Epoch 86/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0566 - accuracy: 0.9814 - val_loss: 0.9478 - val_accuracy: 0.8847
Epoch 87/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0553 - accuracy: 0.9826 - val_loss: 0.8579 - val_accuracy: 0.8948
Epoch 88/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0529 - accuracy: 0.9830 - val_loss: 0.8626 - val_accuracy: 0.8944
Epoch 89/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0503 - accuracy: 0.9837 - val_loss: 0.9221 - val_accuracy: 0.8923
Epoch 90/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0509 - accuracy: 0.9838 - val_loss: 0.8517 - val_accuracy: 0.8937
Epoch 91/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0486 - accuracy: 0.9846 - val_loss: 0.9109 - val_accuracy: 0.8971
Epoch 92/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0533 - accuracy: 0.9833 - val_loss: 0.9352 - val_accuracy: 0.8892
Epoch 93/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0504 - accuracy: 0.9844 - val_loss: 0.9123 - val_accuracy: 0.8976
Epoch 94/100
375/375 [==============================] - 2s 7ms/step - loss: 0.0483 - accuracy: 0.9840 - val_loss: 0.9532 - val_accuracy: 0.8903
Epoch 95/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0502 - accuracy: 0.9842 - val_loss: 0.9666 - val_accuracy: 0.8898
Epoch 96/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0515 - accuracy: 0.9837 - val_loss: 0.9376 - val_accuracy: 0.8938
Epoch 97/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0470 - accuracy: 0.9849 - val_loss: 0.9374 - val_accuracy: 0.8917
Epoch 98/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0473 - accuracy: 0.9854 - val_loss: 0.9197 - val_accuracy: 0.8899
Epoch 99/100
375/375 [==============================] - 3s 7ms/step - loss: 0.0464 - accuracy: 0.9850 - val_loss: 1.0070 - val_accuracy: 0.8866
Epoch 100/100
375/375 [==============================] - 3s 8ms/step - loss: 0.0485 - accuracy: 0.9854 - val_loss: 0.9458 - val_accuracy: 0.8923
In [10]:
history_dict = history.history
history_dict.keys()
Out[10]:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In [11]:
import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
In [12]:
plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Regularization

In the previous section Regularization was introduced as a technique used to combat the Overfitting problem. We describe some popular Regularization algorithms in this section, which have proven to be very effective in practice. Because of the importance of this topic, there is a huge amount of work that has been done in this area. Dropout based Regularization which is described here, is one of the factors that has led to the resurgence of interest in DLNs in the last few years.

There are a wide variety of techniques that are used for Regularization, and in general the one characteristic that unites them is that these techniques reduce the effective capacity of a model, i.e., the ability for the model to handle more complex classification tasks. This makes sense since the basic cause of Overfitting is that the model capacity exceeds the requirements for the problem.

DLNs also exhibit a little understood feature called Self Regularization. For example for a given amount of Training Set data, if we increase the complexity of the model by adding additional Hidden Layers for example, then we should start to see overfitting, as per the arguments that we just presented. However, interestingly enough, increased model complexity leads to higher test data classification accuracy, i.e., the increased complexity somehow self-regularizes the model, see Bishop (1995). Hence when using DLN models, it is a good idea to start with a more complex model that the problem may warrant, and then add Regularization techniques if overfitting is detected.

Some commonly used Regularization techniques include:

  • Early Stopping

  • L1 Regularization

  • L2 Regularization

  • Dropout Regularization

  • Training Data Augmentation

  • Batch Normalization

The first three techniques are well known from Machine Learning days, and continue to be used for DLN models. The last three techniques on the other hand have been specially designed for DLNs, and were discovered in the last few years. They also tend to be more effective than the older ML techniques. Batch Normalization was already described in Chapter GradientDescentTechniques as a way of Normalizing activations within a model, and it is also very effective as a Regularization technique.

These techniques are discussed in the next few sub-sections.

Early Stopping

In [19]:
#RL4
nb_setup.images_hconcat(["DL_images/RL4.png"], width=600)
Out[19]:

Early Stopping is one of the most popular, and also effective, techniques to prevent overfitting. The basic idea is simple and is illustrated in Figure RL4. Use the validation data set to compute the loss function at the end of each training epoch, and once the loss stops decreasing, stop the training and use the test data to compute the final classification accuracy. In practice it is more robust to wait until the validation loss has stopped decreasing for four or five successive epochs before stopping. The justification for this rule is quite simple: The point at which the validation loss starts to increase is when the model starts to overfit the training data, since from this point onwards its generalization ability starts to decrease. Early Stopping can be used by iteself or in combination with other Regularization techniques.

Note that the Optimal Stopping Point can be considered to be a hyper-parameter, hence effectively we are testing out multiple values of the hyper-parameter during the course of a single training run. This makes Early Stopping more efficient than other hyper-parameter optimization techniques which typically require a complete run of the model to test out a single hyper-parameter value. Another advantage of Early Stopping is that it is a fairly un-obtrusive form of Regularization, since it does not require any changes to the model or objective function which can change the learning dynamics of the system.

The following example shows how to implement Early Stopping in Keras.

In [20]:
import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

We use two pre-defined Keras callbacks to implement Early Stopping:

  • The callback EarlyStopping monitors the validation accuracy, and interrupts the execution of the model if this quantity stops increasing for more than patience epochs.

  • The callback ModelCheckpoint saves the model and its parameters after every epoch in the specified .h5 file. The save_best_only flag ensures that it doesn't override the model file unless val_loss have become smaller, which allows us to save the best model seen during training.

In [21]:
from keras import models
from keras import layers
from keras import callbacks

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor = 'val_accuracy',
        patience = 4,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath = 'Models/fashion.h5',
        monitor = 'val_loss',
        save_best_only = True,
    )
]

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
In [17]:
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, 
                      callbacks = callbacks_list, validation_split=0.2)
Epoch 1/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1903 - accuracy: 0.9296 - val_loss: 0.3448 - val_accuracy: 0.8942
Epoch 2/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1841 - accuracy: 0.9308 - val_loss: 0.3519 - val_accuracy: 0.8942
Epoch 3/100
375/375 [==============================] - 3s 7ms/step - loss: 0.1793 - accuracy: 0.9332 - val_loss: 0.3526 - val_accuracy: 0.8921
Epoch 4/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1739 - accuracy: 0.9359 - val_loss: 0.3971 - val_accuracy: 0.8838
Epoch 5/100
375/375 [==============================] - 3s 8ms/step - loss: 0.1661 - accuracy: 0.9379 - val_loss: 0.4165 - val_accuracy: 0.8827

L2 Regularization

L2 Regularization is a commonly used technique in ML systems is also sometimes referred to as “Weight Decay”. It works by adding a quadratic term to the Cross Entropy Loss Function $\mathcal L$, called the Regularization Term, which results in a new Loss Function $\mathcal L_R$ given by:

\begin{equation} \mathcal L_R = {\mathcal L} + \frac{\lambda}{2} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} (w_{ij}^{(r)})^2 \quad \quad (**L2reg**) \end{equation}

The Regularization Term consists of the sum of the squares of all the link weights in the DLN, multiplied by a parameter $\lambda$ called the Regularization Parameter. This is another Hyperparameter whose appropriate value needs to be chosen as part of the training process by using the validation data set. By choosing a value for this parameter, we decide on the relative importance of the Regularization Term vs the Loss Function term. Note that the Regularization Term does not include the biases, since in practice it has been found that their inclusion does not make much of a difference to the final result.

In order to gain an intuitive understanding of how L2 Regularization works against overfitting, note that the net effect of adding the Regularization Term is to bias the algorithm towards choosing smaller values for the link weight parameters. The value of the parameter $\lambda$ governs the relative importance of the Cross Entropy term vs the regularization term and as $\lambda$ increases, the system tends to favor smaller and smaller weight values.

L2 Regularization also leads to more "diffuse" weight parameters, in other words, it encourages the network to use all its inputs a little rather than some of its inputs a lot. How does this help? A complete answer to this question will have to await some sort of theory of regularization, which does not exist at present. But in general, going back to the example of overfitting in the context of Linear Regression in Figure UnderfittingOverfitting, it is observed that when overfitting occurs as in the right hand part of that figure, the parameters of the model (which in this case are coefficients to the powers of $x$) begin to assume very large values in an attempt to fit all of the training data. Hence one of the signs of overfitting is that the model parameters, whether they are DLN weights or polynomial coefficients, blow up in value during training, which results in the model giving too much importance to the idiosyncrasies of the training dataset. This line of argument leads to the conclusion that smaller values of the model parameters enable the model to generalize better, and hence do a better job of classifying patterns it has not seen before. This increase in the values of the model parameters always seems to occur in the later stages of the training process. Hence one of the effects of the Early Stopping rule is to restrain the growth in the model parameter values. Therefore, in some sense, Early Stopping is also a form of Regularization, and indeed it can be shown that L2 Regularization and Early Stopping are mathematically equivalent.

In order to get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations (Wijr)-(bir) for the weight and bias parameters. Taking the derivative on both sides of equation (L2reg), we obtain

\begin{equation} \frac{\partial \mathcal L_R}{\partial w_{ij}^{(r)}} = \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} + {\lambda}\; w_{ij}^{(r)} \quad \quad (**LWM**) \end{equation}

Substituting equation (LWM) back into equation (GradDesc), the weight update rule becomes:

\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial W_{ij}^{(r)}} - {\eta \lambda} \; w_{ij}^{(r)} \\ = \left(1 - {\eta \lambda} \right)w_{ij}^{(r)} - \eta \; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**L2W**) \end{equation}

Comparing the preceding equation (L2W) with equation (GradDesc), it follows that the net effect of the Regularization Term on the Gradient Descent rule is to rescale the weight $w_{ij}^{(r)}$ by a factor of $(1-{\eta \lambda})$ before applying the gradient to it. This is called “weight decay” since it causes the weight to become smaller with each iteration.

If Stochastic Gradient Descent with batch size $B$ is used then the weight update rule becomes

\begin{equation} w_{ij}^{(r)}\leftarrow \left(1 - {\eta \lambda} \right) w_{ij}^{(r)} - \frac{\eta}{B} \; \sum_{m=1}^B \frac{\partial \mathcal L(m)}{\partial w_{ij}^{(r)}} \quad \quad (**L2wur**) \end{equation}

In both preceding equations (L2W) and (L2wur) the gradients $\frac{\partial L}{\partial w_{ij}^{(r)}}$ are computed using the usual Backprop algorithm.

In the following example, we apply L2 Regularization to Fashion Dataset. From the results we can see that this was not very effective in improving the validation accuracy for the system. Indeed it seems that the system went from a state of Overfitting with no Regularization, to a state of Underfitting with L2 Regularization, since the Training Accuracy also decreased by quite a bit. Since Regularization has the effect of reducing the Model Capacity, it is quire plausible that the resulting decrease in Capacity pushed the system to the Underfitting state.

In [22]:
import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

from keras import models
from keras import layers
from keras import regularizers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', kernel_regularizer = regularizers.l2(0.001),
                               input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, 
                    validation_split=0.2)
Epoch 1/100
375/375 [==============================] - 4s 9ms/step - loss: 0.9372 - accuracy: 0.7810 - val_loss: 0.7519 - val_accuracy: 0.7936
Epoch 2/100
375/375 [==============================] - 3s 9ms/step - loss: 0.6167 - accuracy: 0.8319 - val_loss: 0.6080 - val_accuracy: 0.8343
Epoch 3/100
375/375 [==============================] - 3s 9ms/step - loss: 0.5379 - accuracy: 0.8432 - val_loss: 0.5223 - val_accuracy: 0.8495
Epoch 4/100
375/375 [==============================] - 3s 9ms/step - loss: 0.5000 - accuracy: 0.8497 - val_loss: 0.4730 - val_accuracy: 0.8645
Epoch 5/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4842 - accuracy: 0.8533 - val_loss: 0.4772 - val_accuracy: 0.8558
Epoch 6/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4679 - accuracy: 0.8565 - val_loss: 0.4944 - val_accuracy: 0.8537
Epoch 7/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4590 - accuracy: 0.8607 - val_loss: 0.4607 - val_accuracy: 0.8555
Epoch 8/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4502 - accuracy: 0.8626 - val_loss: 0.4692 - val_accuracy: 0.8512
Epoch 9/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4430 - accuracy: 0.8650 - val_loss: 0.4649 - val_accuracy: 0.8546
Epoch 10/100
375/375 [==============================] - 4s 9ms/step - loss: 0.4389 - accuracy: 0.8663 - val_loss: 0.5050 - val_accuracy: 0.8388
Epoch 11/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4304 - accuracy: 0.8682 - val_loss: 0.4567 - val_accuracy: 0.8624
Epoch 12/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4265 - accuracy: 0.8699 - val_loss: 0.4253 - val_accuracy: 0.8730
Epoch 13/100
375/375 [==============================] - 4s 9ms/step - loss: 0.4236 - accuracy: 0.8693 - val_loss: 0.4304 - val_accuracy: 0.8692
Epoch 14/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4203 - accuracy: 0.8709 - val_loss: 0.4329 - val_accuracy: 0.8702
Epoch 15/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4190 - accuracy: 0.8721 - val_loss: 0.4507 - val_accuracy: 0.8656
Epoch 16/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4149 - accuracy: 0.8742 - val_loss: 0.4498 - val_accuracy: 0.8652
Epoch 17/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4095 - accuracy: 0.8746 - val_loss: 0.4082 - val_accuracy: 0.8812
Epoch 18/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4080 - accuracy: 0.8755 - val_loss: 0.4296 - val_accuracy: 0.8718
Epoch 19/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4072 - accuracy: 0.8754 - val_loss: 0.4104 - val_accuracy: 0.8800
Epoch 20/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3998 - accuracy: 0.8779 - val_loss: 0.4238 - val_accuracy: 0.8689
Epoch 21/100
375/375 [==============================] - 3s 9ms/step - loss: 0.4013 - accuracy: 0.8785 - val_loss: 0.5479 - val_accuracy: 0.8257
Epoch 22/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3990 - accuracy: 0.8794 - val_loss: 0.4211 - val_accuracy: 0.8734
Epoch 23/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3960 - accuracy: 0.8799 - val_loss: 0.4208 - val_accuracy: 0.8749
Epoch 24/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3911 - accuracy: 0.8812 - val_loss: 0.4590 - val_accuracy: 0.8567
Epoch 25/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3915 - accuracy: 0.8812 - val_loss: 0.4102 - val_accuracy: 0.8773
Epoch 26/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3865 - accuracy: 0.8835 - val_loss: 0.4190 - val_accuracy: 0.8743
Epoch 27/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3833 - accuracy: 0.8848 - val_loss: 0.4119 - val_accuracy: 0.8784
Epoch 28/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3829 - accuracy: 0.8843 - val_loss: 0.4882 - val_accuracy: 0.8435
Epoch 29/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3806 - accuracy: 0.8843 - val_loss: 0.4367 - val_accuracy: 0.8701
Epoch 30/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3763 - accuracy: 0.8860 - val_loss: 0.4011 - val_accuracy: 0.8805
Epoch 31/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3737 - accuracy: 0.8874 - val_loss: 0.4130 - val_accuracy: 0.8767
Epoch 32/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3729 - accuracy: 0.8867 - val_loss: 0.4130 - val_accuracy: 0.8734
Epoch 33/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3729 - accuracy: 0.8853 - val_loss: 0.4237 - val_accuracy: 0.8742
Epoch 34/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3701 - accuracy: 0.8862 - val_loss: 0.4396 - val_accuracy: 0.8686
Epoch 35/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3686 - accuracy: 0.8883 - val_loss: 0.3959 - val_accuracy: 0.8806
Epoch 36/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3658 - accuracy: 0.8891 - val_loss: 0.4732 - val_accuracy: 0.8550
Epoch 37/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3652 - accuracy: 0.8907 - val_loss: 0.3917 - val_accuracy: 0.8808
Epoch 38/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3600 - accuracy: 0.8920 - val_loss: 0.3969 - val_accuracy: 0.8822
Epoch 39/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3650 - accuracy: 0.8898 - val_loss: 0.4093 - val_accuracy: 0.8778
Epoch 40/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3624 - accuracy: 0.8903 - val_loss: 0.4599 - val_accuracy: 0.8662
Epoch 41/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3590 - accuracy: 0.8923 - val_loss: 0.4229 - val_accuracy: 0.8712
Epoch 42/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3591 - accuracy: 0.8920 - val_loss: 0.4786 - val_accuracy: 0.8602
Epoch 43/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3588 - accuracy: 0.8924 - val_loss: 0.4013 - val_accuracy: 0.8806
Epoch 44/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3577 - accuracy: 0.8911 - val_loss: 0.4109 - val_accuracy: 0.8752
Epoch 45/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3554 - accuracy: 0.8930 - val_loss: 0.3904 - val_accuracy: 0.8830
Epoch 46/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3565 - accuracy: 0.8942 - val_loss: 0.4002 - val_accuracy: 0.8799
Epoch 47/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3543 - accuracy: 0.8925 - val_loss: 0.3936 - val_accuracy: 0.8798
Epoch 48/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3551 - accuracy: 0.8934 - val_loss: 0.4109 - val_accuracy: 0.8783
Epoch 49/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3545 - accuracy: 0.8944 - val_loss: 0.4550 - val_accuracy: 0.8658
Epoch 50/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3487 - accuracy: 0.8966 - val_loss: 0.4015 - val_accuracy: 0.8793
Epoch 51/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3504 - accuracy: 0.8945 - val_loss: 0.4254 - val_accuracy: 0.8739
Epoch 52/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3536 - accuracy: 0.8935 - val_loss: 0.4528 - val_accuracy: 0.8583
Epoch 53/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3463 - accuracy: 0.8967 - val_loss: 0.4257 - val_accuracy: 0.8721
Epoch 54/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3482 - accuracy: 0.8954 - val_loss: 0.4159 - val_accuracy: 0.8774
Epoch 55/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3472 - accuracy: 0.8980 - val_loss: 0.4313 - val_accuracy: 0.8697
Epoch 56/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3474 - accuracy: 0.8957 - val_loss: 0.5067 - val_accuracy: 0.8475
Epoch 57/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3481 - accuracy: 0.8954 - val_loss: 0.3993 - val_accuracy: 0.8847
Epoch 58/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3493 - accuracy: 0.8956 - val_loss: 0.4208 - val_accuracy: 0.8728
Epoch 59/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3451 - accuracy: 0.8973 - val_loss: 0.4028 - val_accuracy: 0.8834
Epoch 60/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3425 - accuracy: 0.8971 - val_loss: 0.3937 - val_accuracy: 0.8830
Epoch 61/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3435 - accuracy: 0.8974 - val_loss: 0.4318 - val_accuracy: 0.8697
Epoch 62/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3417 - accuracy: 0.8990 - val_loss: 0.4004 - val_accuracy: 0.8798
Epoch 63/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3411 - accuracy: 0.8988 - val_loss: 0.4146 - val_accuracy: 0.8782
Epoch 64/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3381 - accuracy: 0.8989 - val_loss: 0.4203 - val_accuracy: 0.8731
Epoch 65/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3417 - accuracy: 0.8977 - val_loss: 0.3919 - val_accuracy: 0.8832
Epoch 66/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3403 - accuracy: 0.8978 - val_loss: 0.4582 - val_accuracy: 0.8657
Epoch 67/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3394 - accuracy: 0.8994 - val_loss: 0.3928 - val_accuracy: 0.8818
Epoch 68/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3369 - accuracy: 0.8997 - val_loss: 0.4492 - val_accuracy: 0.8633
Epoch 69/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3395 - accuracy: 0.8996 - val_loss: 0.4053 - val_accuracy: 0.8823
Epoch 70/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3377 - accuracy: 0.8990 - val_loss: 0.4256 - val_accuracy: 0.8734
Epoch 71/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3365 - accuracy: 0.9001 - val_loss: 0.4205 - val_accuracy: 0.8781
Epoch 72/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3397 - accuracy: 0.8994 - val_loss: 0.4118 - val_accuracy: 0.8753
Epoch 73/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3339 - accuracy: 0.9008 - val_loss: 0.4409 - val_accuracy: 0.8712
Epoch 74/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3367 - accuracy: 0.9009 - val_loss: 0.4364 - val_accuracy: 0.8708
Epoch 75/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3346 - accuracy: 0.9012 - val_loss: 0.4493 - val_accuracy: 0.8637
Epoch 76/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3334 - accuracy: 0.9016 - val_loss: 0.4579 - val_accuracy: 0.8643
Epoch 77/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3341 - accuracy: 0.9023 - val_loss: 0.3982 - val_accuracy: 0.8824
Epoch 78/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3306 - accuracy: 0.9012 - val_loss: 0.4202 - val_accuracy: 0.8752
Epoch 79/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3359 - accuracy: 0.8990 - val_loss: 0.3888 - val_accuracy: 0.8880
Epoch 80/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3340 - accuracy: 0.9004 - val_loss: 0.3998 - val_accuracy: 0.8835
Epoch 81/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3329 - accuracy: 0.9019 - val_loss: 0.4017 - val_accuracy: 0.8817
Epoch 82/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3315 - accuracy: 0.9025 - val_loss: 0.3926 - val_accuracy: 0.8809
Epoch 83/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3309 - accuracy: 0.9019 - val_loss: 0.4474 - val_accuracy: 0.8643
Epoch 84/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3336 - accuracy: 0.9002 - val_loss: 0.4136 - val_accuracy: 0.8758
Epoch 85/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3284 - accuracy: 0.9032 - val_loss: 0.4205 - val_accuracy: 0.8787
Epoch 86/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3325 - accuracy: 0.9011 - val_loss: 0.4305 - val_accuracy: 0.8735
Epoch 87/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3285 - accuracy: 0.9030 - val_loss: 0.3921 - val_accuracy: 0.8831
Epoch 88/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3301 - accuracy: 0.9021 - val_loss: 0.4464 - val_accuracy: 0.8701
Epoch 89/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3313 - accuracy: 0.9020 - val_loss: 0.4274 - val_accuracy: 0.8769
Epoch 90/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3296 - accuracy: 0.9030 - val_loss: 0.4441 - val_accuracy: 0.8754
Epoch 91/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3308 - accuracy: 0.9024 - val_loss: 0.4112 - val_accuracy: 0.8818
Epoch 92/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3302 - accuracy: 0.9032 - val_loss: 0.4363 - val_accuracy: 0.8719
Epoch 93/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3300 - accuracy: 0.9024 - val_loss: 0.4600 - val_accuracy: 0.8624
Epoch 94/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3284 - accuracy: 0.9023 - val_loss: 0.5266 - val_accuracy: 0.8391
Epoch 95/100
375/375 [==============================] - 3s 8ms/step - loss: 0.3298 - accuracy: 0.9034 - val_loss: 0.4257 - val_accuracy: 0.8739
Epoch 96/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3297 - accuracy: 0.9027 - val_loss: 0.4017 - val_accuracy: 0.8800
Epoch 97/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3249 - accuracy: 0.9031 - val_loss: 0.4460 - val_accuracy: 0.8692
Epoch 98/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3308 - accuracy: 0.9010 - val_loss: 0.4293 - val_accuracy: 0.8752
Epoch 99/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3281 - accuracy: 0.9026 - val_loss: 0.4153 - val_accuracy: 0.8761
Epoch 100/100
375/375 [==============================] - 3s 9ms/step - loss: 0.3269 - accuracy: 0.9037 - val_loss: 0.4569 - val_accuracy: 0.8661
In [23]:
history_dict = history.history
history_dict.keys()
Out[23]:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In [24]:
import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
In [25]:
plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

L1 Regularization

L1 Regularization uses a Regularization Function which is the sum of the absolute value of all the weights in DLN, resulting in the following loss function ($\mathcal L$ is the usual Cross Entropy loss):

\begin{equation} \mathcal L_R = \mathcal L + {\lambda} \sum_{r=1}^{R+1} \sum_{j=1}^{P^{r-1}} \sum_{i=1}^{P^r} |w_{ij}^{(r)}| \quad \quad (**L1reg**) \end{equation}

At a high level L1 Regularization is similar to L2 Regularization since it leads to smaller weights. It results in the following weight update equation when using Stochastic Gradient Descent (where $sgn$ is the sign function, such that $sgn(w) = +1$ if $w > 0$, $sgn(w) = -1$ if $w< 0$, and $sgn(0) = 0$):

\begin{equation} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - {\eta \lambda}\; sgn(w_{ij}^{(r)}) - {\eta}\; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \quad \quad (**Wsgn**) \end{equation}

Comparing equations (Wsgn) and (L2W) we can see that both L1 and L2 Regularizations lead to a reduction in the weights with each iteration. However the way the weights drop is different: In L2 Regularization the weight reduction is multiplicative and proportional to the value of the weight, so it is faster for large weights and de-accelerates as the weights get smaller. In L1 Regularization on the other hand, the weights are reduced by a fixed amount in every iteration, irrespective of the value of the weight. Hence for larger weights L2 Regularization is faster than L1, while for smaller weights the reverse is true. As a result L1 Regularization leads to DLNs in which the weight of most of the connections tends towards zero, with a few connections with larger weights left over. This type of DLN that results after the application of L1 Regularization is said to be “sparse”.

Dropout Regularization

Dropout is one of the most effective Regularization techniques to have emerged in the last few years, see Srivastava (2013); Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov (2014). We first describe the algorithm and then discuss reasons for its effectiveness.

In [32]:
#dropoutRegularization
nb_setup.images_hconcat(["DL_images/dropoutRegularization.png"], width=600)
Out[32]:

The basic idea behind Dropout is to run each iteration of the Backprop algorithm on randomly modified versions of the original DLN. The random modifications are carried out to the topology of the DLN using the following rules:

  • Assign probability values $p^{(r)}, 0 \leq r \leq R$, which is defined as the probability that a node is present in the model, and use these to generate $\{0,1\}$-valued Bernoulli random variables $e_j^{(r)}$: $$ e_j^{(r)} \sim Bernoulli(p^{(r)}), \quad 0 \leq r \leq R,\ \ 1 \leq j \leq P^r $$

  • Modify the input vector as follows: \begin{equation} \hat x_j = e_j^{(0)} x_j, \quad 1 \leq j \leq N \quad \quad (**Xj0**) \end{equation}

  • Modify the activations $z_j^{(r)}$ of the hidden layer r as follows: \begin{equation} \hat z_j^{(r)} = e_j^{(r)} z_j^{(r)}, \quad 1 \leq r \leq R,\ \ 1 \leq j \leq P^r \quad \quad (**Zj0**) \end{equation}

The net effect of equation (Xj0) is that for each iteration of the Backprop algorithm, instead of using the entire input vector $x_j, 1 \leq j \leq N$, a randomly selected subset $\hat x_j$ is used instead. Similarly the net effect of equation (Zj0) is that, in each iteration of Backprop, a randomly selected subset of the nodes in each of the hidden layers is erased. Note that since the random subset chosen for erasure in each layer changes from iteration to iteration, we are effectively running Backprop on a different and thinned version of the original DLN network each time. This is illustrated in Figure dropoutRegularization, with part (a) showing the original DLN, and part (b) showing the DLN after a random subset of the nodes have been erased after applying Dropout. Also note that the erasure probabilities are the same for each layer, but can vary from layer to layer.

Weight Adjustments in Dropout

In [33]:
#weightAdjustment
nb_setup.images_hconcat(["DL_images/weightAdjustment.png"], width=600)
Out[33]: