%pylab inline
from ipypublish import nb_setup
import keras
keras.__version__
from keras import models
from keras import layers
The Backprop algorithm was known by the mid-1980s, but it toook almost two more decades before the field of Deep Learning entered the mainstream. There were several reasons for this delay, including the fact that the processing power was not yet there, but the main reason was that Backprop simply did not work for large models that arise in practical problems. In these cases it was observed that the gradients died away before the training was complete, thus limiting the accuracy of the model. In this chapter and in the next, our objective is go through individual elements of the Gradient Descent algorithm and make improvements so that it is able to work for large models. We start by examining the parameter update equation and present several modifications that improve upon it in Sections LearningRateSelection and ParameterUpdate . We then investigate the role of the Activation Function in Section ActivationLossFunctions, and also come up with a number of alternative functions that improve performance. The correct initialization of weights at the start of the algorithm is also a huge issue, and is discussed in Section InitializingWeights. In Section DataPreprocessing we show that if the training data is pre-conditioned before being fed into the model, then it leads to several benefits in the training process. We end the chapter with a discussion of Batch Normalization which is a recently discovered technique to improve the training process, but which has already had a big impact on the field.
#TNN1
nb_setup.images_hconcat(["DL_images/TNN1.png"], width=600)
Recall that the Gradient Descent based parameter update equation in one dimension is given by:
$$ w\leftarrow w - \eta\frac{\partial {\mathcal L}}{\partial w} $$As shown in Figure TNN1, $\frac{\partial {\mathcal L}}{\partial w} > 0$ for points on the curve to the right of the minimum. This causes the value of $w$ to decrease, until it converges to the minimum point at which the gradient is zero. Conversely $\frac{\partial {\mathcal L}}{\partial w} < 0$ for points to the left of the minimum, which causes $w$ to increase with each iteration. There are a number of cases in which this simple iteration does not work very well, and we will describe these next:
#TNN2
nb_setup.images_hconcat(["DL_images/TNN2.png"], width=600)
#TNN4
nb_setup.images_hconcat(["DL_images/TNN4.png"], width=600)
#TNN3
nb_setup.images_hconcat(["DL_images/TNN3.png"], width=600)
#TNN8
nb_setup.images_hconcat(["DL_images/TNN8.png"], width=600)
All the issues with Gradient Descent that were raised in this section are addressed by techniques described in the next two sections.
#TNN6
nb_setup.images_hconcat(["DL_images/TNN6.png"], width=600)
As we mentioned in the previous section, the Learning Rate parameter has a big influence on the effectiveness of the Gradient Descent algorithm. If it is set to a large value then the algorithm moves quickly at the start of the iteration, but the large step size can cause a parameter overshoot as the system approaches minimum which can lead to oscillations. If set too small then the algorithm converges with high likelihood, however it can take a very long time to do so (see Fig. TNN2). Hence ideally $\eta$ should be set adaptively such it is large in the initial stages of the optimization and becomes smaller as it gets closer to the minimum.
Figure TNN6 illustrates the effect of the Learning Rate on the Loss Function during training and can be used to do a quick check on the suitability of the rate being used. A very high Learning Rate can cause the Loss Function to start to increase after a few iterations, while a moderately high rate causes the Loss to plateau at a high value after an initial rapid decrease. A very low Learning Rate on the other hand can be identified by a slow decrease in the Loss Function over training epochs. A good Learning Rate on the other hand combines a quick decrease during the initial epochs with a lower steady state value.
#TNN7
nb_setup.images_hconcat(["DL_images/TNN7.png"], width=400)
A well known technique for achieving the best Learning Rate behavior is called Learning Rate Annealing. This is the strategy of reducing the Learning Rate as the system approaches the minimum (see Figure TNN7), such that rate is high at the start of the training and gradually falls as the training progresses. This reduction can be done in several ways, popular approaches are:
Track the validation accuracy and decrease the Learning Rate when it appears to plateau.
Automatically anneal the Learning Rate based on the number of epochs that the Gradient Descent algorithm has been through.
Instead of using the same Learning Rate for every parameter, in the next Section we will learn about techniques that tailor the rate to the parameter. Thus parameters possessing a steep gradient get lower rates compared to parameters with a smaller gradient.
Keras provides a feature called callbacks which can be used to implement Learning Rate Annealing. Callbacks is an object that is passed to the fit routine, and then gets called by the model while the training is still going on. Some of the uses for this feature incluse:
Callbacks can be used to do Learning Rate Annealing, as illustrated in the example below. We use one of the built-in callbacks called ReduceLROnPlateau, which has three parameters: (1) The performance measure to be monitored, (2) The factor by which the Learning Rate is reduced everytime the callback is triggered, and (3) The number of epochs for which the performance measure is seen to be stationary before the callback is triggered.
import keras
keras.__version__
from keras import models
from keras import layers
from keras import callbacks
from tensorflow.keras import optimizers
from keras.datasets import cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
network = models.Sequential()
network.add(layers.Dense(20, activation='relu', input_shape=(32 * 32 * 3,)))
network.add(layers.Dense(15, activation='relu'))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
callbacks_list = [
keras.callbacks.ReduceLROnPlateau(
monitor = 'val_loss',
factor = 0.1,
patience = 5,
)
]
sgd = optimizers.SGD(learning_rate=0.01, decay=1e-6, momentum=0.9, nesterov=True)
network.compile(optimizer = sgd,
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=500, batch_size=128,
callbacks = callbacks_list, validation_split=0.2)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
In the next few sections we present a number of modifications to the base parameter update equation $w\leftarrow w - \eta\frac{\partial {\mathcal L}}{\partial w}$, which help to improve the performance of the Gradient Descent algorithm. Some of these algorithms automatically adapt the effective Learning Rate as the training progresses (for example the ADAGRAD, RMSPROP and Adam algorithms), while others improve the speed of convergence (for example the Momentum, Nesterov Momentum and Adam algorithms).
Momentum is one of the most popular techniques used to improve the speed of convergence of the Gradient Descent algorithm. The basic idea behind Momentum is the following: Some Loss Funtions are characterized by the figure shown on LHS of Figure TNN4. In this case the gradient along one of the dimensions is very large, while along the other dimension it is small. If we do the Gradient Descent iteration for this system then the parameter on the steep side fluctuates from one side of the "canyon" to the other, while the parameter on the shallow side progresses very slowly down the canyon. This behavior slows down the speed of convergence quite a lot. An ingenious but simple technique that can counteract this behavior is as follows: Replace the Gradient Descent iteration by the following:
At the end of the $n^{th}$ iteration of the Backprop algorithm, define a sequence $v(n)$ by
$$ v(n) = \rho\; v(n-1) - \eta \; g(n) $$with
$$ v(0) = -\eta \; g(0) $$where $\rho$ is new hyper-parameter called the "momentum" parameter, and $g(n)$ is the gradient evaluated at parameters value $w(n)$, defined by
$$ g(n) = \frac{\partial {\mathcal L(n)}}{\partial w} $$for Stochastic Gradient Descent and
$$ g(n) = {\eta\over B}\sum_{m=nB}^{(n+1)B}\frac{\partial {\mathcal L(m)}}{\partial w} $$for Batch Stochastic Gradient Descent (note that in this case $n$ is an index into the batch number). The change in parameter values on each iteration is now defined as
\begin{equation} w(n+1) = w(n) + v(n) \quad \quad (**Wn1**) \end{equation}It can be shown from these equations that $v(n)$ can be written as
\begin{equation} v(n) = - \eta\sum_{i=0}^n \rho^{n-i} g(i) \quad \quad (**Wn2**) \end{equation}so that
\begin{equation} w(n+1) = w(n) - \eta\sum_{i=0}^n \rho^{n-i} g(i) \quad \quad (**Wn3**) \end{equation}When the momentum parameter $\rho = 0$, then this equation reduces to the usual Stochastic Gradient Descent iteration. On the other hand, when $\rho > 0$, then we get some interesting behaviors:
If the gradients $g(i)$ are such that they change sign frequently (as in the steep side of Figure TNN4), then the stepsize $\sum_{i=0}^n \rho^{n-i}g(i)$ will be small. Thus the change in these parameters with the number of iterations will limited.
If the gradients $g(i)$ are such that they maintain their sign (as in the shallow portion of Figure TNN4), then the stepsize $\sum_{i=0}^n \rho^{n-i}g(i)$ will be large. This means that if the gradients maintain their sign then the corresponding parameters will take bigger and bigger steps as the algorithm progresses, even though the individual gradients may be small.
#TNN5
nb_setup.images_hconcat(["DL_images/TNN5.png"], width=2000)
The Momentum algorithm thus accelerates parameter convergence for parameters whose gradients consistently point in the same direction, and slows parameter change for parameters whose gradient changes sign frequently, thus resulting in faster convergence (this is shown on the RHS of Figure TNN4). The variable $v(n)$ is analogous to velocity in a dynamical system, while the parameter $1-\rho$ plays the role of the co-efficient of friction. The value of $\rho$ determines the degree of momentum, with the momentum becoming stronger as $\rho$ approaches $1$. Note that
$$ \sum_{i=0}^{n} \rho^{n-i}g(i) \le {g_{max}\over 1-\rho} $$$\rho$ is usually set to the neighborhood of $0.9$ and from the above equation it follows that $\sum_{i=0}^n \rho^{n-i}g(i)\approx 10g$ assuming all the $g(i)$ are approximately equation to $g$. Hence the effective gradient in Equation (Wn3) is ten times the value of the actual gradient. This results in an "overshoot" where the value of the parameter shoots past the minimum point to the other side of the bowl, and then reverses itself. This is a desirable behavior since it prevents the algorithm from getting stuck at a saddle point or a local minima, since the momentum carries it out from these areas (see Figure TNN5).
Nesterov Momentum is a variation on the plain Momentum method described above. Note that the Momentum parameter update equations can be written as:
$$ v(n) = \rho\; v(n-1) - \eta \; g(w(n)) $$$$ w(n+1) = w(n) + v(n) $$In the first equation we have explicitly written out the fact that the gradient $g$ is computed for parameter value $w(n)$. These equations can be improved by evaluation of the gradient at parameter value $w(n+1)$ instead. This may seem like circular reasoning since in order to compute $w(n+1)$ we first need to compute $g(w(n))$. However note that $w(n+1)\approx w(n) + \rho v(n-1)$. This leads to the velocity update equation for Nesterov Momentum
$$ v(n) = \rho\; v(n-1) - \eta \; g(w(n)+\rho v(n-1)) $$where $g(w(n)+\rho v(n-1))$ denotes the gradient computed at parameter values $w(n) + \rho v(n-1)$. By using a slightly more accurate estimate of the gradient in each step, it has been observed in practice that the Gradient Descent process speeds up considerably when compared to the plain Momentum method.
The Momentum and Nesterov Momentum algorithms help to improve the speed of convergence, however we still have the issue of optimally varying the Learning Rate parameter (see Section LearningRateSelection). It would be nice if this could be done automatically as part of the parameter update equation and this is precisely what the ADAGRAD algorithm does. This algorithm replaces the parameter update rule with the following equation:
\begin{equation} w(n+1) = w(n) - \frac{\eta}{\sqrt{\sum_{i=1}^n g(n)^2+\epsilon}}\; g(n) \quad \quad (**Wng**) \end{equation}The constant $\epsilon$ has been added to better condition the denominator and is usually set to a small number such $10^{-7}$.
Equation (Wng) leads to the following benefits: Each parameter gets its own adaptive Learning Rate, such that large gradients have smaller learning rates and small gradients have larger learning rates ($\eta$ is usually defaulted to $0.01$). As a result the progress along each dimension evens out over time, which helps the training process. This is a type of Learning Rate annealing, but it is more powerful since:
Also, the accumulation of gradients in the denominator leads to smaller Learning Rates over time, which has the same effect as annealing. This is a double edged sword, since the continuous decrease in Learning Rates can lead to a halt of training in large networks that require a greater number of iterations. This problem is addressed by the RMSPROP algorithm, which is described next.
The RMSPROP algorithm accumulates the sum of gradients using a sliding window, using the following formula:
$$ E[g^2]_n = \rho E[g^2]_{n-1} + (1-\rho) g(n)^2 $$where $\rho$ is a decay constant (usually set to $0.9$). This operation (called a Low Pass Filter) has a windowing effect, since it forgets gradients that are far back in time. The quantity $RMS[g]_n$ defined by
$$ RMS[g]_n = \sqrt{E[g^2]_n + \epsilon} $$is used in the denominator of equation (Wng), resulting in following the parameter update equation:
\begin{equation} w(n+1) = w(n) - \frac{\eta}{RMS[g]_n}\; g(n) \quad \quad (**WRMS**) \end{equation}Note that $$ E[g^2]_n = (1-\rho)\sum_{i=0}^n \rho^{n-i} g(i)^2 \le \frac{g_{max}}{1-\rho} $$ which shows that the parameter $\rho$ prevents the sum from blowing up, and a large value of $\rho$ is equivalent to using a larger window of previous gradients in computing the sum. Hence RMSPROP retains the benefits of ADAGRAD while avoiding the decay of the Learning Rate to zero.
The Adaptive Moment Estimation (Adam) algorithm combines the best of algorithms such as Momentum that speed up the training process, with algorithms such as RMSPROP that adaptively vary the effective Learning Rate. The update equtions for Adam are as follows:
$$ \Lambda(n) = \beta\Lambda(n-1) +(1-\beta)g(n),\ \ \ {\hat\Gamma}(n) = \frac{\Gamma(n)}{1-\beta^n} $$$$ \Delta(n) = \alpha\Delta(n-1) + (1-\alpha) g(n)^2,\ \ \ {\hat\Delta}(n) = \frac{\Delta(n)}{1-\alpha^n} $$$$ w(n+1) = w(n) - \eta\frac{\hat\Lambda(n)}{\sqrt{\hat\Delta(n) + \epsilon}} $$The definition of the sequence $\Delta(n)$ is identical to that of $E[g^2]_n$ in the RMSPROP, and it serves an identical purpose, i.e., it is used to customize the effective Learning Rate on a per parameter basis, so that the rates for parameters with larger gradients are equalized with those for parameters with smaller gradients.
The sequence $\Lambda(n)$ is used to provide "Momentum" to the updates, and works in a fashion similar to the velocity sequence $v(n)$ in the Momentum algorithm. It is easy to show that
$$ \Lambda(n) = (1-\rho)\sum_{i=0}^n \rho^{n-i} g(i) \le \frac{g_{max}}{1-\rho} $$which shows that $\Lambda(n)$ is the weighted sum of the previous $n$ gradients (compare this with the expression for $v(n)$ in Equation (Wn2)). Since $\Lambda(n)$ and $\Delta(n)$ are initialized as vectors of 0s, they are biased towards 0 at the start of the iteration. These biases are counteracted by computing the estimates $\hat\Lambda(n)$ and $\hat\Delta(n)$. The parameters $\alpha$ and $\beta$ are usually defaulted to $10^{-8}$ and $0.999$ respectively.
Adam serves as the default choice for the parameter update rule, since it combines the best features of the other update algorithms.
Optimizers can either be instantiated before being invoked as in:
model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
Or they can be called by name, in which case default parameters are used, as in:
model.compile(loss='mean_squared_error', optimizer='sgd')
The following optimizers are available in Keras:
# Stochastic Gradient Descent
keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False)
# RMSProp
keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)
#AdaGrad
keras.optimizers.Adagrad(learning_rate=0.01)
#Adadelta
keras.optimizers.Adadelta(learning_rate=1.0, rho=0.95)
#Adam
keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)
If any of the optimizers are invoked by name, then Keras supplied default values for all the relevant parameters, which usually work quite well in practice.
Keras has added a new feature called the KerasTuner, which as the name implies, can be used to automatically tune and find the best parameters. More information about the KerasTuner, along with a short tutorial, can be found at the following webpages: https://www.tensorflow.org/tutorials/keras/keras_tuner, https://keras.io/keras_tuner/.
#AF1
nb_setup.images_hconcat(["DL_images/AF1.png"], width=600)
The choice of Activation Function has a major influence on the training and operation of DLN systems. When DLNs were first proposed, the first choice for the activation was the sigmoid, probably as a result of what was then known about biological neurons. This turned out to be an unfortunate choice, as illustrated in Figure AF1. This picture shows a single neuron with a sigmoid activation. Using the Gradient Flow rules from Chapter TrainingNNsBackprop, we can see that the backpropagation of the gradient $\frac{\partial\mathcal L}{\partial z}$ results in
$$ \frac{\partial\mathcal L}{\partial a} = z(1-z)\frac{\partial\mathcal L}{\partial z} $$If the neuron is saturated, i.e., $a$ lies significantly away from the origin, then from the shape of the sigmoid it follows that either $z$ or $(1-z)$ is zero, which implies that $\frac{\partial\mathcal L}{\partial a} \approx 0$. As a result the gradients flowing back to the next layer of nodes will also be zero. The neuron in this state is said to be "dead". Once a neuron is dead, it stays dead, since in order to get get it back into the active state, the inputs weights (shown on the LHS of Figure AF1) need to change. However the weights cannot change since the gradient with respect to the weights is given by $\frac{\partial\mathcal L}{\partial w} = z'\delta$ and $\delta = 0$.
Thus the choice of the sigmoid function contributed to the Vanishing Gradient problem that plagued the first DLN systems. Interestingly enough, a suitable replacement for the sigmoid (the ReLU function) was not proposed until 2010. But once it was in place, it contributed to the rapid advances in the field since then. Our objective in this section is to survey some of the many Activation Functions that have been proposed in the last few years.
#AF2
nb_setup.images_hconcat(["DL_images/AF2.png"], width=600)