from ipypublish import nb_setup
Recurrent Neural Nets or RNNs are systems that are designed to detect patterns present in data sequences. This makes them better suited to solve "prediction" problems, as compared to other types of DLNs. An example of a prediction problem is predicting the next word in a sentence, which is a fundamental problem in Language Modeling. The solution to this problem requires that the system take into account the variable number of words that came before, i.e., it should be able to "remember" previous data in the sequence. Just like ConvNets, RNNs were discovered in the late 1980s, but lay dormant until recently due to the difficulty in training them. These problems have been overcome in recent years, with the use of a type of RNN called Long Short Term Memory or LSTMs, as well as the increase in processing power and size of training datasets. Today RNNs are at the forefront of exciting new discoveries in Deep Learning, and some of the most important recent work in DLNs falls in the RNN domain.
#rnn36
nb_setup.images_hconcat(["DL_images/rnn36.png"], width=600)
In order to motivate the need for RNNs, consider the following: The DLN architectures that we have seen so far implements a static mapping between input and output vectors, of the type $Y=f(X)$ as shown in Part (a) of Figure rnn36.
Instead, consider a system that is evolving with time, i.e., it is a Dynamical System. In this scenario, the system is subject to an input sequence $X_1,...,X_n$ in which successive values of $X_i$ are dependent (for example if each $X_i$ were an image, then the sequence represents a video clip). Furthermore the output vector $Y_i$ at time $i$ depends not just on the input vector $X_i$, but on past values of $X$ as well, i.e., $Y_i = f_i(X_1,...,X_i)$ as shown in Part (b) of Figure rnn36. If we were to try to solve this problem using a traditional Neural Network, then we would need to find a different function $f_i$ for each value of $i$, i.e., the function would depend on the size of the input sequence. Obviously this would be a formidable task, and it would be much nicer if we could find a solution which is not dependent on the size of the input sequence; this is precisely what RNNs do.
Systems whose behavior is time dependent are commonly encountered in practice and are usually studied by postulating a Hidden Variable or Hidden State $Z_i$ (also called a State Variable) The evolution of the system state obeys the following recursion:
$$ Z_{i+1} = w_{i+1}(Z_i,X_i) $$while the output is a function of the current state and is given by
$$ Y_{i+1} = v_{i+1}(Z_{i+1}) $$The Hidden Variable sequence $Z_i$ captures the lossy history of the $X_i$ sequence, and hence serves as a type of memory. If we assume that the functions $v$ and $w$ do not change with time, then these equations reduce to:
$$ \begin{equation} Z_{i+1} = w(Z_i,X_i) \quad \quad (**eqn1**) \end{equation} $$$$ \begin{equation} Y_{i+1} = v(Z_{i+1}) \quad \quad (**eqn2**) \end{equation} $$Traditional Systems Theory makes the additional simplifying assumption that these functions are linear so that
$$ Z_{i+1} = WZ_i + UX_i $$$$ Y_{i+1} = VZ_{i+1} $$The use of DLNs enables us to approximate non-linear functions to high degree of accuracy using the training dataset, so that this linearity assumption is no longer needed. We do however assume that the vectors $Z_i$ and $X_i$ can be combined linearly, so that the resulting equations are:
$$ \begin{equation} Z_{i+1} = f(W Z_i + UX_i) \quad \quad (**eqn**3) \end{equation} $$$$ \begin{equation} Y_{i+1} = h(V Z_{i+1}) \quad \quad (**eqn4**) \end{equation} $$Equations (eqn3) and (eqn4) are in a form in which they can be implemented using Neural Networks, with the functions $f$ and $h$ serving as the Activation Functions, as shown in Figure rnn1.
#rnn1
nb_setup.images_hconcat(["DL_images/rnn1.png"], width=600)
The figure shows three types of nodes:
$X{(n)}$: This represents the value of the input vector at time $n$.
$Z{(n)}$: This represents the value of the Hidden Layer vector at time $n$.
$Y{(n)}$: This represents the value of the output vector at time $n$.
The LHS of Figure rnn1 shows a RNN with connections between the Hidden Layers at times $n$ and $n+1$ shown explicitly. The RHS of Figure rnn1 is a simplified representation of the same RNN. Each of the boxes in this figure represents a row of Neural Network nodes belonging to a single layer, of the type shown on the LHS, which have been omitted for the sake of clarity. Note that:
The fact that all these weight matrices do not change with time is a result of the time invariance assumption.
#rnn1c
nb_setup.images_hconcat(["DL_images/rnn1c.png"], width=600)
While Figure rnn1 accurately captures the fact that the RNN design incorporates feedback, it does not lend itself easily to training of the type we have seen in Dense Feed-Forward Networks. In order to facilitate the reuse of techniques we have developed in previous chapters, we convert Figure rnn1 into an equivalent Dense Feed Forward network shown in Figure rnn1c, by a process called "unfolding". The unfolded network in the RHS of Figure rnn1c basically shows snapshots of the RNN at various points in time, and the fact that there is feedback between the nodes in the Hidden Layer is captured by the connections involving the weight matrix $W$.
As a result of the unfolding operation, the system becomes amenable to optimization using the Backprop algorithm. In addition, the depth or number of layers of the RNN is now dependent on the length of the input data sequence. This can result in a RNN with hundreds of Hidden Layers, hence the problems of training deep models with Backprop have a special resonance here. We also mentioned earlier that the Hidden Layer in RNNs keeps a "Lossy Memory" of the input sequence - the reason why it is lossy is because this single layer has to capture data from an entire input sequence, which cannot be done without compressing the data is some fashion.
In order to gain an intuitive understanding of the way in which the RNN shown in Figure rnn1c operates, note the following: There is an analogy between the way a ConvNet operates by looking for patterns in localized patches of an image, and the way a RNN operates by looking for patterns in localized intervals in time. Just as a ConvNet slides a single Filter over the entire image, a RNN slides its own filter represented by the weight matrix U over the entire input sequence, one input at a time. Hence RNNs implement a form of Translational Invariance, but they do this over the temporal axis, as opposed to spatial Translational Invariance seen in ConvNets. As a result of this property, the important part of the sequence can occur anywhere along the time axis, but still can be detected using a single weight matrix repeated at each point in time. There is however a crucial difference in the way a RNN operates compared to a ConvNet: Due to the feedback loop, the RNN pattern detector in the Hidden Layer is able to take advantage of the information that was learnt in the prior time steps from older data in the sequence. As a result the RNN is able to detect patterns that are spread over time, something that a ConvNet cannot do (in the spatial sense).
There is a fundamental difference between the internal representations that are created in the hidden layers of a ConvNet vs those that are created in the hidden layers of a RNN. The hidden layers in a ConvNet create a hierarchical representation, in which the representation at layer $r+1$ is at a higher level of abstraction compared to that for layer $r$. The hidden layer in a RNN on the other hand, does not add to the level of abstraction in the representation with successive time steps. Instead it captures patterns that are spread in time, but at the same level of abstraction. It is indeed possible to combine DLNs and RNNs and create deep RNNs (see the next section for examples), in which case higher level hidden layers capture time dependent patterns at different levels of abstraction.
The rest of this chapter is organized as follows: RNNs can be configured in several useful ways depending upon the problem they are being used to solve, some of these are described in Section Examples. In Section Training we discuss the training of RNNs and the problems that arise when doing so. In particular the Backprop algorithm can lead to issues such as the Vanishing Gradients Problem or the Exploding Gradient Problem, and in Section LSTMs we discuss a modified RNN architecture called LSTM (Long Short Term Memory) which was designed to solve these problems. In the final section we show how to model RNNs and LSTMs using Keras.
In prior chapters we built models to classify IMDB Movie Reviews using Dense Feed Forward Networks (Chapter NNDeepLearning) and 1-D Convnets (Chapter ConvNetsPart1). We now show how to do this using a RNN.
We first address the problem of how to feed data into a RNN: As shown in Figure rnn39, Keras requires that data inputs into each RNN stage be in the form of a 1-D vector. Hence as Part (a) of the figure shows, a single training sample into an RNN is a 2-D matrix of shape $time\times features$, such that the $i^{th}$ row of the matrix represents the data vector that is fed into the $i^{th}$ stage of the RNN. As shown in Part (b) of the figure, a batch of samples into the RNN becomes a 3D tensor of shape $sample\times time\times features$.
All datasets to be fed into an RNN have to be first formatted into this shape. This raises the question of how to feed non-vector data into a RNN, the most common example of which are 3-D images. A common way of doing so is by first passing the image data through a ConvNet, and then using the image feature vector that occurs at the end of the convolutional layers as an input. This enables us to feed a video clip into a RNN, such that images in the clip are fed into successive stages of the RNN.
#rnn39
nb_setup.images_hconcat(["DL_images/rnn39.png"], width=600)
We use the IMDB dataset that comes with Keras. The test data has already been split into trainin + test samples and also tokenized such that all words have been converted into integers. When loading this data, we limit ourselves to the top 10,000 words that occur in the dataset by means of the max_features parameter. Furthermore each review is truncated after 500 words, with shorter reviews padded with zeroes, using the pad_sequences command and the maxlen parameter. At the end of these operations, each review is in the form of a 1-D vector of size 500 and the entire IMDB dataset is a matrix of shape $samples\times word tokens$.
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000 # number of words to consider as features
maxlen = 500 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
The RNN model for the IMDB problem is shown below. It is a basic RNN with 500 stages (since each movie review sample consists of 500 words) and a single Logit node for doing the binary classification.
In order to input a single word into the model, Keras does the following: A word is represented using 1-Hot vector of size 10,000 (corresponding to the 10,000 words in the vocabulary), and it then passes through the Embedding Layer which converts it into its vector representation of size 32 by multiplying it with a matrix of size $10,000\times 32$ (note that this multiplication is not needed in the actual implementation, Keras just picks the $i^{th}$ row of the matrix if the token of the input word is $i$). Note that we are not using a pre-trained Embedding Layer, hence the best embedding for each word is learnt as part of the training process.
Keras inputs a single review into the model by presenting each of its 500 words in turn to successive stages of the RNN.
Keras inputs a batch of reviews into the model by creating a 3-D tensor of shape $(batch\ size, review\ size, feature\ size)$, which is (128, 500, 32) for our example. Hence during the forward pass, 128 reviews are fed in parallel into the model.
#rnn40
nb_setup.images_hconcat(["DL_images/rnn40.png"], width=600)
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(input_train, y_train,
epochs=100,
batch_size=128,
validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
The Accuracy and Loss curves show that the model starts to overfit after about 8-10 epochs, achieving a maximum accuracy of about 85%. This is also close to the accuracy that was achieved with the 1-D convnet (see Chapter ConvNetsPart1), even though that was done using an embedding of size 128. In order to increase the accuracy we may increase the the number of words per review to more than 500. However this also increases the number of RNN stages, and later in this chapter we will show that this can cause problems with the Backprop algorithm.
#rnn38
nb_setup.images_hconcat(["DL_images/rnn48.png"], width=600)
Figure rnn38 shows a RNN with multiple hidden layers, usually referred to as a Deep RNN. This system incorporates ideas from Dense Feed Forward Networks into the RNN, and enables the system to simultaneously create:
The Keras code for a RNN Model with 4 stacked layers is shown below. They require the use of the return_sequences flag which is set to True in the intermediate layers as shown. This causes Keras to return the full sequence of outputs for each timestep, which are needed to feed the following layer. The model.summary() command shows the output shape from the intermediate layers as a 3D tensor of size $(batch\ size, timesteps, output\ features)$.
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32)) # This last layer only returns the last outputs.
model.summary()
#rnn38
nb_setup.images_hconcat(["DL_images/rnn38.png"], width=600)
Figure rnn38 shows an important RNN architecture called Bi-Directional RNNS. As the name implies this system incorporates two sets of hidden layers. The first set is the conventional hidden layer, with connections going from left to right, while the second set of Hidden Layers has connections going in the opposite direction, from right to left. Unlike the first layer, the second layer is able to spot patterns in the reversed sequence. The final output of each layer incorporates information from both the forward and backward hidden layers, usually by concatenation of layers. Such a design can be used for the case when the input sequence is not being generated in real time and is most useful when processing text sequences and has been shown to substantially improve performance in many cases. Note that this architecture cannot be used for real time data sequences, since the at test time future values are not yet available.
The code snippet below shows how to program a Bi-Directional RNN in Keras. The Bi-Directional layer is invoked using a recurrent layer instance as its first argument, which processes the input sequence in the forward order. It creates a second layer which processes the input sequence in the reverse order, and then the final states of the two layers are concatenated and fed into the Dense layer.
from keras import layers
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(layers.Embedding(10000, 32))
model.add(layers.Bidirectional(layers.SimpleRNN(32)))
model.add(layers.Dense(1,activation='sigmoid'))
model.summary()
Recall that the Dropout technique is used to reduce the amount of Overfitting in a DLN model. Initial applications of Dropout to RNNs were not successful, and later it was discovered that it was due to the fact that a different random Dropout Mask was applied to successive stages of the RNN. Later was shown that if the Dropout Mask was kept fixed for each stage for a single forward/backward pass through the RNN, then indeed the amount of Overfitting was reduced. Keras supports this type of RNN Dropout, and it can be turned on by using two flags in the parameter list (see below): The dropout flag specifies that Dropout should be used in the input layer of the model, while the recurrent_dropout flag specifies that Dropout should be used in the recurrent layer as well.
Earlier in this chapter we saw that the SimpleRNN model when applied to the IMDB Movie Review problem resulted in a good deal of overfitting. In the example below we use Dropout as part of the model. The results show that indeed the Overfitting is pushed out to later in the training process, after 60-70 epochs as opposed to 10 epochs. However there is not much improvement in the best accuracy level, which actually has decreased.
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32,
dropout = 0.2,
recurrent_dropout = 0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,
epochs=100,
batch_size=128,
validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
The Backprop algorithm that was described in Chapter TrainingNNsBackprop can be modified in order to train RNNs, and this forms the subject of this section. We start by setting up the optimization problem, which as before, is to minimize the Loss Function $L$. Recall that $L$ is a function of of the labeled Training Sequence $\{(X(m),T(m))\}_{m=1}^{M}$ and the network parameters $(W,b)$ while $\mathcal L(m)$ is the Cross Entropy Loss Function for the $m^{th}$ training pair, given by
$$ \mathcal L(m) = -\sum_{k=1}^{K}t_k(m)\log y_k(m) $$In this equation $K$ is the number of output categories, $t_k(m), 1\le k\le K$ is the Ground Truth vector and $y_k(m), 1\le k\le K$ are the classification probabilities for the $m^{th}$ training pair, given by
$$ y_k(m) = \frac{\exp(a_k(m))}{\sum_{j=1}^K \exp(a_j(m))},\ \ 1\le k\le K $$where $a_k(m), 1\le k\le K$ are the corresponding logits. The overall Loss function can then be written as
$$ L={1\over M}\sum_{m=1}^M \mathcal L(m)= -{1\over M} \sum_{m=1}^M\sum_{k=1}^K t_k(m)\log y_k(m) $$The modification to the Backprop algorithm to train RNNs is called Back Propagation Through Time (BPTT), and in the following two sub-sections we describe the forward and backward passes for this algorithm.
#rnn41
nb_setup.images_hconcat(["DL_images/rnn41.png"], width=600)
Consider a RNN with R stages, such that the (R+1)rst layer is the Logit Layer. The forward pass of the RNN backprop algorithm for a single sample of the input data $(X_1,...,X_R)$, proceeds as follows (refer to Figure rnn41):
The first input vector in the sample, $X_1$ is applied to the system, resulting in the hidden state activation vector $Z_1$ as per the formulae:
$$ Z_1 = f(UX_1) $$The second input vector $X_2$ and the hidden state vector $Z_1$ is applied to the system, resulting in the hidden state activation vector $Z_1$:
$$ Z_2 = f(WZ_1 + UX_2) $$Step 2 is repeated an additional $R-2$ times to compute the vectors $Z_r, 3\le r\le R$ until the remaining input vectors from the sample $(X_3,...,X_R)$ have been processed.
After the last stage, the Logits $A_{R+1}$ are computed as a function of the final Hidden State $Z_R$:
$$ A_{R+1} = VZ_R $$and the output probabilities are given by
$$ Y = softmax(A_{R+1}) $$This constitutes the forward pass of the BPTT algorithm for a single sample of the input dataset. In practice $B$ samples are sent through the RNN at the same time when the calculations are done in the batch mode.
#rnn42
nb_setup.images_hconcat(["DL_images/rnn42.png"], width=600)
Once again consider a RNN with R stages, such that the (R+1)rst layer is the Logit Layer. Define the following gradients for the R stages in the RNN, with P nodes per stage:
$$ \Delta_r = ({{\partial\mathcal L}\over{\partial a^1_r}},...,{{\partial\mathcal L}\over{\partial a^P _r}}) = {{\partial\mathcal L}\over{\partial A_r}},\ 1\le r\le R $$Since there are K nodes in the Logit Layer, the gradient is defined by:
$$ \Delta_{R+1} = ({{\partial\mathcal L}\over{\partial a^1_{R+1}}},...,{{\partial\mathcal L}\over{\partial a^K _{R+1}}}) = {{\partial\mathcal L}\over{\partial A_{R+1}}} $$The Backward Pass starts with the computation of the gradient for the Logit Layer given by:
$$ \Delta_{R+1} = Y - T $$The gradients for the other stages is then computed iteratively in a backwards fashion from stage $R$ down to stage 1, using the equation
$$ \Delta_{r} = f'(A^{r})\odot (W^{r+1})^T \Delta_{r+1},\ \ 1\le m\le R $$The derivation of these equations is the same as for Dense Feed Forward Networks and was explained in Chapter TrainingNNsBackprop.
The gradient of the Loss Function with respect to the weight matrices $U, V, W$ is then given by the sum of the gradients across each of the stages:
$$ \frac{\partial L}{\partial U} = X_1^T\Delta_1 + ... + X_M^R\Delta_R $$$$ \frac{\partial L}{\partial V} = Z_R^T\Delta_{R+1} $$$$ \frac{\partial L}{\partial W} = Z_1^T\Delta_2 + ... + Z_{R-1}^T\Delta_R $$These calculations are illustrated for a RNN with $R=3$ in Figure rnn42.
#rnn20
nb_setup.images_hconcat(["DL_images/rnn20.png"], width=800)
Given the structure of RNNs, it is possible to process a single training sample of size $R$, $(X_r,T_r),r = 1,..,R$ using a RNN with R stages. However if R is large, the RNN is not able to learn from data in the initial values of the input sample. In practice once R exceeds 20-30 stages, the model performance no longer improves. These issue, which is due to the vanishing gradient problem, is treated in greater detail later in the next section, but for now we describe a training technique that is commonly used to mitigate this problem called Truncated BPTT.
This technique is illustrated in Figure rnn20 and the basic idea is to break up a large sample into smaller segments, and then run BPTT on each segment separately. Part (a) of the figure shows a RNN with 6 stages. In order to apply Truncated BPTT we split it into two segments of size 3 and Parts (b) and (c) show BPTT being aplied to these segments. However in order to create a connection between successive segments, the final Hidden State vector $Z$ from segment $n$ is used to initialize the First Hidden State in Segmet $n+1$ as shown. It has been observed experimentally that this technique is effective in detecting correlations between inputs that are located more than a segment-size apart.
#rnn43
nb_setup.images_hconcat(["DL_images/rnn43.png"], width=800)
When the BPTT algorithm, is applied to a RNN with a large number of stages, then we run into some problems that were first encountered when training Dense Feed Forward Networks, but they have more serious implications for RNNs:
The Vanishing Gradient Problem: As a result of the Vanishing Gradient Problem, the gradients $\Delta_i$ in the initial stages of the network become progressively smaller and get close to zero during Backprop, as the number of RNN stages increases.
The Exploding Gradient Problem: This leads to the opposite effect, i.e., the gradients $\delta_i$ become larger and larger as we add more RNN stages.
The causes of these problems can be understood by analysing the recursion equations for the gradients, and we proceed to do this next. The gist of the argument can be understood by considering a simple scalar sequence $\{x_n\}$ generated by the recursion:
$$ x_{n+1} = ax_{n} \ \ {\mbox so}\ \ that\ \ x_{n} = a^n x_0,\ \ n=1,2,... $$Clearly if $a<1$ then the sequence converges to zero, while if $a>1$, then the sequence blows up to infinity. In the case of RNNs, the BPTT recursion equations for the gradients $\Delta_t = \frac{\partial L}{\partial A_t}$ are given by the following equations:
$$ \Delta_t = f'(A_t)\odot W^T\Delta_{t+1} $$where $W^T$ is the transpose of the weight matrix $W$. Ignoring the derivative $f'$ (and noting that since with ReLUs, $f'$ is either $1$ or $0$, this is a reasonable assumption), it follows that for $\tau$ RNN stages
$$ \Delta(\tau-n) = (W^T)^n\ \Delta_{\tau},\ \ n = 1,2,...,\tau-1 $$If the matrix $W$ can be decomposed as
$$ W = R\Lambda R^T $$where $R$ is an orthogonal matrix, and $\Lambda$ is a diagonal matrix with the eigenvalues $\lambda_j$ at the $j^{th}$ position along the diagonal. The (backwards) recursion then becomes
$$ \Delta(\tau-n) = R^T\Lambda^n R\ \ \Delta_{\tau},\ \ n = 1, 2,...,\tau-1 $$Note that the diagonal matrix $\Lambda^n$ has diagonal components $\lambda_j^n$, and as $n$ increases, they either converge to $0$ if $\lambda_j < 1$ (resulting in Vanishing Gradients) or go to infinity if $\lambda > 1$ (resulting in Exploding Gradients). Hence for larger values of $\tau$, this will result in the gradients $\Delta_i$ of the early stages converging to zero.
In the previous section we derived the following expression for the gradient of the Loss Function with respect to the weight parameters $W$:
$$ \frac{\partial L}{\partial U} = \sum_{t=1}^{\tau} X_t^T\Delta_t $$Combining this equation with the prior one, it follows that the inputs $X_t$ from the early stages of the network will have limited or no effect on the gradient $\frac{\partial L}{\partial U}$. This is known as the problem of Long Term Dependencies in RNNs, i.e., in a RNN with a large number of stages, the early training samples are not able to influence the gradients, and thus the learning process.
Figure rnn43 shows an example of this analsys for a 3 stage RNN network.
#rnn4
nb_setup.images_hconcat(["DL_images/rnn4.png"], width=600)
The Exploding Gradient problem is usually addressed by a technique known as Gradient Clipping. As the name implies, this involves clipping the gradients $\frac{\partial L}{\partial W}$ whenever they explode,i.e., if their value exceeds some pre-set threshold, then it is set back to a small number. Defining $$ {\hat g}=\frac{\partial L}{\partial W} $$ If $||{\hat g}|| \ge threshold$ then $$ {\hat g} \leftarrow \frac{threshold}{||{\hat g}||}{\hat g} $$ The effect of Gradient Clipping is illustrated in Figure rnn4, in which the Loss Function is plotted as a function of the network parameters $(w,b)$. The solid blue arrows show the progression of the Loss Function with successive optimization steps. When the optimization hits the high Loss Function wall, then the next gradient will be extremely large, and in the absence of Gradient Clipping, the Loss Function is pushed off to a distant location on the Loss Function surface. On the other hand, with Gradient Clipping the optimization proceeds along the dashed line, where we can see that the Loss Function is pulled back to an area that is closer to its original location.
The Vanishing Gradient problem can be (partially) addressed by using Truncated BPTT which restricts the number of stages over which BPTT is run to twenty or less. This technique doesn't work well in all cases and in general the Vanishing Gradient problem is more difficult to solve using traditional RNNs. In the next section we describe an alternative RNN architecture which was designed with the objective of solving this problem.
#rnn44
nb_setup.images_hconcat(["DL_images/rnn44.png"], width=1500)
LSTMs were designed with the objective of solving the Vanishing Gradients Problem that plagues RNNs. They have been very successful in this regard, indeed almost all of the successful applications of RNNs in recent years have been with LSTMs rather than with plain RNNs. The reason why RNNs suffer from the Vanishing Gradients problem is that successive hidden states in the RNN are computed using a (matrix) multiplication operation as on $A_2 = WZ_1+UX_2$, as illustrated in Part (a) of Figure rnn44. The basic change that the creators of LSTM made, was to replace this multiplication by an addition. This is illustrated in Part (b) of Figure rnn44 and explained next:
In addition to the basic increment and decrement operations, we would also like to increase the flexibility of our architecture by supporting features such as (1) Holding the Cell State to a constant value, or (2) Resetting the Cell State value. Implementation of these requires the addition of Gates to the architecture and this is explained next.
#rnn23
nb_setup.images_hconcat(["DL_images/rnn23.png"], width=600)
There are three types of Gates that are used in LSTMs:
Input Gate: This is illustrated in Part (a) of Figure rnn45. The Input Gate is implemented using the vector $i_t$ (at stage $t$). This vector is generated by using the equation $$ i_{t+1} = \sigma(W_iZ_t + U_iX_{t+1}) $$ Recall that the sigma function varies in the range [0,1] and assuming that it is operating in the saturation mode, we can see how $i_t$ can act as a switch. This is an intelligent switch whose operation is governed by the previous Hidden State $Z_t$ and the current input $X_{t+1}$, as well as by the parameters $W_i$ and $U_i$ whose values are learnt as part of the training process. This switch controls the Cell State propagation using the equation $$ C_{t+1} = C_t + i_{t+1}\odot\tanh(WZ_t + UX_{t+1}) $$ When the Input Gate is closed i.e., $i_{t+1} = 0$, then $C_{t+1} = C_t$, i.e. all the components of the Cell State vector are passed on unchanged to the next stage. On the other hand when the Input Gate is open, i.e., $i_{t+1} = 1$, then $C_{t+1} = C_t + \tanh(WZ_t + UX_{t+1})$, so that the Cell State vector changes in response to the Hidden State and Input Vector values by +1 or -1.
Forget Gate: This is illustrated in Part (b) of Figure rnn45 (along with the Input Gate). The Forget Gate is implemented using the vector $f_t$, which is generated using the equation $$ f_{t+1} = \sigma(W_fZ_t + U_fX_{t+1}) $$ The Forget Gate controls the Cell State propagation using the equation (for simplicity we have left out the Input Gate) $$ C_{t+1} = f_{t+1}\odot C_t + \tanh(WZ_t + UX_{t+1}) $$ When the Forget Gate is closed i.e., $f_{t+1} = 0$, then $C_{t+1} = \tanh(WZ_t + UX_{t+1})$, i.e., the Cell State vector is reset to +1 or -1 and it looses memory of the prior Cell State $C_t$. On the other hand when the Forget Gate is open, i.e., $f_{t+1} = 1$, then $C_{t+1} = C_t + \tanh(WZ_t + UX_{t+1})$, so that the Cell State vector changes in response to both the prior state $C_t$, as well as the Hidden State and Input Vector values.
Output Gate: This is illustrated in Part (c) of Figure rnn45 (along with the Input and Forget Gates). The Output Gate is implemented using the vector $o_t$, which is generated using the equation $$ o_{t+1} = \sigma(W_oZ_t + U_oX_{t+1}) $$ Unlike the Input and Forget Gates, the Output Gate does not directly influence the Cell State propagation, instead it controls whether the Cell State is copied onto the Hidden State, by using the equation $$ Z_{t+1} = o_{t+1}\odot \tanh(VC_{t+1}) $$ Of these three gates, the Input and Forget Gates have been found essential to the operation of the LSTM, while the Output Gate is of lesser importance.
#rnn45
nb_setup.images_hconcat(["DL_images/rnn45.png"], width=1200)
Putting all the equations for the LSTM together:
The Cell State evolves according to
$$ C_{t+1} = f_{t+1}\odot C_t + i_{t+1}\odot\tanh(WZ_t + UX_{t+1}) $$The corresponding Hidden State is given by
$$ Z_{t+1} = o_{t+1}\odot \tanh(VC_{t+1}) $$where the gates are given by:
$$ i_{t+1} = \sigma(W_iZ_t + U_iX_{t+1}) $$$$ f_{t+1} = \sigma(W_fZ_t + U_fX_{t+1}) $$$$ o_{t+1} = \sigma(W_oZ_t + U_oX_{t+1}) $$Going back to Figure rnn23, we can readily see how the flat and reset portions of the curve can be generated by appropriate choice of the input and forget gates.
From this analysis it is clear that LSTMs are much more effective in retaining memory of past inputs as compared to RNNs. Their structure gives them a fine level of control over component of their Cell State vector. They can propagate a state value indefinitly into the future by setting the gates $f_t = 1, i_t = 0$, as the analysis shows.
An example of this is shown in Figure rnn25 which shows the value of a single component of the Cell State vector in a LSTM system that was trained to generate sentences in English (the input into this network are characters that have already been generated, so that it uses this information to generate the next caharacter). This cell state component was in the ON state (shown in blue) whenever the system is generating characters that lie in between quotes, and is OFF otherwise (shown in red). This illustrates the LSTMs ability to change its state and hold it in response to the input values.
#rnn25
nb_setup.images_hconcat(["DL_images/rnn25.png"], width=600)
There is an interesting similarity between LSTMs and the best performing ConvNet architecture called ResNets (see Chapter ConvNetsPart2), since both feature the additive interaction design. Not all aspects of why this is beneficial are understood at this point, and this topic forms an active area of research.
#rnn24
nb_setup.images_hconcat(["DL_images/rnn24.png"], width=600)
#rnn6
nb_setup.images_hconcat(["DL_images/rnn6.png"], width=600)
GRUs are a recently proposed alternative to LSTMs, that share its good properties, i.e., they are also designed to avoid the Vanishing Gradient problem (see Figure rnn6). Even though they feature fewer parameters than LSTMs, their performance has been experimentally seen to be about the same. The main difference from LSTMs is that GRUs don't have a cell memory state $C_t$, but instead are able to function effectively using the Hidden State $Z_t$ alone.
The design of the GRU features two gates, the Update Gate $o_t$, given by: \begin{equation} o_t = \sigma(W^o X_t + U^o Z_{t-1}) \quad \quad (**GRU1**) \end{equation} and the Reset Gate $r_t$ given by: \begin{equation} r_t = \sigma(W^r X_t + U^r Z_{t-1}) \quad \quad (**GRU2**) \end{equation}
The Hidden State vector $Z_{t-1}$ from the prior stage is combined with the Input vector $X_t$ from the current stage, to generate an intermediate Hidden State value $\tilde Z_t$ given by: \begin{equation} {\tilde Z}_t = \tanh(r_t\odot U Z_{t-1} + W X_t) \quad \quad (**GRU3**) \end{equation}
Finally the new Hidden State $Z_t$ is generated by combining the prior Hidden State $Z_{t-1}$ with the intermediate value $\tilde Z_t$ as follows: \begin{equation} Z_t = (1 - o_t)\odot {\tilde Z}_t + o_t\odot Z_{t-1} \quad \quad (**GRU4**) \end{equation}
In order to understand the workings of the GRU, we once again assume that both the Update and Reset gates are operating in the saturated regime, so that they assume values in the set $\{0,1\}$. It is easy to see that each component of $Z_t$ vector takes on values from the set $\{+1,0,-1\}$ and from Equations (GRU3) and (GRU4) one can see that it can exhibit the following behaviors:
The reader will notice that there are similarities in the behavior of $Z_t$ in GRUs and LSTMs, in the sense both have modes in which $Z_t$ maintains its prior value as well, as well as modes in which it changes value based on current input or gets reset and looses memory of prior states. However GRUs are able to do this using a smaller number of parameters. The GRU design also provides a path for the gradients with respect to $Z$ to flow backwards without being subject to matrix multiplication, as an application of the Gradient Flow rules to Figure rnn6 reveals. This gradient flow is gated by $o_t$, such that if $o_t = 0$ at a cell, then the gradient is not propagated backwards.
LSTMs can be invoked in Keras simply by replacing SimpleRNN with LSTM (or GRU). We solve the IMDB Moview Review problem using a single layer LSTM model below. Note the following:
#rnn47
nb_setup.images_hconcat(["DL_images/rnn47.png"], width=600)
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000 # number of words to consider as features
maxlen = 500 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(input_train, y_train,
epochs=50,
batch_size=128,
validation_split=0.2)
model.summary()
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()