In [1]:

```
from ipypublish import nb_setup
```

Recurrent Neural Nets or RNNs are systems that are designed to detect patterns present in data sequences. This makes them better suited to solve "prediction" problems, as compared to other types of DLNs. An example of a prediction problem is predicting the next word in a sentence, which is a fundamental problem in Language Modeling. The solution to this problem requires that the system take into account the variable number of words that came before, i.e., it should be able to "remember" previous data in the sequence. Just like ConvNets, RNNs were discovered in the late 1980s, but lay dormant until recently due to the difficulty in training them. These problems have been overcome in recent years, with the use of a type of RNN called Long Short Term Memory or LSTMs, as well as the increase in processing power and size of training datasets. Today RNNs are at the forefront of exciting new discoveries in Deep Learning, and some of the most important recent work in DLNs falls in the RNN domain.

In [2]:

```
#rnn36
nb_setup.images_hconcat(["DL_images/rnn36.png"], width=600)
```

Out[2]:

In order to motivate the need for RNNs, consider the following: The DLN architectures that we have seen so far implements a static mapping between input and output vectors, of the type $Y=f(X)$ as shown in Part (a) of Figure **rnn36**.

Instead, consider a system that is evolving with time, i.e., it is a *Dynamical System*. In this scenario, the system is subject to an input sequence $X_1,...,X_n$ in which successive values of $X_i$ are dependent (for example if each $X_i$ were an image, then the sequence represents a video clip). Furthermore the output vector $Y_i$ at time $i$ depends not just on the input vector $X_i$, but on past values of $X$ as well, i.e., $Y_i = f_i(X_1,...,X_i)$ as shown in Part (b) of Figure **rnn36**. If we were to try to solve this problem using a traditional Neural Network, then we would need to find a different function $f_i$ for each value of $i$, i.e., the function would depend on the size of the input sequence. Obviously this would be a formidable task, and it would be much nicer if we could find a solution which is not dependent on the size of the input sequence; this is precisely what RNNs do.

Systems whose behavior is time dependent are commonly encountered in practice and are usually studied by postulating a *Hidden Variable or Hidden State $Z_i$* (also called a State Variable) The evolution of the system state obeys the following recursion:

while the output is a function of the current state and is given by

$$ Y_{i+1} = v_{i+1}(Z_{i+1}) $$The Hidden Variable sequence $Z_i$ captures the lossy history of the $X_i$ sequence, and hence serves as a type of memory. If we assume that the functions $v$ and $w$ do not change with time, then these equations reduce to:

$$ \begin{equation} Z_{i+1} = w(Z_i,X_i) \quad \quad (**eqn1**) \end{equation} $$$$ \begin{equation} Y_{i+1} = v(Z_{i+1}) \quad \quad (**eqn2**) \end{equation} $$Traditional Systems Theory makes the additional simplifying assumption that these functions are linear so that

$$ Z_{i+1} = WZ_i + UX_i $$$$ Y_{i+1} = VZ_{i+1} $$The use of DLNs enables us to approximate non-linear functions to high degree of accuracy using the training dataset, so that this linearity assumption is no longer needed. We do however assume that the vectors $Z_i$ and $X_i$ can be combined linearly, so that the resulting equations are:

$$ \begin{equation} Z_{i+1} = f(W Z_i + UX_i) \quad \quad (**eqn**3) \end{equation} $$$$ \begin{equation} Y_{i+1} = h(V Z_{i+1}) \quad \quad (**eqn4**) \end{equation} $$Equations (**eqn3**) and (**eqn4**) are in a form in which they can be implemented using Neural Networks, with the functions $f$ and $h$ serving as the Activation Functions, as shown in Figure **rnn1**.

In [3]:

```
#rnn1
nb_setup.images_hconcat(["DL_images/rnn1.png"], width=600)
```

Out[3]:

The figure shows three types of nodes:

$X{(n)}$: This represents the value of the input vector at time $n$.

$Z{(n)}$: This represents the value of the Hidden Layer vector at time $n$.

$Y{(n)}$: This represents the value of the output vector at time $n$.

The LHS of Figure **rnn1** shows a RNN with connections between the Hidden Layers at times $n$ and $n+1$ shown explicitly. The RHS of Figure **rnn1** is a simplified representation of the same RNN. Each of the boxes in this figure represents a row of Neural Network nodes belonging to a single layer, of the type shown on the LHS, which have been omitted for the sake of clarity. Note that:

- The weight matrix $U$ connects the nodes in the Input Layer with those in the Hidden Layer
- The weight matrix $W$ connects the nodes in the Hidden Layer with the same set of nodes, but at the previous time instant. The square block on the self-loop to the Hidden Layer represents a time delay of one unit, and represents the fact that at time $n$, the nodes in that layer have as one of their inputs, the values that they had at time $n-1$
- The weight matrix $V$ connects the nodes in the Hidden Layer with those in the Output Layer.

The fact that all these weight matrices do not change with time is a result of the time invariance assumption.

In [28]:

```
#rnn1c
nb_setup.images_hconcat(["DL_images/rnn1c.png"], width=600)
```

Out[28]:

While Figure **rnn1** accurately captures the fact that the RNN design incorporates feedback, it does not lend itself easily to training of the type we have seen in Dense Feed-Forward Networks. In order to facilitate the reuse of techniques we have developed in previous chapters, we convert Figure **rnn1** into an equivalent Dense Feed Forward network shown in Figure **rnn1c**, by a process called "unfolding". The unfolded network in the RHS of Figure **rnn1c** basically shows snapshots of the RNN at various points in time, and the fact that there is feedback between the nodes in the Hidden Layer is captured by the connections involving the weight matrix $W$.

As a result of the unfolding operation, the system becomes amenable to optimization using the Backprop algorithm. In addition, the depth or number of layers of the RNN is now dependent on the length of the input data sequence. This can result in a RNN with hundreds of Hidden Layers, hence the problems of training deep models with Backprop have a special resonance here. We also mentioned earlier that the Hidden Layer in RNNs keeps a "Lossy Memory" of the input sequence - the reason why it is lossy is because this single layer has to capture data from an entire input sequence, which cannot be done without compressing the data is some fashion.

In order to gain an intuitive understanding of the way in which the RNN shown in Figure **rnn1c** operates, note the following: There is an analogy between the way a ConvNet operates by looking for patterns in localized patches of an image, and the way a RNN operates by looking for patterns in localized intervals in time. Just as a ConvNet slides a single Filter over the entire image, a RNN slides its own filter represented by the weight matrix U over the entire input sequence, one input at a time. Hence RNNs implement a form of Translational Invariance, but they do this over the temporal axis, as opposed to spatial Translational Invariance seen in ConvNets. As a result of this property, the important part of the sequence can occur anywhere along the time axis, but still can be detected using a single weight matrix repeated at each point in time. There is however a crucial difference in the way a RNN operates compared to a ConvNet: Due to the feedback loop, the RNN pattern detector in the Hidden Layer is able to take advantage of the information that was learnt in the prior time steps from older data in the sequence. As a result the RNN is able to detect patterns that are spread over time, something that a ConvNet cannot do (in the spatial sense).

There is a fundamental difference between the internal representations that are created in the hidden layers of a ConvNet vs those that are created in the hidden layers of a RNN. The hidden layers in a ConvNet create a hierarchical representation, in which the representation at layer $r+1$ is at a higher level of abstraction compared to that for layer $r$. The hidden layer in a RNN on the other hand, does not add to the level of abstraction in the representation with successive time steps. Instead it captures patterns that are spread in time, but at the same level of abstraction. It is indeed possible to combine DLNs and RNNs and create deep RNNs (see the next section for examples), in which case higher level hidden layers capture time dependent patterns at different levels of abstraction.

The rest of this chapter is organized as follows: RNNs can be configured in several useful ways depending upon the problem they are being used to solve, some of these are described in Section **Examples**. In Section **Training** we discuss the training of RNNs and the problems that arise when doing so. In particular the Backprop algorithm can lead to issues such as the Vanishing Gradients Problem or the Exploding Gradient Problem, and in Section **LSTMs** we discuss a modified RNN architecture called LSTM (Long Short Term Memory) which was designed to solve these problems. In the final section we show how to model RNNs and LSTMs using Keras.

In prior chapters we built models to classify IMDB Movie Reviews using Dense Feed Forward Networks (Chapter **NNDeepLearning**) and 1-D Convnets (Chapter **ConvNetsPart1**). We now show how to do this using a RNN.

We first address the problem of how to feed data into a RNN: As shown in Figure **rnn39**, Keras requires that data inputs into each RNN stage be in the form of a 1-D vector. Hence as Part (a) of the figure shows, a single training sample into an RNN is a 2-D matrix of shape $time\times features$, such that the $i^{th}$ row of the matrix represents the data vector that is fed into the $i^{th}$ stage of the RNN. As shown in Part (b) of the figure, a batch of samples into the RNN becomes a 3D tensor of shape $sample\times time\times features$.

All datasets to be fed into an RNN have to be first formatted into this shape. This raises the question of how to feed non-vector data into a RNN, the most common example of which are 3-D images. A common way of doing so is by first passing the image data through a ConvNet, and then using the image feature vector that occurs at the end of the convolutional layers as an input. This enables us to feed a video clip into a RNN, such that images in the clip are fed into successive stages of the RNN.

In [11]:

```
#rnn39
nb_setup.images_hconcat(["DL_images/rnn39.png"], width=600)
```

Out[11]:

We use the IMDB dataset that comes with Keras. The test data has already been split into trainin + test samples and also tokenized such that all words have been converted into integers. When loading this data, we limit ourselves to the top 10,000 words that occur in the dataset by means of the *max_features* parameter. Furthermore each review is truncated after 500 words, with shorter reviews padded with zeroes, using the *pad_sequences* command and the *maxlen* parameter. At the end of these operations, each review is in the form of a 1-D vector of size 500 and the entire IMDB dataset is a matrix of shape $samples\times word tokens$.

In [2]:

```
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000 # number of words to consider as features
maxlen = 500 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
```

The RNN model for the IMDB problem is shown below. It is a basic RNN with 500 stages (since each movie review sample consists of 500 words) and a single Logit node for doing the binary classification.

In order to input a single word into the model, Keras does the following: A word is represented using 1-Hot vector of size 10,000 (corresponding to the 10,000 words in the vocabulary), and it then passes through the Embedding Layer which converts it into its vector representation of size 32 by multiplying it with a matrix of size $10,000\times 32$ (note that this multiplication is not needed in the actual implementation, Keras just picks the $i^{th}$ row of the matrix if the token of the input word is $i$). Note that we are not using a pre-trained Embedding Layer, hence the best embedding for each word is learnt as part of the training process.

Keras inputs a single review into the model by presenting each of its 500 words in turn to successive stages of the RNN.

Keras inputs a batch of reviews into the model by creating a 3-D tensor of shape $(batch\ size, review\ size, feature\ size)$, which is (128, 500, 32) for our example. Hence during the forward pass, 128 reviews are fed in parallel into the model.

In [5]:

```
#rnn40
nb_setup.images_hconcat(["DL_images/rnn40.png"], width=600)
```

Out[5]:

In [5]:

```
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
```

In [6]:

```
model.summary()
```

In [7]:

```
history = model.fit(input_train, y_train,
epochs=100,
batch_size=128,
validation_split=0.2)
```

In [9]:

```
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
```

The Accuracy and Loss curves show that the model starts to overfit after about 8-10 epochs, achieving a maximum accuracy of about 85%. This is also close to the accuracy that was achieved with the 1-D convnet (see Chapter **ConvNetsPart1**), even though that was done using an embedding of size 128. In order to increase the accuracy we may increase the the number of words per review to more than 500. However this also increases the number of RNN stages, and later in this chapter we will show that this can cause problems with the Backprop algorithm.

In [12]:

```
#rnn38
nb_setup.images_hconcat(["DL_images/rnn48.png"], width=600)
```

Out[12]:

Figure **rnn38** shows a RNN with multiple hidden layers, usually referred to as a Deep RNN. This system incorporates ideas from Dense Feed Forward Networks into the RNN, and enables the system to simultaneously create:

- Higher level hierarchical representations of the input data, and
- At each level of the hierarchy capture temporal patterns in the data. This architecture is used quite commonly in current RNN systems.

The Keras code for a RNN Model with 4 stacked layers is shown below. They require the use of the *return_sequences* flag which is set to *True* in the intermediate layers as shown. This causes Keras to return the full sequence of outputs for each timestep, which are needed to feed the following layer. The model.summary() command shows the output shape from the intermediate layers as a 3D tensor of size $(batch\ size, timesteps, output\ features)$.

In [10]:

```
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32)) # This last layer only returns the last outputs.
model.summary()
```

In [10]:

```
#rnn38
nb_setup.images_hconcat(["DL_images/rnn38.png"], width=600)
```

Out[10]:

Figure **rnn38** shows an important RNN architecture called Bi-Directional RNNS. As the name implies this system incorporates two sets of hidden layers. The first set is the conventional hidden layer, with connections going from left to right, while the second set of Hidden Layers has connections going in the opposite direction, from right to left. Unlike the first layer, the second layer is able to spot patterns in the reversed sequence.
The final output of each layer incorporates information from both the forward and backward hidden layers, usually by concatenation of layers. Such a design can be used for the case when the input sequence is not being generated in real time and is most useful when processing text sequences and has been shown to substantially improve performance in many cases. Note that this architecture cannot be used for real time data sequences, since the at test time future values are not yet available.

The code snippet below shows how to program a Bi-Directional RNN in Keras. The Bi-Directional layer is invoked using a recurrent layer instance as its first argument, which processes the input sequence in the forward order. It creates a second layer which processes the input sequence in the reverse order, and then the final states of the two layers are concatenated and fed into the Dense layer.

In [11]:

```
from keras import layers
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(layers.Embedding(10000, 32))
model.add(layers.Bidirectional(layers.SimpleRNN(32)))
model.add(layers.Dense(1,activation='sigmoid'))
model.summary()
```

Recall that the Dropout technique is used to reduce the amount of Overfitting in a DLN model. Initial applications of Dropout to RNNs were not successful, and later it was discovered that it was due to the fact that a different random Dropout Mask was applied to successive stages of the RNN. Later was shown that if the Dropout Mask was kept fixed for each stage for a single forward/backward pass through the RNN, then indeed the amount of Overfitting was reduced. Keras supports this type of RNN Dropout, and it can be turned on by using two flags in the parameter list (see below): The *dropout* flag specifies that Dropout should be used in the input layer of the model, while the *recurrent_dropout* flag specifies that Dropout should be used in the recurrent layer as well.

Earlier in this chapter we saw that the SimpleRNN model when applied to the IMDB Movie Review problem resulted in a good deal of overfitting. In the example below we use Dropout as part of the model. The results show that indeed the Overfitting is pushed out to later in the training process, after 60-70 epochs as opposed to 10 epochs. However there is not much improvement in the best accuracy level, which actually has decreased.

In [12]:

```
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32,
dropout = 0.2,
recurrent_dropout = 0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
```

In [13]:

```
history = model.fit(input_train, y_train,
epochs=100,
batch_size=128,
validation_split=0.2)
```

In [14]:

```
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
```