Natural Language Processing¶

%%capture
from ipypublish import nb_setup

Introduction¶

Natural Language Processing or NLP is an important application area for Neural Networks. Just as images are full of patterns in pixel space which can detected by Neural Networks, human language also presents a rich panorama of patterns which Neural Networks can exploit. In order to detect these patterns, words and sentences have to coded and then fed into the model. We start by describing how this is done in the section on Word Embeddings. It is shown there that words can be represented using 1-D tensors or vectors, the contents of which are analogous to the RGB pixels in an image. These vectors, called Word Embeddings, are arranged in N dimensional vector space in a way such that words with similar meanings have their embeddings cluster together. Once we know how to embed words, the next step is to encode entire sentences or even larger units. In order to do this we use the Neural Network models that we have developed in the last few chapters, in particular RNNs and LSTMs (including GRUs). The process of finding representations for a sequence of words is called Language Modeling and we show how this can be accomplished using Self-Supervised Learning. Once we have such a representation, it can be used for a variety of useful tasks, including:

Machine Translation
Text Categorization
Speeech Transcription
Question-Answering
Text Summarization

Since NLP is an commercially important technology, decades of work had gone into it the pre-Deep Learning era. Within a few short years, Neural Network models have begun to equal or better older techniques in tasks such as Speech Transcription and Machine Translation, and steady progress is being made in other areas as well.

In this chapter we will make extensive use of Neural Network models as Generators, i.e., they will be used to generate sentences (in contrast to earlier chapters in which Neural Networks were used mainly for classification). Generative Modeling is an emerging area in Deep Learning, and has also been used to generate images as shown in Chapter ConvNetsPart2 and in the chapter on Unsupervised Learning. However unlike for Image Processing, Generative Modeling is integral to performing important NLP applications such Machine Translation and Speech Transcription.

In the next chapter we will introduce a class of Neural Network models called Transformers that were introduced recently (in 2017). They constitute another way of modeling sequences, beyond RNNs and LSTMs, and outperform these older models in most NLP tasks. Transformers can also be pre-trained and then used do Transfer Learning for NLP tasks, just like ConvNets are used for Transfer Learning for images. This is an exciting emerging area which is still in its research phase.

Word Embeddings¶

The coding rule used to encode words in a text document is known as Word Embedding and it is used to convert a vocabulary of words into points in a high dimensional co-ordinate space. In previous chapters we used Word Embeddings to encode the text in the IMDB Movie Review dataset, and the reader may recall that there were two ways in which it could be done:

By using pre-trained Word Embeddings, such as Word2Vec or Glove
By using Gradient Descent to find the embeddings as part of the model training process

Method 1 works better if the training data is limited, since the Word Emveddings are already pre-trained on a much larger vocabulary. If there is enough data available then Method 2 should be tried, since it learns embeddings that are more appropriate for the problem at hand. In this section we focus on Method 1 and show how pre-training is done for the Word2Vec model.

The most straightforward rule for embedding is the 1-of-K Encoding (also known as One-Hot Encoding), and works as follows: If the vocabulary under consideration consists of a certain number of words, say 20,000 words, then the system encodes them using a 20,000 dimensional vector, in which all the co-ordinates are zero, except for the index co-ordinate for the word, that is set one. There are two problems with this scheme:

It results in extremely high dimensional input vectors, and the dimension increases with increasing number of words in the vocabulary.
The coding scheme does not incorporate any semantic information about the words, since the representations are all located equi-distant from each other at the vertices of a K-dimensional cube.

In order to remedy these problems, researchers have come up with several alternative algorithms to do Word Embeddings, one of the most popular of which is Word2Vec. This representation is created by making use the words that occur frequently in the neighborhood of the target word, and it has been shown to capture many linguistic regularities.

#rnn50
nb_setup.images_hconcat(["DL_images/rnn50.png"], width=1000)

In order to understand how the algorithm works, consider the table in Figure rnn50 which considers the problem of encoding the words "kitten", "cat" and "dog". In order to capture the meaning of these three words, we represent them using a vector of size four as shown. The co-ordinates of this vector encode the following characteristics which are found in the original three words: bite, cute, furry and loud. For example a kitten can be characterized as being cute, while a cat is furry and loud in addition to being cute. Representing of words using this vector scheme captures semantic information about them, and it is concievable that words that are closer in meaning will have vectors that occur closer to each other. One way of capturing this similarity is by using the inner product or cosine between the vectors, and as shown above, by this measure cats and kittens are more similar as compared to dogs and kittens.

Word2Vec encodes words into vectors by automating the procedure that was described above. Instead of having to manually specify the characteristics of a word, Word2Vec automatically infers this information by other words that occur frequently together in the same sentence as the target word (for example sentences that have the words "kitten" and "cute" would occur more frequently than sentences with the words "kitten" and "bite").

#CBOW
nb_setup.images_hconcat(["DL_images/rnn28.png"], width=1000)

Word2Vec uses two algorithms to generate Word Embeddings, namely Continuous Bag of Words (CBOW) and Skip-Gram. Both these algorithms use shallow Neural Networks that are trained in a self-supervised manner, so they are extremely fast. Figure CBOW shows the CBOW network, and it works as follows: The Input Layer consists of 1 of K encoded representations for all the words in a sentence, except for the target word. In the example shown, for the sentence "the dog chased the cat", the target word is set to "dog" and the words "chased" and "cat" are the inputs into the network. The main idea of the algorithm is to train the network with backprop while using the target word as the label, and this is done in two linear stages:

In the first stage each of the 1-Hot-Encoded word vectors (which are of size V, where V is the number of words in the vocabulary) are transformed into a smaller vector of size N, where N is the dimension of the embedding, by using a matrix multiplication (the same matrix parameters are used for all the inputs).
In the second stage, the results of the first stage are added together to form a vector of size N, and then passed through another matrix multiplication into V output nodes, the contents of which are sent through a softmax layer to generate the classification probabilities.

This network is trained using all the sentences in the corpus, with each word in a sentence taking turns as the target word. Note that this is an example of Self-Supervised Learning, since the labels are generated automatically using the input dataset itself. Once the network is fully trained, the first stage of the network serves as the transform to generate an embedding.

#Skip-Gram
nb_setup.images_hconcat(["DL_images/rnn49.png"], width=1000)

The operation of the Skip-Gram network, shown in Figure Skip-Gram, is some ways the inverse of the CBOW network. As we saw earlier, CBOW tries to predict a target word by making use of the context provided by the surrounding words. On the other hand, given a sentence, Skip-Gram tries to predict the surrounding words with the target word now serving as an input into the network. This is illustrated in the figure using the sentence "The dog chased the cat": The word "dog" is input into the Skip-Gram network while the words "cat" and "chased" serve as target labels for training the network. Note that this is an example of multi-label classification since the network is being used to do multiple classifications based on a single input. The network itself is a simple Dense Feedforward network with a single Hidden Layer and two transform stages, just as for the CBOW network. The transform carried out in the first stage serves as the Embedding Matrix, while the second stage transform is used to generate the Logit Values for classification.

#rnn51
nb_setup.images_hconcat(["DL_images/rnn51.png"], width=600)

In general words that are semantically close to each other tend to cluster together in the embedded space as shown in FIgure rnn51. As an example, consider the following vector operation which makes use of Word Embeddings (vec('x') stands for the Word Embedding representation for the word 'x'):

$$ vec('Paris') - vec('France') = vec('Rome') - vec('Italy') $$

Hence Word Embeddings incorporate the meaning of word into their representations, which gives better results when these vectors used in various operations involving words and language.

It should also be noted that instead of encoding words, we could encode at the character level instead. The benefit of doing this is that the number of characters in a dataset are limited to a few hundred at most, while the number of words can easily number in the tens of thousands. The tradeoff is that there is less semantic structure at the character level so that model performance may not be optimun). Hence 1 of K encoding is almost always the default choice for character level embedding.

Text Classification¶

#rnn73
nb_setup.images_hconcat(["DL_images/rnn73.png"], width=800)

Text Classification is an important applications of NLP. Here is a list of ways in which it can be used:

Detecting spam emails
Sentiment Analysis: For example classifying investor sentiment (Buy or Sell) based on classification of news articles or tweets
Classifying the topic of an article based on its text
Classifying the language for an article
Classifying the author of an article

The chapter on RNNs contains several examples of techniques for doing text classification, some of which are shown in Figure rnn73. Part (a) of this figure shows the most straightforward way for doing text classification, while (c) (c) show more complex networks which can improve the accuracy, either by incorporating multiple layers as in (b) or a bi-directional architecture as in (c). The latter two network types can also be combined easily with Keras into a multi-layer bi-directional RNN.

#rnn74
nb_setup.images_hconcat(["DL_images/rnn74.png"], width=600)

Figure rnn74 shows an example of multi-label text classification using a multi-headed RNN. Note that we are trying to classify into just two categories for each class, i.e., present or not-present, hence we use Binary Cross Entropy Loss function for each. These are then aggregated across all the categories to obtain the final Loss Function as shown in the figure.

The following example, taken from Chollet Section 7.1.3, shows an application of multi-label classification applied to tweet inputs to predicting the following about the tweeter:

Age: This is predicted using Mean Square Error based regression, hence the MSE Loss Function
Income: This is classified into a number of discrete categories, hence uses the Categorical Cross Entrop Loss Function
Gender: Since there are only 2 categories, we use the Binary Cross Entropy Loss Function

#rnn75
nb_setup.images_hconcat(["DL_images/rnn75.png"], width=800)

In order to code the multi-headed model, we use the Keras Functional API. Each of the three prediction types gets its own Logit Layer. Note that since the Age prediction is being done using Regression, only a single node is needed for its Logit.

from keras import layers
from keras import Input
from keras.models import Model

vocabulary_size = 50000
num_income_groups = 10
posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)

x = layers.SimpleRNN(32)(embedded_posts)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

Even though each output type gets its own loss function, all three of them need to be combined together into a singe scalar value in order to do Gradient Descent optimization. This can be done during model compilation as shown below. Since the three Loss Functions can differ in values by quite a bit, it is recommended that their values be appropriately weighted, which can be done using the loss_weights field. The MSE Loss Function assumes values around 3-5 while the Binary Cross Entropy Loss Function assumes values which are can be less than 1, which justifies the weights shown in the example.

model.compile(optimizer='rmsprop',
            loss={'age': 'mse',
                  'income': 'categorical_crossentropy',
                  'gender': 'binary_crossentropy'},
            loss_weights={'age': 0.25,
                          'income': 1.,
                          'gender': 10.})

Assuming that model input posts and the outputs age_targets, income targets and gender_targets are already in the form of Numpy arrays, they can be passed on to the fit command as shown.

model.fit(posts, {'age': age_targets,
                  'income': income_targets,
                  'gender': gender_targets},
          epochs=10, batch_size=64)

In the following example we use an LSTM model to classify the movie reviews in the IMDB dataset. We start by invoking the text_dataset_from_directory function to create the training, validation and test samples.

import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = '/Users/subirvarma/handson-ml/datasets/aclImdb'

train_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

The TextVectorization in invoked to convert the text into integers, for each sample review. Each review is restricted to 600 characters or less, and the vocabulary used is restricted to the 20,000 most frequently ocurring words.

from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

The main model is a single layer LSTM with 32 nodes per cell.

vocab_size = 20000
embed_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)

x = layers.LSTM(32)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_6 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_6 (Embedding)      (None, None, 32)          640000    
_________________________________________________________________
lstm_5 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 33        
=================================================================
Total params: 648,353
Trainable params: 648,353
Non-trainable params: 0
_________________________________________________________________

history = model.fit(int_train_ds, validation_data=int_val_ds, epochs=15)

Epoch 1/15
625/625 [==============================] - 143s 224ms/step - loss: 0.6935 - accuracy: 0.4949 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 2/15
625/625 [==============================] - 133s 212ms/step - loss: 0.6922 - accuracy: 0.5025 - val_loss: 0.6875 - val_accuracy: 0.5124
Epoch 3/15
625/625 [==============================] - 134s 215ms/step - loss: 0.6851 - accuracy: 0.5102 - val_loss: 0.6977 - val_accuracy: 0.5084
Epoch 4/15
625/625 [==============================] - 134s 215ms/step - loss: 0.6817 - accuracy: 0.5163 - val_loss: 0.6873 - val_accuracy: 0.5086
Epoch 5/15
625/625 [==============================] - 133s 213ms/step - loss: 0.6706 - accuracy: 0.5787 - val_loss: 0.6773 - val_accuracy: 0.6232
Epoch 6/15
625/625 [==============================] - 133s 213ms/step - loss: 0.6098 - accuracy: 0.6816 - val_loss: 0.6522 - val_accuracy: 0.6336
Epoch 7/15
625/625 [==============================] - 133s 212ms/step - loss: 0.5918 - accuracy: 0.7066 - val_loss: 0.5835 - val_accuracy: 0.7388
Epoch 8/15
625/625 [==============================] - 133s 213ms/step - loss: 0.5767 - accuracy: 0.7246 - val_loss: 0.6021 - val_accuracy: 0.7166
Epoch 9/15
625/625 [==============================] - 134s 215ms/step - loss: 0.5742 - accuracy: 0.7232 - val_loss: 0.5960 - val_accuracy: 0.7408
Epoch 10/15
625/625 [==============================] - 138s 221ms/step - loss: 0.5820 - accuracy: 0.7329 - val_loss: 0.5793 - val_accuracy: 0.7436
Epoch 11/15
625/625 [==============================] - 138s 220ms/step - loss: 0.6003 - accuracy: 0.6848 - val_loss: 0.7171 - val_accuracy: 0.5056
Epoch 12/15
625/625 [==============================] - 137s 220ms/step - loss: 0.5919 - accuracy: 0.6920 - val_loss: 0.5772 - val_accuracy: 0.7496
Epoch 13/15
625/625 [==============================] - 138s 221ms/step - loss: 0.5871 - accuracy: 0.7046 - val_loss: 0.6061 - val_accuracy: 0.7044
Epoch 14/15
625/625 [==============================] - 138s 220ms/step - loss: 0.5708 - accuracy: 0.7199 - val_loss: 0.6384 - val_accuracy: 0.6780
Epoch 15/15
625/625 [==============================] - 138s 220ms/step - loss: 0.5928 - accuracy: 0.7007 - val_loss: 0.5868 - val_accuracy: 0.7408

Language Models¶

Language Modeling is a fundamental task in NLP. It enables the NLP model to learn the underlying statistical structure of a language, which then enables it to generate valid sentences that are gramatically correct and are likely to have been spoken in that language. Language generation based on Language Models is very powerful tool that has been put to use in a number of commercially important applications such as Machine Translation, Speech Transcription etc.

There are two equivalent definitions of a Language Model that we find in the literature:

Given a sequence of words $(w_1,...,w_N)$, a Language Model can be used to generate the most probable next word $w_{N+1}$ in the sequence.
Given a sequence of words $(w_1,...,w_N)$, a Language Model can be used to compute the probability $p(w_1,...,w_N)$ of that sequence occuring in the language (see Figure rnn52).

#rnn52
nb_setup.images_hconcat(["DL_images/rnn52.png"], width=600)

Figure rnn53 shows an application of the first definition of Language Models that we see on an everyday basis, i.e., search engines with the ability to automatically generate the next few words in the search term before they are typed. Another common application is the Auto Reply feature for emails.

#rnn53
nb_setup.images_hconcat(["DL_images/rnn53.png"], width=600)

The second definition of a Language Model is fundamental to the task of language generation. As we will see shortly, given a context (such as an image or a sentence in another language), the Language Model can be used to predict the probability $p(w_1,...,w_N)$ of all possible words that follow from that context. This allows the system to choose a set of words $(w_1,...,w_N)$ that have the highest joint probability, thus resulting in a caption if the context was an image, or a translaton if the input was a sentence in another language.

#rnn76
nb_setup.images_hconcat(["DL_images/rnn76.png"], width=600)

A Language Model of Type 1 using an RNN architecture is shown in Figure rnn76. As shown in the figure, this model is trained by using N consecutive words from the training corpus as input, with the $(N+1)^{rst}$ word serving as the target. This is an example of Self-Supervised Learning since the training labels are being automatically generated from the input data. Once the model is fully trained then it can be used to generate new text once it is initialized using the first $N$ words. An example Keras program for this type of model is included at the end of this section.

In order to come up with a model for Language Models of Type 2, consider the following:

Recall that the objective of Language Model 2 is to compute the probability $p(X_1,...,X_N)$ of a sequence of words $(X_1,...,X_N)$ from some language. Consider the RNN shown below in Figure Training: Its input is the sequence of words $(X_1,...,X_N)$ whose joint probability we are trying compute. Unlike the RNN in Figure rnn76, this RNN outputs a prediction for the next word after each RNN stage. As shown in the figure, we train this model by using the word $X_2$ as the label at the first stage, $X_3$ as the label for the second stage, all the way to $X_{N+1} = EoS$ (end of sentence token) as the label for the $N^{th}$ stage. In other words, at stage $n$ with input $X_n$, the model tries to predict the next word $X_{n+1}$ in the sequence.

As shown in Figure rnn54, once this model has been trained, it will output the prediction $p(Y_1|X_1)$ at the first stage, $p(Y_2|X_1,X_2)$ at the second stage, all the way to $p(Y_N|X_1,X_2,...X_N)$ at the $N^{th}$ stage. Note that each of these is a probability distribution over the entire dictionary of words that is being used for the model. In order to compute the the probability $p(X_1,X_2,...,X_N)$ for an input sequence, we use the Law of Conditional Probabilities to decompose it into the product of conditional probabilities:

$$ p(X_1,...,X_N) = p(X_1)p(X_2|X_1)p(X_3|X_1,X_2)...p(X_{N+1}|X_1,X_2,...,X_N) $$

This implies that the probability of a sentence can be computed by finding the conditional probabilities $p(X_i|X_1,...,X_{i-1}), i=2,...,N+1$. These probabilities can be computed using the trained RNN model described earlier as shown in Figure Inference. In particular

$$ p(X_2|X_1) = p(Y_1=X_2|X_1) p(X_3|X_1,X_2) = p(Y_2=X_3|X_1,X_2) ... p(X_{N+1}|X_1,...,X_N) = p(Y_N|=X_{N+1}|X_1,X_2,...,X_N) $$

Hence the probability $p(X_1,X_2,...,X_N)$ for any sequence of words can be readily computed by using this technique.

#Training
nb_setup.images_hconcat(["DL_images/rnn55.png"], width=600)

#rnn54
nb_setup.images_hconcat(["DL_images/rnn54.png"], width=500)

#Inference
nb_setup.images_hconcat(["DL_images/rnn56.png"], width=600)

Not only can this procedure be used to compute the probability of the occurrence of a sentence, it can also be used to actually generate a sentence. This is done by reducing the problem of generating a sentence consisting of $N$ words to solving $N$ successive classification problems as shown in Figure rnn8. To generate a sequence, we seed it using the start of sequence token $X_1$, and then:

The second word $X_2$ is chosen by sampling from the distribution $P(Y_1|X_1)$
The third word $X_3$ is chosen by sampling from the distribution $P(Y_2|X_1,X_2)
etc.. until we generate the Eod of Sentence token

This type of recursive model, in which inputs to the model are chosen to be the outputs from the prior stage, is called an auto-regressive model. This ability of the Language Model to serve as a Generative Model, is fundamental to its use in tasks such as Machine Translation.

#rnn8
nb_setup.images_hconcat(["DL_images/rnn8.png"], width=600)

Next Word Generation¶

In the language generation example in the previous section, we chose the next word by sampling from the conditional distribution $P(Y_i|X_1,X_2,...,X_i)$, which is an example of a Next Word Generation algorithm. However sampling this is just one example of Next Word Generation, and this section we survey other ways in which this can be done with better performance.

#rnn57
nb_setup.images_hconcat(["DL_images/rnn57.png"], width=1000)

Four different Next Word Generation techniques are illustrated in Figure rnn57, we describe these next (these examples are taken from the blog post https://huggingface.co/blog/how-to-generate):

Greedy Search: The Greedy Search algorithm chooses the word with the highest probability as the next word. An example of this is shown in Part (a) of the figure. The word 'The' is used to seed the sentence, and this is followed by the word 'Nice' since it has the highest probability, and then 'woman', resulting in the sentence 'The Nice Woman'. Another example of a sentences generated by Greedy Search is shown at the bottom of the figure (with the phrase 'I enjoy walking with my cute dog' used as seed), and the reader will note that it suffers from the fact that it starts to repeat itself. Part of the reason why it does not work well is that it is not able to choose words that are deeper in the tree which have a high probability, for example the word 'has' at level 2 of the tree.
Beam Search: Beam Search tries to remedy Greedy Search by simultaneously generating $B$ sentences at the same time (where $B$ is a model parameter). In Stage 1, instead of choosing a single word as output, we choose the $B$ words whose output probabilities $(p_1^{1},...,p_1^{B})$ are the largest. Each one of these $B$ words is then fed as the input at the next step, resulting in a total $B^2$ candidate words at the end of the second stage. However this list is then pruned back to size $B$ by choosing the $B$ words for whom the product $p_1^{i}p_2^{j}$ is largest. These vectors are then fed as the input into the third step, and the process continues. Part (b) of the figure shows an example of this process for $N=2$:
- At the first stage the sentences 'The nice' and 'The dog' are generated since these have the highest probabilities of 0.5 and 0.4 respectively.
- At the second stage the sentences 'The nice woman' and 'The dog has' are generated, with joint probabilites of 0.50.4 = 0.2 and 0.40.9 = 0.36. Note that this algorithm has overcome the problem in Greedy Search by being able to make use of the high probability word 'has' at the second stage, thus resulting in a sentence with higher joint probability. In general Beam Search works better than Greedy Search, however in practice it still suffers from the repeating words problam, as shown in the example at the bottom of the figure.
Sampling: As the name implies, at each stage of the Language Model, we sample from the output distribution to generate the next word. In the example shown in Part (c) of the figure, the word 'car' is chosen at the first stage even though it has the lowest probability, since the choice is now entirely random. In practice this results in sentences that are bit weird, as shown in the example in the bottom of the figure.
Sampling with Softmax Temperature: The output of Sampling can be improved by modulating the probability distribution using a new parameter called Temperature. It changes the output probability distribution at each stage of the RNN to the following:

$$ P(b_i) = {{\exp({a_i\over T})}\over{\sum\exp({a_i\over T})}} $$

where ${a_i}$ is the model generated probability distribution and ${b_i}$ is the modified probability distribution with Temperature = T. Lowering the Temperature causes the peaks in the distribution to become more pronounced, while increasing it tends to move the distribution towards equal values. in our running exaple we choose to $T=0.7$, which results in the distribution shown in Part (d) of the figure with bigger peaks. Sampling from this distribution pretty much eliminates the word 'car' in the first stage and the makes the output more coherent, as shown in the examples in the bottom of the figure. Note that $T=0$ corresponds to using the Greedy Search method.

#rnn58
nb_setup.images_hconcat(["DL_images/rnn58.png"], width=1000)

We end this section with the description of two other techniques that have been discovered in the last 2 years, which often work better:

Top-K Sampling: This is another technique in which the output probability distribution is modulated, followed by sampling. In this case the modulation is done by choosing the K words with the highest probability values, and then re-distributing the probabilities among these K words so that it normalizes to one. The example in Part (e) of Figure rnn58 shows the application of this algorithm for $K=6$. Note that this technique eliminates words that have low probability value, but which can still get selected if sampling is done over the entire vocabulary. Top-K sampling results in output that sound more human like compared to the other techniques, as shown in the example in the bottom of the figure. Top-K sampling has the following shortcoming: If the probability distribution is relatively flat, as in the LHS of the figure, it tends to eliminate many words that are promising candidates. On the other hand if the probability distribution is peaked, as in the RHS of the figure, it tends to include too many low probability words.
Top-p Sampling: This technique corrects for the shortcoming of the Top-K sampling by restricting the sampling to the subset of words whose aggregate probability is $p$ or less. This set of words expands if the distribution is flat and contracts if the distribution is peaked, thus correcting the problem. The sentences synthesized by using this technique are further improved as the example shows.

It is also possible to combine the Top-p and Top-K sampling methods together.

Example of a Character based Language Model¶

The following example, taken from Chollet Section 8.1, is for a character based Language Model of Type 1.

We start by downloading the training dataset, which happens to be the a selection of works of Nietzsche, and converting it to lowercase. It consists of 600,893 lines of text.

import keras
import tensorflow
import numpy as np

path = tensorflow.keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

Corpus length: 600893

The code below does the following:

The entire text is broken up into 60 character segments and stored in the list sentences, which results in a total of 200,278 segments. Furthermore for each segment, the $61^{rst}$ character serves as the training target and is stored in the list next_chars.
The unique characters in the test are extracted and stored in the sorted list chars and a dictionary mapping is created for mapping these characters to their integer index in chars.
We create and then populate the input tensor X and the target tensor y. The input tensor X has a 3-dimensional structure, consisting of len(sentences) (= 200,278) 2-D samples each of which is a matrix of size (maxlen, len(chars)) (= (60,57)). This matrix has a binary structure, with each row representing a single character and consisting of zeroes, except for a one at the integer index for that character in the dictionary mapping (see Figure rnn78).

#rnn78
nb_setup.images_hconcat(["DL_images/rnn78.png"], width=600)

# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 200278
Unique characters: 57
Vectorization...

#rnn77
nb_setup.images_hconcat(["DL_images/rnn77.png"], width=800)

The model for the system is shown in Figure rnn77 and coded below. It is a straightforward LSTM model with a single layer consisting of 128 nodes.

from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = tensorflow.keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

The code block below implements the Sampling based Next Character generation algorithm, modulated by the parameter temperature.

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

import random
import sys

for epoch in range(1, 60):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

The output shows that $T=0.5$ seems to work better for this model.

Conditional Language Models¶

The power of the Language Model formulation is realized when it is used to generate Conditional Language, which is defined as sentences that are generated in response to a context, such as an image, a video clip or a sentence in another language, as shown in Figure rnn79. This results in an image caption, a video description and a translation respectively.

#rnn79
nb_setup.images_hconcat(["DL_images/rnn79.png"], width=600)

Some other examples of Conditional Language Models are shown in the table below.

#rnn80
nb_setup.images_hconcat(["DL_images/rnn80.png"], width=600)

Conditional Language Models are created using a class of systems called Encoder-Decoder Systems. Figure EncDec shows the high level architecture of these systems, which consist of two parts:

Part 1 consists of a Neural Network model whose job is to process the Input Context and create a high level representation for it. This can be done using any one of the systems that we have discussed in this book, namely Dense Feed Forward networks, ConvNets or RNNs.
Part 2 consists of a Language Model which synthesizes a sentence in the target language, by using the representation from Part 1.

#EncDec
nb_setup.images_hconcat(["DL_images/rnn81.png"], width=600)

Figure rnn66 show some details for two examples of Encoder-Decoder Systems:

Part (a) of the figure shows the case when the system is used for doing Machine Translation. The Input $(T_1,T_2,T_3)$ is a sentence in Language 1, which is transformed into its vector representation using a RNN (shown as the final state $Z_3$). This representation is then fed into the Language Model which outputs a translation $(X_1,X_2,X_3,X_4)$ in Language 2.
Part (b) of the figure shows the case when the system is used to generate captions for images. The image is first converted into a high level representation using a ConvNet, followed by the Language Model that generates the caption.

There are a number of commercially important applications for this system, including:

Machine Translation: The input corresponds to a sentence from Language A and the output is its translation into Language B.
Email Auto-Reply: The input corresponds to the contents of an email and the output is the appropriate reply to that email.
Document Summarization: The input corresponds to the words in a document and the output is a shorter summary of its content.
Question Answering Systems: These usually have two sets of inputs, for the document and Question respectively, while the output is the Answer.

One of the benefits of the Encoder-Decoder architecture is that the input and the output need not be of the same length. Before the advent of RNNs, this was a significant restriction for these type of systems. Indeed as the Caption Generation architecture will show, the two parts of the system need not be processing the same type of media! Later in this chapter we will show an example in which the input is a speech waveform and the output is its transcription.

#rnn66
nb_setup.images_hconcat(["DL_images/rnn66.png"], width=800)

In the following two sections we get into the details of the Neural Machine Transmation and Image Captioning systems.

Neural Machine Translation¶

#rnn82
nb_setup.images_hconcat(["DL_images/rnn82.png"], width=800)

Machine Translation is the process of taking a sentence from Language A as input, and generating its translation in Language B as the output. The traditional way of doing this was a Bayesian ML algorithm called Statistical Machine Translation (SMT). However in the last few years, Neural Machine Translation (NMT) systems have surpassed SMT in their accuracy, and as a result popular websites such as Google have replaced SMT with NMT in their production systems.

We will use the the Encoder-Decoder architecture shown in Figure rnn82 to do Machine Translation. It is designed to map a variable length input word sequence $T_i, i=1,...,L_{in}$ to a variable length output word sequence $X_i,i,...,L_{out}$.

The Training Phase for this model is shown in Part (a) of the figure. During this phase, the model accepts two input sequences, the sentence $T_1,T_2,T_3$ and its corresponding translation $X_1,X_2,X_3$. The latter sequence also serves as a target for training the model after it has been shisted to the right by one word (so that the target for a word $X_i$ is the next word in the sequence $X_{i+1}$..

During the Inference Phase, shown in Part (b) of the figure, the trained model is used to generate translations for inputs $(T_1,T_2,T_3)$, which are fed into the Encoder part of the model. The Decoder part of the model uses the final representation of the input to generate the translated sentence one word at a time, and at each stage the word generated in stage n serves as the input for stage n+1, which is known as Auto-Regression.

The following example taken from Chollet Chapter 11 uses the Encoder Decoder Transformer to do English to Spanish translation using a GRU based model. We start by downloading the dataset, creating a list each lement of which is an English sentence followed by its Spanish translation, appending a "start" and "end" tokens to the beginning and end of each Spanish sentence, and then storing the English-Spanish pairs in a list called text_pairs:

text_file = "/Users/subirvarma/handson-ml/datasets/spa-eng/spa.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))

Here is what a randomly selected sample from the text_pairs list looks like:

import random
print(random.choice(text_pairs))

("My sister's getting married.", '[start] Mi hermana se va a casar. [end]')

The elements of the text_pairs list are randomly shuffled, and then split into training, validation and test datasets.

import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

In this code block, the vectorization function for the English and Spanish texts are defined, using a maximum sequence length of 20 words per sentence, and a vocabulary size of 20,000 most frequently used words. Before doing this, we remove the special characters from the text, and convert all text to lowercase characters. The adapt commend creates a mapping of words in the vocabulary with their corresponding integer codes.

import tensorflow as tf
from tensorflow.keras import layers
import string
import re

strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")

vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    #standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts)
print(random.choice(train_english_texts))
print(random.choice(train_spanish_texts))
target_vectorization.adapt(train_spanish_texts)

Who told you to give that to me?
[start] Es un trabajo muy ingrato. [end]

The main task of the following code is to create the training and validation datasets for the model. Note that the input into the model consists of the encoded English sentence (to be fed into the Encoder) followed by the encoded Spanish sentence (to be fed into the Decoder), while target consists of the Spanish sentence only (at the output of the Decoder). However note that in order to create the target, the Spanish sentence has to be shifted to the right by one word, so that each Spanish word in the input sequence is mapped to the next Spanish word in that sentence, in order to create the target sequence. This is accomplished in the format_dataset function. The tf.data.Dataset.from_tensor_slices function creates a list of pairs of English and Spanish sentences, which are then vectorized and formatted into the correct input and target tensors by means of the format_dataset function.

batch_size = 64

def format_dataset(eng, spa):
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    return ({
        "english": eng,
        "spanish": spa[:, :-1],
    }, spa[:, 1:])

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)

GRU-based encoder

As shown below, the Encoder module uses a bi-directional GRU with a hidden state of size 1024 and 20 (= sequence length) stages (the latter quantity is not specified in the code below, but Keras infers it from the shape of the Input tensor.

The output of the Encoder module consist of the final hidden states of the forward and backward GRUs, which are then added together to create the final output.

from tensorflow import keras
from tensorflow.keras import layers

embed_dim = 256
latent_dim = 1024

source = keras.Input(shape=(None,), dtype="int64", name="english")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
    layers.GRU(latent_dim), merge_mode="sum")(x)

GRU-based decoder and the end-to-end model

As in the Encoder, the Decoder uses a GRU with a hidden state of size 1024. The GRU cell in its Decoder first stage is initialized using the encoded_source tensor from the last stage of the Encoder. Since the return_sequences flag is set to TRUE, the Hidden State vectors in each of its stages is sent to a Dense Feed Forward Layer with vocab_size nodes in order to predict the next word in the sequence (which is then compared with the target word during Training, in order to generate the error signal).

The Encoder and Decoder modules are combined together to create the full Encoder-Decoder model called seq2seq_rnn, with two input tensors (source, past_target) and a single output tensor target_next_step.

past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step)

The model summary shows that it has a total of 42,554,912 parameters, of which 20.5 million are in the dense layer at the output of the Decoder, and just over 10 million are in the Embedding layers in the Encoder and Decoder. The two GRUs themselves account for about 13 million parameters.

seq2seq_rnn.summary()

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
english (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
spanish (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, None, 256)    5120000     english[0][0]                    
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, None, 256)    5120000     spanish[0][0]                    
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 1024)         7876608     embedding_8[0][0]                
__________________________________________________________________________________________________
gru_1 (GRU)                     (None, None, 1024)   3938304     embedding_9[0][0]                
                                                                 bidirectional[0][0]              
__________________________________________________________________________________________________
dropout (Dropout)               (None, None, 1024)   0           gru_1[0][0]                      
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, None, 20000)  20500000    dropout[0][0]                    
==================================================================================================
Total params: 42,554,912
Trainable params: 42,554,912
Non-trainable params: 0
__________________________________________________________________________________________________

Training our recurrent sequence-to-sequence model

seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Epoch 1/15
1302/1302 [==============================] - 4549s 3s/step - loss: 1.6484 - accuracy: 0.4164 - val_loss: 1.3216 - val_accuracy: 0.5033
Epoch 2/15
1302/1302 [==============================] - 4510s 3s/step - loss: 1.3277 - accuracy: 0.5258 - val_loss: 1.1627 - val_accuracy: 0.5666
Epoch 3/15
1302/1302 [==============================] - 4577s 4s/step - loss: 1.1860 - accuracy: 0.5755 - val_loss: 1.0826 - val_accuracy: 0.5969
Epoch 4/15
1302/1302 [==============================] - 4566s 4s/step - loss: 1.0971 - accuracy: 0.6062 - val_loss: 1.0501 - val_accuracy: 0.6149
Epoch 5/15
1302/1302 [==============================] - 4569s 4s/step - loss: 1.0493 - accuracy: 0.6305 - val_loss: 1.0341 - val_accuracy: 0.6244
Epoch 6/15
1302/1302 [==============================] - 4600s 4s/step - loss: 1.0189 - accuracy: 0.6486 - val_loss: 1.0302 - val_accuracy: 0.6289
Epoch 7/15
1302/1302 [==============================] - 4579s 4s/step - loss: 0.9997 - accuracy: 0.6616 - val_loss: 1.0314 - val_accuracy: 0.6321
Epoch 8/15
1302/1302 [==============================] - 4591s 4s/step - loss: 0.9853 - accuracy: 0.6727 - val_loss: 1.0316 - val_accuracy: 0.6356
Epoch 9/15
1302/1302 [==============================] - 4593s 4s/step - loss: 0.9762 - accuracy: 0.6800 - val_loss: 1.0364 - val_accuracy: 0.6372
Epoch 10/15
1302/1302 [==============================] - 8982s 7s/step - loss: 0.9690 - accuracy: 0.6853 - val_loss: 1.0380 - val_accuracy: 0.6381
Epoch 11/15
1302/1302 [==============================] - 4560s 4s/step - loss: 0.9640 - accuracy: 0.6896 - val_loss: 1.0409 - val_accuracy: 0.6387
Epoch 12/15
1302/1302 [==============================] - 4552s 3s/step - loss: 0.9602 - accuracy: 0.6924 - val_loss: 1.0432 - val_accuracy: 0.6393
Epoch 13/15
1302/1302 [==============================] - 4645s 4s/step - loss: 0.9570 - accuracy: 0.6952 - val_loss: 1.0453 - val_accuracy: 0.6403
Epoch 14/15
1302/1302 [==============================] - 4671s 4s/step - loss: 0.9567 - accuracy: 0.6961 - val_loss: 1.0468 - val_accuracy: 0.6392
Epoch 15/15
1302/1302 [==============================] - 4686s 4s/step - loss: 0.9566 - accuracy: 0.6968 - val_loss: 1.0476 - val_accuracy: 0.6396

<keras.callbacks.History at 0x1560275c0>

We are now going to use the trained model to do English to Spanish translation. We set up the inference models for both the Encoder and the Decoder sub-systems. The Encoder inference model is the same as was defined for the training phase. The Decoder inference model on the other hand is going to be run on a stage by stage basis, such that the input into a stage is the same as the output from the previous stage, i.e., the Decoder is run in the Auto-Regressive mode.

Before the model can be run, we create a dictionary which maps each of the Spanish words with its corresponding index. The Spanish sentence is generated one words at a time, starting with the word 'start' and ending when the word 'end' is sampled. During each stage the model predicts the probabilities of the 20,000 possible Spanish words. The output probabilities for the $i^{th}$ stage are converted into a word, by first choosing the word index that has the maximum probability, and then using the lookup dictionary to convert it into its corresponding alpha-numeric character.

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
Tom doesn't know anything about Australia.
[start] tom no sabe nada de australia end             
-
He was easily deceived and gave her some money.
[start] Él se [UNK] y le he [UNK] algo de dinero end         
-
The patient fainted at the sight of blood.
[start] el profesor se puso al lado de la [UNK] de la [UNK] end       
-
I don't want to die yet.
[start] no quiero todavía [UNK] end               
-
Is it about ten o'clock?
[start] ¿está a diez minutos end               
-
I just don't like it.
[start] lo que no me gusta end              
-
What do you want to protect us from?
[start] ¿qué quieres [UNK] de nosotros end              
-
Take the bags upstairs.
[start] [UNK] las usted de las cinco end             
-
There's a fine line between what's acceptable and what's unacceptable.
[start] hay una gran qué es una [UNK] entre lo que y es lo [UNK] end     
-
This hat is too small for me.
[start] este sombrero es muy grande para mí end            
-
Would you prefer to speak in English?
[start] hablar a hablar inglés end               
-
Who are you going to vote for?
[start] ¿a quién vas a a [UNK] end             
-
According to the newspapers, he will be here today.
[start] para él hoy él se la aquí para estar hoy end         
-
A ball hit the back of my head while I was playing soccer.
[start] una [UNK] de la puerta se me [UNK] cuando se [UNK] el dolor de cabeza end    
-
Tom grinds his teeth in his sleep.
[start] tom se [UNK] los ojos en su solo end           
-
The system is rigged.
[start] el a está qué está [UNK] end             
-
Tom has his own room.
[start] tom tiene su habitación end               
-
I know I can make it.
[start] sé que puedo hacerlo end               
-
Air is mainly composed of nitrogen and oxygen.
[start] el [UNK] está [UNK] por el tiempo y de la [UNK] end        
-
The people voted in November.
[start] la gente está en [UNK] end

Image Captioning¶

#rnn62
nb_setup.images_hconcat(["DL_images/rnn62.png"], width=800)

The Machine Translation application of the Encoder-Decoder architecture featured the same media type on both sides. In this section we describe an application in which two different media are involved: An image on the encoding side and its corresponding description (or caption) in a language such as English, on the decoding side. Creating an image description involves the following tasks:

The description should capture the most significant objects in the image.
It should express how these objects relate to each other, as well as their attributes and activities they are involved in.
This semantic knowledge has to be converted into a natural language, which implies the use of a Language Model to do so.

This is an inherently difficult problem to solve, and before the advent of Encoder-Decoder systems, was typically approached by stitching together the solution for the object recognition problem and filling in pre-existing caption templates. These type of systems were rigid in their text generation, and were demonstrated to work well only in limited domains such as traffic scenes or sports. The Encoder-Decoder based solution on the other hand, was inspired by the Machine Translation systems described earlier, and uses a single joint model that takes an Image $I$ as input, and produces a caption $S$ that maximizes the likelihood $p(S|I)$.

Figure rnn62 shows a proposed design for solving the problem:

The Encoder part of the system is implemented using a ConvNet which generates a rich representation of the input image by embedding it in a fixed-length vector. The last Dense layer that precedes the Logit layer is used as the embedded image representation vector. The ConvNet is pre-trained on the image classification task before it is inserted into the Encoder-Decoder network, and then trained again end-to-end for the image captioning task.
The fixed length vector representing the input image is fed into the Decoder part of the system as shown in the figure. Note that the image is fed into the LSTM only once in the beginning, which actually works better than if the image were to be fed at every step.
The Language Model in the Decoder works exactly as in the Machine Translation example. During training the entire caption is fed into the Decoder, and Backprop done on the basis of error signals. During inference, the decoder generated words one at a time until it reaches the end of sentence marker.

Figure rnn83 shows some examples of captions generated by a model that was trained on 330,000 images in the Microsoft CoCo dataset.

#rnn83
nb_setup.images_hconcat(["DL_images/rnn83.png"], width=800)

The Attention Mechanism¶

The Attention Mechanism was originally proposed in the context of Encoder-Decoder systems, but since then has been expanded to other kinds of neural networks. Its most significant impact on Deep Learning has been the role it played in the discovery of Transformer Networks (the paper that proposed Transformers was titled, "Attention Is All You Need"!). In this section we describe this technique in the context of Machine Translation, Image Captioning and Speech Transcription. In the next chapter we will describe Attention as it is used in Transformer Networks.

In order to understand why Attention is useful, lets consider the example of an Encoder-Decoder based Machine Translation system of the type described in the previous section. A central premise of this architecture is that the system is able to compress sufficient information about the variable size Input Sequence within the final Hidden State of the Encoder part of the network. This information is then used to generate the entire output sequence without any further assistance from the Encoder. However, as the Input Sequence grows, it begs the question of how efficiently can variable size input information be captured within a fixed number of nodes in the final Hidden State. In practice it has been observed that as the size of the input sequence grows, especially if it is larger than the size of the sentences used in the Training Data, the performance of the system detoriates rapidly. This points to an architectural weakness, which needs to be addressed. The Attention Mechanism was designed to address this issue since it enables the Decoder to focus on specific stages of the Encoder network while doing its decoding.

In order to motivate the design of the Attention Mechanism, consider Figure rnn84. Part (a) of this figure shows a possible way in which the context information in the final stage of the Encoder Network can be supplied to all the Decoder stages. In practice this desigfn sometimes works better than the vanilla Encoder-Decoder system, but it can be further enhanced, as shown in Part (b) of the figure. This figure shows a system in which ALL the stages in the Encoder Network are used to contribute to the Context information which is done by adding or concatenating the vectors in the Encoder's Hidden States together. The Attention Mechanism builds on this design and improves it further.

#rnn84
nb_setup.images_hconcat(["DL_images/rnn84.png"], width=600)

The design shown in Part (b) of Figure rnn84 suffers from the issue that when trying to generate the next word in its decoder, it takes into consideration the information in all of the Encoder stages. This is in contrast to how humans would perform a similar task: For example if we are asked to translate a sentence from English to French, we only consider local relevant words in the English sentence when trying to generate its French translation. This strategy can be mimicked as shown in Figure rnn64. Once again the system takes all the stages in the Encoder into consideration when generating the final context, but now it pays a different amount of attention to each word, as captured by the multipliers $a_1,a_2,a_3$ (these are normalized to one). For example when the system generates the first word $X_2$, it may pay more attention to the vector $Z_1$, which is captured by having $a_1$ being larger than $a_2$ and $a_3$. This focus shifts when generating successive words in the translation.

#rnn64
nb_setup.images_hconcat(["DL_images/rnn64.png"], width=600)

Figure rnn65 shows the details of how Attenton is computed. The computations in the Encoder part of the network are unchanged from before, and result in the Hidden State vectore $(Z_1,Z_2,Z_3)$ in response to the input $(T_1,T_2,T_3)$. The computations for the Decoder part proceed as follows:

Lets focus on Part(b) of the figure, which shows the computations for generating the second output $X_3$. We compute the Attention scores $(e^1_1,e^1_2,e^1_3)$ by taking the inner product of the previous decoder state $H_1$ with each of the encoder states $(Z_1, Z_2, Z_3)$. The reasoning behind this operation is that the state $H_1$ contains information about which part of the Encoder network the decoding should be focusing on, and the inner product quantifies this. The Attention Scores are then transformed into weights $(a^1_1,a^1_2,a^1_3)$ between 0 and 1 by using the Softmax function, and these are used to create a Context Vector $B_1$ whose value is computed by $$ B_1 = a^1_1Z_1 + a^1_2 Z_2 + a^1_3 Z_3$ $$ The Context Vector $B_1$ encodes the most useful information in the Hidden States of the Encoder network, which is relevant to generating the next output $X_3$. As shown in the figure, along with $H_1$ and $X_2$, $B_1$ is also fed into the second stage of the Decoder to generate the next state $H_2$ (and subsequently $X_3$).
Part (a) of the figure shows the generation of $X_2$. The computations are the same as for $X_3$, except for the fact that since there is no prior decoder state to do the inner product with the encoder states, the last encoder state $Z_3$ is used instead.
Part (c) of the figure shows the generation of $X_4$, which is exactly the same as that for $X_3$.

Using recently established terminology, when generating the output $X_i$, the decoder state $H_{i-1}$ is called the Query, while the encoder states $(Z_1,Z_2,Z_3)$ are called the Keys. By taking the inner product of the Query with each Key, the system is trying to identify the Keys that are most similar to the Query, and the corresponding encoder Hidden State is assigned more weight.

In the original formulation of Attention, the Context Vector is computed by taking the weighted sum of the Key values $(Z_1,Z_2,Z_3)$. In later formulations of Attention, the Context Vector is computed using another set of vectors $(A_1,A_2,A_3)$ which are referred to as Values, and this is known as the Key-Value formulation of Attention. In both the cases the Keys and Values are derived from the Encoder Hidden States using matrix transformations. We will cover Key-Value Attention when discussing Transformers.

#rnn65
nb_setup.images_hconcat(["DL_images/rnn65.png"], width=1000)

One of the benefits of using Attention is that gives insights into the workings of the model. Figure rnn85 plots the Attention Vector ${a^i_j}$ on a row by row basis when doing translation from English to French. The grayscale squares are shaded so that $a^i_j=1$ is white and 0 is black. One can clearly see the Attention shifting from left to right as the translation proceeds. Also for cases in which the order of words are different in the two languages, for example 'European Economic' vs 'economique europeenne', we can see that Attention is correctly focused on the correct English word while finding its French translation.

#rnn85
nb_setup.images_hconcat(["DL_images/rnn85.png"], width=1000)

In addition to the Inner Product, there are other ways in which Attention Scores can be computed:

Multiplicative Attention: The Attention Score is computed using $$ e^j_i = H_j^T M Z_i $$ where $H_j$ is the Query, $Z_i$ is the Key and $M$ is a parameter matrix whose values are computed as part of the training.
Additive Attention: The Attention Score is computed using $$ e^j_i = v^T \tanh(MH_j + NZ_i) $$ where $M$ and $N$ are parameter matrices and $v$ is a parameter vector.

Additive and Multiplicative Attention are similar in complexity, although Multiplicative Attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality of the decoder states, but Additive Attention performs better for larger dimensions. One way to mitigate this for multiplicative Attention is to divide the Attention Score by $1\over{\sqrt{d}}$ where $d$ is the dimension of the Query Vector (this is commonly done in Transformers).

Image Captioning with Attention¶

#rnn61
nb_setup.images_hconcat(["DL_images/rnn61.png"], width=1000)

Xu et.al. (2015) applied the Attention Mechanism to the Image Captioning problem and obtained excellent results with a system that they called "Show, Attend and Tell". At a high level, the Attention Mechanism generates the caption by sequentially focusing on different parts of the image as the description progresses, and generating the word that is most relevant to the attended part (see Figure rnn13 for examples). We now go over the steps that are needed to apply this algorithm.

We first need to choose the vectors that will serve as the Keys in the Encoder. This is a critical design decision since it determines how the system will focus its attention on specific parts of the input image. Recall that in the Image Captioning systems discussed earlier, the image representation is conveyed by the vector formed by the last Dense layer (before the Logit layer) which unfortunately doesn't convey any local information about the image. An ingenious way in which this problem was solved is shown in Part (a) of Figure rnn61: Consider the last Convolutional Layer in the system of dimensions $H\times W\times C$, so that it contains $H\times W$ Activation vectors each of depth $C$. Each of these vectors contains local information about the image, and can serve as the Key value. Part (b) of the figure shows the operation of the Attention Layer. The Query vactors are set to the successive Hidden States in Decoder just as in Machine Translation.

As shown in Figure rnn13, the attention mechanism works quite well in practice. In each pair of images, the image on the right shows the area in the photo on which attention is being focussed on (by lighting up the picture in proportion to the Attention Weights). The underlined word in the generated caption is the corresponding word that was generated when Attention was focussed in a particular part of the picture, and it can be seen there is a very good correspondence between the word and the image under focus.

#rnn13
nb_setup.images_hconcat(["DL_images/rnn13.png"], width=600)

Speech Transcription with Attention¶

#rnn34
nb_setup.images_hconcat(["DL_images/rnn34.png"], width=600)

Speech Transcription or Recognition is the process of converting the sound waveform from a spoken language into text. It is a commercially important problem, for obvious reasons, and over the years a tremenduous amount of effort has gone into designing systems that can perform this task well. The process by which a speech waveform is converted into vectors that can be fed into a speech recognition model is shown in Figure rnn34 and consists of the following steps:

The speech waveform is segmented into smaller pieces of about 20 ms each.
Each of these segments is then processed using a Fast Fourier Transform (FFT) to extract its spectral power components. These components, organized by frequency, constitute the feature vector input into speech recognition models.

The first Speech Recognition models were built in the 1970s using a generative probabilistic model called GMM-HMM (stands for Gaussian MultiMode-Hidden Markov Model). The GMM part of the model modeled the probability distribution of the speech feature vectors, while the HMM part modeled the sequence information. This model was fairly complex, and made a number of assumptions about the underlying probability distributions. During the late 1990s this model was replaced by a NN-HMM model in which a Dense Feed Forward Network was used instead of the GMM. This led to an improvement in performance, and also allowed the system to start using less processed input signals.

Speech Transcription can be classified as a Pattern Recognition problem, however the patterns now occur in time rather than over space. From this point of view, RNNs are the perfect tool to solve the Speech Transcription problem since they are designed to recognize patterns in sequences that occur in time. Figure rnn35 shows a RNN based speech recognition model with speech feature vectors as its input. The application of RNNs to this problem did not work initially for the following two reasons:

The difficulty in training RNNs.
Speech Transcription systems exhibit a big difference in the input and the output sequence sizes. This is due to the fact that the input sequence consists of vectors generated from filtered features of the audio waveform, which may be generated every 10 or 20 ms; while the corresponding output sequence may consist of just a few words. Until recently RNN architectures could not handle this disparity in input and output sequence sizes.

Both of these problems have been solved in recent years: The training problem was solved with the use of LSTMs and GRUs, while the sequence mis-match problem was solved by using the Encoder-Decoder architectures since they are designed to handle inputs and output sequences of differing lengths.

#rnn35
nb_setup.images_hconcat(["DL_images/rnn35.png"], width=600)

Figure rnn14 shown an architecture based on the Encoder-Decoder model for doing Speech Transcription. This system was designed by the Google Brain team and is called Listen, Attend and Spell (LAS). Unlike the older models, all aspects of the speech recognition system, including acoustic, pronunciation and language models, are captured within a single framework. This system learns to transcribe an audio signal into a word sequence, one character at a time. The Encoder part of the system is called a Listener, and the Decoder part is called a Speller and these are described next.

#rnn14
nb_setup.images_hconcat(["DL_images/rnn14.png"], width=600)

The Listen Subsystem: The Listen system, which is the Encoder part of the network, uses a Bi-directional LSTM (BLSTM) with a pyramidal structure and encodes speech signals into higher level representations (see bottom half of Figure rnn14). The reason for this design is the following: Unlike other Encoder-Decoder systems, Speech Recognition systems exhibit a big difference in the input and the output sequence sizes due to the sequence mismatch problem. The Google Brain team observed that if the Encoder-Decoder architecture is implemented without the pyramidal structure, then it converges slowly and produced inferior results even after a training period lasting one month. This may be due to the fact that the Decoder system finds it difficult to extract the relevant information from the large number of input steps. The pyramidal BLSTM addresses this problem by reducing the time resoluton by a factor of 2 with each layer. Since the Encoder uses a 4 layer stack, it reduces the time resolution by a factor of 8 which allows the Attention Model to extract the relevant information from a smaller number of time steps.

The Attend and Spell Subsystem: At each step, the Speller uses it hidden state $Z_t$ to guide an Attention Mechanism to compute a Context Vector $B_t$ from the Listener's encoded higher level features $\{h_1,...,h_{L_{in}}\}$ which are set to the top level Hidden State vector (see top half of Figure rnn14). The details of the computation of the Attention Context function are as described earlier in this chapter. It uses this Context Vector to update its internal state as well as to predict the next character in the output sequence.

A recent paper has extended the "Listen, Attend and Spell"" model to a system that also incorporates lip reading (see Figure rnn15). This system, called "Listen, Watch, Attend and Spell" has two Encoders feeding the Decoder: Encoder 1 (called "Listen") processes the sound waveform and produces the sound Context Set vectors $o^s$. The new Encoder 2 (called "Watch") processes a video of the person talking. The Watch system consists of a ConvNet module based on VGGNet that generates image features for every input time step. These are then fed into a LSTM that produces the video Context Set vectors $o^v$. The Spell module then uses an Attention model that combines the information in both sound and video Context Sets, as shown in the figure. This system is capable of operating with just the Listen module, or with just the Watch module, or with both these modules. Indeed the researchers discovered that the output word error rate decreased significantly when the Watch module was used in addition to the Listen module. The Lip Reading performance with only the Watch module in operation, surpassed the performance of professional lip readers.

#rnn15
nb_setup.images_hconcat(["DL_images/rnn15.png"], width=600)