Deep Feed Forward Networks¶

%pylab inline
from ipypublish import nb_setup

Populating the interactive namespace from numpy and matplotlib

Non-Linear Filters¶

The Linear Models that we discussed in Chapter LinearLearningModels work well if the input dataset is approximately linearly separable, but they have limited accuracy for complex datasets. Some of the issues with Linear Models are the following:

If the input data is not linearly separable, then the designer has to expend a lot of effort in finding an appropriate feature map that makes it so. It would be nice to have a model that solves this problem automatically, by learning the best feature map from the data itself.
We showed that the model weight parameters could be regarded as a filter, so that for $K$ classes, the operation of the system is equivalent to trying to the match the input with $K$ different filters. The limitations of this approach can be seen in the filter for the "horse" class in Figure LC2. The filter looks like a horse with two heads, since it is trying its best to match with a horse image, irrespective of the direction in which the horse is facing. This type of filtering will clearly not work for cases in which the horse were standing with some other orientation, or if it were located in a corner of the image. The fact that the best accuracy that can be achieved with linear classifiers and the CIFAR-10 Dataset is only about 40% is a reflection of this shortcoming. The linear system tries to do classification by taking each and every pixel into account, which is a difficult task. What if it were possible to create representations for higher level features in the image, say the head of the horse or its legs, and then use these for classification instead. This will enable the system to identify a horse irrespective of its orientation and its location in the image. This is precisely what Deep Learning systems do.
In general a way to make any model more powerful is by increasing the number of parameters. However in a Linear Model the number of parameters is constrained to $KN + K$ by the sizes of the input data and the number of output classes, which limits its modeling power.

#LC2
nb_setup.images_hconcat(["DL_images/LC2.png"], width=600)

Dense Feed Forward Networks¶

Dense Feed Forward Networks were designed with the objective the overcoming these shortcomings. As Figure DFN1 shows, we are looking for a functional block between the input vector $(x_1,...,x_N)$ and the output logits $(a_1,...,a_K)$, that can create a new representation vector $(z_1,...,z_P)$ which satisfies the approximate linear separability property. One way to do this is shown in Figure DFN2, which is a Deep Feed Forward Network with a single Hidden Layer. Note the following:

The Input layer and Output layers are as before, but we have added a third layer, the so-called Hidden Layer in between. The Input Layer is fully connected to the Hidden Layer, i.e., each node in the Input Layer is connected to every other node in the Hidden Layer, and the same holds true for connections between the Hidden Layer and the Output Layer. DLNs with these characteristics are called Dense Feed Forward Neural Networks. Later in this monograph we will come across examples of DLNs where these properties don’t apply; either because the fully connected property does not hold (as in Convolutional Neural Networks), or the DLN incorporates feedback loops (as in Recurrent Neural Networks).
The $j$-th node in the Hidden Layer performs the following computation on the input variables $x_i$ to generate an output $z_j^{(1)}, 1 \leq j \leq P$ given by $$ a_j^{(1)} = \sum_{i=1}^N w_{ji}^{(1)} x_i + b_j^{(1)} $$ $$ z_j^{(1)} = f(a_j^{(1)}) $$ The vector $(a^{(1)}_1,...,a_P^{(1)})$, which we call the Pre-Activation, is computed as a simple linear combination of the Input Vector. The output of the Hidden Layer $(z^{(1)}_1,...,z_P^{(1)})$ which we call the Activation, is computed as an elementwise non-linear function of the Pre-Activations.
The Output Layer operates on the Activations $z_j^{(1)}$ from the Hidden Layer, and computes the logits for the K classes $(a_1^{(2)},...,a_K^{(2)})$. $$ a_k^{(2)} = \sum_{i=1}^P w_{ki}^{(2)} z_i^{(1)} + b_k^{(2)}, \ \ 1\le k\le K $$ The classification probabilities $y_k, 1\le k\le K$ are obtained by applying the Softmax function to the logits. $$ y_k = \frac{\exp(a_k^{(2)})}{\sum_{j=1}^K \exp(a_j^{(2)})}, \ \ 1\le k\le K $$ Note that the logit and classification probability computations are identical to that done in Linear Systems, with the inputs $X$ now replaced by the activations $Z$.
The weight parameters $w_{ij}^{(1)}, 1\le i\le P,1\le j\le N; w_{ij}^{(2)}, 1\le i\le K,1\le j\le P$ and the bias parameters $b_i^{(1)}, 1\le i\le P; b_i^{(2)}, 1\le i\le K$ have to be learnt using the training data, as in Linear Models. The total number of parameters need to describe this network is given by $NP + P + PK + K$, which is now dependent on the number of nodes in the Hidden Layer $P$. Hence we can build a Dense Feed Forward model with more powerful classification ability by increasing the number of nodes in the Hidden Layer, which is an option that does not exist in Linear Systems.

#DFN1
nb_setup.images_hconcat(["DL_images/DFN1.png"], width=600)

#DFN2
nb_setup.images_hconcat(["DL_images/DFN2.png"], width=600)

The activations $(z^{(1)}_1,...,z_P^{(1)})$ correspond to the new data representation that we are looking for. They filter the input and create higher layer representations, which are then used by the logit layer for classification. Note that the filtering done by the Hidden Layer is non-linear due to the presence of the non-linear Function $f$. This function is called the Activation Function, and plays an important role in system performance. The most popular Activation Function in use is called the Rectified Linear Unit, or ReLU, and is shown in Figure DFN3. It simply passes on the pre-activations that are greater than zero, and blocks those that are less.

The presence of the Activation Function is critical to the functioning of the DLN, and it can be easily shown that if they were to be omitted, then the Hidden and Output layers can be collapsed together so that the resulting model would be equivalent to a Linear Model. Indeed the presence of Activation Functions gives the system its modeling power, and in general we will see later in the book that DLN systems can be made more powerful by increasing the amount of non-linear processing. The appropriate choice of Activation Functions has a big influence on the performance of the DLN, and the discovery of more effective Activation Functions such as the ReLU have helped make DLNs easier to train.

nb_setup.images_hconcat(["DL_images/DFN3.png"], width=600)

The system shown in Figure DFN2 incorporates only a single Hidden Layer. Why not continue the process and enable the model to create higher level representations by adding additional hidden layers? This is certainly possible and the resulting network is shown in Figure DFN4. It shows a Dense Feed Forward Network with $R$ hidden layers, such that layer $r$ consists of $P^r$ nodes. The equations decribing this network can be written as:

The activations for the first Hidden Layer: $$ a_j^{(1)} = \sum_{i=1}^N w_{ji}^{(1)} x_i + b_j^{(1)},\ \ 1\le j\le P^1 $$ $$ z_j^{(1)} = f(a_j^{(1)}),\ \ 1\le j\le P^1 $$
The Activations for Hidden Layer 2 to Hidden Layer R: $$ a_j^{(r+1)} = \sum_{i=1}^{P^r} w_{ji}^{(r+1)} z_i^{r} + b_j^{(r+1)},\ \ 1\le r\le R-1, 1\le j\le P^{(r+1)} $$ $$ z_j^{(r+1)} = f(a_j^{(r+1)}),\ \ 1\le r\le R-1, 1\le j\le P^{(r+1)} $$
The logits and the classification probabilities: $$ a_k^{(R+1)} = \sum_{i=1}^{P^R} w_{ki}^{(R+1)} z_i^R + b_k^{(R+1)},\ \ 1\le k\le K $$ $$ y_k = \frac{a_k^{(R+1)}}{\sum_{j=1}^K a_j^{(R+1)}}, \ \ 1\le k\le K $$

With each successive Hidden Layer, this network creates representations at higher levels of abstraction.

Using matrix notation, these equations can be compactly written as (with the $Z^{(0)} = X$):

$$ A^{(r)} = W^{(r)}Z^{(r-1)} + B^{(r)},\ \ Z^{(r)} = f(A^{(r)}),\ \ 1\le r\le R $$$$ A^{(R+1)} = W^{(R+1)}Z^{(R)} + B^{(R+1)},\ \ Y = h(A^{(R+1)}) $$

In these equations $f$ and $h$ represent the Activation and Softmax functions respectively, and these operations are carried out on an elementwise basis across all the matrix entries.

nb_setup.images_hconcat(["DL_images/DFN4.png"], width=600)

Nodes vs Layers¶

We have introduced two degrees of freedom in DLN design in this chapter: (1) The number of Hidden Layers, and (2) The number of nodes per Hidden Layer. This leads to the following questions:

To get a better performing model, is it preferable to increase the number of layers, or is it better to increase the number of nodes per layer (while keeping the number of layers fixed)?
Does the system performance keep improving as we add more and more layers, or are there limits that the model runs into?

Unfortunately there don't exist many theoretical results in this area which can give definite answers to these questions. However there is one interesting theorem regarding Deep Feed Forward Networks with a single Hidden Layer whose proof was given by Cybenko et.al. in 1989:

Given an arbitrary continuous function $g$ of $n$ variables such as

$$ y = g(x_1,...,x_n) $$

it is always possible to find a Deep Feed Forward Network with a single Hidden Layer, such that the output of the network approximates $g$, and the approximation can be made as close as we want by adding nodes to the Hidden Layer.

This property is of course dependent on the form of the Activation Function used, but it has been proven to be true for the most commonly used functions. Hence it should be possible to solve any classification problem with a Dense Feed Forward Network containing a single layer. However the theorem does not specify the number of hidden nodes needed for a particular problem.

In practice, the following has been observed that to increase the modeling power of a DLN, it is advantageous to add Hidden Layers, becuase of the following reasons:

More layers allow the model to develop an hierarchical representation of the input data, which simplifies the task of the linear classifier in the final layer.
Having additional layers increases the amount of non-linearity and thus the modeling capacity.

This still begs the question of how wide should the network be. There has been some progress on this recently more recently [Li, Xu, et.al] (https://arxiv.org/pdf/1712.09913.pdf), and their key finding is shown in the Figure convnet46.

#convnet46
nb_setup.images_hconcat(["DL_images/convnet46.png"], width=700)

As illustrated in the figure, the width of the network has a critical effect on the smoothness of its Loss Function. The figure shows four contour plots for the Loss Function of an increasingly wider network, and as can be seen the Loss Function landscape becomes progressively smoother as we move from left to right. This makes the optimization task much easier. This effect is more pronounced for the very deep networks with hundreds of layers that we will study later in the course, and less of an issue in a network with only a few layers.

If the Loss Function is highly chaotic as in the leftmost plot, then this causes the optimization becomes highly dependent on the initialization values, since a bad initialization can cause the trajectory to get caught in the ups and downs of the uneven loss landscape. Increasing the width of the network promotes flat minimizers and prevents the transition to chaotic behavior, which also improces the generalization ability for the network.

Performance as a function of layers¶

The other question that we raised is whether the DLN performance keeps improving as we add more and more Hidden Layers. This is actually not the case, the model performance is constrained due to the following factors:

The Vanishing Gradient Problem: In order to train a multilayer Deep Feed Forward Network, the gradients $\frac{\partial L}{\partial w^{(r)}_{ij}}$ and $\frac{\partial L}{\partial b^{(r)}_i}$ have to be computed. It turns that if the number of layers is large, the gradients of the weights that are either in the first few layers or the last few layers, converge towards zero as the training progresses. Once this happens, the corresponding weights stop adapting to new training data, and thus the training process grinds to a halt. This phenomena is known as the Vanishing Gradient problem, and its causes are explained in detail in Chapter GradientDescentTechniques. In addition adding more layers layers makes the Loss Landscape more chaotic as shown in Figure convnet46 which makes optimization very difficult. This problem contrains the number of layers that can be added to the network to asbout 20 or so, without degrading the training process. In order to get around this problem, we can increase the width of the network as explained above, or use a recent advance in DLN architecture called Residual Connections which allows much deeper networks containing hundreds of layers.
The Overfitting Problem: Larger models with more layers have a larger number of parameters, and this in turn requires larger training datasets. As explained in Chapter ImprovingModelGeneralization, modeling is an exercise in matching the Capacity of the Model with the Complexity of the Dataset. If the Capacity of the Model is greater than the Complexity of the Dataset (which can happen if we add more layers than necessary), then it leads to overfitting. This problem constrains the model's generalization ability.

As this discussion shows, there is no formula or theoretical result which tells us the number of layers or the nodes per layer to use in the model. These numbers, which are also called hyper-parameters are a function of the dataset that we are trying to model, and the only way to find the best numbers is by trial and error. Hence when building the model, the designer has to do several trial runs with different vales for these hyper-parameters before settling on the best ones.

In Chapter ImprovingModelGeneralization we provide some guidelines that can be used to make this process more efficient.

Example of a Dense Feed Forward Network in Keras¶

Models Using the Keras Layers Module¶

There are two ways to define a Dense Feed Forward Network in Keras:

Using the Keras Layers Module
Using the Keras Functional API

The code shown below uses the Layers Module to define a Dense Feed Forward Network with two hidden layers with 20 and 15 nodes respectively. The first hidden layer is constrained to accept input tensors of shape (32 32 3, ). Note that the second dimension of this tensor is left un-specified, this allows the system to feed this layer with batches of data such that any batch size can be accepted. The input tensor is transformed into a tensor of shape (20, ) by the first hidden layer, and this tensor is then processed by the second hidden layer with 15 nodes. There is no need to specify an input shape argument for the second layer, since Keras automatically decides on this based on the output of the first layer.

Comparing the results of the Linear Model from the previous chapter and the Dense Feed Forward Model, the accuracy increased from about 40% to 45%. This is a significant jump, however not good enough. One of the main factors that is holding back the Dense Feed Forward model from doing a better job on the accuracy is that it is only able to process images after they have been flattened into a vector shape. Thus a lot of information that is present in the original 3D image shape is lost, especially data about pixels that are in proximity of each other in the original image. In order to process images in the native 3D shape, we will need a more sophisticated Neural Network model called Convolutional Neural Networks, which is discussed in one of the later chapters.

import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import cifar10

(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

network = models.Sequential()
network.add(layers.Dense(20, activation='relu', input_shape=(32 * 32 * 3,)))
network.add(layers.Dense(15, activation='relu'))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='sgd',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)

Epoch 1/100
313/313 [==============================] - 2s 4ms/step - loss: 2.2224 - accuracy: 0.1804 - val_loss: 2.1399 - val_accuracy: 0.2087
Epoch 2/100
313/313 [==============================] - 1s 3ms/step - loss: 2.0539 - accuracy: 0.2482 - val_loss: 2.0086 - val_accuracy: 0.2662
Epoch 3/100
313/313 [==============================] - 1s 3ms/step - loss: 1.9258 - accuracy: 0.2991 - val_loss: 1.9078 - val_accuracy: 0.3113
Epoch 4/100
313/313 [==============================] - 1s 4ms/step - loss: 1.8651 - accuracy: 0.3219 - val_loss: 1.9120 - val_accuracy: 0.3135
Epoch 5/100
313/313 [==============================] - 1s 3ms/step - loss: 1.8258 - accuracy: 0.3395 - val_loss: 1.8388 - val_accuracy: 0.3335
Epoch 6/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7963 - accuracy: 0.3514 - val_loss: 1.8283 - val_accuracy: 0.3377
Epoch 7/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7743 - accuracy: 0.3606 - val_loss: 1.8082 - val_accuracy: 0.3454
Epoch 8/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7570 - accuracy: 0.3694 - val_loss: 1.8109 - val_accuracy: 0.3515
Epoch 9/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7396 - accuracy: 0.3746 - val_loss: 1.7487 - val_accuracy: 0.3742
Epoch 10/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7257 - accuracy: 0.3796 - val_loss: 1.7517 - val_accuracy: 0.3771
Epoch 11/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7108 - accuracy: 0.3873 - val_loss: 1.7341 - val_accuracy: 0.3764
Epoch 12/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6988 - accuracy: 0.3925 - val_loss: 1.7676 - val_accuracy: 0.3726
Epoch 13/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6872 - accuracy: 0.3965 - val_loss: 1.7758 - val_accuracy: 0.3589
Epoch 14/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6760 - accuracy: 0.4033 - val_loss: 1.7202 - val_accuracy: 0.3817
Epoch 15/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6647 - accuracy: 0.4072 - val_loss: 1.7044 - val_accuracy: 0.3949
Epoch 16/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6586 - accuracy: 0.4096 - val_loss: 1.7036 - val_accuracy: 0.3925
Epoch 17/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6495 - accuracy: 0.4121 - val_loss: 1.7121 - val_accuracy: 0.3844
Epoch 18/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6409 - accuracy: 0.4155 - val_loss: 1.6819 - val_accuracy: 0.3980
Epoch 19/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6302 - accuracy: 0.4207 - val_loss: 1.7588 - val_accuracy: 0.3775
Epoch 20/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6251 - accuracy: 0.4232 - val_loss: 1.6870 - val_accuracy: 0.3967
Epoch 21/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6198 - accuracy: 0.4241 - val_loss: 1.6598 - val_accuracy: 0.4138
Epoch 22/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6110 - accuracy: 0.4288 - val_loss: 1.6686 - val_accuracy: 0.4023
Epoch 23/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6078 - accuracy: 0.4282 - val_loss: 1.6444 - val_accuracy: 0.4140
Epoch 24/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5993 - accuracy: 0.4342 - val_loss: 1.6792 - val_accuracy: 0.4041
Epoch 25/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5948 - accuracy: 0.4354 - val_loss: 1.6517 - val_accuracy: 0.4141
Epoch 26/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5892 - accuracy: 0.4360 - val_loss: 1.6861 - val_accuracy: 0.4057
Epoch 27/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5856 - accuracy: 0.4378 - val_loss: 1.6292 - val_accuracy: 0.4195
Epoch 28/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5785 - accuracy: 0.4387 - val_loss: 1.6437 - val_accuracy: 0.4194
Epoch 29/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5749 - accuracy: 0.4388 - val_loss: 1.6746 - val_accuracy: 0.4025
Epoch 30/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5716 - accuracy: 0.4433 - val_loss: 1.6429 - val_accuracy: 0.4153
Epoch 31/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5676 - accuracy: 0.4444 - val_loss: 1.6169 - val_accuracy: 0.4300
Epoch 32/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5602 - accuracy: 0.4462 - val_loss: 1.6660 - val_accuracy: 0.4040
Epoch 33/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5568 - accuracy: 0.4484 - val_loss: 1.6150 - val_accuracy: 0.4316
Epoch 34/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5539 - accuracy: 0.4473 - val_loss: 1.6274 - val_accuracy: 0.4196
Epoch 35/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5496 - accuracy: 0.4490 - val_loss: 1.6318 - val_accuracy: 0.4210
Epoch 36/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5487 - accuracy: 0.4513 - val_loss: 1.6406 - val_accuracy: 0.4150
Epoch 37/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5427 - accuracy: 0.4514 - val_loss: 1.6197 - val_accuracy: 0.4251
Epoch 38/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5387 - accuracy: 0.4526 - val_loss: 1.6142 - val_accuracy: 0.4296
Epoch 39/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5369 - accuracy: 0.4541 - val_loss: 1.6024 - val_accuracy: 0.4350
Epoch 40/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5327 - accuracy: 0.4557 - val_loss: 1.6138 - val_accuracy: 0.4321
Epoch 41/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5293 - accuracy: 0.4556 - val_loss: 1.6303 - val_accuracy: 0.4185
Epoch 42/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5265 - accuracy: 0.4566 - val_loss: 1.6417 - val_accuracy: 0.4231
Epoch 43/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5208 - accuracy: 0.4602 - val_loss: 1.5818 - val_accuracy: 0.4385
Epoch 44/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5198 - accuracy: 0.4587 - val_loss: 1.6336 - val_accuracy: 0.4313
Epoch 45/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5188 - accuracy: 0.4622 - val_loss: 1.5858 - val_accuracy: 0.4422
Epoch 46/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5124 - accuracy: 0.4620 - val_loss: 1.6163 - val_accuracy: 0.4313
Epoch 47/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5152 - accuracy: 0.4608 - val_loss: 1.6237 - val_accuracy: 0.4336
Epoch 48/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5112 - accuracy: 0.4639 - val_loss: 1.6017 - val_accuracy: 0.4327
Epoch 49/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5043 - accuracy: 0.4640 - val_loss: 1.5864 - val_accuracy: 0.4383
Epoch 50/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5021 - accuracy: 0.4669 - val_loss: 1.5868 - val_accuracy: 0.4364
Epoch 51/100
313/313 [==============================] - 1s 4ms/step - loss: 1.5021 - accuracy: 0.4686 - val_loss: 1.6660 - val_accuracy: 0.4187
Epoch 52/100
313/313 [==============================] - 1s 4ms/step - loss: 1.4997 - accuracy: 0.4681 - val_loss: 1.6058 - val_accuracy: 0.4336
Epoch 53/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4990 - accuracy: 0.4670 - val_loss: 1.6451 - val_accuracy: 0.4165
Epoch 54/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4935 - accuracy: 0.4687 - val_loss: 1.6530 - val_accuracy: 0.4162
Epoch 55/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4888 - accuracy: 0.4712 - val_loss: 1.6777 - val_accuracy: 0.4080
Epoch 56/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4911 - accuracy: 0.4719 - val_loss: 1.5867 - val_accuracy: 0.4401
Epoch 57/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4866 - accuracy: 0.4739 - val_loss: 1.5865 - val_accuracy: 0.4377
Epoch 58/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4858 - accuracy: 0.4743 - val_loss: 1.5604 - val_accuracy: 0.4449
Epoch 59/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4837 - accuracy: 0.4745 - val_loss: 1.5733 - val_accuracy: 0.4451
Epoch 60/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4825 - accuracy: 0.4732 - val_loss: 1.5519 - val_accuracy: 0.4519
Epoch 61/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4802 - accuracy: 0.4738 - val_loss: 1.6047 - val_accuracy: 0.4249
Epoch 62/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4790 - accuracy: 0.4734 - val_loss: 1.6213 - val_accuracy: 0.4341
Epoch 63/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4749 - accuracy: 0.4787 - val_loss: 1.5718 - val_accuracy: 0.4447
Epoch 64/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4743 - accuracy: 0.4744 - val_loss: 1.6211 - val_accuracy: 0.4181
Epoch 65/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4705 - accuracy: 0.4776 - val_loss: 1.5992 - val_accuracy: 0.4454
Epoch 66/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4704 - accuracy: 0.4753 - val_loss: 1.6006 - val_accuracy: 0.4418
Epoch 67/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4679 - accuracy: 0.4808 - val_loss: 1.5763 - val_accuracy: 0.4438
Epoch 68/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4702 - accuracy: 0.4786 - val_loss: 1.6170 - val_accuracy: 0.4435
Epoch 69/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4661 - accuracy: 0.4763 - val_loss: 1.5853 - val_accuracy: 0.4381
Epoch 70/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4639 - accuracy: 0.4790 - val_loss: 1.5461 - val_accuracy: 0.4583
Epoch 71/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4614 - accuracy: 0.4821 - val_loss: 1.5972 - val_accuracy: 0.4349
Epoch 72/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4617 - accuracy: 0.4803 - val_loss: 1.5614 - val_accuracy: 0.4451
Epoch 73/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4609 - accuracy: 0.4797 - val_loss: 1.5435 - val_accuracy: 0.4581
Epoch 74/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4550 - accuracy: 0.4836 - val_loss: 1.5703 - val_accuracy: 0.4484
Epoch 75/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4534 - accuracy: 0.4819 - val_loss: 1.6055 - val_accuracy: 0.4230
Epoch 76/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4491 - accuracy: 0.4853 - val_loss: 1.5379 - val_accuracy: 0.4609
Epoch 77/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4525 - accuracy: 0.4850 - val_loss: 1.5882 - val_accuracy: 0.4395
Epoch 78/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4483 - accuracy: 0.4859 - val_loss: 1.5354 - val_accuracy: 0.4519
Epoch 79/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4476 - accuracy: 0.4868 - val_loss: 1.5558 - val_accuracy: 0.4502
Epoch 80/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4462 - accuracy: 0.4881 - val_loss: 1.5957 - val_accuracy: 0.4283
Epoch 81/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4465 - accuracy: 0.4875 - val_loss: 1.5976 - val_accuracy: 0.4265
Epoch 82/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4412 - accuracy: 0.4872 - val_loss: 1.5740 - val_accuracy: 0.4539
Epoch 83/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4423 - accuracy: 0.4882 - val_loss: 1.5424 - val_accuracy: 0.4482
Epoch 84/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4395 - accuracy: 0.4901 - val_loss: 1.5657 - val_accuracy: 0.4527
Epoch 85/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4374 - accuracy: 0.4898 - val_loss: 1.5821 - val_accuracy: 0.4400
Epoch 86/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4352 - accuracy: 0.4887 - val_loss: 1.5355 - val_accuracy: 0.4559
Epoch 87/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4325 - accuracy: 0.4909 - val_loss: 1.5663 - val_accuracy: 0.4445
Epoch 88/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4336 - accuracy: 0.4886 - val_loss: 1.5803 - val_accuracy: 0.4517
Epoch 89/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4333 - accuracy: 0.4908 - val_loss: 1.5887 - val_accuracy: 0.4470
Epoch 90/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4337 - accuracy: 0.4916 - val_loss: 1.5615 - val_accuracy: 0.4496
Epoch 91/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4297 - accuracy: 0.4909 - val_loss: 1.5893 - val_accuracy: 0.4411
Epoch 92/100
313/313 [==============================] - 1s 4ms/step - loss: 1.4267 - accuracy: 0.4934 - val_loss: 1.5940 - val_accuracy: 0.4385
Epoch 93/100
313/313 [==============================] - 1s 4ms/step - loss: 1.4287 - accuracy: 0.4904 - val_loss: 1.5847 - val_accuracy: 0.4457
Epoch 94/100
313/313 [==============================] - 1s 4ms/step - loss: 1.4260 - accuracy: 0.4946 - val_loss: 1.5986 - val_accuracy: 0.4404
Epoch 95/100
313/313 [==============================] - 1s 4ms/step - loss: 1.4230 - accuracy: 0.4927 - val_loss: 1.5516 - val_accuracy: 0.4576
Epoch 96/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4226 - accuracy: 0.4933 - val_loss: 1.5507 - val_accuracy: 0.4449
Epoch 97/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4215 - accuracy: 0.4940 - val_loss: 1.5242 - val_accuracy: 0.4610
Epoch 98/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4214 - accuracy: 0.4936 - val_loss: 1.6298 - val_accuracy: 0.4308
Epoch 99/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4191 - accuracy: 0.4955 - val_loss: 1.5531 - val_accuracy: 0.4500
Epoch 100/100
313/313 [==============================] - 1s 3ms/step - loss: 1.4159 - accuracy: 0.4959 - val_loss: 1.6001 - val_accuracy: 0.4522

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Models Using the Keras Functional API¶

#DFN5
nb_setup.images_hconcat(["DL_images/DFN5.png"], width=600)

The previous model was built using the Keras Sequential Layers Module, and it is compactly represented in Part (a) of Figure DFN5. As the name implies, the flow of data in this model is strictly sequential from left to right, with each layer working on the output of the previous layer in the model. However there are important cases in which the data flow is not sequential, as shown in Parts (b) to (d) of the figure:

Part (b) shows an example of what is known as Model Ensembling, and is used to improve the performance of a network by averaging out the outputs of multiple identical copies of the same network (this is discuused in more detail in Chapter Training Neural Networks Part 2. In order to implement this architecture, the same input is fed into multiple copies of the network with different initializations, and then the output from each is further processed to get the final output.
Part (c) shows an example of a model that features Residual Connections, which provide a path for a copy of the signal to travel to a deeper part of the network after bypassing the intervening layers. It is then added to the rest of the signal that propagates using the usual sequential processing modules. This mechanism has been shown to improve the flow of gradients during training, and has enabled the use of networks with hundreds of layers.
Part (d) shows an example of a model in which the final output is afunction of multiple types of input datasets, in this case it depends on a mixture of tabular, image and text data. Each dataset is processed along its own branch (perhaps using different types of sub-networks, optimal for its type of data) and then combined together to give the final output.
Part (e) is an example of Multi-Label Classification. In this case images have multiple objects, each of which have to be classified together in a single image. This is done by having multiple output nodes, each of which passes a yes/no decision for the presence of one of the objects in the image.

All these models use non-sequential flow of data, which can be modeled using the Keras Functional API. In this system, tensors are manipulated directly and layers are used as functions to take tensors and return tensors. As an example, we take the CIFAR-10 sequential model and recast it in functional form:

import keras
keras.__version__
from keras import Sequential, Model
from keras import layers
from keras import Input

from keras.datasets import cifar10

(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

input_tensor = Input(shape=(32 * 32 * 3,))
x = layers.Dense(20, activation='relu')(input_tensor)
y = layers.Dense(15, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(y)

model = Model(input_tensor, output_tensor)

model.compile(optimizer='sgd',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10
313/313 [==============================] - 1s 3ms/step - loss: 2.2097 - accuracy: 0.1641 - val_loss: 2.1336 - val_accuracy: 0.2005
Epoch 2/10
313/313 [==============================] - 1s 3ms/step - loss: 2.0446 - accuracy: 0.2392 - val_loss: 2.0668 - val_accuracy: 0.2506
Epoch 3/10
313/313 [==============================] - 1s 4ms/step - loss: 1.9528 - accuracy: 0.2894 - val_loss: 1.9645 - val_accuracy: 0.2721
Epoch 4/10
313/313 [==============================] - 1s 3ms/step - loss: 1.8951 - accuracy: 0.3194 - val_loss: 1.8963 - val_accuracy: 0.3048
Epoch 5/10
313/313 [==============================] - 1s 3ms/step - loss: 1.8515 - accuracy: 0.3381 - val_loss: 1.9140 - val_accuracy: 0.3155
Epoch 6/10
313/313 [==============================] - 1s 3ms/step - loss: 1.8180 - accuracy: 0.3519 - val_loss: 1.8279 - val_accuracy: 0.3437
Epoch 7/10
313/313 [==============================] - 1s 3ms/step - loss: 1.7938 - accuracy: 0.3600 - val_loss: 1.8138 - val_accuracy: 0.3574
Epoch 8/10
313/313 [==============================] - 1s 3ms/step - loss: 1.7727 - accuracy: 0.3693 - val_loss: 1.7951 - val_accuracy: 0.3522
Epoch 9/10
313/313 [==============================] - 1s 3ms/step - loss: 1.7517 - accuracy: 0.3777 - val_loss: 1.8063 - val_accuracy: 0.3621
Epoch 10/10
313/313 [==============================] - 1s 3ms/step - loss: 1.7375 - accuracy: 0.3801 - val_loss: 1.7434 - val_accuracy: 0.3808

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 3072)]            0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                61460     
_________________________________________________________________
dense_4 (Dense)              (None, 15)                315       
_________________________________________________________________
dense_5 (Dense)              (None, 10)                160       
=================================================================
Total params: 61,935
Trainable params: 61,935
Non-trainable params: 0
_________________________________________________________________

Exercise: Change the code in the model shown above so that it confirms to the architectures shown in figures (b) and (c)

Ingesting Data into Keras Models¶

#DFN6
nb_setup.images_hconcat(["DL_images/DFN6.png"], width=600)

In the Keras code that we have seen so far, the input data was already formatted into a tensor form that could be fed into the model directly. In practical applications the data exists in raw form that has to be processed before being fed into Keras. Some examples of this are shown Figure DFN6 for three of the most common types of datasets:

Image Datasets: These usually exist as image files in the png or jpg format. These have to be converted into the RGB format and then cropped so that they all have the same dimensions.
Text Datasets: The words and characters in these datasets have to be pre-processed to remove un-necessary elements, and then vectorized.
TimeSeries Datasets: The data in this case is already in numerical form so requires the least amount of pre-processing.

In its latest release, Keras has provided a nunber of dataset pre-processing functions that can be used for this purpose. In particular:

For Image Datsets: image_dataset_from_directory function
For Text Datasets: text_dataset_from_directory function
For TimeSeries Datasets: timeseries_dataset_from array function

In each of these cases, the dataset function creates the training, validation or test datasets by pairing the actual data with its appropriate labels. For the Image and Text datasets, where the data resides in directories, these labels are created by making use of the directory structure for the data, for example all images belonging to a particular category are placed in the same directory. Also note that in each of these cases, the training datasets are not pre-defined and stored, but instead are created on the fly during the training phase (the same applies for validation and test datasets). This has the following benefits:

For image datasets, this allows the possibility of modifying the image on the fly before feeding it into the model. This is an important technique that is used to improve the performance of image processing models.
For TimeSeries Datasets, this leads to a reduction in the amount of memory required to store the dataset, since there is usually a great deal of overlay between neighboring elements.

If the data is already in the form of a tensor, and the objective is to predict a missing row element, then the Dataset.from_tensor_slices function is used to to do the pairing between the data and the corresponding label (as shown in the following section).

Ingesting Tabular Data: Predicting a Missing Row Element¶

In this example we feed the model with data in tabular format with 303 rows. Each row of the table consists of features that have been extracted from cells from a single patient, with the last column indicating whether the call is cancerous (with a label of 1 in case it is). The objective of the model is to predict this label from the cell features.

The following table has a description of each feature. Some of the features are numerical, but several other features are cetgorical, while one of the features is both numerical and categorical. In order to feed this data into the Neural Network, the categorical features are converted into one-hot-encoded values.

#cancer_ds
nb_setup.images_hconcat(["DL_images/cancer_ds.png"], width=600)

import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

We start by downloading the data and storing it in a Pandas dataframe.

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)
dataframe.shape

(303, 14)

A preview of the samples:

dataframe.head()

The data is randomly split in validation and training sets:

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation

The following procedure invokes the Dataset.from_tensor_slices procedure in order to create labels for each input and pair it with the rest of the data in each row. This results in the formation of the training and validation datasets.

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'age': <tf.Tensor: shape=(), dtype=int64, numpy=62>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=120>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=267>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=99>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.8>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'reversible'>}
Target: tf.Tensor(0, shape=(), dtype=int64)

The datasets are batched:

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

We define the following two procedures for pre-processing the data:

The encode_numerical_feature procedure invokes the Normalization function to normalize a column containing numerical features.
The encode_categorical_feature procedure converts categorical features into either integers of one_hot encoded features (we use the latter in this example). If the categorical feature is an integer, then the IntegerLookup function is invoked and if it is a string, then the StringLookup function is invoked. Both these functions use a table lookup method to do the mapping.

from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import StringLookup

def encode_numerical_feature(feature, name, dataset):
    # create a Normalization layer for our feature
    normalizer = Normalization()

    # Prepare a Dataset that only yields our feature
    # expand_dims returns a tensor with a length 1 axis inserted at index axis.
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)
    return encoded_feature


def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = StringLookup if is_string else IntegerLookup
    # Create a lookup layer which will turn strings into integer indices
    lookup = lookup_class(output_mode="one_hot")

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the set of possible string values and assign them a fixed integer index
    lookup.adapt(feature_ds)

    # Turn the string input into one-hot indices
    encoded_feature = lookup(feature)
    return encoded_feature

# Categorical features encoded as integers
sex = keras.Input(shape=(1,), name="sex", dtype="int64")
cp = keras.Input(shape=(1,), name="cp", dtype="int64")
fbs = keras.Input(shape=(1,), name="fbs", dtype="int64")
restecg = keras.Input(shape=(1,), name="restecg", dtype="int64")
exang = keras.Input(shape=(1,), name="exang", dtype="int64")
ca = keras.Input(shape=(1,), name="ca", dtype="int64")

# Categorical feature encoded as string
thal = keras.Input(shape=(1,), name="thal", dtype="string")

# Numerical features
age = keras.Input(shape=(1,), name="age")
trestbps = keras.Input(shape=(1,), name="trestbps")
chol = keras.Input(shape=(1,), name="chol")
thalach = keras.Input(shape=(1,), name="thalach")
oldpeak = keras.Input(shape=(1,), name="oldpeak")
slope = keras.Input(shape=(1,), name="slope")

all_inputs = [
    sex,
    cp,
    fbs,
    restecg,
    exang,
    ca,
    thal,
    age,
    trestbps,
    chol,
    thalach,
    oldpeak,
    slope,
]

# Integer categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, False)
cp_encoded = encode_categorical_feature(cp, "cp", train_ds, False)
fbs_encoded = encode_categorical_feature(fbs, "fbs", train_ds, False)
restecg_encoded = encode_categorical_feature(restecg, "restecg", train_ds, False)
exang_encoded = encode_categorical_feature(exang, "exang", train_ds, False)
ca_encoded = encode_categorical_feature(ca, "ca", train_ds, False)

# String categorical features
thal_encoded = encode_categorical_feature(thal, "thal", train_ds, True)

# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)
trestbps_encoded = encode_numerical_feature(trestbps, "trestbps", train_ds)
chol_encoded = encode_numerical_feature(chol, "chol", train_ds)
thalach_encoded = encode_numerical_feature(thalach, "thalach", train_ds)
oldpeak_encoded = encode_numerical_feature(oldpeak, "oldpeak", train_ds)
slope_encoded = encode_numerical_feature(slope, "slope", train_ds)

all_features = layers.concatenate(
    [
        sex_encoded,
        cp_encoded,
        fbs_encoded,
        restecg_encoded,
        exang_encoded,
        slope_encoded,
        ca_encoded,
        thal_encoded,
        age_encoded,
        trestbps_encoded,
        chol_encoded,
        thalach_encoded,
        oldpeak_encoded,
    ]
)
x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

history = model.fit(train_ds, epochs=50, validation_data=val_ds)

Epoch 1/50
8/8 [==============================] - 1s 45ms/step - loss: 0.6270 - accuracy: 0.6364 - val_loss: 0.5491 - val_accuracy: 0.7705
Epoch 2/50
8/8 [==============================] - 0s 6ms/step - loss: 0.5883 - accuracy: 0.6736 - val_loss: 0.5141 - val_accuracy: 0.8033
Epoch 3/50
8/8 [==============================] - 0s 5ms/step - loss: 0.5707 - accuracy: 0.7107 - val_loss: 0.4863 - val_accuracy: 0.8361
Epoch 4/50
8/8 [==============================] - 0s 5ms/step - loss: 0.5493 - accuracy: 0.7355 - val_loss: 0.4640 - val_accuracy: 0.8197
Epoch 5/50
8/8 [==============================] - 0s 9ms/step - loss: 0.5206 - accuracy: 0.7438 - val_loss: 0.4456 - val_accuracy: 0.8033
Epoch 6/50
8/8 [==============================] - 0s 6ms/step - loss: 0.4862 - accuracy: 0.7314 - val_loss: 0.4322 - val_accuracy: 0.8197
Epoch 7/50
8/8 [==============================] - 0s 5ms/step - loss: 0.4894 - accuracy: 0.7727 - val_loss: 0.4204 - val_accuracy: 0.8197
Epoch 8/50
8/8 [==============================] - 0s 5ms/step - loss: 0.4617 - accuracy: 0.7479 - val_loss: 0.4113 - val_accuracy: 0.8361
Epoch 9/50
8/8 [==============================] - 0s 5ms/step - loss: 0.4821 - accuracy: 0.7521 - val_loss: 0.4034 - val_accuracy: 0.8197
Epoch 10/50
8/8 [==============================] - 0s 6ms/step - loss: 0.4646 - accuracy: 0.7603 - val_loss: 0.3969 - val_accuracy: 0.8197
Epoch 11/50
8/8 [==============================] - 0s 8ms/step - loss: 0.4360 - accuracy: 0.7934 - val_loss: 0.3925 - val_accuracy: 0.8197
Epoch 12/50
8/8 [==============================] - 0s 5ms/step - loss: 0.4386 - accuracy: 0.7934 - val_loss: 0.3881 - val_accuracy: 0.8033
Epoch 13/50
8/8 [==============================] - 0s 5ms/step - loss: 0.4192 - accuracy: 0.8099 - val_loss: 0.3842 - val_accuracy: 0.8033
Epoch 14/50
8/8 [==============================] - 0s 6ms/step - loss: 0.4363 - accuracy: 0.7686 - val_loss: 0.3810 - val_accuracy: 0.8033
Epoch 15/50
8/8 [==============================] - 0s 6ms/step - loss: 0.4059 - accuracy: 0.8182 - val_loss: 0.3772 - val_accuracy: 0.8033
Epoch 16/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3886 - accuracy: 0.8306 - val_loss: 0.3749 - val_accuracy: 0.8033
Epoch 17/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3827 - accuracy: 0.8306 - val_loss: 0.3729 - val_accuracy: 0.8197
Epoch 18/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3630 - accuracy: 0.8595 - val_loss: 0.3710 - val_accuracy: 0.8361
Epoch 19/50
8/8 [==============================] - 0s 7ms/step - loss: 0.3643 - accuracy: 0.8182 - val_loss: 0.3698 - val_accuracy: 0.8361
Epoch 20/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3738 - accuracy: 0.8430 - val_loss: 0.3689 - val_accuracy: 0.8361
Epoch 21/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3526 - accuracy: 0.8595 - val_loss: 0.3678 - val_accuracy: 0.8197
Epoch 22/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3438 - accuracy: 0.8512 - val_loss: 0.3671 - val_accuracy: 0.8197
Epoch 23/50
8/8 [==============================] - 0s 5ms/step - loss: 0.3805 - accuracy: 0.8388 - val_loss: 0.3667 - val_accuracy: 0.8197
Epoch 24/50
8/8 [==============================] - 0s 7ms/step - loss: 0.3658 - accuracy: 0.8512 - val_loss: 0.3665 - val_accuracy: 0.8197
Epoch 25/50
8/8 [==============================] - 0s 5ms/step - loss: 0.3304 - accuracy: 0.8636 - val_loss: 0.3667 - val_accuracy: 0.8197
Epoch 26/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3110 - accuracy: 0.8760 - val_loss: 0.3668 - val_accuracy: 0.8197
Epoch 27/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3539 - accuracy: 0.8388 - val_loss: 0.3662 - val_accuracy: 0.8197
Epoch 28/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3512 - accuracy: 0.8430 - val_loss: 0.3663 - val_accuracy: 0.8197
Epoch 29/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3358 - accuracy: 0.8388 - val_loss: 0.3655 - val_accuracy: 0.8197
Epoch 30/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3201 - accuracy: 0.8554 - val_loss: 0.3656 - val_accuracy: 0.8197
Epoch 31/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3294 - accuracy: 0.8719 - val_loss: 0.3650 - val_accuracy: 0.8197
Epoch 32/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3302 - accuracy: 0.8430 - val_loss: 0.3658 - val_accuracy: 0.8197
Epoch 33/50
8/8 [==============================] - 0s 5ms/step - loss: 0.3125 - accuracy: 0.8636 - val_loss: 0.3670 - val_accuracy: 0.8033
Epoch 34/50
8/8 [==============================] - 0s 9ms/step - loss: 0.3359 - accuracy: 0.8512 - val_loss: 0.3684 - val_accuracy: 0.8033
Epoch 35/50
8/8 [==============================] - 0s 5ms/step - loss: 0.3145 - accuracy: 0.8554 - val_loss: 0.3697 - val_accuracy: 0.8033
Epoch 36/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3145 - accuracy: 0.8512 - val_loss: 0.3706 - val_accuracy: 0.7869
Epoch 37/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3019 - accuracy: 0.8843 - val_loss: 0.3715 - val_accuracy: 0.7869
Epoch 38/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3053 - accuracy: 0.8595 - val_loss: 0.3721 - val_accuracy: 0.7869
Epoch 39/50
8/8 [==============================] - 0s 6ms/step - loss: 0.2955 - accuracy: 0.8760 - val_loss: 0.3723 - val_accuracy: 0.7869
Epoch 40/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2824 - accuracy: 0.8926 - val_loss: 0.3729 - val_accuracy: 0.7869
Epoch 41/50
8/8 [==============================] - 0s 6ms/step - loss: 0.2906 - accuracy: 0.8719 - val_loss: 0.3745 - val_accuracy: 0.7869
Epoch 42/50
8/8 [==============================] - 0s 6ms/step - loss: 0.2896 - accuracy: 0.8926 - val_loss: 0.3746 - val_accuracy: 0.7869
Epoch 43/50
8/8 [==============================] - 0s 6ms/step - loss: 0.2975 - accuracy: 0.8760 - val_loss: 0.3754 - val_accuracy: 0.7869
Epoch 44/50
8/8 [==============================] - 0s 6ms/step - loss: 0.3036 - accuracy: 0.8595 - val_loss: 0.3752 - val_accuracy: 0.7869
Epoch 45/50
8/8 [==============================] - 0s 6ms/step - loss: 0.2879 - accuracy: 0.8802 - val_loss: 0.3751 - val_accuracy: 0.7869
Epoch 46/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2967 - accuracy: 0.8554 - val_loss: 0.3752 - val_accuracy: 0.7869
Epoch 47/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2827 - accuracy: 0.8760 - val_loss: 0.3755 - val_accuracy: 0.7869
Epoch 48/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2819 - accuracy: 0.8636 - val_loss: 0.3766 - val_accuracy: 0.7869
Epoch 49/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2868 - accuracy: 0.8967 - val_loss: 0.3767 - val_accuracy: 0.7869
Epoch 50/50
8/8 [==============================] - 0s 5ms/step - loss: 0.2809 - accuracy: 0.8802 - val_loss: 0.3777 - val_accuracy: 0.7869

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Ingesting Tabular Data: Time Series Analysis¶

An important class of problems in Machine Learning have to do with predicting the next term in a sequence. In this case the DLN has to learn the pattern of data as it evolves in time, an example of which is shown in Figure DFN8 (this example is taken from Section 10.2 of Chollet, the dataset can be downlodade from http://www.bgc-jena.mpg.de/wetter/). The table consists of 14 pieces of metereological data collected once every 10 minutes, over the course of several decades. The objective is to predict the temperature (which is one of the variables) one day in the future, based on all the data collected over a time preiod called lookback, which spans several days.

#DFN8
nb_setup.images_hconcat(["DL_images/DFN8.png"], width=600)

Once again we start by reading in the data file and storing the information in the data structure 'lines'. Since most of the processing is very similar to that described for the previous example, we will limit our comments to places where the two are different.

import tensorflow 
import keras
keras.__version__
from keras import models
from keras import layers

import os

#data_dir = '/home/ubuntu/data/'
data_dir = '/Users/subirvarma/handson-ml/datasets/'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

with open(fname) as f:
    data = f.read()

lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))

['Date Time', 'p (mbar)', 'T (degC)', 'Tpot (K)', 'Tdew (degC)', 'rh (%)', 'VPmax (mbar)', 'VPact (mbar)', 'VPdef (mbar)', 'sh (g/kg)', 'H2OC (mmol/mol)', 'rho (g/m**3)', 'wv (m/s)', 'max. wv (m/s)', 'wd (deg)']
420551

import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(",")[1:]]
    temperature[i] = values[1]
    raw_data[i, :] = values[:]

The variation of temperature with time is plotted below. There is a periodicity in this pattern that reflects the variation of temperature over the course of a year.

from matplotlib import pyplot as plt

temp = float_data[:, 1]  # temperature (in degrees Celsius)
plt.plot(range(len(temp)), temp)
plt.show()

Half the data in the table is used for training, the remaining is split between validation and testing.

num_train_samples = int(0.5 * len(raw_data))
num_val_samples = int(0.25 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples
print("num_train_samples:", num_train_samples)
print("num_val_samples:", num_val_samples)
print("num_test_samples:", num_test_samples)

num_train_samples: 210275
num_val_samples: 105137
num_test_samples: 105139

In the next step we normalize all the variables individually by subtracting their mean and dividing by the standard deviation. Normalization is discuused in greater detail in Chapter Gradient Descent Techniques, it equalizes variables whose values are very different in magnitude, such as temperature and pressure, which improves the training process.

mean = raw_data[:num_train_samples].mean(axis=0)
raw_data -= mean
std = raw_data[:num_train_samples].std(axis=0)
raw_data /= std

We invoke the timeseries_dataset_from_array procedure to create the training, validation and test datsets for the model. Note the following:

Since the sampling_rate is set to 6, the model uses a single sample of the data per hour.
The model uses a sequence of the prior 120 hours of data (i.e. 5 days) in order to predict the temperature 24 hours after the end of the sequence.

sampling_rate = 6
sequence_length = 120
delay = sampling_rate * (sequence_length + 24 - 1)
batch_size = 256

train_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=0,
    end_index=num_train_samples)

val_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples,
    end_index=num_train_samples + num_val_samples)

test_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples + num_val_samples)

we are going to use a Dense Feed Forward Network to process the data, hence the 2D input tensor has to to flattened into a 1D shape using the Flatten layer in Keras, before it can be fed into the rest of the network.

from keras.models import Sequential
from keras import layers
from tensorflow.keras.optimizers import RMSprop

from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Flatten()(inputs)
x = layers.Dense(16, activation="relu")(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
                    epochs=10,
                    validation_data=val_dataset)

Epoch 1/10
819/819 [==============================] - 27s 33ms/step - loss: 15.1945 - mae: 2.9845 - val_loss: 11.1979 - val_mae: 2.6432
Epoch 2/10
819/819 [==============================] - 26s 32ms/step - loss: 9.3913 - mae: 2.4082 - val_loss: 10.4822 - val_mae: 2.5596
Epoch 3/10
819/819 [==============================] - 26s 31ms/step - loss: 8.6098 - mae: 2.3082 - val_loss: 10.2772 - val_mae: 2.5387
Epoch 4/10
819/819 [==============================] - 26s 32ms/step - loss: 8.1438 - mae: 2.2450 - val_loss: 12.3309 - val_mae: 2.7806
Epoch 5/10
819/819 [==============================] - 27s 33ms/step - loss: 7.8681 - mae: 2.2087 - val_loss: 11.7845 - val_mae: 2.7328
Epoch 6/10
819/819 [==============================] - 27s 33ms/step - loss: 7.6416 - mae: 2.1800 - val_loss: 10.3400 - val_mae: 2.5350
Epoch 7/10
819/819 [==============================] - 28s 33ms/step - loss: 7.4663 - mae: 2.1557 - val_loss: 10.6747 - val_mae: 2.5802
Epoch 8/10
819/819 [==============================] - 27s 32ms/step - loss: 7.2823 - mae: 2.1296 - val_loss: 10.7393 - val_mae: 2.5840
Epoch 9/10
819/819 [==============================] - 26s 32ms/step - loss: 7.1593 - mae: 2.1128 - val_loss: 11.3575 - val_mae: 2.6641
Epoch 10/10
819/819 [==============================] - 26s 32ms/step - loss: 7.0561 - mae: 2.0997 - val_loss: 11.3461 - val_mae: 2.6645

Ingesting Text Data¶

This example is taken from Section 11.3 in Chollet.

The objective of this model to classify movie reviews into either positive or negative, given the text of the review. The IMDB dataset which is used in this example can be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/. There are a total of 50,000 reviews in the dataset, of which we will use 25,000 reviews for training and the remainder for testing. Of all the reviews in this dataset, 50% are positive.

We will go through the following steps in pre-processing the text data before it can be fed into the Neural Network:

Map the 20,000 most frequently occurring words in the reviews to integers such that each review becomes a vector (of variable length). Standardize the size of each of the review to a fixed length, say 600, by padding and curring.
Map each word in the review to an embedding vector, so now each review assumes a 2D shape (600, embedding_size)
Flatten over all the word embeddings in a Review, resulting in an 1D Input Tensor of shape (600 x embedding_size) and feed this into the Dense Feed Forward Neural Network

The downloaded dataset already has the reviews sorted into Training and Test directories, and with each directory they are further sorted into negative and positive reviews. We start the process of creating a training dataset by creating two lists:

A list of reviews called texts, with each item in the list containing a single review
A list of the corresponding labels for each review called labels. The label value is inferred from the directory in which the review is placed

import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = '/Users/subirvarma/handson-ml/datasets/aclImdb'
#val_dir = base_dir/"val"
#train_dir = base_dir/"train"

train_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 70000 files belonging to 3 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

#DFN10
nb_setup.images_hconcat(["DL_images/DFN10.png"], width=600)

The next piece of code invokes the TextVectorization procedure, which takes each review and converts it from text to integers. It does so by cutting of the number of words in the reviews to the top 20,000 most freqently occuring words (specified by the parameter max_tokens), and then mapping each word to an unique integer in the range 0 to 20,000 (after removing all punctuation). It furthermore truncates each review to a maximum of max_length = 600 words, and pads the reviews with less than 600 words with zeroes. The resulting 2D array is illustrated in Figure DFN10.

from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

The Dense Feed Forword network model is defined next. Note that the first layer of the model is an Embedding Layer. Its function is take a sample of the input review, which is a 1D array of shape max_length (see Figure DFN9), and converts it into 2D array of shape (max_length, embedding_dim). Hence after this transformation, each review is represented by a matrix, which is then fed into the rest of the network. Since this is a Dense Feed Forward network, it can only accept 1D vectors, hence the matrix is flattened before it is forwarded on, as shown in Figure DF9.

#DFN9
nb_setup.images_hconcat(["DL_images/DFN9.png"], width=600)

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = 100

model = Sequential()
model.add(Embedding(max_tokens, embedding_dim, input_length=max_length))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 600, 100)          2000000   
_________________________________________________________________
flatten_3 (Flatten)          (None, 60000)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 32)                1920032   
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 33        
=================================================================
Total params: 3,920,065
Trainable params: 3,920,065
Non-trainable params: 0
_________________________________________________________________

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10)

Epoch 1/10
2188/2188 [==============================] - 81s 37ms/step - loss: -1713891.7500 - acc: 0.1429 - val_loss: 5708997.5000 - val_acc: 0.5000
Epoch 2/10
2188/2188 [==============================] - 80s 36ms/step - loss: -22681902.0000 - acc: 0.1429 - val_loss: 41613524.0000 - val_acc: 0.5000
Epoch 3/10
2188/2188 [==============================] - 86s 39ms/step - loss: -94023472.0000 - acc: 0.1429 - val_loss: 135895872.0000 - val_acc: 0.5000
Epoch 4/10
2188/2188 [==============================] - 75s 34ms/step - loss: -247544624.0000 - acc: 0.1429 - val_loss: 316518464.0000 - val_acc: 0.5000
Epoch 5/10
2188/2188 [==============================] - 75s 34ms/step - loss: -515071904.0000 - acc: 0.1429 - val_loss: 611338240.0000 - val_acc: 0.5000
Epoch 6/10
2188/2188 [==============================] - 84s 38ms/step - loss: -928301440.0000 - acc: 0.1429 - val_loss: 1048544256.0000 - val_acc: 0.5000
Epoch 7/10
2188/2188 [==============================] - 80s 37ms/step - loss: -1518881280.0000 - acc: 0.1429 - val_loss: 1654583680.0000 - val_acc: 0.5000
Epoch 8/10
2188/2188 [==============================] - 74s 34ms/step - loss: -2318575872.0000 - acc: 0.1429 - val_loss: 2460200192.0000 - val_acc: 0.5000
Epoch 9/10
2188/2188 [==============================] - 86s 39ms/step - loss: -3360411392.0000 - acc: 0.1429 - val_loss: 3492725504.0000 - val_acc: 0.5000
Epoch 10/10
2188/2188 [==============================] - 228s 104ms/step - loss: -4674383872.0000 - acc: 0.1429 - val_loss: 4777116160.0000 - val_acc: 0.5000

<keras.callbacks.History at 0x155c8bd68>

Exercise: The performance of this model is not very good. Part of the reason for this is that Dense Feed Forward Networks are not very weel suited for processing sequences such as those that arise in NLP. Later in this book we will study other models such as Recurrent Neural Netwrks, LSTMs and Transformers that are better at this task.

Try out the following changes to see whether any of them help improve it:

Replace the Flatten Layer with other ways in which the word embeddings in a review can be combined, for example Averaging (look up the section on Merge Layers in Keras documentation).

Ingesting Image Data¶

This example is taken from Section 8.2 of Chollet. The dataset is from a Kaggle competition and consists of 50,000 images, evenly divided between those of cats and dogs (the dataset can be downloaded from https://www.kaggle.com/c/dogs-vs-cats/data). The first step after downloading the data is to split up the images into training, validation and test directories, and furthermore within each of these, create separate sub-directories for cat and dog images (see Figure DFN13). The images in this dataset are labeled using their file names which cannot be directly used during the training process. The Keras dataset generator assumes that each image category occupies its own sub-directory and then generates the training labels by making use of this information.

See the example in Chollet on how to create and populate the directory structure, we will assume that this step has already been done.

#DFN13
nb_setup.images_hconcat(["DL_images/DFN13.png"], width=600)

import os, shutil, pathlib

train_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train'
validation_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation'

train_cats_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train/cats'
train_dogs_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train/dogs'

validation_cats_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation/cats'
validation_dogs_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation/dogs'

new_base_dir = pathlib.Path("/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small")

We will use only a subset of the 50,000 images in the dataset: 2000 images for training and 1000 for validation.

print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))

total training cat images: 1000
total training dog images: 1000

print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))

total validation cat images: 500
total validation dog images: 500

We now make use of the Keras utility called image_dataset_from_directory which carries out the following functions:

Converts the jpg images to the RGB format, and stores them as tensors of shape (150,150,3)
Converts these into floating point tensors
Crops the images so that they all have a size of (150,150) pixels
Creates batches of image data (of size 20 in this case) and stores them in tensors of shape (20,150,150,3), which are then fed into the model during execution.
Creates Labels for each image. This is done by assigning a different one-hot label to images belonging to different directories.

Another Dataset is defined for the the Validation data, with exactly the same structure.

from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(150, 150),
    batch_size=20)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(150, 150),
    batch_size=20)

Found 2000 files belonging to 2 classes.
Found 1000 files belonging to 2 classes.

The output of one of these Dataset objects looks as follows:

for data_batch, labels_batch in train_dataset:
    print('data batch shape:', data_batch.shape)
    print('labels batch shape:', labels_batch.shape)
    break

data batch shape: (20, 150, 150, 3)
labels batch shape: (20,)

We now define a Dense Feed Forward model with four hidden layers to process the data. Since this model can only process 1D tensors, we use the Flatten layer to convert the (150,150,3) image tensor into a 1D tensor of size 67,500. The model has more than 4 million parameters, almost all of which are concentrated in the first layer of weights, due to the large number of image pixels.

from keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers

from keras import Model
from keras import layers
from keras import Input

input_tensor = Input(shape=(150,150,3,))
a = layers.Flatten()(input_tensor)
a = layers.Rescaling(1./255)(a)
layer_1 = layers.Dense(64, activation='relu')(a)
layer_2 = layers.Dense(64, activation='relu')(layer_1)
layer_3 = layers.Dense(64, activation='relu')(layer_2)
layer_4 = layers.Dense(64, activation='relu')(layer_3)
output_tensor = layers.Dense(1, activation='sigmoid')(layer_4)

model = Model(input_tensor, output_tensor)

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(learning_rate=1e-4),
              metrics=['acc'])

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 150, 150, 3)]     0         
_________________________________________________________________
flatten (Flatten)            (None, 67500)             0         
_________________________________________________________________
rescaling (Rescaling)        (None, 67500)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)                4320064   
_________________________________________________________________
dense_7 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_8 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_9 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 65        
=================================================================
Total params: 4,332,609
Trainable params: 4,332,609
Non-trainable params: 0
_________________________________________________________________

We are finally ready to run the model which is done using the fit command. The execution of the model is illustrated in Figure DFN14: The training dataset samples batches of size 20 images, does the pre-processing and converion into tensors, and feeds them into the model.

The Dataset Generator computes the number of batches to be fed into the model using the formula

$$ batches\ per\ epoch = {{total\ sample\ size\ per\ epoch}\over{batch\ size}} $$

In a previous version of Keras, there was a steps_per_epoch argument in the fit command that could be used to control the number of batches per epoch. This allowed more images to be fed into the model than actually existed in the directory. This was very useful when images were augmented with random changes before entering the model, thus enabling an increase in the effective number of images available for training. The steps_per_epoch parameter is still present, but is is not clear whether it is still performing this function.

The Validation Generator feeds the validation data into the model using a similar process.

#DFN14
nb_setup.images_hconcat(["DL_images/DFN14.png"], width=600)

history = model.fit(
      train_dataset,
      epochs=100,
      validation_data=validation_dataset)

Epoch 1/100
100/100 [==============================] - 9s 87ms/step - loss: 0.7008 - acc: 0.5410 - val_loss: 0.7397 - val_acc: 0.4940
Epoch 2/100
100/100 [==============================] - 9s 89ms/step - loss: 0.6834 - acc: 0.5670 - val_loss: 0.6949 - val_acc: 0.5190
Epoch 3/100
100/100 [==============================] - 8s 82ms/step - loss: 0.6739 - acc: 0.5925 - val_loss: 0.6717 - val_acc: 0.5750
Epoch 4/100
100/100 [==============================] - 8s 80ms/step - loss: 0.6641 - acc: 0.6135 - val_loss: 0.6820 - val_acc: 0.5560
Epoch 5/100
100/100 [==============================] - 8s 81ms/step - loss: 0.6594 - acc: 0.6055 - val_loss: 0.7199 - val_acc: 0.5550
Epoch 6/100
100/100 [==============================] - 8s 79ms/step - loss: 0.6483 - acc: 0.6245 - val_loss: 0.6925 - val_acc: 0.5680
Epoch 7/100
100/100 [==============================] - 8s 84ms/step - loss: 0.6440 - acc: 0.6345 - val_loss: 0.7019 - val_acc: 0.5310
Epoch 8/100
100/100 [==============================] - 8s 82ms/step - loss: 0.6330 - acc: 0.6340 - val_loss: 0.6794 - val_acc: 0.5870
Epoch 9/100
100/100 [==============================] - 8s 83ms/step - loss: 0.6250 - acc: 0.6385 - val_loss: 0.7515 - val_acc: 0.5200
Epoch 10/100
100/100 [==============================] - 8s 83ms/step - loss: 0.6100 - acc: 0.6655 - val_loss: 0.6555 - val_acc: 0.6120
Epoch 11/100
100/100 [==============================] - 8s 82ms/step - loss: 0.6078 - acc: 0.6715 - val_loss: 0.6563 - val_acc: 0.6290
Epoch 12/100
100/100 [==============================] - 8s 82ms/step - loss: 0.5938 - acc: 0.6830 - val_loss: 0.7480 - val_acc: 0.5400
Epoch 13/100
100/100 [==============================] - 8s 83ms/step - loss: 0.5892 - acc: 0.6725 - val_loss: 0.6942 - val_acc: 0.6080
Epoch 14/100
100/100 [==============================] - 8s 83ms/step - loss: 0.5843 - acc: 0.6875 - val_loss: 0.6721 - val_acc: 0.6180
Epoch 15/100
100/100 [==============================] - 8s 83ms/step - loss: 0.5755 - acc: 0.6990 - val_loss: 0.6734 - val_acc: 0.6230
Epoch 16/100
100/100 [==============================] - 9s 84ms/step - loss: 0.5676 - acc: 0.6955 - val_loss: 0.6688 - val_acc: 0.5990
Epoch 17/100
100/100 [==============================] - 8s 83ms/step - loss: 0.5642 - acc: 0.7030 - val_loss: 0.6933 - val_acc: 0.6110
Epoch 18/100
100/100 [==============================] - 9s 85ms/step - loss: 0.5461 - acc: 0.7110 - val_loss: 0.7006 - val_acc: 0.6120
Epoch 19/100
100/100 [==============================] - 9s 84ms/step - loss: 0.5447 - acc: 0.7150 - val_loss: 0.7412 - val_acc: 0.5750
Epoch 20/100
100/100 [==============================] - 9s 85ms/step - loss: 0.5265 - acc: 0.7320 - val_loss: 0.6935 - val_acc: 0.5910
Epoch 21/100
100/100 [==============================] - 9s 84ms/step - loss: 0.5185 - acc: 0.7310 - val_loss: 0.6878 - val_acc: 0.6350
Epoch 22/100
100/100 [==============================] - 9s 84ms/step - loss: 0.5178 - acc: 0.7500 - val_loss: 0.7187 - val_acc: 0.6160
Epoch 23/100
100/100 [==============================] - 9s 84ms/step - loss: 0.5113 - acc: 0.7480 - val_loss: 0.7871 - val_acc: 0.6180
Epoch 24/100
100/100 [==============================] - 8s 83ms/step - loss: 0.5020 - acc: 0.7475 - val_loss: 0.7394 - val_acc: 0.5900
Epoch 25/100
100/100 [==============================] - 8s 83ms/step - loss: 0.4948 - acc: 0.7550 - val_loss: 0.7122 - val_acc: 0.6190
Epoch 26/100
100/100 [==============================] - 9s 84ms/step - loss: 0.4918 - acc: 0.7605 - val_loss: 0.7258 - val_acc: 0.6110
Epoch 27/100
100/100 [==============================] - 9s 84ms/step - loss: 0.4717 - acc: 0.7655 - val_loss: 0.7045 - val_acc: 0.6080
Epoch 28/100
100/100 [==============================] - 9s 84ms/step - loss: 0.4774 - acc: 0.7640 - val_loss: 0.7097 - val_acc: 0.5950
Epoch 29/100
100/100 [==============================] - 9s 84ms/step - loss: 0.4511 - acc: 0.7885 - val_loss: 0.8833 - val_acc: 0.6090
Epoch 30/100
100/100 [==============================] - 8s 83ms/step - loss: 0.4527 - acc: 0.7890 - val_loss: 0.7444 - val_acc: 0.6020
Epoch 31/100
100/100 [==============================] - 8s 83ms/step - loss: 0.4414 - acc: 0.7995 - val_loss: 0.7414 - val_acc: 0.6140
Epoch 32/100
100/100 [==============================] - 8s 84ms/step - loss: 0.4407 - acc: 0.7880 - val_loss: 0.7325 - val_acc: 0.6090
Epoch 33/100
100/100 [==============================] - 9s 85ms/step - loss: 0.4222 - acc: 0.8115 - val_loss: 0.8333 - val_acc: 0.6060
Epoch 34/100
100/100 [==============================] - 9s 84ms/step - loss: 0.4163 - acc: 0.8135 - val_loss: 0.7561 - val_acc: 0.6290
Epoch 35/100
100/100 [==============================] - 8s 83ms/step - loss: 0.4214 - acc: 0.7995 - val_loss: 0.7751 - val_acc: 0.6220
Epoch 36/100
100/100 [==============================] - 8s 83ms/step - loss: 0.4133 - acc: 0.8040 - val_loss: 0.7669 - val_acc: 0.6180
Epoch 37/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3890 - acc: 0.8235 - val_loss: 0.8635 - val_acc: 0.5730
Epoch 38/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3938 - acc: 0.8140 - val_loss: 0.8289 - val_acc: 0.5990
Epoch 39/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3818 - acc: 0.8280 - val_loss: 0.7739 - val_acc: 0.6100
Epoch 40/100
100/100 [==============================] - 8s 82ms/step - loss: 0.3714 - acc: 0.8240 - val_loss: 0.8929 - val_acc: 0.5990
Epoch 41/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3697 - acc: 0.8340 - val_loss: 0.8194 - val_acc: 0.6130
Epoch 42/100
100/100 [==============================] - 8s 82ms/step - loss: 0.3714 - acc: 0.8275 - val_loss: 0.8689 - val_acc: 0.5910
Epoch 43/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3657 - acc: 0.8310 - val_loss: 0.8323 - val_acc: 0.6150
Epoch 44/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3593 - acc: 0.8400 - val_loss: 0.9547 - val_acc: 0.6000
Epoch 45/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3410 - acc: 0.8445 - val_loss: 0.8186 - val_acc: 0.6160
Epoch 46/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3481 - acc: 0.8420 - val_loss: 0.9103 - val_acc: 0.6060
Epoch 47/100
100/100 [==============================] - 8s 82ms/step - loss: 0.3343 - acc: 0.8525 - val_loss: 0.8843 - val_acc: 0.6090
Epoch 48/100
100/100 [==============================] - 8s 84ms/step - loss: 0.3283 - acc: 0.8610 - val_loss: 1.0220 - val_acc: 0.5990
Epoch 49/100
100/100 [==============================] - 8s 82ms/step - loss: 0.3088 - acc: 0.8705 - val_loss: 0.9585 - val_acc: 0.6090
Epoch 50/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3139 - acc: 0.8660 - val_loss: 1.0529 - val_acc: 0.5970
Epoch 51/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3077 - acc: 0.8645 - val_loss: 1.0934 - val_acc: 0.6050
Epoch 52/100
100/100 [==============================] - 9s 85ms/step - loss: 0.3188 - acc: 0.8625 - val_loss: 0.9094 - val_acc: 0.6080
Epoch 53/100
100/100 [==============================] - 8s 83ms/step - loss: 0.3000 - acc: 0.8715 - val_loss: 1.0016 - val_acc: 0.5980
Epoch 54/100
100/100 [==============================] - 8s 83ms/step - loss: 0.2840 - acc: 0.8825 - val_loss: 1.1385 - val_acc: 0.5990
Epoch 55/100
100/100 [==============================] - 9s 90ms/step - loss: 0.2999 - acc: 0.8760 - val_loss: 1.0129 - val_acc: 0.5930
Epoch 56/100
100/100 [==============================] - 10s 98ms/step - loss: 0.3026 - acc: 0.8815 - val_loss: 0.9271 - val_acc: 0.5960
Epoch 57/100
100/100 [==============================] - 9s 92ms/step - loss: 0.2772 - acc: 0.8910 - val_loss: 0.9399 - val_acc: 0.6080
Epoch 58/100
100/100 [==============================] - 9s 88ms/step - loss: 0.2669 - acc: 0.8895 - val_loss: 1.1200 - val_acc: 0.5980
Epoch 59/100
100/100 [==============================] - 9s 88ms/step - loss: 0.2704 - acc: 0.8885 - val_loss: 1.0651 - val_acc: 0.6100
Epoch 60/100
100/100 [==============================] - 9s 89ms/step - loss: 0.2593 - acc: 0.8930 - val_loss: 1.0122 - val_acc: 0.6020
Epoch 61/100
100/100 [==============================] - 9s 89ms/step - loss: 0.2543 - acc: 0.8930 - val_loss: 1.2026 - val_acc: 0.5930
Epoch 62/100
100/100 [==============================] - 9s 88ms/step - loss: 0.2616 - acc: 0.8940 - val_loss: 1.1461 - val_acc: 0.5880
Epoch 63/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2492 - acc: 0.9050 - val_loss: 0.9533 - val_acc: 0.6070
Epoch 64/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2347 - acc: 0.9045 - val_loss: 0.9974 - val_acc: 0.6130
Epoch 65/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2366 - acc: 0.9070 - val_loss: 1.1292 - val_acc: 0.5990
Epoch 66/100
100/100 [==============================] - 8s 83ms/step - loss: 0.2522 - acc: 0.9025 - val_loss: 1.0516 - val_acc: 0.6000
Epoch 67/100
100/100 [==============================] - 8s 83ms/step - loss: 0.2305 - acc: 0.9005 - val_loss: 1.0898 - val_acc: 0.5810
Epoch 68/100
100/100 [==============================] - 8s 81ms/step - loss: 0.2263 - acc: 0.9110 - val_loss: 1.6380 - val_acc: 0.5590
Epoch 69/100
100/100 [==============================] - 8s 81ms/step - loss: 0.2261 - acc: 0.9220 - val_loss: 1.0847 - val_acc: 0.6070
Epoch 70/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2263 - acc: 0.9105 - val_loss: 1.1228 - val_acc: 0.6000
Epoch 71/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2204 - acc: 0.9160 - val_loss: 1.2296 - val_acc: 0.5960
Epoch 72/100
100/100 [==============================] - 8s 83ms/step - loss: 0.2044 - acc: 0.9200 - val_loss: 1.1348 - val_acc: 0.6050
Epoch 73/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2170 - acc: 0.9145 - val_loss: 1.1839 - val_acc: 0.6010
Epoch 74/100
100/100 [==============================] - 8s 81ms/step - loss: 0.2001 - acc: 0.9320 - val_loss: 1.2802 - val_acc: 0.5970
Epoch 75/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1967 - acc: 0.9265 - val_loss: 1.1830 - val_acc: 0.6030
Epoch 76/100
100/100 [==============================] - 8s 82ms/step - loss: 0.2021 - acc: 0.9250 - val_loss: 1.7612 - val_acc: 0.5830
Epoch 77/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1961 - acc: 0.9255 - val_loss: 1.2493 - val_acc: 0.6040
Epoch 78/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1707 - acc: 0.9350 - val_loss: 1.2734 - val_acc: 0.5940
Epoch 79/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1685 - acc: 0.9395 - val_loss: 1.4240 - val_acc: 0.5680
Epoch 80/100
100/100 [==============================] - 8s 82ms/step - loss: 0.1793 - acc: 0.9325 - val_loss: 1.2201 - val_acc: 0.6100
Epoch 81/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1903 - acc: 0.9400 - val_loss: 1.3858 - val_acc: 0.6130
Epoch 82/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1866 - acc: 0.9300 - val_loss: 1.5490 - val_acc: 0.5630
Epoch 83/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1996 - acc: 0.9335 - val_loss: 1.2914 - val_acc: 0.6060
Epoch 84/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1770 - acc: 0.9430 - val_loss: 1.3917 - val_acc: 0.6150
Epoch 85/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1608 - acc: 0.9425 - val_loss: 1.3262 - val_acc: 0.5940
Epoch 86/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1653 - acc: 0.9415 - val_loss: 1.3497 - val_acc: 0.5990
Epoch 87/100
100/100 [==============================] - 8s 82ms/step - loss: 0.1765 - acc: 0.9400 - val_loss: 1.3567 - val_acc: 0.6070
Epoch 88/100
100/100 [==============================] - 8s 81ms/step - loss: 0.1505 - acc: 0.9475 - val_loss: 3.3329 - val_acc: 0.5570
Epoch 89/100
100/100 [==============================] - 9s 86ms/step - loss: 0.1746 - acc: 0.9420 - val_loss: 1.4463 - val_acc: 0.6090
Epoch 90/100
100/100 [==============================] - 9s 85ms/step - loss: 0.1799 - acc: 0.9385 - val_loss: 1.3703 - val_acc: 0.6020
Epoch 91/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1383 - acc: 0.9560 - val_loss: 1.3967 - val_acc: 0.6000
Epoch 92/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1460 - acc: 0.9465 - val_loss: 2.1687 - val_acc: 0.5920
Epoch 93/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1554 - acc: 0.9430 - val_loss: 1.4623 - val_acc: 0.5980
Epoch 94/100
100/100 [==============================] - 9s 84ms/step - loss: 0.1727 - acc: 0.9400 - val_loss: 1.3355 - val_acc: 0.6040
Epoch 95/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1363 - acc: 0.9520 - val_loss: 1.5705 - val_acc: 0.5970
Epoch 96/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1409 - acc: 0.9500 - val_loss: 1.7217 - val_acc: 0.6020
Epoch 97/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1446 - acc: 0.9510 - val_loss: 1.5901 - val_acc: 0.6030
Epoch 98/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1476 - acc: 0.9435 - val_loss: 1.5564 - val_acc: 0.5760
Epoch 99/100
100/100 [==============================] - 8s 82ms/step - loss: 0.1370 - acc: 0.9630 - val_loss: 1.7769 - val_acc: 0.5990
Epoch 100/100
100/100 [==============================] - 8s 83ms/step - loss: 0.1480 - acc: 0.9505 - val_loss: 1.5526 - val_acc: 0.6080

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

The Validation Accuracy for this system is about 60%, which is not great. There are several was to improve this:

We used a small training sample of 2000 which limits the accuracy.
The Dense Feed Forward model is not very well suited to process images. Later we will make use of Convolutional Neural Networks (ConvNets) which yield much better results. The Kaggle competition for this dataset was won by an entry that achieved an accuracy of over 90% using ConvNets.

References and Slides¶

Slides

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	63	1	1	145	233	1	2	150	0	2.3	3	0	fixed	0
1	67	1	4	160	286	0	2	108	1	1.5	2	3	normal	1
2	67	1	4	120	229	0	2	129	1	2.6	2	2	reversible	0
3	37	1	3	130	250	0	0	187	0	3.5	3	0	normal	0
4	41	0	2	130	204	0	2	172	0	1.4	1	0	normal	0