%pylab inline
import os
from ipypublish import nb_setup
There is a lot of excitement surrounding the fields of Neural Networks (NN) and Deep Learning (DL), due to numerous well-publicized successes that these systems have achieved in the last few years. The objective of this monograph is to provide a concise survey of this fast developing field, with special emphasis on more recent developments. We will use the nomenclature Deep Learning Networks (DLN) for Neural Networks that use Deep Learning algorithms.
DLNs form a subfield within the broader area of Machine Learning (ML). ML systems are defined as those that are able to train (or program) themselves, either by using a set of labeled training data (called Supervised Learning), or even in the absence of training data (called Un-Supervised Learning). In addition there is a related field called Reinforcement Learning in which algorithms are trained not by training examples, but by using a sequence of control actions and rewards. Even though ML systems are trained on a finite set of training data, their usefulness arises from the fact that they are able to generalize from these and process data that they have not seen before.
Most of the recent breakthroughs in DLNs have been in Supervised Learning systems, which are now being widely used in numerous applications in industrial and consumer settings, some of these are enumerated below:
Image Recognition and Object Detection: DLNs enable the detection and classification of objects in images into one of more than thousand different categories. This has hundreds of applications ranging from diagnosis in medical imaging to security related image analysis.
Speech Transcription: DLN systems are used to transcribe speech into text with a high level of accuracy. Advances in this area have enabled products such as Amazon's Alexa and Apple's Siri.
Machine Translation: Large scale translation systems such as the one deployed by Google, have switched over to using DLN algorithms Sutskever, Vinyals, Le (2014).
FinTech Applications: DLN systems are being used to automate applications such as Trading and Portfolio Management. Indeed the vast majority of transactions on Wall Street today are carried out using DLN based auto trading programs.
Image Recognition belongs to the class of applications that involve classification of the input into one of several categories. The Speech Transcription and Machine Translation applications involve not just recognition of the input pattern, but also the generation of patterns as part of the output (so called Generative Models).
For a comprehensive report on the State of AI, see Benaich and Hogarth(2019); pdf.
And yes, AI may be biased. Handle Deep Learning with care. See "Unmasking AI's Bias Problem"; pdf.
nb_setup.images_hconcat(["DL_images/AI_WorksORNot.png"])
nb_setup.images_hconcat(["DL_images/AI_Movies.png"])
#MLDef
nb_setup.images_hconcat(["DL_images/MLDef.png"], width=500)
Deep Learning is an important sub-field within the general area of Machine Learning, which begs the question: What is Machine Learning?
Machine Learning is the science of designing systems that can learn to solve problems from user supplied data (also called training data). The contrast between Classical Programming and Machine Learning is shown in Figure MLDef. In Classical Programming the programmer has the task of coming up with Algorithmic Rules that when applied to Data, result in the desired Answer. This procedure works for problems whose solution is reasonably well understood so that the appropriate Rules can be discovered.
However there exists a class of problems which are so difficult to analyze, that coming up with programming rules for them is not feasible. Machine Learning systems are designed to solve these type of problems, which often arise when processing high dimensional perceptual data such as images, speech or sound. As shown in the bottom part of the figure, Machine Learning systems derive the Rules automatically by processing Data for which the appropriate Answer is already known (called Training Data). Once the Rules are obtained, the Machine Learning model can be used as shown in the top part of the figure, to process new data that was not part of the original Training Dataset. The success or failure of this procedure is a function of the quality of the Training Data and the sophistication of the Model used.
Now that we know what Machine Learning is, lets turn back to the definition of Deep Learning. In order to do so, we have to understand the concept of Data Representations.
#Representations
nb_setup.images_hconcat(["DL_images/Representations.png"], width=500)
Data Representations are fundamental to the field of Artificial Intelligence. In order to understand why this is so, consider the task of classification or categorization of input data, which is a basic task in AI. Assume that the input data can be represented by a point in N-dimensional space (we will later see that even complex inputs such as images or a natural language sentence can be represented in this way). As an example consider Part (a) of Figure Representations: In this example N = 2 dimensions, and there are two categories of input data, namely (blue) circles and (green) triangles. Machine Learning systems perform classification by using a straight line (in 2 dimenions), or a hyperplane (in higher dimensions), to separate out the input data. In the picture on the LHS of (a), we can see that this data representation does not lend itself to classification in this way, since the two classes form concentric circles around the origin. However as shown on the RHS, if we change the representation from Cartesian to Polar co-ordinates, then the two classes arrange thenselves in vertical strips, which can then be readily classified by drawing a straight line (as shown on Part(b)).
Traditional Machine Learning systems are not capable of changing data representations like we just did, and rely on the designer to find the right data representation that can be processed using linear separators. Deep Learning systems on the other hand learn not just the best linear discriminator, but ALSO learn the appropriate data representation that transforms the input data into a space which is easier to process.
#Stages
nb_setup.images_hconcat(["DL_images/Stages.png"], width=500)
Deep Learning systems accomplish the task of transforming the input data into a new representation by passing the data through multiple stages, which accounts for "Deep" in Deep Learning (this is shown in Figure Stages). The parameters that define each of these stages are learnt during the training process. The number of stages needed to solve a particular problem is a function of the complexity with which the various categories are intertwined together in the input.
DLNs differentiate themselves by their effectiveness in solving problems in which the input data has hundreds of thousands of dimensions. A good examples of such an input is a color image, for example consider an image which consists of three layers (RGB) of 224 X 224 pixels. The total number of pixels in such an image is 150,528, any of which may carry useful information about the contents. If we want to process this image and extract useful information from it (such as, classify all objects or describe the scene), this problem is very difficult to solve using non-DLN systems. How are DLNs able to solve this problem? In order to aswer this question consider Figure ImageTransform below. As the figure shows, the DLN can be considered to be a function that transforms the input representation, which has 150K dimensions, to an output representation that has 4K dimensions. This output representation has useful structural properties and in particular captures semantic information regarding the input.
An example of a structural property, that enables the system to classify the input image into one of several categories, is the following: If we plot the coordinates of the output representation, then it can be verified that the points representing images that are similar to each other, also tend to cluster together. An example of this is seen in Figure MNISTTransform, in which we plot the transformed representations for a dataset called MNIST. This dataset consists of 784-dimensional handwritten images of the digits from 0 to 9 (see MNIST; Database). The figure shows the 4096 dimensional output representations corresponding to the images of digits after they have passed through a DLN. These points are actually projected from 4096-dimensional to 2-dimensional space, using a technique called t-SNE in order to aid human comprehension. Note that the representations of each particular digit are clustered together. This clustering is a very useful property to have, since it enables the subsequent classification of these points into categories by using a simple linear classifier. The key point to note is that the clustered points now carry semantic information about the input image.
#ImageTransform
nb_setup.images_hconcat(["DL_images/Image_Transform.png"], width=500)
#MNISTTransform
nb_setup.images_hconcat(["DL_images/MNIST_Transform.png"], width=500)
The process described in the previous paragraph is called Representation Learning, since the DLN transforms the input representation of each image (in terms of RGB pixels) to an output representation that is more suitable for doing tasks such as classification. Before the advent of DLN systems, tasks such as image classification were carried out with the designer required to come up with a suitable output representation herself. This was a fairly time consuming job requiring a lot of expertise, and typically took several years to come up with a good representation for a particular class of problems. DLN systems have automated this process, such that the system learns the best output representation for a problem on its own, by using the training dataset.
Another way to understand the process by which DLN systems create representations and do classification for images in shown in Figure ImageRep. The figure shows a DLN with 4 so-called Hidden Layers followed by the final Output Layer which classifies the image into one of three categories (these layers are formally defined in the next chapter). The first Hidden Layer acts upon the raw pixels and acts as a detector for edges in the image, such that each node in this layer is responsible for detecting edges in different orientations. Note that edges can be detected fairly easily by comparing the brightness level of neighboring pixels. The second Hidden Layer acts on the output of the First Hidden Layer and detects features that are composed from putting together multiple edges together, such as corners and contours. The third Hidden Layer further builds on this, and detects object parts that composed from corners and contours. This functions as the output representation of the Input Image, and the final Layer (which is also called the Logit Layer) acts on this representation to put together the whole object, and thus carry out the classification task with a linear classifier. As this description shows, each successive layer in a DLN system, builds on the layer immediately preceding it and thus creates representations for objects in an hierarchical fashion, by starting with the detection of simple features and then going on to more complex ones.
#ImageRep
nb_setup.images_hconcat(["DL_images/Image_Processing.png"], width=500)
The process that we described above by which DLN systems create higher level representations, works not just for images but also for language, i.e., DLNs can be used to create high level representations for words and sentences which capture semantic content. In particular it is possible to create DLN based representations for words by mapping them to vectors. Figure WordRep graphs the 2-D representation of English words (once again using t-SNE to project onto 2 dimensions), after they have been processed by a DLN system. As in the case of images, the figure shows that the representation captures semantic information by clustering together words that are similar in meaning. This representation for words is widely used in Natural Language Processing (NLP) for doing tasks such as Machine Translation, Sentiment Analysis etc.
#WordRep
nb_setup.images_hconcat(["DL_images/Words.png"], width=500)
#SL
nb_setup.images_hconcat(["DL_images/SL.png"], width=500)
Almost all the DLN systems that have been found useful in solving real world problems belong to the category of Supervised Learning systems, sometimes also called Teacher based learning. These systems are trained by presenting the DLN model with an input X and its corresponding label T, which is assumed to be known in advance. As shown in Figure SL, the system learns by comparing its output Y with T, and then feeding the errors signal (Y-T) back into itself. This error signal is then used by the Backprop algorithm to adjust the model parameters W in a way such that the error (Y-T) decreases the next time the same input is presented. By repeating this process hundreds of times with labeled datasets consisting of multiple thousands of training examples, gradually the error signal goes to zero as the model gets better trained. However the proof of the pudding is when the model is presented with data that it has not seen before, and still be able to classify it correctly.
A lot of the effort in building Supervised Learning systems goes into the process of creating the labels, which is manual and time consuming. Fortunately there exist public datasets that pre-labeled, and also models which have been pre-trained using these datasets. If the problem under consideration has data that is similar, then by using an algorithm called Transfer Learning, it is possible largely reuse the pre-trained model, which allows the user to get by with a much smaller labeled training dataset.
There are certain problems that come with datasets for which the labels can be generated automatically from the data itself. The process of training these kind of systems is known as self supervised learning. The most important example of these type of systems arise in the field of Natural Language Processing (NLP). One of the main tasks in NLP systems is that of learning a Language Model, which simply a Supervised Learning system that is trained to predict the next word in a sentence that is given to it as input. Given a corpus of training text, such as the contents of a book, note that generating labels for this task becomes very easy, since it is simply the identity of the next word in the sentence. There are other examples of Self Supervised Learning that arise in applications in which the task to predict the next number in a sequence, which is common for applications in Time Series Analysis.
Un-Supervised Learning is the science of training models with datasets that do not have labels. Before the advent of Deep Learning, almost all of classical Machine Learning algorithms fell into this category. Examples include:
Data Clustering. This is the processing of classifying the input data into several clusters, by means of algorithms such as K-Means.
Principle Component Analysis (PCA): This equivalent to doing a co-ordinate transformation on the input data, such the main sources of variation in the input lie along the new co-ordinate axes.
Auto-Encoders: These systems learn new representations of the input data, by using the input itself as a training target, so in a sense they can also be classified as Self Supervised learning systems. The learnt representation could then be used for performing other tasks on the input, such as classification.
The modern approach to Un-Supervised Learning has shifted towards finding the Probability Density Function (PDF) for the dataset. Once this is known, it can be used to Generate new data by sampling the PDF. The PDF estimation can be explicit and algorithms such as Pixel-RNN and Variational Auto-Encoders (VAE) belong to this category. However the most popular way of doing Density Estimation is an implicit technique called Generative Adversarial Networks or GANs. This theory is still new and new applications are still being discovered. Today GANs are mostly used as a way to generate realistic looking images by sampling from the learnt implicit PDF, which is represented by the weights of a Deep Learning model.
#RL
nb_setup.images_hconcat(["DL_images/RL.png"], width=500)
Reinforcement Learning (RL) is the science of making Optimal Decisions. The main objective is to design an RL Agent which takes Actions and interacts with the surrounding Environment in order to achieve an objective. The setup also includes a Reward Function such that the Reward is maximized if the bojective is achieved. The Environment evolves in time in response to the Agent's Actions, as well as its own dynamics (which could have a random component). As shown in the above figure, when Deep Learning is used in RL, the Agent is modeled using a DLN. The input into the DLN is the State of the Environment, while the output is the Agent's Action. The Agent's objective is to choose a series of actions that maximize its Long Term Reward. Note that, unlike in a Supervised Learning setup, the DLN (or the Agent) is not told what the best Action is for a given state, instead it is supposed to infer this indirectly by the Rewards that the Action produces. This makes the solution to the RL problem more involved, and tools from the theory of Markov Decision Processes and Dynamic Programming have to invoked.
#MPNeuron
nb_setup.images_hconcat(["DL_images/MP_neuron.png"], width=500)
NNs have existed since the 1940s, when they were first proposed by McCulloch and Pitts (1943) as a model for biological neurons. As shown in the Figure MPNeuron above, this model consisted of a set of inputs, each of which was multiplied by a weight and then summed over. If the sum exceeded a threshold, then the network outputted a $+1$, and $-1$ otherwise. These was a simple model in which weights were fixed and the output was restricted to $\{+1,-1\}$. The idea of Hebbian learning was discovered around the same time, in which learning was accomplished by changing the weights. The idea of NNs was then taken up by Rosenblatt in the 1950s and 60s, see Rosenblatt (1958) and Rosenblatt (1962), who combined the ideas of the McCullough-Pitts neuron with that of Hebbian learning, and referred to his model as a Perceptron.
Perceptrons achieved some early successes, but ran into problems when researchers tried to use them for more sophisticated classification problems. In particular, the Perceptron learning algorithm fails to converge in cases in which the data is not strictly linearly separable. These problems with Perceptrons, along with the publication of a book on this topic by Minsky and Papert (1969), where these issues were highlighted, led to a drop in academic interest in NNs, which persisted until the early 80s. There were two advances that moved the field beyond Perceptrons and led to the modern idea of DLNs:
The incorporation of techniques from Statistical Estimation and Inference into ML: A critical step was the replacement of deterministic 0-1 classification by probabilistic classifiers based on Maximum Likelihood estimation. This ensured that the learning algorithms converged even for cases in which the classes were not strictly linearly separable.
The discovery of the Backprop algorithm: This was a critical advance that enabled the training of large multi-layer DLNs. Even today Backprop serves as the main learning algorithm for all DLN models.
#NumResearchReports
nb_setup.images_hconcat(["DL_images/NN_num_research_reports.png"], width=500)
As shown in Figure NumResearchReports, these discoveries resulted in renewed interest in NNs from the mid-80s onwards, which persisted till the mid-90s. Around that time, attempts to use the Backprop algorithm to train very large NNs that solve more realistic problems, led to an issue known as the Vanishing Gradient problem, discussed in Chapter GradientDescentTechniques. As of a result of this problem, Backprop based systems stopped learning after some number of iterations, i.e., their weights stopped adapting to new training data. At the same time, a new ML algorithm called Support Vector Machine (SVM) was discovered that worked much better in NNs for large classification problems. As a result of this, interest in DLNs once again dipped in the late 90s and didn’t recover until recently.
As the figure above shows, growth in DLN related research has exploded in the last 8-10 years, which is often referred to as the “Neural Networks Renaissance”. Several factors have contributed to this, among them:
The increase in CPU processing power and use of specialized CPUs optimized for NN processing: Researchers discovered that processors built for Graphics Processing, or GPUs, were very effective in doing DLN computations as well. A new generation of processors built especially for DLN processing is also coming to market.
The availability of crowd sourced massive labeled data-sets that can be used for training large DLNs. This has been mainly due to the rise of the Internet and its database of millions of images.
Algorithmic and Architectural Advances: Among the algorithmic advances are (a) Techniques to solve the Vanishing Gradient problem for large DLNs, Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung (2000), Nair and Hinton (2010), and (b) Techniques to improve the generalization ability of DLNs, Srivastava (2013). Among the architectural advances are the discovery of new DLN configurations such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNNs are well suited for processing images, while RNNs are designed for learning patterns in sequences of data, such as language or video streams.
The rest of this monograph is organized as follows. In Chapter PatternRecognition we introduce DLN architecture by going through an example of the processing of MNIST images. In Chapter SupervisedLearning we introduce the concepts of supervised and un-supervised learning, and set up the basic mathematical framework for training DLN models. In Chapter LinearLearningModels we solve the simplest DLN models, i.e, those involving linear parameterization. Chapter NNDeepLearning describes Fully Connected Neural Networks. Chapter TrainingNNsBackprop delves into the details of training DLNs, including a full description of the Backprop algorithm. Chapter GradientDescentTechniques is on techniques that can be used to improve the Gradient Descent algorithm. In Chapter ImprovingModelGeneralization we discuss the idea of Model Capacity, and its relation to the concepts of Model Underfitting and Model Overfitting. We discuss techniques used to optimally set the Learning Rate parameter, and other Hyper-Parameters in Chapter HyperParameterSelection.