Function Approximations in RL

In [1]:
from ipypublish import nb_setup

Recap: 3 model-free algorithms:

  1. MC control. On-Policy, remains the same for the full Rollout.
  2. TD control, On-Policy, known as SARSA.
  3. TD control, Off-Policy, known as Q-Learning.

SARSA does more exploration, Q-Learning does more exploitation. Depending on the problem setting, one may be better than the other.

We now transition to using deep learning to solve RL problems. An important inflexion point in the course.

Tabular RL

For small problems it is easy to represent the States, Actions in tabular form (or a higher dimension tensor). Here is a depiction.

In [2]:
nb_setup.images_vconcat(["RL_images/Tabular_RL.png"], width=600)
Out[2]:

Neural Nets for RL

But as the state space grows this becomes untenable. So we need to work with neural networks (NNs) as continuous function approximators. Note that NNs are universal function approximators.

  1. Here is a visual proof.

  2. A quick introduction to NNs.

  3. A deeper dive into Deep Learning.

Here is a schematic of Machine Learning (supervised) where the mapping from data and labels to outputs is done using a NN.

We can compare this with RL which looks similar in structure as shown below.

In [3]:
nb_setup.images_hconcat(["RL_images/ML_Schematic.png","RL_images/RL_Schematic.png"], width=900)
Out[3]:

Deep RL

Is a combination of Deep NN with RL, which we call Deep RL (DRL). Here the objective function is the Total Reward, to be maximized.

We feed the Rollouts into a NN and not a Table. Instead of updating tabular values, we now update the weights of the NN through each episode.

In [4]:
nb_setup.images_hconcat(["RL_images/Deep_RL.png"], width=500)
Out[4]:

Two Deep RL Approaches

  1. Use the network to approximate the $Q(S,A)$ functions, where $S,A$ are continuous. It's called Deep Q-Learning or DQN.
  2. Map directly from the state to action (policy). These are called Policy Gradient Methods or PGM.

The graphic below is for DQN on the left and PGM on the right.

We may call this Functional RL. Or Approximate DP. (Just alternate nomenclature.)

In [5]:
nb_setup.images_hconcat(["RL_images/DQN.png","RL_images/PGM.png"], width=900)
Out[5]:

Deep Q Nets (DQN)

  • Run sample episodes, i.e., $S,A,R$ triplets. Maybe use a random policy to generate the episode paths.
  • Use MC, SARSA, QL to get a mapping $(S,A) \rightarrow Q(S,A)$. This is the training data. $Q(S,A)$ is the label.
  • The NN has weights $W$ and we approximate (and eventually maximize) $Q(S,A,W)$.

So we start with an initial set of weights and run the DQN to generate data $S,A,R$ and $Q(S,A)$. Then we use this to update the weights $W$ and get a new $Q$ function. Repeat.

In [6]:
nb_setup.images_hconcat(["RL_images/DQN_Process.png"], width=500)
Out[6]:

Policy Gradient Methods (PGM)

  • Run sample episodes, i.e., $S,A,R$ triplets. Maybe use a random policy to generate the episode paths.
  • Use MC, SARSA, QL to get a mapping $S \rightarrow \pi(A|S)$. This is the training data.
In [7]:
nb_setup.images_hconcat(["RL_images/PGM_Process.png"], width=500)
Out[7]:

Functional Choices from NNs

The $f_W(\cdot)$ model below will depend on which of these models is used.

In [8]:
nb_setup.images_hconcat(["RL_images/QSAW.png"], width=500)
Out[8]:

The output could also be $K$-ary and not unary. For example, if the Action space was made up of $K$ finite values, then we might want to output the $Q(S,A_k,W)$ values for all $k$.

See the slides for a brief recap of functional models, a very fast run through ideas from Deep Learning. Advice: Take the DL course for a fascinating technical education into the world of AI. Remember, it is best to learn theory when in school, it is much harder to do so on the job.

Reward Maximization

Standard gradient descent is applied.

In [9]:
nb_setup.images_hconcat(["RL_images/GradientDescent_RewardMax.png","RL_images/Gradient_Descent.png"], width=900)
Out[9]:
In [10]:
nb_setup.images_hconcat(["RL_images/Gradient_Descent_K.png"], width=500)
Out[10]:

Logistic Regression

In [11]:
nb_setup.images_hconcat(["RL_images/Logistic_Regression.png","RL_images/Cross_Entropy.png"], width=900)
Out[11]:

Here is a plot of cross entropy when $t=1$, also known as Log-Loss plot. The function goes from $(\infty,0)$.

Gradients are obtained through Back Propagation.

As a mathematical coincidence, the gradient for the cross-entropy function is the same as that from mean-squared error function! (Uses the Chain Rule from calculus.)

In [12]:
nb_setup.images_hconcat(["RL_images/GradientDescent_CrossEntropy.png","RL_images/GradientDescent_ChainRule.png"], width=900)
Out[12]:

Implementation Details

  1. Weight initialization.
  2. Setting the Learning rate.
  3. Early Stopping rule.
  4. Improving Generalization Error (i.e., no overfitting).

Setting parameters for all these is known as Hyperparameter Tuning.

Weight Update Formulas

The gradients needed to update the weights of the approximating function for various models is shown below.

In [13]:
nb_setup.images_hconcat(["RL_images/WeightUpdate_Regression.png","RL_images/WeightUpdate_Logit.png",
                         "RL_images/WeightUpdate_BackProp.png"], width=1300)
Out[13]:

Reading and References