from ipypublish import nb_setup
Recap: 3 model-free algorithms:
SARSA does more exploration, Q-Learning does more exploitation. Depending on the problem setting, one may be better than the other.
We now transition to using deep learning to solve RL problems. An important inflexion point in the course.
For small problems it is easy to represent the States, Actions in tabular form (or a higher dimension tensor). Here is a depiction.
nb_setup.images_vconcat(["RL_images/Tabular_RL.png"], width=600)
But as the state space grows this becomes untenable. So we need to work with neural networks (NNs) as continuous function approximators. Note that NNs are universal function approximators.
Here is a visual proof.
A deeper dive into Deep Learning.
Here is a schematic of Machine Learning (supervised) where the mapping from data and labels to outputs is done using a NN.
We can compare this with RL which looks similar in structure as shown below.
nb_setup.images_hconcat(["RL_images/ML_Schematic.png","RL_images/RL_Schematic.png"], width=900)
Is a combination of Deep NN with RL, which we call Deep RL (DRL). Here the objective function is the Total Reward, to be maximized.
We feed the Rollouts into a NN and not a Table. Instead of updating tabular values, we now update the weights of the NN through each episode.
nb_setup.images_hconcat(["RL_images/Deep_RL.png"], width=500)
The graphic below is for DQN on the left and PGM on the right.
We may call this Functional RL. Or Approximate DP. (Just alternate nomenclature.)
nb_setup.images_hconcat(["RL_images/DQN.png","RL_images/PGM.png"], width=900)
So we start with an initial set of weights and run the DQN to generate data $S,A,R$ and $Q(S,A)$. Then we use this to update the weights $W$ and get a new $Q$ function. Repeat.
nb_setup.images_hconcat(["RL_images/DQN_Process.png"], width=500)
nb_setup.images_hconcat(["RL_images/PGM_Process.png"], width=500)
The $f_W(\cdot)$ model below will depend on which of these models is used.
nb_setup.images_hconcat(["RL_images/QSAW.png"], width=500)
The output could also be $K$-ary and not unary. For example, if the Action space was made up of $K$ finite values, then we might want to output the $Q(S,A_k,W)$ values for all $k$.
See the slides for a brief recap of functional models, a very fast run through ideas from Deep Learning. Advice: Take the DL course for a fascinating technical education into the world of AI. Remember, it is best to learn theory when in school, it is much harder to do so on the job.
Standard gradient descent is applied.
nb_setup.images_hconcat(["RL_images/GradientDescent_RewardMax.png","RL_images/Gradient_Descent.png"], width=900)
nb_setup.images_hconcat(["RL_images/Gradient_Descent_K.png"], width=500)
nb_setup.images_hconcat(["RL_images/Logistic_Regression.png","RL_images/Cross_Entropy.png"], width=900)
Here is a plot of cross entropy when $t=1$, also known as Log-Loss plot. The function goes from $(\infty,0)$.
Gradients are obtained through Back Propagation.
As a mathematical coincidence, the gradient for the cross-entropy function is the same as that from mean-squared error function! (Uses the Chain Rule from calculus.)
nb_setup.images_hconcat(["RL_images/GradientDescent_CrossEntropy.png","RL_images/GradientDescent_ChainRule.png"], width=900)
Setting parameters for all these is known as Hyperparameter Tuning.
The gradients needed to update the weights of the approximating function for various models is shown below.
nb_setup.images_hconcat(["RL_images/WeightUpdate_Regression.png","RL_images/WeightUpdate_Logit.png",
"RL_images/WeightUpdate_BackProp.png"], width=1300)