from ipypublish import nb_setup
Recap: Full Sweep and Full Backup means all states are updated in each iteration. In MC and TD learning, we sample from the MDP, so it's not a Full Backup.
Before we had learned model-free policy evaluation, and now we will address model-free "control" by which we mean optimization of the policy given the MDP. This can be done by Policy Iteration, either Full Policy Iteration or Generalized Policy Iteration.
So, here we may undertake Policy Iteration, where we replace the value $v_{\pi}(s)$ with its estimate $V(s)$.
nb_setup.images_vconcat(["RL_images/ModelFreeControl_Uses.png"], width=600)
Replace $v_{\pi}$ with $V_{\pi}$.
Will $V$ be a solution for $v$? Not guaranteed, this may not work. Why? Getting $V_{\pi}$ is not enough because it does not give the action, because we do not know the $Q_{\pi}(s,a)$ is not known. Therefore, estimate the $q$ values instead of the $v$ values.
So we do a MC Backup for Action Value Function, run it to the end of the path and then only do the update of $Q$ values. Policy Improvement is made only at the end of each episode. Update each $(S,A)$ along the path and then update all these after $G$ is known.
$$ Q(S,A) \leftarrow Q(S,A) + \alpha[G - Q(S,A)] $$
Each iteration we use the policies from the latest update, i.e., the newest $Q(S,A)$ values. See Section 5.3 of SnB. Along the path, always take the optimal policy (action) that gives the highest value.
Below, at the bottom of the next graphic, we see the $(S,A)$ pairs. The consequent rewards are shown in green.
nb_setup.images_vconcat(["RL_images/QSA_Evaluation.png"], width=600)
Always taking the best action may not result in fully exploring the space of actions. We must be cognizant of all $(S,A)$ pairs. Use each and every $(S,A)$ pair as the start state in any episode.
This gives an optimal tradeoff between exploration and exploitation. (See examples in the Slides.)
However, exploring starts may not always be feasible. So what can we do? Solution: Choose random policies with some probability, i.e., instead of a greedy policy, we choose a "random greedy policy". This is formally known as an $\epsilon$-greedy policy.
nb_setup.images_vconcat(["RL_images/e_greedy.png"], width=600)
We need to gradually reduce the randomness in the policy. The solution to this is GLIE. That is, $\epsilon$ must eventually go to zero at a specified rate, see below.
nb_setup.images_vconcat(["RL_images/GLIE.png","RL_images/GLIE_MC.png"], width=600)
Section 6.4 of SnB.
Recall TD Learning from sampling:
$$ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] $$
Update at each step $(S,A,R,S')$ in episode, not at the end of the episode.
Likewise, we may also update the $Q$ function with SARSA i.e., $(S,A,R,S',A')$ as follows.
$$ Q(S,a) \leftarrow Q(S,A) + \alpha [R + \gamma Q(S',A') - Q(S,A)] $$
nb_setup.images_vconcat(["RL_images/SARSA.png","RL_images/SARSA_algo.png"], width=600)
Since we update the policy each step and then use the improved policy right away, the approach is called On-Policy.
nb_setup.images_vconcat(["RL_images/OffvsOn_Policy.png"], width=600)
TD is better than MC control because
SARSA converges under GLIE conditions (see Slides) which are conditions on the $\alpha$ values.
When we go off-policy, we actually choose optimally in the next $(S,A)$ not based on the existing policy, but on what may be best. So it is like watching someone else's experience (the Action Agent) but choosing what is best for you (Q-Learning Agent).
So we do not use the current policy $A'$ at state $S'$, but instead take the best possible action at state $S'$ (i.e., act greedy, not $\epsilon$-greedy). Choose action $A$ randomly or $\epsilon$-greedy (call it the Behavior Policy) but choose the action $A'$ greedily. So it is $A'$, the Target Policy which is Off-Policy.
The Off-Policy agent observes the On-Policy agent to take action $A$, but then takes optimal action $A'$. It's like watching some make mistakes and then doing better in the future.
nb_setup.images_vconcat(["RL_images/Q_Learning.png","RL_images/Q_Learning2.png"], width=600)
nb_setup.images_vconcat(["RL_images/Q_Learning3.png","RL_images/Q_Learning4.png"], width=600)
nb_setup.images_vconcat(["RL_images/Q_Learning5.png","RL_images/Q_Learning6.png"], width=600)
nb_setup.images_vconcat(["RL_images/OffPolicy_Uses.png"], width=600)
Q-Learning can also be implemented in Batch Mode.
On Policy Monte Carlo Control: This involves a Policy Evaluation step follows by a Policy Improvement Step. Policy Evaluation is done based on the data from a complete episode of the MDP. This is followed by Policy Improvement using the new Q values, and the new policy is used to generate the next episode.
On Policy Temporal Difference Control (SARSA): This also involves Policy Evaluation followed by Policy Improvement. However the Policy Evaluation is based on the data from a single step of the MDP. This is immediately followed by Policy Improvement, and the improved policy is used to generate the next step of the MDP.
Off Policy Temporal Difference Control (Q Learning): In this case the Agent taking the Actions (using the Behavior Policy) is different from the Agent computing the optimal Q function (using the Target Policy). Behavior Policy uses some randomness to traverse the MDP, and Target Policy uses the data generated from this traversal to compute the optimal Q function.
nb_setup.images_vconcat(["RL_images/ModelFree_Summary.png"], width=600)