from ipypublish import nb_setup
Two types of algorithms to be considered here:
Recall the Policy Iteration equation:
$$ v_{k+1}(s) = \sum_{a \in A} \pi(a|s) \left(R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_k(s') \right) $$
Let's explore the case where we do not know the Model, i.e., we do not know transition probabilities $P_{ss'}^a$.
It may also be hard to get an MDP Model $P$ if the number of states is very large.
In other words, we need to learn the Model from experience. We will compute the Value Function using sample sequences, i.e., $(S, A, R, S')$.
Experience may be real or simulated.
Instead of a Full Backup, we do MC Backup or Sample Backup. The update made at the end of the path is
$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$
nb_setup.images_vconcat(["RL_images/MonteCarloBackup.png"], width=600)
Example
Note that we run 100 steps and stop as this can go on indefinitely. This is not a problem especially when $\gamma < 1$. The average across all 1000 samples paths is the value function $V(S_0)$. Big $V$ is used for the sample average, small $v$ is for an expectation.
In the example, we get the BEE value for the particular state from which we are starting. In this case, state 0.
nb_setup.images_vconcat(["RL_images/MC_MDP_Returns.png","RL_images/MC_PolicyEvaluation.png"], width=600)
In each simulation run, count the returns ($G$) that occur in a state only following the first time it occurs. In the same run, if the state occurs again, then we ignore the reward. See Ch 5.1 of SnB. Stop the run if you hit the state again. So this should be faster.
If we count it every time it occurs then it is Every Visit MC. So we get several $G$ values for a state on one path and we take the average for the path of those G values.
See the Blackjack Example 5.1 in SnB.
nb_setup.images_vconcat(["RL_images/FirstVisit_MC.png"], width=600)
$$V_N(s) = V_{N-1}(s) + \frac{1}{N}[G_N(s) - V_{N-1}(s)], \forall s $$
But we may want some "burn-in" time, i.e., let the experiential part run for some time, and then start evaluating the average to get the value function.
Another way to do this is to use Exponential Smoothing, i.e.,
$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$
Leads to an exponential forgetting rate. This is the MC Backup Equation.
nb_setup.images_vconcat(["RL_images/FirstVisit_MC_ExponentialSmoothing.png"], width=600)
Useful because MC RL does not work well in some cases.
Here the update is made at step of $(S_t,A_t,R_t,S_{t+1})$ using the following equation:
$$ V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] $$
Or,
$$ V(S_t) \leftarrow (1-\alpha) \cdot V(S_t) + \alpha \cdot [R_{t+1} + \gamma V(S_{t+1}) ] $$
Basically, we have replaced $G_t$ in MC learning with $R_{t+1} + \gamma V(S_{t+1}) $.
TD learns from incomplete episodes as we go along, continually updating the value function for each state. TD updates a guess towards a guess.
The next graphic shows the key difference and link between MC learning and TD learning.
MC can work without the Markov property, but TD exploits the Markov property.
nb_setup.images_vconcat(["RL_images/DerivationTDrule.png"], width=600)
Below we see the full version of the MC vs TD relation. This simplest version of TD is called $TD(0)$ and the terminology of TD target and TD error is introduced below.
The TD algorithm below is probably one of the most important in RL. When we transition to Deep RL this is what we will be using.
In TD we update after every step in the episode, but in MC we make an update only at the end of the episode.
nb_setup.images_vconcat(["RL_images/TD0.png","RL_images/TD_algorithm.png"], width=600)
Below we see a comparison.
nb_setup.images_vconcat(["RL_images/MCvsTD.png"], width=600)
In practice, TD converges faster than MC. Because in TD the term $V(S_{t+1})$ is an average which has a higher variance than Returns.
TD Target is lower variance than Return because it depends on one action, transition, reward.
TD target $R_{t+1} + \gamma V(S_{t+1})$ is a biased estimate of $v_{\pi}(S_t)$.
Return $G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_T$ is an unbiased estimate of $v_{\pi}(S_t)$.
True TD target $R_{t+1} + \gamma v_{\pi}(S_{t+1})$ is a unbiased estimate of $v_{\pi}(S_t)$.
nb_setup.images_vconcat(["RL_images/TDMC_BiasvsVariance.png"], width=600)
Bootstrapping means that the current solution is used to generate the next solution. We note that
Sampling occurs when the RL algorithm is based on random generation of paths.
nb_setup.images_vconcat(["RL_images/RL_UnifiedView.png"], width=600)
Sutton and Barto, Chs 5, 6 & 7.