Model Free Prediction

In [1]:
from ipypublish import nb_setup

Two types of algorithms to be considered here:

  1. Monte Carlo (MC) RL
  2. Temporal Difference (TD) Algorithms

Recall the Policy Iteration equation:

$$ v_{k+1}(s) = \sum_{a \in A} \pi(a|s) \left(R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_k(s') \right) $$

Let's explore the case where we do not know the Model, i.e., we do not know transition probabilities $P_{ss'}^a$.

It may also be hard to get an MDP Model $P$ if the number of states is very large.

In other words, we need to learn the Model from experience. We will compute the Value Function using sample sequences, i.e., $(S, A, R, S')$.

Experience may be real or simulated.

Monte Carlo Backup

Instead of a Full Backup, we do MC Backup or Sample Backup. The update made at the end of the path is

$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$

In [2]:
nb_setup.images_vconcat(["RL_images/MonteCarloBackup.png"], width=600)
Out[2]:

Example

Note that we run 100 steps and stop as this can go on indefinitely. This is not a problem especially when $\gamma < 1$. The average across all 1000 samples paths is the value function $V(S_0)$. Big $V$ is used for the sample average, small $v$ is for an expectation.

In the example, we get the BEE value for the particular state from which we are starting. In this case, state 0.

In [3]:
nb_setup.images_vconcat(["RL_images/MC_MDP_Returns.png","RL_images/MC_PolicyEvaluation.png"], width=600)
Out[3]:

First Visit and Every Visit MC Evaluation

In each simulation run, count the returns ($G$) that occur in a state only following the first time it occurs. In the same run, if the state occurs again, then we ignore the reward. See Ch 5.1 of SnB. Stop the run if you hit the state again. So this should be faster.

If we count it every time it occurs then it is Every Visit MC. So we get several $G$ values for a state on one path and we take the average for the path of those G values.

See the Blackjack Example 5.1 in SnB.

In [4]:
nb_setup.images_vconcat(["RL_images/FirstVisit_MC.png"], width=600)
Out[4]:

Incremental Mean Update

$$V_N(s) = V_{N-1}(s) + \frac{1}{N}[G_N(s) - V_{N-1}(s)], \forall s $$

But we may want some "burn-in" time, i.e., let the experiential part run for some time, and then start evaluating the average to get the value function.

Another way to do this is to use Exponential Smoothing, i.e.,

$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$

Leads to an exponential forgetting rate. This is the MC Backup Equation.

In [5]:
nb_setup.images_vconcat(["RL_images/FirstVisit_MC_ExponentialSmoothing.png"], width=600)
Out[5]:

Temporal Difference (TD) Reinforcement Learning

Useful because MC RL does not work well in some cases.

  1. Have to wait till end of episode to compute value function.
  2. Something might fail if we wait to the end.
  3. Never-ending MDPs.

Here the update is made at step of $(S_t,A_t,R_t,S_{t+1})$ using the following equation:

$$ V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] $$

Or,

$$ V(S_t) \leftarrow (1-\alpha) \cdot V(S_t) + \alpha \cdot [R_{t+1} + \gamma V(S_{t+1}) ] $$

Basically, we have replaced $G_t$ in MC learning with $R_{t+1} + \gamma V(S_{t+1}) $.

TD learns from incomplete episodes as we go along, continually updating the value function for each state. TD updates a guess towards a guess.

The next graphic shows the key difference and link between MC learning and TD learning.

MC can work without the Markov property, but TD exploits the Markov property.

In [6]:
nb_setup.images_vconcat(["RL_images/DerivationTDrule.png"], width=600)
Out[6]:

TD(0)

Below we see the full version of the MC vs TD relation. This simplest version of TD is called $TD(0)$ and the terminology of TD target and TD error is introduced below.

The TD algorithm below is probably one of the most important in RL. When we transition to Deep RL this is what we will be using.

In TD we update after every step in the episode, but in MC we make an update only at the end of the episode.

In [7]:
nb_setup.images_vconcat(["RL_images/TD0.png","RL_images/TD_algorithm.png"], width=600)
Out[7]:

MC vs TD

Below we see a comparison.

In [8]:
nb_setup.images_vconcat(["RL_images/MCvsTD.png"], width=600)
Out[8]:
  • In practice, TD converges faster than MC. Because in TD the term $V(S_{t+1})$ is an average which has a higher variance than Returns.

  • TD Target is lower variance than Return because it depends on one action, transition, reward.

  • TD target $R_{t+1} + \gamma V(S_{t+1})$ is a biased estimate of $v_{\pi}(S_t)$.

  • Return $G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_T$ is an unbiased estimate of $v_{\pi}(S_t)$.

  • True TD target $R_{t+1} + \gamma v_{\pi}(S_{t+1})$ is a unbiased estimate of $v_{\pi}(S_t)$.

Bias vs Variance of TD and MC

In [9]:
nb_setup.images_vconcat(["RL_images/TDMC_BiasvsVariance.png"], width=600)
Out[9]:

Example

  • See Random Walk Example in the Slides. This is Example 6.2 in SnB.
  • The rate of convergence is also seen to be better for TD than MC.

Bootstrapping and Sampling

Bootstrapping means that the current solution is used to generate the next solution. We note that

  • DP bootstraps.
  • MC does not bootstrap.
  • TD bootstraps

Sampling occurs when the RL algorithm is based on random generation of paths.

  • DP does not sample.
  • MC samples.
  • TD samples.

Unified View of RL

In [10]:
nb_setup.images_vconcat(["RL_images/RL_UnifiedView.png"], width=600)
Out[10]:

References and Readings