Model-Based Control: Policy and Value Iteration

In [1]:
from ipypublish import nb_setup

Principle of Optimality

Decompose the problem into a smaller problem that is easy to solve, and a bigger problem that is solved, and combining them to solve the full problem.

Bellman Expectation Equation

$$ v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \left[R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') \right] $$

In [2]:
nb_setup.images_hconcat(["RL_images/PrincipleOptimality.png"], width=600)
Out[2]:
In [3]:
nb_setup.images_hconcat(["RL_images/PrincipleOptimality2.png"], width=600)
Out[3]:

Finding the Optimal Policy

Noting that

$$ q_*(s,a) = R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') $$

we have that

$$ \pi_* = \mbox{argmax}_a q_*(s,a) $$

Deficiencies of Standard Dynamic Programming

  1. Assumes we know the Model perfectly. This is clearly not always valid.
  2. The state-space may be too large to implement standard DP. So DP may not scale.

RL is a way to get around these deficiencies using experimental/experiential (empirical) learning rather than direct DP based on a theoretical view of the environment. But still, DP is a good starting point because the RL algorithms that do scale are based on the ideas in DP.

Policy Evaluation (Prediction)

  1. MDP is known.
  2. Policy is known.

$$ v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \left[R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') \right] $$

$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a \pi(a'|s')q_{\pi}(s',a') $$

Solution approaches:

  1. Matrix inversion. Does not always work because very large matrices often pose difficulties in inversion.
  2. Iteration.
In [4]:
nb_setup.images_hconcat(["RL_images/IterativePolicyEvaluation.png"], width=600)
Out[4]:

Solved by iteration. We use the equation above to update the value function from iteration $k$ to iteration $k+1$. Iteration number is maintained by index $k = 0,1,2,...$. Each state takes its turn being on the LHS of the equation.

Two types of iteration: (usually set $v_1=0$)

  1. Full backup: to compute the root value function above, use all states below under the summation. That is, create two full arrays, $v_k$ and $v_{k+1}$ and use $v_k$ to get the new $v_{k+1}$. This is the same as synchronous value iteration.
  2. Full sweep: Here, use just one array $v_k$ and replace values there "in-place", so that the new values that are generated earlier in the iteration also influence the values later in the iteration. This is called in-place value iteration. Potential to be more efficient.

Note: Not clear how the second equation follows from the first??

Example: Grid World, see Sutton and Barto, a.k.a SnB, Example 4.1. The abbreviated example is in the slides and one should read the full example there, or in SnB. Here, the policy being evaluated is a very simple one, i.e., randomly walk NSEW with equal probability and if at the edge of the grid, reflect back on the start cell (call this the "random policy"). What is the value function under this (non-optimal) policy? To answer this, we do "policy evaluation".

Policy Iteration (Optimization)

In the previous section, we evaluated policies iteratively, where the policy was given. But what we really would like to do is find the optimal policy, while following the current policy.

Revisit the GridWorld example. Now, follow the action that gives the best immediate reward (which is also the best long-term reward because the $v_{k+1}$ contains the best value for all future actions). For cells where all actions have the same reward, randomly move to any cell. See SnB, Figure 4.1.

So the process goes as follows. Start with a policy $\pi$. Then:

  1. Evaluate the policy. $v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+1} + ... | S_t = s]$.

  2. Improve the policy by "greedy" choice of action given $v_{\pi}$.

$$ \pi_* = \mbox{greedy}(v_{\pi}) $$

i.e., compute $\pi' > \pi$ (greedy policy improvement)

$$ q_*(s,a) = R_s^a + \gamma \sum_{s \in S'} P_{ss'}^a v_{\pi}(s') $$

Rinse and repeat 1 and 2.

  1. Policy iteration always converges to $\pi_*$.
In [5]:
nb_setup.images_hconcat(["RL_images/OptimalPolicyIteration.png"],width=600)
Out[5]:

Proof of Convergence

The next diagram is an excellent depiction of how we iterate between policy evaluation and policy improvement until convergence. See Section 4.3 of SnB.

In [6]:
nb_setup.images_hconcat(["RL_images/OptimalPolicyValuationIteration.png"],width=600)
Out[6]:
In [7]:
nb_setup.images_vconcat(["RL_images/ProofPolicyImprovementConvergence.png","RL_images/ProofPolicyImprovementConvergence2.png"], width=600)
Out[7]:

Value Iteration

Instead of several policy evaluation-improvement iterations, can we iterate just once? If so, this is called Value Iteration. It saves a good deal of computation.

Turns the BOE into an iterative update. Recall the BOE

$$ v_{k+1}(s) = \max_{a \in A} \left(R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_k(s') \right) $$

So it's basically an execution of this equation repeatedly and looks like it subsumes policy evaluation and improvement in one iterative update.

In the Example below we can see that we go directly from one policy to another and not from value function to policy back and forth.

In [8]:
nb_setup.images_hconcat(["RL_images/Example_ShortestPaths.png"], width=600)
Out[8]:

To illustrate, note that in the figure above, the bottom right cell changes from -4 in $V_5$ to -5 in $V_6$ by the application of the BOE as follows (two actions are possible, up or down):

$$ \max \left(-1 + (-4), -1 + (-4) \right) = -5 $$

It's pretty simple actually. The transition probabilities are not needed here because the movement is deterministic once the action is chosen, as the action chosen moves you to only one possible cell with certainty.

In [9]:
nb_setup.images_hconcat(["RL_images/PolicyValueIteration.png"], width=600)
Out[9]:

Convergence rates are not always better in one versus the other, it depends on the problem.

To summarize, see the table below.

In [10]:
nb_setup.images_hconcat(["RL_images/Summary_VP.png"], width=600)
Out[10]:

Asynchronous Dynamic Programming

Both Full Backup and Full Sweep approaches back up all states in each iteration. But if we back up only some of them in random fashion in each iteration, then it's known as asynchronous dynamic programming. To work however, it must back up all of the states and not completely ignore some of them. Reduces computation and converges if all states continue to be selected. Section 4.5 of SnB.

There are two approaches used here to choose which states to backup in any iteration.

  1. Real Time Dynamic Programming. Instead of naively updating all states, spend more time updating the states that the agent actually visits. This is akin to importance sampling.

  2. Prioritized Sweeping. Update the states where the change in value function from one iteration to the next is the highest. (The difference in value function at a state between two iterations is known as the "Bellman Error".)

References and Reading