from ipypublish import nb_setup
Decompose the problem into a smaller problem that is easy to solve, and a bigger problem that is solved, and combining them to solve the full problem.
Bellman Expectation Equation
$$ v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \left[R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') \right] $$
nb_setup.images_hconcat(["RL_images/PrincipleOptimality.png"], width=600)
nb_setup.images_hconcat(["RL_images/PrincipleOptimality2.png"], width=600)
Noting that
$$ q_*(s,a) = R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') $$
we have that
$$ \pi_* = \mbox{argmax}_a q_*(s,a) $$
RL is a way to get around these deficiencies using experimental/experiential (empirical) learning rather than direct DP based on a theoretical view of the environment. But still, DP is a good starting point because the RL algorithms that do scale are based on the ideas in DP.
$$ v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \left[R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_{\pi}(s') \right] $$
$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a \pi(a'|s')q_{\pi}(s',a') $$
Solution approaches:
nb_setup.images_hconcat(["RL_images/IterativePolicyEvaluation.png"], width=600)
Solved by iteration. We use the equation above to update the value function from iteration $k$ to iteration $k+1$. Iteration number is maintained by index $k = 0,1,2,...$. Each state takes its turn being on the LHS of the equation.
Two types of iteration: (usually set $v_1=0$)
Note: Not clear how the second equation follows from the first??
Example: Grid World, see Sutton and Barto, a.k.a SnB, Example 4.1. The abbreviated example is in the slides and one should read the full example there, or in SnB. Here, the policy being evaluated is a very simple one, i.e., randomly walk NSEW with equal probability and if at the edge of the grid, reflect back on the start cell (call this the "random policy"). What is the value function under this (non-optimal) policy? To answer this, we do "policy evaluation".
In the previous section, we evaluated policies iteratively, where the policy was given. But what we really would like to do is find the optimal policy, while following the current policy.
Revisit the GridWorld example. Now, follow the action that gives the best immediate reward (which is also the best long-term reward because the $v_{k+1}$ contains the best value for all future actions). For cells where all actions have the same reward, randomly move to any cell. See SnB, Figure 4.1.
So the process goes as follows. Start with a policy $\pi$. Then:
Evaluate the policy. $v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+1} + ... | S_t = s]$.
Improve the policy by "greedy" choice of action given $v_{\pi}$.
$$ \pi_* = \mbox{greedy}(v_{\pi}) $$
i.e., compute $\pi' > \pi$ (greedy policy improvement)
$$ q_*(s,a) = R_s^a + \gamma \sum_{s \in S'} P_{ss'}^a v_{\pi}(s') $$
Rinse and repeat 1 and 2.
nb_setup.images_hconcat(["RL_images/OptimalPolicyIteration.png"],width=600)
The next diagram is an excellent depiction of how we iterate between policy evaluation and policy improvement until convergence. See Section 4.3 of SnB.
nb_setup.images_hconcat(["RL_images/OptimalPolicyValuationIteration.png"],width=600)
nb_setup.images_vconcat(["RL_images/ProofPolicyImprovementConvergence.png","RL_images/ProofPolicyImprovementConvergence2.png"], width=600)
Instead of several policy evaluation-improvement iterations, can we iterate just once? If so, this is called Value Iteration. It saves a good deal of computation.
Turns the BOE into an iterative update. Recall the BOE
$$ v_{k+1}(s) = \max_{a \in A} \left(R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_k(s') \right) $$
So it's basically an execution of this equation repeatedly and looks like it subsumes policy evaluation and improvement in one iterative update.
In the Example below we can see that we go directly from one policy to another and not from value function to policy back and forth.
nb_setup.images_hconcat(["RL_images/Example_ShortestPaths.png"], width=600)
To illustrate, note that in the figure above, the bottom right cell changes from -4 in $V_5$ to -5 in $V_6$ by the application of the BOE as follows (two actions are possible, up or down):
$$ \max \left(-1 + (-4), -1 + (-4) \right) = -5 $$
It's pretty simple actually. The transition probabilities are not needed here because the movement is deterministic once the action is chosen, as the action chosen moves you to only one possible cell with certainty.
nb_setup.images_hconcat(["RL_images/PolicyValueIteration.png"], width=600)
Convergence rates are not always better in one versus the other, it depends on the problem.
To summarize, see the table below.
nb_setup.images_hconcat(["RL_images/Summary_VP.png"], width=600)
Both Full Backup and Full Sweep approaches back up all states in each iteration. But if we back up only some of them in random fashion in each iteration, then it's known as asynchronous dynamic programming. To work however, it must back up all of the states and not completely ignore some of them. Reduces computation and converges if all states continue to be selected. Section 4.5 of SnB.
There are two approaches used here to choose which states to backup in any iteration.
Real Time Dynamic Programming. Instead of naively updating all states, spend more time updating the states that the agent actually visits. This is akin to importance sampling.
Prioritized Sweeping. Update the states where the change in value function from one iteration to the next is the highest. (The difference in value function at a state between two iterations is known as the "Bellman Error".)