from ipypublish import nb_setup
nb_setup.images_hconcat(["RL_images/Model_Based_Overview.png","RL_images/Model_Based_Overview2.png"], height=400)
Model-Free RL is called "Learning" (from experience), whereas Model-Based RL is called "Planning" (from a model). Model-Based Learning makes efficient use of supervised learning and can reason about model uncertainty. But it has two sources of approximation error, one from model learning and the second from value function approximation.
A model $M$ is a representation of an MDP $(S,A,P,R)$ parameterized by $\eta$. Assuming state $S$ and action space $A$ are known, then a model $M=(P_\eta,R_\eta)$ represents state transitions and rewards
$$ S_{t+1} \sim P_\eta(S_{t+1}|S_t,A_t) $$
$$ R_{t+1} = R_\eta(R_{t+1}|S_t,A_t) $$
Estimate model $M$ from experience: $\{S_1,A_1,R_2,...,S_T\}$.
Supervised learning problem: map $S_t,A_t \rightarrow R_{t+1},S_{t+1}$.
Mapping $s,a \rightarrow r$ is a regression problem.
Mapping $s,a \rightarrow s'$ is a density estimation problem.
Pick a loss function and find parameters $\eta$ that minimize empirical loss.
See the Table Lookup AB Example in the slides.
Use knowledge of the model only to generate samples of $S_t,R_t$.
Then apply model-free learning to the samples via MC, Sarsa, Q-learning. These are efficient approaches. One version of this is Tabular Q-Planning.
nb_setup.images_hconcat(["RL_images/Tabular_Q_Planning.png"], width=600)
Background Planning: Use simulated experience to improve the model. Not focused on any particular state. Example: the Dyna algorithm.
Decision Time Planning: Focuses on finding the best action for any state. Algorithm run separately for each new state encountered. Example: Monte Carlo Tree Search (MCTS).
Sampled experience could be "Real" (sampled from the environment, i.e., from the true MDP)
$$ S' \sim P{s,s'}^a, \quad R=R_s^a $$
or "Simulated" (sampled from the model, i.e., from an approximate MDP)
$$ S' \sim P_\eta(S' | S,A), \quad R = R_\eta(R | S,A) $$
Use the simulated experience to improve the state-action function $Q(S,A)$.
nb_setup.images_hconcat(["RL_images/Dyna_Architecture.png","RL_images/Dyna_Q_Algorithm.png"], width=900)
Focuses on a particular state $S$, choose the best action to take, and then stop the planning process. Restart the planning process for each new state.
"Forward Search" algorithms select the best action by lookahead. Only solve the best action for the MDP starting in a given search. It is like backward recursion but in a forward pass approach. Run simulations going forward from a specific node and solve for that and then rollback a node or level of the tree one at a time.
nb_setup.images_hconcat(["RL_images/Forward_Search.png","RL_images/Simple_MC_Search.png"], width=900)
Here we do not evaluate the whole tree but more than one node at a time. See the slides for full details. There are 4 steps: Selection, Expansion, Simulation, Backup Computation. It has good properties:
The slides have an example of playing Go with MCTS and Deep RL.
nb_setup.images_hconcat(["RL_images/MCTS.png"], width=600)
nb_setup.images_hconcat(["RL_images/Reducing_Action_Candidates1.png","RL_images/Reducing_Action_Candidates2.png"], width=900)
nb_setup.images_hconcat(["RL_images/AlphaGoZero.png"], width=600)