Model Based Planning

In [1]:
from ipypublish import nb_setup
In [2]:
nb_setup.images_hconcat(["RL_images/Model_Based_Overview.png","RL_images/Model_Based_Overview2.png"], height=400)
Out[2]:

Model-Free RL is called "Learning" (from experience), whereas Model-Based RL is called "Planning" (from a model). Model-Based Learning makes efficient use of supervised learning and can reason about model uncertainty. But it has two sources of approximation error, one from model learning and the second from value function approximation.

What is a Model?

A model $M$ is a representation of an MDP $(S,A,P,R)$ parameterized by $\eta$. Assuming state $S$ and action space $A$ are known, then a model $M=(P_\eta,R_\eta)$ represents state transitions and rewards

$$ S_{t+1} \sim P_\eta(S_{t+1}|S_t,A_t) $$

$$ R_{t+1} = R_\eta(R_{t+1}|S_t,A_t) $$

Model Learning

Estimate model $M$ from experience: $\{S_1,A_1,R_2,...,S_T\}$.

Supervised learning problem: map $S_t,A_t \rightarrow R_{t+1},S_{t+1}$.

Mapping $s,a \rightarrow r$ is a regression problem.

Mapping $s,a \rightarrow s'$ is a density estimation problem.

Pick a loss function and find parameters $\eta$ that minimize empirical loss.

See the Table Lookup AB Example in the slides.

Sample-Based Planning

Use knowledge of the model only to generate samples of $S_t,R_t$.

Then apply model-free learning to the samples via MC, Sarsa, Q-learning. These are efficient approaches. One version of this is Tabular Q-Planning.

In [3]:
nb_setup.images_hconcat(["RL_images/Tabular_Q_Planning.png"], width=600)
Out[3]:

Types of Planning Algorithms

  1. Background Planning: Use simulated experience to improve the model. Not focused on any particular state. Example: the Dyna algorithm.

  2. Decision Time Planning: Focuses on finding the best action for any state. Algorithm run separately for each new state encountered. Example: Monte Carlo Tree Search (MCTS).

Background Planning

Sampled experience could be "Real" (sampled from the environment, i.e., from the true MDP)

$$ S' \sim P{s,s'}^a, \quad R=R_s^a $$

or "Simulated" (sampled from the model, i.e., from an approximate MDP)

$$ S' \sim P_\eta(S' | S,A), \quad R = R_\eta(R | S,A) $$

Use the simulated experience to improve the state-action function $Q(S,A)$.

In [4]:
nb_setup.images_hconcat(["RL_images/Dyna_Architecture.png","RL_images/Dyna_Q_Algorithm.png"], width=900)
Out[4]:

Decision Time Planning

Focuses on a particular state $S$, choose the best action to take, and then stop the planning process. Restart the planning process for each new state.

"Forward Search" algorithms select the best action by lookahead. Only solve the best action for the MDP starting in a given search. It is like backward recursion but in a forward pass approach. Run simulations going forward from a specific node and solve for that and then rollback a node or level of the tree one at a time.

In [5]:
nb_setup.images_hconcat(["RL_images/Forward_Search.png","RL_images/Simple_MC_Search.png"], width=900)
Out[5]:

Monte Carlo Tree Search (MCTS)

Here we do not evaluate the whole tree but more than one node at a time. See the slides for full details. There are 4 steps: Selection, Expansion, Simulation, Backup Computation. It has good properties:

  • Highly selective best-first search.
  • Evaluates states dynamically, unlike DP.
  • Sampling breaks curse of dimensionality.
  • Works for "black-box" models, only requires samples.
  • Computationally efficient, anytime, parallelizeable.

The slides have an example of playing Go with MCTS and Deep RL.

In [6]:
nb_setup.images_hconcat(["RL_images/MCTS.png"], width=600)
Out[6]:

Reducing Action Candidates using Policy Gradients

In [7]:
nb_setup.images_hconcat(["RL_images/Reducing_Action_Candidates1.png","RL_images/Reducing_Action_Candidates2.png"], width=900)
Out[7]:
In [8]:
nb_setup.images_hconcat(["RL_images/AlphaGoZero.png"], width=600)
Out[8]:

Readings and References