Markov Decision Processes (MDPs)

In [2]:
from ipypublish import nb_setup

Overview, Model-Free vs Model-Based RL

There are 4 elements of a MDP:

  1. State ($S=s$).
  2. Action ($A=a$).
  3. Reward ($R(s,a)$).
  4. Transition Probability $T = P[s'|s,a]$.

Call this the ${\bf S,A,R,P}$ model.

Goal: Find a policy $\pi$ that maximizes future expected (maybe discounted) reward. The stream of reward along one path of the model from time $t$ to the end is called Return, denoted $G_t$.

Result: A Value Function $v_{\pi}(s)$ for the policy $\pi(s)$ which gives the value of applying this policy in every state $s$. So we want to find the optimal value function, denoted $v_*(s)$ under the optimal policy $\pi_*(s)$.

  • If you know all of $S,A,R,P$, then you can calculate the policy $\pi(s)$ for all states $s$ without interacting with the environment. This makes the situation that of a planning problem. The agent solves for an optimal policy using various approaches that we will see later, e.g., policy iteration and value iteration.

  • In true RL problems, the agent does not have all the 4 components above. Usually they are missing knowledge of $R$ and $P$, so interaction with the environment is needed. The agent may run a series of trials and interact with the environment and thus learn $P, R$ by seeing what transitions and rewards occur when the agent takes actions $a$ in state $s$. Once it has obtained a distribution of $P$ and $R$, i.e., a model of the system, then the agent can stop interacting with the environment and proceed to solve a planning problem. This is called model-based RL. Model-based RL allows the agent to use the model to determine te optimal value function $v_*(s)$ and policy $\pi_*(s)$ directly.

  • When the agent interacts with the environment and does not build a transition probability model, but only gets an idea of the Return $G$ at each state $s$ and action $a$ under policy $\pi$, known as the "quality" function $q_{\pi}(s,a)$, then we call it model-free RL. The nomenclature comes from the absence of a model of state transitions.

Recap (Intuitive)

  • Fully observable, model-based. Mental models might be a starting point.
  • Partially observable, model-free.
In [3]:
nb_setup.images_hconcat(["RL_images/Model_Free_Based.png"], width=600)
Out[3]:

Markov Property

$$ P[S_{t+1} | S_t] = P[S_{t+1} | S_t, S_{t-1}, ...] $$

  • A Markov process (MP) is a situation where there are no Rewards or Actions and the world just evolves. This is different from a MDP.

  • Markov chain is a tuple $\langle S,P \rangle$. The chain can be represented by a transition probability matrix $P$ over a set of finite states. This matrix completely characterizes the model. Each row of the matrix sums to 1.

  • Tree representation (also called a "Rollout"): from the root you evolve downwards, but nodes are the same state repeated. So the same Markov chain can also be represented as a tree. Shows time also and is a picture of the evolution of states over time.

MDP

  • Includes Rewards and Actions. These affect the State. Difference between watching a movie and playing a video game.
  • Add $A_t, R_t$ and a discount factor $\gamma$.
  • Now transitions probability is represented on a tensor. So you can slice it by state now, state next, or action.

$$ P_{ss'}^a = P[S_{t+1}=s' | S_t=s, A_t=a] $$

  • Here is a diagram of the MDP transition diagram. Two types of transitions and arrows (red and blue). Also see Rewards on the Actions. The transitions are random and are known, so model-based.
In [4]:
#MDPdiagram
nb_setup.images_hconcat(["RL_images/MDPdiagram.png"], width=600)
Out[4]:
In [6]:
nb_setup.images_hconcat(["RL_images/MDPmatrix.png"], width=600)
Out[6]:
In [7]:
nb_setup.images_hconcat(["RL_images/MDPmatrix2.png"], width=600)
Out[7]:

MDP Tree Representation

This representation mimics the Matrix Representation and is often called a "Rollout".

In [9]:
#MDPTree_Rollout
nb_setup.images_hconcat(["RL_images/MDPTree_Rollout.png"], width=600)
Out[9]:

Policies

  • A policy is a distribution of Actions given States (note that here $\pi(\cdot)$ is a probability).

$$ \pi(a|s) = P[A_t=a | S_t = s] $$

  • Fully defines the behavior of the agent.

  • Given by $\pi(s) = a$ if deterministic. Else could be different Actions with probabilities attached. Deterministic policies gives a transition probability matrix over states.

  • MDP policies depend on the current state (not history).

  • Policies are stationary (time-independent)

  • For a deterministic policy, see that the $\pi(s)$ matrix is all binary (0 or 1). Not so for a random matrix.

In [10]:
#Policy matrices (deterministic)
nb_setup.images_hconcat(["RL_images/Policy_deterministic.png"], width=600)
Out[10]:
In [11]:
#Policy matrix (random)
nb_setup.images_hconcat(["RL_images/Policy_random.png"], width=600)
Out[11]:

MDP Rewards

  • Reward defined by original State and Action and next State. Can be represented in a tensor.

$$ r(s_1; s_0,a_1) $$

  • But we can write reward as function of $s_0, a_1$ only, can be represented in a matrix. This becomes an expectation over the preceding tensor.

$$ r(s_0, a_1) = E[r(s_1; s_0,a_1)] $$

  • As shown in the diagram below.
In [12]:
#MDP Tree Representation (Rewards)
nb_setup.images_hconcat(["RL_images/MDPrewards.png"], width=600)
Out[12]:
In [13]:
#MDP rewards matrix
nb_setup.images_hconcat(["RL_images/MDPrewards_matrix.png"], width=600)
Out[13]:

MDP Return

  • Return $G_t$ s the total discounted reward from time $t$ out into the future. $G$ is different depending on path taken. It is a random variable.

$$ G_t = R_{t+1} + \gamma R_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

  • Discount used to time normalize rewards (finance), behavior shows discounting, mathematical convenience for infinite horizon problems. If $\gamma=1$, then make sure sequences terminate.

  • Run several paths and the mean value in each state is the value (function) of that state.

In [15]:
#Example MDP Returns
nb_setup.images_hconcat(["RL_images/MDPreturns.png"], width=800)
Out[15]:

State Value Function

  • The state-value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$ and then following policy $\pi$.

$$ v_{\pi}(s) = E[G_t | S_t=s ] $$

In [16]:
#State Value Function Fire Policy
nb_setup.images_hconcat(["RL_images/StateValueFunction.png"], width=600)
Out[16]:

Action Value Function

  • The action-value function $q_{\pi}(s,a)$ of an MDP is the expected return starting from state $s$ and take action $a$ and then following policy $\pi$.

$$ q_{\pi}(s,a) = E_{\pi}[G_t | S_t=s, A_t=a ] $$

Bellman Expectation Equation (BEE)

  • Gives the value function for a given policy.

  • Given a policy $\pi$ compute the value functions $v_{\pi}(s)$ and $q_{\pi}(s,a)$.

  • Bellman Expectation Equations:

$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s] $$

$$ q_{\pi}(s,a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t=s, A_t=a] $$

  • Implementation (known as a "Backup")

$$ v_{\pi}(s) = \sum_a \pi(a|s) q_{\pi}(s,a) $$

In [20]:
nb_setup.images_hconcat(["RL_images/BEE_SVF.png"], width=300)
Out[20]:

Function $q_{\pi}(s,a)$

$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$

In [21]:
nb_setup.images_hconcat(["RL_images/BEE_AVF.png"], width=300)
Out[21]:

BEE for State Value Function $v_{\pi}(s)$

  • Putting these together, we get the Bellman Expectation Equation for the value functions

$$ v_{\pi}(s) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$

  • Note that $\pi(a|s)$ is a probability.
In [28]:
nb_setup.images_hconcat(["RL_images/BEE.png"], width=500)
Out[28]:

BEE for the Action Value Function $q_{\pi}(s,a)$

$$ q_{\pi}(s,a) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$

In [29]:
nb_setup.images_hconcat(["RL_images/BEEq.png"], width=500)
Out[29]:

Example

In [30]:
#Example BEE for Fire Policy
#Solve 3 equations in 3 unknowns
nb_setup.images_hconcat(["RL_images/BEE_SVF_Example.png"], width=500)
Out[30]:

Easy to get the $q_{\pi}(s,a)$ values once we know the $v_{\pi}(s)$ values.

$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$

In [31]:
nb_setup.images_hconcat(["RL_images/BEE_AVF_Example.png"], width=600)
Out[31]:

BEE in Matrix Form

  • The equations above are linear and can be solved using Linear Algebra.

  • We can stack up the equations as follows:

$$ v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi} $$

  • The solution follows easily through matrix inversion.

$$ v_{\pi} = (1-\gamma P^{\pi})^{-1} R^{\pi} $$

Bellman Optimality Equation (BOE)

  • The optimal state-value function $v_*(s)$ is the maximum value over all policies

$$ v_*(s) = \max_{\pi} v_{\pi}(s) $$

  • Still does not tell us what policy to follow.

Optimal Action Value Function

$$ q_*(s,a) = \max_{\pi} q_{\pi}(s,a) $$

Can we get the optimal policy from $q_*(s,a)$? Yes, because it tells us the best Action, which is then our optimal policy.

In [32]:
#Example of optimal policy
nb_setup.images_hconcat(["RL_images/OptimalPolicy.png"], width=600)
Out[32]:

Optimal Policy

  • $\pi \geq \pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s), \forall s$

  • Theorem from slides. Existence theorem.

  • $\pi_*(s) = \mbox{argmax}_a q_*(s,a)$, zero otherwise.

BOE for $v_*$

$$ v_*(s) = \max_a q_*(s,a) $$

and recall that $q_*(s,a) = \max_{\pi} q_{\pi}(s,a)$.

$$ q_*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') $$

and of course then

$$ v_*(s) = \max_a \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') \right) $$

In [36]:
#Bellman Optimality Equation for v_*
nb_setup.images_hconcat(["RL_images/BOE_SVF.png"], width=600)
Out[36]:

BOE for $q_*$

$$ q_*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') $$

In [35]:
nb_setup.images_hconcat(["RL_images/BOE_AVF.png"], width=500)
Out[35]:

BOE Final Expressions

Which is then combined with the equations above to give the final BOE expressions for both $v_*(s)$ and $q_*(s,a)$.

In [38]:
nb_setup.images_hconcat(["RL_images/BOE_SVF2.png"], width=600)
Out[38]:
In [39]:
nb_setup.images_hconcat(["RL_images/BOE_AVF2.png"], width=600)
Out[39]:
In [40]:
#Example Optimal Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalSVF.png"], width=600)
Out[40]:
In [41]:
#Example Optimal Action-Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalAVF.png"], width=600)
Out[41]:
In [45]:
#Summary, BEE, BOE
nb_setup.images_hconcat(["RL_images/Summary_BEE_BOE.png","RL_images/Summary_OptimalPolicy.png"], width=1000)
Out[45]:

Solving the BOE

  1. Nonlinear
  2. No closed form
  3. Iterative solutions methods
    • Value iteration
    • Policy iteration
    • Q learning
    • SARSA

Readings and References