Markov Decision Processes (MDPs)¶

from ipypublish import nb_setup

Overview, Model-Free vs Model-Based RL¶

There are 4 elements of a MDP:

State ($S=s$).
Action ($A=a$).
Reward ($R(s,a)$).
Transition Probability $T = P[s'|s,a]$.

Call this the ${\bf S,A,R,P}$ model.

Goal: Find a policy $\pi$ that maximizes future expected (maybe discounted) reward. The stream of reward along one path of the model from time $t$ to the end is called Return, denoted $G_t$.

Result: A Value Function $v_{\pi}(s)$ for the policy $\pi(s)$ which gives the value of applying this policy in every state $s$. So we want to find the optimal value function, denoted $v_*(s)$ under the optimal policy $\pi_*(s)$.

If you know all of $S,A,R,P$, then you can calculate the policy $\pi(s)$ for all states $s$ without interacting with the environment. This makes the situation that of a planning problem. The agent solves for an optimal policy using various approaches that we will see later, e.g., policy iteration and value iteration.
In true RL problems, the agent does not have all the 4 components above. Usually they are missing knowledge of $R$ and $P$, so interaction with the environment is needed. The agent may run a series of trials and interact with the environment and thus learn $P, R$ by seeing what transitions and rewards occur when the agent takes actions $a$ in state $s$. Once it has obtained a distribution of $P$ and $R$, i.e., a model of the system, then the agent can stop interacting with the environment and proceed to solve a planning problem. This is called model-based RL. Model-based RL allows the agent to use the model to determine te optimal value function $v_*(s)$ and policy $\pi_*(s)$ directly.
When the agent interacts with the environment and does not build a transition probability model, but only gets an idea of the Return $G$ at each state $s$ and action $a$ under policy $\pi$, known as the "quality" function $q_{\pi}(s,a)$, then we call it model-free RL. The nomenclature comes from the absence of a model of state transitions.

Recap (Intuitive)

Fully observable, model-based. Mental models might be a starting point.
Partially observable, model-free.

nb_setup.images_hconcat(["RL_images/Model_Free_Based.png"], width=600)

Markov Property¶

$$ P[S_{t+1} | S_t] = P[S_{t+1} | S_t, S_{t-1}, ...] $$

A Markov process (MP) is a situation where there are no Rewards or Actions and the world just evolves. This is different from a MDP.
Markov chain is a tuple $\langle S,P \rangle$. The chain can be represented by a transition probability matrix $P$ over a set of finite states. This matrix completely characterizes the model. Each row of the matrix sums to 1.
Tree representation (also called a "Rollout"): from the root you evolve downwards, but nodes are the same state repeated. So the same Markov chain can also be represented as a tree. Shows time also and is a picture of the evolution of states over time.

MDP¶

Includes Rewards and Actions. These affect the State. Difference between watching a movie and playing a video game.
Add $A_t, R_t$ and a discount factor $\gamma$.
Now transitions probability is represented on a tensor. So you can slice it by state now, state next, or action.

$$ P_{ss'}^a = P[S_{t+1}=s' | S_t=s, A_t=a] $$

Here is a diagram of the MDP transition diagram. Two types of transitions and arrows (red and blue). Also see Rewards on the Actions. The transitions are random and are known, so model-based.

#MDPdiagram
nb_setup.images_hconcat(["RL_images/MDPdiagram.png"], width=600)

nb_setup.images_hconcat(["RL_images/MDPmatrix.png"], width=600)

nb_setup.images_hconcat(["RL_images/MDPmatrix2.png"], width=600)

MDP Tree Representation¶

This representation mimics the Matrix Representation and is often called a "Rollout".

#MDPTree_Rollout
nb_setup.images_hconcat(["RL_images/MDPTree_Rollout.png"], width=600)

Policies¶

A policy is a distribution of Actions given States (note that here $\pi(\cdot)$ is a probability).

$$ \pi(a|s) = P[A_t=a | S_t = s] $$

Fully defines the behavior of the agent.
Given by $\pi(s) = a$ if deterministic. Else could be different Actions with probabilities attached. Deterministic policies gives a transition probability matrix over states.
MDP policies depend on the current state (not history).
Policies are stationary (time-independent)
For a deterministic policy, see that the $\pi(s)$ matrix is all binary (0 or 1). Not so for a random matrix.

#Policy matrices (deterministic)
nb_setup.images_hconcat(["RL_images/Policy_deterministic.png"], width=600)

#Policy matrix (random)
nb_setup.images_hconcat(["RL_images/Policy_random.png"], width=600)

MDP Rewards¶

Reward defined by original State and Action and next State. Can be represented in a tensor.

$$ r(s_1; s_0,a_1) $$

But we can write reward as function of $s_0, a_1$ only, can be represented in a matrix. This becomes an expectation over the preceding tensor.

$$ r(s_0, a_1) = E[r(s_1; s_0,a_1)] $$

As shown in the diagram below.

#MDP Tree Representation (Rewards)
nb_setup.images_hconcat(["RL_images/MDPrewards.png"], width=600)

#MDP rewards matrix
nb_setup.images_hconcat(["RL_images/MDPrewards_matrix.png"], width=600)

MDP Return¶

Return $G_t$ s the total discounted reward from time $t$ out into the future. $G$ is different depending on path taken. It is a random variable.

$$ G_t = R_{t+1} + \gamma R_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

Discount used to time normalize rewards (finance), behavior shows discounting, mathematical convenience for infinite horizon problems. If $\gamma=1$, then make sure sequences terminate.
Run several paths and the mean value in each state is the value (function) of that state.

#Example MDP Returns
nb_setup.images_hconcat(["RL_images/MDPreturns.png"], width=800)

State Value Function¶

The state-value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$ and then following policy $\pi$.

$$ v_{\pi}(s) = E[G_t | S_t=s ] $$

#State Value Function Fire Policy
nb_setup.images_hconcat(["RL_images/StateValueFunction.png"], width=600)

Action Value Function¶

The action-value function $q_{\pi}(s,a)$ of an MDP is the expected return starting from state $s$ and take action $a$ and then following policy $\pi$.

$$ q_{\pi}(s,a) = E_{\pi}[G_t | S_t=s, A_t=a ] $$

Bellman Expectation Equation (BEE)¶

Gives the value function for a given policy.
Given a policy $\pi$ compute the value functions $v_{\pi}(s)$ and $q_{\pi}(s,a)$.
Bellman Expectation Equations:

$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s] $$

$$ q_{\pi}(s,a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t=s, A_t=a] $$

Implementation (known as a "Backup")

$$ v_{\pi}(s) = \sum_a \pi(a|s) q_{\pi}(s,a) $$

nb_setup.images_hconcat(["RL_images/BEE_SVF.png"], width=300)

Function $q_{\pi}(s,a)$¶

$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$

nb_setup.images_hconcat(["RL_images/BEE_AVF.png"], width=300)

BEE for State Value Function $v_{\pi}(s)$¶

Putting these together, we get the Bellman Expectation Equation for the value functions

$$ v_{\pi}(s) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$

Note that $\pi(a|s)$ is a probability.

nb_setup.images_hconcat(["RL_images/BEE.png"], width=500)

BEE for the Action Value Function $q_{\pi}(s,a)$¶

$$ q_{\pi}(s,a) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$

nb_setup.images_hconcat(["RL_images/BEEq.png"], width=500)

Example¶

#Example BEE for Fire Policy
#Solve 3 equations in 3 unknowns
nb_setup.images_hconcat(["RL_images/BEE_SVF_Example.png"], width=500)

Easy to get the $q_{\pi}(s,a)$ values once we know the $v_{\pi}(s)$ values.

$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$

nb_setup.images_hconcat(["RL_images/BEE_AVF_Example.png"], width=600)

BEE in Matrix Form¶

The equations above are linear and can be solved using Linear Algebra.
We can stack up the equations as follows:

$$ v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi} $$

The solution follows easily through matrix inversion.

$$ v_{\pi} = (1-\gamma P^{\pi})^{-1} R^{\pi} $$

Bellman Optimality Equation (BOE)¶

The optimal state-value function $v_*(s)$ is the maximum value over all policies

$$ v_*(s) = \max_{\pi} v_{\pi}(s) $$

Still does not tell us what policy to follow.

Optimal Action Value Function¶

$$ q_*(s,a) = \max_{\pi} q_{\pi}(s,a) $$

Can we get the optimal policy from $q_*(s,a)$? Yes, because it tells us the best Action, which is then our optimal policy.

#Example of optimal policy
nb_setup.images_hconcat(["RL_images/OptimalPolicy.png"], width=600)

Optimal Policy¶

$\pi \geq \pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s), \forall s$
Theorem from slides. Existence theorem.
$\pi_*(s) = \mbox{argmax}_a q_*(s,a)$, zero otherwise.

BOE for $v_*$¶

$$ v_*(s) = \max_a q_*(s,a) $$

and recall that $q_*(s,a) = \max_{\pi} q_{\pi}(s,a)$.

$$ q_*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') $$

and of course then

$$ v_*(s) = \max_a \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') \right) $$

#Bellman Optimality Equation for v_*
nb_setup.images_hconcat(["RL_images/BOE_SVF.png"], width=600)

BOE for $q_*$¶