Processing math: 100%

Markov Decision Processes (MDPs)

In [2]:
from ipypublish import nb_setup

Overview, Model-Free vs Model-Based RL

There are 4 elements of a MDP:

  1. State (S=s).
  2. Action (A=a).
  3. Reward (R(s,a)).
  4. Transition Probability T=P[s|s,a].

Call this the S,A,R,P model.

Goal: Find a policy π that maximizes future expected (maybe discounted) reward. The stream of reward along one path of the model from time t to the end is called Return, denoted Gt.

Result: A Value Function vπ(s) for the policy π(s) which gives the value of applying this policy in every state s. So we want to find the optimal value function, denoted v(s) under the optimal policy π(s).

  • If you know all of S,A,R,P, then you can calculate the policy π(s) for all states s without interacting with the environment. This makes the situation that of a planning problem. The agent solves for an optimal policy using various approaches that we will see later, e.g., policy iteration and value iteration.

  • In true RL problems, the agent does not have all the 4 components above. Usually they are missing knowledge of R and P, so interaction with the environment is needed. The agent may run a series of trials and interact with the environment and thus learn P,R by seeing what transitions and rewards occur when the agent takes actions a in state s. Once it has obtained a distribution of P and R, i.e., a model of the system, then the agent can stop interacting with the environment and proceed to solve a planning problem. This is called model-based RL. Model-based RL allows the agent to use the model to determine te optimal value function v(s) and policy π(s) directly.

  • When the agent interacts with the environment and does not build a transition probability model, but only gets an idea of the Return G at each state s and action a under policy π, known as the "quality" function qπ(s,a), then we call it model-free RL. The nomenclature comes from the absence of a model of state transitions.

Recap (Intuitive)

  • Fully observable, model-based. Mental models might be a starting point.
  • Partially observable, model-free.
In [3]:
nb_setup.images_hconcat(["RL_images/Model_Free_Based.png"], width=600)
Out[3]:

Markov Property

P[St+1|St]=P[St+1|St,St1,...]

  • A Markov process (MP) is a situation where there are no Rewards or Actions and the world just evolves. This is different from a MDP.

  • Markov chain is a tuple S,P. The chain can be represented by a transition probability matrix P over a set of finite states. This matrix completely characterizes the model. Each row of the matrix sums to 1.

  • Tree representation (also called a "Rollout"): from the root you evolve downwards, but nodes are the same state repeated. So the same Markov chain can also be represented as a tree. Shows time also and is a picture of the evolution of states over time.

MDP

  • Includes Rewards and Actions. These affect the State. Difference between watching a movie and playing a video game.
  • Add At,Rt and a discount factor γ.
  • Now transitions probability is represented on a tensor. So you can slice it by state now, state next, or action.

Pass=P[St+1=s|St=s,At=a]

  • Here is a diagram of the MDP transition diagram. Two types of transitions and arrows (red and blue). Also see Rewards on the Actions. The transitions are random and are known, so model-based.
In [4]:
#MDPdiagram
nb_setup.images_hconcat(["RL_images/MDPdiagram.png"], width=600)
Out[4]:
In [6]:
nb_setup.images_hconcat(["RL_images/MDPmatrix.png"], width=600)
Out[6]:
In [7]:
nb_setup.images_hconcat(["RL_images/MDPmatrix2.png"], width=600)
Out[7]:

MDP Tree Representation

This representation mimics the Matrix Representation and is often called a "Rollout".

In [9]:
#MDPTree_Rollout
nb_setup.images_hconcat(["RL_images/MDPTree_Rollout.png"], width=600)
Out[9]:

Policies

  • A policy is a distribution of Actions given States (note that here π() is a probability).

π(a|s)=P[At=a|St=s]

  • Fully defines the behavior of the agent.

  • Given by π(s)=a if deterministic. Else could be different Actions with probabilities attached. Deterministic policies gives a transition probability matrix over states.

  • MDP policies depend on the current state (not history).

  • Policies are stationary (time-independent)

  • For a deterministic policy, see that the π(s) matrix is all binary (0 or 1). Not so for a random matrix.

In [10]:
#Policy matrices (deterministic)
nb_setup.images_hconcat(["RL_images/Policy_deterministic.png"], width=600)
Out[10]:
In [11]:
#Policy matrix (random)
nb_setup.images_hconcat(["RL_images/Policy_random.png"], width=600)
Out[11]:

MDP Rewards

  • Reward defined by original State and Action and next State. Can be represented in a tensor.

r(s1;s0,a1)

  • But we can write reward as function of s0,a1 only, can be represented in a matrix. This becomes an expectation over the preceding tensor.

r(s0,a1)=E[r(s1;s0,a1)]

  • As shown in the diagram below.
In [12]:
#MDP Tree Representation (Rewards)
nb_setup.images_hconcat(["RL_images/MDPrewards.png"], width=600)
Out[12]:
In [13]:
#MDP rewards matrix
nb_setup.images_hconcat(["RL_images/MDPrewards_matrix.png"], width=600)
Out[13]:

MDP Return

  • Return Gt s the total discounted reward from time t out into the future. G is different depending on path taken. It is a random variable.

Gt=Rt+1+γRt+2+=k=0γkRt+k+1

  • Discount used to time normalize rewards (finance), behavior shows discounting, mathematical convenience for infinite horizon problems. If γ=1, then make sure sequences terminate.

  • Run several paths and the mean value in each state is the value (function) of that state.

In [15]:
#Example MDP Returns
nb_setup.images_hconcat(["RL_images/MDPreturns.png"], width=800)
Out[15]:

State Value Function

  • The state-value function vπ(s) of an MDP is the expected return starting from state s and then following policy π.

vπ(s)=E[Gt|St=s]

In [16]:
#State Value Function Fire Policy
nb_setup.images_hconcat(["RL_images/StateValueFunction.png"], width=600)
Out[16]:

Action Value Function

  • The action-value function qπ(s,a) of an MDP is the expected return starting from state s and take action a and then following policy π.

qπ(s,a)=Eπ[Gt|St=s,At=a]

Bellman Expectation Equation (BEE)

  • Gives the value function for a given policy.

  • Given a policy π compute the value functions vπ(s) and qπ(s,a).

  • Bellman Expectation Equations:

vπ(s)=Eπ[Rt+1+γvπ(St+1)|St=s]

qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)|St=s,At=a]

  • Implementation (known as a "Backup")

vπ(s)=aπ(a|s)qπ(s,a)

In [20]:
nb_setup.images_hconcat(["RL_images/BEE_SVF.png"], width=300)
Out[20]:

Function qπ(s,a)

qπ(s,a)=Ras+γsPassvπ(s)

In [21]:
nb_setup.images_hconcat(["RL_images/BEE_AVF.png"], width=300)
Out[21]:

BEE for State Value Function vπ(s)

  • Putting these together, we get the Bellman Expectation Equation for the value functions

vπ(s)=aπ(a|s)(Ras+γsPassvπ(s))

  • Note that π(a|s) is a probability.
In [28]:
nb_setup.images_hconcat(["RL_images/BEE.png"], width=500)
Out[28]:

BEE for the Action Value Function qπ(s,a)

qπ(s,a)=aπ(a|s)(Ras+γsPassvπ(s))

In [29]:
nb_setup.images_hconcat(["RL_images/BEEq.png"], width=500)
Out[29]:

Example

In [30]:
#Example BEE for Fire Policy
#Solve 3 equations in 3 unknowns
nb_setup.images_hconcat(["RL_images/BEE_SVF_Example.png"], width=500)
Out[30]:

Easy to get the qπ(s,a) values once we know the vπ(s) values.

qπ(s,a)=Ras+γsPassvπ(s)

In [31]:
nb_setup.images_hconcat(["RL_images/BEE_AVF_Example.png"], width=600)
Out[31]:

BEE in Matrix Form

  • The equations above are linear and can be solved using Linear Algebra.

  • We can stack up the equations as follows:

vπ=Rπ+γPπvπ

  • The solution follows easily through matrix inversion.

vπ=(1γPπ)1Rπ

Bellman Optimality Equation (BOE)

  • The optimal state-value function v(s) is the maximum value over all policies

v(s)=maxπvπ(s)

  • Still does not tell us what policy to follow.

Optimal Action Value Function

q(s,a)=maxπqπ(s,a)

Can we get the optimal policy from q(s,a)? Yes, because it tells us the best Action, which is then our optimal policy.

In [32]:
#Example of optimal policy
nb_setup.images_hconcat(["RL_images/OptimalPolicy.png"], width=600)
Out[32]:

Optimal Policy

  • ππ if vπ(s)vπ(s),s

  • Theorem from slides. Existence theorem.

  • π(s)=argmaxaq(s,a), zero otherwise.

BOE for v

v(s)=maxaq(s,a)

and recall that q(s,a)=maxπqπ(s,a).

q(s,a)=Ras+γsPassv(s)

and of course then

v(s)=maxa(Ras+γsPassv(s))

In [36]:
#Bellman Optimality Equation for v_*
nb_setup.images_hconcat(["RL_images/BOE_SVF.png"], width=600)
Out[36]:

BOE for q

q(s,a)=Ras+γsPassv(s)

In [35]:
nb_setup.images_hconcat(["RL_images/BOE_AVF.png"], width=500)
Out[35]:

BOE Final Expressions

Which is then combined with the equations above to give the final BOE expressions for both v(s) and q(s,a).

In [38]:
nb_setup.images_hconcat(["RL_images/BOE_SVF2.png"], width=600)
Out[38]:
In [39]:
nb_setup.images_hconcat(["RL_images/BOE_AVF2.png"], width=600)
Out[39]:
In [40]:
#Example Optimal Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalSVF.png"], width=600)
Out[40]:
In [41]:
#Example Optimal Action-Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalAVF.png"], width=600)
Out[41]:
In [45]:
#Summary, BEE, BOE
nb_setup.images_hconcat(["RL_images/Summary_BEE_BOE.png","RL_images/Summary_OptimalPolicy.png"], width=1000)
Out[45]:

Solving the BOE

  1. Nonlinear
  2. No closed form
  3. Iterative solutions methods
    • Value iteration
    • Policy iteration
    • Q learning
    • SARSA

Readings and References