from ipypublish import nb_setup
There are 4 elements of a MDP:
Call this the ${\bf S,A,R,P}$ model.
Goal: Find a policy $\pi$ that maximizes future expected (maybe discounted) reward. The stream of reward along one path of the model from time $t$ to the end is called Return, denoted $G_t$.
Result: A Value Function $v_{\pi}(s)$ for the policy $\pi(s)$ which gives the value of applying this policy in every state $s$. So we want to find the optimal value function, denoted $v_*(s)$ under the optimal policy $\pi_*(s)$.
If you know all of $S,A,R,P$, then you can calculate the policy $\pi(s)$ for all states $s$ without interacting with the environment. This makes the situation that of a planning problem. The agent solves for an optimal policy using various approaches that we will see later, e.g., policy iteration and value iteration.
In true RL problems, the agent does not have all the 4 components above. Usually they are missing knowledge of $R$ and $P$, so interaction with the environment is needed. The agent may run a series of trials and interact with the environment and thus learn $P, R$ by seeing what transitions and rewards occur when the agent takes actions $a$ in state $s$. Once it has obtained a distribution of $P$ and $R$, i.e., a model of the system, then the agent can stop interacting with the environment and proceed to solve a planning problem. This is called model-based RL. Model-based RL allows the agent to use the model to determine te optimal value function $v_*(s)$ and policy $\pi_*(s)$ directly.
When the agent interacts with the environment and does not build a transition probability model, but only gets an idea of the Return $G$ at each state $s$ and action $a$ under policy $\pi$, known as the "quality" function $q_{\pi}(s,a)$, then we call it model-free RL. The nomenclature comes from the absence of a model of state transitions.
Recap (Intuitive)
nb_setup.images_hconcat(["RL_images/Model_Free_Based.png"], width=600)
$$ P[S_{t+1} | S_t] = P[S_{t+1} | S_t, S_{t-1}, ...] $$
A Markov process (MP) is a situation where there are no Rewards or Actions and the world just evolves. This is different from a MDP.
Markov chain is a tuple $\langle S,P \rangle$. The chain can be represented by a transition probability matrix $P$ over a set of finite states. This matrix completely characterizes the model. Each row of the matrix sums to 1.
Tree representation (also called a "Rollout"): from the root you evolve downwards, but nodes are the same state repeated. So the same Markov chain can also be represented as a tree. Shows time also and is a picture of the evolution of states over time.
$$ P_{ss'}^a = P[S_{t+1}=s' | S_t=s, A_t=a] $$
#MDPdiagram
nb_setup.images_hconcat(["RL_images/MDPdiagram.png"], width=600)
nb_setup.images_hconcat(["RL_images/MDPmatrix.png"], width=600)
nb_setup.images_hconcat(["RL_images/MDPmatrix2.png"], width=600)
This representation mimics the Matrix Representation and is often called a "Rollout".
#MDPTree_Rollout
nb_setup.images_hconcat(["RL_images/MDPTree_Rollout.png"], width=600)
$$ \pi(a|s) = P[A_t=a | S_t = s] $$
Fully defines the behavior of the agent.
Given by $\pi(s) = a$ if deterministic. Else could be different Actions with probabilities attached. Deterministic policies gives a transition probability matrix over states.
MDP policies depend on the current state (not history).
Policies are stationary (time-independent)
For a deterministic policy, see that the $\pi(s)$ matrix is all binary (0 or 1). Not so for a random matrix.
#Policy matrices (deterministic)
nb_setup.images_hconcat(["RL_images/Policy_deterministic.png"], width=600)
#Policy matrix (random)
nb_setup.images_hconcat(["RL_images/Policy_random.png"], width=600)
$$ r(s_1; s_0,a_1) $$
$$ r(s_0, a_1) = E[r(s_1; s_0,a_1)] $$
#MDP Tree Representation (Rewards)
nb_setup.images_hconcat(["RL_images/MDPrewards.png"], width=600)
#MDP rewards matrix
nb_setup.images_hconcat(["RL_images/MDPrewards_matrix.png"], width=600)
$$ G_t = R_{t+1} + \gamma R_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$
Discount used to time normalize rewards (finance), behavior shows discounting, mathematical convenience for infinite horizon problems. If $\gamma=1$, then make sure sequences terminate.
Run several paths and the mean value in each state is the value (function) of that state.
#Example MDP Returns
nb_setup.images_hconcat(["RL_images/MDPreturns.png"], width=800)
$$ v_{\pi}(s) = E[G_t | S_t=s ] $$
#State Value Function Fire Policy
nb_setup.images_hconcat(["RL_images/StateValueFunction.png"], width=600)
$$ q_{\pi}(s,a) = E_{\pi}[G_t | S_t=s, A_t=a ] $$
Gives the value function for a given policy.
Given a policy $\pi$ compute the value functions $v_{\pi}(s)$ and $q_{\pi}(s,a)$.
Bellman Expectation Equations:
$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s] $$
$$ q_{\pi}(s,a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t=s, A_t=a] $$
$$ v_{\pi}(s) = \sum_a \pi(a|s) q_{\pi}(s,a) $$
nb_setup.images_hconcat(["RL_images/BEE_SVF.png"], width=300)
$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$
nb_setup.images_hconcat(["RL_images/BEE_AVF.png"], width=300)
$$ v_{\pi}(s) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$
nb_setup.images_hconcat(["RL_images/BEE.png"], width=500)
$$ q_{\pi}(s,a) = \sum_a \pi(a|s) \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) $$
nb_setup.images_hconcat(["RL_images/BEEq.png"], width=500)
#Example BEE for Fire Policy
#Solve 3 equations in 3 unknowns
nb_setup.images_hconcat(["RL_images/BEE_SVF_Example.png"], width=500)
Easy to get the $q_{\pi}(s,a)$ values once we know the $v_{\pi}(s)$ values.
$$ q_{\pi}(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') $$
nb_setup.images_hconcat(["RL_images/BEE_AVF_Example.png"], width=600)
The equations above are linear and can be solved using Linear Algebra.
We can stack up the equations as follows:
$$ v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi} $$
$$ v_{\pi} = (1-\gamma P^{\pi})^{-1} R^{\pi} $$
$$ v_*(s) = \max_{\pi} v_{\pi}(s) $$
$$ q_*(s,a) = \max_{\pi} q_{\pi}(s,a) $$
Can we get the optimal policy from $q_*(s,a)$? Yes, because it tells us the best Action, which is then our optimal policy.
#Example of optimal policy
nb_setup.images_hconcat(["RL_images/OptimalPolicy.png"], width=600)
$\pi \geq \pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s), \forall s$
Theorem from slides. Existence theorem.
$\pi_*(s) = \mbox{argmax}_a q_*(s,a)$, zero otherwise.
$$ v_*(s) = \max_a q_*(s,a) $$
and recall that $q_*(s,a) = \max_{\pi} q_{\pi}(s,a)$.
$$ q_*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') $$
and of course then
$$ v_*(s) = \max_a \left(R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') \right) $$
#Bellman Optimality Equation for v_*
nb_setup.images_hconcat(["RL_images/BOE_SVF.png"], width=600)
$$ q_*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a v_*(s') $$
nb_setup.images_hconcat(["RL_images/BOE_AVF.png"], width=500)
Which is then combined with the equations above to give the final BOE expressions for both $v_*(s)$ and $q_*(s,a)$.
nb_setup.images_hconcat(["RL_images/BOE_SVF2.png"], width=600)
nb_setup.images_hconcat(["RL_images/BOE_AVF2.png"], width=600)
#Example Optimal Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalSVF.png"], width=600)
#Example Optimal Action-Value Function
nb_setup.images_hconcat(["RL_images/Example_optimalAVF.png"], width=600)
#Summary, BEE, BOE
nb_setup.images_hconcat(["RL_images/Summary_BEE_BOE.png","RL_images/Summary_OptimalPolicy.png"], width=1000)