Policy Gradient Methods¶

%pylab inline
from ipypublish import nb_setup

Populating the interactive namespace from numpy and matplotlib

The MC learning, SARSA, and Q-learning models are examples of value function approximations where we fit the weights of functions approximating the state value function $V(S)$ or the state-action value function $Q(S,A)$. The policy function $A=\pi(S)$ is implicit at each state and is the maximum of the various $Q$ functions $Q_k(S,A_k,W_k)$ for various discrete actions $A_k, k=1...K$ and $W_k$ are the weights of the $K$ distinct state-action value functions.

But it may be easier to directly fit the policy function than the value function using a functional approximation. We input the state variables $S$ to this function that has weights $W_k$ to get functions

$$ \pi(A_k,S,W_k) = P(A_k | S,W_k), \quad \forall k=1...K $$

These action probabilities form a distribution over actions and must sum to 1.

Policy-based RL has advantages:

May have better convergence.
Works well in high-dimension or continuous action spaces.
Can learn stochastic policies.

But it can

Converge to local optima and not the global one.
Policy evaluation may be inefficient
High variance in outcomes.
Typicall requires 10x the number of episodes to converge compared to DQN.

The algorithm used is called "Reinforce".

Actor-Critic algorithms are a combination of value function and policy function approaches.

nb_setup.images_vconcat(["RL_images/Value_vs_Policy_RL.png"], width=600)

Playing Pong¶

The game of Pong is a simple example where the policy (action) is easier to model directly. The input to the game is a $80 \times 80$ pixel frame, which is then fed into a NN to get a binary action, i.e., move up or down. Here is an example of a NN with a single hidden layer of 5 nodes and an output layer of 1 node. The number of weights (including the bias/intercept) in the hidden layer will be $5 \times (80 \times 80 + 1) = 32005$ and the number of weights in the output layer will be 30, for a total of 32,035.

nb_setup.images_vconcat(["RL_images/Pong.png"], width=600)

Fitting this network needs labels $t(j)$, i.e., knowing whether the binary action taken (up or down) is correct or not, i.e., 1 or 0. The network generates a probability $y(j) \in (0,1)$ of an up action, $j$ is the action number in a sequence of actions. So if we had labels, we could go ahead and compute the Log Loss, a special case for binary outcomes of Cross Entropy; pdf, as follows:

$$ L(W) = \frac{1}{M}\sum_{j=1}^M [t(j) \log y(j) + (1-t(j))log(1-y(j))] $$

We recap Log Loss below. Note that it decreases as we get closer to 1, which means we would like to minimize log loss. However, in the expression above we have left off the minus sign so in fact we want to maximize $L(W)$.

def LogLoss(true_label, predicted):
  p = predicted
  if true_label == 1:
    return -log(p)
  else:
    return -log(1 - p)

p = linspace(0.01,1,50)
plot(p,LogLoss(1,p))
xlabel('Prob of true outcome'); ylabel('Log Loss')
grid()

Back to Pong. Note that we do not have labels, so we need to play the game to generate data. Use whatever the current policy is. At the end of each episode of the game, we know if we won or lost. Label each action in the episode with 1 for a win and 0 for a loss.

nb_setup.images_hconcat(["RL_images/Pong_labels1.png","RL_images/Pong_labels2.png"], width=1000)

Andrej Karpathy has a github site with the pong code in a little over 100 lines. He has also written a terrific blog post on RL and Pong.

nb_setup.images_vconcat(["RL_images/Pong_py.png"], width=600)

Reinforce: Monte Carlo Policy Gradients¶

If the outcome is not binary (1,0), then we multiply the objective function by return (sum of rewards) $G = \sum_{j=1}^T R_j$ instead, i.e., $J(W) = G \cdot L(W)$. Gradients now become

$$ \frac{\partial J}{\partial W} = G \cdot \frac{\partial L}{\partial W} $$

After each episode, we update the weights as follows:

$$ w \leftarrow w + \eta \cdot \frac{\partial J}{\partial W} = w + \eta G \cdot \frac{\partial L}{\partial W} $$

These are the weights of the policy function and in the case of Pong, it should emit a binary choice (up or down). So this function needs to be chosen carefully. For Pong, a sigmoid function would satisfy the requirement of a binary output.

The update equation has a plus sign because we are doing a maximization of the objective function.

nb_setup.images_vconcat(["RL_images/MC_Policy_Gradients.png"], width=600)

Note that no epsilon-greedy is needed as the network itself generates randomness.

Reward Function for a Policy¶

The policy $\pi$ is a function of the state and this function may be approximated by a NN.

Fix the policy and then generate episodes

$$ (s_1,a_1,r_1), (s_2,a_2,r_2),...,(s_M,a_M,r_M) $$

The total expected reward is

$$ J(W) = E_{\pi}[\sum_{t=1}^M r(S_t,A_t)] = E_{\pi}[R(\tau_e)] $$

where $\tau_e$ is the notation for a single episode.

This expectation over episodes can be written in detail as the probability weighted sum across $E$ episodes

$$ J(W) = \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) R(\tau_e) \approx \frac{1}{\cal E} R(\tau_e) $$

where $P^{\pi}(\tau_e,w)$ is the probability of generating an episode $\tau_e$ under policy $\pi$. Because each episode is independent, the expectation is the simple mean over all episodes.

Reward Gradient¶

$$ J(W) = \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) R(\tau_e) \approx \frac{1}{\cal E} \sum_{e=1}^{\cal E} R(\tau_e) $$

So,

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& \sum_{e=1}^{\cal E} \frac{\partial P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \\ &=& \sum_{e=1}^{\cal E} \frac{\partial P^{\pi}(\tau_e,w)}{\partial P^{\pi}(\tau_e,w)} \cdot \frac{\partial P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e)\\ &=& \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) \cdot \frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e)\\ &=& E \left[\frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \right]\\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \end{eqnarray*}

Log Probability Gradient¶

Our problem reduces to computing the log probability gradient using sample data from generated episodes.

$$ \frac{\partial \log P^{\pi}(\tau_e,w)}{w} $$

The probability of an episode depends on the state transitions in that episode, i.e.,

$$ P^{\pi}(\tau_e,w) = \prod_{i=1}^T P(s_{i+1}|s_i,a_i) \cdot \pi_W(a_i|s_i) $$

Log probability is

$$ \log P^{\pi}(\tau_e,w) = \sum_{i=1}^T \log P(s_{i+1}|s_i,a_i) + \sum_{i=1}^T \log \pi_W(a_i|s_i) $$

The first part of this equation is the "model" as it contains the transition probabilities and the second part contains the policy function. Since we are using a functional approximation for the policy, the reward gradient only comes from the second part of the equation above, i.e., the gradient is "model-free".

$$ \frac{\partial \log P^{\pi}(\tau_e,w)}{w} = \sum_{i=1}^T \frac{\log \pi_W(a_i|s_i)}{\partial w} $$

We can see that the Log Probability gradient is the sum over Log Policy gradients.

Log Policy Gradient¶

The gradient is evaluated by sampling without knowledge of the model itself.

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& E \left[\frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \right] \\ &=& E \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot R(\tau_e) \right]\\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot R(\tau_e) \right] \end{eqnarray*}

Define $L(s_i,a_i) = L_i \equiv \log \pi_W(a_i|s_i)$. Then

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot R(\tau_e) \right] \\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E}G_e \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \right] \end{eqnarray*}

We know that $L(W) = \sum_{k=1}^K t_k(j) \log \pi_k(j)$. Note that $K$ is the number of actions.

nb_setup.images_vconcat(["RL_images/Reinforce.png"], width=600)

Gradient Estimation¶

We need to estimate $\frac{\partial L_i}{\partial w}$. How?? Use the $L(W)$ function above with the data and a numerical gradient engine, such as TensorFlow.

Issues

High variance
Slow convergence
On-policy, so data cannot be reused

nb_setup.images_vconcat(["RL_images/PolicyGradient_tf.png"], width=600)

Variance Reduction¶

Discounting
Exploiting causality: Reward-to-go
Baselines
Actor-Critic Algorithms

Discounting¶

nb_setup.images_vconcat(["RL_images/Discounting.png"], width=600)

Reward-to-go¶

The idea here is that an action taken at time $i$ can only influence rewards ahead of $i$. Amend the gradient from

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot R(\tau_e) \right] $$

to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot \left( \sum_{j=i}^T R_j \right) \right] $$

equivalent to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot G_i \right] $$

where recall that $L_i = \log \pi_i$.

Baselines¶

Further revise the previous equation to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot (G_i-b) \right] $$

where

$$ b = \frac{1}{M} \sum_{i=1}^{T{\cal E}} G_i $$

So the gradient is based on the mean-differenced reward function. There is a proof that this is acceptable, shown in the slides.

The baseline may also be a function of state, i.e., $b(S_i)$.

Actor-Critic Algorithms¶

Because the gradient function contains the total rewards we need a terminating sequence, which is not always possible. If the sequence has a finite $T$ then the following function works.

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot (R_i(s_i,a_i)+R_{i+1}(S_{i+1},a_{i+1})+...+R_T(s_T,a_T)) \right] $$

So the variance of the sum $R_i(s_i,a_i)+R_{i+1}(S_{i+1},a_{i+1})+...+R_T(s_T,a_T)$ matters and can be high. The average of the sum will have lower variance and we may exploit this idea by replacing total reward with the $Q$ function, which is an expectation (average).

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot q_{\pi}(s_i,a_i) \right] $$

Below we see graphical intuition and the baseline under A-C.

nb_setup.images_hconcat(["RL_images/Q_averaging.png","RL_images/Q_baseline.png"], width=1000)

Advantage Function¶

In the image above, we denote $A_{\pi}(s,a) = q_{\pi}(s,a) - V_{\pi}(s)$ as the Advantage Function.

nb_setup.images_vconcat(["RL_images/AdvantageFunction.png"], width=600)

nb_setup.images_vconcat(["RL_images/ActorCritic_System.png"], width=600)

References and Readings¶

Slides
Sutton and Barto (2018), Ch 13