Policy Gradient Methods

In [1]:
%pylab inline
from ipypublish import nb_setup
Populating the interactive namespace from numpy and matplotlib

The MC learning, SARSA, and Q-learning models are examples of value function approximations where we fit the weights of functions approximating the state value function $V(S)$ or the state-action value function $Q(S,A)$. The policy function $A=\pi(S)$ is implicit at each state and is the maximum of the various $Q$ functions $Q_k(S,A_k,W_k)$ for various discrete actions $A_k, k=1...K$ and $W_k$ are the weights of the $K$ distinct state-action value functions.

But it may be easier to directly fit the policy function than the value function using a functional approximation. We input the state variables $S$ to this function that has weights $W_k$ to get functions

$$ \pi(A_k,S,W_k) = P(A_k | S,W_k), \quad \forall k=1...K $$

These action probabilities form a distribution over actions and must sum to 1.

Policy-based RL has advantages:

  • May have better convergence.
  • Works well in high-dimension or continuous action spaces.
  • Can learn stochastic policies.

But it can

  • Converge to local optima and not the global one.
  • Policy evaluation may be inefficient
  • High variance in outcomes.
  • Typicall requires 10x the number of episodes to converge compared to DQN.

The algorithm used is called "Reinforce".

Actor-Critic algorithms are a combination of value function and policy function approaches.

In [2]:
nb_setup.images_vconcat(["RL_images/Value_vs_Policy_RL.png"], width=600)
Out[2]:

Playing Pong

The game of Pong is a simple example where the policy (action) is easier to model directly. The input to the game is a $80 \times 80$ pixel frame, which is then fed into a NN to get a binary action, i.e., move up or down. Here is an example of a NN with a single hidden layer of 5 nodes and an output layer of 1 node. The number of weights (including the bias/intercept) in the hidden layer will be $5 \times (80 \times 80 + 1) = 32005$ and the number of weights in the output layer will be 30, for a total of 32,035.

In [3]:
nb_setup.images_vconcat(["RL_images/Pong.png"], width=600)
Out[3]:

Fitting this network needs labels $t(j)$, i.e., knowing whether the binary action taken (up or down) is correct or not, i.e., 1 or 0. The network generates a probability $y(j) \in (0,1)$ of an up action, $j$ is the action number in a sequence of actions. So if we had labels, we could go ahead and compute the Log Loss, a special case for binary outcomes of Cross Entropy; pdf, as follows:

$$ L(W) = \frac{1}{M}\sum_{j=1}^M [t(j) \log y(j) + (1-t(j))log(1-y(j))] $$

We recap Log Loss below. Note that it decreases as we get closer to 1, which means we would like to minimize log loss. However, in the expression above we have left off the minus sign so in fact we want to maximize $L(W)$.

In [4]:
def LogLoss(true_label, predicted):
  p = predicted
  if true_label == 1:
    return -log(p)
  else:
    return -log(1 - p)

p = linspace(0.01,1,50)
plot(p,LogLoss(1,p))
xlabel('Prob of true outcome'); ylabel('Log Loss')
grid()

Back to Pong. Note that we do not have labels, so we need to play the game to generate data. Use whatever the current policy is. At the end of each episode of the game, we know if we won or lost. Label each action in the episode with 1 for a win and 0 for a loss.

In [2]:
nb_setup.images_hconcat(["RL_images/Pong_labels1.png","RL_images/Pong_labels2.png"], width=1000)
Out[2]:

Andrej Karpathy has a github site with the pong code in a little over 100 lines. He has also written a terrific blog post on RL and Pong.

In [3]:
nb_setup.images_vconcat(["RL_images/Pong_py.png"], width=600)
Out[3]:

Reinforce: Monte Carlo Policy Gradients

If the outcome is not binary (1,0), then we multiply the objective function by return (sum of rewards) $G = \sum_{j=1}^T R_j$ instead, i.e., $J(W) = G \cdot L(W)$. Gradients now become

$$ \frac{\partial J}{\partial W} = G \cdot \frac{\partial L}{\partial W} $$

After each episode, we update the weights as follows:

$$ w \leftarrow w + \eta \cdot \frac{\partial J}{\partial W} = w + \eta G \cdot \frac{\partial L}{\partial W} $$

These are the weights of the policy function and in the case of Pong, it should emit a binary choice (up or down). So this function needs to be chosen carefully. For Pong, a sigmoid function would satisfy the requirement of a binary output.

The update equation has a plus sign because we are doing a maximization of the objective function.

In [6]:
nb_setup.images_vconcat(["RL_images/MC_Policy_Gradients.png"], width=600)
Out[6]:

Note that no epsilon-greedy is needed as the network itself generates randomness.

Reward Function for a Policy

The policy $\pi$ is a function of the state and this function may be approximated by a NN.

Fix the policy and then generate episodes

$$ (s_1,a_1,r_1), (s_2,a_2,r_2),...,(s_M,a_M,r_M) $$

The total expected reward is

$$ J(W) = E_{\pi}[\sum_{t=1}^M r(S_t,A_t)] = E_{\pi}[R(\tau_e)] $$

where $\tau_e$ is the notation for a single episode.

This expectation over episodes can be written in detail as the probability weighted sum across $E$ episodes

$$ J(W) = \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) R(\tau_e) \approx \frac{1}{\cal E} R(\tau_e) $$

where $P^{\pi}(\tau_e,w)$ is the probability of generating an episode $\tau_e$ under policy $\pi$. Because each episode is independent, the expectation is the simple mean over all episodes.

Reward Gradient

$$ J(W) = \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) R(\tau_e) \approx \frac{1}{\cal E} \sum_{e=1}^{\cal E} R(\tau_e) $$

So,

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& \sum_{e=1}^{\cal E} \frac{\partial P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \\ &=& \sum_{e=1}^{\cal E} \frac{\partial P^{\pi}(\tau_e,w)}{\partial P^{\pi}(\tau_e,w)} \cdot \frac{\partial P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e)\\ &=& \sum_{e=1}^{\cal E} P^{\pi}(\tau_e,w) \cdot \frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e)\\ &=& E \left[\frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \right]\\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \end{eqnarray*}

Log Probability Gradient

Our problem reduces to computing the log probability gradient using sample data from generated episodes.

$$ \frac{\partial \log P^{\pi}(\tau_e,w)}{w} $$

The probability of an episode depends on the state transitions in that episode, i.e.,

$$ P^{\pi}(\tau_e,w) = \prod_{i=1}^T P(s_{i+1}|s_i,a_i) \cdot \pi_W(a_i|s_i) $$

Log probability is

$$ \log P^{\pi}(\tau_e,w) = \sum_{i=1}^T \log P(s_{i+1}|s_i,a_i) + \sum_{i=1}^T \log \pi_W(a_i|s_i) $$

The first part of this equation is the "model" as it contains the transition probabilities and the second part contains the policy function. Since we are using a functional approximation for the policy, the reward gradient only comes from the second part of the equation above, i.e., the gradient is "model-free".

$$ \frac{\partial \log P^{\pi}(\tau_e,w)}{w} = \sum_{i=1}^T \frac{\log \pi_W(a_i|s_i)}{\partial w} $$

We can see that the Log Probability gradient is the sum over Log Policy gradients.

Log Policy Gradient

The gradient is evaluated by sampling without knowledge of the model itself.

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& E \left[\frac{\partial \log P^{\pi}(\tau_e,w)}{w} \cdot R(\tau_e) \right] \\ &=& E \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot R(\tau_e) \right]\\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot R(\tau_e) \right] \end{eqnarray*}

Define $L(s_i,a_i) = L_i \equiv \log \pi_W(a_i|s_i)$. Then

\begin{eqnarray*} \frac{\partial J(W)}{\partial w} &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot R(\tau_e) \right] \\ &=& \frac{1}{\cal E} \sum_{e=1}^{\cal E}G_e \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \right] \end{eqnarray*}

We know that $L(W) = \sum_{k=1}^K t_k(j) \log \pi_k(j)$. Note that $K$ is the number of actions.

In [7]:
nb_setup.images_vconcat(["RL_images/Reinforce.png"], width=600)
Out[7]:

Gradient Estimation

We need to estimate $\frac{\partial L_i}{\partial w}$. How?? Use the $L(W)$ function above with the data and a numerical gradient engine, such as TensorFlow.

Issues

  • High variance
  • Slow convergence
  • On-policy, so data cannot be reused
In [8]:
nb_setup.images_vconcat(["RL_images/PolicyGradient_tf.png"], width=600)
Out[8]:

Variance Reduction

  1. Discounting
  2. Exploiting causality: Reward-to-go
  3. Baselines
  4. Actor-Critic Algorithms

Discounting

In [9]:
nb_setup.images_vconcat(["RL_images/Discounting.png"], width=600)
Out[9]:

Reward-to-go

The idea here is that an action taken at time $i$ can only influence rewards ahead of $i$. Amend the gradient from

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot R(\tau_e) \right] $$

to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot \left( \sum_{j=i}^T R_j \right) \right] $$

equivalent to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot G_i \right] $$

where recall that $L_i = \log \pi_i$.

Baselines

Further revise the previous equation to

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial L_i}{\partial w} \cdot (G_i-b) \right] $$

where

$$ b = \frac{1}{M} \sum_{i=1}^{T{\cal E}} G_i $$

So the gradient is based on the mean-differenced reward function. There is a proof that this is acceptable, shown in the slides.

The baseline may also be a function of state, i.e., $b(S_i)$.

Actor-Critic Algorithms

Because the gradient function contains the total rewards we need a terminating sequence, which is not always possible. If the sequence has a finite $T$ then the following function works.

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot (R_i(s_i,a_i)+R_{i+1}(S_{i+1},a_{i+1})+...+R_T(s_T,a_T)) \right] $$

So the variance of the sum $R_i(s_i,a_i)+R_{i+1}(S_{i+1},a_{i+1})+...+R_T(s_T,a_T)$ matters and can be high. The average of the sum will have lower variance and we may exploit this idea by replacing total reward with the $Q$ function, which is an expectation (average).

$$ \frac{\partial J(W)}{\partial w} = \frac{1}{\cal E} \sum_{e=1}^{\cal E} \left[\sum_{i=1}^T \frac{\partial \log \pi_W(a_i|s_i)}{\partial w} \cdot q_{\pi}(s_i,a_i) \right] $$

Below we see graphical intuition and the baseline under A-C.

In [10]:
nb_setup.images_hconcat(["RL_images/Q_averaging.png","RL_images/Q_baseline.png"], width=1000)
Out[10]:

Advantage Function

In the image above, we denote $A_{\pi}(s,a) = q_{\pi}(s,a) - V_{\pi}(s)$ as the Advantage Function.

In [11]:
nb_setup.images_vconcat(["RL_images/AdvantageFunction.png"], width=600)
Out[11]:
In [12]:
nb_setup.images_vconcat(["RL_images/ActorCritic_System.png"], width=600)
Out[12]:

References and Readings