Deep RL Value Function Approximation

In [1]:
from ipypublish import nb_setup

Instead of a tabular representation, we introduced the idea of having functional representations, often using neural nets. This brings benefits in saving memory, computation, and training runs (experience). So now we need to find the weights to approximate the function and not the tabular version. So in instead of training the $Q(S,A)$ values directly, we will instead find the weights to functionally approximate the $Q$s instead. The fitting procedures will use gradient descent ideas from deep learning and machine learning.

Much of the NN-based algorithms are coded in Python and code is freely available. A good repository for examples is OpenAI. It has excellent documentation.

See also DeepMind -- an excellent blog and resource.

In [5]:
nb_setup.images_vconcat(["RL_images/Functional_Representation.png"], width=600)
Out[5]:

Types of Value Function Approximation

The diagram on the left below shows a value function approximation. The middle and right ones are both cases of approximation of the Q function. The right side diagram takes in only the state and produces different Q functions for each action. It may be more apt to use this approach when the action space is discrete and finite as opposed to continuous and large.

In [7]:
nb_setup.images_vconcat(["RL_images/ValueFnApprox_Types.png"], width=600)
Out[7]:

Approximating Value Functions (Prediction)

  1. Run sample episodes from the MDP.
  2. Use MC or TD Learning to get estimates of the value function $v(S)$. (Model-free approach.)
  3. Use these samples to get the weights of the NN. This is done using gradient descent.

$$ S \rightarrow h_W(S) \rightarrow V(S,W) $$

Recall again that we use $V$ for the estimate and $v$ for the true value of the function. The NN generates $V(S,W)$ and we compare it to the true value $v(S)$. But we don't know $v(S)$!

Below is an example for a linear function.

In [9]:
nb_setup.images_vconcat(["RL_images/RLGradDescent.png"], width=600)
Out[9]:

MC Learning with NNs

Since we do not know $v(S)$, we obtain a proxy for it from the episodes.

When we use MC learning this is $v(S) \approx G(S)$.

$$ w_j \leftarrow w_j - \eta x_j(i)[V(S_i,W) - G(S_i)] $$

($i$ indexes the state.)

After running an episode, we update the weights using the formula shown above (and below). See the slides for more elaboration of the full implementation program flow.

In [11]:
nb_setup.images_vconcat(["RL_images/MC_FunctionalApprox.png","RL_images/MC_ValueFnAlgo.png"], width=600)
Out[11]:

TD Learning with NNs

The value function is proxied by $v(S) \approx R + \gamma V(S',W)$ and the update equation is

$$ w_j \leftarrow w_j - \eta x_j(s)[V(S_i,W) - (R + \gamma V(S_{i+1}',W))] $$

where $s$ is the state, i.e., the feature set of $x$ values that define the state.

Here the update is done after every step.

Note that, in the equation above, the NN approximation of the value function $V(S,W)$ is called twice to determine the error $[V(S_i,W) - (R + \gamma V(S_{i+1}',W))]$, in states $S_i$ and $S_{i+1}$. This is weird and interesting, because as you change the weights $W$, the target $V(S_{i+1}',W)$ also changes. We are chasing a moving target. How do we know that such an algorithm is stable and will converge? (This is not an issue for MC learning.)

Another interesting issue is that when we do an update, the value function changes in all states. This is different from what happens in tabular TD learning. This may also affect stability and convergence. (MC also has this issue.)

Note that in the algorithm shown below, the policy $\pi$ is given -- we are only doing evaluation (prediction) with the NN.

In [12]:
nb_setup.images_vconcat(["RL_images/TD_FunctionalApprox.png","RL_images/TD_ValueFnAlgo.png"], width=600)
Out[12]:

Convergence

In [13]:
nb_setup.images_vconcat(["RL_images/VF_convergence.png"], width=600)
Out[13]:

Approximating Value Functions (Control)

  1. As before, instead of estimating $V(S)$ we estimate $Q(S,A)$.
  2. At each step, use $\epsilon$-greedy policy improvement.

So we approximate the optimal action-value function. Use generalized policy iteration. The $\epsilon$-greedy algorithm now updates the weights $W$.

  1. Run sample episodes from the MDP.
  2. Use MC, TD, or Q learning to get estimates of $Q(S,A)$.
  3. Train the NN

$$ S \rightarrow f_W(\cdot) \rightarrow Q_k(S,A_k,W) $$

for a discrete set of $A_1,...,A_k,...,A_K$. (In the diagram below $T$ should be $T_k$.)

In [15]:
nb_setup.images_vconcat(["RL_images/K-ary_control.png"], width=600)
Out[15]:

MC NN Control

The update equation will be

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - G] $$

Choose $A_k$ using $\epsilon$-greedy. Weight updates made at the end of each episode.

TD NN Control (On Policy SARSA)

The update equation will be a version of SARSA:

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + Q_{\epsilon}(S',A',W))] $$

Choose $Q(S',A',W)$ using $\epsilon$-greedy. Weight updates made at each time point.

In [18]:
nb_setup.images_vconcat(["RL_images/SARSA_ValueFnAlgo.png"], width=600)
Out[18]:

TD NN Control (Off Policy Q-Learning)

The update equation will be a version of Q learning:

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W))] $$

Choose $Q(S',A',W)$ using optimal choice, so we go off-policy. Weight updates made at each time point.

So now, to update $Q$ values, we need to update weights.

Stability and Convergence

Recall 2 issues:

(1) Correlated Samples. All value functions update at the same time. Supervised learning needs iid samples, but the TD approach generates correlated samples. So we may end up sampling only some regions of the space. We have to hope that $\epsilon$-greedy mitigates the problem.

(2) Moving Target. The target is not fixed, so we chase a moving target.

The situation is tenuous. Tuning of these algorithms is needed to make sure they are stable.

Two big ideas here from DeepMind, experience replay and target networks.

In [19]:
nb_setup.images_vconcat(["RL_images/VF_convergence2.png"], width=600)
Out[19]:

Deep Q Networks, Experience Replay

Experience replay. Experience is a collection of one step transitions from $(S,A,R)$ to $(S',A',R')$. Randomly sample from the transitions (stored in a "replay buffer") and update

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W))] $$

This makes the samples uncorrelated. So this gets rid of the concept of an episode.

Can we jump to a state $S'$ that is not accessible from the current state $S$. No, we should not random sample arbitrarily. This works with Q-learning because it is off policy, as it does not care about what policy is being used to generate the transitions, which would not be the case with an on-policy algorithm. This would not work with SARSA because it is on-policy.

The idea of a replay buffer comes about a decade before DeepMind used it for the Atari game.

In [20]:
nb_setup.images_vconcat(["RL_images/Q_Learning_ReplayBuffer.png"], width=600)
Out[20]:

Deep Q Networks, Target Network

Target Network. To stabilize the target, because of the moving target problem, depicted below.

This uses two networks, one for the Behavior NN and the other is the Target NN. We update the weights $W$ in the Behavior NN every step, but not for the Target NN's weights $W'$. By keeping the Target NN more static the target is perforce not moving all the time, but only occasionally! And this works well. The update rule now becomes

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W'))] $$

How do we update $W'$? The answer is we don't. Instead, we simply copy over the $W$ to $W'$ every $n$ steps or so. It's that simple!

In [23]:
nb_setup.images_vconcat(["RL_images/TD_MovingTargetProblem.png","RL_images/Target_Networks.png"], width=600)
Out[23]:

Below we see a combination of Replay Buffer and Target Network.

In [25]:
nb_setup.images_vconcat(["RL_images/Q_Learning_TargetNetwork.png","RL_images/DQN_Algo.png"], width=600)
Out[25]:

Atari Game

See the paper Mnih et al(2015), "Human-Level Control Through Deep Reinforcement Learning".

In [26]:
nb_setup.images_vconcat(["RL_images/Atari_Network_Architecture.png"], width=600)
Out[26]:
In [27]:
nb_setup.images_vconcat(["RL_images/DQN_BehaviorAgent.png","RL_images/DQN_TargetAgent.png"], width=600)
Out[27]:

Continuous Action Spaces

See Lillicrap et al (2015), "Continuous control with deep reinforcement learning."

Deterministic Policy Gradients (DPG)

We need to compute $\max_a Q(S,a)$ for which we will differentiate the Q function to get the gradient $\frac{\partial Q}{\partial a}$. So we can get $A_* = \mbox{argmax} Q(S,a)$. We then use this in the update equation:

$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + Q(S',A_*',W'))] $$

In [29]:
nb_setup.images_vconcat(["RL_images/DPG_CriticActor.png","RL_images/DPG_Algo.png"], width=600)
Out[29]:

Actor-Critic

This is called the "Actor-Critic Policy" approach, see below. Reference: A3C (Asynchronous Advantage Actor Critic) algorithm, Mnih et al (2016).

This approach may be enhanced to make the model Asynchronous. It has the following advantages:

  • Can multithread on standard CPUs.
  • Parallelize many agents.
  • NN parameters may be shared amongst threads.
  • Parallelism mitigates the correlation problem and offers an alternative to the experience replay approach. The effect of multiple workers applying online updates in parallel is less correlated than a single agent applying online updates.
In [32]:
nb_setup.images_hconcat(["RL_images/A3C_Architecture.png","RL_images/A3C_Workflow.png"], width=1000)
Out[32]:

References and Reading