from ipypublish import nb_setup
Instead of a tabular representation, we introduced the idea of having functional representations, often using neural nets. This brings benefits in saving memory, computation, and training runs (experience). So now we need to find the weights to approximate the function and not the tabular version. So in instead of training the $Q(S,A)$ values directly, we will instead find the weights to functionally approximate the $Q$s instead. The fitting procedures will use gradient descent ideas from deep learning and machine learning.
Much of the NN-based algorithms are coded in Python and code is freely available. A good repository for examples is OpenAI. It has excellent documentation.
See also DeepMind -- an excellent blog and resource.
nb_setup.images_vconcat(["RL_images/Functional_Representation.png"], width=600)
The diagram on the left below shows a value function approximation. The middle and right ones are both cases of approximation of the Q function. The right side diagram takes in only the state and produces different Q functions for each action. It may be more apt to use this approach when the action space is discrete and finite as opposed to continuous and large.
nb_setup.images_vconcat(["RL_images/ValueFnApprox_Types.png"], width=600)
$$ S \rightarrow h_W(S) \rightarrow V(S,W) $$
Recall again that we use $V$ for the estimate and $v$ for the true value of the function. The NN generates $V(S,W)$ and we compare it to the true value $v(S)$. But we don't know $v(S)$!
Below is an example for a linear function.
nb_setup.images_vconcat(["RL_images/RLGradDescent.png"], width=600)
Since we do not know $v(S)$, we obtain a proxy for it from the episodes.
When we use MC learning this is $v(S) \approx G(S)$.
$$ w_j \leftarrow w_j - \eta x_j(i)[V(S_i,W) - G(S_i)] $$
($i$ indexes the state.)
After running an episode, we update the weights using the formula shown above (and below). See the slides for more elaboration of the full implementation program flow.
nb_setup.images_vconcat(["RL_images/MC_FunctionalApprox.png","RL_images/MC_ValueFnAlgo.png"], width=600)
The value function is proxied by $v(S) \approx R + \gamma V(S',W)$ and the update equation is
$$ w_j \leftarrow w_j - \eta x_j(s)[V(S_i,W) - (R + \gamma V(S_{i+1}',W))] $$
where $s$ is the state, i.e., the feature set of $x$ values that define the state.
Here the update is done after every step.
Note that, in the equation above, the NN approximation of the value function $V(S,W)$ is called twice to determine the error $[V(S_i,W) - (R + \gamma V(S_{i+1}',W))]$, in states $S_i$ and $S_{i+1}$. This is weird and interesting, because as you change the weights $W$, the target $V(S_{i+1}',W)$ also changes. We are chasing a moving target. How do we know that such an algorithm is stable and will converge? (This is not an issue for MC learning.)
Another interesting issue is that when we do an update, the value function changes in all states. This is different from what happens in tabular TD learning. This may also affect stability and convergence. (MC also has this issue.)
Note that in the algorithm shown below, the policy $\pi$ is given -- we are only doing evaluation (prediction) with the NN.
nb_setup.images_vconcat(["RL_images/TD_FunctionalApprox.png","RL_images/TD_ValueFnAlgo.png"], width=600)
nb_setup.images_vconcat(["RL_images/VF_convergence.png"], width=600)
So we approximate the optimal action-value function. Use generalized policy iteration. The $\epsilon$-greedy algorithm now updates the weights $W$.
$$ S \rightarrow f_W(\cdot) \rightarrow Q_k(S,A_k,W) $$
for a discrete set of $A_1,...,A_k,...,A_K$. (In the diagram below $T$ should be $T_k$.)
nb_setup.images_vconcat(["RL_images/K-ary_control.png"], width=600)
The update equation will be
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - G] $$
Choose $A_k$ using $\epsilon$-greedy. Weight updates made at the end of each episode.
The update equation will be a version of SARSA:
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + Q_{\epsilon}(S',A',W))] $$
Choose $Q(S',A',W)$ using $\epsilon$-greedy. Weight updates made at each time point.
nb_setup.images_vconcat(["RL_images/SARSA_ValueFnAlgo.png"], width=600)
The update equation will be a version of Q learning:
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W))] $$
Choose $Q(S',A',W)$ using optimal choice, so we go off-policy. Weight updates made at each time point.
So now, to update $Q$ values, we need to update weights.
Recall 2 issues:
(1) Correlated Samples. All value functions update at the same time. Supervised learning needs iid samples, but the TD approach generates correlated samples. So we may end up sampling only some regions of the space. We have to hope that $\epsilon$-greedy mitigates the problem.
(2) Moving Target. The target is not fixed, so we chase a moving target.
Stabilization of Deep Q Networks: DQN by DeepMind (2013)
Stabilization of SARSA: A3C (Asynchronous Advantage Actor Critic) algorithm, Mnih et al (2016).
The situation is tenuous. Tuning of these algorithms is needed to make sure they are stable.
Two big ideas here from DeepMind, experience replay and target networks.
nb_setup.images_vconcat(["RL_images/VF_convergence2.png"], width=600)
Experience replay. Experience is a collection of one step transitions from $(S,A,R)$ to $(S',A',R')$. Randomly sample from the transitions (stored in a "replay buffer") and update
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W))] $$
This makes the samples uncorrelated. So this gets rid of the concept of an episode.
Can we jump to a state $S'$ that is not accessible from the current state $S$. No, we should not random sample arbitrarily. This works with Q-learning because it is off policy, as it does not care about what policy is being used to generate the transitions, which would not be the case with an on-policy algorithm. This would not work with SARSA because it is on-policy.
The idea of a replay buffer comes about a decade before DeepMind used it for the Atari game.
nb_setup.images_vconcat(["RL_images/Q_Learning_ReplayBuffer.png"], width=600)
Target Network. To stabilize the target, because of the moving target problem, depicted below.
This uses two networks, one for the Behavior NN and the other is the Target NN. We update the weights $W$ in the Behavior NN every step, but not for the Target NN's weights $W'$. By keeping the Target NN more static the target is perforce not moving all the time, but only occasionally! And this works well. The update rule now becomes
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + \max_{A'} Q(S',A',W'))] $$
How do we update $W'$? The answer is we don't. Instead, we simply copy over the $W$ to $W'$ every $n$ steps or so. It's that simple!
nb_setup.images_vconcat(["RL_images/TD_MovingTargetProblem.png","RL_images/Target_Networks.png"], width=600)
Below we see a combination of Replay Buffer and Target Network.
nb_setup.images_vconcat(["RL_images/Q_Learning_TargetNetwork.png","RL_images/DQN_Algo.png"], width=600)
See the paper Mnih et al(2015), "Human-Level Control Through Deep Reinforcement Learning".
nb_setup.images_vconcat(["RL_images/Atari_Network_Architecture.png"], width=600)
nb_setup.images_vconcat(["RL_images/DQN_BehaviorAgent.png","RL_images/DQN_TargetAgent.png"], width=600)
See Lillicrap et al (2015), "Continuous control with deep reinforcement learning."
We need to compute $\max_a Q(S,a)$ for which we will differentiate the Q function to get the gradient $\frac{\partial Q}{\partial a}$. So we can get $A_* = \mbox{argmax} Q(S,a)$. We then use this in the update equation:
$$ w_{jk} \leftarrow w_{jk} - \eta x_j [Q_k(S,A_k,W) - (R + Q(S',A_*',W'))] $$
nb_setup.images_vconcat(["RL_images/DPG_CriticActor.png","RL_images/DPG_Algo.png"], width=600)
This is called the "Actor-Critic Policy" approach, see below. Reference: A3C (Asynchronous Advantage Actor Critic) algorithm, Mnih et al (2016).
This approach may be enhanced to make the model Asynchronous. It has the following advantages:
nb_setup.images_hconcat(["RL_images/A3C_Architecture.png","RL_images/A3C_Workflow.png"], width=1000)