The notes and index here are based on a class taught by Professor Subir Varma at SCU in the Summer of 2019. Much of the material is taken from and based on the well-known book by Sutton and Barto. No claim of originality is made here. Tha goal is to offer a distillation of the main ideas in RL.
#Install ipypublish in case it is not there.
#!pip install ipypublish
from ipypublish import nb_setup
RL is the science of making decisions by interacting with the environment.
It is the third pillar of machine learning: supervised learning, unsupervised learning, and machine learning.
In supervised learning, there is a label. Here it is a reward instead. There is no supervisor, only a reward signal.
Feedback is delayed, not instantaneous.
Time matters (sequential, non iid data).
Agent's actions affect the subsequent data it receives.
Examples:
Goal: select actions to maximize total future reward.
Actions may have long-term consequences.
Reward may be delayed.
It may be better to sacrifice immediate rewards for long term rewards.
Examples:
The sequence of observations, actions, and rewards.
$S_t = f(H_t)$
Types of state: Environment State, $S_t^e$. The agent may not access to this as the agent may only have access to $O$ and $R$.
Agent State $S_t^a$: what the agent knows. It's some function of the history.
Useful property for State: Markov Property. $P[S_{t+1} | S_1,...,S_t] = P[S_{t+1} | S_t]$
We prefer to choose states that have the Markov property.
$$O_t = S_t^a = S_t^e$$
Agent State equals Enviroment State equals Information State.
Is a MDP.
Partially observable environments : driving, we can't see everything around us. Poker playing agent only sees some cards.
$$ A_t = \pi(S_t) $$
Function of the current State because of the Markov property. Policy can be deterministic or stochastic (different actions with defined probabilities).
Value function: how good is each state/action. A prediction of future reward. Used to evaluate the goodness or badness of states, so as to select between actions. A value function specifies what is good in the long run.
$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s] $$
$\gamma < 1$ gives some assurance of convergence.
Model: agent's representation of the environment. A model predicts what the environment will do next. If the agent does not know the model, it is "model-free" RL. But if the agent figures out what the environment is doing, then he can build a model, and this is called "model-based" RL. We interact with a model of the real world.
Model gives the probabilistic next state and reward.
Computation of Value Function
Computation of the Policy Function
Challenges: Model not known, Large state spaces, Continuous state spaces.
Example of the Maze:
Value-based: no policy (implicit); learn the value function.
Policy-based: no value function; learn the policy function.
Actor Critic: both policy and value function based.
Also:
Model Free: get best policy or value function without trying to figure out the model. No Model.
Model Based: get best policy of value function. Agent tries to figure out the Model.
nb_setup.images_hconcat(["RL_images/RL_Agent_Taxonomy.png"], width=400)
Learning and Planning: can learn by RL (model free) or Planning (model based).Planning can be a tree-based algorithm.
Exploration and Exploitation. Usually in model free systems. Discover a good exploration policy without losing too much reward and after this then exploit. Examples are: (i) Restaurant selection; (ii) online banner ads; (iii) Oil drilling; (iv) Game playing.
Prediction and Control. Prediction is evaluate the future, $v_{\pi}(s)$. Control is optimize the future, $v_*(s)$.
Needed for much larger state spaces. Cannot use the entire state space, so we use neural nets instead to approximate the state space.
This is different from tabular RL.
Represent the value function by a NN.
Sutton and Barto (2018); pdf: Ch 1 and Ch 3.1-3.4.
Reinforcement Learning with OpenAI, Keras, and TensorFlow Nandy and Bsiwas (2017); pdf.
Spriteworld is a python-based RL environment that consists of a 2-dimensional arena with simple shapes that can be moved freely. Spriteworld was developed for the COBRA agent introduced in the paper "COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration" (Watters et al., 2019).
bsuite: Behaviour Suite for Reinforcement Learning.