Introduction to Reinforcement Learning (RL)¶

The notes and index here are based on a class taught by Professor Subir Varma at SCU in the Summer of 2019. Much of the material is taken from and based on the well-known book by Sutton and Barto. No claim of originality is made here. Tha goal is to offer a distillation of the main ideas in RL.

#Install ipypublish in case it is not there. 
#!pip install ipypublish
from ipypublish import nb_setup

RL is the science of making decisions by interacting with the environment.

Decisions or actions.
Consequences: (i) observations; (ii) rewards.

It is the third pillar of machine learning: supervised learning, unsupervised learning, and machine learning.

Characteristics of RL¶

In supervised learning, there is a label. Here it is a reward instead. There is no supervisor, only a reward signal.
- Feedback is delayed, not instantaneous.
- Time matters (sequential, non iid data).
- Agent's actions affect the subsequent data it receives.

Rewards¶

A reward $R_t$ is a scalar feedback signal.
Tells you how well you are doing at step $t$.
The agent tries to maximize expected cumulative reward.

Examples:

Making a humanoid robot walk, walks or falls over.
Playing video games, win or lose.
Manage an investment portfolio, size of bank balance.
Control a power station, positive reward for producing power and negative reward for exceeding safety thresholds.
Backgammon, win or lose.

Sequential Decision Making¶

Goal: select actions to maximize total future reward.

Actions may have long-term consequences.

Reward may be delayed.

It may be better to sacrifice immediate rewards for long term rewards.

Examples:

Financial investment
Refuelling a helicopter in time.
Blocking opponent moves

Environment: Action, Observation, Reward¶

Agent looks at Observation $O_t$ and Reward $R_t$ and takes an Action $A_t$.
Action $A_t$ leads to Observation $O_{t+1}$ and/or Reward $R_{t+1}$.

History and State¶

The sequence of observations, actions, and rewards.
$S_t = f(H_t)$
Types of state: Environment State, $S_t^e$. The agent may not access to this as the agent may only have access to $O$ and $R$.
Agent State $S_t^a$: what the agent knows. It's some function of the history.
Useful property for State: Markov Property. $P[S_{t+1} | S_1,...,S_t] = P[S_{t+1} | S_t]$
We prefer to choose states that have the Markov property.

Fully Observable Environments¶

$$O_t = S_t^a = S_t^e$$

Agent State equals Enviroment State equals Information State.
Is a MDP.
Partially observable environments : driving, we can't see everything around us. Poker playing agent only sees some cards.

RL Agent Components¶

Policy: agent's behavior function. Map from State to Action.

$$ A_t = \pi(S_t) $$

Function of the current State because of the Markov property. Policy can be deterministic or stochastic (different actions with defined probabilities).
Value function: how good is each state/action. A prediction of future reward. Used to evaluate the goodness or badness of states, so as to select between actions. A value function specifies what is good in the long run.

$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s] $$

$\gamma < 1$ gives some assurance of convergence.
Model: agent's representation of the environment. A model predicts what the environment will do next. If the agent does not know the model, it is "model-free" RL. But if the agent figures out what the environment is doing, then he can build a model, and this is called "model-based" RL. We interact with a model of the real world.
Model gives the probabilistic next state and reward.

Central Problems of RL¶

Computation of Value Function
Computation of the Policy Function
Challenges: Model not known, Large state spaces, Continuous state spaces.

Example of the Maze:

reward = $-1$ per time step'
Actions: N, S, E, W;
State: Agent's location;
Policy: action for each state;
Value function is the number of steps needed to get to the end. So each cell in the maze has a negative number that is the number of steps remaining.
Model: only the states the agent saw and the incremental value.

Categorizing RL Agents¶

Value-based: no policy (implicit); learn the value function.
Policy-based: no value function; learn the policy function.
Actor Critic: both policy and value function based.

Also:

Model Free: get best policy or value function without trying to figure out the model. No Model.
Model Based: get best policy of value function. Agent tries to figure out the Model.

RL Agent Taxonomy¶

nb_setup.images_hconcat(["RL_images/RL_Agent_Taxonomy.png"], width=400)

Sub Problems in RL¶

Learning and Planning: can learn by RL (model free) or Planning (model based).Planning can be a tree-based algorithm.
Exploration and Exploitation. Usually in model free systems. Discover a good exploration policy without losing too much reward and after this then exploit. Examples are: (i) Restaurant selection; (ii) online banner ads; (iii) Oil drilling; (iv) Game playing.
Prediction and Control. Prediction is evaluate the future, $v_{\pi}(s)$. Control is optimize the future, $v_*(s)$.

Deep RL¶

Needed for much larger state spaces. Cannot use the entire state space, so we use neural nets instead to approximate the state space.
This is different from tabular RL.
Represent the value function by a NN.

Reading¶

Sutton and Barto (2018); pdf: Ch 1 and Ch 3.1-3.4.
Slides
Reinforcement Learning with OpenAI, Keras, and TensorFlow Nandy and Bsiwas (2017); pdf.

Deep Mind Open Sources RL¶

OpenSpiel: A Framework for Reinforcement Learning in Games.
Spriteworld is a python-based RL environment that consists of a 2-dimensional arena with simple shapes that can be moved freely. Spriteworld was developed for the COBRA agent introduced in the paper "COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration" (Watters et al., 2019).
bsuite: Behaviour Suite for Reinforcement Learning.