Introduction to Reinforcement Learning (RL)

The notes and index here are based on a class taught by Professor Subir Varma at SCU in the Summer of 2019. Much of the material is taken from and based on the well-known book by Sutton and Barto. No claim of originality is made here. Tha goal is to offer a distillation of the main ideas in RL.

In [1]:
#Install ipypublish in case it is not there. 
#!pip install ipypublish
from ipypublish import nb_setup

RL is the science of making decisions by interacting with the environment.

  • Decisions or actions.
  • Consequences: (i) observations; (ii) rewards.

It is the third pillar of machine learning: supervised learning, unsupervised learning, and machine learning.

Characteristics of RL

  • In supervised learning, there is a label. Here it is a reward instead. There is no supervisor, only a reward signal.

    • Feedback is delayed, not instantaneous.

    • Time matters (sequential, non iid data).

    • Agent's actions affect the subsequent data it receives.

Rewards

  • A reward $R_t$ is a scalar feedback signal.
  • Tells you how well you are doing at step $t$.
  • The agent tries to maximize expected cumulative reward.

Examples:

  • Making a humanoid robot walk, walks or falls over.
  • Playing video games, win or lose.
  • Manage an investment portfolio, size of bank balance.
  • Control a power station, positive reward for producing power and negative reward for exceeding safety thresholds.
  • Backgammon, win or lose.

Sequential Decision Making

Goal: select actions to maximize total future reward.

Actions may have long-term consequences.

Reward may be delayed.

It may be better to sacrifice immediate rewards for long term rewards.

Examples:

  • Financial investment
  • Refuelling a helicopter in time.
  • Blocking opponent moves

Environment: Action, Observation, Reward

  • Agent looks at Observation $O_t$ and Reward $R_t$ and takes an Action $A_t$.
  • Action $A_t$ leads to Observation $O_{t+1}$ and/or Reward $R_{t+1}$.

History and State

  • The sequence of observations, actions, and rewards.

  • $S_t = f(H_t)$

  • Types of state: Environment State, $S_t^e$. The agent may not access to this as the agent may only have access to $O$ and $R$.

  • Agent State $S_t^a$: what the agent knows. It's some function of the history.

  • Useful property for State: Markov Property. $P[S_{t+1} | S_1,...,S_t] = P[S_{t+1} | S_t]$

  • We prefer to choose states that have the Markov property.

Fully Observable Environments

$$O_t = S_t^a = S_t^e$$

  • Agent State equals Enviroment State equals Information State.

  • Is a MDP.

  • Partially observable environments : driving, we can't see everything around us. Poker playing agent only sees some cards.

RL Agent Components

  • Policy: agent's behavior function. Map from State to Action.

$$ A_t = \pi(S_t) $$

  • Function of the current State because of the Markov property. Policy can be deterministic or stochastic (different actions with defined probabilities).

  • Value function: how good is each state/action. A prediction of future reward. Used to evaluate the goodness or badness of states, so as to select between actions. A value function specifies what is good in the long run.

$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s] $$

  • $\gamma < 1$ gives some assurance of convergence.

  • Model: agent's representation of the environment. A model predicts what the environment will do next. If the agent does not know the model, it is "model-free" RL. But if the agent figures out what the environment is doing, then he can build a model, and this is called "model-based" RL. We interact with a model of the real world.

  • Model gives the probabilistic next state and reward.

Central Problems of RL

  • Computation of Value Function

  • Computation of the Policy Function

  • Challenges: Model not known, Large state spaces, Continuous state spaces.

Example of the Maze:

  • reward = $-1$ per time step'
  • Actions: N, S, E, W;
  • State: Agent's location;
  • Policy: action for each state;
  • Value function is the number of steps needed to get to the end. So each cell in the maze has a negative number that is the number of steps remaining.
  • Model: only the states the agent saw and the incremental value.

Categorizing RL Agents

  • Value-based: no policy (implicit); learn the value function.

  • Policy-based: no value function; learn the policy function.

  • Actor Critic: both policy and value function based.

Also:

  • Model Free: get best policy or value function without trying to figure out the model. No Model.

  • Model Based: get best policy of value function. Agent tries to figure out the Model.

RL Agent Taxonomy

In [3]:
nb_setup.images_hconcat(["RL_images/RL_Agent_Taxonomy.png"], width=400)
Out[3]:

Sub Problems in RL

  • Learning and Planning: can learn by RL (model free) or Planning (model based).Planning can be a tree-based algorithm.

  • Exploration and Exploitation. Usually in model free systems. Discover a good exploration policy without losing too much reward and after this then exploit. Examples are: (i) Restaurant selection; (ii) online banner ads; (iii) Oil drilling; (iv) Game playing.

  • Prediction and Control. Prediction is evaluate the future, $v_{\pi}(s)$. Control is optimize the future, $v_*(s)$.

Deep RL

  • Needed for much larger state spaces. Cannot use the entire state space, so we use neural nets instead to approximate the state space.

  • This is different from tabular RL.

  • Represent the value function by a NN.

Reading

Deep Mind Open Sources RL