Reinforcement Learning (Class Notes, 2019)

Professor Subir Varma

Scribe notes by: Professor Sanjiv Das

1_Introduction2RL

(NB HTML) | Characteristics of RL | Rewards | Sequential Decision Making | Environment: Action, Observation, Reward | History and State | Fully Observable Environments | RL Agent Components | Central Problems of RL | Categorizing RL Agents | RL Agent Taxonomy | Sub Problems in RL | Deep RL | Reading | Deep Mind Open Sources RL

2_Markov_Decision_Processes

(NB HTML) | Overview, Model-Free vs Model-Based RL | Markov Property | MDP | MDP Tree Representation | Policies | MDP Rewards | MDP Return | State Value Function | Action Value Function | Bellman Expectation Equation (BEE) | Function $q_{\\pi}(s,a)$ | BEE for State Value Function $v_{\\pi}(s)$ | BEE for the Action Value Function $q_{\\pi}(s,a)$ | Example | BEE in Matrix Form | Bellman Optimality Equation (BOE) | Optimal Action Value Function | Optimal Policy | BOE for $v_*$ | BOE for $q_*$ | BOE Final Expressions | Solving the BOE | Readings and References

3_Model_Based_Control

(NB HTML) | Principle of Optimality | Finding the Optimal Policy | Deficiencies of Standard Dynamic Programming | Policy Evaluation (Prediction) | Policy Iteration (Optimization) | Proof of Convergence | Value Iteration | Asynchronous Dynamic Programming | References and Reading

4_Model_Free_Prediction

(NB HTML) | Monte Carlo Backup | First Visit and Every Visit MC Evaluation | Incremental Mean Update | Temporal Difference (TD) Reinforcement Learning | TD(0) | MC vs TD | Bias vs Variance of TD and MC | Example | Bootstrapping and Sampling | Unified View of RL | References and Readings

5_Model_Free_Control

(NB HTML) | Uses of Model-Free Control | Monte Carlo Control | Exploring Starts ($\\epsilon$-greedy actions) | On-Policy TD Control (SARSA algorithm) | On-Policy vs Off-Policy | TD >> MC Control | Off-Policy TD Control: Q-Learning | Uses of Off-Policy Learning | Summary: TD vs MC, On vs Off Policy | References and Readings

6_Function_Approximations

(NB HTML) | Tabular RL | Neural Nets for RL | Deep RL | Two Deep RL Approaches | Deep Q Nets (DQN) | Policy Gradient Methods (PGM) | Functional Choices from NNs | Reward Maximization | Logistic Regression | Implementation Details | Weight Update Formulas | Reading and References

7_DeepRL_Value_Q_Function_Approximation

(NB HTML) | Types of Value Function Approximation | Approximating Value Functions (Prediction) | MC Learning with NNs | TD Learning with NNs | Convergence | Approximating Value Functions (Control) | MC NN Control | TD NN Control (On Policy SARSA) | TD NN Control (Off Policy Q-Learning) | Stability and Convergence | Deep Q Networks, Experience Replay | Deep Q Networks, Target Network | Atari Game | Continuous Action Spaces | Deterministic Policy Gradients (DPG) | Actor-Critic | References and Reading

8_Policy_Gradient_Methods

(NB HTML) | Playing Pong | Reinforce: Monte Carlo Policy Gradients | Reward Function for a Policy | Reward Gradient | Log Probability Gradient | Log Policy Gradient | Gradient Estimation | Variance Reduction | Discounting | Reward-to-go | Baselines | Actor-Critic Algorithms | Advantage Function | References and Readings

9_Model_Based_Planning

(NB HTML) | What is a Model? | Model Learning | Sample-Based Planning | Types of Planning Algorithms | Background Planning | Decision Time Planning | Monte Carlo Tree Search (MCTS) | Reducing Action Candidates using Policy Gradients | Readings and References