InfoCoBuild

Reinforcement Learning

Reinforcement Learning. Instructor: Prof. Balaraman Ravindran, Department of Computer Science and Engineering, IIT Madras. Reinforcement learning is a paradigm that aims to model the trial-and-error learning process that is needed in many problem situations where explicit instructive signals are not available. It has roots in operations research, behavioral psychology and AI. The goal of the course is to introduce the basic mathematical foundations of reinforcement learning, as well as highlight some of the recent directions of research. (from nptel.ac.in)

Lecture 07 - Introduction to Immediate RL

In this lecture we discuss a sub-problem of reinforcement learning known as immediate RL or multi-arm bandits. We talk about the exploration-exploitation dilemma in the context of bandit problems and also discuss some practical issues with regards to sampling which will be useful in simulating bandit problems.

Go to the Course Home or watch other lectures:

Preparatory Material

Lecture 01 - Probability Basics 1

Lecture 02 - Probability Basics 2

Lecture 03 - Linear Algebra 1

Lecture 04 - Linear Algebra 2

Introduction to RL and Immediate RL

Lecture 05 - Introduction to RL

Lecture 06 - RL Framework and Applications

Lecture 07 - Introduction to Immediate RL

Lecture 08 - Bandit Optimalities

Lecture 09 - Value Function based Methods

Bandit Algorithms

Lecture 10 - Upper Confidence Bound 1 (UCB 1)

Lecture 11 - Concentration Bounds

Lecture 12 - UCB 1 Theorem

Lecture 13 - Probably Approximately Correct (PAC) Bounds

Lecture 14 - Median Elimination

Lecture 15 - Thompson Sampling

Policy Gradient Methods and Introduction to Full RL

Lecture 16 - Policy Search

Lecture 17 - REINFORCE

Lecture 18 - Contextual Bandits

Lecture 19 - Full RL Introduction

Lecture 20 - Returns, Value Functions and Markov Decision Processes (MDPs)

MDP Formulation, Bellman Equations and Optimality Proofs

Lecture 21 - MDP Modelling

Lecture 22 - Bellman Equation

Lecture 23 - Bellman Optimality Equation

Lecture 24 - Cauchy Sequence and Green's Equation

Lecture 25 - Banach Fixed Point Theorem

Lecture 26 - Convergence Proof

Dynamic Programming and Monte Carlo Methods

Lecture 27 - Lpi Convergence

Lecture 28 - Value Iteration

Lecture 29 - Policy Iteration

Lecture 30 - Dynamic Programming

Lecture 31 - Monte Carlo

Lecture 32 - Control in Monte Carlo

Monte Carlo and Temporal Difference Methods

Lecture 33 - Off Policy MC

Lecture 34 - UCT (Upper Confidence Bound 1 applied to Trees)

Lecture 35 - TD (0)

Lecture 36 - TD (0) Control

Lecture 37 - Q-Learning

Lecture 38 - Afterstate

Eligibility Traces

Lecture 39 - Eligibility Traces

Lecture 40 - Backward View of Eligibility Traces

Lecture 41 - Eligibility Trace Control

Lecture 42 - Thompson Sampling Recap

Function Approximation

Lecture 43 - Function Approximation

Lecture 44 - Linear Parameterization

Lecture 45 - State Aggregation Methods

Lecture 46 - Function Approximation and Eligibility Traces

Lecture 47 - Least-Squares Temporal Difference (LSTD) and LSTDQ

Lecture 48 - LSPI and Fitted Q

DQN, Fitted Q and Policy Gradient Approaches

Lecture 49 - DQN and Fitted Q-Iteration

Lecture 50 - Policy Gradient Approach

Lecture 51 - Actor Critic and REINFORCE

Lecture 52 - REINFORCE (cont.)

Lecture 53 - Policy Gradient with Function Approximation

Hierarchical Reinforcement Learning

Lecture 54 - Hierarchical Reinforcement Learning

Lecture 55 - Types of Optimality

Lecture 56 - Semi Markov Decision Processes

Lecture 57 - Options

Lecture 58 - Learning with Options

Lecture 59 - Hierarchical Abstract Machines

Hierarchical RL: MAXQ

Lecture 60 - MAXQ

Lecture 61 - MAXQ Value Function Decomposition

Lecture 62 - Option Discovery

POMDPs (Partially Observable Markov Decision Processes)

Lecture 63 - POMDP Introduction

Lecture 64 - Solving POMDP