CSE510 Deep Reinforcement Learning (Lecture 2)

Introduction and Markov Decision Processes (MDPs)

What is reinforcement learning (RL)

A general computational framework for behavior learning through reinforcement/trial and error
Deep RL: combining deep learning with RL for complex problems
Showing a promise for artificial general intelligence (AGI)

What RL can do now.

Backgammon

Neuro-Gammon

Developed by Gerald Tesauro in 1989 in IBM’s research center.

Train to mimic expert demonstrations using supervised learning.

Achieved intermediate-level human player.

TD-Gammon (Temporal Difference Learning)

Developed by Gerald Tesauro in 1992 in IBM’s research center.

A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.

Achieved performance close to top human players of its time.

DeepMind Atari

Use deep Q-learning to play Atari games.

Without human demonstrations, it can learn to play the game at a superhuman level.

AlphaGo

Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.

Video Games

OpenAI Five for Dota 2

won 5v5 best of 3 games against top human players.

Deepmind AlphaStar for StarCraft

supervised training followed by a league competition training.

AlphaTensor

discovering faster matrix multiplication algorithms with reinforcement learning.

AlphaTensor: 76 vs Strassen’s 80 for 5x5 matrix multiplication.

Training LLMs

For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.

Robotics

Unitree Go, Altlas by Boston Dynamics, etc.

What are the challenges of RL in real world applications?

Beating the human champion is “easier” than placing the go stones.

State estimation

Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).

Need for behaviors to transfer/generalize across environmental variations since the real world is very diverse.

State estimation

To be able to act, you need first to be able to see, detect the objects that you interact with, detect whether you achieved the goal.

Most works are between two extremes:

Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
Do not attempt to detect any objects and learn to map RGB images directly to actions.

Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.

Interesting direction: leveraging DRL and vision-language models

Efficiency

Cheap vs. Expensive to get experience samples

DRL Sample Efficiency

Humans after 15 minutes tend to outperform DDQN after 115 hours

Reinforcement Learning in Human

Human appear to learn to act (e.g., walk) through “very few examples” of trial and error. How is an open question…

Possible answers:

Hardware: 230 million years of bipedal movement data
Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
Algorithms: Better than backpropagation and stochastic gradient descent

Discrete and continuous action spaces

Computation is discrete, but the real action space is continuous.

One-goal vs. Multi-goal

Life is a multi-goal problem. Involving infinitely many possible games.

Rewards automatic and auto detect rewards

Our curiosity is a reward.

And more

Transfer learning
Generalization
Long horizon reasoning
Model-based RL
Sparse rewards
Reward design/learning
Planning/Learning
Lifelong learning
Safety
Interpretability
etc.

What is the course about?

To teach you RL models and algorithms.

To be able to tackle real world problems.

To excite you about RL.

To provide a primer for you to launch advanced studies.

Schedule:

RL Model and basic algorithms
- Markov Decision Process (MDP)
- Passive RL: ADP and TD-learning
- Active RL: Q-Learning and SARSA
Deep RL algorithms
- Value-Based methods
- Policy Gradient Methods
- Model-Based methods
Advanced Topics
- Offline RL, Multi-Agent RL, etc.

Reinforcement Learning Algorithms

Model-Based

Learn the model of the world, then plan using the model
Update model often
Re-plan often

Value-Based

Learn the state or state-action value
Act by choosing best action in state
Exploration is a necessary add-on

Policy-based

Learn the stochastic policy function that maps state to action
Act by sampling policy
Exploration is baked in

Better sample efficiency to Less sample efficiency

Model-Based
Off-policy/Q-learning
Actor-critic
On-policy/Policy gradient
Evolutionary/Gradient-free