CSE510 Deep Reinforcement Learning (Lecture 2)
Introduction and Markov Decision Processes (MDPs)
What is reinforcement learning (RL)
- A general computational framework for behavior learning through reinforcement/trial and error
- Deep RL: combining deep learning with RL for complex problems
- Showing a promise for artificial general intelligence (AGI)
What RL can do now.
Backgammon
Neuro-Gammon
Developed by Gerald Tesauro in 1989 in IBM’s research center.
Train to mimic expert demonstrations using supervised learning.
Achieved intermediate-level human player.
TD-Gammon (Temporal Difference Learning)
Developed by Gerald Tesauro in 1992 in IBM’s research center.
A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.
Achieved performance close to top human players of its time.
DeepMind Atari
Use deep Q-learning to play Atari games.
Without human demonstrations, it can learn to play the game at a superhuman level.
AlphaGo
Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.
Video Games
OpenAI Five for Dota 2
won 5v5 best of 3 games against top human players.
Deepmind AlphaStar for StarCraft
supervised training followed by a league competition training.
AlphaTensor
discovering faster matrix multiplication algorithms with reinforcement learning.
AlphaTensor: 76 vs Strassen’s 80 for 5x5 matrix multiplication.
Training LLMs
For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.
Robotics
Unitree Go, Altlas by Boston Dynamics, etc.
What are the challenges of RL in real world applications?
Beating the human champion is “easier” than placing the go stones.
State estimation
Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).
Need for behaviors to transfer/generalize across environmental variations since the real world is very diverse.
State estimation
To be able to act, you need first to be able to see, detect the objects that you interact with, detect whether you achieved the goal.
Most works are between two extremes:
-
Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
-
Do not attempt to detect any objects and learn to map RGB images directly to actions.
Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.
Interesting direction: leveraging DRL and vision-language models
Efficiency
Cheap vs. Expensive to get experience samples
DRL Sample Efficiency
Humans after 15 minutes tend to outperform DDQN after 115 hours
Reinforcement Learning in Human
Human appear to learn to act (e.g., walk) through “very few examples” of trial and error. How is an open question…
Possible answers:
- Hardware: 230 million years of bipedal movement data
- Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
- Algorithms: Better than backpropagation and stochastic gradient descent
Discrete and continuous action spaces
Computation is discrete, but the real action space is continuous.
One-goal vs. Multi-goal
Life is a multi-goal problem. Involving infinitely many possible games.
Rewards automatic and auto detect rewards
Our curiosity is a reward.
And more
- Transfer learning
- Generalization
- Long horizon reasoning
- Model-based RL
- Sparse rewards
- Reward design/learning
- Planning/Learning
- Lifelong learning
- Safety
- Interpretability
- etc.
What is the course about?
To teach you RL models and algorithms.
- To be able to tackle real world problems.
To excite you about RL.
- To provide a primer for you to launch advanced studies.
Schedule:
- RL Model and basic algorithms
- Markov Decision Process (MDP)
- Passive RL: ADP and TD-learning
- Active RL: Q-Learning and SARSA
- Deep RL algorithms
- Value-Based methods
- Policy Gradient Methods
- Model-Based methods
- Advanced Topics
- Offline RL, Multi-Agent RL, etc.
Reinforcement Learning Algorithms
Model-Based
- Learn the model of the world, then plan using the model
- Update model often
- Re-plan often
Value-Based
- Learn the state or state-action value
- Act by choosing best action in state
- Exploration is a necessary add-on
Policy-based
- Learn the stochastic policy function that maps state to action
- Act by sampling policy
- Exploration is baked in
Better sample efficiency to Less sample efficiency
- Model-Based
- Off-policy/Q-learning
- Actor-critic
- On-policy/Policy gradient
- Evolutionary/Gradient-free