CSE510 Deep Reinforcement Learning (Lecture 26)
Continue on Real-World Practical Challenges for RL
Factored multi-agent RL
- Sample efficiency -> Shared Learning
- Complexity -> High-Order Factorization
- Partial Observability -> Communication Learning
- Sparse reward -> Coordinated Exploration
Parameter Sharing vs. Diversity
- Parameter Sharing is critical for deep MARL methods
- However, agents tend to acquire homogenous behaviors
- Diversity is essential for exploration and practical tasks
link to paper: Google Football
Schematics of Our Approach: Celebrating Diversity in Shared MARL (CDS)
- In representation, CDS allows MARL to adaptively decide when to share learning
- Encouraging Diversity in Optimization
In optimization, maximizing an information-theoretic objective to achieve identity-aware diversity
Here: represents the action diversity.
represents the observation diversity.
Summary
- MARL plays a critical role for AI, but is at the early stage
- Value factorization enables scalable MARL
- Linear factorization sometimes is surprising effective
- Non-linear factorization shows promise in offline settings
- Parameter sharing plays an important role for deep MARL
- Diversity and dynamic parameter sharing can be critical for complex cooperative tasks
Challenges and open problems in DRL
Overview for Reinforcement Learning Algorithms
Recall from lecture 2
Better sample efficiency to less sample efficiency:
- Model-based
- Off-policy/Q-learning
- Actor-critic
- On-policy/Policy gradient
- Evolutionary/Gradient-free
Model-Based
- Learn the model of the world, then pan using the model
- Update model often
- Re-plan often
Value-Based
- Learn the state or state-action value
- Act by choosing best action in state
- Exploration is a necessary add-on
Policy-based
- Learn the stochastic policy function that maps state to action
- Act by sampling policy
- Exploration is baked in
Where we are?
Deep RL has achieved impressive results in games, robotics, control, and decision systems.
But it is still far from a general, reliable, and efficient learning paradigm.
Today: what limits Deep RL, what’s being worked on, and what’s still open.
Outline of challenges
- Offline RL
- Multi-Agent complexity
- Sample efficiency & data reuse
- Stability & reproducibility
- Generalization & distribution shift
- Scalable model-based RL
- Safety
- Theory gaps & evaluation
Sample inefficiency
Model-free Deep RL often need million/billion of steps
- Humans with 15-minute learning tend to outperform DDQN with 115 hours
- OpenAI Five for Dota 2: 180 years playing time per day
Real-world systems can’t afford this
Root causes: high-variance gradients, weak priors, poor credit assignment.
Open direction for sample efficiency
- Better data reuse: off-policy learning & replay improvements
- Self-supervised representation learning for control (learning from interacting with the environment)
- Hybrid model-based/model-free approaches
- Transfer & pre-training on large datasets
- Knowledge driving-RL: leveraging pre-trained models
Knowledge-Driven RL: Motivation
Current LLMs are not good at decision making
Pros: rich knowledge
Cons: Auto-regressive decoding lack of long turn memory
Reinforcement learning in decision making
Pros: Go beyond human intelligence
Cons: sample inefficiency
Instability & the Deadly triad
Function approximation + boostraping + off-policy learning can diverge
Even stable algorithms (PPO) can be unstable
Open direction for Stability
Better optimization landscapes + regularization
Calibration/monitoring tools for RL training
Architectures with built-in inductive biased (e.g., equivariance)
Reproducibility & Evaluation
Results often depend on random seeds, codebase, and compute budget
Benchmark can be overfit; comparisons apples-to-oranges
Offline evaluation is especially tricky
Toward Better Evaluation
- Robustness checks and ablations
- Out-of-distribution test suites
- Realistic benchmarks beyond games (e.g., science and healthcare)
Generalization & Distribution Shift
Policy overfit to training environments and fail under small challenges
Sim-to-real gap, sensor noise, morphology changes, domain drift.
Requires learning invariance and robust decision rules.
Open direction for Generalization
- Domain randomization + system identification
- Robust/ risk-sensitive RL
- Representation learning for invariance
- Meta-RL and fast adaptation
Model-based RL: Promise & Pitfalls
- Learned models enable planning and sample efficiency
- But distribution mismatch and model exploitation can break policies
- Long-horizon imagination amplifies errors
- Model-learning is challenging
Safety, alignment, and constraints
Reward mis-specification -> unsafe or unintended behavior
Need to respect constraints: energy, collisions, ethics, regulation
Exploration itself may be unsafe
Open direction for Safety RL
- Constraint RL (Lagrangians, CBFs, she)
Theory Gaps & Evaluation
Deep RL lacks strong general guarantees.
We don’t fully understand when/why it works
Bridging theory and
Promising theory directoins
Optimization thoery of RL objectives
Generalization and representation learning bounds
Finite-sample analysis s
Connection to foundation models
- Pre-training on large scale experience
- World models as sequence predictors
- RLHF/preference optimization for alignment
- Open problems: groundign
What to expect in the next 3-5 years
Unified model-based offline + safe RL stacks
Large pretrianed decision models
Deployment in high-stake domains