CSE510 Deep Reinforcement Learning (Lecture 26)

Continue on Real-World Practical Challenges for RL

Factored multi-agent RL

Sample efficiency -> Shared Learning
Complexity -> High-Order Factorization
Partial Observability -> Communication Learning
Sparse reward -> Coordinated Exploration

Parameter Sharing is critical for deep MARL methods
However, agents tend to acquire homogenous behaviors
Diversity is essential for exploration and practical tasks

Schematics of Our Approach: Celebrating Diversity in Shared MARL (CDS)

In representation, CDS allows MARL to adaptively decide when to share learning
Encouraging Diversity in Optimization

In optimization, maximizing an information-theoretic objective to achieve identity-aware diversity

\begin{aligned} I^\pi(\tau_T;id)&=H(\tau_t)-H(\tau_T|id)=\mathbbb{E}_{id,\tau_T\sim \pi}\left[\log \frac{p(\tau_T|id)}{p(\tau_T)}\right]\\ &= \mathbb{E}_{id,\tau}\left[ \log \frac{p(o_0|id)}{p(o_0)}+\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}+\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}\right] \end{aligned}

Here: $\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}$ represents the action diversity.

$\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}$ represents the observation diversity.

Summary

MARL plays a critical role for AI, but is at the early stage
Value factorization enables scalable MARL
- Linear factorization sometimes is surprising effective
- Non-linear factorization shows promise in offline settings
Parameter sharing plays an important role for deep MARL
Diversity and dynamic parameter sharing can be critical for complex cooperative tasks

Challenges and open problems in DRL

Overview for Reinforcement Learning Algorithms

Recall from lecture 2

Better sample efficiency to less sample efficiency:

Model-based
Off-policy/Q-learning
Actor-critic
On-policy/Policy gradient
Evolutionary/Gradient-free

Model-Based

Learn the model of the world, then pan using the model
Update model often
Re-plan often

Value-Based

Learn the state or state-action value
Act by choosing best action in state
Exploration is a necessary add-on

Policy-based

Learn the stochastic policy function that maps state to action
Act by sampling policy
Exploration is baked in

Where we are?

Deep RL has achieved impressive results in games, robotics, control, and decision systems.

But it is still far from a general, reliable, and efficient learning paradigm.

Today: what limits Deep RL, what’s being worked on, and what’s still open.

Outline of challenges

Offline RL
Multi-Agent complexity
Sample efficiency & data reuse
Stability & reproducibility
Generalization & distribution shift
Scalable model-based RL
Safety
Theory gaps & evaluation

Sample inefficiency

Model-free Deep RL often need million/billion of steps

Humans with 15-minute learning tend to outperform DDQN with 115 hours
OpenAI Five for Dota 2: 180 years playing time per day

Real-world systems can’t afford this

Root causes: high-variance gradients, weak priors, poor credit assignment.

Open direction for sample efficiency

Better data reuse: off-policy learning & replay improvements
Self-supervised representation learning for control (learning from interacting with the environment)
Hybrid model-based/model-free approaches
Transfer & pre-training on large datasets
- Knowledge driving-RL: leveraging pre-trained models

Knowledge-Driven RL: Motivation

Current LLMs are not good at decision making

Pros: rich knowledge

Cons: Auto-regressive decoding lack of long turn memory

Reinforcement learning in decision making

Pros: Go beyond human intelligence

Cons: sample inefficiency

Instability & the Deadly triad

Function approximation + boostraping + off-policy learning can diverge

Even stable algorithms (PPO) can be unstable

Open direction for Stability

Better optimization landscapes + regularization

Calibration/monitoring tools for RL training

Architectures with built-in inductive biased (e.g., equivariance)

Reproducibility & Evaluation

Results often depend on random seeds, codebase, and compute budget

Benchmark can be overfit; comparisons apples-to-oranges

Offline evaluation is especially tricky

Toward Better Evaluation

Robustness checks and ablations
Out-of-distribution test suites
Realistic benchmarks beyond games (e.g., science and healthcare)

Generalization & Distribution Shift

Policy overfit to training environments and fail under small challenges

Sim-to-real gap, sensor noise, morphology changes, domain drift.

Requires learning invariance and robust decision rules.

Open direction for Generalization

Domain randomization + system identification
Robust/ risk-sensitive RL
Representation learning for invariance
Meta-RL and fast adaptation

Model-based RL: Promise & Pitfalls

Learned models enable planning and sample efficiency
But distribution mismatch and model exploitation can break policies
Long-horizon imagination amplifies errors
Model-learning is challenging

Safety, alignment, and constraints

Reward mis-specification -> unsafe or unintended behavior

Need to respect constraints: energy, collisions, ethics, regulation

Exploration itself may be unsafe

Open direction for Safety RL

Constraint RL (Lagrangians, CBFs, she)

Theory Gaps & Evaluation

Deep RL lacks strong general guarantees.

We don’t fully understand when/why it works

Bridging theory and

Promising theory directoins

Optimization thoery of RL objectives

Generalization and representation learning bounds

Finite-sample analysis s

Connection to foundation models

Pre-training on large scale experience
World models as sequence predictors
RLHF/preference optimization for alignment
Open problems: groundign

What to expect in the next 3-5 years

Unified model-based offline + safe RL stacks

Large pretrianed decision models

Deployment in high-stake domains