Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 26)

CSE510 Deep Reinforcement Learning (Lecture 26)

Continue on Real-World Practical Challenges for RL

Factored multi-agent RL

  • Sample efficiency -> Shared Learning
  • Complexity -> High-Order Factorization
  • Partial Observability -> Communication Learning
  • Sparse reward -> Coordinated Exploration

Parameter Sharing vs. Diversity

  • Parameter Sharing is critical for deep MARL methods
  • However, agents tend to acquire homogenous behaviors
  • Diversity is essential for exploration and practical tasks

link to paper: Google Football 

Schematics of Our Approach: Celebrating Diversity in Shared MARL (CDS)

  • In representation, CDS allows MARL to adaptively decide when to share learning
  • Encouraging Diversity in Optimization

In optimization, maximizing an information-theoretic objective to achieve identity-aware diversity

Iπ(τT;id)=H(τt)H(τTid)=\mathbbbEid,τTπ[logp(τTid)p(τT)]=Eid,τ[logp(o0id)p(o0)+t=0T1logatτt,idp(atτt)+logp(ot+1τt,at,id)p(ot+1τt,at)]\begin{aligned} I^\pi(\tau_T;id)&=H(\tau_t)-H(\tau_T|id)=\mathbbb{E}_{id,\tau_T\sim \pi}\left[\log \frac{p(\tau_T|id)}{p(\tau_T)}\right]\\ &= \mathbb{E}_{id,\tau}\left[ \log \frac{p(o_0|id)}{p(o_0)}+\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}+\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}\right] \end{aligned}

Here: t=0T1logatτt,idp(atτt)\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)} represents the action diversity.

logp(ot+1τt,at,id)p(ot+1τt,at)\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)} represents the observation diversity.

Summary

  • MARL plays a critical role for AI, but is at the early stage
  • Value factorization enables scalable MARL
    • Linear factorization sometimes is surprising effective
    • Non-linear factorization shows promise in offline settings
  • Parameter sharing plays an important role for deep MARL
  • Diversity and dynamic parameter sharing can be critical for complex cooperative tasks

Challenges and open problems in DRL

Overview for Reinforcement Learning Algorithms

Recall from lecture 2

Better sample efficiency to less sample efficiency:

  • Model-based
  • Off-policy/Q-learning
  • Actor-critic
  • On-policy/Policy gradient
  • Evolutionary/Gradient-free

Model-Based

  • Learn the model of the world, then pan using the model
  • Update model often
  • Re-plan often

Value-Based

  • Learn the state or state-action value
  • Act by choosing best action in state
  • Exploration is a necessary add-on

Policy-based

  • Learn the stochastic policy function that maps state to action
  • Act by sampling policy
  • Exploration is baked in

Where we are?

Deep RL has achieved impressive results in games, robotics, control, and decision systems.

But it is still far from a general, reliable, and efficient learning paradigm.

Today: what limits Deep RL, what’s being worked on, and what’s still open.

Outline of challenges

  • Offline RL
  • Multi-Agent complexity
  • Sample efficiency & data reuse
  • Stability & reproducibility
  • Generalization & distribution shift
  • Scalable model-based RL
  • Safety
  • Theory gaps & evaluation

Sample inefficiency

Model-free Deep RL often need million/billion of steps

  • Humans with 15-minute learning tend to outperform DDQN with 115 hours
  • OpenAI Five for Dota 2: 180 years playing time per day

Real-world systems can’t afford this

Root causes: high-variance gradients, weak priors, poor credit assignment.

Open direction for sample efficiency

  • Better data reuse: off-policy learning & replay improvements
  • Self-supervised representation learning for control (learning from interacting with the environment)
  • Hybrid model-based/model-free approaches
  • Transfer & pre-training on large datasets
    • Knowledge driving-RL: leveraging pre-trained models

Knowledge-Driven RL: Motivation

Current LLMs are not good at decision making

Pros: rich knowledge

Cons: Auto-regressive decoding lack of long turn memory

Reinforcement learning in decision making

Pros: Go beyond human intelligence

Cons: sample inefficiency

Instability & the Deadly triad

Function approximation + boostraping + off-policy learning can diverge

Even stable algorithms (PPO) can be unstable

Open direction for Stability

Better optimization landscapes + regularization

Calibration/monitoring tools for RL training

Architectures with built-in inductive biased (e.g., equivariance)

Reproducibility & Evaluation

Results often depend on random seeds, codebase, and compute budget

Benchmark can be overfit; comparisons apples-to-oranges

Offline evaluation is especially tricky

Toward Better Evaluation

  • Robustness checks and ablations
  • Out-of-distribution test suites
  • Realistic benchmarks beyond games (e.g., science and healthcare)

Generalization & Distribution Shift

Policy overfit to training environments and fail under small challenges

Sim-to-real gap, sensor noise, morphology changes, domain drift.

Requires learning invariance and robust decision rules.

Open direction for Generalization

  • Domain randomization + system identification
  • Robust/ risk-sensitive RL
  • Representation learning for invariance
  • Meta-RL and fast adaptation

Model-based RL: Promise & Pitfalls

  • Learned models enable planning and sample efficiency
  • But distribution mismatch and model exploitation can break policies
  • Long-horizon imagination amplifies errors
  • Model-learning is challenging

Safety, alignment, and constraints

Reward mis-specification -> unsafe or unintended behavior

Need to respect constraints: energy, collisions, ethics, regulation

Exploration itself may be unsafe

Open direction for Safety RL

  • Constraint RL (Lagrangians, CBFs, she)

Theory Gaps & Evaluation

Deep RL lacks strong general guarantees.

We don’t fully understand when/why it works

Bridging theory and

Promising theory directoins

Optimization thoery of RL objectives

Generalization and representation learning bounds

Finite-sample analysis s

Connection to foundation models

  • Pre-training on large scale experience
  • World models as sequence predictors
  • RLHF/preference optimization for alignment
  • Open problems: groundign

What to expect in the next 3-5 years

Unified model-based offline + safe RL stacks

Large pretrianed decision models

Deployment in high-stake domains

Last updated on