Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 24)

CSE510 Deep Reinforcement Learning (Lecture 24)

Cooperative Multi-Agent Reinforcement Learning (MARL)

This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.

Multi-Agent Coordination Under Uncertainty

In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).

A transition is defined as:

P(ss,a1,,an)P(s' \mid s, a_{1}, \dots, a_{n})

Parameter explanations:

  • ss: current global state.
  • ss': next global state.
  • aia_{i}: action taken by agent ii.
  • P()P(\cdot): environment transition function.

The shared return is:

E[t=0Tγtrt]\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]

Parameter explanations:

  • γ\gamma: discount factor.
  • TT: horizon length.
  • rtr_{t}: shared team reward at time tt.

CTDE: Centralized Training, Decentralized Execution

Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.

Joint vs Factored Q-Learning

Joint Q-Learning

In joint-action learning, one learns a full joint Q-function:

Qtot(s,a1,,an)Q_{tot}(s, a_{1}, \dots, a_{n})

Parameter explanations:

  • QtotQ_{tot}: joint value for the entire team.
  • (a1,,an)(a_{1}, \dots, a_{n}): joint action vector across agents.

Problem:

  • The joint action space grows exponentially in nn.
  • Learning is not scalable.

Value Factorization

Instead of learning QtotQ_{tot} directly, we factorize it into individual utility functions:

Qtot(s,a)=f(Q1(s,a1),,Qn(s,an))Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))

Parameter explanations:

  • a\mathbf{a}: joint action vector.
  • f()f(\cdot): mixing network combining individual Q-values.

The goal is to enable decentralized greedy action selection.

Individual-Global-Max (IGM) Condition

The IGM condition enables decentralized optimal action selection:

argmaxaQtot(s,a)=(argmaxa1Q1(s,a1),,argmaxanQn(s,an))\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})= \big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)

Parameter explanations:

  • argmaxa\arg\max_{\mathbf{a}}: search for best joint action.
  • argmaxai\arg\max_{a_{i}}: best local action for agent ii.
  • Qi(s,ai)Q_{i}(s,a_{i}): individual utility for agent ii.

IGM makes decentralized execution optimal with respect to the learned factorized value.

VDN (Value Decomposition Networks)

VDN assumes:

Qtot(s,a)=i=1nQi(s,ai)Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})

Parameter explanations:

  • Qi(s,ai)Q_{i}(s,a_{i}): value of agent ii‘s action.
  • i=1n\sum_{i=1}^{n}: linear sum over agents.

Pros:

  • Very simple, satisfies IGM.
  • Fully decentralized execution.

Cons:

  • Limited representation capacity.
  • Cannot model non-linear teamwork interactions.

QMIX: Monotonic Value Factorization

QMIX uses a state-conditioned mixing network enforcing monotonicity:

QtotQi0\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0

Parameter explanations:

  • Qtot/Qi\partial Q_{tot} / \partial Q_{i}: gradient of global Q w.r.t. individual Q.
  • 0\ge 0: ensures monotonicity required for IGM.

The mixing function is:

Qtot(s,a)=fmix(Q1,,Qn;s)Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)

Parameter explanations:

  • fmixf_{mix}: neural network with non-negative weights.
  • ss: global state conditioning the mixing process.

Benefits:

  • More expressive than VDN.
  • Supports CTDE while keeping decentralized greedy execution.

Theoretical Issues With Linear and Monotonic Factorization

Limitations:

  • Linear models (VDN) cannot represent complex coordination.
  • QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
  • Off-policy training can diverge in some factorizations.

QPLEX: Duplex Dueling Multi-Agent Q-Learning

QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.

QPLEX Advantage Factorization

QPLEX factorizes:

Qtot(s,a)=i=1nλi(s,a)(Qi(s,ai)maxaiQi(s,ai))maxai=1nQi(s,ai)Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big) - \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})

Parameter explanations:

  • λi(s,a)\lambda_{i}(s,\mathbf{a}): positive mixing coefficients.
  • Qi(s,ai)Q_{i}(s,a_{i}): individual utility.
  • maxaiQi(s,ai)\max_{a'-{i}} Q_{i}(s,a'-{i}): per-agent baseline value.
  • maxa\max_{\mathbf{a}}: maximization over joint actions.

QPLEX Properties:

  • Fully satisfies IGM.
  • Has full representation capacity for all IGM-consistent Q-functions.
  • Enables stable off-policy training.

QPLEX Training Objective

QPLEX minimizes a TD loss over QtotQ_{tot}:

L=E[(r+γmaxaQtot(s,a)Qtot(s,a))2]L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]

Parameter explanations:

  • rr: shared team reward.
  • γ\gamma: discount factor.
  • ss': next state.
  • a\mathbf{a'}: next joint action evaluated by TD target.
  • QtotQ_{tot}: QPLEX global value estimate.

Role of Credit Assignment

Credit assignment addresses: “Which agent contributed what to the team reward?”

Value factorization supports implicit credit assignment:

  • Gradients into each QiQ_{i} act as counterfactual signals.
  • Dueling architectures allow each agent to learn its influence.
  • QPLEX provides clean marginal contributions implicitly.

Performance on SMAC Benchmarks

QPLEX outperforms:

  • QTRAN
  • QMIX
  • VDN
  • Other CTDE baselines

Key reasons:

  • Effective realization of IGM.
  • Strong representational capacity.
  • Off-policy stability.

Extensions: Diversity and Shared Parameter Learning

Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.

Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:

  • Identity-aware diversity.
  • Information-based intrinsic rewards for agent differentiation.
  • Balanced sharing vs agent specialization.

These techniques improve exploration and cooperation in complex multi-agent tasks.

Summary of Lecture 24

Key points:

  • Cooperative MARL requires scalable value decomposition.
  • IGM enables decentralized action selection from centralized training.
  • QMIX introduces monotonic non-linear factorization.
  • QPLEX achieves full IGM representational capacity.
  • Implicit credit assignment arises naturally from factorization.
  • Diversity methods allow richer multi-agent coordination strategies.
Last updated on