Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 25)

CSE510 Deep Reinforcement Learning (Lecture 25)

Restore human intelligence

Linear Value Factorization

link to paper 

Why Linear Factorization works?

  • Multi-agent reinforcement learning are mostly emprical
  • Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI)

Theorem 1

It realize Counterfactual credit assignment mechanism.

Agent ii:

Qi(t+1)(s,ai)=Eai[y(t)(s,aiai)]n1nEa[y(t)(s,a)]Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]

Here Eai[y(t)(s,aiai)]\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right] is the evaluation of aia_i.

and Ea[y(t)(s,a)]\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right] is the baseline

The target QQ-value: y(t)(s,a)=r+γmaxaQtot(t)(s,a)y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')

Theorem 2

it has local convergence with on-policy training

Limitations of Linear Factorization

Linear: Qtot(s,a)=i=1nQi(s,ai)Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)

Limited Representation: Suboptimal (Prisoner’s Dilemma)

a_2\a_2Action 1Action 2
Action 18-12
Action 2-120

After linear factorization:

a_2\a_2Action 1Action 2
Action 1-6.5-5
Action 2-5-3.5

Theorem 3

it may diverge with off-policy training

Perfect Alignment: IGM Factorization

  • Individual-Global Maximization (IGM) Constraint
arg maxaQtot(s,a)=(arg maxa1Q1(s,a1),,arg maxanQn(s,an))\argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n))
  • IGM Factorization: Qtot(s,a)=f(Q1(s,a1),,Qn(s,an))Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))

    • Factorization function ff realizes all functions satsisfying IGM.
  • FQI-IGM: Fitted Q-Iteration with IGM Factorization

Theorem 4

Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.

QPLEX: Multi-Agent Q-Learning with IGM Factorization

link to paper 

IGM: \argmax_a Q_{tot}(s,a)=\begin{pamtrix} \argmax_{a_1}Q_1(s,a_1) \\ \dots \\ \argmax_{a_n}Q_n(s,a_n) \end{pmatrix}

Core idea:

  • Fitting well the values of optimal actions
  • Approximate the values of non-optimal actions

QPLEX Mixing Network:

Qtot(s,a)=i=1nmaxaiQi(s,ai)+i=1nλi(s,a)(Qi(s,ai)maxaiQi(s,ai))Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i'))

Here i=1nmaxaiQi(s,ai)\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i') is the baseline maxaQtot(s,a)\max_a Q_{tot}(s,a)

And Qi(s,ai)maxaiQi(s,ai)Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i') is the “advantage”.

Coefficients: λi(s,a)>0\lambda_i(s,a)>0, easily realized and learned with neural networks

Continue next time…

Last updated on