CSE510 Deep Reinforcement Learning (Lecture 25)

Restore human intelligence

Linear Value Factorization

It realize Counterfactual credit assignment mechanism.

Agent $i$ :

Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]

Here $\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]$ is the evaluation of $a_i$ .

and $\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]$ is the baseline

The target $Q$ -value: $y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')$

it has local convergence with on-policy training

Linear: $Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)$

Limited Representation: Suboptimal (Prisoner’s Dilemma)

a_2\a_2	Action 1	Action 2
Action 1	8	-12
Action 2	-12	0

After linear factorization:

a_2\a_2	Action 1	Action 2
Action 1	-6.5	-5
Action 2	-5	-3.5

it may diverge with off-policy training

\argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n))

IGM Factorization: $Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))$
- Factorization function $f$ realizes all functions satsisfying IGM.
FQI-IGM: Fitted Q-Iteration with IGM Factorization

Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.

IGM: $\argmax_a Q_{tot}(s,a)=\begin{pamtrix} \argmax_{a_1}Q_1(s,a_1) \\ \dots \\ \argmax_{a_n}Q_n(s,a_n) \end{pmatrix}$

Core idea:

QPLEX Mixing Network:

Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i'))

Here $\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')$ is the baseline $\max_a Q_{tot}(s,a)$

And $Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i')$ is the “advantage”.

Coefficients: $\lambda_i(s,a)>0$ , easily realized and learned with neural networks

Continue next time…