CSE510 Deep Reinforcement Learning (Lecture 25)
Restore human intelligence
Linear Value Factorization
Why Linear Factorization works?
- Multi-agent reinforcement learning are mostly emprical
- Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI)
Theorem 1
It realize Counterfactual credit assignment mechanism.
Agent :
Here is the evaluation of .
and is the baseline
The target -value:
Theorem 2
it has local convergence with on-policy training
Limitations of Linear Factorization
Linear:
Limited Representation: Suboptimal (Prisoner’s Dilemma)
| a_2\a_2 | Action 1 | Action 2 |
|---|---|---|
| Action 1 | 8 | -12 |
| Action 2 | -12 | 0 |
After linear factorization:
| a_2\a_2 | Action 1 | Action 2 |
|---|---|---|
| Action 1 | -6.5 | -5 |
| Action 2 | -5 | -3.5 |
Theorem 3
it may diverge with off-policy training
Perfect Alignment: IGM Factorization
- Individual-Global Maximization (IGM) Constraint
-
IGM Factorization:
- Factorization function realizes all functions satsisfying IGM.
-
FQI-IGM: Fitted Q-Iteration with IGM Factorization
Theorem 4
Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.
QPLEX: Multi-Agent Q-Learning with IGM Factorization
IGM: \argmax_a Q_{tot}(s,a)=\begin{pamtrix} \argmax_{a_1}Q_1(s,a_1) \\ \dots \\ \argmax_{a_n}Q_n(s,a_n) \end{pmatrix}
Core idea:
- Fitting well the values of optimal actions
- Approximate the values of non-optimal actions
QPLEX Mixing Network:
Here is the baseline
And is the “advantage”.
Coefficients: , easily realized and learned with neural networks
Continue next time…