CSE510 Deep Reinforcement Learning (Lecture 14)

Advanced Policy Gradient Methods

Trust Region Policy Optimization (TRPO)

“Recall” from last lecture

\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]

such that

\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

Unconstrained penalized objective:

d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)

$\theta_{new}=\theta_{old}+d$

First order Taylor expansion for the loss and second order for the KL:

\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^\top\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta

If you are really interested, try to fill the solving the KL Constrained Problem section.

Natural Gradient Descent

Setting the gradient to zero:

\begin{aligned} 0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^\top F(\theta_{old})d\right)\\ &=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d \end{aligned}

d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}

The natural gradient is

\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)

\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}

D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^\top F(\theta_{old})(\theta-\theta_{old})

\frac{1}{2}(\alpha g_N)^\top F(\alpha g_N)=\delta

\alpha=\sqrt{\frac{2\delta}{g_N^\top F g_N}}

However, due to the quadratic approximation, the KL constrains may be violated.

Linear search

We do Linear search for the best step size by making sure that

Improving the objective
Satisfying the KL constraint

TRPO = NPG + Linesearch + monotonic improvement theorem

Summary of TRPO

Pros

Proper learning step
Monotonic improvement guarantee

Cons

Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. — OpenAI

link to paper

Idea:

The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
Does it matter if we only break the constraint just a few times?

What if we treat it as a “soft” constraint? Add proximal value to the objective function?

PPO with Adaptive KL Penalty

\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Use adaptive $\beta$ value.

L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Compute $d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]$

If $d<d_{target}/1.5$ , $\beta\gets \beta/2$
If $d>d_{target}\times 1.5$ , $\beta\gets \beta\times 2$

PPO with Clipped Objective

\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]

r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

Here, $r_t(\theta)$ $r_{t} (θ)$ measures how much the new policy changes the probability of taking action $a_t$ $a_{t}$ in state $s_t$ $s_{t}$ :
- If $r_t > 1$ :the action becomes more likely under the new policy.
- If $r_t < 1$ :the action becomes less likely.
We’d like $r_tA_t$ to increase if $A_t > 0$ (good actions become more probable) and decrease if $A_t < 0$ .
But if $r_t$ changes too much, the update becomes unstable, just like in vanilla PG.

We limit $r_t(\theta)$ to be in a range:

L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]

Trusted region Policy Optimization (TRPO): Don’t move further than $\delta$ in KL. Proximal Policy Optimization (PPO): Don’t let $r_t(\theta)$ drift further than $\epsilon$ .

PPO in Practice

L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]

Here $L_t^{CLIP}(\theta)$ is the surrogate objective function.

$L_t^{VF}(\theta)$ is a squared-error loss for “critic” $(V_\theta(s_t)-V_t^{target})^2$ .

$S[\pi_\theta](s_t)$ is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.

$c_1$ and $c_2$ are trade-off parameters, in paper $c_1=1$ and $c_2=0.01$ .

Summary for Policy Gradient Methods

Trust region policy optimization (TRPO)

Optimization problem formulation
Natural gradient ascent + monotonic improvement + line search
But require second-order optimization

Proximal policy optimization (PPO)

Clipped objective
Simple yet effective

Take-away:

Proper step size is critical for policy gradient methods
Sample efficiency can be improved by using important sampling