Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 14)

CSE510 Deep Reinforcement Learning (Lecture 14)

Advanced Policy Gradient Methods

Trust Region Policy Optimization (TRPO)

“Recall” from last lecture

maxπEsdπ,aπ[π(as)π(as)Aπ(s,a)]\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]

such that

EsdπDKL(π(˙s)π(˙s))δ\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

Unconstrained penalized objective:

d=argmaxdJ(θ+d)λ(DKL[πθπθ+d]δ)d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)

θnew=θold+d\theta_{new}=\theta_{old}+d

First order Taylor expansion for the loss and second order for the KL:

argmaxdJ(θold)+θJ(θ)θ=θoldd12λ(dθ2DKL[πθoldπθ]θ=θoldd)+λδ\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^\top\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta

If you are really interested, try to fill the solving the KL Constrained Problem section.

Natural Gradient Descent

Setting the gradient to zero:

0=d(θJ(θ)θ=θoldd+12λ(dF(θold)d)=θJ(θ)θ=θold+12λF(θold)d\begin{aligned} 0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^\top F(\theta_{old})d\right)\\ &=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d \end{aligned} d=2λF1(θold)θJ(θ)θ=θoldd=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}

The natural gradient is

~J(θ)=F1(θold)θJ(θ)\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta) θnew=θold+αF1(θold)g^\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g} DKL(πθoldπθ)12(θθold)F(θold)(θθold)D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^\top F(\theta_{old})(\theta-\theta_{old}) 12(αgN)F(αgN)=δ\frac{1}{2}(\alpha g_N)^\top F(\alpha g_N)=\delta α=2δgNFgN\alpha=\sqrt{\frac{2\delta}{g_N^\top F g_N}}

However, due to the quadratic approximation, the KL constrains may be violated.

We do Linear search for the best step size by making sure that

  • Improving the objective
  • Satisfying the KL constraint

TRPO = NPG + Linesearch + monotonic improvement theorem

Summary of TRPO

Pros

Cons

  • Poor scalability
    • Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
  • Not quite sample efficient
    • Requiring a large batch of rollouts to approximate accurately

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. — OpenAI

link to paper 

Idea:

  • The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
  • Does it matter if we only break the constraint just a few times?

What if we treat it as a “soft” constraint? Add proximal value to the objective function?

PPO with Adaptive KL Penalty

maxθEt^[πθ(atst)πθold(atst)A^t]βEt^[KL[πθold(˙st),πθ(˙st)]]\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Use adaptive β\beta value.

LKLPEN(θ)=Et^[πθ(atst)πθold(atst)A^t]βEt^[KL[πθold(˙st),πθ(˙st)]]L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Compute d=Et^[KL[πθold(˙st),πθ(˙st)]]d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

  • If d<dtarget/1.5d<d_{target}/1.5, ββ/2\beta\gets \beta/2
  • If d>dtarget×1.5d>d_{target}\times 1.5, ββ×2\beta\gets \beta\times 2

PPO with Clipped Objective

maxθEt^[πθ(atst)πθold(atst)A^t]\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right] rt(θ)=πθ(atst)πθold(atst)r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
  • Here, rt(θ)r_t(\theta) measures how much the new policy changes the probability of taking action ata_t in state sts_t:
    • If rt>1r_t > 1 :the action becomes more likely under the new policy.
    • If rt<1r_t < 1 :the action becomes less likely.
  • We’d like rtAtr_tA_t to increase if At>0A_t > 0 (good actions become more probable) and decrease if At<0A_t < 0.
  • But if rtr_t changes too much, the update becomes unstable, just like in vanilla PG.

We limit rt(θ)r_t(\theta) to be in a range:

LCLIP(θ)=Et^[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]

Trusted region Policy Optimization (TRPO): Don’t move further than δ\delta in KL. Proximal Policy Optimization (PPO): Don’t let rt(θ)r_t(\theta) drift further than ϵ\epsilon.

PPO in Practice

LtCLIP+VF+S(θ)=Et^[LtCLIP(θ)+c1LtVF(θ)+c2S[πθ](st)]L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]

Here LtCLIP(θ)L_t^{CLIP}(\theta) is the surrogate objective function.

LtVF(θ)L_t^{VF}(\theta) is a squared-error loss for “critic” (Vθ(st)Vttarget)2(V_\theta(s_t)-V_t^{target})^2.

S[πθ](st)S[\pi_\theta](s_t) is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.

c1c_1 and c2c_2 are trade-off parameters, in paper c1=1c_1=1 and c2=0.01c_2=0.01.

Summary for Policy Gradient Methods

Trust region policy optimization (TRPO)

  • Optimization problem formulation
  • Natural gradient ascent + monotonic improvement + line search
  • But require second-order optimization

Proximal policy optimization (PPO)

  • Clipped objective
  • Simple yet effective

Take-away:

  • Proper step size is critical for policy gradient methods
  • Sample efficiency can be improved by using important sampling
Last updated on