Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 13)

CSE510 Deep Reinforcement Learning (Lecture 13)

Recap from last lecture

For any differentiable policy πθ(s,a)\pi_\theta(s,a), for any o the policy objective functions J=J1,JavRJ=J_1, J_{avR} or 11γJavV\frac{1}{1-\gamma} J_{avV}

The policy gradient is

θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]

Problem for policy gradient method

Data Inefficiency

  • On-policy method: for each new policy, we need to generate a completely new trajectory
  • The data is thrown out after just one gradient update
  • As complex neural networks need many updates, this makes the training process very slow

Unstable update: step size is very important

  • If step size is too large:
    • Large step -> bad policy
    • Next batch is generated from current bad policy → collect bad samples
    • Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
  • If step size is too small: the learning process is slow

Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)

Objective of Policy Gradient Methods

Policy Objective

J(πθ)=Eτπthetat=0γtrtJ(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t

here τ\tau is the trajectory for the policy πθ\pi_\theta.

Policy objective can be written in terms of old one:

J(πθ)J(πθ)=Eτπθt=0γtAπθ(st,at)J(\pi_{\theta'})-J(\pi_{\theta})=\mathbb{E}_{\tau \sim \pi_{\theta'}}\sum_{t=0}^{\infty}\gamma^t A^{\pi_\theta}(s_t,a_t)

Equivalently for succinctness:

J(π)J(π)=Eτπt=0Aπ(st,at)J(\pi')-J(\pi)=\mathbb{E}_{\tau\in \pi'}\sum_{t=0}^{\infty} A^{\pi}(s_t,a_t)

Proof

Eτπ[γtAπθ(st,at)]=Eτπ[t=0γtR(s0)+t=0γt+1Vπ(st+1)t=0γtVπ(st)]=J(π)+t=1γtVπ(st)t=0γtVπ(st)=J(π)EτπVπ(s0)=J(π)J(π)\begin{aligned} &\mathbb{E}_{\tau\sim \pi'}\left[\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\\ &=E_{\tau\sim \pi'}\left[\sum_{t=0}^{\infty}\gamma^t R(s_0)+\sum_{t=0}^{\infty} \gamma^{t+1}V^{\pi}(s_{t+1})-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\right]\\ &=J(\pi')+\sum_{t=1}^{\infty} \gamma^{t}V^{\pi}(s_t)-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\\ &=J(\pi')-\mathbb{E}_{\tau\sim\pi'}V^{\pi}(s_0)\\ &=J(\pi')-J(\pi) \end{aligned}

Importance Sampling

Estimate one distribution by sampling form anther distribution

Exp[f(x)]=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Exq[f(x)p(x)q(x)]1Ni=1,xiqNf(xi)p(xi)q(xi)\begin{aligned} \mathbb{E}_{x\sim p}[f(x)]&=\int f(x)p(x)dx &=\int f(x)\frac{p(x)}{q(x)}q(x)dx\\ &=\mathbb{E}_{x\sim q}\left[f(x)\frac{p(x)}{q(x)}\right]\\ &\approx \frac{1}{N}\sum_{i=1,x^i\in q}^N f(x^i)\frac{p(x^i)}{q(x^i)} \end{aligned}

Estimating objective with importance sampling

Discounted state visit distribution:

dπ(s)=(1γ)t=0γtP(st=sπ)d^\pi(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^t P(s_t=s|\pi) J(π)J(π)=Eτπt=0γtAπ(st,at)=Esdπ,aπAπ(s,a)=Esdπ,aπ[Aπ(s,a)π(as)π(as)]\begin{aligned} J(\pi')-J(\pi)&=\mathbb{E}_{\tau\sim\pi'}\sum_{t=0}^{\infty} \gamma^t A^{\pi}(s_t,a_t)\\ &=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'} A^{\pi}(s,a)\\ &=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]\\ \end{aligned}

Using the old policy to sample states form a policy that we are trying to optimize.

Lπ(π)=Esdπ,aπ[Aπ(s,a)π(as)π(as)]L_\pi(\pi')=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]

Lower bound of Optimization

Note

(Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.

DKL(π(˙s)π(˙s))=sπ(as)logπ(as)π(as)daD_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da

J(π)J(π)Lπ(π)CmaxsSDKL(π(˙s)π(˙s))J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

where CC is a constant.

Optimizing the objective function:

maxπJ(π)J(π)\max_{\pi'} J(\pi')-J(\pi)

By maximizing the lower bound

maxπLπ(π)CmaxsSDKL(π(˙s)π(˙s))\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

Monotonic Improvement Theorem

Proof of improvement guarantee: Suppose πk+1\pi_{k+1} and πk\pi_k are related by

πk+1=maxπLπ(π)CmaxsSDKL(π(˙s)π(˙s))\pi_{k+1}=\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

πk\pi_{k} is a feasible point, and the objective at πk\pi_k is equal to 0.

Lπk(πk)Es,adπk[Aπk(s,a)]=0L_{\pi_k}(\pi_{k})\propto \mathbb{E}_{s,a\sim d^{\pi_k}}[A^{\pi_k}(s,a)]=0 DKL(πkπk)[s]=0D_{KL}(\pi_k||\pi_k)[s]=0

Optimal value 0\geq 0.

By the performance bound, Jpik+1Jpik0J_{pi_{k+1}}-J_{pi_k}\geq 0.

Final objective function

maxπEsdπ,aπ[Aπ(s,a)π(as)π(as)]CmaxsSDKL(π(˙s)π(˙s))\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

by approximation

maxπEsdπ,aπ[Aπ(s,a)π(as)π(as)]CEsSDKL(π(˙s)π(˙s))\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

With the Lagrangian Duality, the objective is mathematically the same as following using a trust region constraint:

maxπLπ(π)\max_{\pi'} L_\pi(\pi')

such that

EsSDKL(π(˙s)π(˙s))δ\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

CC gets very high when γ\gamma is close to one and the corresponding gradient step size becomes too small.

Cϵγ(1γ)2C\propto \frac{\epsilon \gamma}{(1-\gamma)^2}
  • Empirical results show that it needs to more adaptive
  • But Tuning CC is hard (need some trick just like PPO)
  • TRPO uses trust region constraint and make δ\delta a tunable hyperparameter.

Trust Region Policy Optimization (TRPO)

maxπLπ(π)\max_{\pi'} L_\pi(\pi')

such that

EsSDKL(π(˙s)π(˙s))δ\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

Make linear approximation to LπθoldL_{\pi_{\theta_{old}}} and quadratic approximation to KL term.

Maximize g(θθold)β2(θθold)F(θθold)g\cdot(\theta-\theta_{old})-\frac{\beta}{2}(\theta-\theta_{old})^\top F(\theta-\theta_{old})

where g=θLπθold(πθ)θ=θoldg=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}} and F=2θ2KLπθold(πθ)θ=θoldF=\frac{\partial^2}{\partial \theta^2}\overline{KL}_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}

Taylor Expansion of KL Term

DKL(πθoldπθ)DKL(πθoldπθold)+dθDKL(πθoldπθ)θ=θold+12dθ2DKL(πθoldπθ)θ=θolddD_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\approx D_{KL}(\pi_{\theta_{old}}|\pi_{\theta_{old}})+d^\top \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}+\frac{1}{2}d^\top \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}d θDKL(πθoldπθ)=θExπθoldlogPθ(x)θ=θold=ExπθoldθlogPθ(x)θ=θold=Exπθold1πθold(x)θPθ(x)θ=θold=xPθold(x)1Pθold(x)θPθ(x)θ=θolddx=xPθold(x)θPθ(x)θ=θolddx=θxPθold(x)θ=θolddx=0\begin{aligned} \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})&=-\nabla_\theta \mathbb{E}_{x\sim \pi_{\theta_{old}}}\log P_\theta(x)\vert_{\theta=\theta_{old}}\\ &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \log P_\theta(x)\vert_{\theta=\theta_{old}}\\ &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\frac{1}{\pi_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}}\\ &=\int_x P_{\theta_{old}}(x)\frac{1}{P_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\ &=\int_x P_{\theta_{old}}(x)\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\ &=\nabla_\theta \int_x P_{\theta_{old}}(x) \vert_{\theta=\theta_{old}} dx\\ &=0 \end{aligned} θ2DKL(πθoldπθ)θ=θold=Exπθoldθ2logPθ(x)θ=θold=Exπθoldθ(θPθ(x)Pθ(x))θ=θold=Exπθold(θ2Pθ(x)θPθ(x)θPθ(x)Pθ(x)2)θ=θold=Exπθold(θ2Pθ(x)θ=θoldPθold(x))+Exπθold(θlogPθ(x)θlogPθ(x))θ=θold=ExπθoldθlogPθ(x)θlogPθ(x)θ=θold\begin{aligned} \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta^2 \log P_\theta(x)\vert_{\theta=\theta_{old}}\\ &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \left(\frac{\nabla_\theta P_\theta(x)}{P_\theta(x)}\right)\vert_{\theta=\theta_{old}}\\ &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)-\nabla_\theta P_\theta(x)\nabla_\theta P_\theta(x)^\top}{P_\theta(x)^2}\right)\vert_{\theta=\theta_{old}}\\ &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)\vert_{\theta=\theta_{old}}}P_{\theta_{old}}(x)\right)+\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\nabla_\theta \log P_\theta(x)\nabla_\theta \log P_\theta(x)^\top\right)\vert_{\theta=\theta_{old}}\\ &=\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta\log P_\theta(x)\nabla_\theta\log P_\theta(x)^\top\vert_{\theta=\theta_{old}}\\ \end{aligned}
Last updated on