Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 16)

CSE510 Deep Reinforcement Learning (Lecture 16)

Deterministic Policy Gradient (DPG)

Learning Deterministic Policies

  • Deterministic policy gradients [Silver et al., ICML 2014]
    • Explicitly learn a deterministic policy.
    • a=μθ(s)a = \mu_\theta(s)
  • Advantages
    • Existing optimal deterministic policy for MDPs
    • Naturally dealing with a continuous action space
    • Expected to be more efficient than learning stochastic policies
      • Computing stochastic gradient requires more samples, as it integrates over both state and action space.
      • Deterministic gradient is preferable as it integrates over state space only.

Deterministic Policy Gradient

The objective function is:

J(θ)=sSρμ(s)r(s,μθ(s))dsJ(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds

where ρμ(s)\rho^{\mu}(s) is the stationary distribution under the behavior policy μθ(s)\mu_\theta(s).

The policy gradient from the standard policy gradient theorem is:

θJ(θ)=Esρμ[θQμθ(s,a)]=Esρμ[θμθ(s)aQμθ(s,a)a=μθ(s)]\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]

Issues for DPG

The formulations up to now can only use on-policy data.

Deterministic policy can hardly guarantee sufficient exploration.

  • Solution: Off-policy training using a stochastic behavior policy.

Off-Policy Deterministic Policy Gradient (Off-DPG)

Use a stochastic behavior policy β(as)\beta(a|s). The modified objective function is:

J(μθ)=sSρβ(s)Qμθ(s,β(s))dsJ(\mu_\theta)=\int_{s\in S} \rho^{\beta}(s) Q^{\mu_\theta}(s,\beta(s)) ds

The gradients are:

θJ(μθ)sSρβ(s)θμθ(s)aQμθ(s,a)a=μθ(s)ds=Esρβ[θμθ(s)aQμθ(s,a)a=μθ(s)]\begin{aligned} \nabla_\theta J(\mu_\theta) &\approx \int_{s\in S} \rho^{\beta}(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)} ds\\ &= \mathbb{E}_{s\sim \rho^{\beta}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}] \end{aligned}

Importance sampling is avoided in the actor due to the absence of integral over actions.

Policy Evaluation in DPG

Importance sampling can also be avoided in the critic.

Gradient TD-like algorithm can be directly applied to the critic.

Lcritic(w)=E[rt+γQw(st+1,at+1)Qw(st,at)]2\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2

Off-Policy Deterministic Actor-Critic

δt=rt+γQw(st+1,at+1)Qw(st,at)\delta_t=r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t) wt+1=wt+αwδtwQw(st,at)w_{t+1} = w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t) θt+1=θt+αθθμθ(st)aQμθ(st,at)a=μθ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t) \nabla_a Q^{\mu_\theta}(s_t,a_t)\vert_{a=\mu_\theta(s_t)}

Deep Deterministic Policy Gradient (DDPG)

Insights from DQN + Deterministic Policy Gradients

- Use a replay buffer
  • Critic is updated every timestep (Sample from buffer, minibatch):
Lcritic(w)=E[rt+γQw(st+1,at+1)Qw(st,at)]2\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2

Actor is updated every timestep:

aQ(st,a;w)a=μθ(st)θμθ(st)\nabla_a Q(s_t,a;w)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)

Smoothing target updated at every timestep:

wt+1=τwt+(1τ)wt+1w_{t+1} = \tau w_t + (1-\tau) w_{t+1} θt+1=τθt+(1τ)θt+1\theta_{t+1} = \tau \theta_t + (1-\tau) \theta_{t+1}

Exploration: add noise to the action selection: at=μθ(st)+Nta_t = \mu_\theta(s_t) + \mathcal{N}_t

Batch normalization used for training networks

Extension of DDPG

Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate

DDPG:θJ(θ)=Esρμ[θμθ(s)aQμθ(s,a)a=μθ(s)]DDPG:\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]

Double DQN is not enough

Because the slow-changing policy in an actor-critic setting

  • the current and target value estimates remain too similar to avoid maximization bias.
  • Target value of Double DQN: r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))

TD3: Twin Delayed Deep Deterministic policy gradient

Address overestimation bias:

  • Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation.
y1=r+γQθ2(s,πϕ1(s))y_1 = r + \gamma Q^{\theta_2'}(s', \pi_{\phi_1}(s')) y2=r+γQθ1(s,πϕ2(s))y_2 = r + \gamma Q^{\theta_1'}(s', \pi_{\phi_2}(s'))

It is possible that Qθ2(s,πϕ1(s))>Qθ1(s,πϕ1(s))Q^{\theta_2}(s, \pi_{\phi_1}(s)) > Q^{\theta_1}(s, \pi_{\phi_1}(s))

Clipped double Q-learning:

y1=r+γmini=1,2Qθi(s,πϕi(s))y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))

High-variance estimates provide a noisy gradient.

Techniques in TD3 to reduce the variance:

  • Update the policy at a lower frequency than the value network.
  • Smoothing the value estimate: y=r+γEϵ[Qθ(s,πϕ(s)+ϵ)]y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]

Update target:

y=r+γEϵ[Qθ(s,πϕ(s)+ϵ)]y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]

where ϵclip(N(0,σ),c,c)\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)

Other methods

  • Generalizable Episode Memory for Deep Reinforcement Learning
  • Distributed Distributional Deep Deterministic Policy Gradient
    • Distributional critic
    • N-step returns are used to update the critic
    • Multiple distributed parallel actors
    • Prioritized experience replay
Last updated on