Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 12)

CSE510 Deep Reinforcement Learning (Lecture 12)

Policy Gradient Theorem

For any differentiable policy πθ(s,a)\pi_\theta(s,a), for any o the policy objective functions J=J1,JavRJ=J_1, J_{avR} or 11γJavV\frac{1}{1-\gamma} J_{avV}

The policy gradient is

θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]

Policy Gradient Methods

Advantages of Policy-Based RL

Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies

Disadvantages:

  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Anchor-Critic Methods

Q Actor-Critic

Reducing Variance Using a Critic

Monte-Carlo Policy Gradient still has high variance.

We use a critic to estimate the action-value function Qw(s,a)Qπθ(s,a)Q_w(s,a)\approx Q^{\pi_\theta}(s,a).

Anchor-critic algorithms maintain two sets of parameters:

Critic: updates action-value function parameters ww

Actor: updates policy parameters θ\theta, in direction suggested by the critic.

Actor-critic algorithms follow an approximate policy gradient:

θJ(θ)Eπθ[θlogπθ(s,a)Qw(s,a)]\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right] Δθ=αθlogπθ(s,a)Qw(s,a)\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)

Action-Value Actor-Critic

  • Simple actor-critic algorithm based on action-value critic
  • Using linear value function approximation Qw(s,a)=ϕ(s,a)wQ_w(s,a)=\phi(s,a)^\top w

Critic: updates ww by linear TD(0)TD(0) Actor: updates θ\theta by policy gradient

def Q_actor-critic(states,theta): actions=sample_actions(a,pi_theta) for i in range(num_steps): reward=sample_rewards(actions,states) transition=sample_transition(actions,states) new_actions=sample_action(transition,theta) delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions) theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions) w=w+beta*delta*phi(states, actions) a=new_actions s=transition

Advantage Actor-Critic

Reducing variance using a baseline

  • We subtract a baseline function B(s)B(s) form the policy gradient
  • This can reduce the variance without changing expectation
Eπθ[θlogπθ(s,a)B(s)]=sSdπθ(s)aAθπθ(s,a)B(s)=sSdπθB(s)θaAπθ(s,a)=0\begin{aligned} \mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\ &=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\ &=0 \end{aligned}

A good baseline is the state value function B(s)=Vπθ(s)B(s)=V^{\pi_\theta}(s)

So we can rewrite the policy gradient using the advantage function Aπθ(s,a)=Qπθ(s,a)Vπtheta(s)A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)

θJ(θ)=E[θlogπθ(s,a)Aπtheta(s,a)]\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right]
Estimating the Advantage function

Method 1: direct estimation

May increase the variance

The advantage function can significantly reduce variance of policy gradient

So the critic should really estimate the advantage function

For example, by estimating both Vπtheta(s)V^{\pi_theta}(s) and Qπtheta(s,a)Q^{\pi_theta}(s,a)

Using two function approximators and two parameter vectors,

Vv(s)Vπθ(s)Qw(s,a)Qπθ(s,a)A(s,a)=Qw(s,a)Vv(s)V_v(s)\approx V^{\pi_\theta}(s)\\ Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\ A(s,a)=Q_w(s,a)-V_v(s)

And updating both value functions by e.g. TD learning

Method 2: using the TD error

We can prove that TD error is an unbiased estimation of the advantage function

For the true value function Vπθ(s)V^{\pi_\theta}(s), the TD error δπθ\delta^{\pi_\theta}

δπθ=r+γVπθ(s)Vπθ(s)\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s)

is an unbiased estimate of the advantage function

Eπθ[δπθs,a]=Eπθ[r+γVπθ(s)s,a]Vπθ(s)=Qπθ(s,a)Vπθ(s)=Aπθ(s,a)\begin{aligned} \mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\ &=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\ &=A^{\pi_\theta}(s,a) \end{aligned}

So we can use the TD error to compute the policy gradient

ΔθJ(θ)=Eπθ[θlogπθ(s,a)δπθ]\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]

In practice, we can use an approximate TD error δv=r+γVv(s)Vv(s)\delta_v=r+\gamma V_v(s')-V_v(s) to compute the policy gradient

Summary of policy gradient algorithms

THe policy gradient has many equivalent forms.

θJ(θ)=Eπθ[θlogπθ(s,a)vt] REINFORCE=Eπθ[θlogπθ(s,a)Qw(s,a)] Q Actor-Critic=Eπθ[θlogπθ(s,a)Aπθ(s,a)] Advantage Actor-Critic=Eπθ[θlogπθ(s,a)δπθ] TD Actor-Critic\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{ REINFORCE} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{ Q Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{ Advantage Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{ TD Actor-Critic} \end{aligned}

Each leads s stochastic gradient ascent algorithm.

Critic use policy evaluation to estimate the Qπ(s,a)Q^\pi(s,a) or Aπ(s,a)A^\pi(s,a) or Vπ(s)V^\pi(s).

Compatible Function Approximation

If the following two conditions are satisfied:

  1. Value function approximation is a compatible with the policy wQw(s,a)=θlogπθ(s,a)\nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a)
  2. Value function parameters ww minimize the MSE ϵ=Eπθ[(Qπθ(s,a)Qw(s,a))2]\epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2] Note ϵ\epsilon need not be zero, just need to be minimized.

Then the policy gradient is exact

θJ(θ)=Eπθ[θlogπθ(s,a)Qw(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)]

Remember:

θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]

Challenges with Policy Gradient Methods

  • Data Inefficiency
    • On-policy method: for each new policy, we need to generate a completely new
    • trajectory
    • The data is thrown out after just one gradient update
    • As complex neural networks need many updates, this makes the training process very slow
  • Unstable update: step size is very important
    • If step size is too large:
      • Large step -> bad policy
      • Next batch is generated from current bad policy -> collect bad samples
      • Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
    • If step size is too small: the learning process is slow
Last updated on