Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 11)

CSE510 Deep Reinforcement Learning (Lecture 11)

Materials Used

  • Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.

  • Some slides are borrowed from Rich Sutton’s RL class and David Silver’s Deep RL tutorial

Problem: often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best

  • Q-learning’s priority: get Q-values close (modeling)
  • Action selection priority: get ordering of Q-values right (prediction)

Value functions can often be much more complex to represent than the corresponding policy

  • Do we really care about knowing Q(s,left)=0.3554,Q(s,right)=0.533Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533 Or just that “right is better than left in state ss

Motivates searching directly in a parameterized policy space

  • Bypass learning value function and “directly” optimize the value of a policy

Examples

Rock-Paper-Scissors

  • Two-player game of rock-paper-scissors
    • Scissors beats paper
    • Rock beats scissors
    • Paper beats rock
  • Consider policies for iterated rock-paper-scissors
    • A deterministic policy is easily exploited
    • A uniform random policy is optimal (i.e., Nash equilibrium)

Partial Observable GridWorld

Partial Observable Grid World

The agent cannot differentiate the grey state

Consider features of the following form (for all N,E,S,WN,E,S,W actions):

ϕ(s,a)=1(wall toN,a=moveE)\phi(s,a)=1(\text{wall to} N, a=\text{move} E)

Compare value-based RL, suing an approximate value function

Qθ(s,a)=f(ϕ(s,a),θ)Q_\theta(s,a) = f(\phi(s,a),\theta)

To policy-based RL, using a parameterized policy

πθ(s,a)=g(ϕ(s,a),θ)\pi_\theta(s,a) = g(\phi(s,a),\theta)

Under aliasing, an optimal deterministic policy will either

  • move WW in both grey states (shown by red arrows)
  • move EE in both grey states

Either way, it can get stuck and never reach the money

  • Value-based RL learns a near-deterministic policy
    • e.g. greedy or ϵ\epsilon-greedy

So it will traverse the corridor for a long time

An optimal stochastic policy will randomly move EE or WW in grey cells.

πθ(wall to N and S,move E)=0.5πθ(wall to N and S,move W)=0.5\pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\ \pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5

It will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy

RL via Policy Gradient Ascent

The policy gradient approach has the following schema:

  1. Select a space of parameterized policies (i.e., function class)
  2. Compute the gradient of the value of current policy wrt parameters
  3. Move parameters in the direction of the gradient
  4. Repeat these steps until we reach a local maxima

So we must answer the following questions:

  • How should we represent and evaluate parameterized policies?
  • How can we compute the gradient?

Policy learning objective

Goal: given policy πθ(s,a)\pi_\theta(s,a) with parameter θ\theta, find best θ\theta

In episodic environments we can use the start value:

J1(θ)=Vπθ(s1)=Eπθ[v1]J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]

In continuing environments we can use the average value:

JavV(θ)=sSdπθ(s)Vπθ(s)J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s)

Or the average reward per time-step

JavR(θ)=sSdπθ(s)aAπθ(s,a)R(s,a)J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a)

here dπθ(s)d^{\pi_\theta}(s) is the stationary distribution of Markov Chain for policy πθ\pi_\theta.

Policy optimization

Policy based reinforcement learning is an optimization problem

Find θ\theta that maximises J(θ)J(\theta)

Some approaches do not use gradient

  • Hill climbing
  • Simplex / amoeba / Nelder Mead
  • Genetic algorithms

Greater efficiency often possible using gradient

  • Gradient descent
  • Conjugate gradient
  • Quasi-newton

We focus on gradient descent, many extensions possible

And on methods that exploit sequential structure

Policy gradient

Let J(θ)J(\theta) be any policy objective function

Policy gradient algorithms search for a local maxima in J(θ)J(\theta) by ascending the gradient of the policy with respect to θ\theta

Δθ=αθJ(θ)\Delta \theta = \alpha \nabla_\theta J(\theta)

Where θJ(θ)\nabla_\theta J(\theta) is the policy gradient.

θJ(θ)=(J(θ)θ1J(θ)θ2J(θ)θn)\nabla_\theta J(\theta) = \begin{pmatrix} \frac{\partial J(\theta)}{\partial \theta_1} \\ \frac{\partial J(\theta)}{\partial \theta_2} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{pmatrix}

and α\alpha is the step-size parameter.

Policy gradient methods

The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.

Score Function

Assume the policy πθ\pi_\theta is differentiable and non-zero and we know the gradient θπθ(s,a)\nabla_\theta \pi_\theta(s,a) for all sSs\in S and aAa\in A.

We can compute the policy gradient analytically

We define the Likelihood ratio as:

θπθ(s,a)=πθ(s,a)θπθ(s,a)πθ(s,a)=θlogπθ(s,a)\begin{aligned} \nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\ &= \nabla_\theta \log \pi_\theta(s,a) \end{aligned}

The Score Function is:

θlogπθ(s,a)\nabla_\theta \log \pi_\theta(s,a)

Example

Take the softmax policy as example:

Weight actions using the linear combination of features ϕ(s,a)θ\phi(s,a)^\top\theta:

Probability of action is proportional to the exponentiated weights:

πθ(s,a)exp(ϕ(s,a)θ)\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)

The score function is

θln[exp(ϕ(s,a)θ)aAexp(ϕ(s,a)θ)]=θ(lnexp(ϕ(s,a)θ)(lnaAexp(ϕ(s,a)θ)))=θ(ϕ(s,a)θϕ(s,a)aAexp(ϕ(s,a)θ)aAexp(ϕ(s,a)θ))=ϕ(s,a)aAθ(s,a)ϕ(s,a)=ϕ(s,a)Eaπθ(s,a)[ϕ(s,a)]\begin{aligned} \nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\ &= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\ &=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a') &= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')] \end{aligned}

In continuous action spaces, a Gaussian policy is natural

Mean is a linear combination of state features μ(s)=ϕ(s)θ\mu(s) = \phi(s)^\top\theta

Variance may be fixed σ2\sigma^2, or can also parametrized

Policy is Gaussian, aN(μ(s),σ2)a \sim N (\mu(s), \sigma^2)

The score function is

θlogπθ(s,a)=(aμ(s))ϕ(s)σ2\nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}

Policy Gradient Theorem

For any differentiable policy πθ(s,a)\pi_\theta(s,a),

for any of the policy objective function J=J1,JavR,J=J_1, J_{avR}, or 11γJavV\frac{1}{1-\gamma}J_{avV}, the policy gradient is:

θJ(θ)=πθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]

Proof

Take ϕ(s)=aAθπθ(as)Qπ(s,a)\phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) to simplify the proof.

θVπ(s)=θ(aAπθ(as)Qπ(s,a))=aA(θπθ(as)Qπ(s,a)+πθ(as)θQπ(s,a))by linear approximation=aA(θπθ(as)Qπ(s,a)+πθ(as)θs,rS×RP(s,rs,a)(r+Vπ(s)))rewrite the Q-function as sum of expected rewards from actions=aA(θπθ(as)Qπ(s,a)+πθ(as)s,rS×RP(s,rs,a)θ(r+Vπ(s)))=ϕ(s)+aA(πθ(as)sSP(ss,a)θVπ(s))=ϕ(s)+sSaAπθ(as)P(ss,a)θVπ(s)=ϕ(s)+sSρ(ss,1)θVπ(s) notice the recurrence relation=ϕ(s)+sSρ(ss,1)[ϕ(s)+sSρ(ss,1)θVπ(s)]=ϕ(s)+[sSρ(ss,1)ϕ(s)]+[sSρ(ss,2)θVπ(s)]==xSk=0ρ(sx,k)ϕ(x)\begin{aligned} \nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\ &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\ &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\ &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\ &=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\ &=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\ &=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\ &=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\ &=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\ &=\cdots\\ &=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x) \end{aligned}

Just to note that ρ(sx,k)=aAπθ(as)P(xs,a)k\rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k is the probability of reaching state xx in kk steps from state ss.

Let η(s)=k=0ρ(s0s,k)\eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k) be the expected number of steps to reach state ss from state s0s_0.

Note that sSη(s)\sum_{s\in S} \eta(s) is constant depends solely on the initial state s0s_0 and policy πθ\pi_\theta.

So dπθ(s)=η(s)sSη(s)d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')} is the stationary distribution of the Markov Chain for policy πθ\pi_\theta.

θJ(θ)=θVπ(s0)=sSk=0ρ(s0s,k)ϕ(s)=sSη(s)ϕ(s)=sSη(s)aAη(s)sSη(s)ϕ(s)sSη(s)sSη(s)ϕ(s)=sSdπθ(s)aAθπθ(as)Qπθ(s,a)=[sSdπθ(s)aAπθ(as)]θQπθ(s,a)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\begin{aligned} \nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\ &=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\ &=\sum_{s\in S} \eta(s)\phi(s)\\ &=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\ &\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\ &=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\ &=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] \end{aligned}

Monte-Carlo Policy Gradient

We can use the score function to compute the policy gradient.

Actor-Critic methods

Q Actor-Critic

Advantage Actor-Critic

Last updated on