CSE510 Deep Reinforcement Learning (Lecture 11)
Materials Used
Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.
Some slides are borrowed from Rich Sutton’s RL class and David Silver’s Deep RL tutorial
Problem: often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best
- Q-learning’s priority: get Q-values close (modeling)
- Action selection priority: get ordering of Q-values right (prediction)
Value functions can often be much more complex to represent than the corresponding policy
- Do we really care about knowing Or just that “right is better than left in state ”
Motivates searching directly in a parameterized policy space
- Bypass learning value function and “directly” optimize the value of a policy
Examples
Rock-Paper-Scissors
- Two-player game of rock-paper-scissors
- Scissors beats paper
- Rock beats scissors
- Paper beats rock
- Consider policies for iterated rock-paper-scissors
- A deterministic policy is easily exploited
- A uniform random policy is optimal (i.e., Nash equilibrium)
Partial Observable GridWorld

The agent cannot differentiate the grey state
Consider features of the following form (for all actions):
Compare value-based RL, suing an approximate value function
To policy-based RL, using a parameterized policy
Under aliasing, an optimal deterministic policy will either
- move in both grey states (shown by red arrows)
- move in both grey states
Either way, it can get stuck and never reach the money
- Value-based RL learns a near-deterministic policy
- e.g. greedy or -greedy
So it will traverse the corridor for a long time
An optimal stochastic policy will randomly move or in grey cells.
It will reach the goal state in a few steps with high probability
Policy-based RL can learn the optimal stochastic policy
RL via Policy Gradient Ascent
The policy gradient approach has the following schema:
- Select a space of parameterized policies (i.e., function class)
- Compute the gradient of the value of current policy wrt parameters
- Move parameters in the direction of the gradient
- Repeat these steps until we reach a local maxima
So we must answer the following questions:
- How should we represent and evaluate parameterized policies?
- How can we compute the gradient?
Policy learning objective
Goal: given policy with parameter , find best
In episodic environments we can use the start value:
In continuing environments we can use the average value:
Or the average reward per time-step
here is the stationary distribution of Markov Chain for policy .
Policy optimization
Policy based reinforcement learning is an optimization problem
Find that maximises
Some approaches do not use gradient
- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms
Greater efficiency often possible using gradient
- Gradient descent
- Conjugate gradient
- Quasi-newton
We focus on gradient descent, many extensions possible
And on methods that exploit sequential structure
Policy gradient
Let be any policy objective function
Policy gradient algorithms search for a local maxima in by ascending the gradient of the policy with respect to
Where is the policy gradient.
and is the step-size parameter.
Policy gradient methods
The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.
Score Function
Assume the policy is differentiable and non-zero and we know the gradient for all and .
We can compute the policy gradient analytically
We define the Likelihood ratio as:
The Score Function is:
Example
Take the softmax policy as example:
Weight actions using the linear combination of features :
Probability of action is proportional to the exponentiated weights:
The score function is
In continuous action spaces, a Gaussian policy is natural
Mean is a linear combination of state features
Variance may be fixed , or can also parametrized
Policy is Gaussian,
The score function is
Policy Gradient Theorem
For any differentiable policy ,
for any of the policy objective function or , the policy gradient is:
Proof
Take to simplify the proof.
Just to note that is the probability of reaching state in steps from state .
Let be the expected number of steps to reach state from state .
Note that is constant depends solely on the initial state and policy .
So is the stationary distribution of the Markov Chain for policy .
Monte-Carlo Policy Gradient
We can use the score function to compute the policy gradient.