CSE510 Deep Reinforcement Learning (Lecture 12)
Policy Gradient Theorem
For any differentiable policy , for any o the policy objective functions or
The policy gradient is
Policy Gradient Methods
Advantages of Policy-Based RL
Advantages:
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages:
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Anchor-Critic Methods
Q Actor-Critic
Reducing Variance Using a Critic
Monte-Carlo Policy Gradient still has high variance.
We use a critic to estimate the action-value function .
Anchor-critic algorithms maintain two sets of parameters:
Critic: updates action-value function parameters
Actor: updates policy parameters , in direction suggested by the critic.
Actor-critic algorithms follow an approximate policy gradient:
Action-Value Actor-Critic
- Simple actor-critic algorithm based on action-value critic
- Using linear value function approximation
Critic: updates by linear Actor: updates by policy gradient
def Q_actor-critic(states,theta):
actions=sample_actions(a,pi_theta)
for i in range(num_steps):
reward=sample_rewards(actions,states)
transition=sample_transition(actions,states)
new_actions=sample_action(transition,theta)
delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions)
theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions)
w=w+beta*delta*phi(states, actions)
a=new_actions
s=transitionAdvantage Actor-Critic
Reducing variance using a baseline
- We subtract a baseline function form the policy gradient
- This can reduce the variance without changing expectation
A good baseline is the state value function
So we can rewrite the policy gradient using the advantage function
Estimating the Advantage function
Method 1: direct estimation
May increase the variance
The advantage function can significantly reduce variance of policy gradient
So the critic should really estimate the advantage function
For example, by estimating both and
Using two function approximators and two parameter vectors,
And updating both value functions by e.g. TD learning
Method 2: using the TD error
We can prove that TD error is an unbiased estimation of the advantage function
For the true value function , the TD error
is an unbiased estimate of the advantage function
So we can use the TD error to compute the policy gradient
In practice, we can use an approximate TD error to compute the policy gradient
Summary of policy gradient algorithms
THe policy gradient has many equivalent forms.
Each leads s stochastic gradient ascent algorithm.
Critic use policy evaluation to estimate the or or .
Compatible Function Approximation
If the following two conditions are satisfied:
- Value function approximation is a compatible with the policy
- Value function parameters minimize the MSE Note need not be zero, just need to be minimized.
Then the policy gradient is exact
Remember:
Challenges with Policy Gradient Methods
- Data Inefficiency
- On-policy method: for each new policy, we need to generate a completely new
- trajectory
- The data is thrown out after just one gradient update
- As complex neural networks need many updates, this makes the training process very slow
- Unstable update: step size is very important
- If step size is too large:
- Large step -> bad policy
- Next batch is generated from current bad policy -> collect bad samples
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
- If step size is too small: the learning process is slow
- If step size is too large: