CSE510 Deep Reinforcement Learning (Lecture 14)
Advanced Policy Gradient Methods
Trust Region Policy Optimization (TRPO)
“Recall” from last lecture
such that
Unconstrained penalized objective:
First order Taylor expansion for the loss and second order for the KL:
If you are really interested, try to fill the solving the KL Constrained Problem section.
Natural Gradient Descent
Setting the gradient to zero:
The natural gradient is
However, due to the quadratic approximation, the KL constrains may be violated.
Linear search
We do Linear search for the best step size by making sure that
- Improving the objective
- Satisfying the KL constraint
TRPO = NPG + Linesearch + monotonic improvement theorem
Summary of TRPO
Pros
- Proper learning step
- Monotonic improvement guarantee
Cons
- Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
- Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. — OpenAI
Idea:
- The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
- Does it matter if we only break the constraint just a few times?
What if we treat it as a “soft” constraint? Add proximal value to the objective function?
PPO with Adaptive KL Penalty
Use adaptive value.
Compute
- If ,
- If ,
PPO with Clipped Objective
- Here, measures how much the new policy changes the probability of taking action in state :
- If :the action becomes more likely under the new policy.
- If :the action becomes less likely.
- We’d like to increase if (good actions become more probable) and decrease if .
- But if changes too much, the update becomes unstable, just like in vanilla PG.
We limit to be in a range:
Trusted region Policy Optimization (TRPO): Don’t move further than in KL. Proximal Policy Optimization (PPO): Don’t let drift further than .
PPO in Practice
Here is the surrogate objective function.
is a squared-error loss for “critic” .
is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.
and are trade-off parameters, in paper and .
Summary for Policy Gradient Methods
Trust region policy optimization (TRPO)
- Optimization problem formulation
- Natural gradient ascent + monotonic improvement + line search
- But require second-order optimization
Proximal policy optimization (PPO)
- Clipped objective
- Simple yet effective
Take-away:
- Proper step size is critical for policy gradient methods
- Sample efficiency can be improved by using important sampling