CSE510 Deep Reinforcement Learning (Lecture 13)
Recap from last lecture
For any differentiable policy , for any o the policy objective functions or
The policy gradient is
Problem for policy gradient method
Data Inefficiency
- On-policy method: for each new policy, we need to generate a completely new trajectory
- The data is thrown out after just one gradient update
- As complex neural networks need many updates, this makes the training process very slow
Unstable update: step size is very important
- If step size is too large:
- Large step -> bad policy
- Next batch is generated from current bad policy → collect bad samples
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
- If step size is too small: the learning process is slow
Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)
Objective of Policy Gradient Methods
Policy Objective
here is the trajectory for the policy .
Policy objective can be written in terms of old one:
Equivalently for succinctness:
Proof
Importance Sampling
Estimate one distribution by sampling form anther distribution
Estimating objective with importance sampling
Discounted state visit distribution:
Using the old policy to sample states form a policy that we are trying to optimize.
Lower bound of Optimization
(Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
where is a constant.
Optimizing the objective function:
By maximizing the lower bound
Monotonic Improvement Theorem
Proof of improvement guarantee: Suppose and are related by
is a feasible point, and the objective at is equal to 0.
Optimal value .
By the performance bound, .
Final objective function
by approximation
With the Lagrangian Duality, the objective is mathematically the same as following using a trust region constraint:
such that
gets very high when is close to one and the corresponding gradient step size becomes too small.
- Empirical results show that it needs to more adaptive
- But Tuning is hard (need some trick just like PPO)
- TRPO uses trust region constraint and make a tunable hyperparameter.
Trust Region Policy Optimization (TRPO)
such that
Make linear approximation to and quadratic approximation to KL term.
Maximize
where and