CSE510 Deep Reinforcement Learning (Lecture 15)
Motivation
For policy gradient methods over stochastic policies
Advantages
- Potentially learning optimal solutions for multi-agent settings
- Dealing with partial observable settings
- Sufficient exploration
Disadvantages
- Can not learning a deterministic policy
- Extension to continuous action space is not straightforward
On-Policy vs. Off-Policy Policy Gradients
On-Policy Policy Gradients:
- Training samples are collected according to the current policy.
Off-Policy Algorithms:
- Enable the reuse of past experience.
- Samples can be collected by an exploratory behavior policy.
How to design off-policy policy gradient?
- Using importance sampling
Off-Policy Actor-Critic (OffPAC)
Stochastic Behavior Policy for exploration.
- For collecting data. Labelled as
The objective function is:
is the stationary distribution under the behavior policy .
Solving the Off-Policy Policy Gradient
To compute the off-policy policy gradient, is estimated given data collected by .
Common solution:
- Importance sampling
- Tree backup
- Gradient temporal-difference learning
- Retrace [Munos et al., 2016] IMPALA
Importance Sampling
Assume that samples come in the form of episodes.
is the number of episodes containing be the first time when appears in episode .
The first-visit importance sampling estimator of is:
is the return following in episode .
is the importance sampling weight:
Per-decision algorithm
Consider the parts we used in importance sampling:
Intuitively, should not depend on the actions taken after .
This gives the per-decision importance sampling estimator:
The per-decision importance sampling estimator is consistence and unbiased estimator of .
Proof as exercise.
Hints
Deterministic Policy Gradient (DPG)
The objective function is:
where is the stationary distribution under the behavior policy .
Proof along the same lines of the standard policy gradient theorem.
Issues for DPG
The formulations up to now can only use on-policy data.