CSE510 Deep Reinforcement Learning (Lecture 16)
Deterministic Policy Gradient (DPG)
Learning Deterministic Policies
- Deterministic policy gradients [Silver et al., ICML 2014]
- Explicitly learn a deterministic policy.
- Advantages
- Existing optimal deterministic policy for MDPs
- Naturally dealing with a continuous action space
- Expected to be more efficient than learning stochastic policies
- Computing stochastic gradient requires more samples, as it integrates over both state and action space.
- Deterministic gradient is preferable as it integrates over state space only.
Deterministic Policy Gradient
The objective function is:
where is the stationary distribution under the behavior policy .
The policy gradient from the standard policy gradient theorem is:
Issues for DPG
The formulations up to now can only use on-policy data.
Deterministic policy can hardly guarantee sufficient exploration.
- Solution: Off-policy training using a stochastic behavior policy.
Off-Policy Deterministic Policy Gradient (Off-DPG)
Use a stochastic behavior policy . The modified objective function is:
The gradients are:
Importance sampling is avoided in the actor due to the absence of integral over actions.
Policy Evaluation in DPG
Importance sampling can also be avoided in the critic.
Gradient TD-like algorithm can be directly applied to the critic.
Off-Policy Deterministic Actor-Critic
Deep Deterministic Policy Gradient (DDPG)
Insights from DQN + Deterministic Policy Gradients
- Use a replay buffer- Critic is updated every timestep (Sample from buffer, minibatch):
Actor is updated every timestep:
Smoothing target updated at every timestep:
Exploration: add noise to the action selection:
Batch normalization used for training networks
Extension of DDPG
Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate
Double DQN is not enough
Because the slow-changing policy in an actor-critic setting
- the current and target value estimates remain too similar to avoid maximization bias.
- Target value of Double DQN: r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))
TD3: Twin Delayed Deep Deterministic policy gradient
Address overestimation bias:
- Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation.
It is possible that
Clipped double Q-learning:
High-variance estimates provide a noisy gradient.
Techniques in TD3 to reduce the variance:
- Update the policy at a lower frequency than the value network.
- Smoothing the value estimate:
Update target:
where
Other methods
- Generalizable Episode Memory for Deep Reinforcement Learning
- Distributed Distributional Deep Deterministic Policy Gradient
- Distributional critic
- N-step returns are used to update the critic
- Multiple distributed parallel actors
- Prioritized experience replay