Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 15)

CSE510 Deep Reinforcement Learning (Lecture 15)

Motivation

For policy gradient methods over stochastic policies

πθ(as)=P[as,θ]\pi_\theta(a|s) = P[a|s,\theta]

Advantages

  • Potentially learning optimal solutions for multi-agent settings
  • Dealing with partial observable settings
  • Sufficient exploration

Disadvantages

  • Can not learning a deterministic policy
  • Extension to continuous action space is not straightforward

On-Policy vs. Off-Policy Policy Gradients

On-Policy Policy Gradients:

  • Training samples are collected according to the current policy.

Off-Policy Algorithms:

  • Enable the reuse of past experience.
  • Samples can be collected by an exploratory behavior policy.

How to design off-policy policy gradient?

  • Using importance sampling

Off-Policy Actor-Critic (OffPAC)

Stochastic Behavior Policy for exploration.

  • For collecting data. Labelled as β(as)\beta(a|s)

The objective function is:

J(θ)=Esdβ[Vπ(s)]=sSdβ(s)aAπθ(as)Qπ(s,a)\begin{aligned} J(\theta)=\mathbb{E}_{s\sim d^\beta}[V^{\pi}(s)] &= \sum_{s\in S} d^\beta(s) \sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\\ \end{aligned}

dβ(s)d^\beta(s) is the stationary distribution under the behavior policy β(as)\beta(a|s).

Solving the Off-Policy Policy Gradient

θJ(θ)=θEsdβ[aAπθ(as)Qπ(s,a)]=Esdβ[aAθπθ(as)Qπ(s,a)+πθ(as)θQπ(s,a)]=Esdβ[aAθπθ(as)Qπ(s,a)]=Esdβ[aAβ(as)1β(as)θπθ(as)Qπ(s,a)]=Esdβ[aAβ(as)θlogβ(as)Qπ(s,a)]=Eβ[1β(as)θπθ(as)Qπ(s,a)]=Eβ[πθ(as)β(as)Qπ(s,a)θlogπθ(as)]\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)+\pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \nabla_\theta \log \beta(a|s) Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{\beta}\left[\frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ &= \mathbb{E}_{\beta}\left[\frac{\pi_\theta(a|s)}{\beta(a|s)} Q^{\pi}(s,a)\nabla_\theta \log \pi_\theta(a|s)\right]\\ \end{aligned}

To compute the off-policy policy gradient, Qπ(s,a)Q^{\pi}(s,a) is estimated given data collected by β\beta.

Common solution:

  • Importance sampling
  • Tree backup
  • Gradient temporal-difference learning
  • Retrace [Munos et al., 2016] IMPALA 

Importance Sampling

Assume that samples come in the form of episodes.

MM is the number of episodes containing (s,a),tm(s,a), t_m be the first time when (s,a)(s,a) appears in episode mm.

The first-visit importance sampling estimator of Qπ(s,a)Q^{\pi}(s,a) is:

QIS(s,a)1Mm=1MRmwmQ^{IS}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M R_m w_m

RmR_m is the return following (s,a)(s,a) in episode mm.

Rmrtm+1+γrtm+2++γTmtm1rTmR_m\coloneqq r_{t_m +1}+\gamma r_{t_m +2}+\cdots+\gamma^{T_m-t_m -1} r_{T_m}

wmw_m is the importance sampling weight:

wmπ(atmstm)β(atmstm)π(atm+1stm+1)β(atm+1stm+1)π(aTmsTm)β(aTmsTm)w_m\coloneqq \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\frac{\pi(a_{t_m+1}|s_{t_m+1})}{\beta(a_{t_m+1}|s_{t_m+1})}\cdots\frac{\pi(a_{T_m}|s_{T_m})}{\beta(a_{T_m}|s_{T_m})}

Per-decision algorithm

Consider the parts we used in importance sampling:

Rmwm=i=tm+1Tmγitm1riπ(atmstm)β(atmstm)π(ati1sti1)β(ati1sti1)π(atisti)β(atisti)π(aTm1sTm1)β(aTm1sTm1)R_m w_m=\sum_{i=t_m+1}^{T_m}\gamma^{i-t_m-1} r_i \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\cdots \frac{\pi(a_{t_{i-1}}|s_{t_{i-1}})}{\beta(a_{t_{i-1}}|s_{t_{i-1}})}\frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})}\cdots \frac{\pi(a_{T_m-1}|s_{T_m-1})}{\beta(a_{T_m-1}|s_{T_m-1})}

Intuitively, rir_i should not depend on the actions taken after tit_i.

This gives the per-decision importance sampling estimator:

QPD(s,a)1Mm=1Mk=1Tmtmγk1rtm+ki=tmtm+k1π(atisti)β(atisti)Q^{PD}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M \sum_{k=1}^{T_m-t_m} \gamma^{k-1} r_{t_m+k}\prod_{i=t_m}^{t_m+k-1} \frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})}

The per-decision importance sampling estimator is consistence and unbiased estimator of Qπ(s,a)Q^{\pi}(s,a).

Proof as exercise.

Hints

- Show the expectation of $Q^{PD}(s,a)$ is the same as $Q^{IS}(s,a)$. - $Q^{IS}(s,a)$ is a consistence and unbiased estimator of $Q^{\pi}(s,a)$.

Deterministic Policy Gradient (DPG)

The objective function is:

J(θ)=sSρμ(s)r(s,μθ(s))dsJ(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds

where ρμ(s)\rho^{\mu}(s) is the stationary distribution under the behavior policy μθ(s)\mu_\theta(s).

Proof along the same lines of the standard policy gradient theorem.

θJ(θ)=Eμθ[θQμθ(s,a)]=Esρμ[θμθ(s)aQμθ(s,a)a=μθ(s)]\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]

Issues for DPG

The formulations up to now can only use on-policy data.

Deep Deterministic Policy Gradient (DDPG)

Last updated on