Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 19)

CSE510 Deep Reinforcement Learning (Lecture 19)

Model learning with high-dimensional observations

  • Learning model in a latent space with observation reconstruction
  • Learning model in a latent space without reconstruction

Learn in Latent Space: Dreamer

Learning embedding of images & dynamics model (jointly)

Dreamer

Representation model: pθ(stst1,at1,ot)p_\theta(s_t|s_{t-1}, a_{t-1}, o_t)

Observation model: qθ(otst)q_\theta(o_t|s_t)

Reward model: qθ(rtst)q_\theta(r_t|s_t)

Transition model: qθ(stst1,at1)q_\theta(s_t| s_{t-1}, a_{t-1}).

Variational evidence lower bound (ELBO) objective:

JRECEp(t(JOt+JRt+JDt))\mathcal{J}_{REC}\doteq \mathbb{E}_{p}\left(\sum_t(\mathcal{J}_O^t+\mathcal{J}_R^t+\mathcal{J}_D^t)\right)

where

JOtlnq(otst)\mathcal{J}_O^t\doteq \ln q(o_t|s_t) JRtlnq(rtst)\mathcal{J}_R^t\doteq \ln q(r_t|s_t) JDtβKL(p(stst1,at1,ot)q(stst1,at1))\mathcal{J}_D^t\doteq -\beta \operatorname{KL}(p(s_t|s_{t-1}, a_{t-1}, o_t)||q(s_t|s_{t-1}, a_{t-1}))

More versions for Dreamer

Latest is V3, link to the paper 

Learn in Latent Space

  • Pros
    • Learn visual skill efficiently (using relative simple networks)
  • Cons
    • Using autoencoder might not recover the right representation
    • Not necessarily suitable for model-based methods
    • Embedding is often not a good state representation without using history observations

Planning with Value Prediction Network (VPN)

Idea: generating trajectories by following ϵ\epsilon-greedy policy based on the planning method

Q-value calculated from dd-step planning is defined as:

Qθd(s,o)=r+γVθd(s)Q_\theta^d(s,o)=r+\gamma V_\theta^{d}(s') Vθd(s)={Vθ(s)if d=11dVθ(s)+d1dmaxoQθd1(s,o)if d>1V_\theta^{d}(s)=\begin{cases} V_\theta(s) & \text{if } d=1\\ \frac{1}{d}V_\theta(s)+\frac{d-1}{d}\max_{o} Q_\theta^{d-1}(s,o)& \text{if } d>1 \end{cases}

VPN

Given an n-step trajectory x1,o1,r1,γ1,x2,o2,r2,γ2,...,xn+1x_1, o_1, r_1, \gamma_1, x_2, o_2, r_2, \gamma_2, ..., x_{n+1} generated by the ϵ\epsilon-greedy policy, k-step predictions are defined as follows:

stk={fθenc(xt)if k=0fθtrans(st1k1,ot1)if k>0s_t^k=\begin{cases} f^{enc}_\theta(x_t) & \text{if } k=0\\ f^{trans}_\theta(s_{t-1}^{k-1},o_{t-1}) & \text{if } k>0 \end{cases} vtk=fθvalue(stk)v_t^k=f^{value}_\theta(s_t^k) rtk,γtk=fθout(stk1,ot)r_t^k,\gamma_t^k=f^{out}_\theta(s_t^{k-1},o_t) Lt=l=1k(Rtvtl)2+(rtrtl)2+(γtγtl)2 where Rt={rt+γtRt+1if tnmaxoQθd(sn+1,o)if t=n+1\mathcal{L}_t=\sum_{l=1}^k(R_t-v_t^l)^2+(r_t-r_t^l)^2+(\gamma_t-\gamma_t^l)^2\text{ where } R_t=\begin{cases} r_t+\gamma_t R_{t+1} & \text{if } t\leq n\\ \max_{o} Q_{\theta-}^d(s_{n+1},o)& \text{if } t=n+1 \end{cases}

MuZero

beats AlphaZero

Last updated on