Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 18)

CSE510 Deep Reinforcement Learning (Lecture 18)

Model-based RL framework

Model Learning with High-Dimensional Observations

  • Learning model in a latent space with observation reconstruction
  • Learning model in a latent space without observation reconstruction
  • Learning model in the observation space (i.e., videos)

Naive approach:

If we knew f(st,at)=st+1f(s_t,a_t)=s_{t+1}, we could use the tools from last week. (or p(st+1st,at)p(s_{t+1}| s_t, a_t) in the stochastic case)

So we can learn f(st,at)f(s_t,a_t) from data, and then plan through it.

Model-based reinforcement learning version 0.5:

  1. Run base polity π0\pi_0 (e.g. random policy) to collect D={(st,at,st+1)}t=0\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  2. Learn dynamics model f(st,at)f(s_t,a_t) to minimize if(si,ai)si+12\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
  3. Plan through f(st,at)f(s_t,a_t) to choose action ata_t

Sometime, it does work!

  • Essentially how system identification works in classical robotics
  • Some care should be taken to design a good base policy
  • Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

However, Distribution mismatch problem becomes worse as we use more expressive model classes.

Version 0.5: collect random samples, train dynamics, plan

  • Pro: simple, no iterative procedure
  • Con: distribution mismatch problem

Version 1.0: iteratively collect data, replan, collect data

  • Pro: simple, solves distribution mismatch
  • Con: open loop plan might perform poorly, esp. in stochastic domains

Version 1.5: iteratively collect data using MPC (replan at each step)

  • Pro: robust to small model errors
  • Con: computationally expensive, but have a planning algorithm available

Version 2.0: backpropagate directly into policy

  • Pro: computationally cheap at runtime
  • Con: can be numerically unstable, especially in stochastic domains
  • Solution: model-free RL + model-based RL

Final version:

  1. Run base polity π0\pi_0 (e.g. random policy) to collect D={(st,at,st+1)}t=0\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  2. Learn dynamics model f(st,at)f(s_t,a_t) to minimize if(si,ai)si+12\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
  3. Backpropagate through f(st,at)f(s_t,a_t) into the policy to optimized πθ(st,at)\pi_\theta(s_t,a_t)
  4. Run the policy πθ(st,at)\pi_\theta(s_t,a_t) to collect D={(st,at,st+1)}t=0\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  5. Goto 2

Model Learning with High-Dimensional Observations

  • Learning model in a latent space with observation reconstruction
  • Learning model in a latent space without observation reconstruction
Last updated on