CSE510 Deep Reinforcement Learning (Lecture 18)

Model-based RL framework

Model Learning with High-Dimensional Observations

If we knew $f(s_t,a_t)=s_{t+1}$ , we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case)

So we can learn $f(s_t,a_t)$ from data, and then plan through it.

Model-based reinforcement learning version 0.5:

Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
Plan through $f(s_t,a_t)$ to choose action $a_t$

Sometime, it does work!

Essentially how system identification works in classical robotics
Some care should be taken to design a good base policy
Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

However, Distribution mismatch problem becomes worse as we use more expressive model classes.

Version 0.5: collect random samples, train dynamics, plan

Version 1.0: iteratively collect data, replan, collect data

Version 1.5: iteratively collect data using MPC (replan at each step)

Version 2.0: backpropagate directly into policy

Final version:

Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
Goto 2