CSE510 Deep Reinforcement Learning (Lecture 23)

Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

Offline Reinforcement Learning Part II: Advanced Approaches

Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.

Reverse Model-Based Imagination (ROMI)

ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.

Reverse Dynamics Model

ROMI learns a reverse dynamics model:

p_{\psi}(s_{t} \mid s_{t+1}, a_{t})

Parameter explanations:

$p_{\psi}$ : learned reverse transition model.
$\psi$ : parameter vector for the reverse model.
$s_{t+1}$ : next state (from dataset).
$a_{t}$ : action that hypothetically leads into $s_{t+1}$ .
$s_{t}$ : predicted predecessor state.

ROMI also learns a reverse policy to sample actions that likely lead into known states:

\pi_{rev}(a_{t} \mid s_{t+1})

Parameter explanations:

$\pi_{rev}$ : reverse policy distribution.
$a_{t}$ : action sampled for backward trajectory generation.
$s_{t+1}$ : state whose predecessors are being imagined.

Reverse Imagination Process

To generate imagined transitions:

Select a goal or high-value state $s_{g}$ from the offline dataset.
Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$ .
Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$ .
Form an imagined transition $(s_{t}, a_{t}, s_{g})$ .
Repeat backward to obtain a longer imagined trajectory.

Benefits:

Imagined states remain grounded by terminating in real dataset states.
Helps propagate reward signals backward through states not originally visited.
Avoids runaway model error that occurs in forward model rollouts.

ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.

Implicit Credit Assignment via Value Factorization Structures

Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.

Counterfactual Credit Assignment Insight

A factored value function structure of the form:

Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))

can implicitly implement counterfactual credit assignment.

Parameter explanations:

$Q_{tot}$ : global value function.
$Q_{i}(s,a_{i})$ : individual component value for agent or subsystem $i$ .
$f(\cdot)$ : mixing function combining components.
$s$ : environment state.
$a_{i}$ : action taken by entity $i$ .

In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.

Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.

Model-Based vs Model-Free Offline RL

Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.

Forward Model-Based Rollouts

Forward imagination using a learned model:

p_{\theta}(s'|s,a)

Parameter explanations:

$p_{\theta}$ : learned forward dynamics model.
$\theta$ : parameters of the forward model.
$s'$ : predicted next state.
$s$ : current state.
$a$ : action taken in current state.

Problems:

Forward rollouts drift away from dataset support.
Model error compounds with each step.
Leads to training instability if used without penalties.

Penalty Methods (MOPO, MOReL)

Augmented reward:

r_{model}(s,a) = r(s,a) - \beta u(s,a)

Parameter explanations:

$r_{model}(s,a)$ : penalized reward for model-generated steps.
$u(s,a)$ : uncertainty score of model for state-action pair.
$\beta$ : penalty coefficient.
$r(s,a)$ : original reward.

These methods limit exploration into uncertain model regions.

ROMI vs Forward Rollouts

Forward methods expand state space beyond dataset.
ROMI expands -backward-, staying consistent with known good future states.
ROMI reduces error accumulation because future anchors are real.

Combining ROMI With Conservative Offline RL

ROMI is typically combined with:

CQL (Conservative Q-Learning)
IQL (Implicit Q-Learning)
BCQ and BEAR (policy constraint methods)

Workflow:

Generate imagined transitions via ROMI.
Add them to dataset.
Train Q-function or policy using conservative losses.

Benefits:

Better coverage of reward-relevant states.
Increased policy improvement over dataset.
More stable Q-learning backups.

Summary of Lecture 23

Key points:

Offline RL can be improved via structured imagination.
ROMI creates safe imagined transitions by reversing dynamics.
Reverse imagination avoids pitfalls of forward model error.
Factored value structures provide implicit counterfactual credit assignment.
Combining ROMI with conservative learners yields state-of-the-art performance.