CSE510 Deep Reinforcement Learning (Lecture 22)

Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

Offline Reinforcement Learning: Introduction and Challenges

Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.

The Offline RL Dataset

We are given a static dataset:

D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N

Parameter explanations:

$s_i$ : state sampled from behavior policy state distribution.
$a_i$ : action selected by the behavior policy $\pi_beta$ .
$s'_i$ : next state sampled from environment dynamics $p(s'|s,a)$ .
$r_i$ : reward observed for transition $(s_i,a_i)$ .
$N$ : total number of transitions in the dataset.
$D$ : full offline dataset used for training.

The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$ :

\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]

Parameter explanations:

$\pi$ : policy we want to learn.
$r(s,a)$ : reward received for state-action pair.
$\gamma$ : discount factor controlling weight of future rewards.
$T$ : horizon or trajectory length.

Why Offline RL Is Difficult

Offline RL is fundamentally harder than online RL because:

The agent cannot try new actions to fix wrong value estimates.
The policy may choose out-of-distribution actions not present in $D$ .
Q-value estimates for unseen actions can be arbitrarily incorrect.
Bootstrapping on wrong Q-values can cause divergence.

This leads to two major failure modes:

—Distribution shift—: new policy actions differ from dataset actions.
—Extrapolation error—: the Q-function guesses values for unseen actions.

Extrapolation Error Problem

In standard Q-learning, the Bellman backup is:

Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')

Parameter explanations:

$Q(s,a)$ : estimated value of taking action $a$ in state $s$ .
$\max_{a'}$ : maximum over possible next actions.
$a'$ : candidate next action for evaluation in backup step.

If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.

Behavior Cloning (BC): The Safest Baseline

The simplest offline method is to imitate the behavior policy:

\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]

Parameter explanations:

$\phi$ : neural network parameters of the cloned policy.
$\pi_{\phi}$ : learned policy approximating behavior policy.
$\log \pi_{\phi}(a|s)$ : negative log-likelihood loss.

Pros:

Does not suffer from extrapolation error.
Extremely stable.

Cons:

Cannot outperform the behavior policy.
Ignores reward information entirely.

Naive Offline Q-Learning Fails

Directly applying off-policy Q-learning on $D$ generally leads to:

Overestimation of unseen actions.
Divergence due to extrapolation error.
Policies worse than behavior cloning.

Strategies for Safe Offline RL

There are two primary families of solutions:

—Policy constraint methods—
—Conservative value estimation methods—

1. Policy Constraint Methods

These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.

Advantage Weighted Regression (AWR / AWAC)

Policy update:

\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)

Parameter explanations:

$\pi_{beta}$ : behavior policy used to collect dataset.
$A(s,a)$ : advantage function derived from Q or V estimates.
$\lambda$ : temperature controlling strength of advantage weighting.
$\exp(\cdot)$ : positive weighting on high-advantage actions.

Properties:

Uses advantages to filter good and bad actions.
Improves beyond behavior policy while staying safe.

Batch-Constrained Q-learning (BCQ)

BCQ constrains the policy using a generative model:

Train a VAE $G_{\omega}$ to model $a$ given $s$ .
Train a small perturbation model $\xi$ .
Limit the policy to $a = G_{\omega}(s) + \xi(s)$ .

Parameter explanations:

$G_{\omega}(s)$ : VAE-generated action similar to data actions.
$\omega$ : VAE parameters.
$\xi(s)$ : small correction to generated actions.
$a$ : final policy action constrained near dataset distribution.

BCQ avoids selecting unseen actions and strongly reduces extrapolation.

BEAR (Bootstrapping Error Accumulation Reduction)

BEAR adds explicit constraints:

D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon

Parameter explanations:

$D_{MMD}$ : Maximum Mean Discrepancy distance between action distributions.
$\epsilon$ : threshold restricting policy deviation from behavior policy.

BEAR controls distribution shift more tightly than BCQ.

2. Conservative Value Function Methods

These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.

Conservative Q-Learning (CQL)

One formulation is:

J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)

Parameter explanations:

$J_{TD}$ : standard Bellman TD loss.
$\alpha$ : weight of conservatism penalty.
$\mathbb{E}_{a\sim\pi(\cdot|s)}$ : expectation over policy-chosen actions.
$\mathbb{E}_{a\sim D}$ : expectation over dataset actions.

Effect:

Increases Q-values of dataset actions.
Decreases Q-values of out-of-distribution actions.

Implicit Q-Learning (IQL)

IQL avoids constraints entirely by using expectile regression:

Value regression:

V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]

Parameter explanations:

$v$ : scalar value estimate for state $s$ .
$\rho_{\tau}(x)$ : expectile regression loss.
$\tau$ : expectile parameter controlling conservatism.
$Q(s,a)$ : Q-value estimate.

Key idea:

For $\tau < 1$ , IQL reduces sensitivity to large (possibly incorrect) Q-values.
Implicitly conservative without special constraints.

IQL often achieves state-of-the-art performance due to simplicity and stability.

Model-Based Offline RL

Forward Model-Based RL

Train a dynamics model:

p_{\theta}(s'|s,a)

Parameter explanations:

$p_{\theta}$ : learned transition model.
$\theta$ : parameters of transition model.

We can generate synthetic transitions using $p_{\theta}$ , but model error accumulates.

Penalty-Based Model Approaches (MOPO, MOReL)

Add uncertainty penalty:

r_{model}(s,a) = r(s,a) - \beta , u(s,a)

Parameter explanations:

$r_{model}$ : penalized reward for model rollouts.
$u(s,a)$ : model uncertainty estimate.
$\beta$ : penalty coefficient.

These methods limit exploration into unknown model regions.

Reverse Model-Based Imagination (ROMI)

ROMI generates new training data by -backward- imagination.

Reverse Dynamics Model

ROMI learns:

p_{\psi}(s_{t} \mid s_{t+1}, a_{t})

Parameter explanations:

$\psi$ : parameters of reverse dynamics model.
$s_{t+1}$ : later state.
$a_{t}$ : action taken leading to $s_{t+1}$ .
$s_{t}$ : predicted predecessor state.

ROMI also learns a reverse policy for sampling likely predecessor actions.

Reverse Imagination Process

Given a goal state $s_{g}$ :

Sample $a_{t}$ from reverse policy.
Predict $s_{t}$ from reverse dynamics.
Form imagined transition $(s_{t}, a_{t}, s_{t+1})$ .
Repeat to build longer imagined trajectories.

Benefits:

Imagined transitions end in real states, ensuring grounding.
Completes missing parts of dataset.
Helps propagate reward backward reliably.

ROMI combined with conservative RL often outperforms standard offline methods.

Summary of Lecture 22

Offline RL requires balancing:

Improvement beyond dataset behavior.
Avoiding unsafe extrapolation to unseen actions.

Three major families of solutions:

Policy constraints (BCQ, BEAR, AWR)
Conservative Q-learning (CQL, IQL)
Model-based conservatism and imagination (MOPO, MOReL, ROMI)

Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.