CSE510 Deep Reinforcement Learning (Lecture 21)

Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)

Computational Curiosity

“The direct goal of curiosity and boredom is to improve the world model.”
Curiosity encourages agents to seek experiences that better predict or explain the environment.
A “curiosity unit” gives reward based on the mismatch between current model predictions and actual outcomes.
Intrinsic reward is high when the agent’s prediction fails, that is, when it encounters surprising outcomes.
This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
The agent is effectively motivated to create situations where its model fails.

Model Prediction Error as Intrinsic Reward

We augment the reward with an intrinsic bonus based on model prediction error:

$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$

Parameter explanations:

$s$ : current state of the agent.
$a$ : action taken by the agent in state $s$ .
$s'$ : next state resulting from executing action $a$ in state $s$ .
$r(s, a, s')$ : extrinsic environment reward for transition $(s, a, s')$ .
$T(s, a; \theta)$ : learned dynamics model with parameters $\theta$ that predicts the next state.
$\theta$ : parameter vector of the predictive dynamics model $T$ .
$|T(s, a; \theta) - s'|$ : prediction error magnitude between predicted next state and actual next state.
$B(\cdot)$ : function converting prediction error magnitude into an intrinsic reward bonus.
$R(s, a, s')$ : total reward, sum of extrinsic reward and intrinsic curiosity bonus.

Key ideas:

The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
Initially many transitions are surprising, encouraging broad exploration.
As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
Exploration becomes focused on less-known parts of the state space.
Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.

Avoiding Trivial Curiosity Traps

link to paper

Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.

Examples:

The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
This yields high intrinsic reward without meaningful learning or progress toward task goals.

To prevent this, we restrict prediction to a more informative feature space:

$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$

Parameter explanations:

$E(s; \phi)$ : learned encoder mapping raw state $s$ into a feature vector.
$\phi$ : parameter vector of the encoder $E$ .
$T(E(s; \phi), a; \theta)$ : forward model predicting next feature representation from encoded state and action.
$E(s'; \phi)$ : encoded feature representation of the next state $s'$ .
$B(s, a, s')$ : intrinsic reward based on prediction error in feature space.

Key ideas:

The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$ .
This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
Measuring prediction error in feature space ignores unpredictable environmental noise.
Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
The agent’s curiosity is directed toward aspects of the environment it can influence and learn.

A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):

The encoder $E$ and forward model $T$ are trained jointly.
The loss includes both forward prediction error and inverse dynamics error.
Intrinsic reward is set to the forward prediction error in feature space.
This drives exploration of states where the agent cannot yet predict the effect of its actions.

Random Network Distillation (RND)

Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.

Basic idea:

Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$ .

Typical form of the intrinsic reward:

$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$

Parameter explanations:

$f_{\text{target}}$ : fixed random neural network generating target features for each state.
$f_{\text{pred}}(s; \psi)$ : trainable predictor network with parameters $\psi$ .
$\psi$ : parameter vector for the predictor network.
$s$ : state input to both networks.
$|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$ : squared error between predictor and target features.
$r^{\text{int}}(s)$ : intrinsic reward based on prediction error in random feature space.

Key properties:

For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$ , so error is high.
For frequently visited states, prediction error becomes small, and intrinsic reward decays.
The target network is random and fixed, so it does not adapt to the policy.
This provides a stable novelty signal without explicit dynamics learning.
RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.

Efficacy of Curiosity-Driven Exploration

Empirical observations:

Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
Simple surprise signals are often sufficient to drive effective exploration.
Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.

Historical context:

The concept of learning from intrinsic rewards alone is not new.
Itti and Baldi (2005) studied “Bayesian surprise” as a driver of human attention.
Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.

For further reading on intrinsic curiosity methods:

Pathak et al., “Curiosity-driven Exploration by Self-supervised Prediction”, 2017.
Burda et al., “Exploration by Random Network Distillation”, 2018.
Schmidhuber, “Formal Theory of Creativity, Fun, and Intrinsic Motivation”, 2010.

Exploration via Posterior Sampling

While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.

These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.

Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)

In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:

Maintain a prior and posterior distribution over the reward parameters for each arm.
At each time step, sample reward parameters for all arms from their current posterior.
Select the arm with the highest sampled mean reward.
Observe the reward, update the posterior, and repeat.

Intuition:

Each action is selected with probability equal to the posterior probability that it is optimal.
Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.

Theoretical properties:

Thompson Sampling attains near-optimal regret bounds in many bandit settings.
It often performs as well as or better than upper confidence bound algorithms in practice.

Posterior Sampling for Reinforcement Learning (PSRL)

In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.

Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:

Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
At the beginning of an episode, sample an MDP model from this posterior.
Compute the optimal policy for the sampled MDP (for example, by value iteration).
Execute this policy in the real environment for the whole episode.
Use the observed transitions to update the posterior, then repeat.

Key advantages:

The agent commits to a sampled model’s policy for an extended duration, which induces deep exploration.
If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.

Challenges:

Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
Practical implementations use approximations.

Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)

A common approximate posterior method in deep RL is Bootstrapped DQN.

Basic idea:

Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$ .
Each head is trained on a different bootstrap sample or masked subset of experience.
At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$ .
For the entire episode, act greedily with respect to $Q^{(k)}$ .

Parameter definitions for the ensemble:

$K$ : number of Q-network heads in the ensemble.
$Q^{(k)}(s, a)$ : Q-value estimate for head $k$ at state-action pair $(s, a)$ .
$k$ : index of the sampled head used for the current episode.
$(s, a)$ : state and action arguments to Q-value functions.

Implementation details:

A shared feature backbone network processes state inputs, feeding into all heads.
Each head has its own final layers, allowing diverse value estimates.
Masking or bootstrapping assigns different subsets of transitions to different heads during training.

Benefits:

Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
Exploration behavior is temporally consistent within an episode.
No modification of the reward function is required; exploration arises from policy randomization via multiple heads.

Comparison to epsilon-greedy:

Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.