CSE510 Deep Reinforcement Learning (Lecture 20)
Exploration in RL
Motivations
Exploration vs. Exploitation Dilemma
Online decision-making involves a fundamental choice:
- Exploration: trying out new things (new behaviors), with the hope of discovering higher rewards
- Exploitation: doing what you know will yield the highest reward
The best long-term strategy may involve short-term sacrifices
Gather enough knowledge early to make the best long-term decisions
Example
- Exploitation: Go to your favorite restaurant
- Exploration: Try a new restaurant
Oil Drilling
- Exploitation: Drill at the best known location
- Exploration: Drill at a new location
Game Playing
- Exploitation: Play the move you believe is best
- Exploration: Play an experimental move
Breakout vs. Montezuma’s Revenge
| Property | Breakout | Montezuma’s Revenge |
|---|---|---|
| Reward frequency | Dense (every brick hit gives points) | Extremely sparse (only after collecting key or treasure) |
| State space | Simple (ball, paddle, bricks) | Complex (many rooms, objects, ladders, timing) |
| Action relevance | Almost any action affects reward soon | Most actions have no immediate feedback |
| Exploration depth | Shallow (few steps to reward) | Deep (dozens/hundreds of steps before reward) |
| Determinism | Mostly deterministic dynamics | Deterministic but requires long sequences of precise actions |
| Credit assignment | Easy — short time gap | Very hard — long delay from cause to effect |
Motivation
- Motivation: “Forces” that energize an organism to act and that direct its activity
- Extrinsic Motivation: being motivated to do something because of some external reward ($, a prize, food, water, etc.)
- Intrinsic Motivation: being motivated to do something because it is inherently enjoyable (curiosity, exploration, novelty, surprise, incongruity, complexity…)
Intuitive Exploration Strategy
-
Intrinsic motivation drives the exploration for unknowns
-
Intuitively, we explore efficiently once we know what we do not know, and target our exploration efforts to the unknown part of the space.
-
All non-naive exploration methods consider some form of uncertainty estimation, regarding state (or state-action) I have visited, transition dynamics, or Q-functions.
-
Optimal methods in smaller settings don’t work, but can inspire for larger settings
-
May use some hacks
Classes of Exploration Methods in Deep RL
- Optimistic exploration
- Uncertainty about states
- Visiting novel states (state visitation counting)
- Information state search
- Uncertainty about state transitions or dynamics
- Dynamics prediction error or Information gain for dynamics learning
- Posterior sampling
- Uncertainty about Q-value functions or policies
- Selecting actions according to the probability they are best
Optimistic Exploration
Count-Based Exploration in Small MDPs
Book-keep state visitation counts Add exploration reward bonuses that encourage policies that visit states with fewer counts.
where is the intrinsic exploration reward bonus.
-
UCB: (more aggressive exploration)
-
MBIE-EB (Strehl & Littman):
-
BEB (Kolter & Ng):
-
We want to come up with something that rewards states that we have not visited often.
-
But in large MDPs, we rarely visit a state twice!
-
We need to capture a notion of state similarity, and reward states that are most dissimilar to what we have seen so far
- as opposed to different (as they will always be different).
Fitting Generative Models
Idea: fit a density model (or )
might be high even for a new .
If is similar to perviously seen states, can we use to get a “pseudo-count” for ?
If we have small MDPs, the true probability is
where is the number of times has been visited and is the total states visited.
after we visit , then
- fit model to all states so far.
- take a step and observe .
- fit new model to all states .
- use and to estimate the “pseudo-count” for .
- set
- go to 1
How to get ? use the equations
Density models
State Counting with DeepHashing
- We still count states (images) but not in pixel space, but in latent compressed space.
- Compress into a latent code, then count occurrences of the code.
- How do we get the image encoding? e.g., using autoencoders.
- There is no guarantee such reconstruction loss will capture the important things that make two states to be similar