CSE510 Deep Reinforcement Learning (Lecture 5)
Passive Reinforcement Learning
New twist: don’t know T or R
- i.e. we don’t know which states are good or what the actions do
- Must actually try actions and states out to learn
Passive learning and active learning
Passive Learning
- The agent has a fixed policy and tries to learn the utilities of states by observing the world go by
- Analogous to policy evaluation
- Often serves as a component of active learning algorithms
- Often inspires active learning algorithms
Active Learning
- The agent attempts to find an optimal (or at least good) policy by acting in the world
- Analogous to solving the underlying MDP, but without first being given the MDP model
Model-based vs. Model-free RL
Model-based RL
- Learn the MDP model, or an approximate of it
- Use it for policy evaluation or to find the optimal policy
Example as human: learn to navigate by exploring the environment
Model-free RL
- Derive the optimal policy without explicitly learning the model
- Useful when model is difficult to represent and/or learn
Example as human: learn to walk, talk based on reflection. (don’t need to know law of physics).
Small vs. Huge MDPs
We will first cover RL methods for small MDPs
- MDPs where the number of states and actions is reasonably small
- These algorithms will inspire more advanced methods
Later we will cover algorithms for huge MDPs
- Function Approximation Methods
- Policy Gradient Methods
- Actor-Critic Methods
Problem settings
Suppose given a stationary policy, want to determine how good it is.
We want to estimate , but not given:
- translation matrix
- reward function
Monte Carlo direct estimation (model-free)
Also called Direct Estimation.
def monte_carlo_direct_estimation(policy, env, num_episodes):
V = {}
for episode in range(num_episodes):
state = env.reset()
while not env.done:
action = policy.act(state)
next_state, reward, done = env.step(action)
V[state] += reward
state = next_state
return {s: V[s] / num_episodes for s in V}Estimate by average total reward of episodes containing state .
Reward to go of state s is the sum of the (discounted) rewards from the state until a terminal state is reached.
Drawback:
- Need large number of episodes to get accurate estimate
- Does not exploit Bellman constrains on policy values.
Adaptive Dynamic Programming (ADP) (model-based)
- Follow the policy for a while
- Estimate transition model based on obsercations
- Learn reward function
- Use estimated model to compute utilities of policy
def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode):
V = {}
for episode in range(num_episodes):
state = env.reset()
while not env.done:
action = policy.act(state, steps_per_episode)
next_state, reward, done = env.step(action)
V[state] += reward
state = next_state
return {s: V[s] / num_episodes for s in V}Drawback:
- Still need full DP policy evaluation after certain steps.
Temporal difference learning (model-free)
- Do local updates of utility/value function on a per-action basis
- Don’t try to estimate the entire transition function
- For each transition from to , update:
Here is the learning rate, is the discount factor.
def temporal_difference_learning(policy, env, num_episodes):
V = {}
for episode in range(num_episodes):
state = env.reset()
while not env.done:
action = policy.act(state)
next_state, reward, done = env.step(action)
V[state] += alpha * (reward + gamma * V[next_state] - V[state])
state = next_state
return VDrawback:
- requires more training experience (epochs) than ADP but much less computation per epoch
- Choice depends on relative cost of experience vs. computation
Online Mean Estimation algorithm
Suppose we want to incrementally computer the mean of a stream of numbers.
Given a new sample , the new mean is the old estimate (for samples) plus the weighted difference between the new sample and the old estimate.
Summary of passive RL
Monte-Carlo Direct Estimation (model-free)
- Simple to implement
- Each update is fast
- Does not exploit Bellman constraints
- Converges slowly
Adaptive Dynamic Programming (model-based)
- Harder to implement
- Each update is a full policy evaluation (expensive)
- Fully exploits Bellman constraints
- Fast convergence (in terms of updates)
Temporal Difference Learning (model-free)
- Update speed and implementation similar to direct estimation
- Partially exploits Bellman constraints---adjusts state to ‘agree’ with observed successor
- Not all possible successors as in ADP
- Convergence speed in between direct estimation and ADP
Between ADP and TD
- Moving TD toward ADP
- At each step perform TD updates based on observed transition and “imagined” transitions based on initial trajectory tests. (model-based)
- Imagined transition are generated using estimated model
- The more imagined transitions used, the more like ADP
- Making estimate more consistent with next state distribution
- Converges in the limit of infinite imagined transitions to ADP
- Trade-off computational and experience efficiency
- More imagined transitions require more time per step, but fewer steps of actual experience
Active Reinforcement Learning
Naive Model-Based Approach
- Act Randomly for a (long) time
- Or systematically explore all possible actions
- Learn
- Transition function
- Reward function
- Use value iteration, policy iteration, …
- Follow resulting policy
This will work if step 1 is running long enough. And there is no dead-end for policy testing.
Drawback:
- Long time to converge
Revision of Naive Approach
- Start with an initial (uninformed) model
- Solve for the optimal policy given the current model (using value or policy iteration)
- Execute an action suggested by the policy in the current state
- Update the estimated model based on the observed transition
- Goto 2
This is just like ADP but we follow the greedy policy suggested by current value estimate.
Will this work?
No. Can get stuck in local minima.
Exploration vs. Exploitation
Two reasons to take an action in RL:
-
Exploitation: To try to get reward. We exploit our current knowledge to get a payoff.
-
Exploration: Get more information about the world. How do we know if there is not a pot of gold around the corner?
-
To explore we typically need to take actions that do not seem best according to our current model
-
Managing the trade-off between exploration and exploitation is a critical issue in RL
-
Basic intuition behind most approaches:
- Explore more when knowledge is weak
- Exploit more as we gain knowledge