CSE510 Deep Reinforcement Learning (Lecture 5)

Passive Reinforcement Learning

New twist: don’t know T or R

i.e. we don’t know which states are good or what the actions do
Must actually try actions and states out to learn

Passive learning and active learning

Passive Learning

The agent has a fixed policy and tries to learn the utilities of states by observing the world go by
Analogous to policy evaluation
Often serves as a component of active learning algorithms
Often inspires active learning algorithms

Active Learning

The agent attempts to find an optimal (or at least good) policy by acting in the world
Analogous to solving the underlying MDP, but without first being given the MDP model

Model-based vs. Model-free RL

Model-based RL

Learn the MDP model, or an approximate of it
Use it for policy evaluation or to find the optimal policy

Example as human: learn to navigate by exploring the environment

Model-free RL

Derive the optimal policy without explicitly learning the model
Useful when model is difficult to represent and/or learn

Example as human: learn to walk, talk based on reflection. (don’t need to know law of physics).

Small vs. Huge MDPs

We will first cover RL methods for small MDPs

MDPs where the number of states and actions is reasonably small
These algorithms will inspire more advanced methods

Later we will cover algorithms for huge MDPs

Function Approximation Methods
Policy Gradient Methods
Actor-Critic Methods

Problem settings

Suppose given a stationary policy, want to determine how good it is.

We want to estimate $V^\pi(s)$ , but not given:

translation matrix
reward function

Monte Carlo direct estimation (model-free)

Also called Direct Estimation.


def monte_carlo_direct_estimation(policy, env, num_episodes):
    V = {}
    for episode in range(num_episodes):
        state = env.reset()
        while not env.done:
            action = policy.act(state)
            next_state, reward, done = env.step(action)
            V[state] += reward
            state = next_state
    return {s: V[s] / num_episodes for s in V}

Estimate $V^\pi(s)$ by average total reward of episodes containing state $s$ .

Reward to go of state s is the sum of the (discounted) rewards from the state until a terminal state is reached.

Drawback:

Need large number of episodes to get accurate estimate
Does not exploit Bellman constrains on policy values.

Adaptive Dynamic Programming (ADP) (model-based)

Follow the policy for a while
Estimate transition model based on obsercations
Learn reward function
Use estimated model to compute utilities of policy

V^\pi(s) = R(s) + \gamma \sum_{s'\in S} T(s,a,s') V^\pi(s')


def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode):
    V = {}
    for episode in range(num_episodes):
        state = env.reset()
        while not env.done:
            action = policy.act(state, steps_per_episode)
            next_state, reward, done = env.step(action)
            V[state] += reward
            state = next_state
    return {s: V[s] / num_episodes for s in V}

Drawback:

Still need full DP policy evaluation after certain steps.

Temporal difference learning (model-free)

Do local updates of utility/value function on a per-action basis
Don’t try to estimate the entire transition function
For each transition from $s$ to $s'$ , update: $V^\pi(s) \gets V^\pi(s) + \alpha (R(s) + \gamma V^\pi(s') - V^\pi(s))$

Here $\alpha$ is the learning rate, $\gamma$ is the discount factor.


def temporal_difference_learning(policy, env, num_episodes):
    V = {}
    for episode in range(num_episodes):
        state = env.reset()
        while not env.done:
            action = policy.act(state)
            next_state, reward, done = env.step(action)
            V[state] += alpha * (reward + gamma * V[next_state] - V[state])
            state = next_state
    return V

Drawback:

requires more training experience (epochs) than ADP but much less computation per epoch
Choice depends on relative cost of experience vs. computation

Online Mean Estimation algorithm

Suppose we want to incrementally computer the mean of a stream of numbers.

(x_1, x_2, \ldots)

Given a new sample $x_{n+1}$ , the new mean is the old estimate (for $n$ samples) plus the weighted difference between the new sample and the old estimate.

\begin{aligned} \hat{X}_{n+1} &= \frac{1}{n+1} \sum_{i=1}^{n+1} x_i \\ &= \frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n+1} x_{n}-\frac{1}{n}\hat{X}_n\\ &= \hat{x}_{n} +\frac{1}{n+1} \left[x_{n+1} +\sum_{i=1}^{n} x_i-\frac{n+1}{n}\sum_{i=1}^{n} x_i\right]\\ &= \hat{x}_{n} + \frac{1}{n+1} (x_{n+1} - \hat{X}_n) \end{aligned}

Summary of passive RL

Monte-Carlo Direct Estimation (model-free)

Simple to implement
Each update is fast
Does not exploit Bellman constraints
Converges slowly

Adaptive Dynamic Programming (model-based)

Harder to implement
Each update is a full policy evaluation (expensive)
Fully exploits Bellman constraints
Fast convergence (in terms of updates)

Temporal Difference Learning (model-free)

Update speed and implementation similar to direct estimation
Partially exploits Bellman constraints---adjusts state to ‘agree’ with observed successor
- Not all possible successors as in ADP
Convergence speed in between direct estimation and ADP

Between ADP and TD

Moving TD toward ADP
- At each step perform TD updates based on observed transition and “imagined” transitions based on initial trajectory tests. (model-based)
- Imagined transition are generated using estimated model
The more imagined transitions used, the more like ADP
- Making estimate more consistent with next state distribution
- Converges in the limit of infinite imagined transitions to ADP
Trade-off computational and experience efficiency
- More imagined transitions require more time per step, but fewer steps of actual experience

Active Reinforcement Learning

Naive Model-Based Approach

Act Randomly for a (long) time
- Or systematically explore all possible actions
Learn
- Transition function
- Reward function
Use value iteration, policy iteration, …
Follow resulting policy

This will work if step 1 is running long enough. And there is no dead-end for policy testing.

Drawback:

Long time to converge

Revision of Naive Approach

Start with an initial (uninformed) model
Solve for the optimal policy given the current model (using value or policy iteration)
Execute an action suggested by the policy in the current state
Update the estimated model based on the observed transition
Goto 2

This is just like ADP but we follow the greedy policy suggested by current value estimate.

Will this work?

No. Can get stuck in local minima.

Exploration vs. Exploitation

Two reasons to take an action in RL:

Exploitation: To try to get reward. We exploit our current knowledge to get a payoff.
Exploration: Get more information about the world. How do we know if there is not a pot of gold around the corner?
To explore we typically need to take actions that do not seem best according to our current model
Managing the trade-off between exploration and exploitation is a critical issue in RL
Basic intuition behind most approaches:
- Explore more when knowledge is weak
- Exploit more as we gain knowledge