Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 5)

CSE510 Deep Reinforcement Learning (Lecture 5)

Passive Reinforcement Learning

New twist: don’t know T or R

  • i.e. we don’t know which states are good or what the actions do
  • Must actually try actions and states out to learn

Passive learning and active learning

Passive Learning

  • The agent has a fixed policy and tries to learn the utilities of states by observing the world go by
  • Analogous to policy evaluation
  • Often serves as a component of active learning algorithms
  • Often inspires active learning algorithms

Active Learning

  • The agent attempts to find an optimal (or at least good) policy by acting in the world
  • Analogous to solving the underlying MDP, but without first being given the MDP model

Model-based vs. Model-free RL

Model-based RL

  • Learn the MDP model, or an approximate of it
  • Use it for policy evaluation or to find the optimal policy

Example as human: learn to navigate by exploring the environment

Model-free RL

  • Derive the optimal policy without explicitly learning the model
  • Useful when model is difficult to represent and/or learn

Example as human: learn to walk, talk based on reflection. (don’t need to know law of physics).

Small vs. Huge MDPs

We will first cover RL methods for small MDPs

  • MDPs where the number of states and actions is reasonably small
  • These algorithms will inspire more advanced methods

Later we will cover algorithms for huge MDPs

  • Function Approximation Methods
  • Policy Gradient Methods
  • Actor-Critic Methods

Problem settings

Suppose given a stationary policy, want to determine how good it is.

We want to estimate Vπ(s)V^\pi(s), but not given:

  • translation matrix
  • reward function

Monte Carlo direct estimation (model-free)

Also called Direct Estimation.

def monte_carlo_direct_estimation(policy, env, num_episodes): V = {} for episode in range(num_episodes): state = env.reset() while not env.done: action = policy.act(state) next_state, reward, done = env.step(action) V[state] += reward state = next_state return {s: V[s] / num_episodes for s in V}

Estimate Vπ(s)V^\pi(s) by average total reward of episodes containing state ss.

Reward to go of state s is the sum of the (discounted) rewards from the state until a terminal state is reached.

Drawback:

  • Need large number of episodes to get accurate estimate
  • Does not exploit Bellman constrains on policy values.

Adaptive Dynamic Programming (ADP) (model-based)

  • Follow the policy for a while
  • Estimate transition model based on obsercations
  • Learn reward function
  • Use estimated model to compute utilities of policy
Vπ(s)=R(s)+γsST(s,a,s)Vπ(s)V^\pi(s) = R(s) + \gamma \sum_{s'\in S} T(s,a,s') V^\pi(s')
def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode): V = {} for episode in range(num_episodes): state = env.reset() while not env.done: action = policy.act(state, steps_per_episode) next_state, reward, done = env.step(action) V[state] += reward state = next_state return {s: V[s] / num_episodes for s in V}

Drawback:

  • Still need full DP policy evaluation after certain steps.

Temporal difference learning (model-free)

  • Do local updates of utility/value function on a per-action basis
  • Don’t try to estimate the entire transition function
  • For each transition from ss to ss', update: Vπ(s)Vπ(s)+α(R(s)+γVπ(s)Vπ(s))V^\pi(s) \gets V^\pi(s) + \alpha (R(s) + \gamma V^\pi(s') - V^\pi(s))

Here α\alpha is the learning rate, γ\gamma is the discount factor.

def temporal_difference_learning(policy, env, num_episodes): V = {} for episode in range(num_episodes): state = env.reset() while not env.done: action = policy.act(state) next_state, reward, done = env.step(action) V[state] += alpha * (reward + gamma * V[next_state] - V[state]) state = next_state return V

Drawback:

  • requires more training experience (epochs) than ADP but much less computation per epoch
  • Choice depends on relative cost of experience vs. computation

Online Mean Estimation algorithm

Suppose we want to incrementally computer the mean of a stream of numbers.

(x1,x2,)(x_1, x_2, \ldots)

Given a new sample xn+1x_{n+1}, the new mean is the old estimate (for nn samples) plus the weighted difference between the new sample and the old estimate.

X^n+1=1n+1i=1n+1xi=1ni=1nxi+1n+1xn1nX^n=x^n+1n+1[xn+1+i=1nxin+1ni=1nxi]=x^n+1n+1(xn+1X^n)\begin{aligned} \hat{X}_{n+1} &= \frac{1}{n+1} \sum_{i=1}^{n+1} x_i \\ &= \frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n+1} x_{n}-\frac{1}{n}\hat{X}_n\\ &= \hat{x}_{n} +\frac{1}{n+1} \left[x_{n+1} +\sum_{i=1}^{n} x_i-\frac{n+1}{n}\sum_{i=1}^{n} x_i\right]\\ &= \hat{x}_{n} + \frac{1}{n+1} (x_{n+1} - \hat{X}_n) \end{aligned}

Summary of passive RL

Monte-Carlo Direct Estimation (model-free)

  • Simple to implement
  • Each update is fast
  • Does not exploit Bellman constraints
  • Converges slowly

Adaptive Dynamic Programming (model-based)

  • Harder to implement
  • Each update is a full policy evaluation (expensive)
  • Fully exploits Bellman constraints
  • Fast convergence (in terms of updates)

Temporal Difference Learning (model-free)

  • Update speed and implementation similar to direct estimation
  • Partially exploits Bellman constraints---adjusts state to ‘agree’ with observed successor
    • Not all possible successors as in ADP
  • Convergence speed in between direct estimation and ADP

Between ADP and TD

  • Moving TD toward ADP
    • At each step perform TD updates based on observed transition and “imagined” transitions based on initial trajectory tests. (model-based)
    • Imagined transition are generated using estimated model
  • The more imagined transitions used, the more like ADP
    • Making estimate more consistent with next state distribution
    • Converges in the limit of infinite imagined transitions to ADP
  • Trade-off computational and experience efficiency
    • More imagined transitions require more time per step, but fewer steps of actual experience

Active Reinforcement Learning

Naive Model-Based Approach

  1. Act Randomly for a (long) time
    • Or systematically explore all possible actions
  2. Learn
    • Transition function
    • Reward function
  3. Use value iteration, policy iteration, …
  4. Follow resulting policy

This will work if step 1 is running long enough. And there is no dead-end for policy testing.

Drawback:

  • Long time to converge

Revision of Naive Approach

  1. Start with an initial (uninformed) model
  2. Solve for the optimal policy given the current model (using value or policy iteration)
  3. Execute an action suggested by the policy in the current state
  4. Update the estimated model based on the observed transition
  5. Goto 2

This is just like ADP but we follow the greedy policy suggested by current value estimate.

Will this work?

No. Can get stuck in local minima.

Exploration vs. Exploitation

Two reasons to take an action in RL:

  • Exploitation: To try to get reward. We exploit our current knowledge to get a payoff.

  • Exploration: Get more information about the world. How do we know if there is not a pot of gold around the corner?

  • To explore we typically need to take actions that do not seem best according to our current model

  • Managing the trade-off between exploration and exploitation is a critical issue in RL

  • Basic intuition behind most approaches:

    • Explore more when knowledge is weak
    • Exploit more as we gain knowledge
Last updated on