Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 17)

CSE510 Deep Reinforcement Learning (Lecture 17)

Why Model-Based RL?

  • Sample efficiency
  • Generalization and transferability
  • Support efficient exploration in large-scale RL problems
  • Explainability
  • Super-human performance in practice
    • Video games, Go, Algorithm discovery, etc.
Note

Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(ss,a)T(s'| s, a) and reward R(s,a)R(s, a).

For ADP-based (model-based) RL

  1. Start with initial model
  2. Solve for optimal policy given current model
    • (using value or policy iteration)
  3. Take action according to an exploration/exploitation policy
    • Explores more early on and gradually uses policy from 2
  4. Update estimated model based on observed transition
  5. Goto 2

Problems in Large Scale Model-Based RL

  • New planning methods for given a model
    • Model is large and not perfect
  • Model learning
    • Requiring generalization
  • Exploration/exploitation strategy
    • Requiring generalization and attention

Large Scale Model-Based RL

  • New optimal planning methods (Today)
    • Model is large and not perfect
  • Model learning (Next Lecture)
    • Requiring generalization
  • Exploration/exploitation strategy (Next week)
    • Requiring generalization and attention

Model-based RL

Deterministic Environment: Cross-Entropy Method

Stochastic Optimization

abstract away optimal control/planning:

a1,,aT=arg maxa1,,aTJ(a1,,aT)a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T) A=arg maxAJ(A)A=\argmax_{A} J(A)

Simplest method: guess and check: “random shooting method”

  • pick A1,A2,...,AnA_1, A_2, ..., A_n from some distribution (e.g. uniform)
  • Choose AiA_i based on arg maxiJ(Ai)\argmax_i J(A_i)

Cross-Entropy Method (CEM) with continuous-valued inputs

Cross-entropy method with continuous-valued inputs:s

  1. Sample A1,A2,...,AnA_1, A_2, ..., A_n from some distribution p(A)p(A)
  2. Evaluate J(A1),J(A2),...,J(An)J(A_1), J(A_2), ..., J(A_n)
  3. Pick the elites A1,A2,...,AmA_1, A_2, ..., A_m with the highest J(Ai)J(A_i), where m<nm<n
  4. Update the distribution p(A)p(A) to be more likely to choose the elites

Pros:

  • Very fast to run if parallelized
  • Extremely simple to implement

Cons:

  • Very harsh dimensionality limit
  • Only open-loop planning
  • Suboptimal in stochastic environments

Discrete Case: Monte Carlo Tree Search (MCTS)

Discrete planning as a search problem

Close-loop planning:

  • At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.

Use model as simulator to evaluate actions.

MCTS Algorithm Overview

  1. Selection: Select the best-first action from the search tree
  2. Expansion: Add a new node to the search tree
  3. Simulation: Simulate the next state from the selected action
  4. Backpropagation: Update the values of the nodes in the search tree

Policies in MCTS

Tree policy:

  • Select/create leaf node
  • Selection and Expansion
  • Bandit problem!

Default policy/rollout policy

  • Play the game till end
  • Simulation

Decision policy

  • Selecting the final action

Upper Confidence Bound on Trees (UCT)

Selecting Child Node - Multi-Arm Bandit Problem

UCB1 applied for each child selection

UCT=Xj+2Cp2lnnjnjUCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
  • where Xj\overline{X_j} is the mean reward of selecting this position
    • [0,1][0,1]
  • nn is the number of times current(parent) node has been visited
  • njn_j is the number of times child node jj has been visited
    • Guaranteed we explore each child node at least once
  • CpC_p is some constant >0>0

Each child has non-zero probability of being selected

We can adjust CpC_p to change exploration vs. exploitation trade-off

Decision Policy: Final Action Selection

Selecting the best child

  • Max (highest weight)
  • Robust (most visits)
  • Max-Robust (max of the two)

Advantages and disadvantages of MCTS

Advantages:

  • Proved MCTS converges to minimax solution
    • Domain-independent
    • Anytime algorithm
    • Achieving better with a large branch factor

Disadvantages:

  • Basic version converges very slowly
  • Leading to small-probability failures

Example usage of MCTS

AlphaGo vs Lee Sedol, Game 4

  • White 78 (Lee): unexpected move (even other professional players didn’t see coming) - needle in the haystack
  • AlphaGo failed to explore this in MCTS

Imitation learning from MCTS:

Continuous Case: Trajectory Optimization

Linear Quadratic Regulator (LQR)

Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)

Last updated on