Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 4)

CSE510 Deep Reinforcement Learning (Lecture 4)

Markov Decision Process (MDP) Part II

Recall from last lecture

An Finite MDP is defined by:

  • A finite set of states sSs \in S
  • A finite set of actions aAa \in A
  • A transition function T(s,a,s)T(s, a, s')
    • Probability that a from s leads to ss', i.e., P(ss,a)P(s'| s, a)
    • Also called the model or the dynamics
  • A reward function R(s)R(s) ( Sometimes R(s,a)R(s,a) or R(s,a,s)R(s, a, s') )
  • A start state
  • Maybe a terminal state

A model for sequential decision making problem under uncertainty

Optimal Policy and Bellman Optimality Equation

The goal for a MDP is to compute or learn an optimal policy.

  • An optimal policy is one that achieves the highest value at any state
π=argmaxπVπ(s)\pi^* = \arg\max_\pi V^\pi(s)
  • We define the optimal value function using Bellman Optimality Equation
V(s)=R(s)+γmaxaAsSP(ss,a)V(s)V^*(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
  • The optimal policy is
π(s)=argmaxaAsSP(ss,a)V(s)\pi^*(s) = \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')

The Existence of the Optimal Policy

Theorem: for any Markov Decision Process

  • There exists an optimal policy
  • There can be many optimal policies, but all optimal policies achieve the same optimal value function
  • There is always a deterministic optimal policy for any MDP

Solve MDP

Value Iteration

Repeatedly update an estimate of the optimal value function according to Bellman Optimality Equation.

  1. Initialize an estimate for the value function arbitrarily
V^(s)0,sS\hat{V}(s) \gets 0, \forall s \in S
  1. Repeat, update:
V^(s)R(s)+γmaxaAsSP(ss,a)V^(s),sS\hat{V}(s) \gets R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s'), \forall s \in S

Example

Suppose we have a robot that can move in a 2D grid. with the following dynamics:

  • with 80% probability, the robot moves in the direction of the action
  • with 10% probability, the robot moves in the direction of the action + 1 (wrap to left)
  • with 10% probability, the robot moves in the direction of the action - 1 (wrap to right)

The gird (V0(s)V^0(s)) is:

|0|0|0|1| |0|*|0|-100| |0|0|0|0|

If we fun the value iteration with γ=0.9\gamma = 0.9, we can update the value function as follows:

V1(s)=R(s)+γmaxaAsSP(ss,a)V0(s)V^1(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^0(s')

On point (3,3)(3,3), the best action is to move to the goal state, so:

V1((3,3))=R((3,3))+γmaxaAsSP(s(3,3),right)V0((3,4))=0+0.9×0.8×1=0.72\begin{aligned} V^1((3,3)) &= R((3,3)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,3),\text{right}) V^0((3,4)) &= 0+0.9 \times 0.8 \times 1 = 0.72 \end{aligned}

On point (3,4)(3,4), the best action is to move up so that you can stay in the grid with 90%90\% probability, so:

V1((3,4))=R((3,4))+γmaxaAsSP(s(3,4),up)V0((3,4))=1+0.9×(0.8+0.1)×1=1.81\begin{aligned} V^1((3,4)) &= R((3,4)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,4),\text{up}) V^0((3,4)) &= 1+0.9 \times (0.8+0.1) \times 1 = 1.81 \end{aligned}

On t=1t=1, the value on grid is:

|0|0|0.72|1.81| |0|*|0|-99.91| |0|0|0|0|

The general algorithm can be written as:

# suppose we defined the grid as previous example grid = [ [0, 0, 0, 1], [0, '*', 0, -100], [0, 0, 0, 0] ] m,n = len(grid), len(grid[0]) ACTIONS = {'up':(0,-1), 'down':(0,1), 'left':(-1,0), 'right':(1,0)} gamma = 0.9 V = value_iteration(gamma, ACTIONS, grid) print(V) def get_reward(action, i, j): reward = 0 reward += 0.8 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j] reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j] reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j] return reward def value_iteration(gamma, ACTIONS, V): V_new=[[0]*m for _ in range(n)] while True: for i in range(m): for j in range(n): s = (i, j) V_new[i][j] = V[i][j] + gamma * max(get_reward(action, i, j) for action.values() in ACTIONS) if max(abs(V_new[i][j] - V[i][j]) for i in range(m) for j in range(n)) < 1e-6: break V = V_new return V

Convergence of Value Iteration

Theorem: Value Iteration converges to the optimal value function V^V\hat{V}\to V^* as tt\to\infty.

Proof

For any estimate of the value function V^\hat{V}, we define the Bellman backup operator B:RSRS\operatorname{B}:\mathbb{R}^{|S|}\to \mathbb{R}^{|S|} by

B(V^(s))=R(s)+γmaxaAsSP(ss,a)V^(s)\operatorname{B}(\hat{V}(s)) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s')

Note that B(V)=V\operatorname{B}(V^*) = V^*.

Since maxxXf(x)maxxXg(x)maxxXf(x)g(x)\|\max_{x\in X}f(x)-\max_{x\in X}g(x)\|\leq \max_{x\in X}\|f(x)-g(x)\|, for any value function V1V_1 and V2V_2, we have

B(V1(s))B(V2(s))=γmaxaAsSP(ss,a)V1(s)maxaAsSP(ss,a)V2(s)γmaxaAsSP(ss,a)V1(s)sSP(ss,a)V2(s)γmaxaAsSP(ss,a)V1(s)V2(s)γmaxsSV1V2\begin{aligned} |\operatorname{B}(V_1(s))-\operatorname{B}(V_2(s))|&= \gamma \left|\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_1(s')-\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\ &\leq \gamma \max_{a\in A} \left|\sum_{s'\in S} P(s'|s,a) V_1(s')-\sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\ &\leq \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) |V_1(s')-V_2(s')|\\ &\leq \gamma \max_{s\in S}|V_1-V_2| \end{aligned}

Assume 0γ<10\leq \gamma < 1, and reward R(s)R(s) is bounded by RmaxR_{\max}.

Then

V(s)t=0γtRmax=Rmax1γV^*(s)\leq \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1-\gamma}

Let VkV^k be the value function after kk iterations of Value Iteration.

maxsSVk(s)V(s)Rmax1γγk\max_{s\in S}|V^k(s)-V^*(s)|\leq \frac{R_{\max}}{1-\gamma}\gamma^k

Stopping condition

We can construct the optimal policy arbitrarily close to the optimal value function.

If VkVk+1<ϵ\|V^k-V^{k+1}\|<\epsilon, then VkVϵγ1γ\|V^k-V^*\|\leq \epsilon\frac{\gamma}{1-\gamma}.

So we can select small ϵ\epsilon to stop the iteration.

Greedy Policy

Given a VkV^k that is close to the optimal value VV^*, the greedy policy is:

πg(s)=argmaxaAsST(s,a,s)Vk(s)\pi_{g}(s) = \arg\max_{a\in A} \sum_{s'\in S} T(s',a,s') V^k(s')

Here T(s,a,s)T(s',a,s') is the transition function between state ss' and ss with action aa.

This selects the action looks best if we assume that we get value VkV^k in one step.

Value of a greedy policy

If we define VgV_g to be the value function of the greedy policy, then

This is not necessarily optimal, but it is a good approximation.

In homework, we will prove that if VkV<λ\|V^k-V^*\|<\lambda, then VgV2λγ1γ\|V_g-V^*\|\leq 2\lambda\frac{\gamma}{1-\gamma}.

So we can set stopping condition so that VgV_g has desired accuracy to VV^*.

There is a finite ϵ\epsilon such that greedy policy is ϵ\epsilon-optimal.

Problem of Value Iteration and Policy Iteration

  • It is slow O(S2A)O(|S|^2|A|)
  • The max action at each state rarely changes
  • The policy converges before the value function

Policy Iteration

Interleaving polity evaluation and policy improvement.

  1. Initialize a random policy π^\hat{\pi}
  2. Compute the value function VπV^{\pi}
  3. Update the policy π\pi to be greedy policy with respect to VπV^{\pi} π(s)argmaxaAsSP(ss,a)Vπ(s)\pi(s)\gets \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^{\pi}(s')
  4. Repeat until convergence

Exact Policy Evaluation by Linear Solver

Let VπRSV^{\pi}\in \mathbb{R}^{|S|} be a vector of values for each state, rRSr\in \mathbb{R}^{|S|} be a vector of rewards for each state.

Let PπRS×SP^{\pi}\in \mathbb{R}^{|S|\times |S|} be a transition matrix for the policy π\pi.

Pijπ=P(st+1=ist=j,at=π(st))P^{\pi}_{ij} = P(s_{t+1}=i|s_t=j,a_t=\pi(s_t))

The Bellman equation for the policy can be written in vector form as:

Vπ=r+γPπVπ(IγPπ)Vπ=rVπ=(IγPπ)1r\begin{aligned} V^{\pi} &= r + \gamma P^{\pi} V^{\pi} \\ (I-\gamma P^{\pi})V^{\pi} &= r \\ V^{\pi} &= (I-\gamma P^{\pi})^{-1} r \end{aligned}
  • Proof involves showing that each iteration is also a contraction and monotonically improve the policy
  • Convergence to the exact optimal policy
    • The number of policies is finite

In real world, policy iteration is usually faster than value iteration.

Policy Iteration Complexity

  • Each iteration runs in polynomial time in the number of states and actions
  • There are at most |A|n policies and PI never repeats a policy
    • So at most an exponential number of iterations
    • Not a very good complexity bound
  • Empirically O(n) iterations are required
    • Challenge: try to generate an MDP that requires more than that n iterations

Generalized Policy Iteration

  • Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement
  • independent of their granularity and other details of the two processes

Summary

Policy Iteration vs Value Iteration

  • PI has two loops: inner loop (evaluate VπV^{\pi}) and outer loop (improve π\pi)

  • VI has one loop: repeatedly apply Vk+1(s)=maxaA[r(s,a)+γsSP(ss,a)Vk(s)]V^{k+1}(s) = \max_{a\in A} [r(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) V^k(s')]

  • Trade-offs:

    • PI converges in few outer steps if you can evaluate quickly/accurately;
    • VI avoids expensive exact evaluation, doing cheaper but many Bellman optimality updates.
  • Modified Policy Iteration: partial evaluation + improvement.

  • Modified Policy Iteration: partial evaluation + improvement.

Last updated on