CSE510 Deep Reinforcement Learning (Lecture 4)

Markov Decision Process (MDP) Part II

Recall from last lecture

An Finite MDP is defined by:

A finite set of states $s \in S$
A finite set of actions $a \in A$
A transition function $T(s, a, s')$ $T (s, a, s^{'})$
- Probability that a from s leads to $s'$ , i.e., $P(s'| s, a)$
- Also called the model or the dynamics
A reward function $R(s)$ ( Sometimes $R(s,a)$ or $R(s, a, s')$ )
A start state
Maybe a terminal state

A model for sequential decision making problem under uncertainty

Optimal Policy and Bellman Optimality Equation

The goal for a MDP is to compute or learn an optimal policy.

An optimal policy is one that achieves the highest value at any state

\pi^* = \arg\max_\pi V^\pi(s)

We define the optimal value function using Bellman Optimality Equation

V^*(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')

The optimal policy is

\pi^*(s) = \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')

The Existence of the Optimal Policy

Theorem: for any Markov Decision Process

There exists an optimal policy
There can be many optimal policies, but all optimal policies achieve the same optimal value function
There is always a deterministic optimal policy for any MDP

Solve MDP

Value Iteration

Repeatedly update an estimate of the optimal value function according to Bellman Optimality Equation.

Initialize an estimate for the value function arbitrarily

\hat{V}(s) \gets 0, \forall s \in S

Repeat, update:

\hat{V}(s) \gets R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s'), \forall s \in S

Example

Suppose we have a robot that can move in a 2D grid. with the following dynamics:

with 80% probability, the robot moves in the direction of the action
with 10% probability, the robot moves in the direction of the action + 1 (wrap to left)
with 10% probability, the robot moves in the direction of the action - 1 (wrap to right)

The gird ( $V^0(s)$ ) is:

|0|0|0|1| |0|*|0|-100| |0|0|0|0|

If we fun the value iteration with $\gamma = 0.9$ , we can update the value function as follows:

V^1(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^0(s')

On point $(3,3)$ , the best action is to move to the goal state, so:

\begin{aligned} V^1((3,3)) &= R((3,3)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,3),\text{right}) V^0((3,4)) &= 0+0.9 \times 0.8 \times 1 = 0.72 \end{aligned}

On point $(3,4)$ , the best action is to move up so that you can stay in the grid with $90\%$ probability, so:

\begin{aligned} V^1((3,4)) &= R((3,4)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,4),\text{up}) V^0((3,4)) &= 1+0.9 \times (0.8+0.1) \times 1 = 1.81 \end{aligned}

On $t=1$ , the value on grid is:

|0|0|0.72|1.81| |0|*|0|-99.91| |0|0|0|0|

The general algorithm can be written as:


# suppose we defined the grid as previous example
 
grid = [
    [0, 0, 0, 1],
    [0, '*', 0, -100],
    [0, 0, 0, 0]
]
m,n = len(grid), len(grid[0])
ACTIONS = {'up':(0,-1), 'down':(0,1), 'left':(-1,0), 'right':(1,0)}
 
gamma = 0.9
V = value_iteration(gamma, ACTIONS, grid)
print(V)
 
def get_reward(action, i, j):
    reward = 0
    reward += 0.8 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
    reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
    reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
    return reward
 
def value_iteration(gamma, ACTIONS, V):
    V_new=[[0]*m for _ in range(n)]
    while True:
        for i in range(m):
            for j in range(n):
                s = (i, j)
                V_new[i][j] = V[i][j] + gamma * max(get_reward(action, i, j) for action.values() in ACTIONS)
        if max(abs(V_new[i][j] - V[i][j]) for i in range(m) for j in range(n)) < 1e-6:
            break
        V = V_new
    return V

Convergence of Value Iteration

Theorem: Value Iteration converges to the optimal value function $\hat{V}\to V^*$ as $t\to\infty$ .

Proof

For any estimate of the value function $\hat{V}$ , we define the Bellman backup operator $\operatorname{B}:\mathbb{R}^{|S|}\to \mathbb{R}^{|S|}$ by

\operatorname{B}(\hat{V}(s)) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s')

Note that $\operatorname{B}(V^*) = V^*$ .

Since $\|\max_{x\in X}f(x)-\max_{x\in X}g(x)\|\leq \max_{x\in X}\|f(x)-g(x)\|$ , for any value function $V_1$ and $V_2$ , we have

\begin{aligned} |\operatorname{B}(V_1(s))-\operatorname{B}(V_2(s))|&= \gamma \left|\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_1(s')-\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\ &\leq \gamma \max_{a\in A} \left|\sum_{s'\in S} P(s'|s,a) V_1(s')-\sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\ &\leq \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) |V_1(s')-V_2(s')|\\ &\leq \gamma \max_{s\in S}|V_1-V_2| \end{aligned}

Assume $0\leq \gamma < 1$ , and reward $R(s)$ is bounded by $R_{\max}$ .

Then

V^*(s)\leq \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1-\gamma}

Let $V^k$ be the value function after $k$ iterations of Value Iteration.

\max_{s\in S}|V^k(s)-V^*(s)|\leq \frac{R_{\max}}{1-\gamma}\gamma^k

Stopping condition

We can construct the optimal policy arbitrarily close to the optimal value function.

If $\|V^k-V^{k+1}\|<\epsilon$ , then $\|V^k-V^*\|\leq \epsilon\frac{\gamma}{1-\gamma}$ .

So we can select small $\epsilon$ to stop the iteration.

Greedy Policy

Given a $V^k$ that is close to the optimal value $V^*$ , the greedy policy is:

\pi_{g}(s) = \arg\max_{a\in A} \sum_{s'\in S} T(s',a,s') V^k(s')

Here $T(s',a,s')$ is the transition function between state $s'$ and $s$ with action $a$ .

This selects the action looks best if we assume that we get value $V^k$ in one step.

Value of a greedy policy

If we define $V_g$ to be the value function of the greedy policy, then

This is not necessarily optimal, but it is a good approximation.

In homework, we will prove that if $\|V^k-V^*\|<\lambda$ , then $\|V_g-V^*\|\leq 2\lambda\frac{\gamma}{1-\gamma}$ .

So we can set stopping condition so that $V_g$ has desired accuracy to $V^*$ .

There is a finite $\epsilon$ such that greedy policy is $\epsilon$ -optimal.

Problem of Value Iteration and Policy Iteration

It is slow $O(|S|^2|A|)$
The max action at each state rarely changes
The policy converges before the value function

Policy Iteration

Interleaving polity evaluation and policy improvement.

Initialize a random policy $\hat{\pi}$
Compute the value function $V^{\pi}$
Update the policy $\pi$ to be greedy policy with respect to $V^{\pi}$ $\pi(s)\gets \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^{\pi}(s')$
Repeat until convergence

Exact Policy Evaluation by Linear Solver

Let $V^{\pi}\in \mathbb{R}^{|S|}$ be a vector of values for each state, $r\in \mathbb{R}^{|S|}$ be a vector of rewards for each state.

Let $P^{\pi}\in \mathbb{R}^{|S|\times |S|}$ be a transition matrix for the policy $\pi$ .

P^{\pi}_{ij} = P(s_{t+1}=i|s_t=j,a_t=\pi(s_t))

The Bellman equation for the policy can be written in vector form as:

\begin{aligned} V^{\pi} &= r + \gamma P^{\pi} V^{\pi} \\ (I-\gamma P^{\pi})V^{\pi} &= r \\ V^{\pi} &= (I-\gamma P^{\pi})^{-1} r \end{aligned}

Proof involves showing that each iteration is also a contraction and monotonically improve the policy
Convergence to the exact optimal policy
- The number of policies is finite

In real world, policy iteration is usually faster than value iteration.

Policy Iteration Complexity

Each iteration runs in polynomial time in the number of states and actions
There are at most |A|n policies and PI never repeats a policy
- So at most an exponential number of iterations
- Not a very good complexity bound
Empirically O(n) iterations are required
- Challenge: try to generate an MDP that requires more than that n iterations

Generalized Policy Iteration

Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement
independent of their granularity and other details of the two processes

Summary

Policy Iteration vs Value Iteration

PI has two loops: inner loop (evaluate $V^{\pi}$ ) and outer loop (improve $\pi$ )
VI has one loop: repeatedly apply $V^{k+1}(s) = \max_{a\in A} [r(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) V^k(s')]$
Trade-offs:
- PI converges in few outer steps if you can evaluate quickly/accurately;
- VI avoids expensive exact evaluation, doing cheaper but many Bellman optimality updates.
Modified Policy Iteration: partial evaluation + improvement.
Modified Policy Iteration: partial evaluation + improvement.