CSE510 Deep Reinforcement Learning (Lecture 4)
Markov Decision Process (MDP) Part II
Recall from last lecture
An Finite MDP is defined by:
- A finite set of states
- A finite set of actions
- A transition function
- Probability that a from s leads to , i.e.,
- Also called the model or the dynamics
- A reward function ( Sometimes or )
- A start state
- Maybe a terminal state
A model for sequential decision making problem under uncertainty
Optimal Policy and Bellman Optimality Equation
The goal for a MDP is to compute or learn an optimal policy.
- An optimal policy is one that achieves the highest value at any state
- We define the optimal value function using Bellman Optimality Equation
- The optimal policy is
The Existence of the Optimal Policy
Theorem: for any Markov Decision Process
- There exists an optimal policy
- There can be many optimal policies, but all optimal policies achieve the same optimal value function
- There is always a deterministic optimal policy for any MDP
Solve MDP
Value Iteration
Repeatedly update an estimate of the optimal value function according to Bellman Optimality Equation.
- Initialize an estimate for the value function arbitrarily
- Repeat, update:
Example
Suppose we have a robot that can move in a 2D grid. with the following dynamics:
- with 80% probability, the robot moves in the direction of the action
- with 10% probability, the robot moves in the direction of the action + 1 (wrap to left)
- with 10% probability, the robot moves in the direction of the action - 1 (wrap to right)
The gird () is:
|0|0|0|1| |0|*|0|-100| |0|0|0|0|
If we fun the value iteration with , we can update the value function as follows:
On point , the best action is to move to the goal state, so:
On point , the best action is to move up so that you can stay in the grid with probability, so:
On , the value on grid is:
|0|0|0.72|1.81| |0|*|0|-99.91| |0|0|0|0|
The general algorithm can be written as:
# suppose we defined the grid as previous example
grid = [
[0, 0, 0, 1],
[0, '*', 0, -100],
[0, 0, 0, 0]
]
m,n = len(grid), len(grid[0])
ACTIONS = {'up':(0,-1), 'down':(0,1), 'left':(-1,0), 'right':(1,0)}
gamma = 0.9
V = value_iteration(gamma, ACTIONS, grid)
print(V)
def get_reward(action, i, j):
reward = 0
reward += 0.8 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
return reward
def value_iteration(gamma, ACTIONS, V):
V_new=[[0]*m for _ in range(n)]
while True:
for i in range(m):
for j in range(n):
s = (i, j)
V_new[i][j] = V[i][j] + gamma * max(get_reward(action, i, j) for action.values() in ACTIONS)
if max(abs(V_new[i][j] - V[i][j]) for i in range(m) for j in range(n)) < 1e-6:
break
V = V_new
return VConvergence of Value Iteration
Theorem: Value Iteration converges to the optimal value function as .
Proof
For any estimate of the value function , we define the Bellman backup operator by
Note that .
Since , for any value function and , we have
Assume , and reward is bounded by .
Then
Let be the value function after iterations of Value Iteration.
Stopping condition
We can construct the optimal policy arbitrarily close to the optimal value function.
If , then .
So we can select small to stop the iteration.
Greedy Policy
Given a that is close to the optimal value , the greedy policy is:
Here is the transition function between state and with action .
This selects the action looks best if we assume that we get value in one step.
Value of a greedy policy
If we define to be the value function of the greedy policy, then
This is not necessarily optimal, but it is a good approximation.
In homework, we will prove that if , then .
So we can set stopping condition so that has desired accuracy to .
There is a finite such that greedy policy is -optimal.
Problem of Value Iteration and Policy Iteration
- It is slow
- The max action at each state rarely changes
- The policy converges before the value function
Policy Iteration
Interleaving polity evaluation and policy improvement.
- Initialize a random policy
- Compute the value function
- Update the policy to be greedy policy with respect to
- Repeat until convergence
Exact Policy Evaluation by Linear Solver
Let be a vector of values for each state, be a vector of rewards for each state.
Let be a transition matrix for the policy .
The Bellman equation for the policy can be written in vector form as:
- Proof involves showing that each iteration is also a contraction and monotonically improve the policy
- Convergence to the exact optimal policy
- The number of policies is finite
In real world, policy iteration is usually faster than value iteration.
Policy Iteration Complexity
- Each iteration runs in polynomial time in the number of states and actions
- There are at most |A|n policies and PI never repeats a policy
- So at most an exponential number of iterations
- Not a very good complexity bound
- Empirically O(n) iterations are required
- Challenge: try to generate an MDP that requires more than that n iterations
Generalized Policy Iteration
- Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement
- independent of their granularity and other details of the two processes
Summary
Policy Iteration vs Value Iteration
-
PI has two loops: inner loop (evaluate ) and outer loop (improve )
-
VI has one loop: repeatedly apply
-
Trade-offs:
- PI converges in few outer steps if you can evaluate quickly/accurately;
- VI avoids expensive exact evaluation, doing cheaper but many Bellman optimality updates.
-
Modified Policy Iteration: partial evaluation + improvement.
-
Modified Policy Iteration: partial evaluation + improvement.