Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 9)

CSE510 Deep Reinforcement Learning (Lecture 9)

Large state spaces

RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.

  • not longer represent the VV or QQ function as explicit tables

Even if we had enough memory

  • Never enough training data
  • Learning takes too long

What about large state spaces?

We will now study three other approaches

  • Value function approximation
  • Policy gradient methods
  • Actor-critic methods

RL with Function Approximation

Solution for large MDPs:

  • Estimate value function using a function approximator

Value function approximation (VFA) replaces the table with general parameterize form:

V^(s,θ)Vπ(s)\hat{V}(s, \theta) \approx V_\pi(s)

or

Q^(s,a,θ)Qπ(s,a)\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)

Benefit:

  • Generalization: those functions can be trained to map similar states to similar values
    • Reduce memory usage
    • Reduce computation time
    • Reduce experience needed to learn the V/Q

Linear Function Approximation

Defined a set of state features f1(s),,fn(s)f_1(s),\ldots,f_n(s)

  • The features are used as our representation of the state
  • States with similar features values will be considered similar

A common approximation is to represent V(s)V(s) as a linear combination of the features:

V^(s,θ)=θ0+i=1nθifi(s)\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s)

The approximation accuracy is fundamentally limited by the information provided by the features

Can we always defined features that allow for a perfect linear approximation?

  • Yes. Assign each state an indicator feature. (iith feature is 11 if and only if the iith state is present and θi\theta_i represents value of iith state)
  • However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)

Example

Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.

The grid is:

45678910
3456789
2345678
1234567
0123456
0012345
0001234

Features for state s=(x,y):f1(s)=x,f2(s)=ys=(x, y): f_1(s)=x, f_2(s)=y (just 2 features)

V(s)=θ0+θ1x+θ2yV(s) = \theta_0 + \theta_1 x + \theta_2 y

Is there a good linear approximation?

  • Yes.
  • θ0=10,θ1=1,θ2=1\theta_0 =10, \theta_1 = -1, \theta_2 = -1
  • (note upper right is origin)
V(s)=10xyV(s) = 10 - x - y

subtracts Manhattan dist from goal reward


However, for different grid, V(s)=θ0+θ1x+θ2yV(s)=\theta_0 + \theta_1 x + \theta_2 y is not a good approximation.

4567654
5678765
6789876
78910987
6789876
5678765
4567654

But we can include a new feature z=3x+3yz=|3-x|+|3-y| to get a good approximation.

V(s)=θ0+θ1x+θ2y+θ3zV(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z

Usually, we need to define different approximation for different problems.

Learning with Linear Function Approximation

Define a set of features f1(s),,fn(s)f_1(s),\ldots,f_n(s)

  • The features are used as our representation of the state
  • States with similar features values will be treated similarly
  • More complex functions require more features
V^(s,θ)=θ0+i=1nθifi(s)\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s)

Our goal is to learn good parameter values that approximate the value function well

  • How can we do this?
  • Use TD-based RL and somehow update parameters based on each experience

TD-based learning with function approximators

  1. Start with initial parameter values
  2. Take action according to an exploration/exploitation policy
  3. Update estimated model
  4. Perform TD update for each parameter θiθi+α(R(sj)+γV^θ(sj+1)V^θ(sj))fi(sj)\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
  5. Goto 2

The TD update for each parameter is:

θiθi+α(v(sj)V^θ(sj))fi(sj)\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)

Proof from Gradient Descent

Our goal is to minimize the squared errors between our estimated value function at each target value:

Ej(θ)=12i=1n(V^θ(sj)v(sj))2E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2

Here Ej(θ)E_j(\theta) is the squared error of example jj.

V^θ(sj)\hat{V}_\theta(s_j) is our estimated value function at state sjs_j.

v(sj)v(s_j) is the true target value at state sjs_j.

After seeing jj‘th state, the gradient descent rule tells us that we can decrease error with respect to Ej(θ)E_j(\theta) by

θiθiαEj(θ)θi\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i}

here α\alpha is the learning rate.

By the chain rule, we have:

θiθiαEj(θ)θiθiαEj(θ)θi=θiαEjV^θ(sj)V^θ(sj)θi=θiα(V^θ(sj)v(sj))fi(sj)\begin{aligned} \theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\ \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\ &= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j) \end{aligned}

Note that EjV^θ(sj)=V^θ(sj)v(sj)\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)

and V^θ(sj)θi=fi(sj)\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)

For the linear approximation function

V^θ(sj)=θ0+i=1nθifi(sj)\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j)

we have V^θ(sj)θi=fi(sj)\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)

Thus the TD update for each parameter is:

θiθi+α(v(sj)V^θ(sj))fi(sj)\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)

For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.

What we use for target value v(sj)v(s_j)?

Use the TD prediction based on the next state sj+1s_{j+1}: (bootstrap learning)

v(s)=R(s)+γV^θ(s)v(s)=R(s)+\gamma \hat{V}_\theta(s')

So the TD update for each parameter is:

θiθi+α(R(sj)+γV^θ(sj+1)V^θ(sj))fi(sj)\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
Note

Initially, the value function may be full of zeros. It’s better to use other dense reward to initialize the value function.

Q-function approximation

Instead of f(s)f(s), we use f(s,a)f(s,a) to approximate Q(s,a)Q(s,a):

State-action paris with similar feature values will be treated similarly.

More complex functions require more complex features.

Q^(s,a,θ)=θ0+i=1nθifi(s,a)\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a)

Features are a function of state and action.

Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation

Q-learning with Linear Approximators:

  1. Start with initial parameter values
  2. Take action according to an exploration/exploitation policy transition from ss to ss'
  3. Perform TD update for each parameter θiθi+α(R(s)+γmaxaAQ^θ(s,a)Q^θ(s,a))fi(s,a)\theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a)
  4. Goto 2
Warning

Typically the space has many local minima and we no longer guarantee convergence. However, it often works in practice.

Here R(s)+γmaxaAQ^θ(s,a)R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a') is the estimate of Q(s,a)Q(s,a) based on an observed transition.

Note here fi(s,a)=Q^θ(s,a)θif_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i} This need to be computed in closed form.

Deep Q-network (DQN)

This is a non-linear function approximator. That use deep neural networks to approximate the value function.

Goal is to seeking a single agent which can solve any human-level control problem.

  • RL defined the objective (Q-value function)
  • DL learns the hierarchical feature representation

Use deep network to represent the value function:

Last updated on