CSE510 Deep Reinforcement Learning (Lecture 9)

Large state spaces

RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.

not longer represent the $V$ or $Q$ function as explicit tables

Even if we had enough memory

Never enough training data
Learning takes too long

What about large state spaces?

We will now study three other approaches

Value function approximation
Policy gradient methods
Actor-critic methods

RL with Function Approximation

Solution for large MDPs:

Estimate value function using a function approximator

Value function approximation (VFA) replaces the table with general parameterize form:

\hat{V}(s, \theta) \approx V_\pi(s)

\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)

Benefit:

Generalization: those functions can be trained to map similar states to similar values
- Reduce memory usage
- Reduce computation time
- Reduce experience needed to learn the V/Q

Linear Function Approximation

Defined a set of state features $f_1(s),\ldots,f_n(s)$

The features are used as our representation of the state
States with similar features values will be considered similar

A common approximation is to represent $V(s)$ as a linear combination of the features:

\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s)

The approximation accuracy is fundamentally limited by the information provided by the features

Can we always defined features that allow for a perfect linear approximation?

Yes. Assign each state an indicator feature. ( $i$ th feature is $1$ if and only if the $i$ th state is present and $\theta_i$ represents value of $i$ th state)
However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)

Example

Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.

The grid is:

4	5	6	7	8	9	10
3	4	5	6	7	8	9
2	3	4	5	6	7	8
1	2	3	4	5	6	7
0	1	2	3	4	5	6
0	0	1	2	3	4	5
0	0	0	1	2	3	4

Features for state $s=(x, y): f_1(s)=x, f_2(s)=y$ (just 2 features)

V(s) = \theta_0 + \theta_1 x + \theta_2 y

Is there a good linear approximation?

Yes.
$\theta_0 =10, \theta_1 = -1, \theta_2 = -1$
(note upper right is origin)

V(s) = 10 - x - y

subtracts Manhattan dist from goal reward

However, for different grid, $V(s)=\theta_0 + \theta_1 x + \theta_2 y$ is not a good approximation.

4	5	6	7	6	5	4
5	6	7	8	7	6	5
6	7	8	9	8	7	6
7	8	9	10	9	8	7
6	7	8	9	8	7	6
5	6	7	8	7	6	5
4	5	6	7	6	5	4

But we can include a new feature $z=|3-x|+|3-y|$ to get a good approximation.

$V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z$

Usually, we need to define different approximation for different problems.

Learning with Linear Function Approximation

Define a set of features $f_1(s),\ldots,f_n(s)$

The features are used as our representation of the state
States with similar features values will be treated similarly
More complex functions require more features

\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s)

Our goal is to learn good parameter values that approximate the value function well

How can we do this?
Use TD-based RL and somehow update parameters based on each experience

TD-based learning with function approximators

Start with initial parameter values
Take action according to an exploration/exploitation policy
Update estimated model
Perform TD update for each parameter $\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)$
Goto 2

The TD update for each parameter is:

\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)

Proof from Gradient Descent

Our goal is to minimize the squared errors between our estimated value function at each target value:

E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2

Here $E_j(\theta)$ is the squared error of example $j$ .

$\hat{V}_\theta(s_j)$ is our estimated value function at state $s_j$ .

$v(s_j)$ is the true target value at state $s_j$ .

After seeing $j$ ‘th state, the gradient descent rule tells us that we can decrease error with respect to $E_j(\theta)$ by

\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i}

here $\alpha$ is the learning rate.

By the chain rule, we have:

\begin{aligned} \theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\ \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\ &= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j) \end{aligned}

Note that $\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)$

and $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$

For the linear approximation function

\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j)

we have $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$

Thus the TD update for each parameter is:

\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)

For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.

What we use for target value $v(s_j)$ ?

Use the TD prediction based on the next state $s_{j+1}$ : (bootstrap learning)

$v(s)=R(s)+\gamma \hat{V}_\theta(s')$

So the TD update for each parameter is:

\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)

Note

Initially, the value function may be full of zeros. It’s better to use other dense reward to initialize the value function.

Q-function approximation

Instead of $f(s)$ , we use $f(s,a)$ to approximate $Q(s,a)$ :

State-action paris with similar feature values will be treated similarly.

More complex functions require more complex features.

\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a)

Features are a function of state and action.

Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation

Q-learning with Linear Approximators:

Start with initial parameter values
Take action according to an exploration/exploitation policy transition from $s$ to $s'$
Perform TD update for each parameter $\theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a)$
Goto 2

Warning

Typically the space has many local minima and we no longer guarantee convergence. However, it often works in practice.

Here $R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')$ is the estimate of $Q(s,a)$ based on an observed transition.

Note here $f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i}$ This need to be computed in closed form.

Deep Q-network (DQN)

This is a non-linear function approximator. That use deep neural networks to approximate the value function.

Goal is to seeking a single agent which can solve any human-level control problem.

RL defined the objective (Q-value function)
DL learns the hierarchical feature representation

Use deep network to represent the value function: