CSE510 Deep Reinforcement Learning (Lecture 9)
Large state spaces
RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.
- not longer represent the or function as explicit tables
Even if we had enough memory
- Never enough training data
- Learning takes too long
What about large state spaces?
We will now study three other approaches
- Value function approximation
- Policy gradient methods
- Actor-critic methods
RL with Function Approximation
Solution for large MDPs:
- Estimate value function using a function approximator
Value function approximation (VFA) replaces the table with general parameterize form:
or
Benefit:
- Generalization: those functions can be trained to map similar states to similar values
- Reduce memory usage
- Reduce computation time
- Reduce experience needed to learn the V/Q
Linear Function Approximation
Defined a set of state features
- The features are used as our representation of the state
- States with similar features values will be considered similar
A common approximation is to represent as a linear combination of the features:
The approximation accuracy is fundamentally limited by the information provided by the features
Can we always defined features that allow for a perfect linear approximation?
- Yes. Assign each state an indicator feature. (th feature is if and only if the th state is present and represents value of th state)
- However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)
Example
Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.
The grid is:
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| 0 | 0 | 1 | 2 | 3 | 4 | 5 |
| 0 | 0 | 0 | 1 | 2 | 3 | 4 |
Features for state (just 2 features)
Is there a good linear approximation?
- Yes.
- (note upper right is origin)
subtracts Manhattan dist from goal reward
However, for different grid, is not a good approximation.
| 4 | 5 | 6 | 7 | 6 | 5 | 4 |
|---|---|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 7 | 6 | 5 |
| 6 | 7 | 8 | 9 | 8 | 7 | 6 |
| 7 | 8 | 9 | 10 | 9 | 8 | 7 |
| 6 | 7 | 8 | 9 | 8 | 7 | 6 |
| 5 | 6 | 7 | 8 | 7 | 6 | 5 |
| 4 | 5 | 6 | 7 | 6 | 5 | 4 |
But we can include a new feature to get a good approximation.
Usually, we need to define different approximation for different problems.
Learning with Linear Function Approximation
Define a set of features
- The features are used as our representation of the state
- States with similar features values will be treated similarly
- More complex functions require more features
Our goal is to learn good parameter values that approximate the value function well
- How can we do this?
- Use TD-based RL and somehow update parameters based on each experience
TD-based learning with function approximators
- Start with initial parameter values
- Take action according to an exploration/exploitation policy
- Update estimated model
- Perform TD update for each parameter
- Goto 2
The TD update for each parameter is:
Proof from Gradient Descent
Our goal is to minimize the squared errors between our estimated value function at each target value:
Here is the squared error of example .
is our estimated value function at state .
is the true target value at state .
After seeing ‘th state, the gradient descent rule tells us that we can decrease error with respect to by
here is the learning rate.
By the chain rule, we have:
Note that
and
For the linear approximation function
we have
Thus the TD update for each parameter is:
For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.
What we use for target value ?
Use the TD prediction based on the next state : (bootstrap learning)
So the TD update for each parameter is:
Initially, the value function may be full of zeros. It’s better to use other dense reward to initialize the value function.
Q-function approximation
Instead of , we use to approximate :
State-action paris with similar feature values will be treated similarly.
More complex functions require more complex features.
Features are a function of state and action.
Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation
Q-learning with Linear Approximators:
- Start with initial parameter values
- Take action according to an exploration/exploitation policy transition from to
- Perform TD update for each parameter
- Goto 2
Typically the space has many local minima and we no longer guarantee convergence. However, it often works in practice.
Here is the estimate of based on an observed transition.
Note here This need to be computed in closed form.
Deep Q-network (DQN)
This is a non-linear function approximator. That use deep neural networks to approximate the value function.
Goal is to seeking a single agent which can solve any human-level control problem.
- RL defined the objective (Q-value function)
- DL learns the hierarchical feature representation
Use deep network to represent the value function: