CSE510 Deep Reinforcement Learning (Lecture 7)
Large Scale RL
So far we have represented value functions by a lookup table
- Every state s has an entry V(s), or
- Every state-action pair (s, a) has an entry Q(s, a)
Reinforcement learning should be used to solve large problems, e.g.
- Backgammon: 10^20 states
- Computer Go: 10^170 states
- Helicopter, robot, …: enormous continuous state space
Tabular methods clearly cannot handle this.. why?
- There are too many states and/or actions to store in memory
- It is too slow to learn the value of each state individually
- You cannot generalize across states!
Value Function Approximation (VFA)
Solution for large MDPs:
- Estimate the value function using a function approximator
Value function approximation (VFA) replaces the table with general parameterize form:
or
Benefit:
- Can generalize across states
- Save memory (only need to store the function approximator parameters)
End-to-End RL
End-to-end RL methods replace the hand-designed state representation with raw observations.
- Good: We get rid of manual design of state representations
- Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t
Function Approximation
- Linear function approximation
- Neural network function approximation
- Decision tree function approximation
- Nearest neighbor
- …
In this course, we will focus on Linear combination of features and Neural networks.
Today we will do Deep neural networks (fully connected and convolutional).
Artificial Neural Networks
Neuron
are the inputs, are the weights, is the bias.
Then we have activation function (usually non-linear)
Activation functions
ReLU (rectified linear unit):
- Bounded below by 0.
- Non-vanishing gradient.
- No upper bound.
Sigmoid:
- Always positive.
- Bounded between 0 and 1.
- Strictly increasing.
Use relu for previous layers, use sigmoid for output layer.
For fully connected shallow networks, you may use more sigmoid layers.
We can use parallel computing techniques to speed up the computation.
Universal Approximation Theorem
Any continuous function can be approximated by a neural network with a single hidden layer.
(flat layer)
Why use deep neural networks?
Motivation from Biology
- Visual Cortex
Motivation from circuit theory
- Compact representation
Modularity
- More efficiently using data
In Practice: works better for many domains
- Hard to argue with results.
Training Neural Networks
- Loss function
- Model
- Optimization
Empirical loss minimization framework:
is the loss function, is the model, is the parameters, is the regularization term, is the regularization parameter.
Learning is cast as optimization.
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
Stochastic Gradient Descent
Perform updates after seeing each example:
- Initialize:
- For :
- For each training example :
- Compute gradient:
Training a neural network, we need:
- Loss function
- Procedure to compute the gradient
- Regularization term
Mini-batch and Momentum
Make updates based on a mini-batch of examples (instead of a single example)
- the gradient is computed on the average regularized loss for that mini-batch
- can give a more accurate estimate of the gradient
Momentum can use an exponential average of previous gradients.
can get pass plateaus more quickly, by “gaining momentum”.
Convolutional Neural Networks
Overview of history:
- CNN
- MLP
- RNN/LSTM/GRU(Gated Recurrent Unit)
- Transformer