CSE510 Deep Reinforcement Learning (Lecture 7)

Large Scale RL

So far we have represented value functions by a lookup table

Every state s has an entry V(s), or
Every state-action pair (s, a) has an entry Q(s, a)

Reinforcement learning should be used to solve large problems, e.g.

Backgammon: 10^20 states
Computer Go: 10^170 states
Helicopter, robot, …: enormous continuous state space

Tabular methods clearly cannot handle this.. why?

There are too many states and/or actions to store in memory
It is too slow to learn the value of each state individually
You cannot generalize across states!

Value Function Approximation (VFA)

Solution for large MDPs:

Estimate the value function using a function approximator

Value function approximation (VFA) replaces the table with general parameterize form:

\hat{V}(s, \theta) \approx V_\pi(s)

\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)

Benefit:

Can generalize across states
Save memory (only need to store the function approximator parameters)

End-to-End RL

End-to-end RL methods replace the hand-designed state representation with raw observations.

Good: We get rid of manual design of state representations
Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t

Function Approximation

Linear function approximation
Neural network function approximation
Decision tree function approximation
Nearest neighbor
…

In this course, we will focus on Linear combination of features and Neural networks.

Today we will do Deep neural networks (fully connected and convolutional).

Artificial Neural Networks

Neuron

$f(x) = \mathbb{R}^k\to \mathbb{R}$

$z=a_1w_1+a_2w_2+\cdots+a_kw_k+b$

$a_1,a_2,\cdots,a_k$ are the inputs, $w_1,w_2,\cdots,w_k$ are the weights, $b$ is the bias.

Then we have activation function $\sigma(z)$ (usually non-linear)

Activation functions

ReLU (rectified linear unit):

\text{ReLU}(x) = \max(0, x)

Bounded below by 0.
Non-vanishing gradient.
No upper bound.

Sigmoid:

\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}

Always positive.
Bounded between 0 and 1.
Strictly increasing.

Tip

Use relu for previous layers, use sigmoid for output layer.

For fully connected shallow networks, you may use more sigmoid layers.

We can use parallel computing techniques to speed up the computation.

Universal Approximation Theorem

Any continuous function can be approximated by a neural network with a single hidden layer.

(flat layer)

Why use deep neural networks?

Motivation from Biology

Visual Cortex

Motivation from circuit theory

Compact representation

Modularity

More efficiently using data

In Practice: works better for many domains

Hard to argue with results.

Training Neural Networks

Loss function
Model
Optimization

Empirical loss minimization framework:

\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)

$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.

Learning is cast as optimization.

For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.

Stochastic Gradient Descent

Perform updates after seeing each example:

Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
For $t=1,2,\cdots,T$ $t = 1, 2, \dots, T$ :
- For each training example $(x^{(t)},y^{(t)})$ :
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
- $\theta \gets \theta + \alpha \Delta$

Training a neural network, we need:

Loss function
Procedure to compute the gradient
Regularization term

Mini-batch and Momentum

Make updates based on a mini-batch of examples (instead of a single example)

the gradient is computed on the average regularized loss for that mini-batch
can give a more accurate estimate of the gradient

Momentum can use an exponential average of previous gradients.

\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}

can get pass plateaus more quickly, by “gaining momentum”.

Convolutional Neural Networks

Overview of history:

CNN
MLP
RNN/LSTM/GRU(Gated Recurrent Unit)
Transformer