CSE510 Deep Reinforcement Learning (Lecture 6)

Active reinforcement learning

Exploration vs. Exploitation

Exploitation: To try to get reward. We exploit our current knowledge to get a payoff.
Exploration: Get more information about the world. How do we know if there is not a pot of gold around the corner?
To explore we typically need to take actions that do not seem best according to our current model
Managing the trade-off between exploration and exploitation is a critical issue in RL
Basic intuition behind most approaches
- Explore more when knowledge is weak
- Exploit more as we gain knowledge

ADP-based RL

Model based

Start with an initial (uninformed) model
solve for optimal policy given the current model (using value or policy iteration)
Take action according to an exploration/exploitation policy
Update estimated model based on observed transition
Goto 2

Exploration/Exploitation policy

Greedy action is the action maximizing estimated $Q$ value

Q(s,a) = R(s) + \gamma \max_{s'\in S} P(s,a,s')V(s')

where $V$ is current optimal value function estimate (based on current model), and $R, T$ are current estimates of model
$Q(s,a)$ is the expected value of taking action $a$ in state $s$ and then getting estimated value $V(s')$ for the next state $s'$

Want an exploration policy that is greedy in the limit of infinite exploration (GLIE)

Try each action in each state and unbounded number of times
Guarantees convergence

GLIE: Greedy in the limit of infinite exploration

Greedy Policy 1

On time step $t$ select random action with probability $p(t)$ and greedy action with probability $1-p(t)$

$p(t) = \frac{1}{t}$ will lead to convergence, but is slow.

Tip

In practice, it’s common to simply set $p(t) = \epsilon$ for all $t$ .

Greedy Policy 2

Boltzmann exploration

Selection action with probability,

Pr(a\mid s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}

$T$ is the temperature. Large $T$ means that each action has about the same probability. Small $T$ leads to more greedy behavior.

Typically start with large $T$ and decrease with time.

Example: impact of temperature

Suppose we have two actions and that $Q(s,a_1) = 1$ and $Q(s,a_2) = 0$ .

When $T=10$ , we have

Pr(a_1\mid s)=\frac{\exp(1/10)}{\exp(1/10)+\exp(2/10)}=0.48

Pr(a_2\mid s)=\frac{\exp(2/10)}{\exp(1/10)+\exp(2/10)}=0.52

When $T=1$ , we have

Pr(a_1\mid s)=\frac{\exp(1/1)}{\exp(1/1)+\exp(2/1)}=0.27

Pr(a_2\mid s)=\frac{\exp(2/1)}{\exp(1/1)+\exp(2/1)}=0.73

When $T=0.1$ , we have

Pr(a_1\mid s)=\frac{\exp(1/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.02

Pr(a_2\mid s)=\frac{\exp(2/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.98

(Alternative Model-Based RL) Optimistic Exploration: Rmax [Brafman & Tennenholtz, 2002]

Start with an optimistic model
- (assign largest possible reward to “unexplored states”)
- (actions from “unexplored states” only self transition)
Solve for optimal policy in optimistic model (standard VI)
Take greedy action according to the computed policy
Update optimistic estimated model
- (if a state becomes “known” then use its true statistics)
Goto 2

Agent always acts greedily according to a model that assumes all “unexplored” states are maximally rewarding

Implementation for optimistic model

Keep track of number of times a state-action pair is tried
If $N(s, a) < N_e$ then $T(s,a,s)=1$ and $R(s) = Rmax$ in optimistic model,
Otherwise, $T(s,a,s’)$ and $R(s)$ are based on estimates obtained from the $N_e$ experiences (the estimate of true model)
$N_e$ can be determined by using Chernoff Bound
An optimal policy for this optimistic model will try to reach unexplored states (those with unexplored actions) since it can stay at those states and accumulate maximum reward
Never explicitly explores. Is always greedy, but with respect to an optimistic outlook.


Algorithm: (for Infinite horizon RL problems)
Initialize $\hat{p}, \hat{r}$, and $N(s,a)$ For $t = 1, 2, ...$
1. Build an optimistic reward model $(Q(s,a))_{s,a}$ from $\hat{p}, \hat{r}$, and $N(s,a)$
2. Select action $a(t)$ maximizing $Q(s(t),a)$ over $A_{s(t)}$
3. Observe the transition to $s(t+1)$ and collect reward $r(s(t),a(t))$ according to $\hat{p}$
4. Update $\hat{p}, \hat{r}$, and $N(s,a)$

Efficiency of Rmax

If the model is very completely learned (i.e. $N(s, a) = N_e$ for all $s, a$ ), then Rmax will be near optimal.

Results how that this will happen “quickly” in terms of number of steps.

General proof strategy: PAC Guarantee (Roughly speaking): There is a value $N_e$ , such that with high probability the Rmax algorithm will select at most a polynomial number of actions with value less than $\epsilon$ of optimal.

RL can be solved in poly-time in number of actions, number of states, and discount factor.

TD-based Active RL

Start with initial value function
Take action from an exploration/exploitation policygiving new state $s'$ (should converge to optimal policy)
Update estimated model (To compute the exploration/exploitation policy.)
Perform TD update $V(s) \gets V(s) + \alpha (R(s) + \gamma V(s') - V(s))$ $V(s)$ is new estimate of optimal value function at state $s$ .
Goto 2

Given the usual assumptions about learning rate and GLIE, TD will converge to an optimal value function!

Exploration/Exploitation policy requires computing $argmax Q(s, a)$ $a r g ma x Q (s, a)$ for the exploitation part of the policy
- Computing $argmax Q(s, a)$ requires $T$ in addition to $V$
Thus TD-learning must still maintain an estimated model for action selection
It is computationally more efficient at each step compared to Rmax (i.e., optimistic exploration)
- TD-update vs. Value Iteration
- But model requires much more memory than value function
Can we get a model-fee variant?

Q-learning

Instead of learning the optimal value function $V$ , directly learn the optimal $Q$ function.

Recall $Q(s, a)$ is the expected value of taking action $a$ in state $s$ and then following the optimal policy thereafter

Given the $Q$ function we can act optimally by selecting action greedily according to $Q(s, a)$ without a model

The optimal $Q$ -function satisfies $V(s) = \max_{a'\in A} Q(s, a')$ which gives:

\begin{aligned} Q(s,a) &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') V(s')\\ &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') \max_{a'\in A} Q(s',a')\\ \end{aligned}

How can we learn the $Q$ -function directly?

Q-learning implementation

Model-free reinforcement learning

Start with initial Q-values (e.g. all zeros)
Take action from an exploration/exploitation policy giving new state $s'$ (should converge to optimal policy)
Perform TD update $Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max_{a'\in A} Q(s',a') - Q(s,a))$ $Q(s,a)$ is current estimate of optimal Q-value for state $s$ and action $a$ .
Goto 2

Does not require model since we learn the Q-value function directly
Use explicit $|S|\times |A|$ table to store Q-values
Off-policy learning: the update does not depend on the actual next action
The exploration/exploitation policy directly uses $Q$ -values

Convergence of Q-learning

Q-learning converges to the optimal Q-value in the limit with probability 1 if:

Every state-action pair is visited infinitely often
Learning rate decays just so: $\sum_{t=1}^{\infty} \alpha(t) = \infty$ and $\sum_{t=1}^{\infty} \alpha(t)^2 < \infty$

Speedup for Goal-Based Problems

Goal-Based Problem: receive big reward in goal state and then transition to terminal state
Initializing $Q(s, a)$ $Q (s, a)$ for all $s \in S$ $s \in S$ and $a \in A$ $a \in A$ to zeros and then observing the following sequence of (state, reward, action) triples
- $(s0, 0, a0) (s1, 0, a1) (s2, 10, a2) (terminal,0)$
The sequence of Q-value updates would result in: $Q(s0, a0) = 0$ , $Q(s1, a1) =0$ , $Q(s2, a2)=10$
So nothing was learned at $s0$ $s 0$ and $s1$ $s 1$
- Next time this trajectory is observed we will get non-zero for $Q(s1, a1)$ but still $Q(s0, a0)=0$

From the example we see that it can take many learning trials for the final reward to “back propagate” to early state-action pairs

Two approaches for addressing this problem:
1. Trajectory replay: store each trajectory and do several iterations of Q-updates on each one
2. Reverse updates: store trajectory and do Q-updates in reverse order
In our example (with learning rate and discount factor equal to 1 for ease of illustration) reverse updates would give
- $Q(s2,a2) = 10$ , $Q(s1,a1) = 10$ , $Q(s0,a0)=10$

Off-policy vs on-policy RL

SARSA

Start with initial Q-values (e.g. all zeros)
Take action $a_n$ on state $s_n$ from an $\epsilon$ -greedy policy giving new state $s_{n+1}$
Take action $a_{n+1}$ on state $s_{n+1}$ from an $\epsilon$ -greedy
Perform TD update $Q(s_n,a_n) \gets Q(s_n,a_n) + \alpha (R(s_n) + \gamma Q(s_{n+1},a_{n+1}) - Q(s_n,a_n))$
Goto 2

Note

Compared with Q-learning, SARSA (on-policy) usually takes more “safer” actions.