Skip to Content
CSE510CSE510 Deep Reinforcement Learning (Lecture 6)

CSE510 Deep Reinforcement Learning (Lecture 6)

Active reinforcement learning

Exploration vs. Exploitation

  • Exploitation: To try to get reward. We exploit our current knowledge to get a payoff.

  • Exploration: Get more information about the world. How do we know if there is not a pot of gold around the corner?

  • To explore we typically need to take actions that do not seem best according to our current model

  • Managing the trade-off between exploration and exploitation is a critical issue in RL

  • Basic intuition behind most approaches

    • Explore more when knowledge is weak
    • Exploit more as we gain knowledge

ADP-based RL

Model based

  1. Start with an initial (uninformed) model
  2. solve for optimal policy given the current model (using value or policy iteration)
  3. Take action according to an exploration/exploitation policy
  4. Update estimated model based on observed transition
  5. Goto 2

Exploration/Exploitation policy

Greedy action is the action maximizing estimated QQ value

Q(s,a)=R(s)+γmaxsSP(s,a,s)V(s)Q(s,a) = R(s) + \gamma \max_{s'\in S} P(s,a,s')V(s')
  • where VV is current optimal value function estimate (based on current model), and R,TR, T are current estimates of model
  • Q(s,a)Q(s,a) is the expected value of taking action aa in state ss and then getting estimated value V(s)V(s') for the next state ss'

Want an exploration policy that is greedy in the limit of infinite exploration (GLIE)

  • Try each action in each state and unbounded number of times
  • Guarantees convergence

GLIE: Greedy in the limit of infinite exploration

Greedy Policy 1

On time step tt select random action with probability p(t)p(t) and greedy action with probability 1p(t)1-p(t)

p(t)=1tp(t) = \frac{1}{t} will lead to convergence, but is slow.

Tip

In practice, it’s common to simply set p(t)=ϵp(t) = \epsilon for all tt.

Greedy Policy 2

Boltzmann exploration

Selection action with probability,

Pr(as)=exp(Q(s,a)/T)aAexp(Q(s,a)/T)Pr(a\mid s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}

TT is the temperature. Large TT means that each action has about the same probability. Small TT leads to more greedy behavior.

Typically start with large TT and decrease with time.

Example: impact of temperature

Suppose we have two actions and that Q(s,a1)=1Q(s,a_1) = 1 and Q(s,a2)=0Q(s,a_2) = 0.

When T=10T=10, we have

Pr(a1s)=exp(1/10)exp(1/10)+exp(2/10)=0.48Pr(a_1\mid s)=\frac{\exp(1/10)}{\exp(1/10)+\exp(2/10)}=0.48 Pr(a2s)=exp(2/10)exp(1/10)+exp(2/10)=0.52Pr(a_2\mid s)=\frac{\exp(2/10)}{\exp(1/10)+\exp(2/10)}=0.52

When T=1T=1, we have

Pr(a1s)=exp(1/1)exp(1/1)+exp(2/1)=0.27Pr(a_1\mid s)=\frac{\exp(1/1)}{\exp(1/1)+\exp(2/1)}=0.27 Pr(a2s)=exp(2/1)exp(1/1)+exp(2/1)=0.73Pr(a_2\mid s)=\frac{\exp(2/1)}{\exp(1/1)+\exp(2/1)}=0.73

When T=0.1T=0.1, we have

Pr(a1s)=exp(1/0.1)exp(1/0.1)+exp(2/0.1)=0.02Pr(a_1\mid s)=\frac{\exp(1/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.02 Pr(a2s)=exp(2/0.1)exp(1/0.1)+exp(2/0.1)=0.98Pr(a_2\mid s)=\frac{\exp(2/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.98

(Alternative Model-Based RL) Optimistic Exploration: Rmax [Brafman & Tennenholtz, 2002]

  1. Start with an optimistic model
    • (assign largest possible reward to “unexplored states”)
    • (actions from “unexplored states” only self transition)
  2. Solve for optimal policy in optimistic model (standard VI)
  3. Take greedy action according to the computed policy
  4. Update optimistic estimated model
    • (if a state becomes “known” then use its true statistics)
  5. Goto 2

Agent always acts greedily according to a model that assumes all “unexplored” states are maximally rewarding

Implementation for optimistic model

  • Keep track of number of times a state-action pair is tried
  • If N(s,a)<NeN(s, a) < N_e then T(s,a,s)=1T(s,a,s)=1 and R(s)=RmaxR(s) = Rmax in optimistic model,
  • Otherwise, T(s,a,s)T(s,a,s’) and R(s)R(s) are based on estimates obtained from the NeN_e experiences (the estimate of true model)
  • NeN_e can be determined by using Chernoff Bound
  • An optimal policy for this optimistic model will try to reach unexplored states (those with unexplored actions) since it can stay at those states and accumulate maximum reward
  • Never explicitly explores. Is always greedy, but with respect to an optimistic outlook.
Algorithm: (for Infinite horizon RL problems) Initialize $\hat{p}, \hat{r}$, and $N(s,a)$ For $t = 1, 2, ...$ 1. Build an optimistic reward model $(Q(s,a))_{s,a}$ from $\hat{p}, \hat{r}$, and $N(s,a)$ 2. Select action $a(t)$ maximizing $Q(s(t),a)$ over $A_{s(t)}$ 3. Observe the transition to $s(t+1)$ and collect reward $r(s(t),a(t))$ according to $\hat{p}$ 4. Update $\hat{p}, \hat{r}$, and $N(s,a)$

Efficiency of Rmax

If the model is very completely learned (i.e. N(s,a)=NeN(s, a) = N_e for all s,as, a), then Rmax will be near optimal.

Results how that this will happen “quickly” in terms of number of steps.

General proof strategy: PAC Guarantee (Roughly speaking): There is a value NeN_e, such that with high probability the Rmax algorithm will select at most a polynomial number of actions with value less than ϵ\epsilon of optimal.

RL can be solved in poly-time in number of actions, number of states, and discount factor.

TD-based Active RL

  1. Start with initial value function
  2. Take action from an exploration/exploitation policygiving new state ss' (should converge to optimal policy)
  3. Update estimated model (To compute the exploration/exploitation policy.)
  4. Perform TD update V(s)V(s)+α(R(s)+γV(s)V(s))V(s) \gets V(s) + \alpha (R(s) + \gamma V(s') - V(s)) V(s)V(s) is new estimate of optimal value function at state ss.
  5. Goto 2

Given the usual assumptions about learning rate and GLIE, TD will converge to an optimal value function!

  • Exploration/Exploitation policy requires computing argmaxQ(s,a)argmax Q(s, a) for the exploitation part of the policy
    • Computing argmaxQ(s,a)argmax Q(s, a) requires TT in addition to VV
  • Thus TD-learning must still maintain an estimated model for action selection
  • It is computationally more efficient at each step compared to Rmax (i.e., optimistic exploration)
    • TD-update vs. Value Iteration
    • But model requires much more memory than value function
  • Can we get a model-fee variant?

Q-learning

Instead of learning the optimal value function VV, directly learn the optimal QQ function.

Recall Q(s,a)Q(s, a) is the expected value of taking action aa in state ss and then following the optimal policy thereafter

Given the QQ function we can act optimally by selecting action greedily according to Q(s,a)Q(s, a) without a model

The optimal QQ-function satisfies V(s)=maxaAQ(s,a)V(s) = \max_{a'\in A} Q(s, a') which gives:

Q(s,a)=R(s)+γsST(s,a,s)V(s)=R(s)+γsST(s,a,s)maxaAQ(s,a)\begin{aligned} Q(s,a) &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') V(s')\\ &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') \max_{a'\in A} Q(s',a')\\ \end{aligned}

How can we learn the QQ-function directly?

Q-learning implementation

Model-free reinforcement learning

  1. Start with initial Q-values (e.g. all zeros)
  2. Take action from an exploration/exploitation policy giving new state ss' (should converge to optimal policy)
  3. Perform TD update Q(s,a)Q(s,a)+α(R(s)+γmaxaAQ(s,a)Q(s,a))Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max_{a'\in A} Q(s',a') - Q(s,a)) Q(s,a)Q(s,a) is current estimate of optimal Q-value for state ss and action aa.
  4. Goto 2
  • Does not require model since we learn the Q-value function directly
  • Use explicit S×A|S|\times |A| table to store Q-values
  • Off-policy learning: the update does not depend on the actual next action
  • The exploration/exploitation policy directly uses QQ-values

Convergence of Q-learning

Q-learning converges to the optimal Q-value in the limit with probability 1 if:

  • Every state-action pair is visited infinitely often
  • Learning rate decays just so: t=1α(t)=\sum_{t=1}^{\infty} \alpha(t) = \infty and t=1α(t)2<\sum_{t=1}^{\infty} \alpha(t)^2 < \infty

Speedup for Goal-Based Problems

  • Goal-Based Problem: receive big reward in goal state and then transition to terminal state
  • Initializing Q(s,a)Q(s, a) for all sSs \in S and aAa \in A to zeros and then observing the following sequence of (state, reward, action) triples
    • (s0,0,a0)(s1,0,a1)(s2,10,a2)(terminal,0)(s0, 0, a0) (s1, 0, a1) (s2, 10, a2) (terminal,0)
  • The sequence of Q-value updates would result in: Q(s0,a0)=0Q(s0, a0) = 0, Q(s1,a1)=0Q(s1, a1) =0, Q(s2,a2)=10Q(s2, a2)=10
  • So nothing was learned at s0s0 and s1s1
    • Next time this trajectory is observed we will get non-zero for Q(s1,a1)Q(s1, a1) but still Q(s0,a0)=0Q(s0, a0)=0

From the example we see that it can take many learning trials for the final reward to “back propagate” to early state-action pairs

  • Two approaches for addressing this problem:
    1. Trajectory replay: store each trajectory and do several iterations of Q-updates on each one
    2. Reverse updates: store trajectory and do Q-updates in reverse order
  • In our example (with learning rate and discount factor equal to 1 for ease of illustration) reverse updates would give
    • Q(s2,a2)=10Q(s2,a2) = 10, Q(s1,a1)=10Q(s1,a1) = 10, Q(s0,a0)=10Q(s0,a0)=10

Off-policy vs on-policy RL

SARSA

  1. Start with initial Q-values (e.g. all zeros)
  2. Take action ana_n on state sns_n from an ϵ\epsilon-greedy policy giving new state sn+1s_{n+1}
  3. Take action an+1a_{n+1} on state sn+1s_{n+1} from an ϵ\epsilon-greedy
  4. Perform TD update Q(sn,an)Q(sn,an)+α(R(sn)+γQ(sn+1,an+1)Q(sn,an))Q(s_n,a_n) \gets Q(s_n,a_n) + \alpha (R(s_n) + \gamma Q(s_{n+1},a_{n+1}) - Q(s_n,a_n))
  5. Goto 2
Note

Compared with Q-learning, SARSA (on-policy) usually takes more “safer” actions.

Last updated on