Lecture Reinforcement Learning
Lecture Reinforcement Learning
Philipp Koehn
16 April 2024
● States s ∈ S, actions a ∈ A
● Q-learning
– learns action-utility function (Q(s, a) function)
– does not need to model outcomes of actions
– function provides expected utility of taken a given action at a given step
● Reflex agent
– learns policy that maps states to actions
Unknown information
● Given a policy
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04 -0.04
-0.04
-0.04
-0.04
-0.04
-0.04 -0.04 +1
-0.04
-0.04
-0.04
0.76
0.72
● Sample of reward to go
∞
U (s) = E [∑ γ tR(St)]
π
t=0
● Start at state S0 = s
● Reward to go:
0.92 0.96 1.00
– (1,1) one sample: 0.72 0.80 0.88
0.84
– (1,2) two samples: 0.76, 0.84
– (1,3) two samples: 0.80, 0.88
0.76
● Reward to go 0.72
will converge to utility of state
(γ = reward decay)
Need to learn:
′ count(s, s′)
P (s ∣s, π(s)) =
∑s′′ count(s, s′′)
● Idea: do not model P (s′∣s, π(s)), directly adjust utilities U (s) for all visited states
(α = learning rate)
● Learning rate may decrease when state has been visited often
3. Optimize policy
4. Go to Step 2
● Spend coin on
– presumed optimal payoff
– exploration (new lever)
● If independent
– Gittins index: formula for solution
– uses payoff / number of times used
● Simple strategy
– with probability p(1/t) take random action
– initially (t small) focus on exploration
– later (t big) focus on optimal policy
R+ if n < Nc
f (u, n) = {
u otherwise
● For our example, Q-learning slower, but successful applications (TD-G AMMON)
generalization in
reinforcement learning
● Our example
– ∆θ1 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
– ∆θ2 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
● Our toy example: utility related to Manhattan distance from goal (xgoal, ygoal)
● Gradient descent training can also be used for temporal distance learning
policy search
● If ρ(θ) differentiable
⇒ gradient descent by following policy gradient
examples
● Policy search