Artificial
Intelligence
Lecture 10 – Reinforcement Learning
Prof. Shivanjali Khare
skhare@newhaven.edu
Reinforcement Learning
Double Bandits
Double-Bandit MDP
• Actions: Blue, Red No discount
10 time steps
• States: Win, Lose
0.25 $0 Both states
have the
same value
0.75
$2
W 0.25 L
$0
$1 $1
0.75 $2
1.0 1.0
Offline Planning
• Solving MDPs is offline planning No discount
• You determine all quantities through computation 10 time steps
• You need to know the details of the MDP Both states
have the
• You do not actually play the game! same value
0.25 $0
Value
0.7
W 5 0.25 L
Play Red 15 $0
$1 $2 $1
0.75 $2
Play Blue 10 1.0 1.0
Let’s Play!
$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
Online Planning
• Rules changed! Red’s win chance is different.
?? $0
??
$2
W ?? L
$0
$1 $1
?? $2
1.0 1.0
Let’s Play!
$2 $2 $2 $0 $0 $0 $0 $0 $0
$2
What Just Happened?
• That wasn’t planning, it was learning!
• Specifically, reinforcement learning
• There was an MDP, but you couldn’t solve it with just computation
• You needed to actually act to figure it out
• Important ideas in reinforcement learning that came up
• Exploration: you have to try unknown actions to get information
• Exploitation: eventually, you have to use what you know
• Regret: even if you learn intelligently, you make mistakes
• Sampling: because of chance, you have to try things repeatedly
• Difficulty: learning can be much harder than solving a known MDP
Reinforcement Learning
• Still assume a Markov decision process (MDP):
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Still looking for a policy (s)
• New twist: don’t know T or R
• I.e. we don’t know which states are good or what the actions do
• Must actually try actions and states out to learn
Reinforcement Learning
Agent
State: s
Reward: r Actions: a
Environme
nt
• Basic idea:
• Receive feedback in the form of rewards
• Agent’s utility is defined by the reward function
• Must (learn to) act so as to maximize expected rewards
• All learning is based on observed samples of outcomes!
Example: Learning to
Walk
Initial A Learning Trial After Learning [1K Trials]
[Kohl and Stone, ICRA 2004]
Example: Learning to
Walk
[Kohl and Stone, ICRA 2004]
Initial
[Video: AIBO WALK – initial]
Example: Learning to
Walk
[Kohl and Stone, ICRA 2004]
Training
[Video: AIBO WALK – training]
Example: Learning to
Walk
[Kohl and Stone, ICRA 2004]
Finished
[Video: AIBO WALK – finished]
The Crawler!
[Demo: Crawler Bot (L10D1)] [You, in Project 3]
Video of Demo Crawler
Bot
DeepMind Atari
20
Reinforcement Learning
• Still assume a Markov decision process (MDP):
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Still looking for a policy (s)
• New twist: don’t know T or R
• I.e. we don’t know which states are good or what the actions do
• Must actually try actions and states out to learn
Offline (MDPs) vs. Online
(RL)
Offline Solution Online Learning
Passive Reinforcement
Learning
Model-Based Learning
Model-Based Learning
• Model-Based Idea:
• Learn an approximate model based on experiences
• Solve for values as if the learned model were correct
• Step 1: Learn empirical MDP model
• Count outcomes s’ for each s, a
• Normalize to give an estimate of
• Discover each when we experience (s, a, s’)
• Step 2: Solve the learned MDP
• For example, use value iteration, as before
Example: Model-Based
Learning
Input Policy Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10
B C D …
E
Episode 3 Episode 4 R(s,a,s’).
E, north, C, -1 E, north, C, -1 R(B, east, C) = -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume: = 1 D, exit, x, +10 A, exit, x, -10 …
Analogy: Expected Age
Goal: Compute expected age of AI students
Known P(A)
Without P(A), instead collect samples [a1, a2, … aN]
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Model-Free Learning
Passive Reinforcement
Learning
• Simplified task: policy evaluation
• Input: a fixed policy (s)
• You don’t know the transitions T(s,a,s’)
• You don’t know the rewards R(s,a,s’)
• Goal: learn the state values
• In this case:
• Learner is “along for the ride”
• No choice about what actions to take
• Just execute the policy and learn from experience
• This is NOT offline planning! You actually take actions in the world.
Direct Evaluation
• Goal: Compute values for each state under
• Idea: Average together observed sample values
• Act according to
• Every time you visit a state, write down what the
sum of discounted rewards turned out to be
• Average those samples
• This is called direct evaluation
Example: Direct Evaluation
Input Policy Observed Episodes Output Values
(Training)
Episode Episode
1 2 -10
B, east, C, -1 B, east, C, -1
A C, east, D, -1 C, east, D, -1 A
D, exit, x, D, exit, x,
+8 +4 +10
B C D +10 +10
B C D
Episode Episode -2
E 3 4 E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume: = 1 D, exit, x, +10 A, exit, x, -10 If B and E both go to
C under this policy,
how can their values
be different?
Problems with Direct
Evaluation
• What’s good about direct evaluation? Output
• It’s easy to understand Values
• It doesn’t require any knowledge of T, R -10
• It eventually computes the correct average
A
values, using just sample transitions +8 +4 +1
B C D0
• What bad about it? -2
• It wastes information about state connections
E
• Each state must be learned separately If B and E both go to
• So, it takes a long time to learn C under this policy,
how can their values
be different?
Why Not Use Policy Evaluation?
• Simplified Bellman updates calculate V for a fixed policy: s
• Each round, replace V with a one-step-look-ahead layer over V
(s)
s, (s)
s,
(s),s’ ’s
• This approach fully exploited the connections between the states
• Unfortunately, we need T and R to do it!
• Key question: how can we do this update to V without knowing T and R?
• In other words, how to we take a weighted average without knowing the weights?
Sample-Based Policy
Evaluation?
• We want to improve our estimate of V by computing these averages:
• Idea: Take samples of outcomes s’ (by doing the action!) and average
s
(s)
s,
(s)
s, (s),s’
's2 's 1 's3
'
Almost! But we
can’t rewind time to
get sample after
sample from state s.
Temporal Difference
Learning
• Big idea: learn from every experience! s
• Update V(s) each time we experience a transition (s, a, s’, r)
• Likely outcomes s’ will contribute updates more often
(s)
s,
• Temporal difference learning of values (s)
• Policy still fixed, still doing evaluation! ’s
• Move values toward value of whatever successor occurs: running average
Sample of V(s):
Update to V(s):
Same update:
Exponential Moving
Average
• Exponential moving average
• The running interpolation update:
• Makes recent samples more important
• Forgets about the past (distant past values were wrong anyway)
• Decreasing learning rate (alpha) can give converging averages
Example: Temporal Difference
Learning
States Observed Transitions
B, east, C, - C, east, D, -
2 2
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: = 1, α =
1/2
Problems with TD Value Learning
• TD value leaning is a model-free way to do policy evaluation, mimicking
Bellman updates with running sample averages
• However, if we want to turn values into a (new) policy, we’re sunk:
a
s, a
• Idea: learn Q-values, not values s,a,s’
’s
• Makes action selection model-free too!
Detour: Q-Value Iteration
• Value iteration: find successive (depth-limited) values
• Start with V0(s) = 0, which we know is right
• Given Vk, calculate the depth k+1 values for all states:
• But Q-values are more useful, so compute them instead
• Start with Q0(s,a) = 0, which we know is right
• Given Qk, calculate the depth k+1 q-values for all q-states:
Q-Learning
• Q-Learning: sample-based Q-value iteration
• Learn Q(s,a) values as you go
• Receive a sample (s,a,s’,r)
• Consider your old estimate:
• Consider your new sample estimate:
no longer
policy
• Incorporate the new estimate into a running average:
evaluation!
[Demo: Q-learning – gridworld (L10D2)]
[Demo: Q-learning – crawler (L10D3)]
Video of Demo Q-Learning -- Gridworld
Video of Demo Q-Learning -- Crawler
Active Reinforcement
Learning
Q-Learning:
act according to current optimal (and
also explore…)
• Full reinforcement learning: optimal policies (like value iteration)
• You don’t know the transitions T(s,a,s’)
• You don’t know the rewards R(s,a,s’)
• You choose the actions now
• Goal: learn the optimal policy / values
• In this case:
• Learner makes choices!
• Fundamental tradeoff: exploration vs. exploitation
• This is NOT offline planning! You actually take actions in the world and
find out what happens…
Q-Learning Properties
• Amazing result: Q-learning converges to optimal policy -- even if
you’re acting suboptimally!
• This is called off-policy learning
• Caveats:
• You have to explore enough
• You have to eventually make the learning rate
small enough
• … but not decrease it too quickly
• Basically, in the limit, it doesn’t matter how you select actions (!)
Model-Based Learning
Input Policy
A
act according to current optimal
B C D also explore!
E
Discussion: Model-Based vs Model-
Free RL
48