[go: up one dir, main page]

0% found this document useful (0 votes)
21 views112 pages

کتاب هشتم بارگزاری شده

The document provides an overview of Reinforcement Learning (RL), emphasizing its nature as a learning process through interaction with an environment to optimize future behavior. It outlines the key components of RL, including states, actions, rewards, and the goal of learning an optimal policy to maximize long-term rewards. Additionally, it discusses the Markov Decision Process framework and various examples illustrating deterministic and stochastic environments in RL applications.

Uploaded by

ali2124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views112 pages

کتاب هشتم بارگزاری شده

The document provides an overview of Reinforcement Learning (RL), emphasizing its nature as a learning process through interaction with an environment to optimize future behavior. It outlines the key components of RL, including states, actions, rewards, and the goal of learning an optimal policy to maximize long-term rewards. Additionally, it discusses the Markov Decision Process framework and various examples illustrating deterministic and stochastic environments in RL applications.

Uploaded by

ali2124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Reinforcement Learning (RL)

CE-717: Machine Learning


Sharif University of Technology

M. Soleymani
Fall 2020

Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.
Reinforcement Learning (RL)
 Learning as a result of interaction with an environment
 to improve the agent’s ability to behave optimally in the future
to achieve the goal.

 The first idea when we think about the nature of learning

 Examples:
 Baby movements
 Learning to drive car
 Environment’s response affects our subsequent actions
 We find out the effects of our actions later

2
Paradigms of learning
 Supervised learning
𝑛 𝑁
 Training data: features and labels for 𝑁 samples 𝒙 , 𝑦 (𝑛) 𝑛=1
 Unsupervised learning
𝑛 𝑁
 Training data: only features for 𝑁 samples 𝒙 𝑛=1
 Reinforcement learning
 Training data: a sequence of (s, a, r)
 (state, action, reward)
 Agent acts on its environment, it receives some evaluation of its
action via reinforcement signal
 it is not told of which action is the correct one to achieve its goal

3
Reinforcement Learning (RL)

 𝑆: Set of states
 𝐴: Set of actions

 Goal: Learning an optimal policy (mapping from states to actions) in


order to maximize its long-term reward
 The agent's objective is to maximize amount of reward it receives over time.

4
Reinforcement Learning: Example
 Chess game (deterministic game)
 Learning task: chose move at arbitrary board states
 Training signal: final win or loss
 Training: e.g., played n games against itself

5
Main characteristics and applications of RL
 Main characteristics of RL
 Learning is a multistage decision making process
 Actions influence later perceptions (inputs)
 Delayed reward: actions may affect not only the immediate reward but
also subsequent rewards
 agent must learn from interactions with environment
 Agent must be able to learn from its own experience
 Not entirely supervised, but interactive
 by trial-and-error
 Opportunity for active exploration
 Needs trade-off between exploration and exploitation
 Goal: map states to actions, so as to maximize reward over
time (optimal policy)

6
Main elements of RL
 A reward function
 It maps each state (or, state-action pair) to a real number,
called reward.

 Policy: A map from state space to action space.


 May be stochastic.

 A value function
 Value of a state (or state-action) is the total expected reward,
starting from that state (or state-action).

7
RL problem: deterministic environment
 Deterministic
 Transition and reward functions

 At time 𝑡:
 Agent observes state 𝑠𝑡 ∈ 𝑆
 Then chooses action 𝑎𝑡 ∈ 𝐴
 Then receives reward 𝑟𝑡 , and state changes to 𝑠𝑡+1

 Learn to choose action 𝑎𝑡 in state 𝑠𝑡 that maximizes the


discounted return:

𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ = ෍ 𝛾 𝑘 𝑟𝑡+𝑘 , 0<𝛾≤1
𝑘=0
Upon visiting the sequence of states 𝑠𝑡 , 𝑠𝑡+1 , … with actions 𝑎𝑡 , 𝑎𝑡+1 , … it
shows the total payoff
8
RL problem: stochastic environment

 Stochastic environment
 Stochastic transition and/or reward function

 Learn to choose a policy that maximizes the expected


discounted return:
𝐸 𝑅𝑡 = 𝐸 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯

starting from state 𝑠𝑡 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ = ෍ 𝛾 𝑘 𝑟𝑡+𝑘
𝑘=0

9
RL deterministic world: Example
 Example: Robot grid world
 Deterministic and known reward and transitions

10
Non-deterministic world: Example

[Russel, AIMA,2010]

Other actions
reward: -0.04

 What is the policy to achieve max reward?

11
Optimal policy

𝑟 = −0.04 for other actions

𝑟 = −4 for other actions 𝑟 = −0.4 for other actions


12
[Russel, AIMA 2010]
RL: Autonomous Agent
 Execute actions in environment, observe results, and learn
 Learn (perhaps stochastic) policy that maximizes
𝐸 σ∞𝑘=0 𝛾 𝑘𝑟
𝑡+𝑘 |𝑠𝑡 = 𝑠, 𝜋 for every state 𝑠 ∈ 𝑆

 Function to be learned is the policy 𝜋: 𝑆 × 𝐴 → [0,1]


(when the policy is deterministic 𝜋: 𝑆 → 𝐴)
 Training examples in supervised learning: 𝑠, 𝑎 pairs
 RL training data shows the amount of reward for a pair 𝑠, 𝑎 .
 training data are of the form 𝑠, 𝑎 , 𝑟

13
Markov Decision Process (RL Setting)
 We encounter a multistage decision making process.

 Markov assumption:
𝑃 𝑠𝑡+1 , 𝑟𝑡 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡−1 , 𝑠𝑡−1 , 𝑎𝑡−1 , , 𝑟𝑡−2 , … ) = 𝑃(𝑠𝑡+1 , 𝑟𝑡 |𝑠𝑡 , 𝑎𝑡 )

 Markov property: Transition probabilities depend on state only,


not on the path to the state.

 Goal: for every possible state 𝑠∈𝑆 learn a policy 𝜋 for


choosing actions that maximizes (infinite-horizon discounted
reward):
𝐸 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ |𝑠𝑡 = 𝑠, 𝜋 , 0<𝛾≤1
14
Markov Decision Process
 MDPs provide a way to think about how we can act
optimally under uncertainty

 Components:
 states
 actions
 state transition probabilities
 reward function
 discount factor

15
s

a
s, a

s,a,s’
s’

16
MDP: Recycling Robot example
 𝑆 = ℎ𝑖𝑔ℎ, 𝑙𝑜𝑤
 𝐴 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
 𝒜 ℎ𝑖𝑔ℎ = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡 Available actions in the ‘high’ state
 𝒜 𝑙𝑜𝑤 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
 ℛ𝑠𝑒𝑎𝑟𝑐ℎ > ℛ𝑤𝑎𝑖𝑡

𝑃(𝑠𝑡+1 = ℎ𝑖𝑔ℎ|𝑠𝑡 = ℎ𝑖𝑔ℎ, 𝑎𝑡 = 𝑠𝑒𝑎𝑟𝑐ℎ)

17
reward
State-value function for policy 𝜋
 Given a policy 𝜋, define value function

𝑉 𝜋 𝑠 = 𝐸 σ∞
𝑘=0 𝛾 𝑘𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

 𝑉 𝜋 𝑠 : How good for the agent to be in the state 𝑠 when


its policy is 𝜋
 It is simply the expected sum of discounted rewards upon
starting in state s and taking actions according to 𝜋

18
Known Environments

 We first assume that the MDP is known although in RL


problems, we do not know it
𝑎 𝑎
 𝒫𝑠𝑠 ′ and ℛ 𝑠𝑠 ′ is known

 Dynamic programming methods are used for solving the


problems when MDP is known
 Value iteration and Policy iteration are two more classic
approaches to this problem.

19
Recursive definition for 𝑉 𝜋 (𝑆)
𝑉 𝜋 𝑠 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

= 𝐸 𝑟𝑡 + 𝛾 σ∞
𝑘=1 𝛾
𝑘−1 𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

= 𝐸 𝑟𝑡 + 𝛾 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡 = 𝑠, 𝜋

𝜋(𝑠) 𝜋(𝑠)
= ෍ 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋
𝑠′
𝑎 ′
𝒫𝑠𝑠 ′ = 𝑃 𝑠𝑡+1 = 𝑠 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎
𝑉 𝜋 (𝑠′)
𝑎 ′
ℛ𝑠𝑠 ′ = 𝐸 𝑟𝑡 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝑠𝑡+1 = 𝑠

Bellman 𝑉 𝜋 𝑠 = ෍ 𝒫𝑠𝑠′
𝜋(𝑠) 𝜋(𝑠)
ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)
Equations 𝑠′

Base equation for dynamic programming approaches


20
Backup diagram

s
𝜋(𝑠) 𝜋(𝑠)
𝑉 𝜋 𝑠 = ෍ 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′) (s)

𝑠′ s, (s)

s, (s),s’
s’

21
State-action value function for policy 𝜋

𝑄 𝜋 𝑠, 𝑎 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠 , 𝑎𝑡 = 𝑎, 𝜋

𝑎 𝑎 ∞ 𝑘
= ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝐸 σ𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋

𝑠′

𝑉 𝜋 (𝑠′) s

a
s, a

𝑎 𝑎 s, a,s’
𝑄𝜋 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 𝜋
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 (𝑠′)

𝑠′ s’

22
State-value function for policy 𝜋: example
 Deterministic example

𝜋 𝑠
𝑉 𝜋 𝑠 = ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)

73 81

𝛾 = 0.9

66 90 100

23
Grid-world: value function example

[Russel, AIMA 2010]

24
Optimal policy
 Policy 𝜋 is better than (or equal to) 𝜋′ (i.e. 𝜋 ≥ 𝜋′) iff
𝜋 𝜋 ′
𝑉 𝑠 ≥𝑉 𝑠 , ∀𝑠 ∈ 𝑆

 Optimal policy 𝜋 ∗ is better than (or equal to) all other


policies (∀𝜋, 𝜋 ∗ ≥ 𝜋)

 Optimal policy 𝛑∗ :
𝜋 ∗ 𝑠 = argmax 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋

25
MDP: Optimal policy
state-value and action-value functions
 Optimal policies share the same optimal state-value
𝜋∗
function (𝑉 (𝑠) will be abbreviated as 𝑉 ∗ (𝑠)):
𝑉 ∗ 𝑠 = max 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋

 And the same optimal action-value function:


𝑄∗ 𝑠, 𝑎 = max 𝑄𝜋 𝑠, 𝑎 , ∀𝑠 ∈ 𝑆, 𝑎 ∈ 𝒜(𝑠)
𝜋

 For any MDP, a deterministic optimal policy exists!

26
Optimal Quantities

▪ The value (utility) of a state s:


V*(s) = expected utility starting in s
and acting optimally s is a
s state
a
▪ The value (utility) of a q-state (s,a): (s, a) is a
s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s s,a,s’ (s,a,s’) is a
and (thereafter) acting optimally transition
s’

▪ The optimal policy:


*(s) = optimal action from state s

27
Bellman optimality equation

𝑉 ∗ 𝑠 = max 𝑄 ∗ 𝑠, 𝑎
𝑎∈𝒜(𝑠)
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 ∗ 𝑠′

𝑠′

𝑎 𝑎
𝑉 ∗ 𝑠 = max ෍ 𝒫𝑠𝑠 ∗
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 𝑠′
𝑎∈𝒜(𝑠)
𝑠′
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾 max

𝑄 ∗ 𝑠 ′ , 𝑎′
𝑎
𝑠′

28
Snapshot of Demo – Gridworld V Values

Noise = 0.2
Discount = 0.9
Living reward = 0
29
Snapshot of Demo – Gridworld Q Values

Noise = 0.2
Discount = 0.9
Living reward = 0
30
Optimal policy
 If we have 𝑉 ∗ (𝑠) and 𝑃(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) we can compute 𝜋 ∗ (𝑠)

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′


𝑎
𝑠′

 It can also be computed as:


𝜋 ∗ 𝑠 = argmax 𝑄∗ 𝑠, 𝑎
𝑎∈𝒜(𝑠)

 Optimal policy has the interesting property that it is the


optimal policy for all states.
 Share the same optimal state-value function
 It is not dependent on the initial state.
 use the same policy no matter what the initial state of MDP is

31
Value Iteration algorithm
Consider only MDPs with finite state and action spaces:

1) Initialize all 𝑉(𝑠) to zero


2) Repeat until convergence
for 𝑠 ∈ 𝑆
𝑉 𝑛𝑒𝑤 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎
𝑠′
𝑛𝑒𝑤
𝑉=𝑉
3) for 𝑠 ∈ 𝑆
𝜋(𝑠) ← argmax σ𝑠′ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎

𝑉(𝑠) converges to 𝑉 ∗ (𝑠)

Asynchronous: Instead of updating values for all states at once in each iteration,
it can update them state by state, or more often to some states than others.

32
Value Iteration

 Bellman equations characterize the optimal values: V(s)

𝑉 ∗ 𝑠 = max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′ a


𝑎 s, a
𝑠′
s,a,s
 Value iteration computes them:
’V(s’)
𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′
𝑎
𝑠′

 Value iteration is just a fixed point solution method


 … though the Vk vectors are also interpretable as time-limited values

33
Value Iteration

 Start with V0(s) = 0: no time steps left means an expected reward sum of zero

 Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′ a
𝑎
𝑠′ s, a
 Repeat until convergence s,a,s’
Vk(s’)
 Complexity of each iteration: O(S2A)

 Theorem: will converge to unique optimal values


 Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do

34
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
35
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
36
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
37
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
38
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
39
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
40
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
41
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
42
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
43
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
44
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
45
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
46
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
47
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
48
Convergence*
 Case 1: If the tree has maximum depth M, then VM
holds the actual untruncated values

 Case 2: If the discount is less than 1


 Sketch: For any state Vk and Vk+1 can be viewed as
depth k+1 results in nearly identical search trees
 The difference is that on the bottom layer, Vk+1 has
actual rewards while Vk has zeros
 That last layer is at best all RMAX
 It is at worst RMIN
 But everything is discounted by γk that far out
 So Vk and Vk+1 are at most γk max|R| different
 So as k increases, the values converge

𝛾 𝑘 max 𝑅
max 𝑉𝑘 𝑠 − 𝑉 ∗ 𝑠 ≤
𝑠∈𝒮 1−𝛾

49
Convergence and contractions
 Let . show the max norm: 𝑉 = max 𝑉(𝑠)
𝑠

 Theorem: For any two approximations 𝑉 and 𝑉′:


Bellman operator is

𝑉𝑖+1 − 𝑉𝑖+1 ≤ 𝛾 𝑉𝑖 − 𝑉𝑖′ a contraction

 Any two approximations get closer to each other after


Bellman update.

 Since 𝑉𝑖+1 = 𝑉𝑖∗ , any approximation get closer to 𝑉 ∗
 Which implies existence and uniqueness of the fixed point

50
Value Iteration
 Value iteration works even if we randomly traverse the
environment instead of looping through each state and action
(update asynchronously)
 but we must still visit each state infinitely often

 Value Iteration
 It is time and memory expensive

51
Problems with Value Iteration
 Value iteration repeats the Bellman updates:

𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′ s


𝑎
𝑠′
a
 Problem 1: It’s slow – O(S2A) per iteration s, a

s,a,s’
 Problem 2: The “max” at each state rarely changes s’

 Problem 3: The policy often converges long before the values

52
Convergence: Example

[Russel, AIMA, 2010]

53
Computing Actions from Values

 Let’s imagine we have the optimal values V*(s)

 How should we act?


 It’s not obvious!

 We need to do (one step)

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′


𝑎
𝑠′

 This is called policy extraction, since it gets the policy implied by the
values

54
Computing Actions from Q-Values
 Let’s imagine we have the optimal q-values:

 How should we act?


 Completely trivial to decide!

 Important lesson: actions are easier to select from q-values than values!

55
Policy Iteration Algorithm
1) Initialize 𝜋(𝑠) arbitrarily
2) Repeat until convergence
 Compute the value function for the current policy 𝜋 (i.e. 𝑉 𝜋 )
 𝑉 ← 𝑉𝜋
 for 𝑠 ∈ 𝑆
 𝜋 𝑠 ← argmax σ𝑠′ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎

updates the policy (greedily) using the


current value function.
𝜋(𝑠) converges to 𝜋 ∗ (𝑠)

56
Fixed Policies Evaluation

Do the optimal action Do what  says to do


s s

a (s)
s, a s, (s)

s,a,s’ s, (s),s’
s’ s’

max over all actions to compute the fixed some policy (s), then the tree would
optimal values be simpler – only one action per state

57
Utilities for a Fixed Policy

 Another basic operation: compute the utility of a state s s


under a fixed (generally non-optimal) policy
(s)
s, (s)
 Recursive relation (one-step look-ahead / Bellman equation):
s, (s),s’
𝑉 𝜋 𝑠 = ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉 𝜋 𝑠 ′ s’
𝑠′

V(s) = expected total discounted rewards starting


in s and following 

58
Policy Evaluation
 How do we calculate the V’s for a fixed policy ?

 Idea 1: Turn recursive Bellman equations into updates s


(like value iteration)
(s)
s, (s)

𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ s, (s),s’
𝑠′ s’

 Efficiency: O(S2) per iteration

 Idea 2: Without the maxes, the Bellman equations are just a linear system

59
Policy Iteration
 Evaluation: For fixed current policy , find values with policy evaluation:
 Iterate until values converge:

𝜋 𝜋𝑖
𝑖
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋𝑖 (𝑠) 𝑅 𝑠, 𝜋𝑖 (𝑠), 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑠′

 Improvement: For fixed values, get a better policy using policy extraction
 One-step look-ahead:

𝜋𝑖
𝜋𝑖+1 𝑠 ← argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑎
𝑠′

60
Policy Iteration
 Repeat steps until policy converges
 Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
 Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values

 This is policy iteration


 It’s still optimal!
 It is guaranteed to converge and at convergence, the current policy and its value function are
the optimal policy and the optimal value function.
 Can converge (much) faster under some conditions

61
Policy Iteration Optimality: Proof Sketch
 Convergence:
 Every step improves the policy and each policy is achieved at most once.
Thus, the number of iterations is at most as large as the number of
different policies (i.e., 𝒜 𝒮 )

 Optimal at convergence:
 At convergence ∀𝑠, 𝜋𝑘+1 𝑠 = 𝜋𝑘 𝑠 .
 Indeed, ∀𝑠 𝑉 𝜋𝑘 𝑠 = max σ𝑠′ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑘 𝑠 ′
𝑎
 Therefore, 𝑉 𝜋𝑘 satisfies Bellman equation and thus 𝑉 𝜋𝑘 = 𝑉 ∗

62
When to stop iterations:

63 [Russel, AIMA 2010]


Comparison
 Both value iteration and policy iteration compute the same thing (all
optimal values)

 In value iteration:
 Every iteration updates both the values and (implicitly) the policy
 We don’t track the policy, but taking the max over actions implicitly recomputes
it

 In policy iteration:
 We do several passes that update utilities with fixed policy (each pass is fast
because we consider only one action, not all of them)
 After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
 The new policy will be better (or we’re done)

 Both are dynamic programs for solving MDPs


64
MDP Algorithms: Summary
 So you want to….
 Compute optimal values: use value iteration or policy iteration
 Compute values for a particular policy: use policy evaluation
 Turn your values into a policy: use policy extraction (one-step lookahead)

65
Unknown transition model
𝑎
 So far: learning optimal policy when we know 𝒫𝑠𝑠 ′ and
𝑎
ℛ𝑠𝑠 ′

 it requires prior knowledge of the environment's dynamics

 If a model is not available, then it is particularly useful to


estimate action values rather than state values

66
Reinforcement Learning

 Still assume a Markov decision process (MDP):


 A set of states s  S
 A set of actions (per state) A
 A model P(s’|a,s)
 A reward function R(s,a,s’)
 Still looking for a policy (s)

 New twist: don’t know P or R


 I.e. we don’t know which states are good or what the actions do
 Must actually try actions and states out to learn

67
Reinforcement Learning

Agent
State: s
Actions: a
Reward: r

Environmen
t

 Basic idea:
 Receive feedback in the form of rewards
 Agent’s utility is defined by the reward function
 Must (learn to) act so as to maximize expected rewards
 All learning is based on observed samples of outcomes!

68
What Just Happened?
 That wasn’t planning, it was learning!
 Specifically, reinforcement learning
 There was an MDP, but you couldn’t solve it with just computation
 You needed to actually act to figure it out

 Important ideas in reinforcement learning that came up


 Sampling: because of chance, you have to try things repeatedly
 Difficulty: learning can be much harder than solving a known MDP
 Exploration: you have to try unknown actions to get information
 Exploitation: eventually, you have to use what you know
 Regret: even if you learn intelligently, you make mistakes

69
RL algorithms
 Model-based (passive)
 Learn model of environment (transition and reward
probabilities)
 Then, value iteration or policy iteration algorithms

 Model-free (active)

70
Model-Based Learning
 Model-Based Idea:
 Learn an approximate model based on experiences
 Solve for values as if the learned model were correct

 Step 1: Learn empirical MDP model


 Count outcomes s’ for each s, a
 ෠ ′ |𝑠, 𝑎)
Normalize to give an estimate of 𝑃(𝑠
 Discover each when we experience (s, a, s’)

 Step 2: Solve the learned MDP


 For example, use value iteration, as before

71
Model free approaches to solve RL problems
 These methods can be categorized into:
 Monte Carlo methods
 Temporal-difference learning
 Q-learning is a more recent approaches to this problem.

72
Direct Evaluation of a Policy

 Before discussing about Monte Carlo approach for finding optimal


policy, first we solve a simpler problem “policy evaluation” by
Monte Carlo

 Goal: Compute values for each state under 

 Idea:Average together observed sample values


 Act according to 
 Every time you visit a state, write down what the sum of discounted
rewards turned out to be
 Average those samples

73
Example: Direct Policy Evaluation

Input Policy  Observed Episodes (Training) Output Values


Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10

74
Monte Carlo methods
 do not assume complete knowledge of the environment

 require only experience


 sample sequences of states, actions, and rewards from on-line
or simulated interaction with an environment

 are based on averaging sample returns


 are defined for episodic tasks

75
Monte Carlo policy evaluation
1) Initialize 𝑄 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
 Generate an episode using 𝜋 and exploring starts

 for each pair of 𝑠 and 𝑎 appearing in the episode


 𝑅 ←return following the first occurrence of 𝑠, 𝑎
 Append 𝑅 to 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)
 𝑄 𝑠, 𝑎 ← 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)

76
Monte Carlo RL algorithm
1) Initialize 𝑄 and 𝜋 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
 Generate an episode using 𝜋 and exploring starts

 for each pair of 𝑠 and 𝑎 appearing in the episode


 𝑅 ←return following the first occurrence of 𝑠, 𝑎
 Append 𝑅 to 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)
 𝑄 𝑠, 𝑎 ← 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)

for each 𝑠 in the episode


 𝜋(𝑠) ← argmax 𝑄(𝑠, 𝑎)
𝑎

77
Problems with Direct Evaluation

 What’s good about direct evaluation?


 It’s easy to understand Output Values
 It doesn’t require any knowledge of P, R -10
 It eventually computes the correct average A
values, using just sample transitions +8 +4 +10
B C D
 What bad about it? -2
E
 It wastes information about state connections
 Each state must be learned separately If B and E both go to C
under this policy, how can
 So, it takes a long time to learn their values be different?

78
Connections between states

 Simplified Bellman updates calculate V for a fixed policy: s


 Each round, replace V with a one-step-look-ahead layer over V
(s)
s, (s)
𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′
s, (s),s’
𝑠′
s’
 This approach fully exploited the connections between the states
 Unfortunately, we need T and R to do it!

 Key question: how can we do this update to V without knowing T and R?


 In other words, how to we take a weighted average without knowing the weights?

79
Connections between states

 We want to improve our estimate of V by computing these averages:


𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′
𝑠′
 Idea: Take samples of outcomes s’ (by doing the action!) and average
s
(s)
s,
s, (s),s’ (s)
s2 s1 s3
' '' '
Almost! But we can’t
rewind time to get sample
after sample from state s.

80
Temporal Difference Learning
 Big idea: learn from every experience!
 Update V(s) each time we experience a transition (s, a, s’, r)
 Likely outcomes s’ will contribute updates more often s
(s)
 Temporal difference learning of values s, (s)
 Policy still fixed, still doing evaluation!
 Move values toward value of whatever successor occurs: running
s’
average

Sample of V(s):

Update to V(s):

Same update:

81
Exponential Moving Average

 Exponential moving average


 The running interpolation update:

 Makes recent samples more important:

 Forgets about the past (distant past values were wrong anyway)

 Decreasing learning rate (alpha) can give converging averages

82
Example: Temporal Difference Learning

States Observed Transitions


B, east, C, -2 C, east, D, -2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume:  = 1, α = 1/2

83
Temporal Difference Methods
 TD learning is a combination of MC and DP )i.e. Bellman
equations) ideas.
 Like MC methods, can learn directly from raw experience
without a model of the environment's dynamics.

 Like DP, update estimates based in part on other learned


estimates, without waiting for a final outcome.

84
Problems with TD Value Learning
 TD value leaning is a model-free way to do policy evaluation,
mimicking Bellman updates with running sample averages

 However, if we want to turn values into a (new) policy, we’re


sunk.

85
Unknown transition model: New policy
 With a model, state values alone are sufficient to determine a policy
 simply look ahead one step and chooses whichever action leads to the best
combination of reward and next state s

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′ a


𝑎
𝑠′ s, a

 Without a model, state values alone are not sufficient. s,a,s’


s’

 However, if agent knows 𝑄(𝑠, 𝑎), it can choose optimal action


without knowing 𝑃 and 𝑅:
𝜋 ∗ 𝑠 = argmax 𝑄 ∗ (𝑠, 𝑎)
𝑎

 Idea: learn Q-values, not state values


 Makes action selection model-free too!

86
Detour: Q-Value Iteration

 Value iteration: find successive (depth-limited) values


 Start with V0(s) = 0, which we know is right
 Given Vk, calculate the depth k+1 values for all states:

𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′


𝑎
𝑠′

 But Q-values are more useful, so compute them instead


 Start with Q0(s,a) = 0, which we know is right
 Given Qk, calculate the depth k+1 q-values for all q-states:

𝑄𝑘+1 𝑠, 𝑎 ← ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾max



𝑄𝑘 𝑠 ′, 𝑎
𝑎
𝑠′

87
Q-Learning
 We’d like to do Q-value updates to each Q-state:

𝑄𝑘+1 𝑠, 𝑎 ← ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾max



𝑄𝑘 𝑠 ′
,𝑎
𝑎
𝑠′
 But can’t compute this update without knowing P, R

 Instead, compute average as we go


 Receive a sample transition (s,a,r,s’)
 This sample suggests

 But we want to average over results from (s,a) (Why?)


 So keep a running average

88
Q-Learning

 Learn Q(s,a) values as you go


 Receive a sample (s,a,s’,r)
 Consider your old estimate:
 Consider your new sample estimate:

 Incorporate the new estimate into a running average:

89
Model-Free Learning

 Model-free (temporal difference) learning s


 Experience world through episodes a
s, a
r
 Update estimates each transition s’
a’
s’, a’
 Over time, updates will mimic Bellman
updates
s’’

90
Video of Demo Q-Learning Auto Cliff Grid

91
Exploration/exploitation tradeoff
 Exploitation: High rewards from trying previously-well-
rewarded actions

 Exploration:Which actions are best?


 Must try ones not tried before

92
How to Explore?

 Several schemes for forcing exploration


 Simplest: random actions (-greedy)
 Every time step, flip a coin
 With (small) probability , act randomly
 With (large) probability 1-, act on current policy

 Problems with random actions?


 You do eventually explore the space, but keep
thrashing around once learning is done
 One solution: lower  over time
 Another solution: exploration functions

93
Regret
 Even if you learn the optimal policy, you still make mistakes along the way!
 Regret is a measure of your total mistake cost: the difference between
your (expected) rewards, including youthful suboptimality, and optimal
(expected) rewards
 Minimizing regret goes beyond learning to be optimal – it requires
optimally learning to be optimal
 Example: random exploration and exploration functions both end up
optimal, but random exploration has higher regret

94
Q-learning: Policy
 Greedy action selection:
෠ 𝑎)
𝜋 𝑠 = argmax 𝑄(𝑠,
𝑎

 𝜖-greedy: greedy most of the times, occasionally take a random


action

 Softmax policy: Give a higher probability to the actions that


currently have better utility, e.g,

𝑏 𝑄(𝑠,𝑎)
𝜋 𝑠, 𝑎 = ෠ ′)
σ𝑎′ 𝑏 𝑄(𝑠,𝑎

 After learning 𝑄 ∗ , the policy is greedy?

95
Q-learning Algorithm
 Initialize 𝑄(𝑠,
෠ 𝑎) arbitrarily
 Repeat (for each episode):
 Initialize 𝑠 e.g., ε-greedy, softmax, …
 Repeat (for each step of episode):
 Choose 𝑎 from 𝑠 using a policy derived from 𝑄෠
 Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
 𝑄෠ 𝑠, 𝑎 ← 𝑄෠ 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max

෠ 𝑠 ′ , 𝑎′ − 𝑄෠ 𝑠, 𝑎
𝑄
𝑎
 𝑠 ← 𝑠′
 until 𝑠 is terminal

96
Q-learning convergence
 Q-learning converges to optimal Q-values if
 Every state is visited infinitely often
 The policy for action selection becomes greedy as time
approaches infinity
 The step size parameter is chosen appropriately

 Stochastic approximation conditions


 The learning rate is decreased fast enough but not too fast

97
Q-Learning Properties
 Amazing result: Q-learning converges to optimal policy --
even if you’re acting suboptimally!

 This is called off-policy learning

 Caveats:
 You have to explore enough
 You have to eventually make the learning rate small enough
 … but not decrease it too quickly
 Basically, in the limit, it doesn’t matter how you select actions (!)

98
SARSA (State-Action-Reward-State-Action)
 Initialize 𝑄(𝑠,
෠ 𝑎)
 Repeat (for each episode): e.g., greedy, ε-greedy

 Initialize 𝑠
 Choose 𝑎 from 𝑠 using a policy derived from 𝑄෠
 Repeat (for each step of episode):
 Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
 Choose 𝑎′ from 𝑠 ′ using policy derived from 𝑄෠
 𝑄෠ 𝑠, 𝑎 ← 𝑄෠ 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄෠ 𝑠 ′ , 𝑎′ − 𝑄෠ 𝑠, 𝑎
 𝑠 ← 𝑠′
 𝑎 ← 𝑎′
 until 𝑠 is terminal

99
Q-learning vs. SARSA

 Q-learning is an off-policy learning algorithm

 SARSA is an on-policy learning algorithm

100
Tabular methods: Problem
 All of the introduced methods maintain a table

 Table size can be very large for complex environments


 Too many states to visit them all in training
 We may not even visit some states

 Too many states to hold the q-tables in memory


 But computation and memory problem

101
Generalizing Across States

 Basic Q-Learning keeps a table of all q-values

 In realistic situations, we cannot possibly learn


about every single state!

 Instead, we want to generalize:


 Learn about some small number of training states from
experience
 Generalize that experience to new, similar situations
 This is a fundamental idea in machine learning, and we’ll
see it over and over again

102
Example: Pacman

Let’s say we discover In naïve q-learning, Or even this one!


through experience we know nothing
that this state is bad: about this state:

103
Feature-Based Representations

 Solution: describe a state using a vector of


features (properties)
 Features are functions from states to real numbers
(often 0/1) that capture important properties of the
state
 Example features:
 Distance to closest ghost
 Distance to closest dot
 Number of ghosts
 1 / (dist to dot)2
 Is Pacman in a tunnel? (0/1)
 …… etc.
 Is it the exact state on this slide?
 Can also describe a q-state (s, a) with features (e.g.
action moves closer to food)

104
Linear Value Functions
 Using a feature representation, we can write a q function (or value
function) for any state using a few weights:

 Advantage:
 our experience is summed up in a few powerful numbers
 We can generalize from visited states to unvisited ones.
 In addition to the less space requirement

 Disadvantage: states may share features but actually be very different in


value!

105
Least square error: minimization
Imagine we had only one point x, with features f(x), target value y, and weights w:

 Q-learning with linear Q-functions:

“target” “prediction”
 Intuitive interpretation:
 Adjust weights of active features
 E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states
with that state’s features

106
Least square error: minimization
 In the general case (e.g., when 𝑄𝜃 (𝑠, 𝑎) is modeled by a neural
network):

2
𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑄𝜃 (𝑠, 𝑎)

∇𝜃 𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑄𝜃 (𝑠, 𝑎) ∇𝜃 𝑄𝜃 (𝑠, 𝑎)

∇𝜃 𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑟 + 𝛾 max

𝑄𝜃 (𝑠′, 𝑎′) − 𝑄𝜃 (𝑠, 𝑎) ∇𝜃 𝑄𝜃 (𝑠, 𝑎)
𝑎

107
Adjusting function weights

𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 𝑉𝜽 𝑠 ′ −𝑉𝜽 𝑠 𝛻𝜽 𝑉𝜽 (𝑠)
or
′ ′
𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 max

𝑄𝜽 𝑠 , 𝑎 − 𝑄𝜽 (𝑠, 𝑎) 𝛻𝜽 𝑄𝜽 (𝑠, 𝑎)
𝑎

108
Deep RL
 𝑄(𝑠, 𝑎; 𝜃): neural network with parameter 𝜃

 A single feedforward to compute for the current state 𝑠,


Q-values of all actions

109
Playing Atari Games

 State: Raw pixels of images


 Action: e.g., R, L, U, D
 Reward: score

 Last layer has 4-d output (if 4 actions)


 Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 )

110
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
DeepMind Atari (©Two Minute Lectures)

111
References
 T. Mitchell, Machine Learning, MIT Press,1998. [Chapter 13]

 R.S. Sutton, A.G. Barto, Reinforcement Learning: An


Introduction, MIT Press, 1999 [Chapters 1,3,4,6].

112

You might also like