کتاب هشتم بارگزاری شده
کتاب هشتم بارگزاری شده
M. Soleymani
Fall 2020
Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.
Reinforcement Learning (RL)
Learning as a result of interaction with an environment
to improve the agent’s ability to behave optimally in the future
to achieve the goal.
Examples:
Baby movements
Learning to drive car
Environment’s response affects our subsequent actions
We find out the effects of our actions later
2
Paradigms of learning
Supervised learning
𝑛 𝑁
Training data: features and labels for 𝑁 samples 𝒙 , 𝑦 (𝑛) 𝑛=1
Unsupervised learning
𝑛 𝑁
Training data: only features for 𝑁 samples 𝒙 𝑛=1
Reinforcement learning
Training data: a sequence of (s, a, r)
(state, action, reward)
Agent acts on its environment, it receives some evaluation of its
action via reinforcement signal
it is not told of which action is the correct one to achieve its goal
3
Reinforcement Learning (RL)
𝑆: Set of states
𝐴: Set of actions
4
Reinforcement Learning: Example
Chess game (deterministic game)
Learning task: chose move at arbitrary board states
Training signal: final win or loss
Training: e.g., played n games against itself
5
Main characteristics and applications of RL
Main characteristics of RL
Learning is a multistage decision making process
Actions influence later perceptions (inputs)
Delayed reward: actions may affect not only the immediate reward but
also subsequent rewards
agent must learn from interactions with environment
Agent must be able to learn from its own experience
Not entirely supervised, but interactive
by trial-and-error
Opportunity for active exploration
Needs trade-off between exploration and exploitation
Goal: map states to actions, so as to maximize reward over
time (optimal policy)
6
Main elements of RL
A reward function
It maps each state (or, state-action pair) to a real number,
called reward.
A value function
Value of a state (or state-action) is the total expected reward,
starting from that state (or state-action).
7
RL problem: deterministic environment
Deterministic
Transition and reward functions
At time 𝑡:
Agent observes state 𝑠𝑡 ∈ 𝑆
Then chooses action 𝑎𝑡 ∈ 𝐴
Then receives reward 𝑟𝑡 , and state changes to 𝑠𝑡+1
Stochastic environment
Stochastic transition and/or reward function
9
RL deterministic world: Example
Example: Robot grid world
Deterministic and known reward and transitions
10
Non-deterministic world: Example
[Russel, AIMA,2010]
Other actions
reward: -0.04
11
Optimal policy
13
Markov Decision Process (RL Setting)
We encounter a multistage decision making process.
Markov assumption:
𝑃 𝑠𝑡+1 , 𝑟𝑡 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡−1 , 𝑠𝑡−1 , 𝑎𝑡−1 , , 𝑟𝑡−2 , … ) = 𝑃(𝑠𝑡+1 , 𝑟𝑡 |𝑠𝑡 , 𝑎𝑡 )
Components:
states
actions
state transition probabilities
reward function
discount factor
15
s
a
s, a
s,a,s’
s’
16
MDP: Recycling Robot example
𝑆 = ℎ𝑖𝑔ℎ, 𝑙𝑜𝑤
𝐴 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
𝒜 ℎ𝑖𝑔ℎ = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡 Available actions in the ‘high’ state
𝒜 𝑙𝑜𝑤 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
ℛ𝑠𝑒𝑎𝑟𝑐ℎ > ℛ𝑤𝑎𝑖𝑡
17
reward
State-value function for policy 𝜋
Given a policy 𝜋, define value function
𝑉 𝜋 𝑠 = 𝐸 σ∞
𝑘=0 𝛾 𝑘𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋
18
Known Environments
19
Recursive definition for 𝑉 𝜋 (𝑆)
𝑉 𝜋 𝑠 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋
= 𝐸 𝑟𝑡 + 𝛾 σ∞
𝑘=1 𝛾
𝑘−1 𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋
= 𝐸 𝑟𝑡 + 𝛾 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡 = 𝑠, 𝜋
𝜋(𝑠) 𝜋(𝑠)
= 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋
𝑠′
𝑎 ′
𝒫𝑠𝑠 ′ = 𝑃 𝑠𝑡+1 = 𝑠 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎
𝑉 𝜋 (𝑠′)
𝑎 ′
ℛ𝑠𝑠 ′ = 𝐸 𝑟𝑡 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝑠𝑡+1 = 𝑠
Bellman 𝑉 𝜋 𝑠 = 𝒫𝑠𝑠′
𝜋(𝑠) 𝜋(𝑠)
ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)
Equations 𝑠′
s
𝜋(𝑠) 𝜋(𝑠)
𝑉 𝜋 𝑠 = 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′) (s)
𝑠′ s, (s)
s, (s),s’
s’
21
State-action value function for policy 𝜋
𝑄 𝜋 𝑠, 𝑎 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠 , 𝑎𝑡 = 𝑎, 𝜋
𝑎 𝑎 ∞ 𝑘
= 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝐸 σ𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋
𝑠′
𝑉 𝜋 (𝑠′) s
a
s, a
𝑎 𝑎 s, a,s’
𝑄𝜋 𝑠, 𝑎 = 𝒫𝑠𝑠 𝜋
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 (𝑠′)
𝑠′ s’
22
State-value function for policy 𝜋: example
Deterministic example
𝜋 𝑠
𝑉 𝜋 𝑠 = ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)
73 81
𝛾 = 0.9
66 90 100
23
Grid-world: value function example
24
Optimal policy
Policy 𝜋 is better than (or equal to) 𝜋′ (i.e. 𝜋 ≥ 𝜋′) iff
𝜋 𝜋 ′
𝑉 𝑠 ≥𝑉 𝑠 , ∀𝑠 ∈ 𝑆
Optimal policy 𝛑∗ :
𝜋 ∗ 𝑠 = argmax 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋
25
MDP: Optimal policy
state-value and action-value functions
Optimal policies share the same optimal state-value
𝜋∗
function (𝑉 (𝑠) will be abbreviated as 𝑉 ∗ (𝑠)):
𝑉 ∗ 𝑠 = max 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋
26
Optimal Quantities
27
Bellman optimality equation
𝑉 ∗ 𝑠 = max 𝑄 ∗ 𝑠, 𝑎
𝑎∈𝒜(𝑠)
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 ∗ 𝑠′
𝑠′
𝑎 𝑎
𝑉 ∗ 𝑠 = max 𝒫𝑠𝑠 ∗
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 𝑠′
𝑎∈𝒜(𝑠)
𝑠′
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾 max
′
𝑄 ∗ 𝑠 ′ , 𝑎′
𝑎
𝑠′
28
Snapshot of Demo – Gridworld V Values
Noise = 0.2
Discount = 0.9
Living reward = 0
29
Snapshot of Demo – Gridworld Q Values
Noise = 0.2
Discount = 0.9
Living reward = 0
30
Optimal policy
If we have 𝑉 ∗ (𝑠) and 𝑃(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) we can compute 𝜋 ∗ (𝑠)
31
Value Iteration algorithm
Consider only MDPs with finite state and action spaces:
Asynchronous: Instead of updating values for all states at once in each iteration,
it can update them state by state, or more often to some states than others.
32
Value Iteration
33
Value Iteration
Start with V0(s) = 0: no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
𝑉𝑘+1 𝑠 ← max 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′ a
𝑎
𝑠′ s, a
Repeat until convergence s,a,s’
Vk(s’)
Complexity of each iteration: O(S2A)
34
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
35
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
36
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
37
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
38
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
39
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
40
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
41
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
42
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
43
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
44
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
45
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
46
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
47
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
48
Convergence*
Case 1: If the tree has maximum depth M, then VM
holds the actual untruncated values
𝛾 𝑘 max 𝑅
max 𝑉𝑘 𝑠 − 𝑉 ∗ 𝑠 ≤
𝑠∈𝒮 1−𝛾
49
Convergence and contractions
Let . show the max norm: 𝑉 = max 𝑉(𝑠)
𝑠
50
Value Iteration
Value iteration works even if we randomly traverse the
environment instead of looping through each state and action
(update asynchronously)
but we must still visit each state infinitely often
Value Iteration
It is time and memory expensive
51
Problems with Value Iteration
Value iteration repeats the Bellman updates:
s,a,s’
Problem 2: The “max” at each state rarely changes s’
52
Convergence: Example
53
Computing Actions from Values
This is called policy extraction, since it gets the policy implied by the
values
54
Computing Actions from Q-Values
Let’s imagine we have the optimal q-values:
Important lesson: actions are easier to select from q-values than values!
55
Policy Iteration Algorithm
1) Initialize 𝜋(𝑠) arbitrarily
2) Repeat until convergence
Compute the value function for the current policy 𝜋 (i.e. 𝑉 𝜋 )
𝑉 ← 𝑉𝜋
for 𝑠 ∈ 𝑆
𝜋 𝑠 ← argmax σ𝑠′ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎
56
Fixed Policies Evaluation
a (s)
s, a s, (s)
s,a,s’ s, (s),s’
s’ s’
max over all actions to compute the fixed some policy (s), then the tree would
optimal values be simpler – only one action per state
57
Utilities for a Fixed Policy
58
Policy Evaluation
How do we calculate the V’s for a fixed policy ?
𝜋
𝑉𝑘+1 𝑠 ← 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ s, (s),s’
𝑠′ s’
Idea 2: Without the maxes, the Bellman equations are just a linear system
59
Policy Iteration
Evaluation: For fixed current policy , find values with policy evaluation:
Iterate until values converge:
𝜋 𝜋𝑖
𝑖
𝑉𝑘+1 𝑠 ← 𝑃 𝑠 ′ 𝑠, 𝜋𝑖 (𝑠) 𝑅 𝑠, 𝜋𝑖 (𝑠), 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑠′
Improvement: For fixed values, get a better policy using policy extraction
One-step look-ahead:
𝜋𝑖
𝜋𝑖+1 𝑠 ← argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑎
𝑠′
60
Policy Iteration
Repeat steps until policy converges
Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
61
Policy Iteration Optimality: Proof Sketch
Convergence:
Every step improves the policy and each policy is achieved at most once.
Thus, the number of iterations is at most as large as the number of
different policies (i.e., 𝒜 𝒮 )
Optimal at convergence:
At convergence ∀𝑠, 𝜋𝑘+1 𝑠 = 𝜋𝑘 𝑠 .
Indeed, ∀𝑠 𝑉 𝜋𝑘 𝑠 = max σ𝑠′ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑘 𝑠 ′
𝑎
Therefore, 𝑉 𝜋𝑘 satisfies Bellman equation and thus 𝑉 𝜋𝑘 = 𝑉 ∗
62
When to stop iterations:
In value iteration:
Every iteration updates both the values and (implicitly) the policy
We don’t track the policy, but taking the max over actions implicitly recomputes
it
In policy iteration:
We do several passes that update utilities with fixed policy (each pass is fast
because we consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
The new policy will be better (or we’re done)
65
Unknown transition model
𝑎
So far: learning optimal policy when we know 𝒫𝑠𝑠 ′ and
𝑎
ℛ𝑠𝑠 ′
66
Reinforcement Learning
67
Reinforcement Learning
Agent
State: s
Actions: a
Reward: r
Environmen
t
Basic idea:
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards
All learning is based on observed samples of outcomes!
68
What Just Happened?
That wasn’t planning, it was learning!
Specifically, reinforcement learning
There was an MDP, but you couldn’t solve it with just computation
You needed to actually act to figure it out
69
RL algorithms
Model-based (passive)
Learn model of environment (transition and reward
probabilities)
Then, value iteration or policy iteration algorithms
Model-free (active)
70
Model-Based Learning
Model-Based Idea:
Learn an approximate model based on experiences
Solve for values as if the learned model were correct
71
Model free approaches to solve RL problems
These methods can be categorized into:
Monte Carlo methods
Temporal-difference learning
Q-learning is a more recent approaches to this problem.
72
Direct Evaluation of a Policy
73
Example: Direct Policy Evaluation
74
Monte Carlo methods
do not assume complete knowledge of the environment
75
Monte Carlo policy evaluation
1) Initialize 𝑄 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
Generate an episode using 𝜋 and exploring starts
76
Monte Carlo RL algorithm
1) Initialize 𝑄 and 𝜋 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
Generate an episode using 𝜋 and exploring starts
77
Problems with Direct Evaluation
78
Connections between states
79
Connections between states
80
Temporal Difference Learning
Big idea: learn from every experience!
Update V(s) each time we experience a transition (s, a, s’, r)
Likely outcomes s’ will contribute updates more often s
(s)
Temporal difference learning of values s, (s)
Policy still fixed, still doing evaluation!
Move values toward value of whatever successor occurs: running
s’
average
Sample of V(s):
Update to V(s):
Same update:
81
Exponential Moving Average
Forgets about the past (distant past values were wrong anyway)
82
Example: Temporal Difference Learning
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: = 1, α = 1/2
83
Temporal Difference Methods
TD learning is a combination of MC and DP )i.e. Bellman
equations) ideas.
Like MC methods, can learn directly from raw experience
without a model of the environment's dynamics.
84
Problems with TD Value Learning
TD value leaning is a model-free way to do policy evaluation,
mimicking Bellman updates with running sample averages
85
Unknown transition model: New policy
With a model, state values alone are sufficient to determine a policy
simply look ahead one step and chooses whichever action leads to the best
combination of reward and next state s
86
Detour: Q-Value Iteration
87
Q-Learning
We’d like to do Q-value updates to each Q-state:
88
Q-Learning
89
Model-Free Learning
90
Video of Demo Q-Learning Auto Cliff Grid
91
Exploration/exploitation tradeoff
Exploitation: High rewards from trying previously-well-
rewarded actions
92
How to Explore?
93
Regret
Even if you learn the optimal policy, you still make mistakes along the way!
Regret is a measure of your total mistake cost: the difference between
your (expected) rewards, including youthful suboptimality, and optimal
(expected) rewards
Minimizing regret goes beyond learning to be optimal – it requires
optimally learning to be optimal
Example: random exploration and exploration functions both end up
optimal, but random exploration has higher regret
94
Q-learning: Policy
Greedy action selection:
𝑎)
𝜋 𝑠 = argmax 𝑄(𝑠,
𝑎
95
Q-learning Algorithm
Initialize 𝑄(𝑠,
𝑎) arbitrarily
Repeat (for each episode):
Initialize 𝑠 e.g., ε-greedy, softmax, …
Repeat (for each step of episode):
Choose 𝑎 from 𝑠 using a policy derived from 𝑄
Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max
′
𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎
𝑄
𝑎
𝑠 ← 𝑠′
until 𝑠 is terminal
96
Q-learning convergence
Q-learning converges to optimal Q-values if
Every state is visited infinitely often
The policy for action selection becomes greedy as time
approaches infinity
The step size parameter is chosen appropriately
97
Q-Learning Properties
Amazing result: Q-learning converges to optimal policy --
even if you’re acting suboptimally!
Caveats:
You have to explore enough
You have to eventually make the learning rate small enough
… but not decrease it too quickly
Basically, in the limit, it doesn’t matter how you select actions (!)
98
SARSA (State-Action-Reward-State-Action)
Initialize 𝑄(𝑠,
𝑎)
Repeat (for each episode): e.g., greedy, ε-greedy
Initialize 𝑠
Choose 𝑎 from 𝑠 using a policy derived from 𝑄
Repeat (for each step of episode):
Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
Choose 𝑎′ from 𝑠 ′ using policy derived from 𝑄
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎
𝑠 ← 𝑠′
𝑎 ← 𝑎′
until 𝑠 is terminal
99
Q-learning vs. SARSA
100
Tabular methods: Problem
All of the introduced methods maintain a table
101
Generalizing Across States
102
Example: Pacman
103
Feature-Based Representations
104
Linear Value Functions
Using a feature representation, we can write a q function (or value
function) for any state using a few weights:
Advantage:
our experience is summed up in a few powerful numbers
We can generalize from visited states to unvisited ones.
In addition to the less space requirement
105
Least square error: minimization
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
Intuitive interpretation:
Adjust weights of active features
E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states
with that state’s features
106
Least square error: minimization
In the general case (e.g., when 𝑄𝜃 (𝑠, 𝑎) is modeled by a neural
network):
2
𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑄𝜃 (𝑠, 𝑎)
∇𝜃 𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑟 + 𝛾 max
′
𝑄𝜃 (𝑠′, 𝑎′) − 𝑄𝜃 (𝑠, 𝑎) ∇𝜃 𝑄𝜃 (𝑠, 𝑎)
𝑎
107
Adjusting function weights
𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 𝑉𝜽 𝑠 ′ −𝑉𝜽 𝑠 𝛻𝜽 𝑉𝜽 (𝑠)
or
′ ′
𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 max
′
𝑄𝜽 𝑠 , 𝑎 − 𝑄𝜽 (𝑠, 𝑎) 𝛻𝜽 𝑄𝜽 (𝑠, 𝑎)
𝑎
108
Deep RL
𝑄(𝑠, 𝑎; 𝜃): neural network with parameter 𝜃
109
Playing Atari Games
110
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
DeepMind Atari (©Two Minute Lectures)
111
References
T. Mitchell, Machine Learning, MIT Press,1998. [Chapter 13]
112