0% found this document useful (0 votes)

32 views4 pages

RL DP and Value and Policy

Uploaded by

deepaklowanshi622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views4 pages

RL DP and Value and Policy

Uploaded by

deepaklowanshi622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

DS 102 Data, Inference, and Decisions Spring 2020

Lecture 24: Reinforcement Learning

Lecturer: Jacob Steinhardt

Last time, reviewed dynamic programming and Markov Decision Processes, and then saw how to
combine these two ideas into the Value Iteration algorithm. Today, we will discuss Value Iteration in
more details, as well as another useful variant of the algorithm called Q-iteration. We will then see
how to improve these approaches to incorporate learning from data and handle large state spaces.
Finally, having discussed all of the key conceptual ideas underlying reinforcement learning (RL),
we will briefly discuss two more advanced approaches: temporal difference learning and policy
gradient.

24.1 Recap: Value Iteration

Recall that, for Value Iteration, we have a Markov Decision Process (MDP) with transition prob-
abilities P(s0 | a, s) (note that in some places in the literature, these transition probabilities are
instead denotes as T (s0 , a, s)), rewards R(s0 , a, s), and possibly a discount parameter γ. Our goal
is to compute the value function V (s) which gives the expected reward starting from state s under
the optimal policy. We can define V recursively,
X
V (s) = max P(s0 | a, s)(R(s0 , a, s) + γV (s0 )),
a∈A
s0 ∈S

but to avoid computational issues with this circular definition, we also add a time component, Vt (s),
where V0 (s) = 0 and
X
Vt+1 (s) = max P(s0 | a, s)(R(s0 , a, s) + γVt (s0 )).
a∈A
s0 ∈S

24.2 Q-iteration

Recall that, in addition to the value function V , we defined the Q function where Q(s, a) is the ex-
pected reward if we start from state s taking action a, and then follow the optimal policy thereafter.
Compared to V , Q is useful since it helps us understand which actions are optimal. However, the
value and Q functions are very closely related:

V (s) = max Q(s, a),

a∈A

24-1
Lecture 24: Reinforcement Learning 24-2

and
X
Q(s, a) = P(s0 | a, s)(R(s0 , a, s) + γV (s0 ))
s0 ∈S
X
0 0 0 0
= P(s | a, s) R(s , a, s) + γ max
0
Q(s , a ) .
a ∈A
s0 ∈S

We can approach computing Q the same way we computed V in the previous section, namely by
adding a time component and using dynamic programming: Q0 (s, a) = 0 and
X
Qt+1 (s, a) = P(s0 | a, s)(R(s0 , a, s) + γ max
0
Qt (s0 , a0 )).
a ∈A
s0 ∈S

24.3 Q-Learning

Value and Q-Iteration both assume that the MDP is fully known. What if the transition dynamics
P(s0 | a, s) are unknown? It turns out that, even when the transition probabilities are unknown, we
can still estimate Q from data. This is called Q-learning.
Suppose we have data about several different trajectories, where a trajectory is a sequence

s0 , a0 , s1 , a1 , . . . , si , ai , . . .

that comes from some policy π : S → A. After each state-action pair (st , at ) in a trajectory, we can
update our estimate of Q:

Q(st , at ) = (1 − α)Qold (st , at ) + α(R(st+1 , at , st ) + γ max

0
Qold (st+1 , a0 )).
a ∈A

This update is a weighted average of our previous estimate of the Q value and something looks
similar to the update we used in Q-iteration. In fact, it turns out that the right-hand term of this
Q-learning update is equal to the Q-iteration update in expectation:
X
0
Est+1 [R(st+1 , at , st ) + γ max
0
Qold (s t+1 , a )] = P (s0 |at , st )(R(s0 , at , st ) + γ max
0
Qold (s0 , a0 )).
a ∈A a ∈A
s0 ∈S

Thus, our Q-learning update slowly averages in new estimates of the Q-function (by was of “stochas-
tic” Q-iteration updates), while the weighted average with our previous estimate of the Q value
helps stabilize our estimates.
There are convergence theorems which specify conditions under which this Q-learning algorithm
will converge to the true Q-value function for the optimal policy. These theorems have two im-
portant requirements. First, they typically require that α → 0 over time. In practice, however, it
usually works to fix α to a small value. Second, both in theory and in practice we need to make
sure to explore the state space enough. There is an analogy to the multi-armed bandits setting
from Lectures 17 and 21 where we needed to visit all of the states sufficiently often. One way to
accomplish this is to explicitly define an exploration policy that does a good job of visiting all of
Lecture 24: Reinforcement Learning 24-3

the states, but it can be challenging to design such a policy by hand if there are some states that
can only be reached by very specific actions/trajectories. An alternative approach is to follow the
induced policy from the current estimate of Q(s, a), which means taking the action that looks the
best given the current estimate of the Q function. However, this approach can get stuck and end
up not exploring enough; it is better to make sure that some fraction of the time we take a random
action. It often also helps to initialize Q0 (s, a) to some optimistic (i.e. very large) value, which is
similar in spirit to the UCB algorithm in that exploring states which have not yet been visited very
often is incentivized.

24.4 Function Approximation

Q-learning allowed us to improve upon the Q-Iteration algorithm by incorporating learning, but
both of these approaches will do poorly at handling large state spaces. In particular, each Q-learning
update only updates Q(s, a) for a single state-action pair, so if we have a very large number of states
we will not be able to sufficiently explore them all. Intuitively, we would like to somehow update
Q(s, a) for all similar state-action pairs at once. We accomplish this by using function approxima-
tion: we parameterize Q as Qθ (s, a). For example, we often model Q as a neural network, and θ
are the parameters of the network.
Recall our Q-learning update,
Q(st , at ) = (1 − α)Qold (st , at ) + α(R(st+1 , at , st ) + max Qold (st+1 , a0 ))
a0 ∈A

0
= Qold (st , at ) + α R(st+1 , at , st ) + max
0
Q (s
old t+1 , a ) − Q (s
old t t , a ) ,
a ∈A

where the second form of the update emphasizes the similarity to a stochastic gradient descent
update with step size α. Instead of updating Q directly as in Q-learning, our new update will
update θ:

0
θ = θold + α R(st+1 , at , st ) + max
0
Q (s
θold t+1 , a ) − Q (s ,
θold t t a ) ∇θ Qθold (st , at ).
a ∈A

This function approximation approach is an adaptation of Q-learning which is much better able to
handle large state spaces.
One (not entirely rigorous) way to intuit the form of the update rule for θ is to imagine that we
are preforming a prediction task where we would like to predict the Q value and we are using
squared-error as our loss:
2
0
R(st+1 , at , st ) + max
0
Q (s
θold t+1 , a ) − Q (s ,
θold t t a ) .
a ∈A

If we imagine that we are only updating the θold of the last term Qθold (st , at ) and take a gradient,
we get
0
−2 R(st+1 , at , st ) + max
0
Qθold (st+1 , a ) − Qθold (st , at ) ∇θ Qθold (st , at ),
a ∈A

so that taking one gradient step would be equivalent to using our update rule for θ with α = 2.
Lecture 24: Reinforcement Learning 24-4

24.5 Temporal Difference Learning

It is often the case that we only see a reward at the very end of a trajectory. For example, if we want
to use RL to play chess or Go, we do not see rewards until the end of the game when we either
win or lose. This leads to a credit assignment problem wherein we have to determine which action
we took along the way was really responsible for the reward obtained at the end of the trajectory.
In regular Q-learning, this will slow down convergence immensely — we need T updates for a
trajectory of length T to even get away from our initialization Q0 . Temporal Different Learning
(TD(λ)) is a more advanced RL approach which helps deal with these situations by propagating
rewards backwards retrospectively to all the previous states in the trajectory in some exponentially
decaying way (according to the parameter λ) so that each update for a trajectory is more efficient.
This approach was most famously used for TD-Gammon, an RL algorithm that was used to create
a backgammon playing AI.

24.6 Policy Gradient

Policy gradient is an alternative way to approach RL problems. In Q-learning, we tried to build

some approximate Q function, Qθ , and then optimize θ to get good Q values. Policy gradient takes
the alternative approach of learning a policy πθ (a|s) that’s a probability distribution of the action
given the state. The goal is

max Eπθ [R(s0 , a0 , a1 ) + γR(s1 , a1 , s2 ) + . . .].

To maximize this, we want the gradient of this quantity with respect to θ. Taking this gradient is a
bit tricky since the θ appears in the distribution with respect to which we are taking the expectation.
To handle this, we can use the log-derivative trick, which says

∇θ Eπθ [R] = Eπθ [R∇θ log πθ ],

thereby allowing us to compute a gradient we can use to update the parameters θ.

ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Q Learing
No ratings yet
Q Learing
30 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
14 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
Learning Task
No ratings yet
Learning Task
14 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Deep Learning Book Part5
No ratings yet
Deep Learning Book Part5
142 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
TD Learning & Deep Q-Networks
No ratings yet
TD Learning & Deep Q-Networks
20 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
48 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Q-Learning Convergence Explained
No ratings yet
Q-Learning Convergence Explained
4 pages
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
15 pages
Lec 22
No ratings yet
Lec 22
22 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Exploration vs Exploitation AI
No ratings yet
Exploration vs Exploitation AI
2 pages
37 RL
No ratings yet
37 RL
18 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
3964 Double Q Learning
No ratings yet
3964 Double Q Learning
9 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
13 pages
Continuous-Time q-Learning Theory
No ratings yet
Continuous-Time q-Learning Theory
61 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Optimizing Healthcare Inventory Management
No ratings yet
Optimizing Healthcare Inventory Management
3 pages
IoT Quiz for Tech Enthusiasts
No ratings yet
IoT Quiz for Tech Enthusiasts
38 pages
Blackboard Transaction System Service Client - V1.2
No ratings yet
Blackboard Transaction System Service Client - V1.2
21 pages
2 5201958578100897674 PDF
No ratings yet
2 5201958578100897674 PDF
82 pages
WSI User Guide CIS 1.1 (WS Competition)
No ratings yet
WSI User Guide CIS 1.1 (WS Competition)
89 pages
Install SharePoint 3.0 on Windows Home Server
No ratings yet
Install SharePoint 3.0 on Windows Home Server
8 pages
CHOICE HDTV COMPRESSION EBU - r124
No ratings yet
CHOICE HDTV COMPRESSION EBU - r124
4 pages
R 091825471
No ratings yet
R 091825471
5 pages
Online Shopping
No ratings yet
Online Shopping
11 pages
Industrial and Organizational Psychology Research and Practice 7th Edition Spector Full Download
100% (1)
Industrial and Organizational Psychology Research and Practice 7th Edition Spector Full Download
408 pages
PHP Interview Prep Guide
No ratings yet
PHP Interview Prep Guide
13 pages
Data and Digital Communications Module 1
No ratings yet
Data and Digital Communications Module 1
9 pages
System1200 Upload Instructions
No ratings yet
System1200 Upload Instructions
11 pages
Jefferson Frank Salary Survey 2018/19
No ratings yet
Jefferson Frank Salary Survey 2018/19
22 pages
UJ Module Lecture 1 Rev3
No ratings yet
UJ Module Lecture 1 Rev3
165 pages
Module 1: Understand The Security Concepts of Information Assurance
No ratings yet
Module 1: Understand The Security Concepts of Information Assurance
4 pages
00 Artificial Intelligence
No ratings yet
00 Artificial Intelligence
0 pages
Laporan Penjualan Bulan April 2019
No ratings yet
Laporan Penjualan Bulan April 2019
52 pages
Program 3 and Program 4
No ratings yet
Program 3 and Program 4
5 pages
IC Manufacturing & Yield Optimization
No ratings yet
IC Manufacturing & Yield Optimization
75 pages
Voltage Sag Mitigation With Repetitive Controlled Dynamic Voltage Restorer
No ratings yet
Voltage Sag Mitigation With Repetitive Controlled Dynamic Voltage Restorer
69 pages
SCCM PDF
No ratings yet
SCCM PDF
12 pages
Blood Bank: Online
No ratings yet
Blood Bank: Online
23 pages
Tybscit Sem Vi Subject: Business Intelligence Sample Questions For Self Practice
0% (1)
Tybscit Sem Vi Subject: Business Intelligence Sample Questions For Self Practice
219 pages
SAP Partial Turnkey Project Guide
No ratings yet
SAP Partial Turnkey Project Guide
22 pages
Types of Logic Explained
No ratings yet
Types of Logic Explained
6 pages
ST TMH 06 Pearson English Portal
No ratings yet
ST TMH 06 Pearson English Portal
1 page
Python 12
No ratings yet
Python 12
11 pages
Savage Apocalypse v11
100% (5)
Savage Apocalypse v11
73 pages
Instrucciones Montaje Caudalimetro
No ratings yet
Instrucciones Montaje Caudalimetro
55 pages

RL DP and Value and Policy

Uploaded by

RL DP and Value and Policy

Uploaded by

DS 102 Data, Inference, and Decisions Spring 2020

Lecture 24: Reinforcement Learning

24.1 Recap: Value Iteration

V (s) = max Q(s, a),

Q(st , at ) = (1 − α)Qold (st , at ) + α(R(st+1 , at , st ) + γ max

24.4 Function Approximation

24.5 Temporal Difference Learning

24.6 Policy Gradient

Policy gradient is an alternative way to approach RL problems. In Q-learning, we tried to build

max Eπθ [R(s0 , a0 , a1 ) + γR(s1 , a1 , s2 ) + . . .].

∇θ Eπθ [R] = Eπθ [R∇θ log πθ ],

thereby allowing us to compute a gradient we can use to update the parameters θ.

You might also like