Book All-In-One 2
Book All-In-One 2
Mathematical Foundations
of
Reinforcement Learning
Shiyu Zhao
August, 2023
Contents
Contents 5
Preface 6
1 Basic Concepts 13
1.1 A grid world example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 State and action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 State transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Trajectories, returns, and episodes . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.9 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
S. Zhao, 2023
2.10 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2
S. Zhao, 2023
3
S. Zhao, 2023
4
S. Zhao, 2023
Bibliography 275
Symbols 276
Index 278
5
Preface
This book aims to provide a mathematical but friendly introduction to the fundamental
concepts, basic problems, and classic algorithms in reinforcement learning. Some essential
features of this book are highlighted as follows.
The book introduces reinforcement learning from a mathematical point of view. Hope-
fully, readers will not only know the procedure of an algorithm but also understand
why the algorithm was designed in the first place and why it works effectively.
The depth of the mathematics is carefully controlled to an adequate level. The math-
ematics is also presented in a carefully designed manner to ensure that the book is
friendly to read. Readers can read the materials presented in gray boxes selectively
according to their interests.
Many illustrative examples are given to help readers better understand the topics. All
the examples in this book are based on a grid world task, which is easy to understand
and helpful for illustrating concepts and algorithms.
When introducing an algorithm, the book aims to separate its core idea from compli-
cations that may be distracting. In this way, readers can better grasp the core idea
of an algorithm.
The contents of the book are coherently organized. Each chapter is built based on the
preceding chapter and lays a necessary foundation for the subsequent one.
This book is designed for senior undergraduate students, graduate students, researcher-
s, and practitioners interested in reinforcement learning. It does not require readers to
have any background in reinforcement learning because it starts by introducing the most
basic concepts. If the reader already has some background in reinforcement learning, I
believe the book can help them understand some topics more deeply or provide differ-
ent perspectives. This book, however, requires the reader to have some knowledge of
probability theory and linear algebra. Some basics of the required mathematics are also
included in the appendix of this book.
I have been teaching a graduate-level course on reinforcement learning since 2019. I
want to thank the students in my class for their feedback on my teaching. I put the draft
of this book online in August 2022. Up to now, I have received valuable feedback from
many readers. I want to express my gratitude to these readers. Moreover, I would like
6
S. Zhao, 2023
to thank my research assistant, Jialing Lv, for her excellent support in editing the book
and my lecture videos; my teaching assistants, Jianan Li and Yize Mi, for their help in
my teaching; my Ph.D. student Canlun Zheng for his help in the design of a picture in
the book; and my family for their wonderful support. Finally, I would like to thank the
editors of this book, Mr. Sai Guo and Dr. Lanlan Chang from Tsinghua University Press
and Springer Nature Press, for their great support.
I sincerely hope this book can help readers smoothly enter the exciting field of rein-
forcement learning.
Shiyu Zhao
7
Overview of this Book
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
Before we start the journey, it is important to look at the “map” of the book shown
in Figure 1. This book contains ten chapters, which can be classified into two parts: the
first part is about basic tools, and the second part is about algorithms. The ten chapters
are highly correlated. In general, it is necessary to study the earlier chapters first before
the later ones.
Next, please follow me on a quick tour through the ten chapters. Two aspects of each
chapter will be covered. The first aspect is the contents introduced in each chapter, and
the second aspect is its relationships with the previous and subsequent chapters. A heads
up for you to read this overview is as follows. The purpose of this overview is to give you
an impression of the contents and structure of this book. It is all right if you encounter
many concepts you do not understand. Hopefully, you can make a proper study plan
8
S. Zhao, 2023
Chapter 1 introduces the basic concepts such as states, actions, rewards, returns, and
policies, which are widely used in the subsequent chapters. These concepts are first
introduced based on a grid world example, where a robot aims to reach a prespecified
target. Then, the concepts are introduced in a more formal manner based on the
framework of Markov decision processes.
Chapter 2 introduces two key elements. The first is a key concept, and the second is
a key tool. The key concept is the state value, which is defined as the expected return
that an agent can obtain when starting from a state if it follows a given policy. The
greater the state value is, the better the corresponding policy is. Thus, state values
can be used to evaluate whether a policy is good or not.
The key tool is the Bellman equation, which can be used to analyze state values. In
a nutshell, the Bellman equation describes the relationship between the values of all
states. By solving the Bellman equation, we can obtain the state values. Such a
process is called policy evaluation, which is a fundamental concept in reinforcement
learning. Finally, this chapter introduces the concept of action values.
Chapter 3 also introduces two key elements. The first is a key concept, and the
second is a key tool. The key concept is the optimal policy. An optimal policy has the
greatest state values compared to other policies. The key tool is the Bellman optimality
equation. As its name suggests, the Bellman optimality equation is a special Bellman
equation.
Here is a fundamental question: what is the ultimate goal of reinforcement learn-
ing? The answer is to obtain optimal policies. The Bellman optimality equation is
important because it can be used to obtain optimal policies. We will see that the
Bellman optimality equation is elegant and can help us thoroughly understand many
fundamental problems.
The first three chapters constitute the first part of this book. This part lays the
necessary foundations for the subsequent chapters. Starting in Chapter 4, the book
introduces algorithms for learning optimal policies.
Chapter 4 introduces three algorithms: value iteration, policy iteration, and truncated
policy iteration. The three algorithms have close relationships with each other. First,
the value iteration algorithm is exactly the algorithm introduced in Chapter 3 for
solving the Bellman optimality equation. Second, the policy iteration algorithm is
an extension of the value iteration algorithm. It is also the foundation for Monte
Carlo (MC) algorithms introduced in Chapter 5. Third, the truncated policy iteration
algorithm is a unified version that includes the value iteration and policy iteration
algorithms as special cases.
9
S. Zhao, 2023
The three algorithms share the same structure. That is, every iteration has two steps.
One step is to update the value, and the other step is to update the policy. The idea
of the interaction between value and policy updates widely exists in reinforcement
learning algorithms. This idea is also known as generalized policy iteration. In ad-
dition, the algorithms introduced in this chapter are actually dynamic programming
algorithms, which require system models. By contrast, all the algorithms introduced
in the subsequent chapters do not require models. It is important to well understand
the contents of this chapter before proceeding to the subsequent ones.
Starting in Chapter 5, we introduce model-free reinforcement learning algorithms that
do not require system models. While this is the first time we introduce model-free
algorithms in this book, we must fill a knowledge gap: how to find optimal policies
without models? The philosophy is simple. If we do not have a model, we must have
some data. If we do not have data, we must have a model. If we have neither, then we
can do nothing. The “data” in reinforcement learning refer to the experience samples
generated when the agent interacts with the environment.
This chapter introduces three algorithms based on MC estimation that can learn
optimal policies from experience samples. The first and simplest algorithm is MC
Basic, which can be readily obtained by extending the policy iteration algorithm
introduced in Chapter 4. Understanding the MC Basic algorithm is important for
grasping the fundamental idea of MC-based reinforcement learning. By extending
this algorithm, we further introduce two more complicated but more efficient MC-
based algorithms. The fundamental trade-off between exploration and exploitation is
also elaborated in this chapter.
Up to this point, the reader may have noticed that the contents of these chapters are
highly correlated. For example, if we want to study the MC algorithms (Chapter 5), we
must first understand the policy iteration algorithm (Chapter 4). To study the policy
iteration algorithm, we must first know the value iteration algorithm (Chapter 4). To
comprehend the value iteration algorithm, we first need to understand the Bellman opti-
mality equation (Chapter 3). To understand the Bellman optimality equation, we need
to study the Bellman equation (Chapter 2) first. Therefore, it is highly recommended to
study the chapters one by one. Otherwise, it may be difficult to understand the contents
in the later chapters.
There is a knowledge gap when we move from Chapter 5 to Chapter 7: the algorithms
in Chapter 7 are incremental, but the algorithms in Chapter 5 are non-incremental.
Chapter 6 is designed to fill this knowledge gap by introducing the stochastic ap-
proximation theory. Stochastic approximation refers to a broad class of stochastic
iterative algorithms for solving root-finding or optimization problems. The classic
Robbins-Monro and stochastic gradient descent algorithms are special stochastic ap-
proximation algorithms. Although this chapter does not introduce any reinforcement
10
S. Zhao, 2023
11
S. Zhao, 2023
to properly understand the contents in Chapters 8 and 9 before studying Chapter 10.
12
Chapter 1
Basic Concepts
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
This chapter introduces the basic concepts of reinforcement learning. These concepts
are important because they will be widely used in this book. We first introduce these
concepts using examples and then formalize them in the framework of Markov decision
processes.
13
1.2. State and action S. Zhao, 2023
only occupy a single cell. The white cells are accessible for entry, and the orange cells
are forbidden. There is a target cell that the robot would like to reach. We will use such
grid world examples throughout this book since they are intuitive for illustrating new
concepts and algorithms.
Start
Forbidden
Forbidden Target
Figure 1.2: The grid world example is used throughout the book.
The ultimate goal of the agent is to find a “good” policy that enables it to reach
the target cell when starting from any initial cell. How can the “goodness” of a policy
be defined? The idea is that the agent should reach the target without entering any
forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.
It would be trivial to plan a path to reach the target cell if the agent knew the map of
the grid world. The task becomes nontrivial if the agent does not know any information
about the environment in advance. Then, the agent must interact with the environment
to find a good policy by trial and error. To do that, the concepts presented in the rest of
the chapter are necessary.
14
1.3. State transition S. Zhao, 2023
a1
s1 s2 s3
s4 s5 s6 a4 a2
a5
s7 s8 s9
a3
(a) States (b) Actions
Figure 1.3: Illustrations of the state and action concepts. (a) There are nine states {s1 , . . . , s9 }. (b)
Each state has five possible actions {a1 , a2 , a3 , a4 , a5 }.
15
1.4. Policy S. Zhao, 2023
corresponds to a state, and each column corresponds to an action. Each cell indicates
the next state to transition to after the agent takes an action at the corresponding state.
p(s1 |s1 , a2 ) = 0,
p(s2 |s1 , a2 ) = 1,
p(s3 |s1 , a2 ) = 0,
p(s4 |s1 , a2 ) = 0,
p(s5 |s1 , a2 ) = 0,
which indicates that, when taking a2 at s1 , the probability of the agent moving to s2
is one, and the probabilities of the agent moving to other states are zero. As a result,
taking action a2 at s1 will certainly cause the agent to transition to s2 . The preliminaries
of conditional probability are given in Appendix A. Readers are strongly advised to be
familiar with probability theory since it is necessary for studying reinforcement learning.
Although it is intuitive, the tabular representation is only able to describe determinis-
tic state transitions. In general, state transitions can be stochastic and must be described
by conditional probability distributions. For instance, when random wind gusts are ap-
plied across the grid, if taking action a2 at s1 , the agent may be blown to s5 instead of
s2 . We have p(s5 |s1 , a2 ) > 0 in this case. Nevertheless, we merely consider deterministic
state transitions in the grid world examples for simplicity in this book.
1.4 Policy
A policy tells the agent which actions to take at every state. Intuitively, policies can
be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a
trajectory starting from an initial state (see Figure 1.4(b)).
16
1.4. Policy S. Zhao, 2023
Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different
initial states.
π(a1 |s1 ) = 0,
π(a2 |s1 ) = 1,
π(a3 |s1 ) = 0,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0,
which indicates that the probability of taking action a2 at state s1 is one, and the prob-
abilities of taking other actions are zero.
The above policy is deterministic. Policies may be stochastic in general. For example,
the policy shown in Figure 1.5 is stochastic: at state s1 , the agent may take actions to
go either rightward or downward. The probabilities of taking these two actions are the
17
1.5. Reward S. Zhao, 2023
π(a1 |s1 ) = 0,
π(a2 |s1 ) = 0.5,
π(a3 |s1 ) = 0.5,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0.
p = 0.5 p = 0.5
Figure 1.5: A stochastic policy. At state s1 , the agent may move rightward or downward with equal
probabilities of 0.5.
1.5 Reward
Reward is one of the most unique concepts in reinforcement learning.
18
1.5. Reward S. Zhao, 2023
Special attention should be given to the target state s9 . The reward process does not
have to terminate after the agent reaches s9 . If the agent takes action a5 at s9 , the next
state is again s9 , and the reward is rtarget = +1. If the agent takes action a2 , the next
state is also s9 , but the reward is rboundary = −1.
A reward can be interpreted as a human-machine interface, with which we can guide
the agent to behave as we expect. For example, with the rewards designed above, we can
expect that the agent tends to avoid exiting the boundary or stepping into the forbidden
cells. Designing appropriate rewards is an important step in reinforcement learning. This
step is, however, nontrivial for complex tasks since it may require the user to understand
the given problem well. Nevertheless, it may still be much easier than solving the problem
with other approaches that require a professional background or a deep understanding of
the given problem.
The process of getting a reward after executing an action can be intuitively represented
as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each
column corresponds to an action. The value in each cell of the table indicates the reward
that can be obtained by taking an action at a state.
One question that beginners may ask is as follows: if given the table of rewards, can
we find good policies by simply selecting the actions with the greatest rewards? The
answer is no. That is because these rewards are immediate rewards that can be obtained
after taking an action. To determine a good policy, we must consider the total reward
obtained in the long run (see Section 1.6 for more information). An action with the
greatest immediate reward may not lead to the greatest total reward.
Although intuitive, the tabular representation is only able to describe deterministic
reward processes. A more general approach is to use conditional probabilities p(r|s, a) to
describe reward processes. For example, for state s1 , we have
19
1.6. Trajectories, returns, and episodes S. Zhao, 2023
This indicates that, when taking a1 at s1 , the agent obtains r = −1 with certainty. In
this example, the reward process is deterministic. In general, it can be stochastic. For
example, if a student studies hard, he or she would receive a positive reward (e.g., higher
grades on exams), but the specific value of the reward may be uncertain.
r=0
s1 s2 s3 s1 s2 s3
r=0 r=0
s4 s5 s6 s4 s5 s6
r=0 r = −1
s7 s8 s9 s7 s8 s9
r = +1 r=0 r = +1
r = +1 r = +1
(a) Policy 1 and the trajectory (b) Policy 2 and the trajectory
Figure 1.6: Trajectories obtained by following two policies. The trajectories are indicated by red dashed
lines.
2a 3 a 3 2 a a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 .
r=0 r=0 r=0 r=1
The return of this trajectory is defined as the sum of all the rewards collected along the
trajectory:
return = 0 + 0 + 0 + 1 = 1. (1.1)
20
1.6. Trajectories, returns, and episodes S. Zhao, 2023
a 3 a
3 2 a2 a
s1 −−→ s4 −−−→ s7 −−→ s8 −−−→ s9 .
r=0 r=−1 r=0 r=+1
return = 0 − 1 + 0 + 1 = 0. (1.2)
The returns in (1.1) and (1.2) indicate that the left policy is better than the right one
since its return is greater. This mathematical conclusion is consistent with the intuition
that the right policy is worse since it passes through a forbidden cell.
A return consists of an immediate reward and future rewards. Here, the immediate
reward is the reward obtained after taking an action at the initial state; the future
rewards refer to the rewards obtained after leaving the initial state. It is possible that the
immediate reward is negative while the future reward is positive. Thus, which actions to
take should be determined by the return (i.e., the total reward) rather than the immediate
reward to avoid short-sighted decisions.
The return in (1.1) is defined for a finite-length trajectory. Return can also be defined
for infinitely long trajectories. For example, the trajectory in Figure 1.6 stops after
reaching s9 . Since the policy is well defined for s9 , the process does not have to stop after
the agent reaches s9 . We can design a policy so that the agent stays unchanged after
reaching s9 . Then, the policy would generate the following infinitely long trajectory:
a2 3 a 3 a 2 a5 a
5 a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 −−→ s9 −−→ s9 . . .
r=0 r=0 r=0 r=1 r=1 r=1
return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞,
which unfortunately diverges. Therefore, we must introduce the discounted return con-
cept for infinitely long trajectories. In particular, the discounted return is the sum of the
discounted rewards:
where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), the value of (1.3) can be
21
1.6. Trajectories, returns, and episodes S. Zhao, 2023
calculated as
1
discounted return = γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ
The introduction of the discount rate is useful for the following reasons. First, it
removes the stop criterion and allows for infinitely long trajectories. Second, the dis-
count rate can be used to adjust the emphasis placed on near- or far-future rewards. In
particular, if γ is close to 0, then the agent places more emphasis on rewards obtained in
the near future. The resulting policy would be short-sighted. If γ is close to 1, then the
agent places more emphasis on the far future rewards. The resulting policy is far-sighted
and dares to take risks of obtaining negative rewards in the near future. These points
will be demonstrated in Section 3.5.
One important notion that was not explicitly mentioned in the above discussion is the
episode. When interacting with the environment by following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a trial ). If the
environment or policy is stochastic, we obtain different episodes when starting from the
same state. However, if everything is deterministic, we always obtain the same episode
when starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning that the pro-
cess of interacting with the environment will never end. Such tasks are called continuing
tasks. In fact, we can treat episodic and continuing tasks in a unified mathematical
manner by converting episodic tasks to continuing ones. To do that, we need well define
the process after the agent reaches the terminal state. Specifically, after reaching the
terminal state in an episodic task, the agent can continue taking actions in the following
two ways.
First, if we treat the terminal state as a special state, we can specifically design its
action space or state transition so that the agent stays in this state forever. Such
states are called absorbing states, meaning that the agent never leaves a state once
reached. For example, for the target state s9 , we can specify A(s9 ) = {a5 } or set
A(s9 ) = {a1 , . . . , a5 } with p(s9 |s9 , ai ) = 1 for all i = 1, . . . , 5.
Second, if we treat the terminal state as a normal state, we can simply set its action
space to the same as the other states, and the agent may leave the state and come
back again. Since a positive reward of r = 1 can be obtained every time s9 is reached,
the agent will eventually learn to stay at s9 forever to collect more rewards. Notably,
when an episode is infinitely long and the reward received for staying at s9 is positive,
a discount rate must be used to calculate the discounted return to avoid divergence.
In this book, we consider the second scenario where the target state is treated as a normal
state whose action space is A(s9 ) = {a1 , . . . , a5 }.
22
1.7. Markov decision processes S. Zhao, 2023
Sets:
Model:
where t represents the current time step and t + 1 represents the next time step.
Equation (1.4) indicates that the next state or reward depends merely on the current
state and action and is independent of the previous ones. The Markov property is
important for deriving the fundamental Bellman equation of MDPs, as shown in the
next chapter.
Here, p(s0 |s, a) and p(r|s, a) for all (s, a) are called the model or dynamics. The
model can be either stationary or nonstationary (or in other words, time-invariant or
time-variant). A stationary model does not change over time; a nonstationary model
may vary over time. For instance, in the grid world example, if a forbidden area may pop
up or disappear sometimes, the model is nonstationary. In this book, we only consider
stationary models.
23
1.8. Summary S. Zhao, 2023
One may have heard about the Markov processes (MPs). What is the difference
between an MDP and an MP? The answer is that, once the policy in an MDP is fixed,
the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can
be abstracted as a Markov process. In the literature on stochastic processes, a Markov
process is also called a Markov chain if it is a discrete-time process and the number of
states is finite or countable [1]. In this book, the terms “Markov process” and “Markov
chain” are used interchangeably when the context is clear. Moreover, this book mainly
considers finite MDPs where the numbers of states and actions are finite. This is the
simplest case that should be fully understood.
Prob=0.5 Prob=1
s1 s2 s3
p = 0.5
Prob=0.5
p = 0.5
Prob=1
Prob=1
s4 s5 s6
Prob=1
Prob=1
Prob=1 Prob=1
s7 s8 s9
Figure 1.7: Abstraction of the grid world example as a Markov process. Here, the circles represent states
and the links with arrows represent state transitions.
1.8 Summary
This chapter introduced the basic concepts that will be widely used in the remainder of
the book. We used intuitive grid world examples to demonstrate these concepts and then
formalized them in the framework of MDPs. For more information about MDPs, readers
can see [1, 2].
1.9 Q&A
Q: Can we set all the rewards as negative or positive?
A: In this chapter, we mentioned that a positive reward would encourage the agent
to take an action and that a negative reward would discourage the agent from taking
24
1.9. Q&A S. Zhao, 2023
the action. In fact, it is the relative reward values instead of the absolute values that
determine encouragement or discouragement.
More specifically, we set rboundary = −1, rforbidden = −1, rtarget = +1, and rother = 0 in
this chapter. We can also add a common value to all these values without changing
the resulting optimal policy. For example, we can add −2 to all the rewards to obtain
rboundary = −3, rforbidden = −3, rtarget = −1, and rother = −2. Although the rewards
are all negative, the resulting optimal policy is unchanged. That is because optimal
policies are invariant to affine transformations of the rewards. Details will be given in
Chapter 3.5.
Q: Is the reward a function of the next state?
A: We mentioned that the reward r depends only on s and a but not the next state s0 .
However, this may be counterintuitive since it is the next state that determines the
reward in many cases. For example, the reward is positive when the next state is the
target state. As a result, a question that naturally follows is whether a reward should
depend on the next state. A mathematical rephrasing of this question is whether we
should use p(r|s, a, s0 ) where s0 is the next state rather than p(r|s, a). The answer is
that r depends on s, a, and s0 . However, since s0 also depends on s and a, we can
equivalently write r as a function of s and a: p(r|s, a) = s0 p(r|s, a, s0 )p(s0 |s, a). In
P
this way, the Bellman equation can be easily established as shown in Chapter 2.
25
Chapter 2
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
This chapter introduces a core concept and an important tool. The core concept
is the state value, which is defined as the average reward that an agent can obtain if
it follows a given policy. The greater the state value is, the better the corresponding
policy is. State values can be used as a metric to evaluate whether a policy is good or
not. While state values are important, how can we analyze them? The answer is the
Bellman equation, which is an important tool for analyzing state values. In a nutshell,
the Bellman equation describes the relationships between the values of all states. By
solving the Bellman equation, we can obtain the state values. This process is called
policy evaluation, which is a fundamental concept in reinforcement learning. Finally, this
26
2.1. Motivating example 1: Why are returns important? S. Zhao, 2023
p = 0.5
r = −1
s1 s2 s1 s2 s1 s2
r = −1
r=0 r=1 r=1 r=0 r=1
p = 0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1
r=1 r=1 r=1
Figure 2.2: Examples for demonstrating the importance of returns. The three examples have different
policies for s1 .
Consider the three policies shown in Figure 2.2. It can be seen that the three policies
are different at s1 . Which is the best and which is the worst? Intuitively, the leftmost
policy is the best because the agent starting from s1 can avoid the forbidden area. The
middle policy is intuitively worse because the agent starting from s1 moves to the forbid-
den area. The rightmost policy is in between the others because it has a probability of
0.5 to go to the forbidden area.
While the above analysis is based on intuition, a question that immediately follows is
whether we can use mathematics to describe such intuition. The answer is yes and relies
on the return concept. In particular, suppose that the agent starts from s1 .
return1 = 0 + γ1 + γ 2 1 + . . .
= γ(1 + γ + γ 2 + . . . )
γ
= ,
1−γ
27
2.2. Motivating example 2: How to calculate returns? S. Zhao, 2023
return is
return2 = −1 + γ1 + γ 2 1 + . . .
= −1 + γ(1 + γ + γ 2 + . . . )
γ
= −1 + .
1−γ
Following the third policy, two trajectories can possibly be obtained. One is s1 →
s3 → s4 → s4 · · · , and the other is s1 → s2 → s4 → s4 · · · . The probability of either
of the two trajectories is 0.5. Then, the average return that can be obtained starting
from s1 is
γ γ
return3 = 0.5 −1 + + 0.5
1−γ 1−γ
γ
= −0.5 + .
1−γ
for any value of γ. Inequality (2.1) suggests that the first policy is the best because its
return is the greatest, and the second policy is the worst because its return is the smallest.
This mathematical conclusion is consistent with the aforementioned intuition: the first
policy is the best since it can avoid entering the forbidden area, and the second policy is
the worst because it leads to the forbidden area.
The above examples demonstrate that returns can be used to evaluate policies: a
policy is better if the return obtained by following that policy is greater. Finally, it is
notable that return3 does not strictly comply with the definition of returns because it is
more like an expected value. It will become clear later that return3 is actually a state
value.
The first is simply by definition: a return equals the discounted sum of all the rewards
collected along a trajectory. Consider the example in Figure 2.3. Let vi denote the
return obtained by starting from si for i = 1, 2, 3, 4. Then, the returns obtained when
28
2.2. Motivating example 2: How to calculate returns? S. Zhao, 2023
r1
s1 s2
r2
r4
s4 s3
r3
Figure 2.3: An example for demonstrating how to calculate returns. There are no target or forbidden
cells in this example.
v1 = r1 + γr2 + γ 2 r3 + . . . ,
v2 = r2 + γr3 + γ 2 r4 + . . . ,
(2.2)
v3 = r3 + γr4 + γ 2 r1 + . . . ,
v4 = r4 + γr1 + γ 2 r2 + . . . .
The second way, which is more important, is based on the idea of bootstrapping. By
observing the expressions of the returns in (2.2), we can rewrite them as
The above equations indicate an interesting phenomenon that the values of the returns
rely on each other. More specifically, v1 relies on v2 , v2 relies on v3 , v3 relies on v4 ,
and v4 relies on v1 . This reflects the idea of bootstrapping, which is to obtain the
values of some quantities from themselves.
At first glance, bootstrapping is an endless loop because the calculation of an unknown
value relies on another unknown value. In fact, bootstrapping is easier to understand
if we view it from a mathematical perspective. In particular, the equations in (2.3)
can be reformed into a linear matrix-vector equation:
v1 r1 γv2 r1 0 1 0 0 v1
v2 r2 γv3 r2 0 0 1 0 v2
= + = +γ ,
v3 r3 γv4 r3 0 0 0 1 v3
v4 r4 γv1 r4 1 0 0 0 v4
| {z } | {z } | {z }| {z }
v r P v
29
2.3. State values S. Zhao, 2023
v = r + γP v.
In fact, (2.3) is the Bellman equation for this simple example. Although it is simple,
(2.3) demonstrates the core idea of the Bellman equation: the return obtained by starting
from one state depends on those obtained when starting from other states. The idea of
bootstrapping and the Bellman equation for general scenarios will be formalized in the
following sections.
A
t
St −→ St+1 , Rt+1 .
Note that St , St+1 , At , Rt+1 are all random variables. Moreover, St , St+1 ∈ S, At ∈ A(St ),
and Rt+1 ∈ R(St , At ).
Starting from t, we can obtain a state-action-reward trajectory:
At At+1 At+2
St −→ St+1 , Rt+1 −−−→ St+2 , Rt+2 −−−→ St+3 , Rt+3 . . . .
.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . . ,
where γ ∈ (0, 1) is the discount rate. Note that Gt is a random variable since Rt+1 , Rt+2 , . . .
are all random variables.
Since Gt is a random variable, we can calculate its expected value (also called the
expectation or mean):
.
vπ (s) = E[Gt |St = s].
30
2.4. Bellman equation S. Zhao, 2023
Here, vπ (s) is called the state-value function or simply the state value of s. Some impor-
tant remarks are given below.
The relationship between state values and returns is further clarified as follows. When
both the policy and the system model are deterministic, starting from a state always leads
to the same trajectory. In this case, the return obtained starting from a state is equal
to the value of that state. By contrast, when either the policy or the system model is
stochastic, starting from the same state may generate different trajectories. In this case,
the returns of different trajectories are different, and the state value is the mean of these
returns.
Although returns can be used to evaluate policies as shown in Section 2.1, it is more
formal to use state values to evaluate policies: policies that generate greater state values
are better. Therefore, state values constitute a core concept in reinforcement learning.
While state values are important, a question that immediately follows is how to calculate
them. This question is answered in the next section.
where Gt+1 = Rt+2 + γRt+3 + . . . . This equation establishes the relationship between Gt
and Gt+1 . Then, the state value can be written as
31
2.4. Bellman equation S. Zhao, 2023
The first term, E[Rt+1 |St = s], is the expectation of the immediate rewards. By using
the law of total expectation (Appendix A), it can be calculated as
X
E[Rt+1 |St = s] = π(a|s)E[Rt+1 |St = s, At = a]
a∈A
X X
= π(a|s) p(r|s, a)r. (2.5)
a∈A r∈R
Here, A and R are the sets of possible actions and rewards, respectively. It should be
noted that A may be different for different states. In this case, A should be written as
A(s). Similarly, R may also depend on (s, a). We drop the dependence on s or (s, a)
for the sake of simplicity in this book. Nevertheless, the conclusions are still valid in
the presence of dependence.
The second term, E[Gt+1 |St = s], is the expectation of the future rewards. It can be
calculated as
X
E[Gt+1 |St = s] = E[Gt+1 |St = s, St+1 = s0 ]p(s0 |s)
s0 ∈S
X
= E[Gt+1 |St+1 = s0 ]p(s0 |s) (due to the Markov property)
s0 ∈S
X
= vπ (s0 )p(s0 |s)
s0 ∈S
X X
= vπ (s0 ) p(s0 |s, a)π(a|s). (2.6)
s0 ∈S a∈A
The above derivation uses the fact that E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ],
which is due to the Markov property that the future rewards depend merely on the
present state rather than the previous ones.
This equation is the Bellman equation, which characterizes the relationships of state
values. It is a fundamental tool for designing and analyzing reinforcement learning algo-
rithms.
32
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023
The Bellman equation seems complex at first glance. In fact, it has a clear structure.
Some remarks are given below.
vπ (s) and vπ (s0 ) are unknown state values to be calculated. It may be confusing to
beginners how to calculate the unknown vπ (s) given that it relies on another unknown
vπ (s0 ). It must be noted that the Bellman equation refers to a set of linear equations for
all states rather than a single equation. If we put these equations together, it becomes
clear how to calculate all the state values. Details will be given in Section 2.7.
π(a|s) is a given policy. Since state values can be used to evaluate a policy, solving
the state values from the Bellman equation is a policy evaluation process, which is an
important process in many reinforcement learning algorithms, as we will see later in
the book.
p(r|s, a) and p(s0 |s, a) represent the system model. We will first show how to calculate
the state values with this model in Section 2.7, and then show how to do that without
the model by using model-free algorithms later in this book.
In addition to the expression in (2.7), readers may also encounter other expressions
of the Bellman equation in the literature. We next introduce two equivalent expressions.
First, it follows from the law of total probability that
X
p(s0 |s, a) = p(s0 , r|s, a),
r∈R
X
p(r|s, a) = p(s0 , r|s, a).
s0 ∈S
33
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023
s1 s2
r=0 r=1
s3 s4
r=1
r=1
Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deter-
ministic.
Consider the first example shown in Figure 2.4, where the policy is deterministic. We
next write out the Bellman equation and then solve the state values from it.
First, consider state s1 . Under the policy, the probabilities of taking the actions
are π(a = a3 |s1 ) = 1 and π(a 6= a3 |s1 ) = 0. The state transition probabilities
are p(s0 = s3 |s1 , a3 ) = 1 and p(s0 6= s3 |s1 , a3 ) = 0. The reward probabilities are
p(r = 0|s1 , a3 ) = 1 and p(r 6= 0|s1 , a3 ) = 0. Substituting these values into (2.7) gives
Interestingly, although the expression of the Bellman equation in (2.7) seems complex,
the expression for this specific state is very simple.
Similarly, it can be obtained that
We can solve the state values from these equations. Since the equations are simple, we
can manually solve them. More complicated equations can be solved by the algorithms
presented in Section 2.7. Here, the state values can be solved as
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ
34
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023
1
vπ (s4 ) = = 10,
1 − 0.9
1
vπ (s3 ) = = 10,
1 − 0.9
1
vπ (s2 ) = = 10,
1 − 0.9
0.9
vπ (s1 ) = = 9.
1 − 0.9
p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5
s3 s4
r=1
r=1
Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.
Consider the second example shown in Figure 2.5, where the policy is stochastic. We
next write out the Bellman equation and then solve the state values from it.
At state s1 , the probabilities of going right and down equal 0.5. Mathematically, we
have π(a = a2 |s1 ) = 0.5 and π(a = a3 |s1 ) = 0.5. The state transition probability
is deterministic since p(s0 = s3 |s1 , a3 ) = 1 and p(s0 = s2 |s1 , a2 ) = 1. The reward
probability is also deterministic since p(r = 0|s1 , a3 ) = 1 and p(r = −1|s1 , a2 ) = 1.
Substituting these values into (2.7) gives
The state values can be solved from the above equations. Since the equations are
35
2.6. Matrix-vector form of the Bellman equation S. Zhao, 2023
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )],
γ
= −0.5 + .
1−γ
vπ (s4 ) = 10,
vπ (s3 ) = 10,
vπ (s2 ) = 10,
vπ (s1 ) = −0.5 + 9 = 8.5.
If we compare the state values of the two policies in the above examples, it can be
seen that
which indicates that the policy in Figure 2.4 is better because it has greater state values.
This mathematical conclusion is consistent with the intuition that the first policy is better
because it can avoid entering the forbidden area when the agent starts from s1 . As a
result, the above two examples demonstrate that state values can be used to evaluate
policies.
36
2.6. Matrix-vector form of the Bellman equation S. Zhao, 2023
where
. X X
rπ (s) = π(a|s) p(r|s, a)r,
a∈A r∈R
0 . X
pπ (s |s) = π(a|s)p(s0 |s, a).
a∈A
Here, rπ (s) denotes the mean of the immediate rewards, and pπ (s0 |s) is the probability
of transitioning from s to s0 under policy π.
Suppose that the states are indexed as si with i = 1, . . . , n, where n = |S|. For state
si , (2.8) can be written as
X
vπ (si ) = rπ (si ) + γ pπ (sj |si )vπ (sj ). (2.9)
sj ∈S
Let vπ = [vπ (s1 ), . . . , vπ (sn )]T ∈ Rn , rπ = [rπ (s1 ), . . . , rπ (sn )]T ∈ Rn , and Pπ ∈ Rn×n with
[Pπ ]ij = pπ (sj |si ). Then, (2.9) can be written in the following matrix-vector form:
vπ = rπ + γPπ vπ , (2.10)
37
2.7. Solving state values from the Bellman equation S. Zhao, 2023
p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5
s3 s4
r=1
r=1
Figure 2.6: An example for demonstrating the matrix-vector form of the Bellman equation.
38
2.7. Solving state values from the Bellman equation S. Zhao, 2023
As a result,
δk+1 = γPπ δk = γ 2 Pπ2 δk−1 = · · · = γ k+1 Pπk+1 δ0 .
Since every entry of Pπ is nonnegative and no greater than one, we have that 0 ≤
Pπk ≤ 1 for any k. That is, every entry of Pπk is no greater than 1. On the other hand,
since γ < 1, we know that γ k → 0, and hence, δk+1 = γ k+1 Pπk+1 δ0 → 0 as k → ∞.
39
2.7. Solving state values from the Bellman equation S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
(a) Two “good” policies and their state values. The state values of the two policies are the same,
but the two policies are different at the top two states in the fourth column.
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
(b) Two “bad” policies and their state values. The state values are smaller than those of the
“good” policies.
40
2.8. From state value to action value S. Zhao, 2023
Figure 2.7(a) shows two “good” policies and their corresponding state values obtained
by (2.11). The two policies have the same state values but differ at the top two states in
the fourth column. Therefore, we know that different policies may have the same state
values.
Figure 2.7(b) shows two “bad” policies and their corresponding state values. These
two policies are bad because the actions of many states are intuitively unreasonable.
Such intuition is supported by the obtained state values. As can be seen, the state values
of these two policies are negative and much smaller than those of the good policies in
Figure 2.7(a).
.
qπ (s, a) = E[Gt |St = s, At = a].
As can be seen, the action value is defined as the expected return that can be obtained
after taking an action at a state. It must be noted that qπ (s, a) depends on a state-action
pair (s, a) rather than an action alone. It may be more rigorous to call this value a
state-action value, but it is conventionally called an action value for simplicity.
What is the relationship between action values and state values?
As a result, a state value is the expectation of the action values associated with that
state.
Second, since the state value is given by
X hX X i
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s ) ,
a∈A r∈R s0 ∈S
41
2.8. From state value to action value S. Zhao, 2023
It can be seen that the action value consists of two terms. The first term is the mean
of the immediate rewards, and the second term is the mean of the future rewards.
Both (2.13) and (2.14) describe the relationship between state values and action val-
ues. They are the two sides of the same coin: (2.13) shows how to obtain state values
from action values, whereas (2.14) shows how to obtain action values from state values.
p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5
s3 s4
r=1
r=1
Figure 2.8: An example for demonstrating the process of calculating action values.
We next present an example to illustrate the process of calculating action values and
discuss a common mistake that beginners may make.
Consider the stochastic policy shown in Figure 2.8. We next only examine the actions
of s1 . The other states can be examined similarly. The action value of (s1 , a2 ) is
A common mistake that beginners may make is about the values of the actions that
the given policy does not select. For example, the policy in Figure 2.8 can only select
a2 or a3 and cannot select a1 , a4 , a5 . One may argue that since the policy does not
select a1 , a4 , a5 , we do not need to calculate their action values, or we can simply set
qπ (s1 , a1 ) = qπ (s1 , a4 ) = qπ (s1 , a5 ) = 0. This is wrong.
First, even if an action would not be selected by a policy, it still has an action value.
In this example, although policy π does not take a1 at s1 , we can still calculate its
42
2.8. From state value to action value S. Zhao, 2023
action value by observing what we would obtain after taking this action. Specifically,
after taking a1 , the agent is bounced back to s1 (hence, the immediate reward is −1)
and then continues moving in the state space starting from s1 by following π (hence,
the future reward is γvπ (s1 )). As a result, the action value of (s1 , a1 ) is
Similarly, for a4 and a5 , which cannot be possibly selected by the given policy either,
we have
Second, why do we care about the actions that the given policy would not select?
Although some actions cannot be possibly selected by a given policy, this does not
mean that these actions are not good. It is possible that the given policy is not good,
so it cannot select the best action. The purpose of reinforcement learning is to find
optimal policies. To that end, we must keep exploring all actions to determine better
actions for each state.
Finally, after computing the action values, we can also calculate the state value ac-
cording to (2.14):
which is an equation of action values. The above equation is valid for every state-action
pair. If we put all these equations together, their matrix-vector form is
qπ = r̃ + γP Πqπ , (2.15)
where qπ is the action value vector indexed by the state-action pairs: its (s, a)th element
is [qπ ](s,a) = qπ (s, a). r̃ is the immediate reward vector indexed by the state-action
P
pairs: [r̃](s,a) = r∈R p(r|s, a)r. The matrix P is the probability transition matrix, whose
43
2.9. Summary S. Zhao, 2023
row is indexed by the state-action pairs and whose column is indexed by the states:
[P ](s,a),s0 = p(s0 |s, a). Moreover, Π is a block diagonal matrix in which each block is a
1 × |A| vector: Πs0 ,(s0 ,a0 ) = π(a0 |s0 ) and the other entries of Π are zero.
Compared to the Bellman equation defined in terms of state values, the equation
defined in terms of action values has some unique features. For example, r̃ and P are
independent of the policy and are merely determined by the system model. The policy
is embedded in Π. It can be verified that (2.15) is also a contraction mapping and has a
unique solution that can be iteratively solved. More details can be found in [5].
2.9 Summary
The most important concept introduced in this chapter is the state value. Mathematically,
a state value is the expected return that the agent can obtain by starting from a state.
The values of different states are related to each other. That is, the value of state s
relies on the values of some other states, which may further rely on the value of state s
itself. This phenomenon might be the most confusing part of this chapter for beginners.
It is related to an important concept called bootstrapping, which involves calculating
something from itself. Although bootstrapping may be intuitively confusing, it is clear if
we examine the matrix-vector form of the Bellman equation. In particular, the Bellman
equation is a set of linear equations that describe the relationships between the values of
all states.
Since state values can be used to evaluate whether a policy is good or not, the process
of solving the state values of a policy from the Bellman equation is called policy evalu-
ation. As we will see later in this book, policy evaluation is an important step in many
reinforcement learning algorithms.
Another important concept, action value, was introduced to describe the value of
taking one action at a state. As we will see later in this book, action values play a
more direct role than state values when we attempt to find optimal policies. Finally, the
Bellman equation is not restricted to the reinforcement learning field. Instead, it widely
exists in many fields such as control theories and operation research. In different fields,
the Bellman equation may have different expressions. In this book, the Bellman equation
is studied under discrete Markov decision processes. More information about this topic
can be found in [2].
2.10 Q&A
Q: What is the relationship between state values and returns?
A: The value of a state is the mean of the returns that can be obtained if the agent
starts from that state.
44
2.10. Q&A S. Zhao, 2023
45
Chapter 3
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore,
necessary to define what optimal policies are. In this chapter, we introduce a core concept
and an important tool. The core concept is the optimal state value, based on which we
can define optimal policies. The important tool is the Bellman optimality equation, from
which we can solve the optimal state values and policies.
The relationship between the previous, present, and subsequent chapters is as follows.
The previous chapter (Chapter 2) introduced the Bellman equation of any given policy.
46
3.1. Motivating example: How to improve policies? S. Zhao, 2023
The present chapter introduces the Bellman optimality equation, which is a special Bell-
man equation whose corresponding policy is optimal. The next chapter (Chapter 4) will
introduce an important algorithm called value iteration, which is exactly the algorithm
for solving the Bellman optimality equation as introduced in the present chapter.
Be prepared that this chapter is slightly mathematically intensive. However, it is
worth it because many fundamental questions can be clearly answered.
s1 s2
r = −1
r=1
s3 s4
r=1
r=1
Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the
forbidden and target areas, respectively. The policy here is not good because it selects a2
(rightward) at state s1 . How can we improve the given policy to obtain a better policy?
The answer lies in state values and action values.
Intuition: It is intuitively clear that the policy can improve if it selects a3 (downward)
instead of a2 (rightward) at s1 . This is because moving downward enables the agent
to avoid entering the forbidden area.
Mathematics: The above intuition can be realized based on the calculation of state
values and action values.
First, we calculate the state values of the given policy. In particular, the Bellman
equation of this policy is
47
3.2. Optimal state values and optimal policies S. Zhao, 2023
This example illustrates that we can obtain a better policy if we update the poli-
cy to select the action with the greatest action value. This is the basic idea of many
reinforcement learning algorithms.
This example is very simple in the sense that the given policy is only not good for
state s1 . If the policy is also not good for the other states, will selecting the action with
the greatest action value still generate a better policy? Moreover, whether there always
exist optimal policies? What does an optimal policy look like? We will answer all of
these questions in this chapter.
then π1 is said to be better than π2 . Furthermore, if a policy is better than all the other
possible policies, then this policy is optimal. This is formally stated below.
48
3.3. Bellman optimality equation S. Zhao, 2023
Definition 3.1 (Optimal policy and optimal state value). A policy π ∗ is optimal if
vπ∗ (s) ≥ vπ (s) for all s ∈ S and for any other policy π. The state values of π ∗ are the
optimal state values.
The above definition indicates that an optimal policy has the greatest state value
for every state compared to all the other policies. This definition also leads to many
questions:
Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Stochasticity: Is the optimal policy stochastic or deterministic?
Algorithm: How to obtain the optimal policy and the optimal state values?
These fundamental questions must be clearly answered to thoroughly understand
optimal policies. For example, regarding the existence of optimal policies, if optimal
policies do not exist, then we do not need to bother to design algorithms to find them.
We will answer all these questions in the remainder of this chapter.
. X X
q(s, a) = p(r|s, a)r + γ p(s0 |s, a)v(s0 ).
r∈R s0 ∈S
Here, π(s) denotes a policy for state s, and Π(s) is the set of all possible policies for s.
The BOE is an elegant and powerful tool for analyzing optimal policies. However,
it may be nontrivial to understand this equation. For example, this equation has two
unknown variables v(s) and π(a|s). It may be confusing to beginners how to solve two
unknown variables from one equation. Moreover, the BOE is actually a special Bellman
equation. However, it is nontrivial to see that since its expression is quite different from
that of the Bellman equation. We also need to answer the following fundamental questions
about the BOE.
49
3.3. Bellman optimality equation S. Zhao, 2023
x = max(2x − 1 − y 2 ).
y∈R
The first step is to solve y on the right-hand side of the equation. Regardless of the value
of x, we always have maxy (2x − 1 − y 2 ) = 2x − 1, where the maximum is achieved when
y = 0. The second step is to solve x. When y = 0, the equation becomes x = 2x − 1,
which leads to x = 1. Therefore, y = 0 and x = 1 are the solutions of the equation.
We now turn to the maximization problem on the right-hand side of the BOE. The
BOE in (3.1) can be written concisely as
X
v(s) = max π(a|s)q(s, a), s ∈ S.
π(s)∈Π(s)
a∈A
Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to
do that? The following example demonstrates its basic idea.
Example 3.2. Given q1 , q2 , q3 ∈ R, we would like to find the optimal values of c1 , c2 , c3
to maximize
X3
ci q i = c1 q 1 + c2 q 2 + c3 q 3 ,
i=1
where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose that q3 ≥ q1 , q2 . Then, the optimal solution is
c3 = 1 and c∗1 = c∗2 = 0. This is because
∗
q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3
for any c1 , c2 , c3 .
50
3.3. Bellman optimality equation S. Zhao, 2023
P
Inspired by the above example, since a π(a|s) = 1, we have
X X
π(a|s)q(s, a) ≤ π(a|s) max q(s, a) = max q(s, a),
a∈A a∈A
a∈A a∈A
Here, a∗ = arg maxa q(s, a). In summary, the optimal policy π(s) is the one that selects
the action that has the greatest value of q(s, a).
. X X . X
[rπ ]s = π(a|s) p(r|s, a)r, [Pπ ]s,s0 = p(s0 |s) = π(a|s)p(s0 |s, a).
a∈A r∈R a∈A
Since the optimal value of π is determined by v, the right-hand side of (3.2) is a function
of v, denoted as
.
f (v) = max(rπ + γPπ v).
π∈Π
v = f (v). (3.3)
In the remainder of this section, we show how to solve this nonlinear equation.
51
3.3. Bellman optimality equation S. Zhao, 2023
BOE.
Consider a function f (x), where x ∈ Rd and f : Rd → Rd . A point x∗ is called a fixed
point if
f (x∗ ) = x∗ .
The interpretation of the above equation is that the map of x∗ is itself. This is the
reason why x∗ is called “fixed”. The function f is a contraction mapping (or contractive
function) if there exists γ ∈ (0, 1) such that
Example 3.3. We present three examples to demonstrate fixed points and contraction
mappings.
x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 · 0. Moreover, f (x) = 0.5x
is a contraction mapping because k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for
any γ ∈ [0.5, 1).
As a result, |0.5 sin x1 − 0.5 sin x2 | ≤ 0.5|x1 − x2 | and hence f (x) = 0.5 sin x is a
contraction mapping.
The relationship between a fixed point and the contraction property is characterized
by the following classic theorem.
Theorem 3.1 (Contraction mapping theorem). For any equation that has the form x =
f (x) where x and f (x) are real vectors, if f is a contraction mapping, then the following
properties hold.
52
3.3. Bellman optimality equation S. Zhao, 2023
xk+1 = f (xk ),
The contraction mapping theorem not only can tell whether the solution of a nonlinear
equation exists but also suggests a numerical algorithm for solving the equation. The
proof of the theorem is given in Box 3.1.
The following example demonstrates how to calculate the fixed points of some equa-
tions using the iterative algorithm suggested by the contraction mapping theorem.
Example 3.4. Let us revisit the abovementioned examples: x = 0.5x, x = Ax, and
x = 0.5 sin x. While it has been shown that the right-hand sides of these three equations
are all contraction mappings, it follows from the contraction mapping theorem that they
each have a unique fixed point, which can be easily verified to be x∗ = 0. Moreover, the
fixed points of the three equations can be iteratively solved by the following algorithms:
xk+1 = 0.5xk ,
xk+1 = Axk ,
xk+1 = 0.5 sin xk ,
Part 1: We prove that the consequence {xk }∞ k=1 with xk = f (xk−1 ) is convergent.
The proof relies on Cauchy sequences. A sequence x1 , x2 , · · · ∈ R is called Cauchy
if for any small ε > 0, there exists N such that kxm − xn k < ε for all m, n > N .
The intuitive interpretation is that there exists a finite integer N such that all the
elements after N are sufficiently close to each other. Cauchy sequences are important
because it is guaranteed that a Cauchy sequence converges to a limit. Its convergence
property will be used to prove the contraction mapping theorem. Note that we must
have kxm − xn k < ε for all m, n > N . If we simply have xn+1 − xn → 0, it is
insufficient to claim that the sequence is a Cauchy sequence. For example, it holds
√ √
that xn+1 − xn → 0 for xn = n, but apparently, xn = n diverges.
We next show that {xk = f (xk−1 )}∞ k=1 is a Cauchy sequence and hence converges.
53
3.3. Bellman optimality equation S. Zhao, 2023
As a result, for any ε, we can always find N such that kxm −xn k < ε for all m, n > N .
Therefore, this sequence is Cauchy and hence converges to a limit point denoted as
x∗ = limk→∞ xk .
Part 2: We show that the limit x∗ = limk→∞ xk is a fixed point. To do that, since
54
3.3. Bellman optimality equation S. Zhao, 2023
γn
kx∗ − xn k = lim kxm − xn k ≤ kx1 − x0 k.
m→∞ 1−γ
Theorem 3.2 (Contraction property of f (v)). The function f (v) on the right-hand side
of the BOE in (3.3) is a contraction mapping. In particular, for any v1 , v2 ∈ R|S| , it holds
that
kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,
where γ ∈ (0, 1) is the discount rate, and k · k∞ is the maximum norm, which is the
maximum absolute value of the elements of a vector.
The proof of the theorem is given in Box 3.2. This theorem is important because we
can use the powerful contraction mapping theorem to analyze the BOE.
55
3.3. Bellman optimality equation S. Zhao, 2023
Define
.
z = max |γPπ2∗ (v1 − v2 )|, |γPπ1∗ (v1 − v2 )| ∈ R|S| ,
which implies
|f (v1 ) − f (v2 )| ≤ z.
Since pi is a vector with all nonnegative elements and the sum of the elements is
equal to one, it follows that
56
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
v ∗ = max(rπ + γPπ v ∗ ).
π∈Π
Theorem 3.3 (Existence, uniqueness, and algorithm). For the BOE v = f (v) =
maxπ∈Π (rπ + γPπ v), there always exists a unique solution v ∗ , which can be solved
iteratively by
The proof of this theorem directly follows from the contraction mapping theorem since
f (v) is a contraction mapping. This theorem is important because it answers some
fundamental questions.
Solving π ∗ : Once the value of v ∗ has been obtained, we can easily obtain π ∗ by solving
The value of π ∗ will be given in Theorem 3.5. Substituting (3.6) into the BOE yields
v ∗ = rπ∗ + γPπ∗ v ∗ .
Therefore, v ∗ = vπ∗ is the state value of π ∗ , and the BOE is a special Bellman equation
whose corresponding policy is π ∗ .
57
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
At this point, although we can solve v ∗ and π ∗ , it is still unclear whether the solution
is optimal. The following theorem reveals the optimality of the solution.
Theorem 3.4 (Optimality of v ∗ and π ∗ ). The solution v ∗ is the optimal state value, and
π ∗ is an optimal policy. That is, for any policy π, it holds that
v ∗ = vπ∗ ≥ vπ ,
Now, it is clear why we must study the BOE: its solution corresponds to optimal state
values and optimal policies. The proof of the above theorem is given in the following box.
vπ = rπ + γPπ vπ .
Since
we have
v ∗ − vπ ≥ lim γ n Pπn (v ∗ − vπ ) = 0,
n→∞
where the last equality is true because γ < 1 and Pπn is a nonnegative matrix with
all its elements less than or equal to 1 (because Pπn 1 = 1). Therefore, v ∗ ≥ vπ for
any π.
We next examine π ∗ in (3.6) more closely. In particular, the following theorem shows
that there always exists a deterministic greedy policy that is optimal.
Theorem 3.5 (Greedy optimal policy). For any s ∈ S, the deterministic greedy policy
(
1, a = a∗ (s),
π ∗ (a|s) = (3.7)
6 a∗ (s),
0, a =
58
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
where
. X X
q ∗ (s, a) = p(r|s, a)r + γ p(s0 |s, a)v ∗ (s0 ).
r∈R s0 ∈S
While the matrix-vector form of the optimal policy is π ∗ = arg maxπ (rπ + γPπ v ∗ ), its
elementwise form is
!
X X X
∗ 0 ∗ 0
π (s) = arg max π(a|s) p(r|s, a)r + γ p(s |s, a)v (s ) , s ∈ S.
π∈Π
a∈A r∈R s0 ∈S
| {z }
q ∗ (s,a)
It is clear that a∈A π(a|s)q ∗ (s, a) is maximized if π(s) selects the action with the
P
The policy in (3.7) is called greedy because it seeks the actions with the greatest
q (s, a). Finally, we discuss two important properties of π ∗ .
∗
Uniqueness of optimal policies: Although the value of v ∗ is unique, the optimal policy
that corresponds to v ∗ may not be unique. This can be easily verified by counterex-
amples. For example, the two policies shown in Figure 3.3 are both optimal.
Stochasticity of optimal policies: An optimal policy can be either stochastic or de-
terministic, as demonstrated in Figure 3.3. However, it is certain that there always
exists a deterministic optimal policy according to Theorem 3.5.
p = 0.5
p = 0.5
Figure 3.3: Examples for demonstrating that optimal policies may not be unique. The two policies are
different but are both optimal.
59
3.5. Factors that influence optimal policies S. Zhao, 2023
The optimal state value and optimal policy are determined by the following param-
eters: 1) the immediate reward r, 2) the discount rate γ, and 3) the system model
p(s0 |s, a), p(r|s, a). While the system model is fixed, we next discuss how the optimal
policy varies when we change the values of r and γ. All the optimal policies presented
in this section can be obtained via the algorithm in Theorem 3.3. The implementation
details of the algorithm will be given in Chapter 4. The present chapter mainly focuses
on the fundamental properties of optimal policies.
A baseline example
Consider the example in Figure 3.4. The reward settings are rboundary = rforbidden = −1
and rtarget = 1. In addition, the agent receives a reward of rother = 0 for every movement
step. The discount rate is selected as γ = 0.9.
With the above parameters, the optimal policy and optimal state values are given in
Figure 3.4(a). It is interesting that the agent is not afraid of passing through forbidden
areas to reach the target area. More specifically, starting from the state at (row=4,
column=1), the agent has two options for reaching the target area. The first option is to
avoid all the forbidden areas and travel a long distance to the target area. The second
option is to pass through forbidden areas. Although the agent obtains negative rewards
when entering forbidden areas, the cumulative reward of the second trajectory is greater
than that of the first trajectory. Therefore, the optimal policy is far-sighted due to the
relatively large value of γ.
If we change the discount rate from γ = 0.9 to γ = 0.5 and keep other parameters
unchanged, the optimal policy becomes the one shown in Figure 3.4(b). It is interesting
that the agent does not dare to take risks anymore. Instead, it would travel a long
distance to reach the target while avoiding all the forbidden areas. This is because the
optimal policy becomes short-sighted due to the relatively small value of γ.
In the extreme case where γ = 0, the corresponding optimal policy is shown in
Figure 3.4(c). In this case, the agent is not able to reach the target area. This is
60
3.5. Factors that influence optimal policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
(b) The discount rate is changed to γ = 0.5. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5
(c) The discount rate is changed to γ = 0. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5
(d) rforbidden is changed from −1 to −10. The other parameters are the same as those in (a).
Figure 3.4: The optimal policies and optimal state values given different parameter values.
61
3.5. Factors that influence optimal policies S. Zhao, 2023
because the optimal policy for each state is extremely short-sighted and merely selects
the action with the greatest immediate reward instead of the greatest total reward.
In addition, the spatial distribution of the state values exhibits an interesting pattern:
the states close to the target have greater state values, whereas those far away have lower
values. This pattern can be observed from all the examples shown in Figure 3.4. It can
be explained by using the discount rate: if a state must travel along a longer trajectory
to reach the target, its state value is smaller due to the discount rate.
If we want to strictly prohibit the agent from entering any forbidden area, we can increase
the punishment received for doing so. For instance, if rforbidden is changed from −1 to
−10, the resulting optimal policy can avoid all the forbidden areas (see Figure 3.4(d)).
However, changing the rewards does not always lead to different optimal policies.
One important fact is that optimal policies are invariant to affine transformations of the
rewards. In other words, if we scale all the rewards or add the same value to all the
rewards, the optimal policy remains the same.
Theorem 3.6 (Optimal policy invariance). Consider a Markov decision process with
v ∗ ∈ R|S| as the optimal state value satisfying v ∗ = maxπ∈Π (rπ + γPπ v ∗ ). If every reward
r ∈ R is changed by an affine transformation to αr + β, where α, β ∈ R and α > 0, then
the corresponding optimal state value v 0 is also an affine transformation of v ∗ :
β
v 0 = αv ∗ + 1, (3.8)
1−γ
where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the optimal policy
derived from v 0 is invariant to the affine transformation of the reward values.
62
3.5. Factors that influence optimal policies S. Zhao, 2023
We next solve the new BOE in (3.9) by showing that v 0 = αv ∗ +c1 with c = β/(1−γ)
is a solution of (3.9). In particular, substituting v 0 = αv ∗ + c1 into (3.9) gives
where the last equality is due to the fact that Pπ 1 = 1. The above equation can be
reorganized as
which is equivalent to
β1 + cγ1 − c1 = 0.
Since c = β/(1−γ), the above equation is valid and hence v 0 = αv ∗ +c1 is the solution
of (3.9). Since (3.9) is the BOE, v 0 is also the unique solution. Finally, since v 0 is
an affine transformation of v ∗ , the relative relationships between the action values
remain the same. Hence, the greedy optimal policy derived from v 0 is the same as
that from v ∗ : arg maxπ∈Π (rπ + γPπ v 0 ) is the same as arg maxπ (rπ + γPπ v ∗ ).
Readers may refer to [9] for a further discussion on the conditions under which mod-
ifications to the reward values preserve the optimal policy.
In the reward setting, the agent receives a reward of rother = 0 for every movement
step (unless it enters a forbidden area or the target area or attempts to go beyond the
boundary). Since a zero reward is not a punishment, would the optimal policy take
meaningless detours before reaching the target? Should we set rother to be negative to
encourage the agent to reach the target as quickly as possible?
1 2 1 2 1 2 1 2
Figure 3.5: Examples illustrating that optimal policies do not take meaningless detours due to the
discount rate.
Consider the examples in Figure 3.5, where the bottom-right cell is the target area
63
3.6. Summary S. Zhao, 2023
to reach. The two policies here are the same except for state s2 . By the policy in
Figure 3.5(a), the agent moves downward at s2 and the resulting trajectory is s2 → s4 .
By the policy in Figure 3.5(b), the agent moves leftward and the resulting trajectory is
s2 → s1 → s3 → s4 .
It is notable that the second policy takes a detour before reaching the target area. If
we merely consider the immediate rewards, taking this detour does not matter because
no negative immediate rewards will be obtained. However, if we consider the discounted
return, then this detour matters. In particular, for the first policy, the discounted return
is
return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.
It is clear that the shorter the trajectory is, the greater the return is. Therefore, although
the immediate reward of every step does not encourage the agent to approach the target
as quickly as possible, the discount rate does encourage it to do so.
A misunderstanding that beginners may have is that adding a negative reward (e.g.,
−1) on top of the rewards obtained for every movement is necessary to encourage the
agent to reach the target as quickly as possible. This is a misunderstanding because
adding the same reward on top of all rewards is an affine transformation, which preserves
the optimal policy. Moreover, optimal policies do not take meaningless detours due to
the discount rate, even though a detour may not receive any immediate negative rewards.
3.6 Summary
The core concepts in this chapter include optimal policies and optimal state values. In
particular, a policy is optimal if its state values are greater than or equal to those of any
other policy. The state values of an optimal policy are the optimal state values. The BOE
is the core tool for analyzing optimal policies and optimal state values. This equation
is a nonlinear equation with a nice contraction property. We can apply the contraction
mapping theorem to analyze this equation. It was shown that the solutions of the BOE
correspond to the optimal state value and optimal policy. This is the reason why we need
to study the BOE.
The contents of this chapter are important for thoroughly understanding many funda-
mental ideas of reinforcement learning. For example, Theorem 3.3 suggests an iterative
algorithm for solving the BOE. This algorithm is exactly the value iteration algorithm
that will be introduced in Chapter 4. A further discussion about the BOE can be found
in [2].
64
3.7. Q&A S. Zhao, 2023
3.7 Q&A
Q: What is the definition of optimal policies?
A: A policy is optimal if its corresponding state values are greater than or equal to
any other policy.
It should be noted that this specific definition of optimality is valid only for tabular
reinforcement learning algorithms. When the values or policies are approximated by
functions, different metrics must be used to define optimal policies. This will become
clearer in Chapters 8 and 9.
Q: Why is the Bellman optimality equation important?
A: It is important because it characterizes both optimal policies and optimal state
values. Solving this equation yields an optimal policy and the corresponding optimal
state value.
Q: Is the Bellman optimality equation a Bellman equation?
A: Yes. The Bellman optimality equation is a special Bellman equation whose corre-
sponding policy is optimal.
Q: Is the solution of the Bellman optimality equation unique?
A: The Bellman optimality equation has two unknown variables. The first unknown
variable is a value, and the second is a policy. The value solution, which is the optimal
state value, is unique. The policy solution, which is an optimal policy, may not be
unique.
Q: What is the key property of the Bellman optimality equation for analyzing its
solution?
A: The key property is that the right-hand side of the Bellman optimality equation is
a contraction mapping. As a result, we can apply the contraction mapping theorem
to analyze its solution.
Q: Do optimal policies exist?
A: Yes. Optimal policies always exist according to the analysis of the BOE.
Q: Are optimal policies unique?
A: No. There may exist multiple or infinite optimal policies that have the same
optimal state values.
Q: Are optimal policies stochastic or deterministic?
A: An optimal policy can be either deterministic or stochastic. A nice fact is that
there always exist deterministic greedy optimal policies.
65
3.7. Q&A S. Zhao, 2023
66
Chapter 4
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
With the preparation in the previous chapters, we are now ready to present the first
algorithms that can find optimal policies. This chapter introduces three algorithms that
are closely related to each other. The first is the value iteration algorithm, which is
exactly the algorithm suggested by the contraction mapping theorem for solving the
Bellman optimality equation as discussed in the last chapter. We focus more on the
implementation details of this algorithm in the present chapter. The second is the policy
iteration algorithm, whose idea is widely used in reinforcement learning algorithms. The
third is the truncated policy iteration algorithm, which is a unified algorithm that includes
the value iteration and policy iteration algorithms as special cases.
67
4.1. Value iteration S. Zhao, 2023
The algorithms introduced in this chapter are called dynamic programming algorithms
[10, 11], which require the system model. These algorithms are important foundations of
the model-free reinforcement learning algorithms introduced in the subsequent chapters.
For example, the Monte Carlo algorithms introduced in Chapter 5 can be immediately
obtained by extending the policy iteration algorithm introduced in this chapter.
It is guaranteed by Theorem 3.3 that vk and πk converge to the optimal state value and
an optimal policy as k → ∞, respectively.
This algorithm is iterative and has two steps in every iteration.
The first step in every iteration is a policy update step. Mathematically, it aims to
find a policy that can solve the following optimization problem:
68
4.1. Value iteration S. Zhao, 2023
First, the elementwise form of the policy update step πk+1 = arg maxπ (rπ + γPπ vk ) is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s ∈ S.
π
a r s0
| {z }
qk (s,a)
We showed in Section 3.3.1 that the optimal policy that can solve the above optimiza-
tion problem is
(
1 a = a∗k (s),
πk+1 (a|s) = (4.2)
0 a=6 a∗k (s),
where a∗k (s) = arg maxa qk (s, a). If a∗k (s) = arg maxa qk (s, a) has multiple solutions,
we can select any of them without affecting the convergence of the algorithm. Since
the new policy πk+1 selects the action with the greatest qk (s, a), such a policy is called
greedy.
Second, the elementwise form of the value update step vk+1 = rπk+1 + γPπk+1 vk is
!
X X X
vk+1 (s) = πk+1 (a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s ∈ S.
a r s0
| {z }
qk (s,a)
vk (s) → qk (s, a) → new greedy policy πk+1 (s) → new value vk+1 (s) = max qk (s, a)
a
69
4.1. Value iteration S. Zhao, 2023
Initialization: The probability models p(r|s, a) and p(s0 |s, a) for all (s, a) are known.
Initial guess v0 .
Goal: Search for the optimal state value and an optimal policy for solving the Bellman
optimality equation.
While vk has not converged in the sense that kvk − vk−1 k is greater than a predefined
small threshold, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s),P do
q-value: qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P
Maximum action value: a∗k (s) = arg maxa qk (s, a)
Policy update: πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Value update: vk+1 (s) = maxa qk (s, a)
q-table a1 a2 a3 a4 a5
s1 −1 + γv(s1 ) −1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s1 ) 0 + γv(s1 )
s2 −1 + γv(s2 ) −1 + γv(s2 ) 1 + γv(s4 ) 0 + γv(s1 ) −1 + γv(s2 )
s3 0 + γv(s1 ) 1 + γv(s4 ) −1 + γv(s3 ) −1 + γv(s3 ) 0 + γv(s3 )
s4 −1 + γv(s2 ) −1 + γv(s4 ) −1 + γv(s4 ) 0 + γv(s3 ) 1 + γv(s4 )
Table 4.1: The expression of q(s, a) for the example as shown in Figure 4.2.
ure 4.2). The target area is s4 . The reward settings are rboundary = rforbidden = −1 and
rtarget = 1. The discount rate is γ = 0.9.
s1 s2 s1 s2 s1 s2
s3 s4 s3 s4 s3 s4
Figure 4.2: An example for demonstrating the implementation of the value iteration algorithm.
The expression of the q-value for each state-action pair is shown in Table 4.1.
k = 0:
Without loss of generality, select the initial values as v0 (s1 ) = v0 (s2 ) = v0 (s3 ) =
v0 (s4 ) = 0.
q-value calculation: Substituting v0 (si ) into Table 4.1 gives the q-values shown in
Table 4.2.
70
4.1. Value iteration S. Zhao, 2023
q-table a1 a2 a3 a4 a5
s1 −1 −1 0 −1 0
s2 −1 −1 1 0 −1
s3 0 1 −1 −1 0
s4 −1 −1 −1 0 1
Table 4.2: The value of q(s, a) at k = 0.
q-table a1 a2 a3 a4 a5
s1 −1 + γ0 −1 + γ1 0 + γ1 −1 + γ0 0 + γ0
s2 −1 + γ1 −1 + γ1 1 + γ1 0 + γ0 −1 + γ1
s3 0 + γ0 1 + γ1 −1 + γ1 −1 + γ1 0 + γ1
s4 −1 + γ1 −1 + γ1 −1 + γ1 0 + γ1 1 + γ1
Table 4.3: The value of q(s, a) at k = 1.
Policy update: π1 is obtained by selecting the actions with the greatest q-values for
every state:
This policy is visualized in Figure 4.2 (the middle subfigure). It is clear that this policy
is not optimal because it selects to stay unchanged at s1 . Notably, the q-values for
(s1 , a5 ) and (s1 , a3 ) are actually the same, and we can randomly select either action.
Value update: v1 is obtained by updating the v-value to the greatest q-value for each
state:
v1 (s1 ) = 0, v1 (s2 ) = 1, v1 (s3 ) = 1, v1 (s4 ) = 1.
k = 1:
q-value calculation: Substituting v1 (si ) into Table 4.1 yields the q-values shown in
Table 4.3.
Policy update: π2 is obtained by selecting the greatest q-values:
k = 2, 3, 4, . . .
71
4.2. Policy iteration S. Zhao, 2023
we only need to run two iterations to obtain an optimal policy in this simple example. For
more complex examples, we need to run more iterations until the value of vk converges
(e.g., until kvk+1 − vk k is smaller than a pre-specified threshold).
The first is a policy evaluation step. As its name suggests, this step evaluates a given
policy by calculating the corresponding state value. That is to solve the following
Bellman equation:
where πk is the policy obtained in the last iteration and vπk is the state value to be
calculated. The values of rπk and Pπk can be obtained from the system model.
The second is a policy improvement step. As its name suggests, this step is used to
improve the policy. In particular, once vπk has been calculated in the first step, a new
policy πk+1 can be obtained as
In the policy evaluation step, how to solve the state value vπk ?
In the policy improvement step, why is the new policy πk+1 better than πk ?
Why can this algorithm finally converge to an optimal policy?
We introduced two methods in Chapter 2 for solving the Bellman equation in (4.3).
We next revisit the two methods briefly. The first method is a closed-form solution:
72
4.2. Policy iteration S. Zhao, 2023
vπk = (I−γPπk )−1 rπk . This closed-form solution is useful for theoretical analysis purposes,
but it is inefficient to implement since it requires other numerical algorithms to compute
the matrix inverse. The second method is an iterative algorithm that can be easily
implemented:
vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ... (4.4)
(j) (0)
where vπk denotes the jth estimate of vπk . Starting from any initial guess vπk , it is
(j)
ensured that vπk → vπk as j → ∞. Details can be found in Section 2.7.
Interestingly, policy iteration is an iterative algorithm with another iterative algorithm
(4.4) embedded in the policy evaluation step. In theory, this embedded iterative algorithm
requires an infinite number of steps (that is, j → ∞) to converge to the true state value
vπk . This is, however, impossible to realize. In practice, the iterative process terminates
when a certain criterion is satisfied. For example, the termination criterion can be that
(j+1) (j)
kvπk − vπk k is less than a prespecified threshold or that j exceeds a prespecified value.
If we do not run an infinite number of iterations, we can only obtain an imprecise value
of vπk , which will be used in the subsequent policy improvement step. Would this cause
problems? The answer is no. The reason will become clear when we introduce the
truncated policy iteration algorithm later in Section 4.3.
The policy improvement step can improve the given policy, as shown below.
Lemma 4.1 (Policy improvement). If πk+1 = arg maxπ (rπ + γPπ vπk ), then vπk+1 ≥ vπk .
Here, vπk+1 ≥ vπk means that vπk+1 (s) ≥ vπk (s) for all s. The proof of this lemma is
given in Box 4.1.
73
4.2. Policy iteration S. Zhao, 2023
Therefore,
Why can the policy iteration algorithm eventually find an optimal policy?
The policy iteration algorithm generates two sequences. The first is a sequence of policies:
{π0 , π1 , . . . , πk , . . . }. The second is a sequence of state values: {vπ0 , vπ1 , . . . , vπk , . . . }.
Suppose that v ∗ is the optimal state value. Then, vπk ≤ v ∗ for all k. Since the policies
are continuously improved according to Lemma 4.1, we know that
Since vπk is nondecreasing and always bounded from above by v ∗ , it follows from the
monotone convergence theorem [12] (Appendix C) that vπk converges to a constant value,
denoted as v∞ , when k → ∞. The following analysis shows that v∞ = v ∗ .
Theorem 4.1 (Convergence of policy iteration). The state value sequence {vπk }∞k=0 gen-
∗
erated by the policy iteration algorithm converges to the optimal state value v . As a
result, the policy sequence {πk }∞
k=0 converges to an optimal policy.
The proof of this theorem is given in Box 4.2. The proof not only shows the conver-
gence of the policy iteration algorithm but also reveals the relationship between the policy
iteration and value iteration algorithms. Loosely speaking, if both algorithms start from
the same initial guess, policy iteration will converge faster than value iteration due to
the additional iterations embedded in the policy evaluation step. This point will become
clearer when we introduce the truncated policy iteration algorithm in Section 4.3.
74
4.2. Policy iteration S. Zhao, 2023
This iterative algorithm is exactly the value iteration algorithm. We already know
that vk converges to v ∗ when given any initial value v0 .
For k = 0, we can always find a v0 such that vπ0 ≥ v0 for any π0 .
We next show that vk ≤ vπk ≤ v ∗ for all k by induction.
For k ≥ 0, suppose that vπk ≥ vk .
For k + 1, we have
= γPπk0 (vπk − vk ).
Since vπk − vk ≥ 0 and Pπk0 is nonnegative, we have Pπk0 (vπk − vk ) ≥ 0 and hence
vπk+1 − vk+1 ≥ 0.
Therefore, we can show by induction that vk ≤ vπk ≤ v ∗ for any k ≥ 0. Since vk
converges to v ∗ , vπk also converges to v ∗ .
First, the policy evaluation step solves vπk from vπk = rπk + γPπk vπk by using the
75
4.2. Policy iteration S. Zhao, 2023
Initialization: The system model, p(r|s, a) and p(s0 |s, a) for all (s, a), is known. Initial
guess π0 .
Goal: Search for the optimal state value and an optimal policy.
While vπk has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: an arbitrary initial guess vπk
(j)
While vπk has not converged, for the jth iteration, do
For every state s ∈ S, do h i
(j+1) P P P 0 (j) 0
vπk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vπk (s )
Policy improvement:
For every state s ∈ S, do
For every action P a ∈ A, do
qπk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vπk (s0 )
P
a∗k (s) = arg maxa qπk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
where j = 0, 1, 2, . . . .
Second, the policy improvement step solves πk+1 = arg maxπ (rπ + γPπ vπk ). The
elementwise form of this equation is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπk (s0 ) , s ∈ S,
π
a r s0
| {z }
qπk (s,a)
where qπk (s, a) is the action value under policy πk . Let a∗k (s) = arg maxa qπk (s, a).
Then, the greedy optimal policy is
(
1, a = a∗k (s),
πk+1 (a|s) =
6 a∗k (s).
0, a =
76
4.2. Policy iteration S. Zhao, 2023
Consider a simple example shown in Figure 4.3. There are two states with three possible
actions: A = {a` , a0 , ar }. The three actions represent moving leftward, staying un-
changed, and moving rightward. The reward settings are rboundary = −1 and rtarget = 1.
The discount rate is γ = 0.9.
s1 s2 s1 s2
(a) (b)
Figure 4.3: An example for illustrating the implementation of the policy iteration algorithm.
First, in the policy evaluation step, we need to solve the Bellman equation:
In practice, the equation can be solved by the iterative algorithm in (4.4). For example,
(0) (0)
select the initial state values as vπ0 (s1 ) = vπ0 (s2 ) = 0. It follows from (4.3) that
(
(1) (0)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1,
(1) (0)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = 0,
(
(2) (1)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1.9,
(2) (1)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = −0.9,
(
(3) (2)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −2.71,
(3) (2)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = −1.71,
..
.
(j) (j)
With more iterations, we can see the trend: vπ0 (s1 ) → vπ0 (s1 ) = −10 and vπ0 (s2 ) →
vπ0 (s2 ) = −9 as j increases.
77
4.2. Policy iteration S. Zhao, 2023
Second, in the policy improvement step, the key is to calculate qπ0 (s, a) for each
state-action pair. The following q-table can be used to demonstrate such a process:
qπk (s, a) a` a0 ar
s1 −1 + γvπk (s1 ) 0 + γvπk (s1 ) 1 + γvπk (s2 )
s2 0 + γvπk (s1 ) 1 + γvπk (s2 ) −1 + γvπk (s2 )
Table 4.4: The expression of qπk (s, a) for the example in Figure 4.3.
Substituting vπ0 (s1 ) = −10, vπ0 (s2 ) = −9 obtained in the previous policy evaluation
step into Table 4.4 yields Table 4.5.
qπ0 (s, a) a` a0 ar
s1 −10 −9 −7.1
s2 −9 −7.1 −9.1
Table 4.5: The value of qπk (s, a) when k = 0.
By seeking the greatest value of qπ0 , the improved policy π1 can be obtained as
This policy is illustrated in Figure 4.3(b). It is clear that this policy is optimal.
The above process shows that a single iteration is sufficient for finding the optimal
policy in this simple example. More iterations are required for more complex examples.
We next demonstrate the policy iteration algorithm using a more complicated example
shown in Figure 4.4. The reward settings are rboundary = −1, rforbidden = −10, and rtarget =
1. The discount rate is γ = 0.9. The policy iteration algorithm can converge to the
optimal policy (Figure 4.4(h)) when starting from a random initial policy (Figure 4.4(a)).
Two interesting phenomena are observed during the iteration process.
First, if we observe how the policy evolves, an interesting pattern is that the states
that are close to the target area find the optimal policies earlier than those far away.
Only if the close states can find trajectories to the target first, can the farther states
find trajectories passing through the close states to reach the target.
Second, the spatial distribution of the state values exhibits an interesting pattern: the
states that are located closer to the target have greater state values. The reason for
this pattern is that an agent starting from a farther state must travel for many steps
to obtain a positive reward. Such rewards would be severely discounted and hence
relatively small.
78
4.3. Truncated policy iteration S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0
1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0
2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0
4.8 0.0
5.3 0.0
5.9
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
.. ..
. .
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3 1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3
2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9 2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9
3 3 0.0
2.8 0.0 -100.0
10.0 0.0
5.9 0.0
6.6 3 3 0.0
2.8 0.0
2.5 -100.0
10.0 0.0
5.9 0.0
6.6
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0
2.5 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0
2.3 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
Figure 4.4: The evolution processes of the policies generated by the policy iteration algorithm.
79
4.3. Truncated policy iteration S. Zhao, 2023
Policy iteration: Select an arbitrary initial policy π0 . In the kth iteration, do the
following two steps.
Value iteration: Select an arbitrary initial value v0 . In the kth iteration, do the
following two steps.
PE PI PE PI PE PI
Policy iteration: π0 −−→vπ0 −→ π1 −−→ vπ1 −→ π2 −−→ vπ2 −→ . . .
PU VU PU VU PU
Value iteration: v0 −−→ π10 −−→ v1 −−→ π20 −−→ v2 −−→ . . .
It can be seen that the procedures of the two algorithms are very similar.
We examine their value steps more closely to see the difference between the two
algorithms. In particular, let both algorithms start from the same initial condition:
v0 = vπ0 . The procedures of the two algorithms are listed in Table 4.6. In the first
three steps, the two algorithms generate the same results since v0 = vπ0 . They become
80
4.3. Truncated policy iteration S. Zhao, 2023
3) Policy: π1 = arg maxπ (rπ + γPπ vπ0 ) π1 = arg maxπ (rπ + γPπ v0 ) The two policies are
the same
4) Value: vπ1 = rπ1 + γPπ1 vπ1 v1 = rπ1 + γPπ1 v0 vπ1 ≥ v1 since
vπ 1 ≥ vπ 0
5) Policy: π2 = arg maxπ (rπ + γPπ vπ1 ) π20 = arg maxπ (rπ + γPπ v1 )
.. .. .. ..
. . . .
Table 4.6: A comparison between the implementation steps of policy iteration and value iteration.
different in the fourth step. During the fourth step, the value iteration algorithm executes
v1 = rπ1 + γPπ1 v0 , which is a one-step calculation, whereas the policy iteration algorithm
solves vπ1 = rπ1 + γPπ1 vπ1 , which requires an infinite number of iterations. If we explicitly
write out the iterative process for solving vπ1 = rπ1 +γPπ1 vπ1 in the fourth step, everything
(0)
becomes clear. By letting vπ1 = v0 , we have
vπ(0)
1
= v0
value iteration ← v1 ←− vπ(1)
1
= rπ1 + γPπ1 vπ(0)
1
vπ(2)
1
= rπ1 + γPπ1 vπ(1)
1
..
.
truncated policy iteration ← v̄1 ←− vπ(j)
1
= rπ1 + γPπ1 vπ(j−1)
1
..
.
policy iteration ← vπ1 ←− vπ(∞)
1
= rπ1 + γPπ1 vπ(∞)
1
As a result, the value iteration and policy iteration algorithms can be viewed as two
extreme cases of the truncated policy iteration algorithm: value iteration terminates
81
4.3. Truncated policy iteration S. Zhao, 2023
Initialization: The probability models p(r|s, a) and p(s0 |s, a) for all (s, a) are known.
Initial guess π0 .
Goal: Search for the optimal state value and an optimal policy.
While vk has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: select the initial guess as vk = vk−1 . The maximum number of
iterations is set as jtruncate .
While j < jtruncate , do
For every state s ∈ S, do h i
(j+1) P P P 0 (j) 0
vk (s) = a πk (a|s) r p(r|s, a)r + γ s 0 p(s |s, a)vk (s )
(j )
Set vk = vk truncate
Policy improvement:
For every state s ∈ S, do
For every actionPa ∈ A(s), do
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P
a∗k (s) = arg maxa qk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Proposition 4.1 (Value improvement). Consider the iterative algorithm in the policy
82
4.3. Truncated policy iteration S. Zhao, 2023
v*
v
k
vk
Policy iteration
Value iteration
Truncated policy iteration
Optimal state value
Figure 4.5: An illustration of the relationships between the value iteration, policy iteration, and truncated
policy iteration algorithms.
evaluation step:
vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...
(0)
If the initial guess is selected as vπk = vπk−1 , it holds that
vπ(j+1)
k
≥ vπ(j)
k
for j = 0, 1, 2, . . . .
vπ(j+1)
k
− vπ(j)
k
= γPπk (vπ(j)
k
− vπ(j−1)
k
) = · · · = γ j Pπjk (vπ(1)
k
− vπ(0)
k
). (4.5)
(0)
Second, since vπk = vπk−1 , we have
vπ(1)
k
= rπk + γPπk vπ(0)
k
= rπk + γPπk vπk−1 ≥ rπk−1 + γPπk−1 vπk−1 = vπk−1 = vπ(0)
k
,
(1) (0)
where the inequality is due to πk = arg maxπ (rπ + γPπ vπk−1 ). Substituting vπk ≥ vπk
(j+1) (j)
into (4.5) yields vπk ≥ vπk .
(0)
Notably, Proposition 4.1 requires the assumption that vπk = vπk−1 . However, vπk−1 is
unavailable in practice, and only vk−1 is available. Nevertheless, Proposition 4.1 still sheds
light on the convergence of the truncated policy iteration algorithm. A more in-depth
discussion of this topic can be found in [2, Section 6.5].
Up to now, the advantages of truncated policy iteration are clear. Compared to the
83
4.4. Summary S. Zhao, 2023
policy iteration algorithm, the truncated one merely requires a finite number of iterations
in the policy evaluation step and hence is more computationally efficient. Compared to
value iteration, the truncated policy iteration algorithm can speed up its convergence
rate by running for a few more iterations in the policy evaluation step.
4.4 Summary
This chapter introduced three algorithms that can be used to find optimal policies.
Value iteration: The value iteration algorithm is the same as the algorithm suggested
by the contraction mapping theorem for solving the Bellman optimality equation. It
can be decomposed into two steps: value update and policy update.
Policy iteration: The policy iteration algorithm is slightly more complicated than the
value iteration algorithm. It also contains two steps: policy evaluation and policy
improvement.
Truncated policy iteration: The value iteration and policy iteration algorithms can
be viewed as two extreme cases of the truncated policy iteration algorithm.
A common property of the three algorithms is that every iteration has two steps.
One step is to update the value, and the other step is to update the policy. The idea
of interaction between value and policy updates widely exists in reinforcement learning
algorithms. This idea is also called generalized policy iteration [3].
Finally, the algorithms introduced in this chapter require the system model. Starting
in Chapter 5, we will study model-free reinforcement learning algorithms. We will see that
the model-free can be obtained by extending the algorithms introduced in this chapter.
4.5 Q&A
Q: Is the value iteration algorithm guaranteed to find optimal policies?
A: Yes. This is because value iteration is exactly the algorithm suggested by the
contraction mapping theorem for solving the Bellman optimality equation in the last
chapter. The convergence of this algorithm is guaranteed by the contraction mapping
theorem.
Q: Are the intermediate values generated by the value iteration algorithm state values?
A: No. These values are not guaranteed to satisfy the Bellman equation of any policy.
Q: What steps are included in the policy iteration algorithm?
A: Each iteration of the policy iteration algorithm contains two steps: policy evalu-
ation and policy improvement. In the policy evaluation step, the algorithm aims to
solve the Bellman equation to obtain the state value of the current policy. In the
84
4.5. Q&A S. Zhao, 2023
policy improvement step, the algorithm aims to update the policy so that the newly
generated policy has greater state values.
Q: Is another iterative algorithm embedded in the policy iteration algorithm?
A: Yes. In the policy evaluation step of the policy iteration algorithm, an iterative
algorithm is required to solve the Bellman equation of the current policy.
Q: Are the intermediate values generated by the policy iteration algorithm state val-
ues?
A: Yes. This is because these values are the solutions of the Bellman equation of the
current policy.
Q: Is the policy iteration algorithm guaranteed to find optimal policies?
A: Yes. We have presented a rigorous proof of its convergence in this chapter.
Q: What is the relationship between the truncated policy iteration and policy iteration
algorithms?
A: As its name suggests, the truncated policy iteration algorithm can be obtained
from the policy iteration algorithm by simply executing a finite number of iterations
during the policy evaluation step.
Q: What is the relationship between truncated policy iteration and value iteration?
A: Value iteration can be viewed as an extreme case of truncated policy iteration,
where a single iteration is run during the policy evaluation step.
Q: Are the intermediate values generated by the truncated policy iteration algorithm
state values?
A: No. Only if we run an infinite number of iterations in the policy evaluation step,
can we obtain true state values. If we run a finite number of iterations, we can only
obtain approximates of the true state values.
Q: How many iterations should we run in the policy evaluation step of the truncated
policy iteration algorithm?
A: The general guideline is to run a few iterations but not too many. The use of a few
iterations in the policy evaluation step can speed up the overall convergence rate, but
running too many iterations would not significantly speed up the convergence rate.
Q: What is generalized policy iteration?
A: Generalized policy iteration is not a specific algorithm. Instead, it refers to the
general idea of the interaction between value and policy updates. This idea is root-
ed in the policy iteration algorithm. Most of the reinforcement learning algorithms
introduced in this book fall into the scope of generalized policy iteration.
Q: What are model-based and model-free reinforcement learning?
85
4.5. Q&A S. Zhao, 2023
A: Although the algorithms introduced in this chapter can find optimal policies, they
are usually called dynamic programming algorithms rather than reinforcement learn-
ing algorithms because they require the system model. Reinforcement learning al-
gorithms can be classified into two categories: model-based and model-free. Here,
“model-based” does not refer to the requirement of the system model. Instead, model-
based reinforcement learning uses data to estimate the system model and uses this
model during the learning process. By contrast, model-free reinforcement learning
does not involve model estimation during the learning process. All the reinforce-
ment learning algorithms introduced in this book are model-free algorithms. More
information about model-based reinforcement learning can be found in [13–16].
86
Chapter 5
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
In the previous chapter, we introduced algorithms that can find optimal policies based
on the system model. In this chapter, we start introducing model-free reinforcement
learning algorithms that do not presume system models.
While this is the first time we introduce model-free algorithms in this book, we must
fill a knowledge gap: how can we find optimal policies without models? The philosophy
is simple: If we do not have a model, we must have some data. If we do not have data,
we must have a model. If we have neither, then we are not able to find optimal policies.
The “data” in reinforcement learning usually refers to the agent’s interaction experiences
with the environment.
87
5.1. Motivating example: Mean estimation S. Zhao, 2023
To demonstrate how to learn from data rather than a model, we start this chapter by
introducing the mean estimation problem, where the expected value of a random variable
is estimated from some samples. Understanding this problem is crucial for understanding
the fundamental idea of learning from data.
Then, we introduce three algorithms based on Monte Carlo (MC) methods. These
algorithms can learn optimal policies from experience samples. The first and simplest
algorithm is called MC Basic, which can be readily obtained by modifying the policy itera-
tion algorithm introduced in the last chapter. Understanding this algorithm is important
for grasping the fundamental idea of MC-based reinforcement learning. By extending
this algorithm, we further introduce another two algorithms that are more complicated
but more efficient.
The first approach is model-based. Here, the model refers to the probability distribu-
tion of X. If the model is known, then the mean can be directly calculated based on
the definition of the expected value:
X
E[X] = p(x)x.
x∈X
In this book, we use the terms expected value, mean, and average interchangeably.
The second approach is model-free. When the probability distribution (i.e., the model)
of X is unknown, suppose that we have some samples {x1 , x2 , . . . , xn } of X. Then,
the mean can be approximated as
n
1X
E[X] ≈ x̄ = xj .
n j=1
88
5.1. Motivating example: Mean estimation S. Zhao, 2023
This is guaranteed by the law of large numbers: the average of a large number of
samples is close to the expected value. The law of large numbers is introduced in
Box 5.1.
The following example illustrates the two approaches described above. Consider a
coin flipping game. Let random variable X denote which side is showing when the coin
lands. X has two possible values: X = 1 when the head is showing, and X = −1 when
the tail is showing. Suppose that the true probability distribution (i.e., the model) of X
is
If the probability distribution is known in advance, we can directly calculate the mean as
If the probability distribution is unknown, then we can flip the coin many times and
record the sampling results {xi }ni=1 . By calculating the average of the samples, we can
obtain an estimate of the mean. As shown in Figure 5.2, the estimated mean becomes
increasingly accurate as the number of samples increases.
2
samples
average
-1
-2
0 50 100 150 200
Sample index
Figure 5.2: An example for demonstrating the law of large numbers. Here, the samples are drawn from
{+1, −1} following a uniform distribution. The average of the samples gradually converges to zero, which
is the true expected value, as the number of samples increases.
It is worth mentioning that the samples used for mean estimation must be independent
and identically distributed (i.i.d. or iid). Otherwise, if the sampling values correlate, it
may be impossible to correctly estimate the expected value. An extreme case is that
all the sampling values are the same as the first one, whatever the first one is. In this
case, the average of the samples is always equal to the first sample, no matter how many
samples we use.
89
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
For a random variable X, suppose that {xi }ni=1 are some i.i.d. samples. Let x̄ =
1
Pn
n i=1 xi be the average of the samples. Then,
E[x̄] = E[X],
1
var[x̄] = var[X].
n
The above two equations indicate that x̄ is an unbiased estimate of E[X], and its
variance decreases to zero as n increases to infinity.
The proof is given below.
Pn Pn
First, E[x̄] = E i=1 xi /n = i=1 E[xi ]/n = E[X], where the last equability is
due to the fact that the samples are identically distributed (that is, E[xi ] = E[X]).
Pn Pn 2
= (n · var[X])/n2 =
Second, var(x̄) = var i=1 x i /n = i=1 var[xi ]/n
var[X]/n, where the second equality is due to the fact that the samples are indepen-
dent, and the third equability is a result of the samples being identically distributed
(that is, var[xi ] = var[X]).
It must be noted that the action values lie in the core of these two steps. Specifically.
in the first step, the state values are calculated for the purpose of calculating the action
90
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
values. In the second step, the new policy is generated based on the calculated action
values. Let us reconsider how we can calculate the action values. Two approaches are
available.
The first is a model-based approach. This is the approach adopted by the policy
iteration algorithm. In particular, we can first calculate the state value vπk by solving
the Bellman equation. Then, we can calculate the action values by using
X X
qπk (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπk (s0 ). (5.1)
r s0
This approach requires the system model {p(r|s, a), p(s0 |s, a)} to be known.
The second is a model-free approach. Recall that the definition of an action value is
which is the expected return obtained when starting from (s, a). Since qπk (s, a) is an
expectation, it can be estimated by MC methods as demonstrated in Section 5.1. To
do that, starting from (s, a), the agent can interact with the environment by following
policy πk and then obtain a certain number of episodes. Suppose that there are n
(i)
episodes and that the return of the ith episode is gπk (s, a). Then, qπk (s, a) can be
approximated as
n
1 X (i)
qπk (s, a) = E[Gt |St = s, At = a] ≈ g (s, a). (5.2)
n i=1 πk
We already know that, if the number of episodes n is sufficiently large, the approxi-
mation will be sufficiently accurate according to the law of large numbers.
Step 1: Policy evaluation. This step is used to estimate qπk (s, a) for all (s, a). Specif-
ically, for every (s, a), we collect sufficiently many episodes and use the average of the
returns, denoted as qk (s, a), to approximate qπk (s, a).
91
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
P
Step 2: Policy improvement. This step solves πk+1 (s) = arg maxπ a π(a|s)qk (s, a) for
all s ∈ S. The greedy optimal policy is πk+1 (a∗k |s) = 1 where a∗k = arg maxa qk (s, a).
92
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
s1 s2 s3
s4 s5 s6
s7 s8 s9
−1
qπ0 (s1 , a1 ) = −1 + γ(−1) + γ 2 (−1) + · · · = .
1−γ
2 a 3 a3 a
Starting from (s1 , a2 ), the episode is s1 −
→s 2 −
→ s5 −
→ . . .. The action value equals
the discounted return of the episode:
2 3 4 γ3
qπ0 (s1 , a2 ) = 0 + γ0 + γ 0 + γ (1) + γ (1) + · · · = .
1−γ
3 a 2 a3 a
Starting from (s1 , a3 ), the episode is s1 −
→s 4 −
→ s5 −
→ . . .. The action value equals
93
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
γ3
qπ0 (s1 , a3 ) = 0 + γ0 + γ 2 0 + γ 3 (1) + γ 4 (1) + · · · = .
1−γ
4 a1 1a a
Starting from (s1 , a4 ), the episode is s1 −
→s 1 −
→ s1 −
→ . . .. The action value equals
the discounted return of the episode:
−1
qπ0 (s1 , a4 ) = −1 + γ(−1) + γ 2 (−1) + · · · = .
1−γ
5 a1 1a a
Starting from (s1 , a5 ), the episode is s1 −
→s 1 −
→ s1 −
→ . . .. The action value equals
the discounted return of the episode:
−γ
qπ0 (s1 , a5 ) = 0 + γ(−1) + γ 2 (−1) + · · · = .
1−γ
γ3
qπ0 (s1 , a2 ) = qπ0 (s1 , a3 ) = >0
1−γ
are the maximum values. As a result, the new policy can be obtained as
94
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023
1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2
3 0.0 0.0 1.0 0.0 0.0 3 3 0.0 0.0 1.9 0.0 0.0 3
4 0.0 1.0 1.0 1.0 0.0 4 4 0.0 1.9 1.9 1.9 0.0 4
5 0.0 0.0 1.0 0.0 0.0 5 5 0.0 0.9 1.9 0.9 0.0 5
(a) Final value and policy with episode length=1 (b) Final value and policy with episode length=2
Episode length=3 Episode length=3 Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2
3 0.0 0.0 2.7 0.0 0.0 3 3 0.0 0.0 3.4 0.0 0.0 3
4 0.0 2.7 2.7 2.7 0.0 4 4 0.0 3.4 3.4 3.4 0.7 4
5 0.0 1.7 2.7 1.7 0.8 5 5 0.0 2.4 3.4 2.4 1.5 5
(c) Final value and policy with episode length=3 (d) Final value and policy with episode length=4
.. .. ..
. . .
1 1.2 1.6 2.0 2.5 3.0 1 1 1.4 1.8 2.2 2.7 3.3 1
2 0.9 1.2 2.5 3.0 3.6 2 2 1.1 1.4 2.7 3.3 3.8 2
3 0.5 0.3 7.7 3.6 4.3 3 3 0.8 0.5 7.9 3.8 4.5 3
4 0.3 7.7 7.7 7.7 5.0 4 4 0.5 7.9 7.9 7.9 5.2 4
5 0.0 6.7 7.7 6.7 5.8 5 5 0.2 6.9 7.9 6.9 6.0 5
(e) Final value and policy with episode length=14 (f) Final value and policy with episode length=15
Episode length=30 Episode length=30 Episode length=100 Episode length=100
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 3.1 3.5 3.9 4.4 4.9 1 1 3.5 3.9 4.3 4.8 5.3 1
2 2.7 3.1 4.4 4.9 5.5 2 2 3.1 3.5 4.8 5.3 5.9 2
3 2.4 2.1 9.6 5.5 6.1 3 3 2.8 2.5 10.0 5.9 6.6 3
4 2.1 9.6 9.6 9.6 6.9 4 4 2.5 10.0 10.0 10.0 7.3 4
5 1.9 8.6 9.6 8.6 7.7 5 5 2.3 9.0 10.0 9.0 8.1 5
(g) Final value and policy with episode length=30 (h) Final value and policy with episode length=100
Figure 5.4: The policies and state values obtained by the MC Basic algorithm when given different
episode lengths. Only if the length of each episode is sufficiently long, can the state values be accurately
estimated.
95
5.3. MC Exploring Starts S. Zhao, 2023
nonzero values, and all the other states have zero values since each episode is too short
to reach the target or get positive rewards (see Figure 5.4(a)). As the episode length
increases, the policy and value estimates gradually approach the optimal ones (see Fig-
ure 5.4(h)).
As the episode length increases, an interesting spatial pattern emerges. That is, the
states that are closer to the target possess nonzero values earlier than those that are
farther away. The reason for this phenomenon is as follows. Starting from a state, the
agent must travel at least a certain number of steps to reach the target state and then
receive positive rewards. If the length of an episode is less than the minimum desired
number of steps, it is certain that the return is zero, and so is the estimated state value. In
this example, the episode length must be no less than 15, which is the minimum number
of steps required to reach the target when starting from the bottom-left state.
While the above analysis suggests that each episode must be sufficiently long, the
episodes are not necessarily infinitely long. As shown in Figure 5.4(g), when the length
is 30, the algorithm can find an optimal policy, although the value estimate is not yet
optimal.
The above analysis is related to an important reward design problem, sparse reward,
which refers to the scenario in which no positive rewards can be obtained unless the target
is reached. The sparse reward setting requires long episodes that can reach the target.
This requirement is challenging to satisfy when the state space is large. As a result,
the sparse reward problem downgrades the learning efficiency. One simple technique for
solving this problem is to design nonsparse rewards. For instance, in the above grid world
example, we can redesign the reward setting so that the agent can obtain a small positive
reward when reaching the states near the target. In this way, an “attractive field” can
be formed around the target so that the agent can find the target more easily. More
information about sparse reward problems can be found in [17–19].
a2 a
4 2a 3a 1 a
s1 −
→ s2 −
→ s1 −
→ s2 −
→ s5 −
→ ... (5.3)
96
5.3. MC Exploring Starts S. Zhao, 2023
where the subscripts refer to the state or action indexes rather than time steps. Every
time a state-action pair appears in an episode, it is called a visit of that state-action pair.
Different strategies can be employed to utilize the visits.
The first and simplest strategy is to use the initial visit. That is, an episode is only
used to estimate the action value of the initial state-action pair that the episode starts
from. For the example in (5.3), the initial-visit strategy merely estimates the action
value of (s1 , a2 ). The MC Basic algorithm utilizes the initial-visit strategy. However, this
strategy is not sample-efficient because the episode also visits many other state-action
pairs such as (s2 , a4 ), (s2 , a3 ), and (s5 , a1 ). These visits can also be used to estimate the
corresponding action values. In particular, we can decompose the episode in (5.3) into
multiple subepisodes:
a2 a
4 2a 3 a 1 a
s1 −
→ s2 −
→ s1 −
→ s2 −
→ s5 −
→ . . . [original episode]
a4 a
2 3a 1 a
s2 −
→ s1 −
→ s2 −
→ s5 −
→ . . . [subepisode starting from (s2 , a4 )]
a2 a
3 1a
s1 −
→ s2 −
→ s5 −
→ . . . [subepisode starting from (s1 , a2 )]
a3 a
1
s2 −
→ s5 −
→ . . . [subepisode starting from (s2 , a3 )]
a1
s5 −
→ . . . [subepisode starting from (s5 , a1 )]
The trajectory generated after the visit of a state-action pair can be viewed as a new
episode. These new episodes can be used to estimate more action values. In this way,
the samples in the episode can be utilized more efficiently.
Moreover, a state-action pair may be visited multiple times in an episode. For exam-
ple, (s1 , a2 ) is visited twice in the episode in (5.3). If we only count the first-time visit,
this is called a first-visit strategy. If we count every visit of a state-action pair, such a
strategy is called every-visit [20].
In terms of sample usage efficiency, the every-visit strategy is the best. If an episode
is sufficiently long such that it can visit all the state-action pairs many times, then this
single episode may be sufficient for estimating all the action values using the every-visit
strategy. However, the samples obtained by the every-visit strategy are correlated because
the trajectory starting from the second visit is merely a subset of the trajectory starting
from the first visit. Nevertheless, the correlation would not be strong if the two visits are
far away from each other in the trajectory.
The first strategy is, in the policy evaluation step, to collect all the episodes starting
from the same state-action pair and then approximate the action value using the
average return of these episodes. This strategy is adopted in the MC Basic algorithm.
97
5.3. MC Exploring Starts S. Zhao, 2023
Initialization: Initial policy π0 (a|s) and initial value q(s, a) for all (s, a). Returns(s, a) =
0 and Num(s, a) = 0 for all (s, a).
Goal: Search for an optimal policy.
For each episode, do
Episode generation: Select a starting state-action pair (s0 , a0 ) and ensure that all
pairs can be possibly selected (this is the exploring-starts condition). Following the
current policy, generate an episode of length T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Initialization for each episode: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Returns(st , at ) ← Returns(st , at ) + g
Num(st , at ) ← Num(st , at ) + 1
Policy evaluation:
q(st , at ) ← Returns(st , at )/Num(st , at )
Policy improvement:
π(a|st ) = 1 if a = arg maxa q(st , a) and π(a|st ) = 0 otherwise
The drawback of this strategy is that the agent must wait until all the episodes have
been collected before the estimate can be updated.
The second strategy, which can overcome this drawback, is to use the return of a
single episode to approximate the corresponding action value. In this way, we can
immediately obtain a rough estimate when we receive an episode. Then, the policy
can be improved in an episode-by-episode fashion.
Since the return of a single episode cannot accurately approximate the corresponding
action value, one may wonder whether the second strategy is good. In fact, this strategy
falls into the scope of generalized policy iteration introduced in the last chapter. That is,
we can still update the policy even if the value estimate is not sufficiently accurate.
98
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023
which is free of such techniques, is introduced first to reveal the core idea of MC-based
reinforcement learning.
The exploring starts condition requires sufficiently many episodes starting from every
state-action pair. Only if every state-action pair is well explored, can we accurately
estimate their action values (according to the law of large numbers) and hence successfully
find optimal policies. Otherwise, if an action is not well explored, its action value may
be inaccurately estimated, and this action may not be selected by the policy even though
it is indeed the best action. Both MC Basic and MC Exploring Starts require this
condition. However, this condition is difficult to meet in many applications, especially
those involving physical interactions with environments. Can we remove the exploring
starts requirement? The answer is yes, as shown in the next section.
99
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023
where Π denotes the set of all possible policies. We know that the solution of (5.4) is a
greedy policy:
(
1, a = a∗k ,
πk+1 (a|s) =
6 a∗k ,
0, a =
where Π denotes the set of all -greedy policies with a given value of . In this way, we
force the policy to be -greedy. The solution of (5.5) is
|A(s)|−1
(
1− |A(s)|
, a = a∗k ,
πk+1 (a|s) = 1
|A(s)|
, a 6= a∗k ,
where a∗k = arg maxa qπk (s, a). With the above change, we obtain another algorithm
called MC -Greedy. The details of this algorithm are given in Algorithm 5.3. Here, the
every-visit strategy is employed to better utilize the samples.
100
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023
Initialization: Initial policy π0 (a|s) and initial value q(s, a) for all (s, a). Returns(s, a) =
0 and Num(s, a) = 0 for all (s, a). ∈ (0, 1]
Goal: Search for an optimal policy.
For each episode, do
Episode generation: Select a starting state-action pair (s0 , a0 ) (the exploring starts
condition is not required). Following the current policy, generate an episode of length
T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Initialization for each episode: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Returns(st , at ) ← Returns(st , at ) + g
Num(st , at ) ← Num(st , at ) + 1
Policy evaluation:
q(st , at ) ← Returns(st , at )/Num(st , at )
Policy improvement:
Let a∗ = arg maxa q(st , a) and
(
1 − |A(s t )|−1
|A(st )|
, a = a∗
π(a|st ) = 1
|A(st )|
, a 6= a∗
If greedy policies are replaced by -greedy policies in the policy improvement step,
can we still guarantee to obtain optimal policies? The answer is both yes and no. By yes,
we mean that, when given sufficient samples, the algorithm can converge to an -greedy
policy that is optimal in the set Π . By no, we mean that the policy is merely optimal in
Π but may not be optimal in Π. However, if is sufficiently small, the optimal policies
in Π are close to those in Π.
101
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) Initial policy (b) After the first iteration (c) After the second iteration
Figure 5.5: The evolution process of the MC -Greedy algorithm based on single episodes.
We next show that the optimality of -greedy policies becomes worse when increases.
First, a greedy optimal policy and the corresponding optimal state values are shown
in Figure 5.6(a). The state values of some consistent -greedy policies are shown in
102
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
Figure 5.6: The state values of some -greedy policies. These -greedy policies are consistent with each
other in the sense that the actions with the greatest probabilities are the same. It can be seen that,
when the value of increases, the state values of the -greedy policies decrease and hence their optimality
becomes worse.
103
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
(b) The optimal -greedy policy and its state values: = 0.1
1 2 3 4 5 1 2 3 4 5
(c) The optimal -greedy policy and its state values: = 0.2
1 2 3 4 5 1 2 3 4 5
(d) The optimal -greedy policy and its state values: = 0.5
Figure 5.7: The optimal -greedy policies and their corresponding state values under different values of
. Here, these -greedy policies are optimal among all -greedy ones (with the same value of ). It can
be seen that, when the value of increases, the optimal -greedy policies are no longer consistent with
the optimal one as in (a).
104
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023
Figures 5.6(b)-(d). Here, two -greedy policies are consistent if the actions with the
greatest probabilities in the policies are the same.
As the value of increases, the state values of the -greedy policies decrease, indicating
that the optimality of these -greedy policies becomes worse. Notably, the value of
the target state becomes the smallest when is as large as 0.5. This is because, when
is large, the agent starting from the target area may enter the surrounding forbidden
areas and hence receive negative rewards with a higher probability.
Second, Figure 5.7 shows the optimal -greedy policies (they are optimal in Π ). When
= 0, the policy is greedy and optimal among all policies. When is as small as 0.1,
the optimal -greedy policy is consistent with the optimal greedy one. However, when
increases to, for example, 0.2, the obtained -greedy policies are not consistent with
the optimal greedy one. Therefore, if we want to obtain -greedy policies that are
consistent with the optimal greedy ones, the value of should be sufficiently small.
Why are the -greedy policies inconsistent with the optimal greedy one when is
large? We can answer this question by considering the target state. In the greedy
case, the optimal policy at the target state is to stay unchanged to gain positive
rewards. However, when is large, there is a high chance of entering the forbidden
areas and receiving negative rewards. Therefore, the optimal policy at the target state
in this case is to escape instead of staying unchanged.
We next illustrate that the exploration ability of an -greedy policy is strong when is
large.
First, consider an -greedy policy with = 1 (see Figure 5.5(a)). In this case, the
exploration ability of the -greedy policy is strong since it has a 0.2 probability of taking
any action at any state. Starting from (s1 , a1 ), an episode generated by the -policy is
given in Figures 5.8(a)-(c). It can be seen that this single episode can visit all the state-
action pairs many times when the episode is sufficiently long due to the strong exploration
ability of the policy. Moreover, the numbers of times that all the state-action pairs are
visited are almost even, as shown in Figure 5.8(d).
Second, consider an -policy with = 0.5 (see Figure 5.6(d)). In this case, the -greedy
policy has a weaker exploration ability than the case of = 1. Starting from (s1 , a1 ), an
episode generated by the -policy is given in Figures 5.8(e)-(g). Although every action
can still be visited when the episode is sufficiently long, the distribution of the number
of visits may be extremely uneven. For example, given an episode with one million steps,
some actions are visited more than 250,000 times, while most actions are visited merely
hundreds or even tens of times, as shown in Figure 5.8(h).
The above examples demonstrate that the exploration abilities of -greedy policies
decrease when decreases. One useful technique is to initially set to be large to enhance
105
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
1 1
2 2
3 3
4 4
5 5
1 1
2 2
3 3
4 4
5 5
105
8300 3
8200 2.5
8100
Visited times
Visited times
2
8000
1.5
7900
1
7800
7700 0.5
7600 0
20 40 60 80 100 120 20 40 60 80 100 120
State-action index State-action index
(d) = 1, number of times each action is vis- (h) = 0.5, number of times each action is
ited within 1 million steps visited within 1 million steps
106
5.6. Summary S. Zhao, 2023
exploration and gradually reduce it to ensure the optimality of the final policy [21–23].
5.6 Summary
The algorithms in this chapter are the first model-free reinforcement learning algorithms
ever introduced in this book. We first introduced the idea of MC estimation by exam-
ining an important mean estimation problem. Then, three MC-based algorithms were
introduced.
MC Basic: This is the simplest MC-based reinforcement learning algorithm. This al-
gorithm is obtained by replacing the model-based policy evaluation step in the policy
iteration algorithm with a model-free MC-based estimation component. Given suffi-
cient samples, it is guaranteed that this algorithm can converge to optimal policies
and optimal state values.
MC Exploring Starts: This algorithm is a variant of MC Basic. It can be obtained
from the MC Basic algorithm using the first-visit or every-visit strategy to use samples
more efficiently.
MC -Greedy: This algorithm is a variant of MC Exploring Starts. Specifically, in the
policy improvement step, it searches for the best -greedy policies instead of greedy
policies. In this way, the exploration ability of the policy is enhanced and hence the
condition of exploring starts can be removed.
5.7 Q&A
Q: What is Monte Carlo estimation?
A: Monte Carlo estimation refers to a broad class of techniques that use stochastic
samples to solve approximation problems.
Q: What is the mean estimation problem?
A: The mean estimation problem refers to calculating the expected value of a random
variable based on stochastic samples.
Q: How to solve the mean estimation problem?
A: There are two approaches: model-based and model-free. In particular, if the proba-
bility distribution of a random variable is known, the expected value can be calculated
107
5.7. Q&A S. Zhao, 2023
based on its definition. If the probability distribution is unknown, we can use Monte
Carlo estimation to approximate the expected value. Such an approximation is accu-
rate when the number of samples is large.
Q: Why is the mean estimation problem important for reinforcement learning?
A: Both state and action values are defined as expected values of returns. Hence,
estimating state or action values is essentially a mean estimation problem.
Q: What is the core idea of model-free MC-based reinforcement learning?
A: The core idea is to convert the policy iteration algorithm to a model-free one.
In particular, while the policy iteration algorithm aims to calculate values based on
the system model, MC-based reinforcement learning replaces the model-based policy
evaluation step in the policy iteration algorithm with a model-free MC-based policy
evaluation step.
Q: What are initial-visit, first-visit, and every-visit strategies?
A: They are different strategies for utilizing the samples in an episode. An episode
may visit many state-action pairs. The initial-visit strategy uses the entire episode to
estimate the action value of the initial state-action pair. The every-visit and first-visit
strategies can better utilize the given samples. If the rest of the episode is used to
estimate the action value of a state-action pair every time it is visited, such a strategy
is called every-visit. If we only count the first time a state-action pair is visited in the
episode, such a strategy is called first-visit.
Q: What is exploring starts? Why is it important?
A: Exploring starts requires an infinite number of (or sufficiently many) episodes to be
generated when starting from every state-action pair. In theory, the exploring starts
condition is necessary to find optimal policies. That is, only if every action value is
well explored, can we accurately evaluate all the actions and then correctly select the
optimal ones.
Q: What is the idea used to avoid exploring starts?
A: The fundamental idea is to make policies soft. Soft policies are stochastic, enabling
an episode to visit many state-action pairs. In this way, we do not need a large number
of episodes starting from every state-action pair.
Q: Can an -greedy policy be optimal?
A: The answer is both yes and no. By yes, we mean that, if given sufficient samples,
the MC -Greedy algorithm can converge to an optimal -greedy policy. By no, we
mean that the converged policy is merely optimal among all -greedy policies (with
the same value of ).
Q: Is it possible to use one episode to visit all state-action pairs?
108
5.7. Q&A S. Zhao, 2023
A: Yes, it is possible. If the policy is soft (e.g., -greedy) and the episode is sufficiently
long.
Q: What is the relationship between MC Basic, MC Exploring Starts, and MC -
Greedy?
A: MC Basic is the simplest MC-based reinforcement learning algorithm. It is impor-
tant because it reveals the fundamental idea of model-free MC-based reinforcement
learning. MC Exploring Starts is a variant of MC Basic that adjusts the sample us-
age strategy. Furthermore, MC -Greedy is a variant of MC Exploring Starts that
removes the exploring starts requirement. Therefore, while the basic idea is simple,
complication appears when we want to achieve better performance. It is important
to split the core idea from the complications that may be distracting for beginners.
109
Chapter 6
Stochastic Approximation
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
110
6.1. Motivating example: Mean estimation S. Zhao, 2023
non-incremental, but the algorithms that we will study in the subsequent chapters are
incremental.
We use the present chapter to fill this knowledge gap by introducing the basics of
stochastic approximation. Although this chapter does not introduce any specific rein-
forcement learning algorithms, it lays the necessary foundations for studying subsequen-
t chapters. We will see in Chapter 7 that the temporal-difference algorithms can be
viewed as special stochastic approximation algorithms. The well-known stochastic gradi-
ent descent algorithms widely used in machine learning are also introduced in the present
chapter.
The approximation in (6.1) is the basic idea of Monte Carlo estimation, as introduced in
Chapter 5. We know that x̄ → E[X] as n → ∞ according to the law of large numbers.
We next show that two methods can be used to calculate x̄ in (6.1). The first non-
incremental method collects all the samples first and then calculates the average. The
drawback of such a method is that, if the number of samples is large, we may have to
wait for a long time until all of the samples are collected. The second method can avoid
this drawback because it calculates the average in an incremental manner. Specifically,
suppose that
k
. 1X
wk+1 = xi , k = 1, 2, . . .
k i=1
and hence,
k−1
1 X
wk = xi , k = 2, 3, . . .
k − 1 i=1
Then, wk+1 can be expressed in terms of wk as
k k−1
!
1X 1 X 1 1
wk+1 = xi = xi + xk = ((k − 1)wk + xk ) = wk − (wk − xk ).
k i=1 k i=1
k k
111
6.2. Robbins-Monro algorithm S. Zhao, 2023
1
wk+1 = wk − (wk − xk ). (6.2)
k
This algorithm can be used to calculate the mean x̄ in an incremental manner. It can be
verified that
w1 = x1 ,
1
w2 = w1 − (w1 − x1 ) = x1 ,
1
1 1 1
w3 = w2 − (w2 − x2 ) = x1 − (x1 − x2 ) = (x1 + x2 ),
2 2 2
1 1
w4 = w3 − (w3 − x3 ) = (x1 + x2 + x3 ),
3 3
..
.
k
1X
wk+1 = xi . (6.3)
k i=1
The advantage of (6.2) is that the average can be immediately calculated every time we
receive a sample. This average can be used to approximate x̄ and hence E[X]. Notably,
the approximation may not be accurate at the beginning due to insufficient samples.
However, it is better than nothing. As more samples are obtained, the estimation accuracy
can be gradually improved according to the law of large numbers. In addition, one
1
Pk+1 1
Pk
can also define wk+1 = 1+k i=1 xi and wk = k i=1 xi . Doing so would not make
any significant difference. In this case, the corresponding iterative algorithm is wk+1 =
1
wk − 1+k (wk − xk+1 ).
Furthermore, consider an algorithm with a more general expression:
This algorithm is important and frequently used in this chapter. It is the same as (6.2)
except that the coefficient 1/k is replaced by αk > 0. Since the expression of αk is not
given, we are not able to obtain the explicit expression of wk as in (6.3). However, we
will show in the next section that, if {αk } satisfies some mild conditions, wk → E[X] as
k → ∞. In Chapter 7, we will see that temporal-difference algorithms have similar (but
more complex) expressions.
112
6.2. Robbins-Monro algorithm S. Zhao, 2023
g(w) = 0,
g̃(w, η) = g(w) + η,
where η ∈ R is the observation error, which may or may not be Gaussian. In summary,
it is a black-box system where only the input w and the noisy output g̃(w, η) are known
(see Figure 6.2). Our aim is to solve g(w) = 0 using w and g̃.
w g̃(w, η)
g(w) +η
where wk is the kth estimate of the root, g̃(wk , ηk ) is the kth noisy observation, and ak is
a positive coefficient. As can be seen, the RM algorithm does not require any information
about the function. It only requires the input and output.
113
6.2. Robbins-Monro algorithm S. Zhao, 2023
Estimated root wk
2
0
0 10 20 30 40 50
k
2
Observation noise
0
-2
0 10 20 30 40 50
Iteration index k
When wk > w∗ , we have g(wk ) > 0. Then, wk+1 = wk − ak g(wk ) < wk . If ak g(wk ) is
sufficiently small, we have w∗ < wk+1 < wk . As a result, wk+1 is closer to w∗ than wk .
When wk < w∗ , we have g(wk ) < 0. Then, wk+1 = wk − ak g(wk ) > wk . If |ak g(wk )| is
sufficiently small, we have w∗ > wk+1 > wk . As a result, wk+1 is closer to w∗ than wk .
114
6.2. Robbins-Monro algorithm S. Zhao, 2023
1.5
0.5
g(w)
0
...... w w w2 w1
4 3
-0.5
-1
The above example is simple since the observation error is assumed to be zero. It
would be nontrivial to analyze the convergence in the presence of stochastic observation
errors. A rigorous convergence result is given below.
We postpone the proof of this theorem to Section 6.3.3. This theorem relies on the
notion of almost sure convergence, which is introduced in Appendix B.
The three conditions in Theorem 6.1 are explained as follows.
In the first condition, 0 < c1 ≤ ∇w g(w) indicates that g(w) is a monotonically increas-
ing function. This condition ensures that the root of g(w) = 0 exists and is unique. If
g(w) is monotonically decreasing, we can simply treat −g(w) as a new function that
is monotonically increasing.
As an application, we can formulate an optimization problem in which the objective
.
function is J(w) as a root-finding problem: g(w) = ∇w J(w) = 0. In this case, the
condition that g(w) is monotonically increasing indicates that J(w) is convex, which
is a commonly adopted assumption in optimization problems.
The inequality ∇w g(w) ≤ c2 indicates that the gradient of g(w) is bounded from
above. For example, g(w) = tanh(w − 1) satisfies this condition, but g(w) = w3 − 5
does not.
115
6.2. Robbins-Monro algorithm S. Zhao, 2023
The second condition about {ak } is interesting. We often see conditions like this in
reinforcement learning algorithms. In particular, the condition ∞ 2
P
k=1 ak < ∞ means
Pn
that limn→∞ k=1 a2k is bounded from above. It requires that ak converges to zero
as k → ∞. The condition ∞
P Pn
k=1 ak = ∞ means that limn→∞ k=1 ak is infinitely
large. It requires that ak should not converge to zero too fast. These conditions have
interesting properties, which will be analyzed in detail shortly.
The third condition is mild. It does not require the observation error ηk to be Gaussian.
An important special case is that {ηk } is an i.i.d. stochastic sequence satisfying
E[ηk ] = 0 and E[ηk2 ] < ∞. In this case, the third condition is valid because ηk is
independent of Hk and hence we have E[ηk |Hk ] = E[ηk ] = 0 and E[ηk2 |Hk ] = E[ηk2 ].
We next examine the second condition about the coefficients {ak } more closely.
Why is the second condition important for the convergence of the RM algorithm?
This question can naturally be answered when we present a rigorous proof of the above
theorem later. Here, we would like to provide some insightful intuition.
First, ∞ 2
P
k=1 ak < ∞ indicates that ak → 0 as k → ∞. Why is this condition impor-
tant? Suppose that the observation g̃(wk , ηk ) is always bounded. Since
If ∞
P P∞
k=1 ak < ∞, then | k=1 ak g̃(wk , ηk )| is also bounded. Let b denote the finite
upper bound such that
∞
X
|w1 − w∞ | = ak g̃(wk , ηk ) ≤ b. (6.6)
k=1
If the initial guess w1 is selected far away from w∗ so that |w1 − w∗ | > b, then it is
impossible to have w∞ = w∗ according to (6.6). This suggests that the RM algorithm
cannot find the true solution w∗ in this case. Therefore, the condition ∞
P
k=1 ak = ∞
is necessary to ensure convergency given an arbitrary initial guess.
116
6.2. Robbins-Monro algorithm S. Zhao, 2023
P∞ P∞
What kinds of sequences satisfy k=1 ak = ∞ and k=1 a2k < ∞?
One typical sequence is
1
αk = .
k
On the one hand, it holds that
n
!
X 1
lim − ln n = κ,
n→∞
k=1
k
where κ ≈ 0.577 is called the Euler-Mascheroni constant (or Euler’s constant) [28].
Since ln n → ∞ as n → ∞, we have
∞
X 1
= ∞.
k=1
k
In fact, Hn = nk=1 k1 is called the harmonic number in number theory [29]. On the
P
wk+1 = wk + αk (xk − wk )
is the mean estimation algorithm in (6.4). When αk = 1/k, we can obtain the analytical
expression of wk+1 as wk+1 = 1/k ki=1 xi . However, we would not be able to obtain
P
an analytical expression when given general values of αk . In this case, the convergence
analysis is nontrivial. We can show that the algorithm in this case is a special RM
117
6.2. Robbins-Monro algorithm S. Zhao, 2023
.
g(w) = w − E[X].
The original problem is to obtain the value of E[X]. This problem is formulated as a
root-finding problem to solve g(w) = 0. Given a value of w, the noisy observation that
.
we can obtain is g̃ = w − x, where x is a sample of X. Note that g̃ can be written as
g̃(w, η) = w − x
= w − x + E[X] − E[X]
.
= (w − E[X]) + (E[X] − x) = g(w) + η,
.
where η = E[X] − x.
The RM algorithm for solving this problem is
which is exactly the algorithm in (6.4). As a result, it is guaranteed by Theorem 6.1 that
wk converges to E[X] almost surely if ∞
P P∞ 2
k=1 αk = ∞, k=1 αk < ∞, and {xk } is i.i.d.
It is worth mentioning that the convergence property does not rely on any assumption
regarding the distribution of X.
∆k+1 = (1 − αk )∆k + βk ηk ,
where {αk }∞ ∞ ∞
k=1 , {βk }k=1 , {ηk }k=1 are stochastic sequences. Here αk ≥ 0, βk ≥ 0 for all
k. Then, ∆k converges to zero almost surely if the following conditions are satisfied:
118
6.2. Robbins-Monro algorithm S. Zhao, 2023
P∞ P∞ P∞
(a) k=1 αk = ∞, k=1 αk2 < ∞, and k=1 βk2 < ∞ uniformly almost surely;
(b) E[ηk |Hk ] = 0 and E[ηk2 |Hk ] ≤ C almost surely;
where Hk = {∆k , ∆k−1 , . . . , ηk−1 , . . . , αk−1 , . . . , βk−1 , . . . }.
Before presenting the proof of this theorem, we first clarify some issues.
E[hk+1 − hk |Hk ] = E[−αk (2 − αk )∆2k |Hk ] + E[βk2 ηk2 |Hk ] + E[2(1 − αk )βk ηk ∆k |Hk ].
(6.7)
First, since ∆k is included and hence determined by Hk , it can be taken out from
the expectation (see property (e) in Lemma B.1). Second, consider the simple case
119
6.2. Robbins-Monro algorithm S. Zhao, 2023
where αk , βk is determined by Hk . This case is valid when, for example, {αk } and
{βk } are functions of ∆k or deterministic sequences. Then, they can also be taken
out of the expectation. Therefore, (6.7) becomes
E[hk+1 − hk |Hk ] = −αk (2 − αk )∆2k + βk2 E[ηk2 |Hk ] + 2(1 − αk )βk ∆k E[ηk |Hk ]. (6.8)
and hence,
∞
X ∞
X
E[hk+1 − hk |Hk ] ≤ βk2 C < ∞.
k=1 k=1
P∞ 2
The last inequality is due to the condition k=1 βk < ∞. Then, based on the
quasimartingale convergence theorem in Appendix C, we conclude that hk converges
almost surely.
While we now know that hk is convergent and so is ∆k , we next determine what
value ∆k converges to. It follows from (6.9) that
∞
X ∞
X ∞
X
αk (2 − αk )∆2k = βk2 E[ηk2 |Hk ] − E[hk+1 − hk |Hk ].
k=1 k=1 k=1
The first term on the right-hand side is bounded as assumed. The second term
is also bounded because hk converges and hence hk+1 − hk is summable. Thus,
P∞ 2
k=1 αk (2 − αk )∆k on the left-hand side is also bounded. Since we consider the case
of αk ≤ 1, we have
∞
X ∞
X
∞> αk (2 − αk )∆2k ≥ αk ∆2k ≥ 0.
k=1 k=1
P∞ 2
P∞
Therefore, k=1 αk ∆k is bounded. Since k=1 αk = ∞, we must have ∆k → 0
almost surely.
120
6.2. Robbins-Monro algorithm S. Zhao, 2023
wk+1 − w∗ = wk − w∗ + αk (xk − w∗ + w∗ − wk )
.
Let ∆ = w − w∗ . Then, we have
∆k+1 = ∆k + αk (xk − w∗ − ∆k )
= (1 − αk )∆k + αk (xk − w∗ ) .
| {z }
ηk
Since {xk } is i.i.d., we have E[xk |Hk ] = E[xk ] = w∗ . As a result, E[ηk |Hk ] = E[xk −
w∗ |Hk ] = 0 and E[ηk2 |Hk ] = E[x2k |Hk ] − (w∗ )2 = E[x2k ] − (w∗ )2 are bounded if the
variance of xk is finite. Following Dvoretzky’s theorem, we conclude that ∆k converges
to zero and hence wk converges to w∗ = E[X] almost surely.
wk+1 = wk − ak g̃(wk , ηk )
= wk − ak [g(wk ) + ηk ].
Then, we have
Due to the mean value theorem [7, 8], we have g(wk ) − g(w∗ ) = ∇w g(wk0 )(wk − w∗ ),
121
6.2. Robbins-Monro algorithm S. Zhao, 2023
.
where wk0 ∈ [wk , w∗ ]. Let ∆k = wk − w∗ . The above equation becomes
Theorem 6.3. Consider a finite set S of real numbers. For the stochastic process
it holds that ∆k (s) converges to zero almost surely for every s ∈ S if the following
conditions are satisfied for s ∈ S:
P P 2 P 2
(a) k αk (s) = ∞, k αk (s) < ∞, k βk (s) < ∞, and E[βk (s)|Hk ] ≤
E[αk (s)|Hk ] uniformly almost surely;
(b) kE[ηk (s)|Hk ]k∞ ≤ γk∆k k∞ , where γ ∈ (0, 1);
(c) var[ηk (s)|Hk ] ≤ C(1 + k∆k (s)k∞ )2 , where C is a constant.
Here, Hk = {∆k , ∆k−1 , . . . , ηk−1 , . . . , αk−1 , . . . , βk−1 , . . . } represents the historical in-
formation. The term k · k∞ refers to the maximum norm.
122
6.4. Stochastic gradient descent S. Zhao, 2023
We first clarify some notations in the theorem. The variable x can be viewed
as an index. In the context of reinforcement learning, it indicates a state or a
state-action pair. The maximum norm k · k∞ is defined over a set. It is similar
.
but different from the L∞ norm of vectors. In particular, kE[ηk (s)|Hk ]k∞ =
.
maxs∈S |E[ηk (s)|Hk ]| and k∆k (s)k∞ = maxs∈S |∆k (s)|.
This theorem is more general than Dvoretzky’s theorem. First, it can handle the
case of multiple variables due to the maximum norm operations. This is important
for a reinforcement learning problem where there are multiple states. Second,
while Dvoretzky’s theorem requires E[ηk (s)|Hk ] = 0 and var[ηk (s)|Hk ] ≤ C, this
theorem only requires that the expectation and variance are bounded by the error
∆k .
It should be noted that the convergence of ∆(s) for all s ∈ S requires that the
conditions are valid for every s ∈ S. Therefore, when applying this theorem to
prove the convergence of reinforcement learning algorithms, we need to show that
the conditions are valid for every state (or state-action pair).
This gradient descent algorithm can find the optimal solution w∗ under some mild con-
ditions such as the convexity of f . Preliminaries about gradient descent algorithms can
be found in Appendix D.
The gradient descent algorithm requires the expected value E[∇w f (wk , X)]. One
way to obtain the expected value is based on the probability distribution of X. The
123
6.4. Stochastic gradient descent S. Zhao, 2023
distribution is, however, often unknown in practice. Another way is to collect a large
number of i.i.d. samples {xi }ni=1 of X so that the expected value can be approximated as
n
1X
E[∇w f (wk , X)] ≈ ∇w f (wk , xi ).
n i=1
One problem of the algorithm in (6.12) is that it requires all the samples in each iteration.
In practice, if the samples are collected one by one, then it is favorable to update w every
time a sample is collected. To that end, we can use the following algorithm:
where xk is the sample collected at time step k. This is the well-known stochastic gradient
descent algorithm. This algorithm is called “stochastic” because it relies on stochastic
samples {xk }.
Compared to the gradient descent algorithm in (6.11), SGD replaces the true gra-
dient E[∇w f (w, X)] with the stochastic gradient ∇w f (wk , xk ). Since ∇w f (wk , xk ) 6=
E[∇w f (w, X)], can such a replacement still ensure wk → w∗ as k → ∞? The answer
is yes. We next present an intuitive explanation and postpone the rigorous proof of the
convergence to Section 6.4.5.
In particular, since
∇w f (wk , xk ) = E[∇w f (w, X)] + ∇w f (wk , xk ) − E[∇w f (w, X)]
.
= E[∇w f (w, X)] + ηk ,
Therefore, the SGD algorithm is the same as the regular gradient descent algorithm except
that it has a perturbation term αk ηk . Since {xk } is i.i.d., we have Exk [∇w f (wk , xk )] =
EX [∇w f (w, X)]. As a result,
h i
E[ηk ] = E ∇w f (wk , xk ) − E[∇w f (w, X)] = Exk [∇w f (wk , xk )] − EX [∇w f (w, X)] = 0.
Therefore, the perturbation term ηk has a zero mean, which intuitively suggests that it
may not jeopardize the convergence property. A rigorous proof of the convergence of
124
6.4. Stochastic gradient descent S. Zhao, 2023
wk+1 = wk − αk ∇w J(wk )
= wk − αk E[∇w f (wk , X)]
= wk − αk E[wk − X].
This gradient descent algorithm is not applicable since E[wk − X] or E[X] on the
right-hand side is unknown (in fact, it is what we need to solve).
The SGD algorithm for solving (6.14) is
where xk is a sample obtained at time step k. Notably, this SGD algorithm is the
same as the iterative mean estimation algorithm in (6.4). Therefore, (6.4) is an SGD
algorithm designed specifically for solving the mean estimation problem.
125
6.4. Stochastic gradient descent S. Zhao, 2023
Analysis: The relative error between the stochastic and true gradients is
For the sake of simplicity, we consider the case where w and ∇w f (w, x) are both
scalars. Since w∗ is the optimal solution, it holds that E[∇w f (w∗ , X)] = 0. Then, the
relative error can be rewritten as
|∇w f (wk , xk ) − E[∇w f (wk , X)]| |∇w f (wk , xk ) − E[∇w f (wk , X)]|
δk = ∗
= , (6.15)
|E[∇w f (wk , X)] − E[∇w f (w , X)]| |E[∇2w f (w̃k , X)(wk − w∗ )]|
where the last equality is due to the mean value theorem [7, 8] and w̃k ∈ [wk , w∗ ].
Suppose that f is strictly convex such that ∇2w f ≥ c > 0 for all w, X. Then, the
denominator in (6.15) becomes
The above inequality suggests an interesting convergence pattern of SGD: the relative
error δk is inversely proportional to |wk − w∗ |. As a result, when |wk − w∗ | is large, δk
is small. In this case, the SGD algorithm behaves like the gradient descent algorithm
and hence wk quickly converges to w∗ . When wk is close to w∗ , the relative error δk
may be large, and the convergence exhibits more randomness.
Example: A good example for demonstrating the above analysis is the mean estima-
tion problem. Consider the mean estimation problem in (6.14). When w and X are
both scalar, we have f (w, X) = |w − X|2 /2 and hence
∇w f (w, xk ) = w − xk ,
E[∇w f (w, xk )] = w − E[X] = w − w∗ .
The expression of the relative error clearly shows that δk is inversely proportional to
126
6.4. Stochastic gradient descent S. Zhao, 2023
30
20 SG (m=1)
Mean MBGD (m=5)
Samples 25
15 MBGD (m=50)
SGD (m=1)
Distance to mean
MBGD (m=5) 20
10 MBGD (m=50)
15
5
y
10
0
-5 5
0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step
Figure 6.5: An example for demonstrating stochastic and mini-batch gradient descent algorithms. The
distribution of X ∈ R2 is uniform in the square area centered at the origin with a side length as 20. The
mean is E[X] = 0. The mean estimation is based on 100 i.i.d. samples.
|wk − w∗ |. As a result, when wk is far from w∗ , the relative error is small, and SGD
behaves like gradient descent. In addition, since δk is proportional to |E[X] − xk |, the
mean of δk is proportional to the variance of X.
The simulation results are shown in Figure 6.5. Here, X ∈ R2 represents a random
position in the plane. Its distribution is uniform in the square area centered at the
origin and E[X] = 0. The mean estimation is based on 100 i.i.d. samples. Although
the initial guess of the mean is far away from the true value, it can be seen that the
SGD estimate quickly approaches the neighborhood of the origin. When the estimate
is close to the origin, the convergence process exhibits certain randomness.
Suppose that the set {xi }ni=1 is large and we can only fetch a single number each time.
127
6.4. Stochastic gradient descent S. Zhao, 2023
It must be noted that xk here is the number fetched at time step k instead of the kth
element in the set {xi }ni=1 .
The algorithm in (6.16) is very similar to SGD, but its problem formulation is subtly
different because it does not involve any random variables or expected values. Then,
many questions arise. For example, is this algorithm SGD? How should we use the finite
set of numbers {xi }ni=1 ? Should we sort these numbers in a certain order and then use
them one by one, or should we randomly sample a number from the set?
A quick answer to the above questions is that, although no random variables are
involved in the above formulation, we can convert the deterministic formulation to the
stochastic formulation by introducing a random variable. In particular, let X be a random
variable defined on the set {xi }ni=1 . Suppose that its probability distribution is uniform
such that p(X = xi ) = 1/n. Then, the deterministic optimization problem becomes a
stochastic one: n
1X
min J(w) = f (w, xi ) = E[f (w, X)].
w n i=1
The last equality in the above equation is strict instead of approximate. Therefore, the
algorithm in (6.16) is SGD, and the estimate converges if xk is uniformly and indepen-
dently sampled from {xi }ni=1 . Note that xk may repeatedly take the same number in
{xi }ni=1 since it is sampled randomly.
In the BGD algorithm, all the samples are used in every iteration. When n is large,
(1/n) ni=1 ∇w f (wk , xi ) is close to the true gradient E[∇w f (wk , X)]. In the MBGD al-
P
128
6.4. Stochastic gradient descent S. Zhao, 2023
gorithm, Ik is a subset of {1, . . . , n} obtained at time k. The size of the set is |Ik | = m.
The samples in Ik are also assumed to be i.i.d. In the SGD algorithm, xk is randomly
sampled from {xi }ni=1 at time k.
MBGD can be viewed as an intermediate version between SGD and BGD. Compared
to SGD, MBGD has less randomness because it uses more samples instead of just one
as in SGD. Compared to BGD, MBGD does not require using all the samples in every
iteration, making it more flexible. If m = 1, then MBGD becomes SGD. However, if
m = n, MBGD may not become BGD. This is because MBGD uses n randomly fetched
samples, whereas BGD uses all n numbers. These n randomly fetched samples may
contain the same number multiple times and hence may not cover all n numbers in
{xi }ni=1 .
The convergence speed of MBGD is faster than that of SGD in general. This is
because SGD uses ∇w f (wk , xk ) to approximate the true gradient, whereas MBGD uses
P
(1/m) j∈Ik ∇w f (wk , xj ), which is closer to the true gradient because the randomness is
averaged out. The convergence of the MBGD algorithm can be proven similarly to the
SGD case.
A good example for demonstrating the above analysis is the mean estimation prob-
lem. In particular, given some numbers {xi }ni=1 , our goal is to calculate the mean
x̄ = ni=1 xi /n. This problem can be equivalently stated as the following optimization
P
problem:
n
1 X
min J(w) = kw − xi k2 ,
w 2n i=1
whose optimal solution is w∗ = x̄. The three algorithms for solving this problem are,
respectively,
n
1X
wk+1 = wk − αk (wk − xi ) = wk − αk (wk − x̄), (BGD)
n i=1
1 X
(m)
wk+1 = wk − αk (wk − xj ) = wk − αk wk − x̄k , (MBGD)
m j∈I
k
(m) P
where x̄k = j∈Ik xj /m. Furthermore, if αk = 1/k, the above equations can be solved
129
6.4. Stochastic gradient descent S. Zhao, 2023
as follows:
k
1X
wk+1 = x̄ = x̄, (BGD)
k j=1
k
1 X (m)
wk+1 = x̄ , (MBGD)
k j=1 j
k
1X
wk+1 = xj . (SGD)
k j=1
The derivation of the above equations is similar to that of (6.3) and is omitted here. It
can be seen that the estimate given by BGD at each step is exactly the optimal solution
(m)
w∗ = x̄. MBGD converges to the mean faster than SGD because x̄k is already an
average.
A simulation example is given in Figure 6.5 to demonstrate the convergence of MBGD.
Let αk = 1/k. It is shown that all MBGD algorithms with different mini-batch sizes can
converge to the mean. The case with m = 50 converges the fastest, while SGD with m = 1
is the slowest. This is consistent with the above analysis. Nevertheless, the convergence
rate of SGD is still fast, especially when wk is far from w∗ .
Theorem 6.4 (Convergence of SGD). For the SGD algorithm in (6.13), if the following
conditions are satisfied, then wk converges to the root of ∇w E[f (w, X)] = 0 almost surely.
(a) 0 < c1 ≤ ∇2w f (w, X) ≤ c2 ;
P∞ P∞ 2
(b) k=1 ak = ∞ and k=1 ak < ∞;
(c) {xk }∞
k=1 are i.i.d.
130
6.4. Stochastic gradient descent S. Zhao, 2023
Then, SGD aims to find the root of g(w) = 0. This is exactly the problem solved by
the RM algorithm. The quantity that we can measure is g̃ = ∇w f (w, x), where x is
a sample of X. Note that g̃ can be rewritten as
g̃(w, η) = ∇w f (w, x)
= E[∇w f (w, X)] + ∇w f (w, x) − E[∇w f (w, X)] .
| {z }
η(w,x)
which is the same as the SGD algorithm in (6.13). As a result, the SGD algorithm
is a special RM algorithm. We next show that the three conditions in Theorem 6.1
are satisfied. Then, the convergence of SGD naturally follows from Theorem 6.1.
Since ∇w g(w) = ∇w E[∇w f (w, X)] = E[∇2w f (w, X)], it follows from c1 ≤
∇2w f (w, X) ≤ c2 that c1 ≤ ∇w g(w) ≤ c2 . Thus, the first condition in Theo-
rem 6.1 is satisfied.
The second condition in Theorem 6.1 is the same as the second condition in this
theorem.
The third condition in Theorem 6.1 requires E[ηk |Hk ] = 0 and E[ηk2 |Hk ] < ∞.
Since {xk } is i.i.d., we have Exk [∇w f (w, xk )] = E[∇w f (w, X)] for all k. Therefore,
131
6.5. Summary S. Zhao, 2023
a function of wk . Therefore,
Similarly, it can be proven that E[ηk2 |Hk ] < ∞ if |∇w f (w, x)| < ∞ for all w given
any x.
Since the three conditions in Theorem 6.1 are satisfied, the convergence of the
SGD algorithm follows.
6.5 Summary
Instead of introducing new reinforcement learning algorithms, this chapter introduced the
preliminaries of stochastic approximation such as the RM and SGD algorithms. Com-
pared to many other root-finding algorithms, the RM algorithm does not require the
expression of the objective function or its derivative. It has been shown that the SGD al-
gorithm is a special RM algorithm. Moreover, an important problem frequently discussed
throughout this chapter is mean estimation. The mean estimation algorithm (6.4) is the
first stochastic iterative algorithm we have ever introduced in this book. We showed that
it is a special SGD algorithm. We will see in Chapter 7 that temporal-difference learn-
ing algorithms have similar expressions. Finally, the name “stochastic approximation”
was first used by Robbins and Monro in 1951 [25]. More information about stochastic
approximation can be found in [24].
6.6 Q&A
Q: What is stochastic approximation?
A: Stochastic approximation refers to a broad class of stochastic iterative algorithms
for solving root-finding or optimization problems.
Q: Why do we need to study stochastic approximation?
A: This is because the temporal-difference reinforcement learning algorithms that will
be introduced in Chapter 7 can be viewed as stochastic approximation algorithms.
With the knowledge introduced in this chapter, we can be better prepared, and it will
not be abrupt for us to see these algorithms for the first time.
Q: Why do we frequently discuss the mean estimation problem in this chapter?
A: This is because the state and action values are defined as the means of random
variables. The temporal-difference learning algorithms introduced in Chapter 7 are
similar to stochastic approximation algorithms for mean estimation.
132
6.6. Q&A S. Zhao, 2023
133
Chapter 7
Temporal-Difference Methods
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
134
7.1. TD learning of state values S. Zhao, 2023
where t = 0, 1, 2, . . . . Here, vt (st ) is the estimate of vπ (st ) at time t; αt (st ) is the learning
rate for st at time t.
It should be noted that, at time t, only the value of the visited state st is updated. The
values of the unvisited states s 6= st remain unchanged as shown in (7.2). Equation (7.2)
is often omitted for simplicity, but it should be kept in mind because the algorithm would
be mathematically incomplete without this equation.
135
7.1. TD learning of state values S. Zhao, 2023
Readers who see the TD learning algorithm for the first time may wonder why it
is designed like this. In fact, it can be viewed as a special stochastic approximation
algorithm for solving the Bellman equation. To see that, first recall that the definition of
the state value is
vπ (s) = E Rt+1 + γGt+1 |St = s , s ∈ S. (7.3)
0 0
P P
That is because E[Gt+1 |St = s] = a π(a|s) s0 p(s |s, a)vπ (s ) = E[vπ (St+1 )|St = s].
Equation (7.4) is another expression of the Bellman equation. It is sometimes called the
Bellman expectation equation.
The TD algorithm can be derived by applying the Robbins-Monro algorithm (Chap-
ter 6) to solve the Bellman equation in (7.4). Interested readers can check the details in
Box 7.1.
We next show that the TD algorithm in (7.1) can be obtained by applying the
Robbins-Monro algorithm to solve (7.4).
For state st , we define a function as
.
g(vπ (st )) = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st .
Our goal is to solve the above equation to obtain vπ (st ) using the Robbins-Monro
algorithm. Since we can obtain rt+1 and st+1 , which are the samples of Rt+1 and
St+1 , the noisy observation of g(vπ (st )) that we can obtain is
g̃(vπ (st )) = vπ (st ) − rt+1 + γvπ (st+1 )
= vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st
| {z }
g(vπ (st ))
+ E Rt+1 + γvπ (St+1 )|St = st − rt+1 + γvπ (st+1 ) .
| {z }
η
136
7.1. TD learning of state values S. Zhao, 2023
Therefore, the Robbins-Monro algorithm (Section 6.2) for solving g(vπ (st )) = 0 is
where vt (st ) is the estimate of vπ (st ) at time t, and αt (st ) is the learning rate.
The algorithm in (7.5) has a similar expression to that of the TD algorithm in
(7.1). The only difference is that the right-hand side of (7.5) contains vπ (st+1 ),
whereas (7.1) contains vt (st+1 ). That is because (7.5) is designed to merely estimate
the action value of st by assuming that the action values of the other states are already
known. If we would like to estimate the state values of all the states, then vπ (st+1 ) on
the right-hand side should be replaced with vt (st+1 ). Then, (7.5) is exactly the same
as (7.1). However, can such a replacement still ensure convergence? The answer is
yes, and it will be proven later in Theorem 7.1.
TD error δ
z }| t {
vt+1 (st ) = vt (st ) −αt (st ) vt (st ) − rt+1 + γvt (st+1 ) , (7.6)
| {z } | {z } | {z }
new estimate current estimate TD target v̄t
where
.
v̄t = rt+1 + γvt (st+1 )
.
δt = v(st ) − v̄t = vt (st ) − (rt+1 + γvt (st+1 ))
is called the TD error. It can be seen that the new estimate vt+1 (st ) is a combination of
the current estimate vt (st ) and the TD error δt .
137
7.1. TD learning of state values S. Zhao, 2023
Taking the absolute values of both sides of the above equation gives
Since αt (st ) is a small positive number, we have 0 < 1 − αt (st ) < 1. It then follows
that
The above inequality is important because it indicates that the new value vt+1 (st ) is
closer to v̄t than the old value vt (st ). Therefore, this algorithm mathematically drives
vt (st ) toward v̄t . This is why v̄t is called the TD target.
What is the interpretation of the TD error?
First, this error is called temporal-difference because δt = vt (st ) − (rt+1 + γvt (st+1 ))
reflects the discrepancy between two time steps t and t + 1. Second, the TD error is
zero in the expectation sense when the state value estimate is accurate. To see that,
when vt = vπ , the expected value of the TD error is
E[δt |St = st ] = E vπ (St ) − (Rt+1 + γvπ (St+1 ))|St = st
= vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st
= 0. (due to (7.3))
Therefore, the TD error reflects not only the discrepancy between two time steps but
also, more importantly, the discrepancy between the estimate vt and the true state
value vπ .
On a more abstract level, the TD error can be interpreted as the innovation, which
indicates new information obtained from the experience sample (st , rt+1 , st+1 ). The
fundamental idea of TD learning is to correct our current estimate of the state val-
ue based on the newly obtained information. Innovation is fundamental in many
estimation problems such as Kalman filtering [33, 34].
Second, the TD algorithm in (7.1) can only estimate the state values of a given policy.
To find optimal policies, we still need to further calculate the action values and then
conduct policy improvement. This will be introduced in Section 7.2. Nevertheless, the
TD algorithm introduced in this section is very basic and important for understanding
the other algorithms in this chapter.
Third, while both TD learning and MC learning are model-free, what are their ad-
vantages and disadvantages? The answers are summarized in Table 7.1.
138
7.1. TD learning of state values S. Zhao, 2023
TD learning MC learning
Low estimation variance: The estimation High estimation variance: The estimation
variance of TD is lower than that of M- variance of MC is higher since many ran-
C because it involves fewer random vari- dom variables are involved. For example,
ables. For instance, to estimate an ac- to estimate qπ (st , at ), we need samples of
tion value qπ (st , at ), Sarsa merely requires Rt+1 +γRt+2 +γ 2 Rt+3 +. . . . Suppose that
the samples of three random variables: the length of each episode is L. Assume
Rt+1 , St+1 , At+1 . that each state has the same number of
actions as |A|. Then, there are |A|L pos-
sible episodes following a soft policy. If we
merely use a few episodes to estimate, it
is not surprising that the estimation vari-
ance is high.
139
7.1. TD learning of state values S. Zhao, 2023
positive constant in practice. In this case, the condition that t αt2 (s) < ∞ is no longer
P
valid. When α is constant, it can still be shown that the algorithm converges in the sense
of expectation [24, Section 1.5].
or
.
∆t (s) = vt (s) − vπ (s),
where vπ (s) is the state value of s under policy π. Deducting vπ (s) from both sides
of (7.7) gives
whose expression is the same as that of (7.9) except that αt (s) = 0 and ηt (s) = 0.
Therefore, regardless of whether s = st , we obtain the following unified expression:
This is the process in Theorem 6.3. Our goal is to show that the three conditions in
Theorem 6.3 are satisfied and hence the process converges.
The first condition is valid as assumed in Theorem 7.1. We next show that the
second condition is valid. That is, kE[ηt (s)|Ht ]k∞ ≤ γk∆t (s)k∞ for all s ∈ S. Here,
Ht represents the historical information (see the definition in Theorem 6.3). Due to
the Markovian property, ηt (s) = rt+1 + γvt (st+1 ) − vπ (s) or ηt (s) = 0 does not depend
140
7.1. TD learning of state values S. Zhao, 2023
For s = st , we have
Since vπ (st ) = E[rt+1 + γvπ (st+1 )|st ], the above equation implies that
It follows that
X
|E[ηt (s)]| = γ p(s0 |st )[vt (s0 ) − vπ (s0 )]
s0 ∈S
X
≤γ p(s0 |st ) max
0
|vt (s0 ) − vπ (s0 )|
s ∈S
s0 ∈S
= γ max
0
|vt (s0 ) − vπ (s0 )|
s ∈S
Therefore, at time t, we know from (7.10) and (7.11) that |E[ηt (s)]| ≤ γk∆t (s)k∞ for
all s ∈ S regardless of whether s = st . Thus,
which is the second condition in Theorem 7.1. Finally, regarding the third condition,
we have var[ηt (s)|Ht ] = var[rt+1 + γvt (st+1 ) − vπ (st )|st ] = var[rt+1 + γvt (st+1 )|st ] for
s = st and var[ηt (s)|Ht ] = 0 for s 6= st . Since rt+1 is bounded, the third condition
can be proven without difficulty.
The above proof is inspired by [32].
141
7.2. TD learning of action values: Sarsa S. Zhao, 2023
where t = 0, 1, 2, . . . and αt (st , at ) is the learning rate. Here, qt (st , at ) is the estimate of
qπ (st , at ). At time t, only the q-value of (st , at ) is updated, whereas the q-values of the
others remain the same.
Some important properties of the Sarsa algorithm are discussed as follows.
Why is this algorithm called “Sarsa”? That is because each iteration of the algorithm
requires (st , at , rt+1 , st+1 , at+1 ). Sarsa is an abbreviation for state-action-reward-state-
action. The Sarsa algorithm was first proposed in [35] and its name was coined by
[3].
Why is Sarsa designed in this way? One may have noticed that Sarsa is similar to the
TD algorithm in (7.1). In fact, Sarsa can be easily obtained from the TD algorithm
by replacing state value estimation with action value estimation.
What does Sarsa do mathematically? Similar to the TD algorithm in (7.1), Sarsa
is a stochastic approximation algorithm for solving the Bellman equation of a given
policy:
Equation (7.13) is the Bellman equation expressed in terms of action values. A proof
is given in Box 7.3.
142
7.2. TD learning of action values: Sarsa S. Zhao, 2023
This equation establishes the relationships among the action values. Since
By the definition of the expected value, the above equation is equivalent to (7.13).
Hence, (7.13) is the Bellman equation.
The proof is similar to that of Theorem 7.1 and is omitted here. The condition of
P P 2
t αt (s, a) = ∞ and t αt (s, a) < ∞ should be valid for all (s, a). In particular,
P
t αt (s, a) = ∞ requires that every state-action pair must be visited an infinite (or
sufficiently many) number of times. At time t, if (s, a) = (st , at ), then αt (s, a) > 0;
otherwise, αt (s, a) = 0.
143
7.2. TD learning of action values: Sarsa S. Zhao, 2023
Initialization: αt (s, a) = α > 0 for all (s, a) and all t. ∈ (0, 1). Initial q0 (s, a) for all
(s, a). Initial -greedy policy π0 derived from q0 .
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect an experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γqt (st+1 , at+1 ))
Update policy for st :
πt+1 (a|st ) = 1 − |A(s t )| (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1
Simulation setup: In this example, all the episodes start from the top-left state and ter-
minate at the target state. The reward settings are rtarget = 0, rforbidden = rboundary =
−10, and rother = −1. Moreover, αt (s, a) = 0.1 for all t and = 0.1. The initial
guesses of the action values are q0 (s, a) = 0 for all (s, a). The initial policy has a
uniform distribution: π0 (a|s) = 0.2 for all s, a.
Learned policy: The left figure in Figure 7.2 shows the final policy learned by Sarsa.
As can be seen, this policy can successfully lead to the target state from the starting
144
7.2. TD learning of action values: Sarsa S. Zhao, 2023
1 2 3 4 5
Total rewards
0
1
-200
-400
2
0 100 200 300 400 500
Episode length
200
4 100
0
5 0 100 200 300 400 500
Episode index
Figure 7.2: An example for demonstrating Sarsa. All the episodes start from the top-left state and
terminate when reaching the target state (the blue cell). The goal is to find an optimal path from the
starting state to the target state. The reward settings are rtarget = 0, rforbidden = rboundary = −10, and
rother = −1. The learning rate is α = 0.1 and the value of is 0.1. The left figure shows the final policy
obtained by the algorithm. The right figures show the total reward and length of every episode.
state. However, the policies of some other states may not be optimal. That is because
the other states are not well explored.
Total reward of each episode: The top-right subfigure in Figure 7.2 shows the to-
tal reward of each episode. Here, the total reward is the non-discounted sum of all
immediate rewards. As can be seen, the total reward of each episode increases grad-
ually. That is because the initial policy is not good and hence negative rewards are
frequently obtained. As the policy becomes better, the total reward increases.
Length of each episode: The bottom-right subfigure in Figure 7.2 shows that the
length of each episode drops gradually. That is because the initial policy is not good
and may take many detours before reaching the target. As the policy becomes better,
the length of the trajectory becomes shorter. Notably, the length of an episode may
increase abruptly (e.g., the 460th episode) and the corresponding total reward also
drops sharply. That is because the policy is -greedy, and there is a chance for it to
take non-optimal actions. One way to resolve this problem is to use decaying whose
value converges to zero gradually.
Finally, Sarsa also has some variants such as Expected Sarsa. Interested readers may
check Box 7.4.
145
7.2. TD learning of action values: Sarsa S. Zhao, 2023
Given a policy π, its action values can be evaluated by Expected Sarsa, which is a
variant of Sarsa. The Expected Sarsa algorithm is
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γE[qt (st+1 , A)]) ,
qt+1 (s, a) = qt (s, a), for all (s, a) 6= (st , at ),
where
X .
E[qt (st+1 , A)] = πt (a|st+1 )qt (st+1 , a) = vt (st+1 )
a
is the expected value of qt (st+1 , a) under policy πt . The expression of the Ex-
pected Sarsa algorithm is very similar to that of Sarsa. They are different only
in terms of their TD targets. In particular, the TD target in Expected Sarsa is
rt+1 + γE[qt (st+1 , A)], while that of Sarsa is rt+1 + γqt (st+1 , at+1 ). Since the algorithm
involves an expected value, it is called Expected Sarsa. Although calculating the
expected value may increase the computational complexity slightly, it is beneficial
in the sense that it reduces the estimation variances because it reduces the random
variables in Sarsa from {st , at , rt+1 , st+1 , at+1 } to {st , at , rt+1 , st+1 }.
Similar to the TD learning algorithm in (7.1), Expected Sarsa can be viewed as
a stochastic approximation algorithm for solving the following equation:
h i
qπ (s, a) = E Rt+1 + γE[qπ (St+1 , At+1 )|St+1 ] St = s, At = a , for all s, a. (7.15)
The above equation may look strange at first glance. In fact, it is another expression
of the Bellman equation. To see that, substituting
X
E[qπ (St+1 , At+1 )|St+1 ] = qπ (St+1 , A0 )π(A0 |St+1 ) = vπ (St+1 )
A0
146
7.3. TD learning of action values: n-step Sarsa S. Zhao, 2023
(1)
Sarsa ←− Gt = Rt+1 + γqπ (St+1 , At+1 ),
(2)
Gt = Rt+1 + γRt+2 + γ 2 qπ (St+2 , At+2 ),
..
.
(n)
n-step Sarsa ←− Gt = Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n ),
..
.
(∞)
MC ←− Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 . . .
When n = 1, we have
(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].
(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].
147
7.3. TD learning of action values: n-step Sarsa S. Zhao, 2023
.
qt+1 (st , at ) = gt = rt+1 + γrt+2 + γ 2 rt+3 + . . . ,
(n)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|s, a].
In summary, n-step Sarsa is a more general algorithm because it becomes the (one-
step) Sarsa algorithm when n = 1 and the MC learning algorithm when n = ∞ (by
setting αt = 1).
To implement the n-step Sarsa algorithm in (7.17), we need the experience samples
(st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ). Since (rt+n , st+n , at+n ) has not been collected
at time t, we have to wait until time t + n to update the q-value of (st , at ). To that end,
(7.17) can be rewritten as
148
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023
where t = 0, 1, 2, . . . . Here, qt (st , at ) is the estimate of the optimal action value of (st , at )
and αt (st , at ) is the learning rate for (st , at ).
The expression of Q-learning is similar to that of Sarsa. They are different only
in terms of their TD targets: the TD target of Q-learning is rt+1 + γ maxa qt (st+1 , a),
whereas that of Sarsa is rt+1 + γqt (st+1 , at+1 ). Moreover, given (st , at ), Sarsa requires
(rt+1 , st+1 , at+1 ) in every iteration, whereas Q-learning merely requires (rt+1 , st+1 ).
Why is Q-learning designed as the expression in (7.18), and what does it do mathe-
matically? Q-learning is a stochastic approximation algorithm for solving the following
equation:
h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a . (7.19)
a
This is the Bellman optimality equation expressed in terms of action values. The proof is
given in Box 7.5. The convergence analysis of Q-learning is similar to Theorem 7.1 and
omitted here. More information can be found in [32, 39].
149
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023
.
By denoting v(s) = maxa∈A(s) q(s, a), we can rewrite the above equation as
" #
X X
v(s) = max p(r|s, a)r + γ p(s0 |s, a)v(s0 )
a∈A(s)
r s0
" #
X X X
= max π(a|s) p(r|s, a)r + γ p(s0 |s, a)v(s0 ) ,
π
a∈A(s) r s0
which is clearly the Bellman optimality equation in terms of state values as introduced
in Chapter 3.
Sarsa is on-policy.
150
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023
The reason is as follows. Sarsa has two steps in every iteration. The first step is to
evaluate a policy π by solving its Bellman equation. To do that, we need samples
generated by π. Therefore, π is the behavior policy. The second step is to obtain an
improved policy based on the estimated values of π. As a result, π is the target policy
that is constantly updated and eventually converges to an optimal policy. Therefore,
the behavior policy and the target policy are the same.
From another point of view, we can examine the samples required by the algorithm.
The samples required by Sarsa in every iteration include (st , at , rt+1 , st+1 , at+1 ). How
these samples are generated is illustrated below:
π
b model b π
st −−−
→ at −−−→ rt+1 , st+1 −−−
→ at+1
As can be seen, the behavior policy πb is the one that generates at at st and at+1
at st+1 . The Sarsa algorithm aims to estimate the action value of (st , at ) of a policy
denoted as πT , which is the target policy because it is improved in every iteration
based on the estimated values. In fact, πT is the same as πb because the evaluation
of πT relies on the samples (rt+1 , st+1 , at+1 ), where at+1 is generated following πb . In
other words, the policy that Sarsa evaluates is the policy used to generate samples.
Q-learning is off-policy.
The fundamental reason is that Q-learning is an algorithm for solving the Bellman
optimality equation, whereas Sarsa is for solving the Bellman equation of a given
policy. While solving the Bellman equation can evaluate the associated policy, solving
the Bellman optimality equation can directly generate the optimal values and optimal
policies.
In particular, the samples required by Q-learning in every iteration is (st , at , rt+1 , st+1 ).
How these samples are generated is illustrated below:
πb model
st −−−
→ at −−−→ rt+1 , st+1
As can be seen, the behavior policy πb is the one that generates at at st . The Q-learning
algorithm aims to estimate the optimal action value of (st , at ). This estimation process
relies on the samples (rt+1 , st+1 ). The process of generating (rt+1 , st+1 ) does not
involve πb because it is governed by the system model (or by interacting with the
environment). Therefore, the estimation of the optimal action value of (st , at ) does
not involve πb and we can use any πb to generate at at st . Moreover, the target
policy πT here is the greedy policy obtained based on the estimated optimal values
(Algorithm 7.3). The behavior policy does not have to be the same as πT .
MC learning is on-policy. The reason is similar to that of Sarsa. The target policy to
be evaluated and improved is the same as the behavior policy that generates samples.
151
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023
Initialization: αt (s, a) = α > 0 for all (s, a) and all t. ∈ (0, 1). Initial q0 (s, a) for all
(s, a). Initial -greedy policy π0 derived from q0 .
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γ maxa qt (st+1 , a))
Update policy for st :
πt+1 (a|st ) = 1 − |A(s t )| (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
πt+1 (a|st ) = |A(s t )| otherwise
Initialization: Initial guess q0 (s, a) for all (s, a). Behavior policy πb (a|s) for all (s, a).
αt (s, a) = α > 0 for all (s, a) and all t.
Goal: Learn an optimal target policy πT for all states from the experience samples
generated by πb .
For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do
For each step t = 0, 1, 2, . . . of the episode, do
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − (rt+1 + γ maxa qt (st+1 , a))
Update target policy for st :
πT,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT,t+1 (a|st ) = 0 otherwise
152
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023
1 2 3 4 5
Total rewards
0
1
-200
-400
2
0 100 200 300 400 500
Episode length
3 200
100
4
0
5 0 100 200 300 400 500
Episode index
Figure 7.3: An example for demonstrating Q-learning. All the episodes start from the top-left state and
terminate after reaching the target state. The aim is to find an optimal path from the starting state to
the target state. The reward settings are rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The
learning rate is α = 0.1 and the value of is 0.1. The left figure shows the final policy obtained by the
algorithm. The right figure shows the total reward and length of every episode.
7.4.3 Implementation
Since Q-learning is off-policy, it can be implemented in either an on-policy or off-policy
fashion.
The on-policy version of Q-learning is shown in Algorithm 7.2. This implementation
is similar to the Sarsa one in Algorithm 7.1. Here, the behavior policy is the same as the
target policy, which is an -greedy policy.
The off-policy version is shown in Algorithm 7.3. The behavior policy πb can be any
policy as long as it can generate sufficient experience samples. It is usually favorable when
πb is exploratory. Here, the target policy πT is greedy rather than -greedy since it is
not used to generate samples and hence is not required to be exploratory. Moreover, the
off-policy version of Q-learning presented here is implemented offline: all the experience
samples are collected first and then processed. It can be modified to become online: the
value and policy can be updated immediately once a sample is received.
153
7.5. A unified viewpoint S. Zhao, 2023
Ground truth: To verify the effectiveness of Q-learning, we first need to know the
ground truth of the optimal policies and optimal state values. Here, the ground truth
is obtained by the model-based policy iteration algorithm. The ground truth is given
in Figures 7.4(a) and (b).
Experience samples: The behavior policy has a uniform distribution: the probability
of taking any action at any state is 0.2 (Figure 7.4(c)). A single episode with 100,000
steps is generated (Figure 7.4(d)). Due to the good exploration ability of the behavior
policy, the episode visits every state-action pair many times.
Learned results: Based on the episode generated by the behavior policy, the final
target policy learned by Q-learning is shown in Figure 7.4(e). This policy is optimal
because the estimated state value error (root-mean-square error) converges to zero as
shown in Figure 7.4(f). In addition, one may notice that the learned optimal policy
is not exactly the same as that in Figure 7.4(a). In fact, there exist multiple optimal
policies that have the same optimal state values.
Different initial values: Since Q-learning bootstraps, the performance of the algorithm
depends on the initial guess for the action values. As shown in Figure 7.4(g), when the
initial guess is close to the true value, the estimate converges within approximately
10,000 steps. Otherwise, the convergence requires more steps (Figure 7.4(h)). Nev-
ertheless, these figures demonstrate that Q-learning can still converge rapidly even
though the initial value is not accurate.
Different behavior policies: When the behavior policy is not exploratory, the learning
performance drops significantly. For example, consider the behavior policies shown
in Figure 7.5. They are -greedy policies with = 0.5 or 0.1 (the uniform policy in
Figure 7.4(c) can be viewed as -greedy with = 1). It is shown that, when decreases
from 1 to 0.5 and then to 0.1, the learning speed drops significantly. That is because
the exploration ability of the policy is weak and hence the experience samples are
insufficient.
154
7.5. A unified viewpoint S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
8
1
6
State value error
4
3
2
4
0
5
0 2 4 6 8 10
Step in the episode 104
3 100
2.5
80
State value error
2
60
1.5
40
1
20
0.5
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Step in the episode 104 Step in the episode 104
(g) State value error when q0 (s, a) = 10 (h) State value error when q0 (s, a) = 100
Figure 7.4: Examples for demonstrating off-policy learning via Q-learning. The optimal policy and
optimal state values are shown in (a) and (b), respectively. The behavior policy and the generated
episode are shown in (c) and (d), respectively. The estimated policy and the estimation error evolution
are shown in (e) and (f), respectively. The cases with different initial values are shown in (g) and (h).
155
7.5. A unified viewpoint S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
8
1 1
4
3 3
2
4 4
0
5 5
0 2 4 6 8 10
Step in the episode 104
(a) = 0.5
1 2 3 4 5 1 2 3 4 5
1 1
4
3 3
2
4 4
0
5 5
0 2 4 6 8 10
Step in the episode 104
(b) = 0.1
1 2 3 4 5 1 2 3 4 5
1 1
6
State value error
2 2
4
3 3
2
4 4
0
5 5
0 2 4 6 8 10
Step in the episode 104
(c) = 0.1
Figure 7.5: The performance of Q-learning drops when the behavior policy is not exploratory. The figures
in the left column show the behavior policies. The figures in the middle column show the generated
episodes following the corresponding behavior policies. The episode in each example has 100,000 steps.
The figures in the right column show the evolution of the root-mean-square error of the estimated state
values.
156
7.6. Summary S. Zhao, 2023
Table 7.2: A unified point of view of TD algorithms. Here, BE and BOE denote the Bellman equation
and Bellman optimality equation, respectively.
where q̄t is the TD target. Different TD algorithms have different q̄t . See Table 7.2 for a
summary. The MC learning algorithm can be viewed as a special case of (7.20): we can
set αt (st , at ) = 1 and then (7.20) becomes qt+1 (st , at ) = q̄t .
Algorithm (7.20) can be viewed as a stochastic approximation algorithm for solving
a unified equation: q(s, a) = E[q̄t |s, a]. This equation has different expressions with
different q̄t . These expressions are summarized in Table 7.2. As can be seen, all of the
algorithms aim to solve the Bellman equation except Q-learning, which aims to solve the
Bellman optimality equation.
7.6 Summary
This chapter introduced an important class of reinforcement learning algorithms called
TD learning. The specific algorithms that we introduced include Sarsa, n-step Sarsa, and
Q-learning. All these algorithms can be viewed as stochastic approximation algorithms
for solving Bellman or Bellman optimality equations.
The TD algorithms introduced in this chapter, except Q-learning, are used to eval-
uate a given policy. That is to estimate a given policy’s state/action values from some
experience samples. Together with policy improvement, they can be used to learn opti-
mal policies. Moreover, these algorithms are on-policy: the target policy is used as the
behavior policy to generate experience samples.
Q-learning is slightly special compared to the other TD algorithms in the sense that
it is off-policy. The target policy can be different from the behavior policy in Q-learning.
The fundamental reason why Q-learning is off-policy is that Q-learning aims to solve the
Bellman optimality equation rather than the Bellman equation of a given policy.
157
7.7. Q&A S. Zhao, 2023
It is worth mentioning that there are some methods that can convert an on-policy
algorithm to be off-policy. Importance sampling is a widely used one [3, 40] and will be
introduced in Chapter 10. Finally, there are some variants and extensions of the TD
algorithms introduced in this chapter [41–45]. For example, the TD(λ) method provides
a more general and unified framework for TD learning. More information can be found
in [3, 20, 46].
7.7 Q&A
Q: What does the term “TD” in TD learning mean?
A: Every TD algorithm has a TD error, which represents the discrepancy between the
new sample and the current estimate. Since this discrepancy is calculated between
different time steps, it is called temporal-difference.
Q: What does the term “learning” in TD learning mean?
A: From a mathematical point of view, “learning” simply means “estimation”. That
is to estimate state/action values from some samples and then obtain policies based
on the estimated values.
Q: While Sarsa can estimate the action values of a given policy, how can it be used
to learn optimal policies?
A: To obtain an optimal policy, the value estimation process should interact with
the policy improvement process. That is, after a value is updated, the corresponding
policy should be updated. Then, the updated policy generates new samples that can
be used to estimate values again. This is the idea of generalized policy iteration.
Q: Why does Sarsa update policies to be -greedy?
A: That is because the policy is also used to generate samples for value estimation.
Hence, it should be exploratory to generate sufficient experience samples.
Q: While Theorems 7.1 and 7.2 require that the learning rate αt converges to zero
gradually, why is it often set to be a small constant in practice?
A: The fundamental reason is that the policy to be evaluated keeps changing (or called
nonstationary). In particular, a TD learning algorithm like Sarsa aims to estimate
the action values of a given policy. If the policy is fixed, using a decaying learning
rate is acceptable. However, in the optimal policy learning process, the policy that
Sarsa aims to evaluate keeps changing after every iteration. We need a constant
learning rate in this case; otherwise, a decaying learning rate may be too small to
effectively evaluate policies. Although a drawback of constant learning rates is that
the value estimate may fluctuate eventually, the fluctuation is neglectable as long as
the constant learning rate is sufficiently small.
158
7.7. Q&A S. Zhao, 2023
Q: Should we learn the optimal policies for all states or a subset of the states?
A: It depends on the task. One may notice that some tasks considered in this chapter
(e.g., Figure 7.2) do not require finding the optimal policies for all states. Instead,
they only need to find an optimal path from a given starting state to the target state.
Such tasks are not demanding in terms of data because the agent does not need to
visit every state-action pair sufficiently many times. It, however, must be noted that
the obtained path is not guaranteed to be optimal. That is because better paths may
be missed if not all state-action pairs are well explored. Nevertheless, given sufficient
data, we can still find a good or locally optimal path.
Q: Why is Q-learning off-policy while all the other TD algorithms in this chapter are
on-policy?
A: The fundamental reason is that Q-learning aims to solve the Bellman optimality
equation, whereas the other TD algorithms aim to solve the Bellman equation of a
given policy. Details can be found in Section 7.4.2.
Q: Why does the off-policy version of Q-learning update policies to be greedy instead
of -greedy?
A: That is because the target policy is not required to generate experience samples.
Hence, it is not required to be exploratory.
159
Chapter 8
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
160
8.1. Value representation: From table to function S. Zhao, 2023
v̂(s) = as + b
v̂(s)
s1 s2 s3 s4 · · · sn s
Figure 8.2: An illustration of the function approximation method. The x-axis and y-axis correspond to
s and v̂(s), respectively.
State s1 s2 ··· sn
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )
We next show that the values in the above table can be approximated by a function.
In particular, {(si , v̂(si ))}ni=1 are shown as n points in Figure 8.2. These points can be
fitted or approximated by a curve. The simplest curve is a straight line, which can be
described as
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w. (8.1)
|{z} b
φT (s) | {z }
w
Here, v̂(s, w) is a function for approximating vπ (s). It is determined jointly by the state s
and the parameter vector w ∈ R2 . v̂(s, w) is sometimes written as v̂w (s). Here, φ(s) ∈ R2
is called the feature vector of s.
The first notable difference between the tabular and function approximation methods
concerns how they retrieve and update a value.
How to retrieve a value: When the values are represented by a table, if we want to
retrieve a value, we can directly read the corresponding entry in the table. However,
161
8.1. Value representation: From table to function S. Zhao, 2023
when the values are represented by a function, it becomes slightly more complicated
to retrieve a value. In particular, we need to input the state index s into the function
and calculate the function value (Figure 8.3). For the example in (8.1), we first need
to calculate the feature vector φ(s) and then calculate φT (s)w. If the function is
an artificial neural network, a forward propagation from the input to the output is
needed.
s v̂(s, w)
w
function
Figure 8.3: An illustration of the process for retrieving the value of s when using the function approxi-
mation method.
The function approximation method is more efficient in terms of storage due to the
way in which the state values are retrieved. Specifically, while the tabular method
needs to store n values, we now only need to store a lower dimensional parameter
vector w. Thus, the storage efficiency can be significantly improved. Such a benefit
is, however, not free. It comes with a cost: the state values may not be accurately
represented by the function. For example, a straight line is not able to accurately fit
the points in Figure 8.2. That is why this method is called approximation. From a
fundamental point of view, some information will certainly be lost when we use a low-
dimensional vector to represent a high-dimensional dataset. Therefore, the function
approximation method enhances storage efficiency by sacrificing accuracy.
How to update a value: When the values are represented by a table, if we want
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, the way to update a value is
completely different. Specifically, we must update w to change the values indirectly.
How to update w to find optimal state values will be addressed in detail later.
Thanks to the way in which the state values are updated, the function approximation
method has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. When using the tabular method, we can update
a value if the corresponding state is visited in an episode. The values of the states
that have not been visited cannot be updated. However, when using the function
approximation method, we need to update w to update the value of a state. The
update of w also affects the values of some other states even though these states have
not been visited. Therefore, the experience sample for one state can generalize to help
estimate the values of some other states.
The above analysis is illustrated in Figure 8.4, where there are three states {s1 , s2 , s3 }.
162
8.1. Value representation: From table to function S. Zhao, 2023
Suppose that we have an experience sample for s3 and would like to update v̂(s3 ).
When using the tabular method, we can only update v̂(s3 ) without changing v̂(s1 ) or
v̂(s2 ), as shown in Figure 8.4(a). When using the function approximation method,
updating w not only can update v̂(s3 ) but also would change v̂(s1 ) and v̂(s2 ), As
shown in Figure 8.4(b). Therefore, the experience sample of s3 can help update the
values of its neighboring states.
v̂(s) v̂(s)
update v̂(s3 )
s1 s2 s3 s s1 s2 s3 s
(a) Tabular method: when v̂(s2 ) is updated, the other values remain the same.
v̂(s) v̂(s)
update w for s3
s1 s2 s3 s s1 s2 s3 s
(b) Function approximation method: when we update v̂(s2 ) by changing w, the values of the
neighboring states are also changed.
We can use more complex functions that have stronger approximation abilities than
straight lines. For example, consider a second-order polynomial:
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1] b = φT (s)w. (8.2)
| {z }
φT (s) c
| {z }
w
We can use even higher-order polynomial curves to fit the points. As the order of the
curve increases, the approximation accuracy can be improved, but the dimension of the
parameter vector also increases, requiring more storage and computational resources.
Note that v̂(s, w) in either (8.1) or (8.2) is linear in w (though it may be nonlinear
in s). This type of method is called linear function approximation, which is the simplest
function approximation method. To realize linear function approximation, we need to
select an appropriate feature vector φ(s). That is, we must decide, for example, whether
we should use a first-order straight line or a second-order curve to fit the points. The
selection of appropriate feature vectors is nontrivial. It requires prior knowledge of the
given task: the better we understand the task, the better the feature vectors we can select.
For instance, if we know that the points in Figure 8.2 are approximately located on a
163
8.2. TD learning of state values based on function approximation S. Zhao, 2023
straight line, we can use a straight line to fit the points. However, such prior knowledge
is usually unknown in practice. If we do not have any prior knowledge, a popular solution
is to use artificial neural networks as nonlinear function approximations.
Another important problem is how to find the optimal parameter vector. If we know
{vπ (si )}ni=1 , this is a least-squares problem. The optimal parameter can be obtained by
optimizing the following objective function:
n n
X 2 X 2
J1 = v̂(si , w) − vπ (si ) = φT (si )w − vπ (si )
i=1 i=1
2
φT (s1 )
vπ (s1 )
.. .. .
= w − = kΦw − vπ k2 ,
. .
T
φ (sn ) vπ (sn )
where
φT (s1 )
vπ (s1 )
. .. n×2 . .. n
Φ= ∈R , vπ = ∈R .
. .
φT (sn ) vπ (sn )
More information about least-squares problems can be found in [47, Section 3.3] and
[48, Section 5.14].
The curve-fitting example presented in this section illustrates the basic idea of value
function approximation. This idea will be formally introduced in the next section.
164
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where the expectation is calculated with respect to the random variable S ∈ S. While S
is a random variable, what is its probability distribution? This question is important for
understanding this objective function. There are several ways to define the probability
distribution of S.
The first way is to use a uniform distribution. That is to treat all the states as equally
important by setting the probability of each state to 1/n. In this case, the objective
function in (8.3) becomes
1X
J(w) = (vπ (s) − v̂(s, w))2 , (8.4)
n s∈S
which is the average value of the approximation errors of all the states. However, this
way does not consider the real dynamics of the Markov process under the given policy.
Since some states may be rarely visited by a policy, it may be unreasonable to treat
all the states as equally important.
The second way, which is the focus of this chapter, is to use the stationary distribution.
The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently
long period, the probability of the agent being located at any state can be described
by this stationary distribution. Interested readers may see the details in Box 8.1.
Let {dπ (s)}s∈S denote the stationary distribution of the Markov process under policy
π. That is, the probability for the agent visiting s after a long period of time is dπ (s).
P
By definition, s∈S dπ (s) = 1. Then, the objective function in (8.3) can be rewritten
165
8.2. TD learning of state values based on function approximation S. Zhao, 2023
as
X
J(w) = dπ (s)(vπ (s) − v̂(s, w))2 , (8.5)
s∈S
which is a weighted average of the approximation errors. The states that have higher
probabilities of being visited are given greater weights.
It is notable that the value of dπ (s) is nontrivial to obtain because it requires knowing
the state transition probability matrix Pπ (see Box 8.1). Fortunately, we do not need to
calculate the specific value of dπ (s) to minimize this objective function as shown in the
next subsection. In addition, it was assumed that the number of states was finite when
we introduced (8.4) and (8.5). When the state space is continuous, we can replace the
summations with integrals.
The key tool for analyzing stationary distribution is Pπ ∈ Rn×n , which is the probabil-
ity transition matrix under the given policy π. If the states are indexed as s1 , . . . , sn ,
then [Pπ ]ij is defined as the probability for the agent moving from si to sj . The
definition of Pπ can be found in Section 2.6.
Interpretation of Pπk (k = 1, 2, 3, . . . ).
First of all, it is necessary to examine the interpretation of the entries in Pπk . The
probability of the agent transitioning from si to sj using exactly k steps is denoted
as
(k)
pij = Pr(Stk = j|St0 = i),
where t0 and tk are the initial and kth time steps, respectively. First, by the
definition of Pπ , we have
(1)
[Pπ ]ij = pij ,
which means that [Pπ ]ij is the probability of transitioning from si to sj using a
single step. Second, consider Pπ2 . It can be verified that
n
X
[Pπ2 ]ij = [Pπ Pπ ]ij = [Pπ ]iq [Pπ ]qj .
q=1
Since [Pπ ]iq [Pπ ]qj is the joint probability of transitioning from si to sq and then
from sq to sj , we know that [Pπ2 ]ij is the probability of transitioning from si to sj
166
8.2. TD learning of state values based on function approximation S. Zhao, 2023
(2)
[Pπ2 ]ij = pij .
(k)
[Pπk ]ij = pij ,
which means that [Pπk ]ij is the probability of transitioning from si to sj using
exactly k steps.
Definition of stationary distributions.
Let d0 ∈ Rn be a vector representing the probability distribution of the states at
the initial time step. For example, if s is always selected as the starting state,
then d0 (s) = 1 and the other entries of d0 are 0. Let dk ∈ Rn be the vector
representing the probability distribution obtained after exactly k steps starting
from d0 . Then, we have
n
X
dk (si ) = d0 (sj )[Pπk ]ji , i = 1, 2, . . . . (8.6)
j=1
This equation indicates that the probability of the agent visiting si at step k
equals the sum of the probabilities of the agent transitioning from {sj }nj=1 to si
using exactly k steps. The matrix-vector form of (8.6) is
When we consider the long-term behavior of the Markov process, it holds under
certain conditions that
where 1n = [1, . . . , 1]T ∈ Rn and 1n dTπ is a constant matrix with all its rows
equal to dTπ . The conditions under which (8.8) is valid will be discussed later.
Substituting (8.8) into (8.7) yields
167
8.2. TD learning of state values based on function approximation S. Zhao, 2023
168
8.2. TD learning of state values based on function approximation S. Zhao, 2023
1 2 0.8
Percentage of each state visited
0.6
1 s1
s2
0.4
s3
s4
0.2
2
0
0 200 400 600 800 1000
Step index
Figure 8.5: Long-term behavior of an -greedy policy with = 0.5. The asterisks in the right
figure represent the theoretical values of the elements of dπ .
169
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The eigenvalues of PπT can be calculated as {−0.0449, 0.3, 0.4449, 1}. The
unit-length (right) eigenvector of PπT corresponding to the eigenvalue 1 is
[0.0463, 0.1455, 0.1785, 0.9720]T . After scaling this vector so that the sum of
all its elements is equal to 1, we obtain the theoretical value of dπ as follows:
0.0345
0.1084
dπ = .
0.1330
0.7241
wk+1 = wk − αk ∇w J(wk ),
where
170
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where the coefficient 2 before αk can be merged into αk without loss of generality. The
algorithm in (8.11) requires calculating the expectation. In the spirit of stochastic gra-
dient descent, we can replace the true gradient with a stochastic gradient. Then, (8.11)
becomes
wt+1 = wt + αt vπ (st ) − v̂(st , wt ) ∇w v̂(st , wt ), (8.12)
Understanding the TD algorithm in (8.13) is important for studying the other algo-
rithms in this chapter. Notably, (8.13) can only learn the state values of a given policy.
It will be extended to algorithms that can learn action values in Sections 8.3.1 and 8.3.2.
v̂(s, w) = φT (s)w,
171
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where φ(s) ∈ Rm is the feature vector of s. The lengths of φ(s) and w are equal to m,
which is usually much smaller than the number of states. In the linear case, the gradient
is
∇w v̂(s, w) = φ(s),
We next show that the tabular TD algorithm in (7.1) in Chapter 7 is a special case
of the TD-Linear algorithm in (8.14).
Consider the following special feature vector for any s ∈ S:
φ(s) = es ∈ Rn ,
where es is the vector with the entry corresponding to s equal to 1 and the other
172
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where w(s) is the entry in w that corresponds to s. Substituting the above equation
into (8.14) yields
wt+1 = wt + αt rt+1 + γwt (st+1 ) − wt (st ) est .
The above equation merely updates the entry wt (st ) due to the definition of est .
Motivated by this, multiplying eTst on both sides of the equation yields
wt+1 (st ) = wt (st ) + αt rt+1 + γwt (st+1 ) − wt (st ) ,
1 2 3 4 5 1 2 3 4 5
Figure 8.6: (a) The policy to be evaluated. (b) The true state values are represented as a table. (c) The
true state values are represented as a 3D surface.
The grid world example is shown in Figure 8.6. The given policy takes any action at a
state with a probability of 0.2. Our goal is to estimate the state values under this policy.
There are 25 state values in total. The true state values are shown in Figure 8.6(b). The
true state values are visualized as a three-dimensional surface in Figure 8.6(c).
173
8.2. TD learning of state values based on function approximation S. Zhao, 2023
We next show that we can use fewer than 25 parameters to approximate these state
values. The simulation setup is as follows. Five hundred episodes are generated by the
given policy. Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution. In addition, in each simulation trial, the parameter
vector w is randomly initialized such that each element is drawn from a standard normal
distribution with a zero mean and a standard deviation of 1. We set rforbidden = rboundary =
−1, rtarget = 1, and γ = 0.9.
To implement the TD-Linear algorithm, we need to select the feature vector φ(s) first.
There are different ways to do that as shown below.
The first type of feature vector is based on polynomials. In the grid world example, a
state s corresponds to a 2D location. Let x and y denote the column and row indexes
of s, respectively. To avoid numerical issues, we normalize x and y so that their values
are within the interval of [−1, +1]. With a slight abuse of notation, the normalized
values are also represented by x and y. Then, the simplest feature vector is
" #
x
φ(s) = ∈ R2 .
y
w3
When w is given, v̂(s, w) corresponds to a plane that may not pass through the origin.
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the elements
does not matter.
The estimation result when we use the feature vector in (8.15) is shown in Fig-
174
8.2. TD learning of state values based on function approximation S. Zhao, 2023
ure 8.7(a). It can be seen that the estimated state values form a 2D plane. Although
the estimation error converges as more episodes are used, the error cannot decrease
to zero due to the limited approximation ability of a 2D plane.
To enhance the approximation ability, we can increase the dimension of the feature
vector. To that end, consider
The estimation results when we use the feature vectors in (8.16) and (8.17) are shown
in Figures 8.7(b)-(c). As can be seen, the longer the feature vector is, the more
accurately the state values can be approximated. However, in all three cases, the
estimation error cannot converge to zero because these linear approximators still have
limited approximation abilities.
5 5 3.5
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)
3
4 4
2.5
3 3 2
2 2 1.5
1
1 1
0.5
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index
Figure 8.7: TD-Linear estimation results obtained with the polynomial features in (8.15), (8.16), and
(8.17).
In addition to polynomial feature vectors, many other types of features are available
such as Fourier basis and tile coding [3, Chapter 9]. First, the values of x and y of
175
8.2. TD learning of state values based on function approximation S. Zhao, 2023
5 5 6
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)
4
3 3
3
2 2
2
1 1
1
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index
(a) q = 1 and φ(s) ∈ R4 (b) q = 2 and φ(s) ∈ R9 (c) q = 3 and φ(s) ∈ R16
Figure 8.8: TD-Linear estimation results obtained with the Fourier features in (8.18).
each state are normalized to the interval of [0, 1]. The resulting feature vector is
..
.
(q+1)2
φ(s) = cos π(c1 x + c2 y) ∈R , (8.18)
..
.
where π denotes the circumference ratio, which is 3.1415 . . . , instead of a policy. Here,
c1 or c2 can be set as any integers in {0, 1, . . . , q}, where q is a user-specified integer.
As a result, there are (q + 1)2 possible values for the pair (c1 , c2 ) to take. Hence, the
dimension of φ(s) is (q + 1)2 . For example, in the case of q = 1, the feature vector is
cos π(0x + 0y) 1
cos π(0x + 1y) cos(πy)
φ(s) = = ∈ R4 .
cos π(1x + 0y) cos(πx)
cos π(1x + 1y) cos(π(x + y))
The estimation results obtained when we use the Fourier features with q = 1, 2, 3 are
shown in Figure 8.8. The dimensions of the feature vectors in the three cases are
4, 9, 16, respectively. As can be seen, the higher the dimension of the feature vector
is, the more accurately the state values can be approximated.
176
8.2. TD learning of state values based on function approximation S. Zhao, 2023
function, we introduced the stochastic algorithm in (8.12). Later, the true value function
in the algorithm, which was unknown, was replaced by an approximation, leading to the
TD algorithm in (8.13). Although this story is helpful for understanding the basic idea
of value function approximation, it is not mathematically rigorous. For example, the
algorithm in (8.13) actually does not minimize the objective function in (8.3).
We next present a theoretical analysis of the TD algorithm in (8.13) to reveal why
the algorithm works effectively and what mathematical problems it solves. Since gen-
eral nonlinear approximators are difficult to analyze, this part only considers the linear
case. Readers are advised to read selectively based on their interests since this part is
mathematically intensive.
Convergence analysis
To study the convergence property of (8.13), we first consider the following deterministic
algorithm:
h i
wt+1 = wt + αt E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) ,
(8.19)
where the expectation is calculated with respect to the random variables st , st+1 , rt+1 .
The distribution of st is assumed to be the stationary distribution dπ . The algorithm
in (8.19) is deterministic because the random variables st , st+1 , rt+1 all disappear after
calculating the expectation.
Why would we consider this deterministic algorithm? First, the convergence of this
deterministic algorithm is easier (though nontrivial) to analyze. Second and more im-
portantly, the convergence of this deterministic algorithm implies the convergence of the
stochastic TD algorithm in (8.13). That is because (8.13) can be viewed as a stochastic
gradient descent (SGD) implementation of (8.19). Therefore, we only need to study the
convergence property of the deterministic algorithm.
Although the expression of (8.19) may look complex at first glance, it can be greatly
simplified. To do that, define
.. ..
. .
∈ Rn×m , ∈ Rn×n ,
Φ= T D= (8.20)
φ (s) dπ (s)
..
...
.
where Φ is the matrix containing all the feature vectors, and D is a diagonal matrix with
the stationary distribution in its diagonal entries. The two matrices will be frequently
used.
177
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where
.
A = ΦT D(I − γPπ )Φ ∈ Rm×m ,
.
b = ΦT Drπ ∈ Rm . (8.21)
Here, Pπ , rπ are the two terms in the Bellman equation vπ = rπ + γPπ vπ , and I is the
identity matrix with appropriate dimensions.
The proof is given in Box 8.3. With the expression in Lemma 8.1, the deterministic
algorithm in (8.19) can be rewritten as
Is A invertible? The answer is yes. In fact, A is not only invertible but also positive
definite. That is, for any nonzero vector x with appropriate dimensions, xT Ax > 0.
The proof is given in Box 8.4.
What is the interpretation of w∗ = A−1 b? It is actually the optimal solution for min-
imizing the projected Bellman error. The details will be introduced in Section 8.2.5.
The tabular method is a special case. One interesting result is that, when the di-
mensionality of w equals n = |S| and φ(s) = [0, . . . , 1, . . . , 0]T , where the entry corre-
sponding to s is 1, we have
w∗ = A−1 b = vπ . (8.23)
This equation indicates that the parameter vector to be learned is actually the true
state value. This conclusion is consistent with the fact that the tabular TD algorithm
is a special case of the TD-Linear algorithm, as introduced in Box 8.2. The proof
of (8.23) is given below. It can be verified that Φ = I in this case and hence A =
ΦT D(I − γPπ )Φ = D(I − γPπ ) and b = ΦT Drπ = Drπ . Thus, w∗ = A−1 b =
(I − γPπ )−1 D−1 Drπ = (I − γPπ )−1 rπ = vπ .
178
8.2. TD learning of state values based on function approximation S. Zhao, 2023
.
Proof 1: Define the convergence error as δt = wt − w∗ . We only need to show that δt
converges to zero. To do that, substituting wt = δt + w∗ into (8.22) gives
kδt+1 k2 ≤ kI − αAkt+1
2 kδ0 k2 .
When α > 0 is sufficiently small, we have that kI − αAk2 < 1 and hence δt → 0 as
t → ∞. The reason why kI − αAk2 < 1 holds is that A is positive definite and hence
xT (I − αA)x < 1 for any x.
.
Proof 2: Consider g(w) = b−Aw. Since w∗ is the root of g(w) = 0, the task is actually
a root-finding problem. The algorithm in (8.22) is actually a Robbins-Monro (RM)
algorithm. Although the original RM algorithm was designed for stochastic processes,
it can also be applied to deterministic cases. The convergence of RM algorithms can
shed light on the convergence of wt+1 = wt + αt (b − Awt ). That is, wt converges to
w∗ when t αt = ∞ and t αt2 < ∞.
P P
179
8.2. TD learning of state values based on function approximation S. Zhao, 2023
as
X h i X
dπ (s)E rt+1 φ(st ) st = s = dπ (s)φ(s)rπ (s) = ΦT Drπ , (8.25)
s∈S s∈S
where rπ = [· · · , rπ (s), · · · ]T ∈ Rn .
Second, consider the second term in (8.24). Since
h i
E φ(st ) γφT (st+1 ) − φT (st ) wt st = s
h i h i
T T
= −E φ(st )φ (st )wt st = s + E γφ(st )φ (st+1 )wt st = s
h i
= −φ(s)φT (s)wt + γφ(s)E φT (st+1 ) st = s wt
X
= −φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt ,
s0 ∈S
.
= b − Awt , (8.27)
. .
where b = ΦT Drπ and A = ΦT D(I − γPπ )Φ.
Box 8.4: Proving that A = ΦT D(I −γPπ )Φ is invertible and positive definite.
The matrix A is positive definite if xT Ax > 0 for any nonzero vector x with ap-
propriate dimensions. If A is positive (or negative) definite, it is denoted as A 0
(or A ≺ 0). Here, and ≺ should be differentiated from > and <, which indicate
elementwise comparisons. Note that A may not be symmetric. Although positive
180
8.2. TD learning of state values based on function approximation S. Zhao, 2023
definite matrices often refer to symmetric matrices, nonsymmetric ones can also be
positive definite.
We next prove that A 0 and hence A is invertible. The idea for proving A 0
is to show that
.
D(I − γPπ ) = M 0. (8.28)
Since all the entries of dπ are positive (see Box 8.1), we have (M + M T )1n > 0.
Second, the elementwise form of (8.29) is
n
X
[M + M T ]ij > 0, i = 1, . . . , n,
j=1
It can be verified according to (8.28) that the diagonal entries of M are positive and
the off-diagonal entries of M are nonpositive. Therefore, the above inequality can be
181
8.2. TD learning of state values based on function approximation S. Zhao, 2023
rewritten as
X
[M + M T ]ii > [M + M T ]ij .
j6=i
The above inequality indicates that the absolute value of the ith diagonal entry in
M + M T is greater than the sum of the absolute values of the off-diagonal entries
in the same row. Thus, M + M T is strictly diagonal dominant and the proof is
complete.
While we have shown that the TD-Linear algorithm converges to w∗ = A−1 b, we next
show that w∗ is the optimal solution that minimizes the projected Bellman error. To do
that, we review three objective functions.
which has been introduced in (8.3). By the definition of expectation, JE (w) can be
reexpressed in a matrix-vector form as
where vπ is the true state value vector and v̂(w) is the approximated one. Here, k · k2D
is a weighted norm: kxk2D = xT Dx = kD1/2 xk22 , where D is given in (8.20).
This is the simplest objective function that we can imagine when talking about func-
tion approximation. However, it relies on the true state, which is unknown. To obtain
an implementable algorithm, we must consider other objective functions such as the
Bellman error and projected Bellman error [50–54].
The second objective function is the Bellman error. In particular, since vπ satisfies
the Bellman equation vπ = rπ + γPπ vπ , it is expected that the estimated value v̂(w)
should also satisfy this equation to the greatest extent possible. Thus, the Bellman
error is
.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D . (8.30)
Here, Tπ (·) is the Bellman operator. In particular, for any vector x ∈ Rn , the Bellman
operator is defined as
.
Tπ (x) = rπ + γPπ x.
182
8.2. TD learning of state values based on function approximation S. Zhao, 2023
Minimizing the Bellman error is a standard least-squares problem. The details of the
solution are omitted here.
Third, it is notable that JBE (w) in (8.30) may not be minimized to zero due to the
limited approximation ability of the approximator. By contrast, an objective function
that can be minimized to zero is the projected Bellman error :
where M ∈ Rn×n is the orthogonal projection matrix that geometrically projects any
vector onto the space of all approximations.
In fact, the TD learning algorithm in (8.13) aims to minimize the projected Bellman
error JP BE rather than JE or JBE . The reason is as follows. For the sake of simplicity,
consider the linear case where v̂(w) = Φw. Here, Φ is defined in (8.20). The range space
of Φ is the set of all possible linear approximations. Then,
is the projection matrix that geometrically projects any vector onto the range space Φ.
Since v̂(w) is in the range space of Φ, we can always find a value of w that can minimize
JP BE (w) to zero. It can be proven that the solution minimizing JP BE (w) is w∗ = A−1 b.
That is
We next show that w∗ = A−1 b is the optimal solution that minimizes JP BE (w). Since
JP BE (w) = 0 ⇔ v̂(w) − M Tπ (v̂(w)) = 0, we only need to study the root of
v̂(w) = M Tπ (v̂(w)).
In the linear case, substituting v̂(w) = Φw and the expression of M in (8.28) into
the above equation gives
183
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where A, b are given in (8.21). Therefore, w∗ = A−1 b is the optimal solution that
minimizes JP BE (w).
1 1 p
kΦw∗ − vπ kD ≤ min kv̂(w) − vπ kD = min JE (w). (8.32)
1−γ w 1−γ w
The proof of this inequality is given in Box 8.6. Inequality (8.32) indicates that the
discrepancy between Φw∗ and vπ is bounded from above by the minimum value of JE (w).
However, this bound is loose, especially when γ is close to one. It is thus mainly of
theoretical value.
Note that
kΦw∗ − vπ kD = kΦw∗ − M vπ + M vπ − vπ kD
≤ kΦw∗ − M vπ kD + kM vπ − vπ kD
= kM Tπ (Φw∗ ) − M Tπ (vπ )kD + kM vπ − vπ kD , (8.33)
where the last equality is due to Φw∗ = M Tπ (Φw∗ ) and vπ = Tπ (vπ ). Substituting
184
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The proof of kM kD = 1 and kPπ xkD ≤ kxkD are postponed to the end of the box.
Recognizing the above inequality gives
1
kΦw∗ − vπ kD ≤ kM vπ − vπ kD
1−γ
1
= min kv̂(w) − vπ kD ,
1−γ w
where the last equality is because kM vπ − vπ kD is the error between vπ and its
orthogonal projection into the space of all possible approximations. Therefore, it is
the minimum value of the error between vπ and any v̂(w).
We next prove some useful facts, which have already been used in the above proof.
√
Properties of matrix weighted norms. By definition, kxkD = xT Dx = kD1/2 xk2 .
The induced matrix norm is kAkD = maxx6=0 kAxkD /kxkD = kD1/2 AD−1/2 k2 . For
matrices A, B with appropriate dimensions, we have kABxkD ≤ kAkD kBkD kxkD .
To see that, kABxkD = kD1/2 ABxk2 = kD1/2 AD−1/2 D1/2 BD−1/2 D1/2 xk2 ≤
kD1/2 AD−1/2 k2 kD1/2 BD−1/2 k2 kD1/2 xk2 = kAkD kBkD kxkD .
Proof of kM kD = 1. This is valid because kM kD = kΦ(ΦT DΦ)−1 ΦT DkD =
kD1/2 Φ(ΦT DΦ)−1 ΦT DD−1/2 k2 = 1, where the last equality is valid due to the
fact that the matrix in the L2 -norm is an orthogonal projection matrix and the
L2 -norm of any orthogonal projection matrix is equal to one.
Proof of kPπ xkD ≤ kxkD for any x ∈ Rn . First,
!
X X X
kPπ xk2D = xT PπT DPπ x = xi [PπT DPπ ]ij xj = xi [PπT ]ik [D]kk [Pπ ]kj xj .
i,j i,j k
185
8.2. TD learning of state values based on function approximation S. Zhao, 2023
= kxk2D .
Least-squares TD
We next introduce an algorithm called least-squares TD (LSTD) [57]. Like the TD-Linear
algorithm, LSTD aims to minimize the projected Bellman error. However, it has some
advantages over the TD-Linear algorithm.
Recall that the optimal parameter for minimizing the projected Bellman error is
w = A−1 b, where A = ΦT D(I − γPπ )Φ and b = ΦT Drπ . In fact, it follows from (8.27)
∗
The above two equations show that A and b are expectations of st , st+1 , rt+1 . The idea of
LSTD is simple: if we can use random samples to directly obtain the estimates of A and
b, which are denoted as  and b̂, then the optimal parameter can be directly estimated
as w∗ ≈ Â−1 b̂.
In particular, suppose that (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) is a trajectory obtained
by following a given policy π. Let Ât and b̂t be the estimates of A and b at time t,
respectively. They are calculated as the averages of the samples:
t−1
X T
Ât = φ(sk ) φ(sk ) − γφ(sk+1 ) ,
k=0
t−1
X
b̂t = rk+1 φ(sk ). (8.34)
k=0
186
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The reader may wonder if a coefficient of 1/t is missing on the right-hand side of (8.34).
In fact, it is omitted for the sake of simplicity since the value of wt remains the same
when it is omitted. Since Ât may not be invertible especially when t is small, Ât is usually
biased by a small constant matrix σI, where I is the identity matrix and σ is a small
positive number.
The advantage of LSTD is that it uses experience samples more efficiently and con-
verges faster than the TD method. That is because this algorithm is specifically designed
based on the knowledge of the optimal solution’s expression. The better we understand
a problem, the better algorithms we can design.
The disadvantages of LSTD are as follows. First, it can only estimate state values.
By contrast, the TD algorithm can be extended to estimate action values as shown in the
next section. Moreover, while the TD algorithm allows nonlinear approximators, LSTD
does not. That is because this algorithm is specifically designed based on the expression
of w∗ . Second, the computational cost of LSTD is higher than that of TD since LSTD
updates an m × m matrix in each update step, whereas TD updates an m-dimensional
vector. More importantly, in every step, LSTD needs to compute the inverse of Ât , whose
computational complexity is O(m3 ). The common method for resolving this problem is
to directly update the inverse of Ât rather than updating Ât . In particular, Ât+1 can be
calculated recursively as follows:
t
X T
Ât+1 = φ(sk ) φ(sk ) − γφ(sk+1 )
k=0
t−1
X T T
= φ(sk ) φ(sk ) − γφ(sk+1 ) + φ(st ) φ(st ) − γφ(st+1 )
k=0
T
= Ât + φ(st ) φ(st ) − γφ(st+1 ) .
The above expression decomposes Ât+1 into the sum of two matrices. Its inverse can be
calculated as [58]
T −1
Â−1
t+1 = Ât + φ(s t ) φ(s t ) − γφ(s t+1 )
T −1
−1 Â−1
t φ(st ) φ(st ) − γφ(st+1 ) Ât
= Ât + T −1 .
1 + φ(st ) − γφ(st+1 ) Ât φ(st )
Therefore, we can directly store and update Â−1 t to avoid the need to calculate the matrix
inverse. This recursive algorithm does not require a step size. However, it requires setting
the initial value of Â−1
0 . The initial value of such a recursive algorithm can be selected as
Â−1
0 = σI, where σ is a positive number. A good tutorial on the recursive least-squares
approach can be found in [59].
187
8.3. TD learning of action values based on function approximation S. Zhao, 2023
The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear
functions are used, we have
188
8.3. TD learning of action values based on function approximation S. Zhao, 2023
0 1 2 3 4 5
Total reward
-500 1
-1000
2
0 100 200 300 400 500
3
Episode length
500
0
0 100 200 300 400 500 5
Episode index
Figure 8.9: Sarsa with linear function approximation. Here, γ = 0.9, = 0.1, rboundary = rforbidden =
−10, rtarget = 1, and α = 0.001.
Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value:h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1
The above update rule is similar to (8.35) except that q̂(st+1 , at+1 , wt ) in (8.35) is replaced
with maxa∈A(st+1 ) q̂(st+1 , a, wt ).
Similar to the tabular case, (8.36) can be implemented in either an on-policy or
off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for
demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is
to find a good policy that can lead the agent to the target state from the top-left state.
189
8.4. Deep Q-learning S. Zhao, 2023
Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value:h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(sε t )| otherwise
As can be seen, Q-learning with linear function approximation can successfully learn an
optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy
version will be demonstrated when we introduce deep Q-learning in Section 8.4.
0 1 2 3 4 5
Total reward
1
-2000
-4000 2
0 100 200 300 400 500
3
Episode length
1000
4
0
0 100 200 300 400 500 5
Episode index
Figure 8.10: Q-learning with linear function approximation. Here, γ = 0.9, = 0.1, rboundary =
rforbidden = −10, rtarget = 1, and α = 0.001.
One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are
represented as functions, the policy π(a|s) is still represented as a table. Thus, it still
assumes finite numbers of states and actions. In Chapter 9, we will see that the policies
can be represented as functions so that continuous state and action spaces can be handled.
190
8.4. Deep Q-learning S. Zhao, 2023
earliest and most successful deep reinforcement learning algorithms. Notably, the neural
networks do not have to be deep. For simple tasks such as our grid world examples,
shallow networks with one or two hidden layers may be sufficient.
Deep Q-learning can be viewed as an extension of the algorithm in (8.36). However,
its mathematical formulation and implementation techniques are substantially different
and deserve special attention.
where (S, A, R, S 0 ) are random variables that denote a state, an action, the immediate
reward, and the next state, respectively. This objective function can be viewed as the
squared Bellman optimality error. That is because
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , for all s, a
a∈A(St+1 )
is the Bellman optimality equation (the proof is given in Box 7.5). Therefore, R +
γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should equal zero in the expectation sense when
q̂(S, A, w) can accurately approximate the optimal action values.
To minimize the objective function in (8.37), we can use the gradient descent algorith-
m. To that end, we need to calculate the gradient of J with respect to w. It is noted that
.
the parameter w appears not only in q̂(S, A, w) but also in y = R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is as-
sumed that the value of w in y is fixed (for a short period of time) so that the calculation
of the gradient becomes much easier. In particular, we introduce two networks: one is a
main network representing q̂(s, a, w) and the other is a target network q̂(s, a, wT ). The
objective function in this case becomes
" 2 #
J =E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ,
a∈A(S )
191
8.4. Deep Q-learning S. Zhao, 2023
The first technique is to use two networks, a main network and a target network,
as mentioned when we calculate the gradient in (8.38). The implementation details
are explained below. Let w and wT denote the parameters of the main and target
networks, respectively. They are initially set to the same value.
In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the replay buffer
(the replay buffer will be explained soon). The inputs of the main network are s and
a. The output y = q̂(s, a, w) is the estimated q-value. The target value of the output
.
is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). The main network is updated to minimize the
TD error (also called the loss function) (y − yT )2 over the samples {(s, a, yT )}.
P
Updating w in the main network does not explicitly use the gradient in (8.38). Instead,
it relies on the existing software tools for training neural networks. As a result, we
need a mini-batch of samples to train a network instead of using a single sample to
update the main network based on (8.38). This is one notable difference between deep
and nondeep reinforcement learning algorithms.
The main network is updated in every iteration. By contrast, the target network is
set to be the same as the main network every certain number of iterations to satisfy
the assumption that wT is fixed when calculating the gradient in (8.38).
The second technique is experience replay [22, 60, 62]. That is, after we have collected
some experience samples, we do not use these samples in the order they were collected.
Instead, we store them in a dataset called the replay buffer. In particular, let (s, a, r, s0 )
.
be an experience sample and B = {(s, a, r, s0 )} be the replay buffer. Every time we
update the main network, we can draw a mini-batch of experience samples from the
replay buffer. The draw of samples, or called experience replay, should follow a uniform
distribution.
Why is experience replay necessary in deep Q-learning, and why must the replay
follow a uniform distribution? The answer lies in the objective function in (8.37).
In particular, to well define the objective function, we must specify the probability
distributions for S, A, R, S 0 . The distributions of R and S 0 are determined by the
system model once (S, A) is given. The simplest way to describe the distribution of
the state-action pair (S, A) is to assume it to be uniformly distributed.
However, the state-action samples may not be uniformly distributed in practice s-
ince they are generated as a sample sequence according to the behavior policy. It is
necessary to break the correlation between the samples in the sequence to satisfy the
assumption of uniform distribution. To do this, we can use the experience replay tech-
nique by uniformly drawing samples from the replay buffer. This is the mathematical
reason why experience replay is necessary and why experience replay must follow a
uniform distribution. A benefit of random sampling is that each experience sample
192
8.4. Deep Q-learning S. Zhao, 2023
Initialization: A main network and a target network with the same initial parameter.
Goal: Learn an optimal target network to approximate the optimal action values from
the experience samples generated by a given behavior policy πb .
Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}
For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
of samples
Set wT = w every C iterations
may be used multiple times, which can increase the data efficiency. This is especially
important when we have a limited amount of data.
193
8.4. Deep Q-learning S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) The behavior policy. (b) An episode with 1,000 steps. (c) The final learned policy.
5 10
4 8
3 6
2 4
1 2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error converges to zero.
Figure 8.11: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 100.
network can also be designed in other ways. For example, it can have two inputs and
five outputs, where the two inputs are the normalized row and column of a state and the
outputs are the five estimated action values for the input state [22].
As shown in Figure 8.11(d), the loss function, defined as the average squared TD
error of each mini-batch, converges to zero, meaning that the network can fit the training
samples well. As shown in Figure 8.11(e), the state value estimation error also converges
to zero, indicating that the estimates of the optimal action values become sufficiently
accurate. Then, the corresponding greedy policy is optimal.
This example demonstrates the high efficiency of deep Q-learning. In particular, a
short episode of 1,000 steps is sufficient for obtaining an optimal policy here. By contrast,
an episode with 100,000 steps is required by tabular Q-learning, as shown in Figure 7.4.
One reason for the high efficiency is that the function approximation method has a strong
generalization ability. Another reason is that the experience samples can be repeatedly
used.
We next deliberately challenge the deep Q-learning algorithm by considering a scenario
with fewer experience samples. Figure 8.12 shows an example of an episode with merely
100 steps. In this example, although the network can still be well-trained in the sense
194
8.5. Summary S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) The behavior policy. (b) An episode with 100 steps. (c) The final learned policy.
7 8
7
5
4
6
3
2
5
1
0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error does not converge to zero.
Figure 8.12: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 50.
that the loss function converges to zero, the state estimation error cannot converge to
zero. That means the network can properly fit the given experience samples, but the
experience samples are too few to accurately estimate the optimal action values.
8.5 Summary
This chapter continued introducing TD learning algorithms. However, it switches from
the tabular method to the function approximation method. The key to understanding
the function approximation method is to know that it is an optimization problem. The
simplest objective function is the squared error between the true state values and the
estimated values. There are also other objective functions such as the Bellman error
and the projected Bellman error. We have shown that the TD-Linear algorithm actually
minimizes the projected Bellman error. Several optimization algorithms such as Sarsa
and Q-learning with value approximation have been introduced.
One reason why the value function approximation method is important is that it allows
artificial neural networks to be integrated with reinforcement learning. For example,
deep Q-learning is one of the most successful deep reinforcement learning algorithms.
195
8.6. Q&A S. Zhao, 2023
Although neural networks have been widely used as nonlinear function approximators,
this chapter provides a comprehensive introduction to the linear function case. Fully
understanding the linear case is important for better understanding the nonlinear case.
Interested readers may refer to [63] for a thorough analysis of TD learning algorithms
with function approximation. A more theoretical discussion on deep Q-learning can be
found in [61].
An important concept named stationary distribution is introduced in this chapter.
The stationary distribution plays an important role in defining an appropriate objective
function in the value function approximation method. It also plays a key role in Chapter 9
when we use functions to approximate policies. An excellent introduction to this topic
can be found in [49, Chapter IV]. The contents of this chapter heavily rely on matrix
analysis. Some results are used without explanation. Excellent references regarding
matrix analysis and linear algebra can be found in [4, 48].
8.6 Q&A
Q: What is the difference between the tabular and function approximation methods?
A: One important difference is how a value is updated and retrieved.
How to retrieve a value: When the values are represented by a table, if we would like
to retrieve a value, we can directly read the corresponding entry in the table. However,
when the values are represented by a function, we need to input the state index s into
the function and calculate the function value. If the function is an artificial neural
network, a forward prorogation process from the input to the output is needed.
How to update a value: When the values are represented by a table, if we would like
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, we must update the function
parameter to change the values indirectly.
Q: What are the advantages of the function approximation method over the tabular
method?
A: Due to the way state values are retrieved, the function approximation method is
more efficient in storage. In particular, while the tabular method needs to store |S|
values, the function approximation method only needs to store a parameter vector
whose dimension is usually much less than |S|.
Due to the way in which state values are updated, the function approximation method
has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. With the tabular method, updating one state
value would not change the other state values. However, with the function approx-
imation method, updating the function parameter affects the values of many states.
196
8.6. Q&A S. Zhao, 2023
Therefore, the experience sample for one state can generalize to help estimate the
values of other states.
Q: Can we unify the tabular and the function approximation methods?
A: Yes. The tabular method can be viewed as a special case of the function approxi-
mation method. The related details can be found in Box 8.2.
Q: What is the stationary distribution and why is it important?
A: The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently long
period, the probability of the agent visiting a state can be described by this stationary
distribution. More information can be found in Box 8.1.
The reason why this concept emerges in this chapter is that it is necessary for defining
a valid objective function. In particular, the objective function involves the probability
distribution of the states, which is usually selected as the stationary distribution. The
stationary distribution is important not only for the value approximation method but
also for the policy gradient method, which will be introduced in Chapter 9.
Q: What are the advantages and disadvantages of the linear function approximation
method?
A: Linear function approximation is the simplest case whose theoretical properties
can be thoroughly analyzed. However, the approximation ability of this method is
limited. It is also nontrivial to select appropriate feature vectors for complex tasks.
By contrast, artificial neural networks can be used to approximate values as black-box
universal nonlinear approximators, which are more friendly to use. Nevertheless, it
is still meaningful to study the linear case to better grasp the idea of the function
approximation method. Moreover, the linear case is powerful in the sense that the
tabular method can be viewed as a special linear case (Box 8.2).
Q: Why does deep Q-learning require experience replay?
A: The reason lies in the objective function in (8.37). In particular, to well define
the objective function, we must specify the probability distributions of S, A, R, S 0 .
The distributions of R and S 0 are determined by the system model once (S, A) is
given. The simplest way to describe the distribution of the state-action pair (S, A)
is to assume it to be uniformly distributed. However, the state-action samples may
not be uniformly distributed in practice since they are generated as a sequence by the
behavior policy. It is necessary to break the correlation between the samples in the
sequence to satisfy the assumption of uniform distribution. To do this, we can use
the experience replay technique by uniformly drawing samples from the replay buffer.
A benefit of experience replay is that each experience sample may be used multiple
times, which can increase the data efficiency.
197
8.6. Q&A S. Zhao, 2023
198
Chapter 9
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
The idea of function approximation can be applied not only to represent state/action
values, as introduced in Chapter 8, but also to represent policies, as introduced in this
chapter. So far in this book, policies have been represented by tables: the action prob-
abilities of all states are stored in a table (e.g., Table 9.1). In this chapter, we show
that policies can be represented by parameterized functions denoted as π(a|s, θ), where
θ ∈ Rm is a parameter vector. It can also be written in other forms such as πθ (a|s),
πθ (a, s), or π(a, s, θ).
When policies are represented as functions, optimal policies can be obtained by op-
timizing certain scalar metrics. Such a method is called policy gradient. The policy
199
9.1. Policy representation: From table to function S. Zhao, 2023
a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
.. .. .. .. .. ..
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )
Table 9.1: A tabular representation of a policy. There are nine states and five actions for each state.
π(a1 |s, θ)
s
π(a|s, θ) s ..
θ θ .
a
π(a5 |s, θ)
function function
(a) (b)
Figure 9.2: Function representations of policies. The functions may have different structures.
gradient method is a big step forward in this book because it is policy-based. By contrast,
all the previous chapters in this book discuss value-based methods. The advantages of the
policy gradient method are numerous. For example, it is more efficient for handling large
state/action spaces. It has stronger generalization abilities and hence is more efficient in
terms of sample usage.
First, how to define optimal policies? When represented as a table, a policy is defined
as optimal if it can maximize every state value. When represented by a function, a
policy is defined as optimal if it can maximize certain scalar metrics.
Second, how to update a policy? When represented by a table, a policy can be updated
by directly changing the entries in the table. When represented by a parameterized
function, a policy can no longer be updated in this way. Instead, it can only be
updated by changing the parameter θ.
Third, how to retrieve the probability of an action? In the tabular case, the probability
of an action can be directly obtained by looking up the corresponding entry in the
table. In the case of function representation, we need to input (s, a) into the function
to calculate its probability (see Figure 9.2(a)). Depending on the structure of the
function, we can also input a state and then output the probabilities of all actions
(see Figure 9.2(b)).
200
9.2. Metrics for defining optimal policies S. Zhao, 2023
The basic idea of the policy gradient method is summarized below. Suppose that J(θ)
is a scalar metric. Optimal policies can be obtained by optimizing this metric via the
gradient-based algorithm:
where ∇θ J is the gradient of J with respect to θ, t is the time step, and α is the
optimization rate.
With this basic idea, we will answer the following three questions in the remainder of
this chapter.
The first metric is the average state value or simply called the average value. It is defined
as
X
v̄π = d(s)vπ (s),
s∈S
P
where d(s) is the weight of state s. It satisfies d(s) ≥ 0 for any s ∈ S and s∈S d(s) = 1.
Therefore, we can interpret d(s) as a probability distribution of s. Then, the metric can
be written as
How to select the distribution d? This is an important question. There are two cases.
The first and simplest case is that d is independent of the policy π. In this case, we
specifically denote d as d0 and v̄π as v̄π0 to indicate that the distribution is independent
of the policy. One case is to treat all the states equally important and select d0 (s) =
1/|S|. Another case is when we are only interested in a specific state s0 (e.g., the
agent always starts from s0 ). In this case, we can design
d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.
201
9.2. Metrics for defining optimal policies S. Zhao, 2023
The second case is that d is dependent on the policy π. In this case, it is common to
select d as dπ , which is the stationary distribution under π. One basic property of dπ
is that it satisfies
dTπ Pπ = dTπ ,
where Pπ is the state transition probability matrix. More information about the
stationary distribution can be found in Box 8.1.
The interpretation of selecting dπ is as follows. The stationary distribution reflects the
long-term behavior of a Markov decision process under a given policy. If one state is
frequently visited in the long term, it is more important and deserves a higher weight;
if a state is rarely visited, then its importance is low and deserves a lower weight.
As its name suggests, v̄π is a weighted average of the state values. Different values
of θ lead to different values of v̄π . Our ultimate goal is to find an optimal policy (or
equivalently an optimal θ) to maximize v̄π .
We next introduce another two important equivalent expressions of v̄π .
Suppose that an agent collects rewards {Rt+1 }∞t=0 by following a given policy π(θ).
Readers may often see the following metric in the literature:
∞
" n
# " #
X X
J(θ) = lim E γ t Rt+1 = E γ t Rt+1 . (9.1)
n→∞
t=0 t=0
This metric may be nontrivial to interpret at first glance. In fact, it is equal to v̄π .
To see that, we have
∞ ∞
" # " #
X X X
E γ t Rt+1 = d(s)E γ t Rt+1 |S0 = s
t=0 s∈S t=0
X
= d(s)vπ (s)
s∈S
= v̄π .
The first equality in the above equation is due to the law of total expectation. The
second equality is by the definition of state values.
The metric v̄π can also be rewritten as the inner product of two vectors. In particular,
let
vπ = [. . . , vπ (s), . . . ]T ∈ R|S| ,
d = [. . . , d(s), . . . ]T ∈ R|S| .
202
9.2. Metrics for defining optimal policies S. Zhao, 2023
Then, we have
v̄π = dT vπ .
The second metric is the average one-step reward or simply called the average reward
[2, 64, 65]. In particular, it is defined as
. X
r̄π = dπ (s)rπ (s)
s∈S
. X
rπ (s) = π(a|s, θ)r(s, a) = EA∼π(s,θ) [r(s, A)|s] (9.3)
a∈A
. P
is the expectation of the immediate rewards. Here, r(s, a) = E[R|s, a] = r rp(r|s, a).
We next present another two important equivalent expressions of r̄π .
It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to
r̄π :
" n−1 #
1 X X
lim E Rt+1 = dπ (s)rπ (s) = r̄π . (9.5)
n→∞ n
t=0 s∈S
rπ = [. . . , rπ (s), . . . ]T ∈ R|S| ,
dπ = [. . . , dπ (s), . . . ]T ∈ R|S| ,
203
9.2. Metrics for defining optimal policies S. Zhao, 2023
Step 1: We first prove that the following equation is valid for any starting state
s0 ∈ S:
" n−1 #
1 X
r̄π = lim E Rt+1 |S0 = s0 . (9.6)
n→∞ n
t=0
To do that, we notice
" n−1 # n−1
1 X 1X
lim E Rt+1 |S0 = s0 = lim E [Rt+1 |S0 = s0 ]
n→∞ n n→∞ n
t=0 t=0
where the last equality is due to the property of the Cesaro mean (also called the
Cesaro summation). In particular, if {ak }∞k=1 is a convergent sequence such that
Pn ∞
limk→∞ ak exists, then {1/n k=1 ak }n=1 is also a convergent sequence such that
limn→∞ 1/n nk=1 ak = limk→∞ ak .
P
We next examine E [Rt+1 |S0 = s0 ] in (9.7) more closely. By the law of total
expectation, we have
X
E [Rt+1 |S0 = s0 ] = E [Rt+1 |St = s, S0 = s0 ] p(t) (s|s0 )
s∈S
X
= E [Rt+1 |St = s] p(t) (s|s0 )
s∈S
X
= rπ (s)p(t) (s|s0 ),
s∈S
where p(t) (s|s0 ) denotes the probability of transitioning from s0 to s using exactly t
steps. The second equality in the above equation is due to the Markov memoryless
property: the reward obtained at the next time step depends only on the current
state rather than the previous ones.
Note that
lim p(t) (s|s0 ) = dπ (s)
t→∞
204
9.2. Metrics for defining optimal policies S. Zhao, 2023
by the definition of the stationary distribution. As a result, the starting state s0 does
not matter. Then, we have
X X
lim E [Rt+1 |S0 = s0 ] = lim rπ (s)p(t) (s|s0 ) = rπ (s)dπ (s) = r̄π .
t→∞ t→∞
s∈S s∈S
Since (9.6) is valid for any starting state, substituting (9.6) into the above equation
yields
" n−1 #
1 X X
lim E Rt+1 = d(s)r̄π = r̄π .
n→∞ n
t=0 s∈S
Some remarks
Table 9.2: Summary of the different but equivalent expressions of v̄π and r̄π .
Up to now, we have introduced two types of metrics: v̄π and r̄π . Each metric has
several different but equivalent expressions. They are summarized in Table 9.2. We
sometimes use v̄π to specifically refer to the case where the state distribution is the
stationary distribution dπ and use v̄π0 to refer to the case where d0 is independent of π.
Some remarks about the metrics are given below.
205
9.3. Gradients of the metrics S. Zhao, 2023
The two metrics v̄π and r̄π are equivalent in the discounted case where γ < 1. In
particular, it can be shown that
r̄π = (1 − γ)v̄π .
The above equation indicates that these two metrics can be simultaneously maximized.
The proof of this equation is given later in Lemma 9.1.
It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2,
Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios
involving different metrics and discounted/undiscounted cases. The gradients in these
scenarios all have similar expressions and hence are summarized in Theorem 9.1. The
specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in
Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be v̄π0 , v̄π ,
or r̄π . The equality in (9.8) may become a strict equality or an approximation. The
distribution η also varies in different scenarios.
The derivation of the gradients is the most complicated part of the policy gradient
method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1
without knowing the proof. The derivation details presented in the rest of this section
are mathematically intensive. Readers are suggested to study selectively based on
their interests.
206
9.3. Gradients of the metrics S. Zhao, 2023
∇θ π(a|s, θ)
∇θ ln π(a|s, θ) = .
π(a|s, θ)
It follows that
It is notable that π(a|s, θ) must be positive for all (s, a) to ensure that ln π(a|s, θ) is
valid. This can be achieved by using softmax functions:
eh(s,a,θ)
π(a|s, θ) = P h(s,a0 ,θ)
, a ∈ A, (9.12)
a0 ∈A e
where h(s, a, θ) is a function indicating the preference for selecting a at s. The policy
P
in (9.12) satisfies π(a|s, θ) ∈ [0, 1] and a∈A π(a|s, θ) = 1 for any s ∈ S. This policy
can be realized by a neural network. The input of the network is s. The output layer
is a softmax layer so that the network outputs π(a|s, θ) for all a and the sum of the
outputs is equal to 1. See Figure 9.2(b) for an illustration.
Since π(a|s, θ) > 0 for all a, the policy is stochastic and hence exploratory. The policy
does not directly tell which action to take. Instead, the action should be generated
according to the probability distribution of the policy.
207
9.3. Gradients of the metrics S. Zhao, 2023
Lemma 9.1 (Equivalence between v̄π (θ) and r̄π (θ)). In the discounted case where γ ∈
(0, 1), it holds that
Proof. Note that v̄π (θ) = dTπ vπ and r̄π (θ) = dTπ rπ , where vπ and rπ satisfy the Bellman
equation vπ = rπ + γPπ vπ . Multiplying dTπ on both sides of the Bellman equation yields
Second, the following lemma gives the gradient of vπ (s) for any s.
Lemma 9.2 (Gradient of vπ (s)). In the discounted case, it holds for any s ∈ S that
X X
∇θ vπ (s) = Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a), (9.14)
s0 ∈S a∈A
where
∞
0 . X k k
γ [Pπ ]ss0 = (In − γPπ )−1 ss0
Prπ (s |s) =
k=0
208
9.3. Gradients of the metrics S. Zhao, 2023
P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 + γ p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S
It is notable that ∇θ vπ appears on both sides of the above equation. One way to
calculate it is to use the unrolling technique [64]. Here, we use another way based on
the matrix-vector form, which we believe is more straightforward to understand. In
particular, let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A
Since
X X X X
π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ) = p(s0 |s)∇θ vπ (s0 ) = [Pπ ]ss0 ∇θ vπ (s0 ),
a∈A s0 ∈S s0 ∈S s0 ∈S
.. .. ..
. . .
∇θ vπ (s) = u(s) +γ(Pπ ⊗ Im ) ∇θ vπ (s0 ) ,
.. .. ..
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn
209
9.3. Gradients of the metrics S. Zhao, 2023
∇θ vπ = u + γ(Pπ ⊗ Im )∇θ vπ .
Here, n = |S|, and m is the dimension of the parameter vector θ. The reason that
the Kronecker product ⊗ emerges in the equation is that ∇θ vπ (s) is a vector. The
above equation is a linear equation of ∇θ vπ , which can be solved as
The quantity [(In − γPπ )−1 ]ss0 has a clear probabilistic interpretation. In particular,
since (In − γPπ )−1 = I + γPπ + γ 2 Pπ2 + · · · , we have
∞
X
(In − γPπ )−1 ss0 = [I]ss0 + γ[Pπ ]ss0 + γ 2 [Pπ2 ]ss0 + · · · = γ k [Pπk ]ss0 .
k=0
Note that [Pπk ]ss0 is the probability of transitioning from s to s0 using exactly k
steps (see Box 8.1). Therefore, [(In − γPπ )−1 ]ss0 is the discounted total probability of
.
transitioning from s to s0 using any number of steps. By denoting [(In − γPπ )−1 ]ss0 =
Prπ (s0 |s), equation (9.18) becomes (9.14).
With the results in Lemma 9.2, we are ready to derive the gradient of v̄π0 .
Theorem 9.2 (Gradient of v̄π0 in the discounted case). In the discounted case where
γ ∈ (0, 1), the gradient of v̄π0 = dT0 vπ is
P∞
where Prπ (s|s0 ) = k=0 γ k [Pπk ]s0 s = [(I − γPπ )−1 ]s0 s is the discounted total probability of
210
9.3. Gradients of the metrics S. Zhao, 2023
Substituting the expression of ∇θ vπ (s) given in Lemma 9.2 into the above equation
yields
X X X X
∇θ v̄π0 = d0 (s)∇θ vπ (s) = d0 (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= d0 (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρπ (s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρπ (s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
X X
= ρπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A
With Lemma 9.1 and Lemma 9.2, we can derive the gradients of r̄π and v̄π .
Theorem 9.3 (Gradients of r̄π and v̄π in the discounted case). In the discounted case
where γ ∈ (0, 1), the gradients of r̄π and v̄π are
X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
= E ∇θ ln π(A|S, θ)qπ (S, A) ,
where S ∼ dπ and A ∼ π(S, θ). Here, the approximation is more accurate when γ is
closer to 1.
211
9.3. Gradients of the metrics S. Zhao, 2023
This equation contains two terms. On the one hand, substituting the expression of
∇θ vπ given in (9.17) into the second term gives
X
dπ (s)∇θ vπ (s) = (dTπ ⊗ Im )∇θ vπ
s∈S
It is noted that
1
dTπ (In − γPπ )−1 = dT ,
1−γ π
which can be easily verified by multiplying (In − γPπ ) on both sides of the equation.
Therefore, (9.21) becomes
X 1
dπ (s)∇θ vπ (s) = dT ⊗ Im u
s∈S
1−γ π
1 X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A
On the other hand, the first term of (9.20) involves ∇θ dπ . However, since the second
1
term contains 1−γ , the second term becomes dominant, and the first term becomes
negligible when γ → 1. Therefore,
1 X X
∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A
212
9.3. Gradients of the metrics S. Zhao, 2023
The approximation in the above equation requires that the first term does not go to
infinity when γ → 1. More information can be found in [66, Section 4].
In the undiscounted case, it is necessary to redefine state and action values. Since the
undiscounted sum of the rewards, E[Rt+1 + Rt+2 + Rt+3 + . . . |St = s], may diverge, the
state and action values are defined in a special way [64]:
.
vπ (s) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s],
.
qπ (s, a) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s, At = a],
where r̄π is the average reward, which is determined when π is given. There are different
names for vπ (s) in the literature such as the differential reward [65] or bias [2, Sec-
tion 8.2.1]. It can be verified that the state value defined above satisfies the following
Bellman-like equation:
" #
X X X
vπ (s) = π(a|s, θ) p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 ) . (9.22)
a r s0
P P
Since vπ (s) = a∈A π(a|s, θ)qπ (s, a), it holds that qπ (s, a) = r p(r|s, a)(r − r̄π ) +
0 0
P
s0 p(s |s, a)vπ (s ). The matrix-vector form of (9.22) is
vπ = rπ − r̄π 1n + Pπ vπ , (9.23)
where 1n = [1, . . . , 1]T ∈ Rn . Equation (9.23) is similar to the Bellman equation and it
has a specific name called the Poisson equation [65, 67].
How to solve vπ from the Poisson equation? The answer is given in the following
theorem.
213
9.3. Gradients of the metrics S. Zhao, 2023
Then, vπ∗ is a solution of the Poisson equation in (9.23). Moreover, any solution of the
Poisson equation has the following form:
vπ = vπ∗ + c1n ,
where c ∈ R.
This theorem indicates that the solution of the Poisson equation may not be unique.
.
A = In − Pπ + 1n dTπ .
Then, vπ∗ = A−1 rπ . The fact that A is invertible will be proven in Step 3. Substi-
tuting vπ∗ = A−1 rπ into (9.25) gives
This equation is valid as proven below. Recognizing this equation gives (−A−1 +
In − 1n dTπ + Pπ A−1 )rπ = 0, and consequently,
The term in the brackets in the above equation is zero because −In +A−1n dTπ A+
Pπ = −In + (In − Pπ + 1n dTπ ) − 1n dTπ (In − Pπ + 1n dTπ ) + Pπ = 0. Therefore, vπ∗ in
(9.24) is a solution.
Step 2: General expression of the solutions.
Substituting r̄π = dTπ rπ into (9.23) gives
vπ = rπ − 1n dTπ rπ + Pπ vπ (9.25)
and consequently
214
9.3. Gradients of the metrics S. Zhao, 2023
Proof. First of all, we state some preliminary facts without proof. Let ρ(M )
be the spectral radius of a matrix M . Then, I − M is invertible if ρ(M ) < 1.
Moreover, ρ(M ) < 1 if and only if limk→∞ M k = 0.
Based on the above facts, we next show that limk→∞ (Pπ − 1n dTπ )k → 0, and then
the invertibility of In − (Pπ − 1n dTπ ) immediately follows. To do that, we notice
that
where the last equality is due to Pπ 1n = 1n , dTπ Pπ = dTπ , and dTπ 1n = 1. The case
of k ≥ 3 can be proven similarly.
Since dπ is the stationary distribution of the state, it holds that limk→∞ Pπk = dTπ 1n
(see Box 8.1). Therefore, (9.27) implies that
As a result, ρ(Pπ −1n dTπ ) < 1 and hence In −(Pπ −1n dTπ ) is invertible. Furthermore,
215
9.3. Gradients of the metrics S. Zhao, 2023
The proof of Lemma 9.3 is inspired by [66]. However, the result (In − Pπ +
1n dTπ )−1 = ∞ k T
P
k=0 (Pπ − 1n dπ ) given in [66] (the statement above equation (16)
P∞ k T
P∞ k
in [66]) is inaccurate because k=0 (Pπ − 1n dπ ) is singular since k=0 (Pπ −
1n dTπ )1n = 0. Lemma 9.3 corrects this inaccuracy.
Derivation of gradients
Although the value of vπ is not unique in the undiscounted case, as shown in Theorem 9.4,
the value of r̄π is unique. In particular, it follows from the Poisson equation that
Notably, the undetermined value c is canceled and hence r̄π is unique. Therefore, we can
calculate the gradient of r̄π in the undiscounted case. In addition, since vπ is not unique,
v̄π is not unique either. We do not study the gradient of v̄π in the undiscounted case. For
interested readers, it is worth mentioning that we can add more constraints to uniquely
solve vπ from the Poisson equation. For example, by assuming that a recurrent state
exists, the state value of this recurrent state is zero [65, Section II], and hence, c can
be determined. There are also other ways to uniquely determine vπ . See, for example,
equations (8.6.5)-(8.6.7) in [2].
The gradient of r̄π in the undiscounted case is given below.
Theorem 9.5 (Gradient of r̄π in the undiscounted case). In the undiscounted case, the
216
9.3. Gradients of the metrics S. Zhao, 2023
Compared to the discounted case shown in Theorem 9.3, the gradient of r̄π in the
undiscounted case is more elegant in the sense that (9.28) is strictly valid and S obeys
the stationary distribution.
s0
P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 − ∇θ r̄π + p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S
Let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A
217
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023
0 0 0 0
P P P
Since a∈A π(a|s, θ) s0 ∈S p(s |s, a)∇θ vπ (s ) = s0 ∈S p(s |s)∇θ vπ (s ), equation
(9.30) can be written in matrix-vector form as
.. .. ..
. . .
∇θ vπ (s) = u(s) −1n ⊗ ∇θ r̄π + (Pπ ⊗ Im ) ∇θ vπ (s0 ) ,
.. .. ..
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn
where n = |S|, m is the dimension of θ, and ⊗ is the Kronecker product. The above
equation can be written concisely as
and hence
which implies
∇θ r̄π = dTπ ⊗ Im u
X
= dπ (s)u(s)
s∈S
X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
s∈S a∈A
where α > 0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we
218
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023
can replace the true gradient with a stochastic gradient to obtain the following algorithm:
The above observations can be proven as follows. When θt+1 − θt is sufficiently small,
it follows from the Taylor expansion that
π(at |st , θt+1 ) ≈ π(at |st , θt ) + (∇θ π(at |st , θt ))T (θt+1 − θt )
= π(at |st , θt ) + αβt (∇θ π(at |st , θt ))T (∇θ π(at |st , θt )) (substituting (9.33))
= π(at |st , θt ) + αβt k∇θ π(at |st , θt )k22 .
It is clear that π(at |st , θt+1 ) ≥ π(at |st , θt ) when βt ≥ 0 and π(at |st , θt+1 ) < π(at |st , θt )
when βt < 0.
Second, the algorithm can strike a balance between exploration and exploitation to a
219
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023
qt (st , at )
βt = .
π(at |st , θt )
Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is
important to understand how the samples should be obtained.
How to sample S? S in the true gradient E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the
distribution η which is either the stationary distribution dπ or the discounted total
probability distribution ρπ in (9.19). Either dπ or ρπ represents the long-term behavior
exhibited under π.
How to sample A? A in E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the distribution of
π(A|S, θ). The ideal way to sample A is to select at following π(a|st , θt ). Therefore,
the policy gradient algorithm is on-policy.
Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice
due to their low efficiency of sample usage. A more sample-efficient implementation of
(9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by
following π(θ). Then, θ is updated multiple times using every experience sample in the
episode.
220
9.5. Summary S. Zhao, 2023
9.5 Summary
This chapter introduced the policy gradient method, which is the foundation of many
modern reinforcement learning algorithms. Policy gradient methods are policy-based. It
is a big step forward in this book because all the methods in the previous chapters are
value-based. The basic idea of the policy gradient method is simple. That is to select an
appropriate scalar metric and then optimize it via a gradient-ascent algorithm.
The most complicated part of the policy gradient method is the derivation of the
gradients of the metrics. That is because we have to distinguish various scenarios with
different metrics and discounted/undiscounted cases. Fortunately, the expressions of the
gradients in different scenarios are similar. Hence, we summarized the expressions in
Theorem 9.1, which is the most important theoretical result in this chapter. For many
readers, it is sufficient to be aware of this theorem. Its proof is nontrivial, and it is not
required for all readers to study.
The policy gradient algorithm in (9.32) must be properly understood since it is the
foundation of many advanced policy gradient algorithms. In the next chapter, this algo-
rithm will be extended to another important policy gradient method called actor-critic.
9.6 Q&A
Q: What is the basic idea of the policy gradient method?
A: The basic idea is simple. That is to define an appropriate scalar metric, derive
its gradient, and then use gradient-ascent methods to optimize the metric. The most
important theoretical result regarding this method is the policy gradient given in
Theorem 9.1.
Q: What is the most complicated part of the policy gradient method?
A: The basic idea of the policy gradient method is simple. However, the derivation
procedure of the gradients is quite complicated. That is because we have to distin-
guish numerous different scenarios. The mathematical derivation procedure in each
scenario is nontrivial. It is sufficient for many readers to be familiar with the result
in Theorem 9.1 without knowing the proof.
Q: What metrics should be used in the policy gradient method?
A: We introduced three common metrics in this chapter: v̄π , v̄π0 , and r̄π . Since they
all lead to similar policy gradients, they all can be adopted in the policy gradient
method. More importantly, the expressions in (9.1) and (9.4) are often encountered
in the literature.
Q: Why is a natural logarithm function contained in the policy gradient?
221
9.6. Q&A S. Zhao, 2023
222
Chapter 10
Actor-Critic Methods
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
This chapter introduces actor-critic methods. From one point of view, “actor-critic”
refers to a structure that incorporates both policy-based and value-based methods. Here,
an “actor” refers to a policy update step. The reason that it is called an actor is that the
actions are taken by following the policy. Here, an “critic” refers to a value update step. It
is called a critic because it criticizes the actor by evaluating its corresponding values. From
another point of view, actor-critic methods are still policy gradient algorithms. They can
be obtained by extending the policy gradient algorithm introduced in Chapter 9. It is
important for the reader to well understand the contents of Chapters 8 and 9 before
studying this chapter.
223
10.1. The simplest actor-critic algorithm (QAC) S. Zhao, 2023
where η is a distribution of the states (see Theorem 9.1 for more information). Since the
true gradient is unknown, we can use a stochastic gradient to approximate it:
224
10.2. Advantage actor-critic (A2C) S. Zhao, 2023
where the additional baseline b(S) is a scalar function of S. We next answer two questions
about the baseline.
225
10.2. Advantage actor-critic (A2C) S. Zhao, 2023
.
X(S, A) = ∇θ ln π(A|S, θt )[qπ (S, A) − b(S)]. (10.4)
Then, the true gradient is E[X(S, A)]. Since we need to use a stochastic sample x to
approximate E[X], it would be favorable if the variance var(X) is small. For example,
if var(X) is close to zero, then any sample x can accurately approximate E[X]. On
the contrary, if var(X) is large, the value of a sample may be far from E[X].
Although E[X] is invariant to the baseline, the variance var(X) is not. Our goal is to
design a good baseline to minimize var(X). In the algorithms of REINFORCE and
QAC, we set b = 0, which is not guaranteed to be a good baseline.
In fact, the optimal baseline that minimizes var(X) is
226
10.2. Advantage actor-critic (A2C) S. Zhao, 2023
When deriving the above equation, we use the trace property tr(AB) = tr(BA)
for any squared matrices A, B with appropriate dimensions. Since x̄ is invariant,
equation (10.6) suggests that we only need to minimize E[X T X]. With X defined in
(10.4), we have
where π(A|S, θ) is written as π for short. Since S ∼ η and A ∼ π, the above equation
can be rewritten as
X
E[X T X] = η(s)EA∼π k∇θ ln πk2 (qπ (s, A) − b(s))2 .
s∈S
The above equation can be easily solved to obtain the optimal baseline:
227
10.2. Advantage actor-critic (A2C) S. Zhao, 2023
Here,
.
δπ (S, A) = qπ (S, A) − vπ (S)
is called the advantage function, which reflects the advantage of one action over the
P
others. More specifically, note that vπ (s) = a∈A π(a|s)qπ (s, a) is the mean of the action
values. If δπ (s, a) > 0, it means that the corresponding action has a greater value than
the mean value.
The stochastic version of (10.7) is
where st , at are samples of S, A at time t. Here, qt (st , at ) and vt (st ) are approximations of
qπ(θt ) (st , at ) and vπ(θt ) (st ), respectively. The algorithm in (10.8) updates the policy based
on the relative value of qt with respect to vt rather than the absolute value of qt . This is
intuitively reasonable because, when we attempt to select an action at a state, we only
care about which action has the greatest value relative to the others.
If qt (st , at ) and vt (st ) are estimated by Monte Carlo learning, the algorithm in (10.8) is
called REINFORCE with a baseline. If qt (st , at ) and vt (st ) are estimated by TD learning,
the algorithm is usually called advantage actor-critic (A2C). The implementation of A2C
is summarized in Algorithm 10.2. It should be noted that the advantage function in this
implementation is approximated by the TD error:
which is valid due to the definition of qπ (st , at ). One merit of using the TD error is
that we only need to use a single neural network to represent vπ (s). Otherwise, if δt =
qt (st , at ) − vt (st ), we need to maintain two networks to represent vπ (s) and qπ (s, a),
respectively. When we use the TD error, the algorithm may also be called TD actor-
critic. In addition, it is notable that the policy π(θt ) is stochastic and hence exploratory.
Therefore, it can be directly used to generate experience samples without relying on
228
10.3. Off-policy actor-critic S. Zhao, 2023
techniques such as ε-greedy. There are some variants of A2C such as asynchronous
advantage actor-critic (A3C). Interested readers may check [71, 72].
To use samples to approximate this true gradient, we must generate the action samples
by following π(θ). Hence, π(θ) is the behavior policy. Since π(θ) is also the target policy
that we aim to improve, the policy gradient methods are on-policy.
In the case that we already have some samples generated by a given behavior policy,
the policy gradient methods can still be applied to utilize these samples. To do that,
we can employ a technique called importance sampling. It is worth mentioning that the
importance sampling technique is not restricted to the field of reinforcement learning.
It is a general technique for estimating expected values defined over one probability
distribution using some samples drawn from another distribution.
229
10.3. Off-policy actor-critic S. Zhao, 2023
First, if the samples {xi }ni=1 are generated by following p0 , then the average value
x̄ = n1 ni=1 xi can be used to approximate EX∼p0 [X] because x̄ is an unbiased estimate
P
of EX∼p0 [X] and the estimation variance converges to zero as n → ∞ (see the law of
large numbers in Box 5.1 for more information).
Second, consider a new scenario where the samples {xi }ni=1 are not generated by
p0 . Instead, they are generated by another distribution p1 . Can we still use these
samples to approximate EX∼p0 [X]? The answer is yes. However, we can no longer use
x̄ = n1 ni=1 xi to approximate EX∼p0 [X] since x̄ ≈ EX∼p1 [X] rather than EX∼p0 [X].
P
In the second scenario, EX∼p0 [X] can be approximated based on the importance sam-
pling technique. In particular, EX∼p0 [X] satisfies
X X p0 (x)
EX∼p0 [X] = p0 (x)x = p1 (x) x = EX∼p1 [f (X)]. (10.9)
x∈X x∈X
p1 (x)
| {z }
f (x)
Thus, estimating EX∼p0 [X] becomes the problem of estimating EX∼p1 [f (X)]. Let
n
. 1X
f¯ = f (xi ).
n i=1
Since f¯ can effectively approximate EX∼p1 [f (X)], it then follows from (10.9) that
n n
1X 1 X p0 (xi )
EX∼p0 [X] = EX∼p1 [f (X)] ≈ f¯ = f (xi ) = xi . (10.10)
n i=1 n i=1 p1 (xi )
| {z }
importance
weight
Equation (10.10) suggests that EX∼p0 [X] can be approximated by a weighted average of
xi . Here, pp01 (x i)
(xi )
is called the importance weight. When p1 = p0 , the importance weight
¯
is 1 and f becomes x̄. When p0 (xi ) ≥ p1 (xi ), xi can be sampled more frequently by p0
but less frequently by p1 . In this case, the importance weight, which is greater than one,
emphasizes the importance of this sample.
Some readers may ask the following question: while p0 (x) is required in (10.10), why
P
do we not directly calculate EX∼p0 [X] using its definition EX∼p0 [X] = x∈X p0 (x)x?
The answer is as follows. To use the definition, we need to know either the analytical
expression of p0 or the value of p0 (x) for every x ∈ X . However, it is difficult to obtain
the analytical expression of p0 when the distribution is represented by, for example, a
neural network. It is also difficult to obtain the value of p0 (x) for every x ∈ X when X
is large. By contrast, (10.10) merely requires the values of p0 (xi ) for some samples and
is much easier to implement in practice.
230
10.3. Off-policy actor-critic S. Zhao, 2023
An illustrative example
Suppose that we have some samples {xi } drawn over p1 . Our goal is to estimate
EX∼p0 [X] using these samples. As shown in Figure 10.2, there are more samples of +1
than −1. That is because p1 (X = +1) = 0.8 > p1 (X = −1) = 0.2. If we directly calculate
the average value ni=1 xi /n of the samples, this value converges to EX∼p1 [X] = 0.6 (see
P
the dotted line in Figure 10.2). By contrast, if we calculate the weighted average value
as in (10.10), this value can successfully converge to EX∼p0 [X] = 0 (see the solid line in
Figure 10.2).
2 samples
average
importance sampling
1
-1
-2
0 50 100 150 200
Sample index
Figure 10.2: An example for demonstrating the importance sampling technique. Here, X ∈ {+1, −1} and
p0 (X = +1) = p0 (X = −1) = 0.5. The samples are generated according to p1 where p1 (X = +1) = 0.8
and p1 (X = −1) = 0.2. The average of the samples converges to EX∼p1 [X] = 0.6, but the weighted
average calculated by the importance sampling technique in (10.10) converges to EX∼p0 [X] = 0.
231
10.3. Off-policy actor-critic S. Zhao, 2023
Finally, the distribution p1 , which is used to generate samples, must satisfy that
p1 (x) 6= 0 when p0 (x) 6= 0. If p1 (x) = 0 while p0 (x) 6= 0, the estimation result may be
problematic. For example, if
p1 (X = +1) = 1, p1 (X = −1) = 0,
then the samples generated by p1 are all positive: {xi } = {+1, +1, . . . , +1}. These
samples cannot be used to correctly estimate EX∼p0 [X] = 0 because
n n n
1 X p0 (xi ) 1 X p0 (+1) 1 X 0.5
xi = 1= 1 ≡ 0.5,
n i=1 p1 (xi ) n i=1 p1 (+1) n i=1 1
where dβ is the stationary distribution under policy β and vπ is the state value under
policy π. The gradient of this metric is given in the following theorem.
Theorem 10.1 (Off-policy policy gradient theorem). In the discounted case where γ ∈
(0, 1), the gradient of J(θ) is
π(A|S, θ)
∇θ J(θ) = ES∼ρ,A∼β ∇θ ln π(A|S, θ)qπ (S, A) , (10.11)
β(A|S)
| {z }
importance
weight
. X
ρ(s) = dβ (s0 )Prπ (s|s0 ), s ∈ S,
s0 ∈S
The gradient in (10.11) is similar to that in the on-policy case in Theorem 9.1, but
there are two differences. The first difference is the importance weight. The second
difference is that A ∼ β instead of A ∼ π. Therefore, we can use the action samples
232
10.3. Off-policy actor-critic S. Zhao, 2023
generated by following β to approximate the true gradient. The proof of the theorem is
given in Box 10.2.
. P∞ k k
where Prπ (s0 |s) = −1
k=0 γ [Pπ ]ss0 = [(In − γPπ ) ]ss0 . Substituting (10.13) into
(10.12) yields
X X X X
∇θ J(θ) = dβ (s)∇θ vπ (s) = dβ (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= dβ (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρ(s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρ(s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
" #
X
= ES∼ρ ∇θ π(a|S, θ)qπ (S, a) .
a∈A
By using the importance sampling technique, the above equation can be further
rewritten as
" # " #
X X π(a|S, θ) ∇θ π(a|S, θ)
ES∼ρ ∇θ π(a|S, θ)qπ (S, a) = ES∼ρ β(a|S) qπ (S, a)
a∈A a∈A
β(a|S) π(a|S, θ)
" #
X π(a|S, θ)
= ES∼ρ β(a|S) ∇θ ln π(a|S, θ)qπ (S, a)
a∈A
β(a|S)
π(A|S, θ)
= ES∼ρ,A∼β ∇θ ln π(A|S, θ)qπ (S, A) .
β(A|S)
The proof is complete. The above proof is similar to that of Theorem 9.1.
233
10.3. Off-policy actor-critic S. Zhao, 2023
π(at |st , θt )
θt+1 = θt + αθ ∇θ ln π(at |st , θt ) qt (st , at ) − vt (st ) ,
β(at |st )
where αθ > 0. Similar to the on-policy case, the advantage function qt (s, a) − vt (s) can
be replaced by the TD error. That is
.
qt (st , at ) − vt (st ) ≈ rt+1 + γvt (st+1 ) − vt (st ) = δt (st , at ).
π(at |st , θ)
θt+1 = θt + αθ ∇θ ln π(at |st , θ)δt (st , at ).
β(at |st )
234
10.4. Deterministic actor-critic S. Zhao, 2023
Initialization: A given behavior policy β(a|s). A target policy π(a|s, θ0 ) where θ0 is the
initial parameter. A value function v(s, w0 ) where w0 is the initial parameter. αw , αθ > 0.
Goal: Learn an optimal policy to maximize J(θ).
At time step t in each episode, do
Generate at following β(st ) and then observe rt+1 , st+1 .
Advantage (TD error):
δt = rt+1 + γv(st+1 , wt ) − v(st , wt )
Actor (policy update):
t |st ,θt )
θt+1 = θt + αθ π(a
β(at |st ) t θ
δ ∇ ln π(at |st , θt )
Critic (value update):
t |st ,θt )
wt+1 = wt + αw π(aβ(at |st ) t w
δ ∇ v(st , wt )
a = µ(s, θ)
to specifically denote a deterministic policy. Different from π which gives the probability
of an action, µ directly gives the action since it is a mapping from S to A. This deter-
ministic policy can be represented by, for example, a neural network with s as its input,
a as its output, and θ as its parameter. For the sake of simplicity, we often write µ(s, θ)
as µ(s) for short.
235
10.4. Deterministic actor-critic S. Zhao, 2023
Theorem 10.2 is a summary of the results presented in Theorem 10.3 and Theorem 10.4
since the gradients in the two theorems have similar expressions. The specific expressions
of J(θ) and η can be found in Theorems 10.3 and 10.4.
Unlike the stochastic case, the gradient in the deterministic case shown in (10.14)
does not involve the action random variable A. As a result, when we use samples to
approximate the true gradient, it is not required to sample actions. Therefore, the de-
terministic policy gradient method is off-policy. In addition, some readers may wonder
why ∇a qµ (S, a) |a=µ(S) cannot be written as ∇a qµ (S, µ(S)), which seems more concise.
That is simply because, if we do that, it is unclear how qµ (S, µ(S)) is a function of a. A
concise yet less confusing expression may be ∇a qµ (S, a = µ(S)).
In the rest of this subsection, we present the derivation details of Theorem 10.2. In
particular, we derive the gradients of two common metrics: the first is the average value
and the second is the average reward. Since these two metrics have been discussed in
detail in Section 9.2, we sometimes use their properties without proof. For most readers,
it is sufficient to be familiar with Theorem 10.2 without knowing its derivation details.
Interested readers can selectively examine the details in the remainder of this section.
236
10.4. Deterministic actor-critic S. Zhao, 2023
Lemma 10.1 (Gradient of vµ (s)). In the discounted case, it holds for any s ∈ S that
X
Prµ (s0 |s)∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 ) ,
∇θ vµ (s) = (10.16)
s0 ∈S
where
∞
0 . X k k
γ [Pµ ]ss0 = (I − γPµ )−1 ss0
Prµ (s |s) =
k=0
is the discounted total probability of transitioning from s to s0 under policy µ. Here, [·]ss0
denotes the entry in the sth row and s0 th column of a matrix.
By the definition of action values, for any given (s, a), we have
X
qµ (s, a) = r(s, a) + γ p(s0 |s, a)vµ (s0 ),
s0 ∈S
P
where r(s, a) = r rp(r|s, a). Since r(s, a) is independent of µ, we have
X
∇θ qµ (s, a) = 0 + γ p(s0 |s, a)∇θ vµ (s0 ).
s0 ∈S
237
10.4. Deterministic actor-critic S. Zhao, 2023
Since the above equation is valid for all s ∈ S, we can combine these equations to
obtain a matrix-vector form:
.. .. ..
. . .
∇θ vµ (s) = u(s) +γ(Pµ ⊗ Im ) ∇θ vµ (s0 ) ,
.. .. ..
. . .
| {z } | {z } | {z }
∇θ vµ ∈Rmn u∈Rmn ∇θ vµ ∈Rmn
∇θ vµ = u + γ(Pµ ⊗ Im )∇θ vµ ,
The quantity [(I − γPµ )−1 ]ss0 has a clear probabilistic interpretation. Since (I −
γPµ )−1 = I + γPµ + γ 2 Pµ2 + · · · , we have
∞
X
(I − γPµ )−1 ss0 = [I]ss0 + γ[Pµ ]ss0 + γ 2 [Pµ2 ]ss0 + · · · = γ k [Pµk ]ss0 .
k=0
Note that [Pµk ]ss0 is the probability of transitioning from s to s0 using exactly k steps
(see Box 8.1 for more information). Therefore, [(I − γPµ )−1 ]ss0 is the discounted
total probability of transitioning from s to s0 using any number of steps. By denoting
.
[(I − γPµ )−1 ]ss0 = Prµ (s0 |s), equation (10.19) leads to (10.16).
With the preparation in Lemma 10.1, we are ready to derive the gradient of J(θ).
Theorem 10.3 (Deterministic policy gradient theorem in the discounted case). In the
238
10.4. Deterministic actor-critic S. Zhao, 2023
Substituting the expression of ∇θ vµ (s) given by Lemma 10.1 into the above equation
yields
X
∇θ J(θ) = d0 (s)∇θ vµ (s)
s∈S
X X
Prµ (s0 |s)∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )
= d0 (s)
s∈S s0 ∈S
!
X X
d0 (s)Prµ (s0 |s) ∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )
=
s0 ∈S s∈S
. X
ρµ (s0 )∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )
=
s0 ∈S
X
(change s0 to s)
= ρµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S
= ES∼ρµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) .
The proof is complete. The above proof is consistent with the proof of Theorem 1
in [74]. Here, we consider the case in which the states and actions are finite. When
they are continuous, the proof is similar, but the summations should be replaced by
integrals [74].
239
10.4. Deterministic actor-critic S. Zhao, 2023
where
X
rµ (s) = E[R|s, a = µ(s)] = rp(r|s, a = µ(s))
r
is the expectation of the immediate rewards. More information about this metric can be
found in Section 9.2.
The gradient of J(θ) is given in the following theorem.
Theorem 10.4 (Deterministic policy gradient theorem in the undiscounted case). In the
undiscounted case, the gradient of J(θ) in (10.20) is
X
∇θ J(θ) = dµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S
= ES∼dµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) ,
In the undiscounted case, it follows from the definition of action value (Section 9.3.2)
that
240
10.4. Deterministic actor-critic S. Zhao, 2023
P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qµ (s, a) = 0 − ∇θ r̄µ + p(s0 |s, a)∇θ vµ (s0 ).
s0
While the above equation is valid for all s ∈ S, we can combine these equations to
obtain a matrix-vector form:
.. .. ..
. . .
∇θ vµ (s) = −1n ⊗ ∇θ r̄µ + (Pµ ⊗ Im ) ∇θ vµ (s0 ) + u(s) ,
.. .. ..
. . .
| {z } | {z } | {z }
∇θ vµ ∈Rmn ∇θ vµ ∈Rmn u∈Rmn
and hence
241
10.4. Deterministic actor-critic S. Zhao, 2023
∇θ r̄µ = dTµ ⊗ Im u
X
= dµ (s)u(s)
s∈S
X
= dµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S
= ES∼dµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) .
242
10.5. Summary S. Zhao, 2023
How to select the behavior policy β? It can be any exploratory policy. It can also be
a stochastic policy obtained by adding noise to µ [75]. In this case, µ is also the behavior
policy and hence this way is an on-policy implementation.
10.5 Summary
In this chapter, we introduced actor-critic methods. The contents are summarized as
follows.
Section 10.1 introduced the simplest actor-critic algorithm called QAC. This algo-
rithm is similar to the policy gradient algorithm, REINFORCE, introduced in the
last chapter. The only difference is that the q-value estimation in QAC relies on TD
learning while REINFORCE relies on Monte Carlo estimation.
Section 10.2 extended QAC to advantage actor-critic. It was shown that the policy
gradient is invariant to any additional baseline. It was then shown that an optimal
baseline could help reduce the estimation variance.
Section 10.3 further extended the advantage actor-critic algorithm to the off-policy
case. To do that, we introduced an important technique called importance sampling.
Finally, while all the previously presented policy gradient algorithms rely on stochastic
policies, we showed in Section 10.4 that the policy can be forced to be determinis-
tic. The corresponding gradient was derived, and the deterministic policy gradient
algorithm was introduced.
Policy gradient and actor-critic methods are widely used in modern reinforcement
learning. There exist a large number of advanced algorithms in the literature such as
SAC [76, 77], TRPO [78], PPO [79], and TD3 [80]. In addition, the single-agent case can
243
10.6. Q&A S. Zhao, 2023
10.6 Q&A
Q: What is the relationship between actor-critic and policy gradient methods?
A: Actor-critic methods are actually policy gradient methods. Sometimes, we use
them interchangeably. It is required to estimate action values in any policy gradient
algorithm. When the action values are estimated using temporal-difference learning
with value function approximation, such a policy gradient algorithm is called actor-
critic. The name “actor-critic” highlights its algorithmic structure that combines the
components of policy update and value update. This structure is also the fundamental
structure used in all reinforcement learning algorithms.
Q: Why is it important to introduce additional baselines to actor-critic methods?
A: Since the policy gradient is invariant to any additional baseline, we can utilize the
baseline to reduce estimation variance. The resulting algorithm is called advantage
actor-critic.
Q: Can importance sampling be used in value-based algorithms other than policy-
based ones?
A: The answer is yes. That is because importance sampling is a general technique
for estimating the expectation of a random variable over one distribution using some
samples drawn from another distribution. The reason why this technique is useful
in reinforcement learning is that the many problems in reinforcement learning are
to estimate expectations. For example, in value-based methods, the action or state
values are defined as expectations. In the policy gradient method, the true gradient is
also an expectation. As a result, importance sampling can be applied in both value-
based and policy-based algorithms. In fact, it has been applied in the value-based
component of Algorithm 10.3.
Q: Why is the deterministic policy gradient method off-policy?
A: The true gradient in the deterministic case does not involve the action random
variable. As a result, when we use samples to approximate the true gradient, it is
not required to sample actions and hence any policy can be used. Therefore, the
deterministic policy gradient method is off-policy.
244
Appendix A
Random variable: The term “variable” indicates that a random variable can take
values from a set of numbers. The term “random” indicates that taking a value must
follow a probability distribution.
A random variable is usually denoted by a capital letter. Its value is usually denoted
by a lowercase letter. For example, X is a random variable, and x is a value that X
can take.
This book mainly considers the case where a random variable can only take a finite
number of values. A random variable can be a scalar or a vector.
Like normal variables, random variables have normal mathematical operations such
as summation, product, and absolute value. For example, if X, Y are two random
variables, we can calculate X + Y , X + 1, and XY .
A stochastic sequence is a sequence of random variables.
One scenario we often encounter is collecting a stochastic sampling sequence {xi }ni=1
of a random variable X. For example, consider the task of tossing a die n times.
Let xi be a random variable representing the value obtained for the ith toss. Then,
{x1 , x2 , . . . , xn } is a stochastic process.
It may be confusing to beginners why xi is a random variable instead of a deterministic
value. In fact, if the sampling sequence is {1,6,3,5,...}, then this sequence is not a
stochastic sequence because all the elements are already determined. However, if we
use a variable xi to represent the values that can possibly be sampled, it is a random
variable since xi can take any value in {1, . . . , 6}. Although xi is a lowercase letter, it
still represents a random variable.
Probability: The notation p(X = x) or pX (x) describes the probability of the random
variable X taking the value x. When the context is clear, p(X = x) is often written
as p(x) for short.
245
S. Zhao, 2023
Conditional probability: The notation p(X = x|A = a) describes the probability of the
random variable X taking the value x given that the random variable A has already
taken the value a. We often write p(X = x|A = a) as p(x|a) for short.
It holds that
p(x, a) = p(x|a)p(a)
and
p(x, a)
p(x|a) = .
p(a)
P
Since p(x) = a p(x, a), we have
X X
p(x) = p(x, a) = p(x|a)p(a),
a a
p(x|y) = p(x).
The above two definitions are equivalent because p(x, y) = p(x|y)p(y), which implies
p(x|y) = p(x) when p(x, y) = p(x)p(y).
Conditional independence: Let X, A, B be three random variables. X is said to be
conditionally independent of A given B if
In the context of reinforcement learning, consider three consecutive states: st , st+1 , st+2 .
Since they are obtained consecutively, st+2 is dependent on st+1 and also st . However,
if st+1 is already given, then st+2 is conditionally independent of st . That is
246
S. Zhao, 2023
Law of total probability: The law of total probability was already mentioned when we
introduced the concept of conditional probability. Due to its importance, we list it
again below:
X
p(x) = p(x, y)
y
and
X
p(x|a) = p(x, y|a).
y
Chain rule of conditional probability and joint probability. By the definition of con-
ditional probability, we have
p(a, b) = p(a|b)p(b).
and hence, p(a, b, c)/p(c) = p(a, b|c) = p(a|b, c)p(b|c). The fact that p(a, b|c) =
p(a|b, c)p(b|c) implies the following property:
X X
p(x|a) = p(x, b|a) = p(x|b, a)p(b|a).
b b
The second equation above can be trivially proven by definition. The first equation
247
S. Zhao, 2023
is proven below:
XX
E[X + Y ] = (x + y)p(X = x, Y = y)
x y
X X X X
= x p(x, y) + y p(x, y)
x y y x
X X
= xp(x) + yp(y)
x y
= E[X] + E[Y ].
E[AX] = AE[X],
Similar to the law of total probability, we have the law of total expectation:
X
E[X] = E[X|A = a]p(a).
a
= E[X].
248
S. Zhao, 2023
This equation is useful in the derivation of the Bellman equation. A hint of its proof
is the chain rule: p(x|a, b)p(b|a) = p(x, b|a).
Finally, it is worth noting that E[X|A = a] is different from E[X|A]. The former is
a value, whereas the latter is a random variable. In fact, E[X|A] is a function of the
random variable A. We need rigorous probability theory to define E[X|A].
Gradient of expectation: Let f (X, β) be a scalar function of a random variable X and
a deterministic parameter vector β. Then,
Variance, covariance, covariance matrix : For a single random variable X, its variance
is defined as var(X) = E[(X − X̄)2 ], where X̄ = E[X]. For two random variables X, Y ,
their covariance is defined as cov(X, Y ) = E[(X − X̄)(Y − Ȳ )]. For a random vector
.
X = [X1 , . . . , Xn ]T , the covariance matrix of X is defined as var(X) = Σ = E[(X −
X̄)(X − X̄)T ] ∈ Rn×n . The ijth entry of Σ is [Σ]ij = E[[X − X̄]i [X − X̄]j ] = E[(Xi −
X̄i )(Xj − X̄j )] = cov(Xi , Xj ). One trivial property is var(a) = 0 if a is deterministic.
Moreover, it can be verified that var(AX + a) = var(AX) = Avar(X)AT = AΣAT .
Some useful facts are summarized below.
249
Appendix B
Measure-Theoretic Probability
Theory
We now briefly introduce measure-theoretic probability theory, which is also called rig-
orous probability theory. We only present basic notions and results. Comprehensive
introductions can be found in [96–98]. Moreover, measure-theoretic probability theory
requires some basic knowledge of measure theory, which is not covered here. Interested
readers may refer to [99].
The reader may wonder if it is necessary to understand measure-theoretic probability
theory before studying reinforcement learning. The answer is yes if the reader is interested
in rigorously analyzing the convergence of stochastic sequences. For example, we often
encounter the notion of almost sure convergence in Chapter 6 and Chapter 7. This notion
is taken from measure-theoretic probability theory. If the reader is not interested in the
convergence of stochastic sequences, it is okay to skip this part.
Probability triples
250
S. Zhao, 2023
Example: Consider the game of dice. An example of an elementary event is “the num-
ber you get is i”, where i ∈ {1, . . . , 6}. An example of a nonelementary event is “the
number you get is greater than 3”. We care about such an event in practice because,
for example, we can win the game if this event occurs. This event is mathematically
expressed as A = {ω ∈ Ω : ω > 3}. Since Ω = {1, 2, 3, 4, 5, 6} in this case, we have
A = {4, 5, 6}.
P: This is a probability measure, which is a mapping from F to [0, 1]. Any A ∈ F is
a set that contains some points in Ω. Then, P(A) is the measure of this set.
Example: If A = Ω, which contains all ω values, then P(A) = 1; if A = ∅, then
P(A) = 0. In the game of dice, consider the event “the number you get is greater
than 3”. In this case, A = {ω ∈ Ω : ω > 3}, and Ω = {1, 2, 3, 4, 5, 6}. Then, we have
A = {4, 5, 6} and hence P(A) = 1/2. That is, the probability of us rolling a number
greater than 3 is 1/2.
Example: When playing the dice game, we have Ω = {1, 2, 3, 4, 5, 6}. Then,
F = {Ω, ∅, {1, 2, 3}, {4, 5, 6}} is a σ-algebra. The above three conditions can be
easily verified. There are also other σ-algebras such as {Ω, ∅, {1, 2, 3, 4, 5}, {6}}.
Moreover, for any Ω with finite elements, the collection of all the subsets of Ω is
a σ-algebra.
Random variables
Based on the notion of probability triples, we can formally define random variables. They
are called variables, but they are actually functions that map from Ω to R. In particular,
251
S. Zhao, 2023
a random variable assigns each outcome in Ω a numerical value, and hence it is a function:
X(ω) : Ω → R.
Not all mappings from Ω to R are random variables. The formal definition of a random
variable is as follows. A function X : Ω → R is a random variable if
A = {ω ∈ Ω|X(ω) ≤ x} ∈ F
for all x ∈ R. This definition indicates that X is a random variable only if X(ω) ≤ x is
an event in F. More information can be found in [96, Section 3.1].
. X
X(ω) = x1Ax (ω),
x∈X
where
.
Ax = {ω ∈ Ω|X(ω) = x} = X −1 (x)
and
(
1, ω ∈ Ax ,
1Ax (ω) =. (B.1)
0, otherwise.
Here, 1Ax (ω) is an indicator function 1Ax (ω) : Ω → {0, 1}. If ω is mapped to x, the
indicator function equals one; otherwise, it equals zero. It is possible that multiple ω’s
in Ω map to the same value in X , but a single ω cannot be mapped to multiple values in
X.
With the above preparation, the expectation of a simple random variable is defined
as
. X
E[X] = xP(Ax ), (B.2)
x∈X
where
Ax = {ω ∈ Ω|X(ω) = x}.
The definition in (B.2) is similar to but more formal than the definition of expectation
P
in the nonmeasure-theoretic case: E[X] = x∈X xp(x).
As a demonstrative example, we next calculate the expectation of the indicator func-
252
S. Zhao, 2023
tion in (B.1). It is notable that the indicator function is also a random variable that
maps Ω to {0, 1} [96, Proposition 3.1.5]. As a result, we can calculate its expectation. In
particular, consider the indicator function 1A where A denotes any event. We have
E[1A ] = P(A).
E[1A ] = zP(1A = z)
X
z∈{0,1}
= 0 · P(1A = 0) + 1 · P(1A = 1)
= P(1A = 1)
= P(A).
While the expectation in (B.2) maps random variables to a specific value, we next intro-
duce a conditional expectation that maps random variables to another random variable.
Suppose that X, Y, Z are all random variables. Consider three cases. First, a condi-
tional expectation like E[X|Y = 2] or E[X|Y = 5] is specific number. Second, E[X|Y = y],
where y is a variable, is a function of y. Third, E[X|Y ], where Y is a random variable,
is a function of Y and hence also a random variable. Since E[X|Y ] is also a random
variable, we can calculate, for example, its expectation.
We next examine the third case closely since it frequently emerges in the convergence
analyses of stochastic sequences. The rigorous definition is not covered here and can be
found in [96, Chapter 13]. We merely present some useful properties [101].
Lemma B.1 (Basic properties). Let X, Y, Z be random variables. The following proper-
ties hold.
(a) E[a|Y ] = a, where a is a given number.
(b) E[aX + bZ|Y ] = aE[X|Y ] + bE[Z|Y ].
(c) E[X|Y ] = E[X] if X, Y are independent.
(d) E[Xf (Y )|Y ] = f (Y )E[X|Y ].
(e) E[f (Y )|Y ] = f (Y ).
(f ) E[X|Y, f (Y )] = E[X|Y ].
(g) If X ≥ 0, then E[X|Y ] ≥ 0.
(h) If X ≥ Z, then E[X|Y ] ≥ E[Z|Y ].
253
S. Zhao, 2023
Proof. We only prove some properties. The others can be proven similarly.
To prove E[a|Y ] = a as in (a), we can show that E[a|Y = y] = a is valid for any y
that Y can possibly take. This is clearly true, and the proof is complete.
To prove the property in (d), we can show that E[Xf (Y )|Y = y] = f (Y = y)E[X|Y =
P P
y] for any y. This is valid because E[Xf (Y )|Y = y] = x xf (y)p(x|y) = f (y) x xp(x|y) =
f (y)E[X|Y = y].
Since E[X|Y ] is a random variable, we can calculate its expectation. The related
properties are presented below. These properties are useful for analyzing the convergence
of stochastic sequences.
= E[X].
The proof of the property in (c) follows immediately from property (e) in Lemma B.1.
That is because E[X|Y ] is a function of Y . We denote this function as f (Y ). It then
follows that E[E[X|Y ]|Y ] = E[f (Y )|Y ] = f (Y ) = E[X|Y ].
254
S. Zhao, 2023
One main reason why we care about measure-theoretic probability theory is that it can
rigorously describe the convergence properties of stochastic sequences.
.
Consider the stochastic sequence {Xk } = {X1 , X2 , . . . , Xk , . . . }. Each element in this
sequence is a random variable defined on a triple (Ω, F, P). When we say {Xk } converges
to a random variable X, we should be careful since there are different types of convergence
as shown below.
Sure convergence:
Definition: {Xk } converges surely (or everywhere or pointwise) to X if
It means that limk→∞ Xk (ω) = X(ω) is valid for all points in Ω. This definition can
be equivalently stated as
n o
A = Ω where A = ω ∈ Ω : lim Xk (ω) = X(ω) .
k→∞
It means that limk→∞ Xk (ω) = X(ω) is valid for almost all points in Ω. The points,
for which this limit is invalid, form a set of zero measure. For the sake of simplicity,
(B.3) is often written as
P lim Xk = X = 1.
k→∞
a.s.
Almost sure convergence can be denoted as Xk −−→ X.
Convergence in probability:
Definition: {Xk } converges in probability to X if for any > 0,
255
S. Zhao, 2023
The most frequently used cases are r = 1 and r = 2. It is worth mentioning that
convergence in mean is not equivalent to limk→∞ E[Xk − X] = 0 or limk→∞ E[Xk ] =
E[X], which indicates that E[Xk ] converges but the variance may not.
Convergence in distribution:
Definition: The cumulative distribution function of Xk is defined as P(Xk ≤ a) where
a ∈ R. Then, {Xk } converges to X in distribution if the cumulative distribution
function converges:
A compact expression is
where
. .
Ak = {ω ∈ Ω : Xk (ω) ≤ a} , A = {ω ∈ Ω : X(ω) ≤ a} .
The relationships between the above types of convergence are given below:
Almost sure convergence and convergence in mean do not imply each other. More infor-
mation can be found in [102].
256
Appendix C
Convergence of Sequences
We next introduce some results about the convergence of deterministic and stochastic se-
quences. These results are useful for analyzing the convergence of reinforcement learning
algorithms such as those in Chapters 6 and 7.
We first consider deterministic sequences and then stochastic sequences.
Similarly, if {xk } is nondecreasing and bounded from above, then the sequence is
convergent.
xk+1 ≤ xk + ηk .
257
C.1. Convergence of deterministic sequences S. Zhao, 2023
In the simple case of ηk = 0, we have xk+1 ≤ xk , and the sequence is monotonic. We now
focus on a more general case where ηk ≥ 0. In this case, the sequence is not monotonic
because xk+1 may be greater than xk . Nevertheless, we can still ensure the convergence
of the sequence under some mild conditions.
z = z+ + z−
for all z ∈ R.
. P
where Sk = k−1
i=1 (xi+1 − xi ). Note that Sk can be decomposed as
k−1
X
Sk = (xi+1 − xi ) = Sk+ + Sk− ,
i=1
where
k−1
X k−1
X
Sk+ = (xi+1 − xi )+ ≥ 0, Sk− = (xi+1 − xi )− ≤ 0.
i=1 i=1
258
C.1. Convergence of deterministic sequences S. Zhao, 2023
Theorem C.2 is more general than Theorem C.1 because it allows xk to increase as
long as the increase is damped as in (C.2). In the monotonic case, Theorem C.2 still
applies. In particular, if xk+1 ≤ xk , then ∞ +
P
k=1 (xk+1 − xk ) = 0. In this case, (C.2) is
still satisfied and the convergence follows.
If xk+1 ≤ xk + ηk , the next result provides a condition for ηk to ensure the convergence
of {xk }. This result is an immediate corollary of Theorem C.2.
xk+1 ≤ xk + ηk
259
C.2. Convergence of stochastic sequences S. Zhao, 2023
As a result, (C.2) is satisfied and the convergence follows from Theorem C.2.
E[Xk+1 ] = E[Xk ]
This result can be obtained by calculating the expectation on both sides of (C.3)
based on property (b) in Lemma B.2.
260
C.2. Convergence of stochastic sequences S. Zhao, 2023
for all k.
Taking the expectation on both sides of (C.4) yields E[Xk+1 ] ≥ E[Xk ]. In particular,
the left-hand side leads to E[E[Xk+1 |X1 , . . . , Xk ]] = E[Xk+1 ] due to property (b) in
Lemma B.2. By induction, we have
for all k.
Taking expectation on both sides of (C.5) gives E[Xk+1 ] ≤ E[Xk ]. By induction, we
have
E[Xk ] ≤ E[Xk−1 ] ≤ · · · ≤ E[X2 ] ≤ E[X1 ].
The names “submartingale” and “supmartingale” are standard, but it may not be easy
for beginners to distinguish them. Some tricks can be employed to do so. For example,
since “supermartingale” has a letter “p” that points down, its expectation decreases;
since submartingale has a letter “b” that points up, its expectation increases [104].
A supermartingale or submartingale is comparable to a deterministic monotonic se-
quence. While the convergence result for monotonic sequences has been given in Theo-
rem C.1, we provide a similar convergence result for martingales as follows.
261
C.2. Convergence of stochastic sequences S. Zhao, 2023
.
The event Ak is defined as Ak = {ω ∈ Ω : E[Xk+1 − Xk |Hk ] ≥ 0}, where Hk =
{X1 , . . . , Xk }. Intuitively, Ak indicates that Xk+1 is greater than Xk in expectation.
Let 1Ak be an indicator function:
(
1, E[Xk+1 − Xk |Hk ] ≥ 0,
1Ak =
0, E[Xk+1 − Xk |Hk ] < 0.
1 = 1A + 1Ac
for any event A where Ac denotes the complementary event of A. As a result, it holds
for any random variable that
X = 1A X + 1Ac X.
k=1
k=1 E[(Xk+1 − Xk )1Ak ] > −∞ and there is a finite random variable such that
P∞
then c
Xk → X almost surely as k → ∞.
Theorem C.4 can be viewed as an analogy of Theorem C.2, which is for nonmono-
tonic deterministic sequences. The proof of this theorem can be found in [105, Proposi-
tion 9.5]. Note that Xk here is required to be nonnegative. As a result, the boundedness
of ∞ k=1 E[(Xk+1 − Xk )1Ak ] implies the boundedness of k=1 E[(Xk+1 − Xk )1Ak ].
P P∞
c
We finally summarize and compare the results for deterministic and stochastic sequences.
Deterministic sequences:
262
C.2. Convergence of stochastic sequences S. Zhao, 2023
Stochastic sequences:
Quasimartingale Non-monotonic
263
Appendix D
We next present some preliminaries for the gradient descent method, which is one of the
most frequently used optimization methods. The gradient descent method is also the
foundation for the stochastic gradient descent method introduced in Chapter 6.
Convexity
Definitions:
.
- Convex set: Suppose that D is a subset of Rn . This set is convex if z = cx + (1 −
c)y ∈ D for any x, y ∈ D and any c ∈ [0, 1].
- Convex function: Suppose f : D → R where D is convex. Then, the function f (x)
is convex if
f (cx + (1 − x)y) ≤ cf (x) + (1 − c)f (y)
Convex conditions:
When x is a scalar, ∇f (x) represents the slope of the tangent line of f (x) at x.
The geometric interpretation of (D.1) is that the point (y, f (y)) is always located
above the tangent line.
- Second-order condition: Consider a function f : D → R where D is convex. Then,
f is convex if
264
S. Zhao, 2023
Degree of convexity:
Given a convex function, it is often of interest how strong its convexity is. The Hessian
matrix is a useful tool for describing the degree of convexity. If ∇2 f (x) is close to rank
deficiency at a point, then the function is flat around that point and hence weakly
convex. Otherwise, if the minimum singular value of ∇2 f (x) is positive and large,
the function is curly around that point and hence strongly convex. The degree of
convexity influences the step size selection in gradient descent algorithms.
The lower and upper bounds of ∇2 f (x) play an important role in characterizing the
function convexity.
then ∇2 f (x) LIn for all x. Here, k · k denotes the Euclidean norm.
min f (x)
x
265
S. Zhao, 2023
- Magnitude of ∇f (xk ):
When xk is close to the optimum x∗ where ∇f (x∗ ) = 0, the magnitude k∇f (xk )k
is small. In this case, the update process of xk is slow, which is reasonable because
we do not want to update x too aggressively and miss the optimum.
When xk is far from the optimum, the magnitude of ∇f (xk ) may be large, and
hence, the update process of xk is fast. This is also reasonable because we hope
that the estimate can approach the optimum as quickly as possible.
- Step size αk :
If αk is small, the magnitude of −αk ∇f (xk ) is small, and hence the convergence
process is slow. If αk is too large, the update process of xk is aggressive, which
leads to either fast convergence or divergence.
How to select αk ? The selection of αk should depend on the degree of convexity
of f (xk ). If the function is curly around the optimum (the degree of convexity is
strong), then the step size αk should be small to guarantee convergence. If the
function is flat around the optimum (the degree of convexity is weak), then the
step size could be large so that xk can quickly approach the optimum. The above
intuition will be verified in the following convergence analysis.
Convergence analysis
We next present a proof of the convergence of the gradient descent algorithm in (D.2).
That is to show xk converges to the optimum x∗ where ∇f (x∗ ) = 0. First of all, we make
some assumptions.
∇2 f (x) `I,
where ` > 0.
Assumption 2: ∇f (x) is Lipschitz continuous with a constant L. This assumption
implies the following inequality according to Lemma D.1:
∇2 f (x) LIn .
Proof. For any xk+1 and xk , it follows from [106, Section 9.1.2] that
1
f (xk+1 ) = f (xk ) + ∇f (xk )T (xk+1 − xk ) + (xk+1 − xk )T ∇2 f (zk )(xk+1 − xk ), (D.3)
2
266
S. Zhao, 2023
where zk is a convex combination of xk and xk+1 . Since it is assumed that ∇2 f (zk ) LIn ,
we have k∇2 f (zk )k ≤ L. (D.3) implies
1
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T (xk+1 − xk ) + k∇2 f (zk )kkxk+1 − xk k2
2
L
≤ f (xk ) + ∇f (xk )T (xk+1 − xk ) + kxk+1 − xk k2 .
2
Substituting xk+1 = xk − αk ∇f (xk ) into the above inequality yields
L
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T (−αk ∇f (xk )) + kαk ∇f (xk )k2
2
α2 L
= f (xk ) − αk k∇f (xk )k2 + k k∇f (xk )k2
2
αk L
= f (xk ) − αk 1 − k∇f (xk )k2 . (D.4)
2
| {z }
ηk
Since ηk k∇f (xk )k2 ≥ 0, the above inequality implies that limk→∞ ηk k∇f (xk )k2 = 0. As
a result, x converges to x∗ where ∇f (x∗ ) = 0. The proof is complete. The above proof is
inspired by [107].
The inequality in (D.5) provides valuable insights into how αk should be selected. If
the function is flat (L is small), the step size can be large; otherwise, if the function
is strongly convex (L is large), then the step size must be sufficiently small to ensure
convergence. There are also many other ways to prove the convergence such as the
contraction mapping theorem [108, Lemma 3]. A comprehensive introduction to convex
optimization can be found in [106].
267
Bibliography
[4] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge University Press, 2012.
[8] A. Besenyei, “A brief history of the mean value theorem,” 2012. Lecture notes.
[9] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transforma-
tions: Theory and application to reward shaping,” in International Conference on
Machine Learning, vol. 99, pp. 278–287, 1999.
268
Bibliography S. Zhao, 2023
[15] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A survey on model-
based reinforcement learning,” arXiv:2206.09328, 2022.
[24] H.-F. Chen, Stochastic approximation and its applications, vol. 64. Springer Science
& Business Media, 2006.
[26] J. Venter, “An extension of the Robbins-Monro procedure,” The Annals of Mathe-
matical Statistics, vol. 38, no. 1, pp. 181–190, 1967.
269
Bibliography S. Zhao, 2023
[28] J. Lagarias, “Euler’s constant: Euler’s work and modern developments,” Bulletin
of the American Mathematical Society, vol. 50, no. 4, pp. 527–628, 2013.
[29] J. H. Conway and R. Guy, The book of numbers. Springer Science & Business
Media, 1998.
[33] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation. Prentice Hall, 2000.
[36] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and
empirical analysis of Expected Sarsa,” in IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pp. 177–184, 2009.
[37] M. Ganger, E. Duryea, and W. Hu, “Double Sarsa and double expected Sarsa with
shallow and deep learning,” Journal of Data Analysis and Information Processing,
vol. 4, no. 4, pp. 159–176, 2016.
[38] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s College,
1989.
[39] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, p-
p. 279–292, 1992.
270
Bibliography S. Zhao, 2023
[42] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double
Q-learning,” in AAAI Conference on Artificial Intelligence, vol. 30, 2016.
[43] C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences:
A survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809–
883, 2014.
[44] J. Clifton and E. Laber, “Q-learning: Theory and applications,” Annual Review of
Statistics and Its Application, vol. 7, pp. 279–301, 2020.
[47] G. Strang, Linear algebra and its applications (4th Edition). Belmont, CA: Thom-
son, Brooks/Cole, 2006.
[48] C. D. Meyer and I. Stewart, Matrix analysis and applied linear algebra. SIAM,
2023.
[51] R. Munos, “Error bounds for approximate policy iteration,” in International Con-
ference on Machine Learning, vol. 3, pp. 560–567, 2003.
[53] B. Scherrer, “Should one compute the temporal difference fix point or minimize the
Bellman residual? the unified oblique projection view,” in International Conference
on Machine Learning, 2010.
271
Bibliography S. Zhao, 2023
[56] S. S. Dragomir, “Some reverses of the Jensen inequality with applications,” Bulletin
of the Australian Mathematical Society, vol. 87, no. 2, pp. 177–194, 2013.
[57] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for temporal dif-
ference learning,” Machine Learning, vol. 22, no. 1, pp. 33–57, 1996.
[58] K. S. Miller, “On the inverse of the sum of matrices,” Mathematics Magazine,
vol. 54, no. 2, pp. 67–72, 1981.
[59] S. A. U. Islam and D. S. Bernstein, “Recursive least squares for real-time imple-
mentation,” IEEE Control Systems Magazine, vol. 39, no. 3, pp. 82–85, 2019.
[61] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,”
in Learning for Dynamics and Control, pp. 486–489, 2020.
[62] L.-J. Lin, Reinforcement learning for robots using neural networks. 1992. Technical
report.
[63] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with
function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5,
pp. 674–690, 1997.
[67] X.-R. Cao, “A basic formula for online policy gradient algorithms,” IEEE Trans-
actions on Automatic Control, vol. 50, no. 5, pp. 696–699, 2005.
[69] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradi-
ents,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
272
Bibliography S. Zhao, 2023
[76] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maxi-
mum entropy deep reinforcement learning with a stochastic actor,” in International
Conference on Machine Learning, pp. 1861–1870, 2018.
[78] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy
optimization,” in International Conference on Machine Learning, pp. 1889–1897,
2015.
273
Bibliography S. Zhao, 2023
[82] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-
agent actor-critic for mixed cooperative-competitive environments,” Advances in
Neural Information Processing Systems, vol. 30, 2017.
[83] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-
agent reinforcement learning,” in International Conference on Machine Learning,
pp. 5571–5580, 2018.
[85] Y. Yang and J. Wang, “An overview of multi-agent reinforcement learning from
game theoretical perspective,” arXiv:2011.00583, 2020.
[87] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-
based policy optimization,” Advances in Neural Information Processing Systems,
vol. 32, 2019.
[90] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive dynamic programming for control:
algorithms and stability. Springer Science & Business Media, 2012.
[92] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic program-
ming for feedback control. John Wiley & Sons, 2013.
[93] Z.-P. Jiang, T. Bian, and W. Gao, “Learning-based control: A tutorial and some
recent results,” Foundations and Trends in Systems and Control, vol. 8, no. 3,
pp. 176–284, 2020.
[94] S. Meyn, Control systems and reinforcement learning. Cambridge University Press,
2022.
274
Bibliography S. Zhao, 2023
[95] S. E. Li, Reinforcement learning for sequential decision and optimal control.
Springer, 2023.
[96] J. S. Rosenthal, First look at rigorous probability theory (2nd Edition). World
Scientific Publishing Company, 2006.
[99] R. G. Bartle, The elements of integration and Lebesgue measure. John Wiley &
Sons, 2014.
[100] M. Taboga, Lectures on probability theory and mathematical statistics (2nd Edition).
CreateSpace Independent Publishing Platform, 2012.
[102] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 2000.
[108] A. Jung, “A fixed-point of view on gradient methods for big data,” Frontiers in
Applied Mathematics and Statistics, vol. 3, p. 18, 2017.
275
Symbols
= equality
≈ approximation
.
= equality by definition
≥, >, ≤, < elementwise comparison
∈ is an element of
k · k2 Euclidean norm of a vector or the corresponding in-
duced matrix norm
k · k∞ maximum norm of a vector or the corresponding in-
duced matrix norm
ln natural logarithm
R set of real numbers
n
R set of n-dimensional real vectors
n×m
R set of all n × m-dimensional real matrices
A 0 (A 0) matrix A is positive semidefinite (definite)
A 0 (A 0) matrix A is negative semidefinite (definite)
|x| absolute value of real scalar x
|S| number of elements in set S
∇x f (x) gradient of scalar function f (x) with respect to vector
x. It may be written as ∇f (x) for short.
[A]ij element in the ith row and jth column of matrix A
[x]i ith element of vector x
X∼p p is the probability distribution of random variable
X.
p(X = x), Pr(X = x) probability of X = x. They are often written as p(x)
or Pr(x) for short.
p(x|y) conditional probability
EX∼p [X] expectation or expected value of random variable X.
It is often written as E[X] for short when the distri-
bution of X is clear.
276
Bibliography S. Zhao, 2023
277
Index
278
Index S. Zhao, 2023
279
Index S. Zhao, 2023
280