0% found this document useful (0 votes)

36 views281 pages

Book All-In-One 2

Uploaded by

teenagedragonslayer520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views281 pages

Book All-In-One 2

Uploaded by

teenagedragonslayer520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 281

Book draft

Mathematical Foundations
of
Reinforcement Learning

Shiyu Zhao

August, 2023
Contents

Contents 5

Preface 6

Overview of this Book 8

1 Basic Concepts 13
1.1 A grid world example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 State and action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 State transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Trajectories, returns, and episodes . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.9 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 State Values and Bellman Equation 26

2.1 Motivating example 1: Why are returns important? . . . . . . . . . . . . 27
2.2 Motivating example 2: How to calculate returns? . . . . . . . . . . . . . 28
2.3 State values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Bellman equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Examples for illustrating the Bellman equation . . . . . . . . . . . . . . . 33
2.6 Matrix-vector form of the Bellman equation . . . . . . . . . . . . . . . . 36
2.7 Solving state values from the Bellman equation . . . . . . . . . . . . . . 38
2.7.1 Closed-form solution . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7.2 Iterative solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 From state value to action value . . . . . . . . . . . . . . . . . . . . . . . 41
2.8.1 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.8.2 The Bellman equation in terms of action values . . . . . . . . . . 43
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1
S. Zhao, 2023

2.10 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Optimal State Values and Bellman Optimality Equation 46

3.1 Motivating example: How to improve policies? . . . . . . . . . . . . . . . 47
3.2 Optimal state values and optimal policies . . . . . . . . . . . . . . . . . . 48
3.3 Bellman optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Maximization of the right-hand side of the BOE . . . . . . . . . . 50
3.3.2 Matrix-vector form of the BOE . . . . . . . . . . . . . . . . . . . 51
3.3.3 Contraction mapping theorem . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Contraction property of the right-hand side of the BOE . . . . . . 55
3.4 Solving an optimal policy from the BOE . . . . . . . . . . . . . . . . . . 57
3.5 Factors that influence optimal policies . . . . . . . . . . . . . . . . . . . 60
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Value Iteration and Policy Iteration 67

4.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Elementwise form and implementation . . . . . . . . . . . . . . . 68
4.1.2 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Elementwise form and implementation . . . . . . . . . . . . . . . 75
4.2.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Truncated policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Comparing value iteration and policy iteration . . . . . . . . . . . 80
4.3.2 Truncated policy iteration algorithm . . . . . . . . . . . . . . . . 82
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Monte Carlo Methods 87

5.1 Motivating example: Mean estimation . . . . . . . . . . . . . . . . . . . 88
5.2 MC Basic: The simplest MC-based algorithm . . . . . . . . . . . . . . . 90
5.2.1 Converting policy iteration to be model-free . . . . . . . . . . . . 90
5.2.2 The MC Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 MC Exploring Starts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Utilizing samples more efficiently . . . . . . . . . . . . . . . . . . 96
5.3.2 Updating policies more efficiently . . . . . . . . . . . . . . . . . . 97
5.3.3 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 MC -Greedy: Learning without exploring starts . . . . . . . . . . . . . . 99
5.4.1 -greedy policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

2
S. Zhao, 2023

5.4.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Exploration and exploitation of -greedy policies . . . . . . . . . . . . . . 102
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Stochastic Approximation 110

6.1 Motivating example: Mean estimation . . . . . . . . . . . . . . . . . . . 111
6.2 Robbins-Monro algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.1 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.2 Application to mean estimation . . . . . . . . . . . . . . . . . . . 117
6.3 Dvoretzky’s convergence theorem . . . . . . . . . . . . . . . . . . . . . . 118
6.3.1 Proof of Dvoretzky’s theorem . . . . . . . . . . . . . . . . . . . . 119
6.3.2 Application to mean estimation . . . . . . . . . . . . . . . . . . . 121
6.3.3 Application to the Robbins-Monro theorem . . . . . . . . . . . . 121
6.3.4 An extension of Dvoretzky’s theorem . . . . . . . . . . . . . . . . 122
6.4 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.1 Application to mean estimation . . . . . . . . . . . . . . . . . . . 125
6.4.2 Convergence pattern of SGD . . . . . . . . . . . . . . . . . . . . . 125
6.4.3 A deterministic formulation of SGD . . . . . . . . . . . . . . . . . 127
6.4.4 BGD, SGD, and mini-batch GD . . . . . . . . . . . . . . . . . . . 128
6.4.5 Convergence of SGD . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Temporal-Difference Methods 134

7.1 TD learning of state values . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1.2 Property analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 TD learning of action values: Sarsa . . . . . . . . . . . . . . . . . . . . . 142
7.2.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2.2 Optimal policy learning via Sarsa . . . . . . . . . . . . . . . . . . 143
7.3 TD learning of action values: n-step Sarsa . . . . . . . . . . . . . . . . . 147
7.4 TD learning of optimal action values: Q-learning . . . . . . . . . . . . . 149
7.4.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.2 Off-policy vs on-policy . . . . . . . . . . . . . . . . . . . . . . . . 150
7.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.4.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.5 A unified viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

3
S. Zhao, 2023

7.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8 Value Function Approximation 160

8.1 Value representation: From table to function . . . . . . . . . . . . . . . . 161
8.2 TD learning of state values based on function approximation . . . . . . . 164
8.2.1 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . 170
8.2.3 Selection of function approximators . . . . . . . . . . . . . . . . . 171
8.2.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.2.5 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.3 TD learning of action values based on function approximation . . . . . . 188
8.3.1 Sarsa with function approximation . . . . . . . . . . . . . . . . . 188
8.3.2 Q-learning with function approximation . . . . . . . . . . . . . . 189
8.4 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.4.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 191
8.4.2 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

9 Policy Gradient Methods 199

9.1 Policy representation: From table to function . . . . . . . . . . . . . . . 200
9.2 Metrics for defining optimal policies . . . . . . . . . . . . . . . . . . . . . 201
9.3 Gradients of the metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.3.1 Derivation of the gradients in the discounted case . . . . . . . . . 208
9.3.2 Derivation of the gradients in the undiscounted case . . . . . . . . 213
9.4 Monte Carlo policy gradient (REINFORCE) . . . . . . . . . . . . . . . . 218
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

10 Actor-Critic Methods 223

10.1 The simplest actor-critic algorithm (QAC) . . . . . . . . . . . . . . . . . 224
10.2 Advantage actor-critic (A2C) . . . . . . . . . . . . . . . . . . . . . . . . 225
10.2.1 Baseline invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.2.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 228
10.3 Off-policy actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.3.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.3.2 The off-policy policy gradient theorem . . . . . . . . . . . . . . . 232
10.3.3 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 234
10.4 Deterministic actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.4.1 The deterministic policy gradient theorem . . . . . . . . . . . . . 235
10.4.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 242

4
S. Zhao, 2023

10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

10.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

A Preliminaries for Probability Theory 245

B Measure-Theoretic Probability Theory 250

C Convergence of Sequences 257

C.1 Convergence of deterministic sequences . . . . . . . . . . . . . . . . . . . 257
C.2 Convergence of stochastic sequences . . . . . . . . . . . . . . . . . . . . . 260

D Preliminaries for Gradient Descent 264

Bibliography 275

Symbols 276

Index 278

5
Preface

This book aims to provide a mathematical but friendly introduction to the fundamental
concepts, basic problems, and classic algorithms in reinforcement learning. Some essential
features of this book are highlighted as follows.

The book introduces reinforcement learning from a mathematical point of view. Hope-
fully, readers will not only know the procedure of an algorithm but also understand
why the algorithm was designed in the first place and why it works effectively.
The depth of the mathematics is carefully controlled to an adequate level. The math-
ematics is also presented in a carefully designed manner to ensure that the book is
friendly to read. Readers can read the materials presented in gray boxes selectively
according to their interests.
Many illustrative examples are given to help readers better understand the topics. All
the examples in this book are based on a grid world task, which is easy to understand
and helpful for illustrating concepts and algorithms.
When introducing an algorithm, the book aims to separate its core idea from compli-
cations that may be distracting. In this way, readers can better grasp the core idea
of an algorithm.
The contents of the book are coherently organized. Each chapter is built based on the
preceding chapter and lays a necessary foundation for the subsequent one.

This book is designed for senior undergraduate students, graduate students, researcher-
s, and practitioners interested in reinforcement learning. It does not require readers to
have any background in reinforcement learning because it starts by introducing the most
basic concepts. If the reader already has some background in reinforcement learning, I
believe the book can help them understand some topics more deeply or provide differ-
ent perspectives. This book, however, requires the reader to have some knowledge of
probability theory and linear algebra. Some basics of the required mathematics are also
included in the appendix of this book.
I have been teaching a graduate-level course on reinforcement learning since 2019. I
want to thank the students in my class for their feedback on my teaching. I put the draft
of this book online in August 2022. Up to now, I have received valuable feedback from
many readers. I want to express my gratitude to these readers. Moreover, I would like

6
S. Zhao, 2023

to thank my research assistant, Jialing Lv, for her excellent support in editing the book
and my lecture videos; my teaching assistants, Jianan Li and Yize Mi, for their help in
my teaching; my Ph.D. student Canlun Zheng for his help in the design of a picture in
the book; and my family for their wonderful support. Finally, I would like to thank the
editors of this book, Mr. Sai Guo and Dr. Lanlan Chang from Tsinghua University Press
and Springer Nature Press, for their great support.
I sincerely hope this book can help readers smoothly enter the exciting field of rein-
forcement learning.

Shiyu Zhao

7
Overview of this Book

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 1: The map of this book.

Before we start the journey, it is important to look at the “map” of the book shown
in Figure 1. This book contains ten chapters, which can be classified into two parts: the
first part is about basic tools, and the second part is about algorithms. The ten chapters
are highly correlated. In general, it is necessary to study the earlier chapters first before
the later ones.
Next, please follow me on a quick tour through the ten chapters. Two aspects of each
chapter will be covered. The first aspect is the contents introduced in each chapter, and
the second aspect is its relationships with the previous and subsequent chapters. A heads
up for you to read this overview is as follows. The purpose of this overview is to give you
an impression of the contents and structure of this book. It is all right if you encounter
many concepts you do not understand. Hopefully, you can make a proper study plan

8
S. Zhao, 2023

that is suitable for you after reading this overview.

Chapter 1 introduces the basic concepts such as states, actions, rewards, returns, and
policies, which are widely used in the subsequent chapters. These concepts are first
introduced based on a grid world example, where a robot aims to reach a prespecified
target. Then, the concepts are introduced in a more formal manner based on the
framework of Markov decision processes.
Chapter 2 introduces two key elements. The first is a key concept, and the second is
a key tool. The key concept is the state value, which is defined as the expected return
that an agent can obtain when starting from a state if it follows a given policy. The
greater the state value is, the better the corresponding policy is. Thus, state values
can be used to evaluate whether a policy is good or not.
The key tool is the Bellman equation, which can be used to analyze state values. In
a nutshell, the Bellman equation describes the relationship between the values of all
states. By solving the Bellman equation, we can obtain the state values. Such a
process is called policy evaluation, which is a fundamental concept in reinforcement
learning. Finally, this chapter introduces the concept of action values.
Chapter 3 also introduces two key elements. The first is a key concept, and the
second is a key tool. The key concept is the optimal policy. An optimal policy has the
greatest state values compared to other policies. The key tool is the Bellman optimality
equation. As its name suggests, the Bellman optimality equation is a special Bellman
equation.
Here is a fundamental question: what is the ultimate goal of reinforcement learn-
ing? The answer is to obtain optimal policies. The Bellman optimality equation is
important because it can be used to obtain optimal policies. We will see that the
Bellman optimality equation is elegant and can help us thoroughly understand many
fundamental problems.

The first three chapters constitute the first part of this book. This part lays the
necessary foundations for the subsequent chapters. Starting in Chapter 4, the book
introduces algorithms for learning optimal policies.

Chapter 4 introduces three algorithms: value iteration, policy iteration, and truncated
policy iteration. The three algorithms have close relationships with each other. First,
the value iteration algorithm is exactly the algorithm introduced in Chapter 3 for
solving the Bellman optimality equation. Second, the policy iteration algorithm is
an extension of the value iteration algorithm. It is also the foundation for Monte
Carlo (MC) algorithms introduced in Chapter 5. Third, the truncated policy iteration
algorithm is a unified version that includes the value iteration and policy iteration
algorithms as special cases.

9
S. Zhao, 2023

The three algorithms share the same structure. That is, every iteration has two steps.
One step is to update the value, and the other step is to update the policy. The idea
of the interaction between value and policy updates widely exists in reinforcement
learning algorithms. This idea is also known as generalized policy iteration. In ad-
dition, the algorithms introduced in this chapter are actually dynamic programming
algorithms, which require system models. By contrast, all the algorithms introduced
in the subsequent chapters do not require models. It is important to well understand
the contents of this chapter before proceeding to the subsequent ones.
Starting in Chapter 5, we introduce model-free reinforcement learning algorithms that
do not require system models. While this is the first time we introduce model-free
algorithms in this book, we must fill a knowledge gap: how to find optimal policies
without models? The philosophy is simple. If we do not have a model, we must have
some data. If we do not have data, we must have a model. If we have neither, then we
can do nothing. The “data” in reinforcement learning refer to the experience samples
generated when the agent interacts with the environment.
This chapter introduces three algorithms based on MC estimation that can learn
optimal policies from experience samples. The first and simplest algorithm is MC
Basic, which can be readily obtained by extending the policy iteration algorithm
introduced in Chapter 4. Understanding the MC Basic algorithm is important for
grasping the fundamental idea of MC-based reinforcement learning. By extending
this algorithm, we further introduce two more complicated but more efficient MC-
based algorithms. The fundamental trade-off between exploration and exploitation is
also elaborated in this chapter.

Up to this point, the reader may have noticed that the contents of these chapters are
highly correlated. For example, if we want to study the MC algorithms (Chapter 5), we
must first understand the policy iteration algorithm (Chapter 4). To study the policy
iteration algorithm, we must first know the value iteration algorithm (Chapter 4). To
comprehend the value iteration algorithm, we first need to understand the Bellman opti-
mality equation (Chapter 3). To understand the Bellman optimality equation, we need
to study the Bellman equation (Chapter 2) first. Therefore, it is highly recommended to
study the chapters one by one. Otherwise, it may be difficult to understand the contents
in the later chapters.

There is a knowledge gap when we move from Chapter 5 to Chapter 7: the algorithms
in Chapter 7 are incremental, but the algorithms in Chapter 5 are non-incremental.
Chapter 6 is designed to fill this knowledge gap by introducing the stochastic ap-
proximation theory. Stochastic approximation refers to a broad class of stochastic
iterative algorithms for solving root-finding or optimization problems. The classic
Robbins-Monro and stochastic gradient descent algorithms are special stochastic ap-
proximation algorithms. Although this chapter does not introduce any reinforcement

10
S. Zhao, 2023

learning algorithms, it is important because it lays the necessary foundations for s-

tudying Chapter 7.
Chapter 7 introduces the classic temporal-difference (TD) algorithms. With the prepa-
ration in Chapter 6, I believe the reader will not be surprised when seeing the TD
algorithms. From a mathematical point of view, TD algorithms can be viewed as
stochastic approximation algorithms for solving the Bellman or Bellman optimality
equations. Like Monte Carlo learning, TD learning is also model-free, but it has some
advantages due to its incremental form. For example, it can learn in an online manner:
it can update the value estimate every time an experience sample is received. This
chapter introduces quite a few TD algorithms such as Sarsa and Q-learning. The
important concepts of on-policy and off-policy are also introduced.
Chapter 8 introduces the value function approximation method. In fact, this chap-
ter continues to introduce TD algorithms, but it uses a different way to represent
state/action values. In the preceding chapters, state/action values are represented by
tables. The tabular method is straightforward to understand, but it is inefficient for
handling large state or action spaces. To solve this problem, we can employ the value
function approximation method. The key to understanding this method is to under-
stand the three steps in its optimization formulation. The first step is to select an
objective function for defining optimal policies. The second step is to derive the gradi-
ent of the objective function. The third step is to apply a gradient-based algorithm to
solve the optimization problem. This method is important because it has become the
standard technique to represent values. It is also the location in which artificial neu-
ral networks are incorporated into reinforcement learning as function approximators.
The famous deep Q-learning algorithm is also introduced in this chapter.
Chapter 9 introduces the policy gradient method, which is the foundation of many
modern reinforcement learning algorithms. The policy gradient method is policy-based.
It is a large step forward in this book because all the methods in the previous chapters
are value-based. The basic idea of the policy gradient method is simple: it selects
an appropriate scalar metric and then optimizes it via a gradient-ascent algorithm.
Chapter 9 has an intimate relationship with Chapter 8 because they both rely on the
idea of function approximation. The advantages of the policy gradient method are
numerous. For example, it is more efficient for handling large state/action spaces.
It has stronger generalization abilities and hence is more efficient regarding sample
usage.
Chapter 10 introduces actor-critic methods. From one point of view, actor-critic refers
to a structure that incorporates both policy-based and value-based methods. From
another point of view, actor-critic methods are not new since they still fall into the
scope of the policy gradient method. Specifically, they can be obtained by extending
the policy gradient algorithm introduced in Chapter 9. It is necessary for the reader

11
S. Zhao, 2023

to properly understand the contents in Chapters 8 and 9 before studying Chapter 10.

12
Chapter 1

Basic Concepts

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 1.1: Where we are in this book.

This chapter introduces the basic concepts of reinforcement learning. These concepts
are important because they will be widely used in this book. We first introduce these
concepts using examples and then formalize them in the framework of Markov decision
processes.

1.1 A grid world example

Consider an example as shown in Figure 1.2, where a robot moves in a grid world. The
robot, called agent, can move across adjacent cells in the grid. At each time step, it can

13
1.2. State and action S. Zhao, 2023

only occupy a single cell. The white cells are accessible for entry, and the orange cells
are forbidden. There is a target cell that the robot would like to reach. We will use such
grid world examples throughout this book since they are intuitive for illustrating new
concepts and algorithms.

Start

Forbidden

Forbidden Target

Figure 1.2: The grid world example is used throughout the book.

The ultimate goal of the agent is to find a “good” policy that enables it to reach
the target cell when starting from any initial cell. How can the “goodness” of a policy
be defined? The idea is that the agent should reach the target without entering any
forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.
It would be trivial to plan a path to reach the target cell if the agent knew the map of
the grid world. The task becomes nontrivial if the agent does not know any information
about the environment in advance. Then, the agent must interact with the environment
to find a good policy by trial and error. To do that, the concepts presented in the rest of
the chapter are necessary.

1.2 State and action

The first concept to be introduced is the state, which describes the agent’s status with
respect to the environment. In the grid world example, the state corresponds to the
agent’s location. Since there are nine cells, there are nine states as well. They are
indexed as s1 , s2 , . . . , s9 , as shown in Figure 1.3(a). The set of all the states is called the
state space, denoted as S = {s1 , . . . , s9 }.
For each state, the agent can take five possible actions: moving upward, moving
rightward, moving downward, moving leftward, and remaining unchanged. These five
actions are denoted as a1 , a2 , . . . , a5 , respectively (see Figure 1.3(b)). The set of all actions
is called the action space, denoted as A = {a1 , . . . , a5 }. Different states can have different
action spaces. For instance, considering that taking a1 or a4 at state s1 would lead to a
collision with the boundary, we can set the action space for state s1 as A(s1 ) = {a2 , a3 , a5 }.
In this book, we consider the most general case: A(si ) = A = {a1 , . . . , a5 } for all i.

14
1.3. State transition S. Zhao, 2023

a1
s1 s2 s3

s4 s5 s6 a4 a2
a5

s7 s8 s9
a3
(a) States (b) Actions

Figure 1.3: Illustrations of the state and action concepts. (a) There are nine states {s1 , . . . , s9 }. (b)
Each state has five possible actions {a1 , a2 , a3 , a4 , a5 }.

1.3 State transition

When taking an action, the agent may move from one state to another. Such a process is
called state transition. For example, if the agent is at state s1 and selects action a2 (that
is, moving rightward), then the agent moves to state s2 . Such a process can be expressed
as
a2
s1 −
→ s2 .

We next examine two important examples.

What is the next state when the agent attempts to go beyond the boundary, for
example, taking action a1 at state s1 ? The answer is that the agent will be bounced
back because it is impossible for the agent to exit the state space. Hence, we have
a1
s1 −
→ s1 .
What is the next state when the agent attempts to enter a forbidden cell, for example,
taking action a2 at state s5 ? Two different scenarios may be encountered. In the first
scenario, although s6 is forbidden, it is still accessible. In this case, the next state
a2
is s6 ; hence, the state transition process is s5 −→ s6 . In the second scenario, s6 is
not accessible because, for example, it is surrounded by walls. In this case, the agent
is bounced back to s5 if it attempts to move rightward; hence, the state transition
a2
process is s5 −→ s5 .
Which scenario should we consider? The answer depends on the physical environmen-
t. In this book, we consider the first scenario where the forbidden cells are accessible,
although stepping into them may get punished. This scenario is more general and in-
teresting. Moreover, since we are considering a simulation task, we can define the state
transition process however we prefer. In real-world applications, the state transition
process is determined by real-world dynamics.
The state transition process is defined for each state and its associated actions. This
process can be described by a table as shown in Table 1.1. In this table, each row

15
1.4. Policy S. Zhao, 2023

corresponds to a state, and each column corresponds to an action. Each cell indicates
the next state to transition to after the agent takes an action at the corresponding state.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

s1 s1 s2 s4 s1 s1
s2 s2 s3 s5 s1 s2
s3 s3 s3 s6 s2 s3
s4 s1 s5 s7 s4 s4
s5 s2 s6 s8 s4 s5
s6 s3 s6 s9 s5 s6
s7 s4 s8 s7 s7 s7
s8 s5 s9 s8 s7 s8
s9 s6 s9 s9 s8 s9
Table 1.1: A tabular representation of the state transition process. Each cell indicates the next state to
transition to after the agent takes an action at a state.

Mathematically, the state transition process can be described by conditional proba-

bilities. For example, for s1 and a2 , the conditional probability distribution is

which indicates that, when taking a2 at s1 , the probability of the agent moving to s2
is one, and the probabilities of the agent moving to other states are zero. As a result,
taking action a2 at s1 will certainly cause the agent to transition to s2 . The preliminaries
of conditional probability are given in Appendix A. Readers are strongly advised to be
familiar with probability theory since it is necessary for studying reinforcement learning.
Although it is intuitive, the tabular representation is only able to describe determinis-
tic state transitions. In general, state transitions can be stochastic and must be described
by conditional probability distributions. For instance, when random wind gusts are ap-
plied across the grid, if taking action a2 at s1 , the agent may be blown to s5 instead of
s2 . We have p(s5 |s1 , a2 ) > 0 in this case. Nevertheless, we merely consider deterministic
state transitions in the grid world examples for simplicity in this book.

1.4 Policy
A policy tells the agent which actions to take at every state. Intuitively, policies can
be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a
trajectory starting from an initial state (see Figure 1.4(b)).

16
1.4. Policy S. Zhao, 2023

(a) A deterministic policy

(b) Trajectories obtained from the policy

Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different
initial states.

Mathematically, policies can be described by conditional probabilities. Denote the

policy in Figure 1.4 as π(a|s), which is a conditional probability distribution function
defined for every state. For example, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 1,
π(a3 |s1 ) = 0,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0,

which indicates that the probability of taking action a2 at state s1 is one, and the prob-
abilities of taking other actions are zero.
The above policy is deterministic. Policies may be stochastic in general. For example,
the policy shown in Figure 1.5 is stochastic: at state s1 , the agent may take actions to
go either rightward or downward. The probabilities of taking these two actions are the

17
1.5. Reward S. Zhao, 2023

same (both are 0.5). In this case, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 0.5,
π(a3 |s1 ) = 0.5,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0.

p = 0.5 p = 0.5

Figure 1.5: A stochastic policy. At state s1 , the agent may move rightward or downward with equal
probabilities of 0.5.

Policies represented by conditional probabilities can be stored as tables. For example,

Table 1.2 represents the stochastic policy depicted in Figure 1.5. The entry in the ith
row and jth column is the probability of taking the jth action at the ith state. Such
a representation is called a tabular representation. We will introduce another way to
represent policies as parameterized functions in Chapter 8.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)

s1 0 0.5 0.5 0 0
s2 0 0 1 0 0
s3 0 0 0 1 0
s4 0 1 0 0 0
s5 0 0 1 0 0
s6 0 0 1 0 0
s7 0 1 0 0 0
s8 0 1 0 0 0
s9 0 0 0 0 1
Table 1.2: A tabular representation of a policy. Each entry indicates the probability of taking an action
at a state.

1.5 Reward
Reward is one of the most unique concepts in reinforcement learning.

18
1.5. Reward S. Zhao, 2023

After executing an action at a state, the agent obtains a reward, denoted as r, as

feedback from the environment. The reward is a function of the state s and action a.
Hence, it is also denoted as r(s, a). Its value can be a positive or negative real number
or zero. Different rewards have different impacts on the policy that the agent would
eventually learn. Generally speaking, with a positive reward, we encourage the agent to
take the corresponding action. With a negative reward, we discourage the agent from
taking that action.
In the grid world example, the rewards are designed as follows:

If the agent attempts to exit the boundary, let rboundary = −1.

If the agent attempts to enter a forbidden cell, let rforbidden = −1.
If the agent reaches the target state, let rtarget = +1.
Otherwise, the agent obtains a reward of rother = 0.

Special attention should be given to the target state s9 . The reward process does not
have to terminate after the agent reaches s9 . If the agent takes action a5 at s9 , the next
state is again s9 , and the reward is rtarget = +1. If the agent takes action a2 , the next
state is also s9 , but the reward is rboundary = −1.
A reward can be interpreted as a human-machine interface, with which we can guide
the agent to behave as we expect. For example, with the rewards designed above, we can
expect that the agent tends to avoid exiting the boundary or stepping into the forbidden
cells. Designing appropriate rewards is an important step in reinforcement learning. This
step is, however, nontrivial for complex tasks since it may require the user to understand
the given problem well. Nevertheless, it may still be much easier than solving the problem
with other approaches that require a professional background or a deep understanding of
the given problem.
The process of getting a reward after executing an action can be intuitively represented
as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each
column corresponds to an action. The value in each cell of the table indicates the reward
that can be obtained by taking an action at a state.
One question that beginners may ask is as follows: if given the table of rewards, can
we find good policies by simply selecting the actions with the greatest rewards? The
answer is no. That is because these rewards are immediate rewards that can be obtained
after taking an action. To determine a good policy, we must consider the total reward
obtained in the long run (see Section 1.6 for more information). An action with the
greatest immediate reward may not lead to the greatest total reward.
Although intuitive, the tabular representation is only able to describe deterministic
reward processes. A more general approach is to use conditional probabilities p(r|s, a) to
describe reward processes. For example, for state s1 , we have

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

19
1.6. Trajectories, returns, and episodes S. Zhao, 2023

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

s1 rboundary 0 0 rboundary 0
s2 rboundary 0 0 0 0
s3 rboundary rboundary rforbidden 0 0
s4 0 0 rforbidden rboundary 0
s5 0 rforbidden 0 0 0
s6 0 rboundary rtarget 0 rforbidden
s7 0 0 rboundary rboundary rforbidden
s8 0 rtarget rboundary rforbidden 0
s9 rforbidden rboundary rboundary 0 rtarget
Table 1.3: A tabular representation of the process of obtaining rewards. Here, the process is deterministic.
Each cell indicates how much reward can be obtained after the agent takes an action at a given state.

This indicates that, when taking a1 at s1 , the agent obtains r = −1 with certainty. In
this example, the reward process is deterministic. In general, it can be stochastic. For
example, if a student studies hard, he or she would receive a positive reward (e.g., higher
grades on exams), but the specific value of the reward may be uncertain.

1.6 Trajectories, returns, and episodes

r=0
s1 s2 s3 s1 s2 s3
r=0 r=0

s4 s5 s6 s4 s5 s6
r=0 r = −1

s7 s8 s9 s7 s8 s9
r = +1 r=0 r = +1
r = +1 r = +1

(a) Policy 1 and the trajectory (b) Policy 2 and the trajectory

Figure 1.6: Trajectories obtained by following two policies. The trajectories are indicated by red dashed
lines.

A trajectory is a state-action-reward chain. For example, given the policy shown in

Figure 1.6(a), if the agent can move along a trajectory as follows:

2a 3 a 3 2 a a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 .
r=0 r=0 r=0 r=1

The return of this trajectory is defined as the sum of all the rewards collected along the
trajectory:

return = 0 + 0 + 0 + 1 = 1. (1.1)

20
1.6. Trajectories, returns, and episodes S. Zhao, 2023

Returns are also called total rewards or cumulative rewards.

Returns can be used to evaluate policies. For example, we can evaluate the two
policies in Figure 1.6 by comparing their returns. In particular, starting from s1 , the
return obtained by the left policy is 1 as calculated above. For the right policy, starting
from s1 , the following trajectory is generated:

a 3 a
3 2 a2 a
s1 −−→ s4 −−−→ s7 −−→ s8 −−−→ s9 .
r=0 r=−1 r=0 r=+1

The corresponding return is

return = 0 − 1 + 0 + 1 = 0. (1.2)

The returns in (1.1) and (1.2) indicate that the left policy is better than the right one
since its return is greater. This mathematical conclusion is consistent with the intuition
that the right policy is worse since it passes through a forbidden cell.
A return consists of an immediate reward and future rewards. Here, the immediate
reward is the reward obtained after taking an action at the initial state; the future
rewards refer to the rewards obtained after leaving the initial state. It is possible that the
immediate reward is negative while the future reward is positive. Thus, which actions to
take should be determined by the return (i.e., the total reward) rather than the immediate
reward to avoid short-sighted decisions.
The return in (1.1) is defined for a finite-length trajectory. Return can also be defined
for infinitely long trajectories. For example, the trajectory in Figure 1.6 stops after
reaching s9 . Since the policy is well defined for s9 , the process does not have to stop after
the agent reaches s9 . We can design a policy so that the agent stays unchanged after
reaching s9 . Then, the policy would generate the following infinitely long trajectory:

a2 3 a 3 a 2 a5 a
5 a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 −−→ s9 −−→ s9 . . .
r=0 r=0 r=0 r=1 r=1 r=1

The direct sum of the rewards along this trajectory is

return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞,

which unfortunately diverges. Therefore, we must introduce the discounted return con-
cept for infinitely long trajectories. In particular, the discounted return is the sum of the
discounted rewards:

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), the value of (1.3) can be

21
1.6. Trajectories, returns, and episodes S. Zhao, 2023

calculated as
1
discounted return = γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ

The introduction of the discount rate is useful for the following reasons. First, it
removes the stop criterion and allows for infinitely long trajectories. Second, the dis-
count rate can be used to adjust the emphasis placed on near- or far-future rewards. In
particular, if γ is close to 0, then the agent places more emphasis on rewards obtained in
the near future. The resulting policy would be short-sighted. If γ is close to 1, then the
agent places more emphasis on the far future rewards. The resulting policy is far-sighted
and dares to take risks of obtaining negative rewards in the near future. These points
will be demonstrated in Section 3.5.
One important notion that was not explicitly mentioned in the above discussion is the
episode. When interacting with the environment by following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a trial ). If the
environment or policy is stochastic, we obtain different episodes when starting from the
same state. However, if everything is deterministic, we always obtain the same episode
when starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning that the pro-
cess of interacting with the environment will never end. Such tasks are called continuing
tasks. In fact, we can treat episodic and continuing tasks in a unified mathematical
manner by converting episodic tasks to continuing ones. To do that, we need well define
the process after the agent reaches the terminal state. Specifically, after reaching the
terminal state in an episodic task, the agent can continue taking actions in the following
two ways.

First, if we treat the terminal state as a special state, we can specifically design its
action space or state transition so that the agent stays in this state forever. Such
states are called absorbing states, meaning that the agent never leaves a state once
reached. For example, for the target state s9 , we can specify A(s9 ) = {a5 } or set
A(s9 ) = {a1 , . . . , a5 } with p(s9 |s9 , ai ) = 1 for all i = 1, . . . , 5.
Second, if we treat the terminal state as a normal state, we can simply set its action
space to the same as the other states, and the agent may leave the state and come
back again. Since a positive reward of r = 1 can be obtained every time s9 is reached,
the agent will eventually learn to stay at s9 forever to collect more rewards. Notably,
when an episode is infinitely long and the reward received for staying at s9 is positive,
a discount rate must be used to calculate the discounted return to avoid divergence.

In this book, we consider the second scenario where the target state is treated as a normal
state whose action space is A(s9 ) = {a1 , . . . , a5 }.

22
1.7. Markov decision processes S. Zhao, 2023

1.7 Markov decision processes

The previous sections of this chapter illustrated some fundamental concepts in reinforce-
ment learning through examples. This section presents these concepts in a more formal
way under the framework of Markov decision processes (MDPs).
An MDP is a general framework for describing stochastic dynamical systems. The
key ingredients of an MDP are listed below.

Sets:

- State space: the set of all states, denoted as S.

- Action space: a set of actions, denoted as A(s), associated with each state s ∈ S.
- Reward set: a set of rewards, denoted as R(s, a), associated with each state-action
pair (s, a).

Model:

- State transition probability: At state s, when taking action a, the probability of

transitioning to state s0 is p(s0 |s, a). It holds that s0 ∈S p(s0 |s, a) = 1 for any (s, a).
P

- Reward probability: At state s, when taking action a, the probability of obtaining

P
reward r is p(r|s, a). It holds that r∈R(s,a) p(r|s, a) = 1 for any (s, a).

Policy: At state s, the probability of choosing action a is π(a|s). It holds that

P
a∈A(s) π(a|s) = 1 for any s ∈ S.

Markov property: The Markov property refers to the memoryless property of a s-

tochastic process. Mathematically, it means that

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),

p(rt+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(rt+1 |st , at ), (1.4)

where t represents the current time step and t + 1 represents the next time step.
Equation (1.4) indicates that the next state or reward depends merely on the current
state and action and is independent of the previous ones. The Markov property is
important for deriving the fundamental Bellman equation of MDPs, as shown in the
next chapter.

Here, p(s0 |s, a) and p(r|s, a) for all (s, a) are called the model or dynamics. The
model can be either stationary or nonstationary (or in other words, time-invariant or
time-variant). A stationary model does not change over time; a nonstationary model
may vary over time. For instance, in the grid world example, if a forbidden area may pop
up or disappear sometimes, the model is nonstationary. In this book, we only consider
stationary models.

23
1.8. Summary S. Zhao, 2023

One may have heard about the Markov processes (MPs). What is the difference
between an MDP and an MP? The answer is that, once the policy in an MDP is fixed,
the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can
be abstracted as a Markov process. In the literature on stochastic processes, a Markov
process is also called a Markov chain if it is a discrete-time process and the number of
states is finite or countable [1]. In this book, the terms “Markov process” and “Markov
chain” are used interchangeably when the context is clear. Moreover, this book mainly
considers finite MDPs where the numbers of states and actions are finite. This is the
simplest case that should be fully understood.
Prob=0.5 Prob=1
s1 s2 s3
p = 0.5

Prob=0.5
p = 0.5

Prob=1
Prob=1
s4 s5 s6

Prob=1

Prob=1
Prob=1 Prob=1
s7 s8 s9

Figure 1.7: Abstraction of the grid world example as a Markov process. Here, the circles represent states
and the links with arrows represent state transitions.

Finally, reinforcement learning can be described as an agent-environment interaction

process. The agent is a decision-maker that can sense its state, maintain policies, and
execute actions. Everything outside of the agent is regarded as the environment. In the
grid world examples, the agent and environment correspond to the robot and grid world,
respectively. After the agent decides to take an action, the actuator executes such a
decision. Then, the state of the agent would be changed and a reward can be obtained.
By using interpreters, the agent can interpret the new state and the reward. Thus, a
closed loop can be formed.

1.8 Summary
This chapter introduced the basic concepts that will be widely used in the remainder of
the book. We used intuitive grid world examples to demonstrate these concepts and then
formalized them in the framework of MDPs. For more information about MDPs, readers
can see [1, 2].

1.9 Q&A
Q: Can we set all the rewards as negative or positive?
A: In this chapter, we mentioned that a positive reward would encourage the agent
to take an action and that a negative reward would discourage the agent from taking

24
1.9. Q&A S. Zhao, 2023

the action. In fact, it is the relative reward values instead of the absolute values that
determine encouragement or discouragement.
More specifically, we set rboundary = −1, rforbidden = −1, rtarget = +1, and rother = 0 in
this chapter. We can also add a common value to all these values without changing
the resulting optimal policy. For example, we can add −2 to all the rewards to obtain
rboundary = −3, rforbidden = −3, rtarget = −1, and rother = −2. Although the rewards
are all negative, the resulting optimal policy is unchanged. That is because optimal
policies are invariant to affine transformations of the rewards. Details will be given in
Chapter 3.5.
Q: Is the reward a function of the next state?
A: We mentioned that the reward r depends only on s and a but not the next state s0 .
However, this may be counterintuitive since it is the next state that determines the
reward in many cases. For example, the reward is positive when the next state is the
target state. As a result, a question that naturally follows is whether a reward should
depend on the next state. A mathematical rephrasing of this question is whether we
should use p(r|s, a, s0 ) where s0 is the next state rather than p(r|s, a). The answer is
that r depends on s, a, and s0 . However, since s0 also depends on s and a, we can
equivalently write r as a function of s and a: p(r|s, a) = s0 p(r|s, a, s0 )p(s0 |s, a). In
P

this way, the Bellman equation can be easily established as shown in Chapter 2.

25
Chapter 2

State Values and Bellman Equation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 2.1: Where we are in this book.

This chapter introduces a core concept and an important tool. The core concept
is the state value, which is defined as the average reward that an agent can obtain if
it follows a given policy. The greater the state value is, the better the corresponding
policy is. State values can be used as a metric to evaluate whether a policy is good or
not. While state values are important, how can we analyze them? The answer is the
Bellman equation, which is an important tool for analyzing state values. In a nutshell,
the Bellman equation describes the relationships between the values of all states. By
solving the Bellman equation, we can obtain the state values. This process is called
policy evaluation, which is a fundamental concept in reinforcement learning. Finally, this

26
2.1. Motivating example 1: Why are returns important? S. Zhao, 2023

chapter introduces another important concept called the action value.

2.1 Motivating example 1: Why are returns impor-

tant?
The previous chapter introduced the concept of returns. In fact, returns play a funda-
mental role in reinforcement learning since they can evaluate whether a policy is good or
not. This is demonstrated by the following examples.

p = 0.5
r = −1
s1 s2 s1 s2 s1 s2
r = −1
r=0 r=1 r=1 r=0 r=1
p = 0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1
r=1 r=1 r=1

Figure 2.2: Examples for demonstrating the importance of returns. The three examples have different
policies for s1 .

Consider the three policies shown in Figure 2.2. It can be seen that the three policies
are different at s1 . Which is the best and which is the worst? Intuitively, the leftmost
policy is the best because the agent starting from s1 can avoid the forbidden area. The
middle policy is intuitively worse because the agent starting from s1 moves to the forbid-
den area. The rightmost policy is in between the others because it has a probability of
0.5 to go to the forbidden area.
While the above analysis is based on intuition, a question that immediately follows is
whether we can use mathematics to describe such intuition. The answer is yes and relies
on the return concept. In particular, suppose that the agent starts from s1 .

Following the first policy, the trajectory is s1 → s3 → s4 → s4 · · · . The corresponding

discounted return is

return1 = 0 + γ1 + γ 2 1 + . . .
= γ(1 + γ + γ 2 + . . . )
γ
= ,
1−γ

where γ ∈ (0, 1) is the discount rate.

Following the second policy, the trajectory is s1 → s2 → s4 → s4 · · · . The discounted

27
2.2. Motivating example 2: How to calculate returns? S. Zhao, 2023

return is

return2 = −1 + γ1 + γ 2 1 + . . .
= −1 + γ(1 + γ + γ 2 + . . . )
γ
= −1 + .
1−γ

Following the third policy, two trajectories can possibly be obtained. One is s1 →
s3 → s4 → s4 · · · , and the other is s1 → s2 → s4 → s4 · · · . The probability of either
of the two trajectories is 0.5. Then, the average return that can be obtained starting
from s1 is

γ γ
return3 = 0.5 −1 + + 0.5
1−γ 1−γ
γ
= −0.5 + .
1−γ

By comparing the returns of the three policies, we notice that

return1 > return3 > return2 (2.1)

for any value of γ. Inequality (2.1) suggests that the first policy is the best because its
return is the greatest, and the second policy is the worst because its return is the smallest.
This mathematical conclusion is consistent with the aforementioned intuition: the first
policy is the best since it can avoid entering the forbidden area, and the second policy is
the worst because it leads to the forbidden area.
The above examples demonstrate that returns can be used to evaluate policies: a
policy is better if the return obtained by following that policy is greater. Finally, it is
notable that return3 does not strictly comply with the definition of returns because it is
more like an expected value. It will become clear later that return3 is actually a state
value.

2.2 Motivating example 2: How to calculate returns?

While we have demonstrated the importance of returns, a question that immediately
follows is how to calculate the returns when following a given policy.
There are two ways to calculate returns.

The first is simply by definition: a return equals the discounted sum of all the rewards
collected along a trajectory. Consider the example in Figure 2.3. Let vi denote the
return obtained by starting from si for i = 1, 2, 3, 4. Then, the returns obtained when

28
2.2. Motivating example 2: How to calculate returns? S. Zhao, 2023

r1
s1 s2
r2

r4
s4 s3
r3

Figure 2.3: An example for demonstrating how to calculate returns. There are no target or forbidden
cells in this example.

starting from the four states in Figure 2.3 can be calculated as

v1 = r1 + γr2 + γ 2 r3 + . . . ,
v2 = r2 + γr3 + γ 2 r4 + . . . ,
(2.2)
v3 = r3 + γr4 + γ 2 r1 + . . . ,
v4 = r4 + γr1 + γ 2 r2 + . . . .

The second way, which is more important, is based on the idea of bootstrapping. By
observing the expressions of the returns in (2.2), we can rewrite them as

v1 = r1 + γ(r2 + γr3 + . . . ) = r1 + γv2 ,

v2 = r2 + γ(r3 + γr4 + . . . ) = r2 + γv3 ,
(2.3)
v3 = r3 + γ(r4 + γr1 + . . . ) = r3 + γv4 ,
v4 = r4 + γ(r1 + γr2 + . . . ) = r4 + γv1 .

The above equations indicate an interesting phenomenon that the values of the returns
rely on each other. More specifically, v1 relies on v2 , v2 relies on v3 , v3 relies on v4 ,
and v4 relies on v1 . This reflects the idea of bootstrapping, which is to obtain the
values of some quantities from themselves.
At first glance, bootstrapping is an endless loop because the calculation of an unknown
value relies on another unknown value. In fact, bootstrapping is easier to understand
if we view it from a mathematical perspective. In particular, the equations in (2.3)
can be reformed into a linear matrix-vector equation:
          
v1 r1 γv2 r1 0 1 0 0 v1
 v2   r2   γv3   r2   0 0 1 0  v2 
= + =  +γ  ,
          
 
 v3   r3   γv4   r3   0 0 0 1  v3 
v4 r4 γv1 r4 1 0 0 0 v4
| {z } | {z } | {z }| {z }
v r P v

29
2.3. State values S. Zhao, 2023

which can be written compactly as

v = r + γP v.

Thus, the value of v can be calculated easily as v = (I − γP )−1 r, where I is the

identity matrix with appropriate dimensions. One may ask whether I − γP is always
invertible. The answer is yes and explained in Section 2.7.1.

In fact, (2.3) is the Bellman equation for this simple example. Although it is simple,
(2.3) demonstrates the core idea of the Bellman equation: the return obtained by starting
from one state depends on those obtained when starting from other states. The idea of
bootstrapping and the Bellman equation for general scenarios will be formalized in the
following sections.

2.3 State values

We mentioned that returns can be used to evaluate policies. However, they are inappli-
cable to stochastic systems because starting from one state may lead to different returns.
Motivated by this problem, we introduce the concept of state value in this section.
First, we need to introduce some necessary notations. Consider a sequence of time
steps t = 0, 1, 2, . . . . At time t, the agent is at state St , and the action taken following a
policy π is At . The next state is St+1 , and the immediate reward obtained is Rt+1 . This
process can be expressed concisely as

A
t
St −→ St+1 , Rt+1 .

Note that St , St+1 , At , Rt+1 are all random variables. Moreover, St , St+1 ∈ S, At ∈ A(St ),
and Rt+1 ∈ R(St , At ).
Starting from t, we can obtain a state-action-reward trajectory:

At At+1 At+2
St −→ St+1 , Rt+1 −−−→ St+2 , Rt+2 −−−→ St+3 , Rt+3 . . . .

By definition, the discounted return along the trajectory is

.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . . ,

where γ ∈ (0, 1) is the discount rate. Note that Gt is a random variable since Rt+1 , Rt+2 , . . .
are all random variables.
Since Gt is a random variable, we can calculate its expected value (also called the
expectation or mean):
.
vπ (s) = E[Gt |St = s].

30
2.4. Bellman equation S. Zhao, 2023

Here, vπ (s) is called the state-value function or simply the state value of s. Some impor-
tant remarks are given below.

vπ (s) depends on s. This is because its definition is a conditional expectation with

the condition that the agent starts from St = s.
vπ (s) depends on π. This is because the trajectories are generated by following the
policy π. For a different policy, the state value may be different.
vπ (s) does not depend on t. If the agent moves in the state space, t represents the
current time step. The value of vπ (s) is determined once the policy is given.

The relationship between state values and returns is further clarified as follows. When
both the policy and the system model are deterministic, starting from a state always leads
to the same trajectory. In this case, the return obtained starting from a state is equal
to the value of that state. By contrast, when either the policy or the system model is
stochastic, starting from the same state may generate different trajectories. In this case,
the returns of different trajectories are different, and the state value is the mean of these
returns.
Although returns can be used to evaluate policies as shown in Section 2.1, it is more
formal to use state values to evaluate policies: policies that generate greater state values
are better. Therefore, state values constitute a core concept in reinforcement learning.
While state values are important, a question that immediately follows is how to calculate
them. This question is answered in the next section.

2.4 Bellman equation

We now introduce the Bellman equation, a mathematical tool for analyzing state val-
ues. In a nutshell, the Bellman equation is a set of linear equations that describe the
relationships between the values of all the states.
We next derive the Bellman equation. First, note that Gt can be rewritten as

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

= Rt+1 + γ(Rt+2 + γRt+3 + . . . )
= Rt+1 + γGt+1 ,

where Gt+1 = Rt+2 + γRt+3 + . . . . This equation establishes the relationship between Gt
and Gt+1 . Then, the state value can be written as

vπ (s) = E[Gt |St = s]

= E[Rt+1 + γGt+1 |St = s]
= E[Rt+1 |St = s] + γE[Gt+1 |St = s]. (2.4)

31
2.4. Bellman equation S. Zhao, 2023

The two terms in (2.4) are analyzed below.

Here, A and R are the sets of possible actions and rewards, respectively. It should be
noted that A may be different for different states. In this case, A should be written as
A(s). Similarly, R may also depend on (s, a). We drop the dependence on s or (s, a)
for the sake of simplicity in this book. Nevertheless, the conclusions are still valid in
the presence of dependence.
The second term, E[Gt+1 |St = s], is the expectation of the future rewards. It can be
calculated as
X
E[Gt+1 |St = s] = E[Gt+1 |St = s, St+1 = s0 ]p(s0 |s)
s0 ∈S
X
= E[Gt+1 |St+1 = s0 ]p(s0 |s) (due to the Markov property)
s0 ∈S
X
= vπ (s0 )p(s0 |s)
s0 ∈S
X X
= vπ (s0 ) p(s0 |s, a)π(a|s). (2.6)
s0 ∈S a∈A

The above derivation uses the fact that E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ],
which is due to the Markov property that the future rewards depend merely on the
present state rather than the previous ones.

Substituting (2.5)-(2.6) into (2.4) yields

vπ (s) = E[Rt+1 |St = s] + γE[Gt+1 |St = s],

X X X X
= π(a|s) p(r|s, a)r + γ π(a|s) p(s0 |s, a)vπ (s0 ),
a∈A r∈R a∈A s0 ∈S
| {z } | {z }
mean of immediate rewards mean of future rewards
" #
X X X
= π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) , for all s ∈ S. (2.7)
a∈A r∈R s0 ∈S

This equation is the Bellman equation, which characterizes the relationships of state
values. It is a fundamental tool for designing and analyzing reinforcement learning algo-
rithms.

32
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023

The Bellman equation seems complex at first glance. In fact, it has a clear structure.
Some remarks are given below.

vπ (s) and vπ (s0 ) are unknown state values to be calculated. It may be confusing to
beginners how to calculate the unknown vπ (s) given that it relies on another unknown
vπ (s0 ). It must be noted that the Bellman equation refers to a set of linear equations for
all states rather than a single equation. If we put these equations together, it becomes
clear how to calculate all the state values. Details will be given in Section 2.7.
π(a|s) is a given policy. Since state values can be used to evaluate a policy, solving
the state values from the Bellman equation is a policy evaluation process, which is an
important process in many reinforcement learning algorithms, as we will see later in
the book.
p(r|s, a) and p(s0 |s, a) represent the system model. We will first show how to calculate
the state values with this model in Section 2.7, and then show how to do that without
the model by using model-free algorithms later in this book.

In addition to the expression in (2.7), readers may also encounter other expressions
of the Bellman equation in the literature. We next introduce two equivalent expressions.
First, it follows from the law of total probability that
X
p(s0 |s, a) = p(s0 , r|s, a),
r∈R
X
p(r|s, a) = p(s0 , r|s, a).
s0 ∈S

Then, equation (2.7) can be rewritten as

X XX
vπ (s) = π(a|s) p(s0 , r|s, a) [r + γvπ (s0 )] .
a∈A s0 ∈S r∈R

This is the expression used in [3].

Second, the reward r may depend solely on the next state s0 in some problems. As a
result, we can write the reward as r(s0 ) and hence p(r(s0 )|s, a) = p(s0 |s, a), substituting
which into (2.7) gives
X X
vπ (s) = π(a|s) p(s0 |s, a) [r(s0 ) + γvπ (s0 )] .
a∈A s0 ∈S

2.5 Examples for illustrating the Bellman equation

We next use two examples to demonstrate how to write out the Bellman equation and
calculate the state values step by step. Readers are advised to carefully go through the
examples to gain a better understanding of the Bellman equation.

33
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023

s1 s2
r=0 r=1

s3 s4
r=1
r=1

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deter-
ministic.

Consider the first example shown in Figure 2.4, where the policy is deterministic. We
next write out the Bellman equation and then solve the state values from it.
First, consider state s1 . Under the policy, the probabilities of taking the actions
are π(a = a3 |s1 ) = 1 and π(a 6= a3 |s1 ) = 0. The state transition probabilities
are p(s0 = s3 |s1 , a3 ) = 1 and p(s0 6= s3 |s1 , a3 ) = 0. The reward probabilities are
p(r = 0|s1 , a3 ) = 1 and p(r 6= 0|s1 , a3 ) = 0. Substituting these values into (2.7) gives

vπ (s1 ) = 0 + γvπ (s3 ).

Interestingly, although the expression of the Bellman equation in (2.7) seems complex,
the expression for this specific state is very simple.
Similarly, it can be obtained that

vπ (s2 ) = 1 + γvπ (s4 ),

vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).

We can solve the state values from these equations. Since the equations are simple, we
can manually solve them. More complicated equations can be solved by the algorithms
presented in Section 2.7. Here, the state values can be solved as

1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ

34
2.5. Examples for illustrating the Bellman equation S. Zhao, 2023

Furthermore, if we set γ = 0.9, then

1
vπ (s4 ) = = 10,
1 − 0.9
1
vπ (s3 ) = = 10,
1 − 0.9
1
vπ (s2 ) = = 10,
1 − 0.9
0.9
vπ (s1 ) = = 9.
1 − 0.9

p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5

s3 s4
r=1
r=1

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

Consider the second example shown in Figure 2.5, where the policy is stochastic. We
next write out the Bellman equation and then solve the state values from it.
At state s1 , the probabilities of going right and down equal 0.5. Mathematically, we
have π(a = a2 |s1 ) = 0.5 and π(a = a3 |s1 ) = 0.5. The state transition probability
is deterministic since p(s0 = s3 |s1 , a3 ) = 1 and p(s0 = s2 |s1 , a2 ) = 1. The reward
probability is also deterministic since p(r = 0|s1 , a3 ) = 1 and p(r = −1|s1 , a2 ) = 1.
Substituting these values into (2.7) gives

vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )].

Similarly, it can be obtained that

vπ (s2 ) = 1 + γvπ (s4 ),

vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).

The state values can be solved from the above equations. Since the equations are

35
2.6. Matrix-vector form of the Bellman equation S. Zhao, 2023

simple, we can solve the state values manually and obtain

1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )],
γ
= −0.5 + .
1−γ

Furthermore, if we set γ = 0.9, then

vπ (s4 ) = 10,
vπ (s3 ) = 10,
vπ (s2 ) = 10,
vπ (s1 ) = −0.5 + 9 = 8.5.

If we compare the state values of the two policies in the above examples, it can be
seen that

vπ1 (si ) ≥ vπ2 (si ), i = 1, 2, 3, 4,

which indicates that the policy in Figure 2.4 is better because it has greater state values.
This mathematical conclusion is consistent with the intuition that the first policy is better
because it can avoid entering the forbidden area when the agent starts from s1 . As a
result, the above two examples demonstrate that state values can be used to evaluate
policies.

2.6 Matrix-vector form of the Bellman equation

The Bellman equation in (2.7) is in an elementwise form. Since it is valid for every state,
we can combine all these equations and write them concisely in a matrix-vector form,
which will be frequently used to analyze the Bellman equation.
To derive the matrix-vector form, we first rewrite the Bellman equation in (2.7) as
X
vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 ), (2.8)
s0 ∈S

36
2.6. Matrix-vector form of the Bellman equation S. Zhao, 2023

where

. X X
rπ (s) = π(a|s) p(r|s, a)r,
a∈A r∈R
0 . X
pπ (s |s) = π(a|s)p(s0 |s, a).
a∈A

Here, rπ (s) denotes the mean of the immediate rewards, and pπ (s0 |s) is the probability
of transitioning from s to s0 under policy π.
Suppose that the states are indexed as si with i = 1, . . . , n, where n = |S|. For state
si , (2.8) can be written as
X
vπ (si ) = rπ (si ) + γ pπ (sj |si )vπ (sj ). (2.9)
sj ∈S

Let vπ = [vπ (s1 ), . . . , vπ (sn )]T ∈ Rn , rπ = [rπ (s1 ), . . . , rπ (sn )]T ∈ Rn , and Pπ ∈ Rn×n with
[Pπ ]ij = pπ (sj |si ). Then, (2.9) can be written in the following matrix-vector form:

vπ = rπ + γPπ vπ , (2.10)

where vπ is the unknown to be solved, and rπ , Pπ are known.

The matrix Pπ has some interesting properties. First, it is a nonnegative matrix,
meaning that all its elements are equal to or greater than zero. This property is denoted
as Pπ ≥ 0, where 0 denotes a zero matrix with appropriate dimensions. In this book, ≥
or ≤ represents an elementwise comparison operation. Second, Pπ is a stochastic matrix,
meaning that the sum of the values in every row is equal to one. This property is denoted
as Pπ 1 = 1, where 1 = [1, . . . , 1]T has appropriate dimensions.
Consider the example shown in Figure 2.6. The matrix-vector form of the Bellman
equation is
      
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
 vπ (s2 )   rπ (s2 )   pπ (s1 |s2 ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 )  vπ (s2 ) 
=  +γ  .
      
pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
 
 vπ (s3 )   rπ (s3 )    vπ (s3 ) 
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ

Substituting the specific values into the above equation gives

      
vπ (s1 ) 0.5(0) + 0.5(−1) 0 0.5 0.5 0 vπ (s1 )
 vπ (s2 )   1   0 0 0 1   vπ (s2 )
  
= +γ .
     
 
 vπ (s3 )   1   0 0 0 1   vπ (s3 ) 
vπ (s4 ) 1 0 0 0 1 vπ (s4 )

37
2.7. Solving state values from the Bellman equation S. Zhao, 2023

It can be seen that Pπ satisfies Pπ 1 = 1.

p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5

s3 s4
r=1
r=1

Figure 2.6: An example for demonstrating the matrix-vector form of the Bellman equation.

2.7 Solving state values from the Bellman equation

Calculating the state values of a given policy is a fundamental problem in reinforcement
learning. This problem is often referred to as policy evaluation. In this section, we present
two methods for calculating state values from the Bellman equation.

2.7.1 Closed-form solution

Since vπ = rπ + γPπ vπ is a simple linear equation, its closed-form solution can be easily
obtained as
vπ = (I − γPπ )−1 rπ .

Some properties of (I − γPπ )−1 are given below.

I − γPπ is invertible. The proof is as follows. According to the Gershgorin circle

theorem [4], every eigenvalue of I − γPπ lies within at least one of the Gershgorin
circles. The ith Gershgorin circle has a center at [I − γPπ ]ii = 1 − γpπ (si |si ) and
P P
a radius equal to j6=i [I − γPπ ]ij = − j6=i γpπ (sj |si ). Since γ < 1, we know that
P
the radius is less than the magnitude of the center: j6=i γpπ (sj |si ) < 1 − γpπ (si |si ).
Therefore, all Gershgorin circles do not encircle the origin, and hence, no eigenvalue
of I − γPπ is zero.
(I − γPπ )−1 ≥ I, meaning that every element of (I − γPπ )−1 is nonnegative and,
more specifically, no less than that of the identity matrix. This is because Pπ has
nonnegative entries, and hence, (I − γPπ )−1 = I + γPπ + γ 2 Pπ2 + · · · ≥ I ≥ 0.
For any vector r ≥ 0, it holds that (I − γPπ )−1 r ≥ r ≥ 0. This property follows from
the second property because [(I − γPπ )−1 − I]r ≥ 0. As a consequence, if r1 ≥ r2 , we
have (I − γPπ )−1 r1 ≥ (I − γPπ )−1 r2 .

38
2.7. Solving state values from the Bellman equation S. Zhao, 2023

2.7.2 Iterative solution

Although the closed-form solution is useful for theoretical analysis purposes, it is not
applicable in practice because it involves a matrix inversion operation, which still needs
to be calculated by other numerical algorithms. In fact, we can directly solve the Bellman
equation using the following iterative algorithm:

vk+1 = rπ + γPπ vk , k = 0, 1, 2, . . . (2.11)

This algorithm generates a sequence of values {v0 , v1 , v2 , . . . }, where v0 ∈ Rn is an initial

guess of vπ . It holds that

vk → vπ = (I − γPπ )−1 rπ , as k → ∞. (2.12)

Interested readers may see the proof in Box 2.1.

Box 2.1: Convergence proof of (2.12)

Define the error as δk = vk − vπ . We only need to show that δk → 0. Substituting

vk+1 = δk+1 + vπ and vk = δk + vπ into vk+1 = rπ + γPπ vk gives

δk+1 + vπ = rπ + γPπ (δk + vπ ),

which can be rewritten as

δk+1 = −vπ + rπ + γPπ δk + γPπ vπ ,

= γPπ δk − vπ + (rπ + γPπ vπ ),
= γPπ δk .

As a result,
δk+1 = γPπ δk = γ 2 Pπ2 δk−1 = · · · = γ k+1 Pπk+1 δ0 .

Since every entry of Pπ is nonnegative and no greater than one, we have that 0 ≤
Pπk ≤ 1 for any k. That is, every entry of Pπk is no greater than 1. On the other hand,
since γ < 1, we know that γ k → 0, and hence, δk+1 = γ k+1 Pπk+1 δ0 → 0 as k → ∞.

2.7.3 Illustrative examples

We next apply the algorithm in (2.11) to solve the state values of some examples.
The examples are shown in Figure 2.7. The orange cells represent forbidden areas.
The blue cell represents the target area. The reward settings are rboundary = rforbidden = −1

39
2.7. Solving state values from the Bellman equation S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(a) Two “good” policies and their state values. The state values of the two policies are the same,
but the two policies are different at the top two states in the fourth column.
1 2 3 4 5 1 2 3 4 5

1 1 -6.6 -7.3 -8.1 -9.0 -10.0

2 2 -8.5 -8.3 -8.1 -9.0 -10.0

3 3 -7.5 -8.3 -8.1 -9.0 -10.0

4 4 -7.5 -7.2 -9.1 -9.0 -10.0

5 5 -7.6 -7.3 -8.1 -9.0 -10.0

1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 -10.0 -10.0

2 2 -9.0 -10.0 -0.4 -0.5 -10.0

3 3 -10.0 -0.5 0.5 -0.5 0.0

4 4 0.0 -1.0 -0.5 -0.5 -10.0

5 5 0.0 0.0 0.0 0.0 0.0

(b) Two “bad” policies and their state values. The state values are smaller than those of the
“good” policies.

Figure 2.7: Examples of policies and their corresponding state values.

and rtarget = 1. Here, the discount rate is γ = 0.9.

40
2.8. From state value to action value S. Zhao, 2023

Figure 2.7(a) shows two “good” policies and their corresponding state values obtained
by (2.11). The two policies have the same state values but differ at the top two states in
the fourth column. Therefore, we know that different policies may have the same state
values.
Figure 2.7(b) shows two “bad” policies and their corresponding state values. These
two policies are bad because the actions of many states are intuitively unreasonable.
Such intuition is supported by the obtained state values. As can be seen, the state values
of these two policies are negative and much smaller than those of the good policies in
Figure 2.7(a).

2.8 From state value to action value

While we have been discussing state values thus far in this chapter, we now turn to
the action value, which indicates the “value” of taking an action at a state. While the
concept of action value is important, the reason why it is introduced in the last section
of this chapter is that it heavily relies on the concept of state values. It is important to
understand state values well first before studying action values.
The action value of a state-action pair (s, a) is defined as

.
qπ (s, a) = E[Gt |St = s, At = a].

As can be seen, the action value is defined as the expected return that can be obtained
after taking an action at a state. It must be noted that qπ (s, a) depends on a state-action
pair (s, a) rather than an action alone. It may be more rigorous to call this value a
state-action value, but it is conventionally called an action value for simplicity.
What is the relationship between action values and state values?

First, it follows from the properties of conditional expectation that

X
E[Gt |St = s] = E[Gt |St = s, At = a] π(a|s).
| {z } | {z }
a∈A
vπ (s) qπ (s,a)

It then follows that

X
vπ (s) = π(a|s)qπ (s, a). (2.13)
a∈A

As a result, a state value is the expectation of the action values associated with that
state.
Second, since the state value is given by
X hX X i
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s ) ,
a∈A r∈R s0 ∈S

41
2.8. From state value to action value S. Zhao, 2023

comparing it with (2.13) leads to

X X
qπ (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ). (2.14)
r∈R s0 ∈S

It can be seen that the action value consists of two terms. The first term is the mean
of the immediate rewards, and the second term is the mean of the future rewards.

Both (2.13) and (2.14) describe the relationship between state values and action val-
ues. They are the two sides of the same coin: (2.13) shows how to obtain state values
from action values, whereas (2.14) shows how to obtain action values from state values.

2.8.1 Illustrative examples

p = 0.5
r = −1
s1 s2
r=0 r=1
p = 0.5

s3 s4
r=1
r=1

Figure 2.8: An example for demonstrating the process of calculating action values.

We next present an example to illustrate the process of calculating action values and
discuss a common mistake that beginners may make.
Consider the stochastic policy shown in Figure 2.8. We next only examine the actions
of s1 . The other states can be examined similarly. The action value of (s1 , a2 ) is

qπ (s1 , a2 ) = −1 + γvπ (s2 ),

where s2 is the next state. Similarly, it can be obtained that

qπ (s1 , a3 ) = 0 + γvπ (s3 ).

A common mistake that beginners may make is about the values of the actions that
the given policy does not select. For example, the policy in Figure 2.8 can only select
a2 or a3 and cannot select a1 , a4 , a5 . One may argue that since the policy does not
select a1 , a4 , a5 , we do not need to calculate their action values, or we can simply set
qπ (s1 , a1 ) = qπ (s1 , a4 ) = qπ (s1 , a5 ) = 0. This is wrong.

First, even if an action would not be selected by a policy, it still has an action value.
In this example, although policy π does not take a1 at s1 , we can still calculate its

42
2.8. From state value to action value S. Zhao, 2023

action value by observing what we would obtain after taking this action. Specifically,
after taking a1 , the agent is bounced back to s1 (hence, the immediate reward is −1)
and then continues moving in the state space starting from s1 by following π (hence,
the future reward is γvπ (s1 )). As a result, the action value of (s1 , a1 ) is

qπ (s1 , a1 ) = −1 + γvπ (s1 ).

Similarly, for a4 and a5 , which cannot be possibly selected by the given policy either,
we have

qπ (s1 , a4 ) = −1 + γvπ (s1 ),

qπ (s1 , a5 ) = 0 + γvπ (s1 ).

Second, why do we care about the actions that the given policy would not select?
Although some actions cannot be possibly selected by a given policy, this does not
mean that these actions are not good. It is possible that the given policy is not good,
so it cannot select the best action. The purpose of reinforcement learning is to find
optimal policies. To that end, we must keep exploring all actions to determine better
actions for each state.
Finally, after computing the action values, we can also calculate the state value ac-
cording to (2.14):

vπ (s1 ) = 0.5qπ (s1 , a2 ) + 0.5qπ (s1 , a3 ),

= 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )].

2.8.2 The Bellman equation in terms of action values

The Bellman equation that we previously introduced was defined based on state values.
In fact, it can also be expressed in terms of action values.
In particular, substituting (2.13) into (2.14) yields
X X X
qπ (s, a) = p(r|s, a)r + γ p(s0 |s, a) π(a0 |s0 )qπ (s0 , a0 ),
r∈R s0 ∈S a0 ∈A(s0 )

which is an equation of action values. The above equation is valid for every state-action
pair. If we put all these equations together, their matrix-vector form is

qπ = r̃ + γP Πqπ , (2.15)

where qπ is the action value vector indexed by the state-action pairs: its (s, a)th element
is [qπ ](s,a) = qπ (s, a). r̃ is the immediate reward vector indexed by the state-action
P
pairs: [r̃](s,a) = r∈R p(r|s, a)r. The matrix P is the probability transition matrix, whose

43
2.9. Summary S. Zhao, 2023

row is indexed by the state-action pairs and whose column is indexed by the states:
[P ](s,a),s0 = p(s0 |s, a). Moreover, Π is a block diagonal matrix in which each block is a
1 × |A| vector: Πs0 ,(s0 ,a0 ) = π(a0 |s0 ) and the other entries of Π are zero.
Compared to the Bellman equation defined in terms of state values, the equation
defined in terms of action values has some unique features. For example, r̃ and P are
independent of the policy and are merely determined by the system model. The policy
is embedded in Π. It can be verified that (2.15) is also a contraction mapping and has a
unique solution that can be iteratively solved. More details can be found in [5].

2.9 Summary
The most important concept introduced in this chapter is the state value. Mathematically,
a state value is the expected return that the agent can obtain by starting from a state.
The values of different states are related to each other. That is, the value of state s
relies on the values of some other states, which may further rely on the value of state s
itself. This phenomenon might be the most confusing part of this chapter for beginners.
It is related to an important concept called bootstrapping, which involves calculating
something from itself. Although bootstrapping may be intuitively confusing, it is clear if
we examine the matrix-vector form of the Bellman equation. In particular, the Bellman
equation is a set of linear equations that describe the relationships between the values of
all states.
Since state values can be used to evaluate whether a policy is good or not, the process
of solving the state values of a policy from the Bellman equation is called policy evalu-
ation. As we will see later in this book, policy evaluation is an important step in many
reinforcement learning algorithms.
Another important concept, action value, was introduced to describe the value of
taking one action at a state. As we will see later in this book, action values play a
more direct role than state values when we attempt to find optimal policies. Finally, the
Bellman equation is not restricted to the reinforcement learning field. Instead, it widely
exists in many fields such as control theories and operation research. In different fields,
the Bellman equation may have different expressions. In this book, the Bellman equation
is studied under discrete Markov decision processes. More information about this topic
can be found in [2].

2.10 Q&A
Q: What is the relationship between state values and returns?
A: The value of a state is the mean of the returns that can be obtained if the agent
starts from that state.

44
2.10. Q&A S. Zhao, 2023

Q: Why do we care about state values?

A: State values can be used to evaluate policies. In fact, optimal policies are defined
based on state values. This point will become clearer in the next chapter.
Q: Why do we care about the Bellman equation?
A: The Bellman equation describes the relationships among the values of all states.
It is the tool for analyzing state values.
Q: Why is the process of solving the Bellman equation called policy evaluation?
A: Solving the Bellman equation yields state values. Since state values can be used to
evaluate a policy, solving the Bellman equation can be interpreted as evaluating the
corresponding policy.
Q: Why do we need to study the matrix-vector form of the Bellman equation?
A: The Bellman equation refers to a set of linear equations established for all the
states. To solve state values, we must put all the linear equations together. The
matrix-vector form is a concise expression of these linear equations.
Q: What is the relationship between state values and action values?
A: On the one hand, a state value is the mean of the action values for that state. On
the other hand, an action value relies on the values of the next states that the agent
may transition to after taking the action.
Q: Why do we care about the values of the actions that a given policy cannot select?
A: Although a given policy cannot select some actions, this does not mean that these
actions are not good. On the contrary, it is possible that the given policy is not good
and misses the best action. To find better policies, we must keep exploring different
actions even though some of them may not be selected by the given policy.

45
Chapter 3

Optimal State Values and Bellman

Optimality Equation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 3.1: Where we are in this book.

The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore,
necessary to define what optimal policies are. In this chapter, we introduce a core concept
and an important tool. The core concept is the optimal state value, based on which we
can define optimal policies. The important tool is the Bellman optimality equation, from
which we can solve the optimal state values and policies.
The relationship between the previous, present, and subsequent chapters is as follows.
The previous chapter (Chapter 2) introduced the Bellman equation of any given policy.

46
3.1. Motivating example: How to improve policies? S. Zhao, 2023

The present chapter introduces the Bellman optimality equation, which is a special Bell-
man equation whose corresponding policy is optimal. The next chapter (Chapter 4) will
introduce an important algorithm called value iteration, which is exactly the algorithm
for solving the Bellman optimality equation as introduced in the present chapter.
Be prepared that this chapter is slightly mathematically intensive. However, it is
worth it because many fundamental questions can be clearly answered.

3.1 Motivating example: How to improve policies?

s1 s2
r = −1
r=1

s3 s4
r=1
r=1

Figure 3.2: An example for demonstrating policy improvement.

Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the
forbidden and target areas, respectively. The policy here is not good because it selects a2
(rightward) at state s1 . How can we improve the given policy to obtain a better policy?
The answer lies in state values and action values.

Intuition: It is intuitively clear that the policy can improve if it selects a3 (downward)
instead of a2 (rightward) at s1 . This is because moving downward enables the agent
to avoid entering the forbidden area.
Mathematics: The above intuition can be realized based on the calculation of state
values and action values.
First, we calculate the state values of the given policy. In particular, the Bellman
equation of this policy is

vπ (s1 ) = −1 + γvπ (s2 ),

vπ (s2 ) = +1 + γvπ (s4 ),
vπ (s3 ) = +1 + γvπ (s4 ),
vπ (s4 ) = +1 + γvπ (s4 ).

47
3.2. Optimal state values and optimal policies S. Zhao, 2023

Let γ = 0.9. It can be easily solved that

vπ (s4 ) = vπ (s3 ) = vπ (s2 ) = 10,

vπ (s1 ) = 8.

Second, we calculate the action values for state s1 :

qπ (s1 , a1 ) = −1 + γvπ (s1 ) = 6.2,

qπ (s1 , a2 ) = −1 + γvπ (s2 ) = 8,
qπ (s1 , a3 ) = 0 + γvπ (s3 ) = 9,
qπ (s1 , a4 ) = −1 + γvπ (s1 ) = 6.2,
qπ (s1 , a5 ) = 0 + γvπ (s1 ) = 7.2.

It is notable that action a3 has the greatest action value:

qπ (s1 , a3 ) ≥ qπ (s1 , ai ), for all i 6= 3.

Therefore, we can update the policy to select a3 at s1 .

This example illustrates that we can obtain a better policy if we update the poli-
cy to select the action with the greatest action value. This is the basic idea of many
reinforcement learning algorithms.
This example is very simple in the sense that the given policy is only not good for
state s1 . If the policy is also not good for the other states, will selecting the action with
the greatest action value still generate a better policy? Moreover, whether there always
exist optimal policies? What does an optimal policy look like? We will answer all of
these questions in this chapter.

3.2 Optimal state values and optimal policies

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is
necessary to first define what an optimal policy is. The definition is based on state
values. In particular, consider two given policies π1 and π2 . If the state value of π1 is
greater than or equal to that of π2 for any state:

vπ1 (s) ≥ vπ2 (s), for all s ∈ S,

then π1 is said to be better than π2 . Furthermore, if a policy is better than all the other
possible policies, then this policy is optimal. This is formally stated below.

48
3.3. Bellman optimality equation S. Zhao, 2023

Definition 3.1 (Optimal policy and optimal state value). A policy π ∗ is optimal if
vπ∗ (s) ≥ vπ (s) for all s ∈ S and for any other policy π. The state values of π ∗ are the
optimal state values.
The above definition indicates that an optimal policy has the greatest state value
for every state compared to all the other policies. This definition also leads to many
questions:
Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Stochasticity: Is the optimal policy stochastic or deterministic?
Algorithm: How to obtain the optimal policy and the optimal state values?
These fundamental questions must be clearly answered to thoroughly understand
optimal policies. For example, regarding the existence of optimal policies, if optimal
policies do not exist, then we do not need to bother to design algorithms to find them.
We will answer all these questions in the remainder of this chapter.

3.3 Bellman optimality equation

The tool for analyzing optimal policies and optimal state values is the Bellman optimality
equation (BOE). By solving this equation, we can obtain optimal policies and optimal
state values. We next present the expression of the BOE and then analyze it in detail.
For every s ∈ S, the elementwise expression of the BOE is
!
X X X
v(s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)v(s0 )
π(s)∈Π(s)
a∈A r∈R s0 ∈S
X
= max π(a|s)q(s, a), (3.1)
π(s)∈Π(s)
a∈A

where v(s), v(s0 ) are unknown variables to be solved and

. X X
q(s, a) = p(r|s, a)r + γ p(s0 |s, a)v(s0 ).
r∈R s0 ∈S

Here, π(s) denotes a policy for state s, and Π(s) is the set of all possible policies for s.
The BOE is an elegant and powerful tool for analyzing optimal policies. However,
it may be nontrivial to understand this equation. For example, this equation has two
unknown variables v(s) and π(a|s). It may be confusing to beginners how to solve two
unknown variables from one equation. Moreover, the BOE is actually a special Bellman
equation. However, it is nontrivial to see that since its expression is quite different from
that of the Bellman equation. We also need to answer the following fundamental questions
about the BOE.

49
3.3. Bellman optimality equation S. Zhao, 2023

Existence: Does this equation have a solution?

Uniqueness: Is the solution unique?
Algorithm: How to solve this equation?
Optimality: How is the solution related to optimal policies?
Once we can answer these questions, we will clearly understand optimal state values and
optimal policies.

3.3.1 Maximization of the right-hand side of the BOE

We next clarify how to solve the maximization problem on the right-hand side of the
BOE in (3.1). At first glance, it may be confusing to beginners how to solve two unknown
variables v(s) and π(a|s) from one equation. In fact, these two unknown variables can
be solved one by one. This idea is illustrated by the following example.
Example 3.1. Consider two unknown variables x, y ∈ R that satisfy

x = max(2x − 1 − y 2 ).
y∈R

The first step is to solve y on the right-hand side of the equation. Regardless of the value
of x, we always have maxy (2x − 1 − y 2 ) = 2x − 1, where the maximum is achieved when
y = 0. The second step is to solve x. When y = 0, the equation becomes x = 2x − 1,
which leads to x = 1. Therefore, y = 0 and x = 1 are the solutions of the equation.
We now turn to the maximization problem on the right-hand side of the BOE. The
BOE in (3.1) can be written concisely as
X
v(s) = max π(a|s)q(s, a), s ∈ S.
π(s)∈Π(s)
a∈A

Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to
do that? The following example demonstrates its basic idea.
Example 3.2. Given q1 , q2 , q3 ∈ R, we would like to find the optimal values of c1 , c2 , c3
to maximize
X3
ci q i = c1 q 1 + c2 q 2 + c3 q 3 ,
i=1

where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose that q3 ≥ q1 , q2 . Then, the optimal solution is
c3 = 1 and c∗1 = c∗2 = 0. This is because
∗

q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3

for any c1 , c2 , c3 .

50
3.3. Bellman optimality equation S. Zhao, 2023

P
Inspired by the above example, since a π(a|s) = 1, we have
X X
π(a|s)q(s, a) ≤ π(a|s) max q(s, a) = max q(s, a),
a∈A a∈A
a∈A a∈A

where equality is achieved when

(
1, a = a∗ ,
π(a|s) =
6 a∗ .
0, a =

Here, a∗ = arg maxa q(s, a). In summary, the optimal policy π(s) is the one that selects
the action that has the greatest value of q(s, a).

3.3.2 Matrix-vector form of the BOE

The BOE refers to a set of equations defined for all states. If we combine these equations,
we can obtain a concise matrix-vector form, which will be extensively used in this chapter.
The matrix-vector form of the BOE is

v = max(rπ + γPπ v), (3.2)

π∈Π

where v ∈ R|S| and maxπ is performed in an elementwise manner. The structures of rπ

and Pπ are the same as those in the matrix-vector form of the normal Bellman equation:

. X X . X
[rπ ]s = π(a|s) p(r|s, a)r, [Pπ ]s,s0 = p(s0 |s) = π(a|s)p(s0 |s, a).
a∈A r∈R a∈A

Since the optimal value of π is determined by v, the right-hand side of (3.2) is a function
of v, denoted as
.
f (v) = max(rπ + γPπ v).
π∈Π

Then, the BOE can be expressed in a concise form as

v = f (v). (3.3)

In the remainder of this section, we show how to solve this nonlinear equation.

3.3.3 Contraction mapping theorem

Since the BOE can be expressed as a nonlinear equation v = f (v), we next introduce
the contraction mapping theorem [6] to analyze it. The contraction mapping theorem is
a powerful tool for analyzing general nonlinear equations. It is also known as the fixed-
point theorem. Readers who already know this theorem can skip this part. Otherwise,
the reader is advised to be familiar with this theorem since it is the key to analyzing the

51
3.3. Bellman optimality equation S. Zhao, 2023

BOE.
Consider a function f (x), where x ∈ Rd and f : Rd → Rd . A point x∗ is called a fixed
point if
f (x∗ ) = x∗ .

The interpretation of the above equation is that the map of x∗ is itself. This is the
reason why x∗ is called “fixed”. The function f is a contraction mapping (or contractive
function) if there exists γ ∈ (0, 1) such that

kf (x1 ) − f (x2 )k ≤ γkx1 − x2 k

for any x1 , x2 ∈ Rd . In this book, k · k denotes a vector or matrix norm.

Example 3.3. We present three examples to demonstrate fixed points and contraction
mappings.

x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 · 0. Moreover, f (x) = 0.5x
is a contraction mapping because k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for
any γ ∈ [0.5, 1).

x = f (x) = Ax, where x ∈ Rn , A ∈ Rn×n and kAk ≤ γ < 1.

It is easy to verify that x = 0 is a fixed point since 0 = A0. To see the contraction
property, kAx1 − Ax2 k = kA(x1 − x2 )k ≤ kAkkx1 − x2 k ≤ γkx1 − x2 k. Therefore,
f (x) = Ax is a contraction mapping.

x = f (x) = 0.5 sin x, x ∈ R.

It is easy to see that x = 0 is a fixed point since 0 = 0.5 sin 0. Moreover, it follows
from the mean value theorem [7, 8] that

0.5 sin x1 − 0.5 sin x2

= |0.5 cos x3 | ≤ 0.5, x3 ∈ [x1 , x2 ].
x1 − x2

As a result, |0.5 sin x1 − 0.5 sin x2 | ≤ 0.5|x1 − x2 | and hence f (x) = 0.5 sin x is a
contraction mapping.

The relationship between a fixed point and the contraction property is characterized
by the following classic theorem.

Theorem 3.1 (Contraction mapping theorem). For any equation that has the form x =
f (x) where x and f (x) are real vectors, if f is a contraction mapping, then the following
properties hold.

Existence: There exists a fixed point x∗ satisfying f (x∗ ) = x∗ .

52
3.3. Bellman optimality equation S. Zhao, 2023

Uniqueness: The fixed point x∗ is unique.

Algorithm: Consider the iterative process:

xk+1 = f (xk ),

where k = 0, 1, 2, . . . . Then, xk → x∗ as k → ∞ for any initial guess x0 . Moreover,

the convergence rate is exponentially fast.

The contraction mapping theorem not only can tell whether the solution of a nonlinear
equation exists but also suggests a numerical algorithm for solving the equation. The
proof of the theorem is given in Box 3.1.
The following example demonstrates how to calculate the fixed points of some equa-
tions using the iterative algorithm suggested by the contraction mapping theorem.

Example 3.4. Let us revisit the abovementioned examples: x = 0.5x, x = Ax, and
x = 0.5 sin x. While it has been shown that the right-hand sides of these three equations
are all contraction mappings, it follows from the contraction mapping theorem that they
each have a unique fixed point, which can be easily verified to be x∗ = 0. Moreover, the
fixed points of the three equations can be iteratively solved by the following algorithms:

xk+1 = 0.5xk ,
xk+1 = Axk ,
xk+1 = 0.5 sin xk ,

given any initial guess x0 .

Box 3.1: Proof of the contraction mapping theorem

Part 1: We prove that the consequence {xk }∞ k=1 with xk = f (xk−1 ) is convergent.
The proof relies on Cauchy sequences. A sequence x1 , x2 , · · · ∈ R is called Cauchy
if for any small ε > 0, there exists N such that kxm − xn k < ε for all m, n > N .
The intuitive interpretation is that there exists a finite integer N such that all the
elements after N are sufficiently close to each other. Cauchy sequences are important
because it is guaranteed that a Cauchy sequence converges to a limit. Its convergence
property will be used to prove the contraction mapping theorem. Note that we must
have kxm − xn k < ε for all m, n > N . If we simply have xn+1 − xn → 0, it is
insufficient to claim that the sequence is a Cauchy sequence. For example, it holds
√ √
that xn+1 − xn → 0 for xn = n, but apparently, xn = n diverges.
We next show that {xk = f (xk−1 )}∞ k=1 is a Cauchy sequence and hence converges.

53
3.3. Bellman optimality equation S. Zhao, 2023

First, since f is a contraction mapping, we have

kxk+1 − xk k = kf (xk ) − f (xk−1 )k ≤ γkxk − xk−1 k.

Similarly, we have kxk − xk−1 k ≤ γkxk−1 − xk−2 k, . . . , kx2 − x1 k ≤ γkx1 − x0 k. Thus,

we have

kxk+1 − xk k ≤ γkxk − xk−1 k

≤ γ 2 kxk−1 − xk−2 k
..
.
≤ γ k kx1 − x0 k.

Since γ < 1, we know that kxk+1 − xk k converges to zero exponentially fast as k → ∞

given any x1 , x0 . Notably, the convergence of {kxk+1 − xk k} is not sufficient for
implying the convergence of {xk }. Therefore, we need to further consider kxm − xn k
for any m > n. In particular,

kxm − xn k = kxm − xm−1 + xm−1 − · · · − xn+1 + xn+1 − xn k

≤ kxm − xm−1 k + · · · + kxn+1 − xn k
≤ γ m−1 kx1 − x0 k + · · · + γ n kx1 − x0 k
= γ n (γ m−1−n + · · · + 1)kx1 − x0 k
≤ γ n (1 + · · · + γ m−1−n + γ m−n + γ m−n+1 + . . . )kx1 − x0 k
γn
= kx1 − x0 k. (3.4)
1−γ

As a result, for any ε, we can always find N such that kxm −xn k < ε for all m, n > N .
Therefore, this sequence is Cauchy and hence converges to a limit point denoted as
x∗ = limk→∞ xk .
Part 2: We show that the limit x∗ = limk→∞ xk is a fixed point. To do that, since

kf (xk ) − xk k = kxk+1 − xk k ≤ γ k kx1 − x0 k,

we know that kf (xk ) − xk k converges to zero exponentially fast. Hence, we have

f (x∗ ) = x∗ at the limit.
Part 3: We show that the fixed point is unique. Suppose that there is another
fixed point x0 satisfying f (x0 ) = x0 . Then,

kx0 − x∗ k = kf (x0 ) − f (x∗ )k ≤ γkx0 − x∗ k.

54
3.3. Bellman optimality equation S. Zhao, 2023

Since γ < 1, this inequality holds if and only if kx0 − x∗ k = 0. Therefore, x0 = x∗ .

Part 4: We show that xk converges to x∗ exponentially fast. Recall that kxm −
γn
xn k ≤ 1−γ kx1 − x0 k, as proven in (3.4). Since m can be arbitrarily large, we have

γn
kx∗ − xn k = lim kxm − xn k ≤ kx1 − x0 k.
m→∞ 1−γ

Since γ < 1, the error converges to zero exponentially fast as n → ∞.

3.3.4 Contraction property of the right-hand side of the BOE

We next show that f (v) in the BOE in (3.3) is a contraction mapping. Thus, the con-
traction mapping theorem introduced in the previous subsection can be applied.

Theorem 3.2 (Contraction property of f (v)). The function f (v) on the right-hand side
of the BOE in (3.3) is a contraction mapping. In particular, for any v1 , v2 ∈ R|S| , it holds
that
kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,

where γ ∈ (0, 1) is the discount rate, and k · k∞ is the maximum norm, which is the
maximum absolute value of the elements of a vector.

The proof of the theorem is given in Box 3.2. This theorem is important because we
can use the powerful contraction mapping theorem to analyze the BOE.

Box 3.2: Proof of Theorem 3.2

.
Consider any two vectors v1 , v2 ∈ R|S| , and suppose that π1∗ = arg maxπ (rπ + γPπ v1 )
.
and π2∗ = arg maxπ (rπ + γPπ v2 ). Then,

f (v1 ) = max(rπ + γPπ v1 ) = rπ1∗ + γPπ1∗ v1 ≥ rπ2∗ + γPπ2∗ v1 ,

f (v2 ) = max(rπ + γPπ v2 ) = rπ2∗ + γPπ2∗ v2 ≥ rπ1∗ + γPπ1∗ v2 ,

where ≥ is an elementwise comparison. As a result,

f (v1 ) − f (v2 ) = rπ1∗ + γPπ1∗ v1 − (rπ2∗ + γPπ2∗ v2 )

≤ rπ1∗ + γPπ1∗ v1 − (rπ1∗ + γPπ1∗ v2 )
= γPπ1∗ (v1 − v2 ).

55
3.3. Bellman optimality equation S. Zhao, 2023

Similarly, it can be shown that f (v2 ) − f (v1 ) ≤ γPπ2∗ (v2 − v1 ). Therefore,

γPπ2∗ (v1 − v2 ) ≤ f (v1 ) − f (v2 ) ≤ γPπ1∗ (v1 − v2 ).

Define
.
z = max |γPπ2∗ (v1 − v2 )|, |γPπ1∗ (v1 − v2 )| ∈ R|S| ,

where max(·), | · |, and ≥ are all elementwise operators. By definition, z ≥ 0. On the

one hand, it is easy to see that

−z ≤ γPπ2∗ (v1 − v2 ) ≤ f (v1 ) − f (v2 ) ≤ γPπ1∗ (v1 − v2 ) ≤ z,

which implies

|f (v1 ) − f (v2 )| ≤ z.

It then follows that

kf (v1 ) − f (v2 )k∞ ≤ kzk∞ , (3.5)

where k · k∞ is the maximum norm.

On the other hand, suppose that zi is the ith entry of z, and pTi and qiT are the
ith row of Pπ1∗ and Pπ2∗ , respectively. Then,

zi = max{γ|pTi (v1 − v2 )|, γ|qiT (v1 − v2 )|}.

Since pi is a vector with all nonnegative elements and the sum of the elements is
equal to one, it follows that

|pTi (v1 − v2 )| ≤ pTi |v1 − v2 | ≤ kv1 − v2 k∞ .

Similarly, we have |qiT (v1 − v2 )| ≤ kv1 − v2 k∞ . Therefore, zi ≤ γkv1 − v2 k∞ and hence

kzk∞ = max |zi | ≤ γkv1 − v2 k∞ .

Substituting this inequality to (3.5) gives

kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,

which concludes the proof of the contraction property of f (v).

56
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

3.4 Solving an optimal policy from the BOE

With the preparation in the last section, we are ready to solve the BOE to obtain the
optimal state value v ∗ and an optimal policy π ∗ .

Solving v ∗ : If v ∗ is a solution of the BOE, then it satisfies

v ∗ = max(rπ + γPπ v ∗ ).
π∈Π

Clearly, v ∗ is a fixed point because v ∗ = f (v ∗ ). Then, the contraction mapping

theorem suggests the following results.

Theorem 3.3 (Existence, uniqueness, and algorithm). For the BOE v = f (v) =
maxπ∈Π (rπ + γPπ v), there always exists a unique solution v ∗ , which can be solved
iteratively by

vk+1 = f (vk ) = max(rπ + γPπ vk ), k = 0, 1, 2, . . . .

π∈Π

The value of vk converges to v ∗ exponentially fast as k → ∞ given any initial guess

v0 .

The proof of this theorem directly follows from the contraction mapping theorem since
f (v) is a contraction mapping. This theorem is important because it answers some
fundamental questions.

- Existence of v ∗ : The solution of the BOE always exists.

- Uniqueness of v ∗ : The solution v ∗ is always unique.
- Algorithm for solving v ∗ : The value of v ∗ can be solved by the iterative algorithm
suggested by Theorem 3.3. This iterative algorithm has a specific name called
value iteration. Its implementation will be introduced in detail in Chapter 4. We
mainly focus on the fundamental properties of the BOE in the present chapter.

Solving π ∗ : Once the value of v ∗ has been obtained, we can easily obtain π ∗ by solving

π ∗ = arg max(rπ + γPπ v ∗ ). (3.6)

π∈Π

The value of π ∗ will be given in Theorem 3.5. Substituting (3.6) into the BOE yields

v ∗ = rπ∗ + γPπ∗ v ∗ .

Therefore, v ∗ = vπ∗ is the state value of π ∗ , and the BOE is a special Bellman equation
whose corresponding policy is π ∗ .

57
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

At this point, although we can solve v ∗ and π ∗ , it is still unclear whether the solution
is optimal. The following theorem reveals the optimality of the solution.

Theorem 3.4 (Optimality of v ∗ and π ∗ ). The solution v ∗ is the optimal state value, and
π ∗ is an optimal policy. That is, for any policy π, it holds that

v ∗ = vπ∗ ≥ vπ ,

where vπ is the state value of π, and ≥ is an elementwise comparison.

Now, it is clear why we must study the BOE: its solution corresponds to optimal state
values and optimal policies. The proof of the above theorem is given in the following box.

Box 3.3: Proof of Theorem 3.4

For any policy π, it holds that

vπ = rπ + γPπ vπ .

Since

v ∗ = max(rπ + γPπ v ∗ ) = rπ∗ + γPπ∗ v ∗ ≥ rπ + γPπ v ∗ ,

we have

v ∗ − vπ ≥ (rπ + γPπ v ∗ ) − (rπ + γPπ vπ ) = γPπ (v ∗ − vπ ).

Repeatedly applying the above inequality gives v ∗ − vπ ≥ γPπ (v ∗ − vπ ) ≥ γ 2 Pπ2 (v ∗ −

vπ ) ≥ · · · ≥ γ n Pπn (v ∗ − vπ ). It follows that

v ∗ − vπ ≥ lim γ n Pπn (v ∗ − vπ ) = 0,
n→∞

where the last equality is true because γ < 1 and Pπn is a nonnegative matrix with
all its elements less than or equal to 1 (because Pπn 1 = 1). Therefore, v ∗ ≥ vπ for
any π.

We next examine π ∗ in (3.6) more closely. In particular, the following theorem shows
that there always exists a deterministic greedy policy that is optimal.

Theorem 3.5 (Greedy optimal policy). For any s ∈ S, the deterministic greedy policy
(
1, a = a∗ (s),
π ∗ (a|s) = (3.7)
6 a∗ (s),
0, a =

58
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

is an optimal policy for solving the BOE. Here,

a∗ (s) = arg max q ∗ (a, s),

where
. X X
q ∗ (s, a) = p(r|s, a)r + γ p(s0 |s, a)v ∗ (s0 ).
r∈R s0 ∈S

Box 3.4: Proof of Theorem 3.5

While the matrix-vector form of the optimal policy is π ∗ = arg maxπ (rπ + γPπ v ∗ ), its
elementwise form is
!
X X X
∗ 0 ∗ 0
π (s) = arg max π(a|s) p(r|s, a)r + γ p(s |s, a)v (s ) , s ∈ S.
π∈Π
a∈A r∈R s0 ∈S
| {z }
q ∗ (s,a)

It is clear that a∈A π(a|s)q ∗ (s, a) is maximized if π(s) selects the action with the
P

greatest q ∗ (s, a).

The policy in (3.7) is called greedy because it seeks the actions with the greatest
q (s, a). Finally, we discuss two important properties of π ∗ .
∗

Uniqueness of optimal policies: Although the value of v ∗ is unique, the optimal policy
that corresponds to v ∗ may not be unique. This can be easily verified by counterex-
amples. For example, the two policies shown in Figure 3.3 are both optimal.
Stochasticity of optimal policies: An optimal policy can be either stochastic or de-
terministic, as demonstrated in Figure 3.3. However, it is certain that there always
exists a deterministic optimal policy according to Theorem 3.5.

p = 0.5
p = 0.5

Figure 3.3: Examples for demonstrating that optimal policies may not be unique. The two policies are
different but are both optimal.

59
3.5. Factors that influence optimal policies S. Zhao, 2023

3.5 Factors that influence optimal policies

The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to
study what factors can influence optimal policies. This question can be easily answered
by observing the elementwise expression of the BOE:
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , s ∈ S.
π(s)∈Π(s)
a∈A r∈R s0 ∈S

The optimal state value and optimal policy are determined by the following param-
eters: 1) the immediate reward r, 2) the discount rate γ, and 3) the system model
p(s0 |s, a), p(r|s, a). While the system model is fixed, we next discuss how the optimal
policy varies when we change the values of r and γ. All the optimal policies presented
in this section can be obtained via the algorithm in Theorem 3.3. The implementation
details of the algorithm will be given in Chapter 4. The present chapter mainly focuses
on the fundamental properties of optimal policies.

A baseline example

Consider the example in Figure 3.4. The reward settings are rboundary = rforbidden = −1
and rtarget = 1. In addition, the agent receives a reward of rother = 0 for every movement
step. The discount rate is selected as γ = 0.9.
With the above parameters, the optimal policy and optimal state values are given in
Figure 3.4(a). It is interesting that the agent is not afraid of passing through forbidden
areas to reach the target area. More specifically, starting from the state at (row=4,
column=1), the agent has two options for reaching the target area. The first option is to
avoid all the forbidden areas and travel a long distance to the target area. The second
option is to pass through forbidden areas. Although the agent obtains negative rewards
when entering forbidden areas, the cumulative reward of the second trajectory is greater
than that of the first trajectory. Therefore, the optimal policy is far-sighted due to the
relatively large value of γ.

Impact of the discount rate

If we change the discount rate from γ = 0.9 to γ = 0.5 and keep other parameters
unchanged, the optimal policy becomes the one shown in Figure 3.4(b). It is interesting
that the agent does not dare to take risks anymore. Instead, it would travel a long
distance to reach the target while avoiding all the forbidden areas. This is because the
optimal policy becomes short-sighted due to the relatively small value of γ.
In the extreme case where γ = 0, the corresponding optimal policy is shown in
Figure 3.4(c). In this case, the agent is not able to reach the target area. This is

60
3.5. Factors that influence optimal policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Baseline example: rboundary = rforbidden = −1, rtarget = 1, γ = 0.9.

1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.1

3 3 0.0 0.0 2.0 0.1 0.1

4 4 0.0 2.0 2.0 2.0 0.2

5 5 0.0 1.0 2.0 1.0 0.5

(b) The discount rate is changed to γ = 0.5. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.0

3 3 0.0 0.0 1.0 0.0 0.0

4 4 0.0 1.0 1.0 1.0 0.0

5 5 0.0 0.0 1.0 0.0 0.0

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(d) rforbidden is changed from −1 to −10. The other parameters are the same as those in (a).

Figure 3.4: The optimal policies and optimal state values given different parameter values.

61
3.5. Factors that influence optimal policies S. Zhao, 2023

because the optimal policy for each state is extremely short-sighted and merely selects
the action with the greatest immediate reward instead of the greatest total reward.
In addition, the spatial distribution of the state values exhibits an interesting pattern:
the states close to the target have greater state values, whereas those far away have lower
values. This pattern can be observed from all the examples shown in Figure 3.4. It can
be explained by using the discount rate: if a state must travel along a longer trajectory
to reach the target, its state value is smaller due to the discount rate.

Impact of the reward values

If we want to strictly prohibit the agent from entering any forbidden area, we can increase
the punishment received for doing so. For instance, if rforbidden is changed from −1 to
−10, the resulting optimal policy can avoid all the forbidden areas (see Figure 3.4(d)).
However, changing the rewards does not always lead to different optimal policies.
One important fact is that optimal policies are invariant to affine transformations of the
rewards. In other words, if we scale all the rewards or add the same value to all the
rewards, the optimal policy remains the same.

Theorem 3.6 (Optimal policy invariance). Consider a Markov decision process with
v ∗ ∈ R|S| as the optimal state value satisfying v ∗ = maxπ∈Π (rπ + γPπ v ∗ ). If every reward
r ∈ R is changed by an affine transformation to αr + β, where α, β ∈ R and α > 0, then
the corresponding optimal state value v 0 is also an affine transformation of v ∗ :

β
v 0 = αv ∗ + 1, (3.8)
1−γ

where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the optimal policy
derived from v 0 is invariant to the affine transformation of the reward values.

Box 3.5: Proof of Theorem 3.6

For any policy π, define rπ = [. . . , rπ (s), . . . ]T where

X X
rπ (s) = π(a|s) p(r|s, a)r, s ∈ S.
a∈A r∈R

If r → αr + β, then rπ (s) → αrπ (s) + β and hence rπ → αrπ + β1, where 1 =

[1, . . . , 1]T . In this case, the BOE becomes

v 0 = max(αrπ + β1 + γPπ v 0 ). (3.9)

π∈Π

62
3.5. Factors that influence optimal policies S. Zhao, 2023

We next solve the new BOE in (3.9) by showing that v 0 = αv ∗ +c1 with c = β/(1−γ)
is a solution of (3.9). In particular, substituting v 0 = αv ∗ + c1 into (3.9) gives

αv ∗ + c1 = max(αrπ + β1 + γPπ (αv ∗ + c1)) = max(αrπ + β1 + αγPπ v ∗ + cγ1),

π∈Π π∈Π

where the last equality is due to the fact that Pπ 1 = 1. The above equation can be
reorganized as

αv ∗ = max(αrπ + αγPπ v ∗ ) + β1 + cγ1 − c1,

π∈Π

which is equivalent to

β1 + cγ1 − c1 = 0.

Since c = β/(1−γ), the above equation is valid and hence v 0 = αv ∗ +c1 is the solution
of (3.9). Since (3.9) is the BOE, v 0 is also the unique solution. Finally, since v 0 is
an affine transformation of v ∗ , the relative relationships between the action values
remain the same. Hence, the greedy optimal policy derived from v 0 is the same as
that from v ∗ : arg maxπ∈Π (rπ + γPπ v 0 ) is the same as arg maxπ (rπ + γPπ v ∗ ).

Readers may refer to [9] for a further discussion on the conditions under which mod-
ifications to the reward values preserve the optimal policy.

Avoiding meaningless detours

In the reward setting, the agent receives a reward of rother = 0 for every movement
step (unless it enters a forbidden area or the target area or attempts to go beyond the
boundary). Since a zero reward is not a punishment, would the optimal policy take
meaningless detours before reaching the target? Should we set rother to be negative to
encourage the agent to reach the target as quickly as possible?

1 2 1 2 1 2 1 2

1 1 9.0 10.0 1 1 9.0 8.1

2 2 10.0 10.0 2 2 10.0 10.0

(a) Optimal policy (b) Non-optimal policy

Figure 3.5: Examples illustrating that optimal policies do not take meaningless detours due to the
discount rate.

Consider the examples in Figure 3.5, where the bottom-right cell is the target area

63
3.6. Summary S. Zhao, 2023

to reach. The two policies here are the same except for state s2 . By the policy in
Figure 3.5(a), the agent moves downward at s2 and the resulting trajectory is s2 → s4 .
By the policy in Figure 3.5(b), the agent moves leftward and the resulting trajectory is
s2 → s1 → s3 → s4 .
It is notable that the second policy takes a detour before reaching the target area. If
we merely consider the immediate rewards, taking this detour does not matter because
no negative immediate rewards will be obtained. However, if we consider the discounted
return, then this detour matters. In particular, for the first policy, the discounted return
is
return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.

As a comparison, the discounted return for the second policy is

return = 0 + γ0 + γ 2 1 + γ 3 1 + · · · = γ 2 /(1 − γ) = 8.1.

It is clear that the shorter the trajectory is, the greater the return is. Therefore, although
the immediate reward of every step does not encourage the agent to approach the target
as quickly as possible, the discount rate does encourage it to do so.
A misunderstanding that beginners may have is that adding a negative reward (e.g.,
−1) on top of the rewards obtained for every movement is necessary to encourage the
agent to reach the target as quickly as possible. This is a misunderstanding because
adding the same reward on top of all rewards is an affine transformation, which preserves
the optimal policy. Moreover, optimal policies do not take meaningless detours due to
the discount rate, even though a detour may not receive any immediate negative rewards.

3.6 Summary
The core concepts in this chapter include optimal policies and optimal state values. In
particular, a policy is optimal if its state values are greater than or equal to those of any
other policy. The state values of an optimal policy are the optimal state values. The BOE
is the core tool for analyzing optimal policies and optimal state values. This equation
is a nonlinear equation with a nice contraction property. We can apply the contraction
mapping theorem to analyze this equation. It was shown that the solutions of the BOE
correspond to the optimal state value and optimal policy. This is the reason why we need
to study the BOE.
The contents of this chapter are important for thoroughly understanding many funda-
mental ideas of reinforcement learning. For example, Theorem 3.3 suggests an iterative
algorithm for solving the BOE. This algorithm is exactly the value iteration algorithm
that will be introduced in Chapter 4. A further discussion about the BOE can be found
in [2].

64
3.7. Q&A S. Zhao, 2023

3.7 Q&A
Q: What is the definition of optimal policies?
A: A policy is optimal if its corresponding state values are greater than or equal to
any other policy.
It should be noted that this specific definition of optimality is valid only for tabular
reinforcement learning algorithms. When the values or policies are approximated by
functions, different metrics must be used to define optimal policies. This will become
clearer in Chapters 8 and 9.
Q: Why is the Bellman optimality equation important?
A: It is important because it characterizes both optimal policies and optimal state
values. Solving this equation yields an optimal policy and the corresponding optimal
state value.
Q: Is the Bellman optimality equation a Bellman equation?
A: Yes. The Bellman optimality equation is a special Bellman equation whose corre-
sponding policy is optimal.
Q: Is the solution of the Bellman optimality equation unique?
A: The Bellman optimality equation has two unknown variables. The first unknown
variable is a value, and the second is a policy. The value solution, which is the optimal
state value, is unique. The policy solution, which is an optimal policy, may not be
unique.
Q: What is the key property of the Bellman optimality equation for analyzing its
solution?
A: The key property is that the right-hand side of the Bellman optimality equation is
a contraction mapping. As a result, we can apply the contraction mapping theorem
to analyze its solution.
Q: Do optimal policies exist?
A: Yes. Optimal policies always exist according to the analysis of the BOE.
Q: Are optimal policies unique?
A: No. There may exist multiple or infinite optimal policies that have the same
optimal state values.
Q: Are optimal policies stochastic or deterministic?
A: An optimal policy can be either deterministic or stochastic. A nice fact is that
there always exist deterministic greedy optimal policies.

65
3.7. Q&A S. Zhao, 2023

Q: How to obtain an optimal policy?

A: Solving the BOE using the iterative algorithm suggested by Theorem 3.3 yields an
optimal policy. The detailed implementation of this iterative algorithm will be given
in Chapter 4. Notably, all the reinforcement learning algorithms introduced in this
book aim to obtain optimal policies under different settings.
Q: What is the general impact on the optimal policies if we reduce the value of the
discount rate?
A: The optimal policy becomes more short-sighted when we reduce the discount rate.
That is, the agent does not dare to take risks even though it may obtain greater
cumulative rewards afterward.
Q: What happens if we set the discount rate to zero?
A: The resulting optimal policy would become extremely short-sighted. The agent
would take the action with the greatest immediate reward, even though that action
is not good in the long run.
Q: If we increase all the rewards by the same amount, will the optimal state value
change? Will the optimal policy change?
A: Increasing all the rewards by the same amount is an affine transformation of the
rewards, which would not affect the optimal policies. However, the optimal state value
would increase, as shown in (3.8).
Q: If we hope that the optimal policy can avoid meaningless detours before reaching
the target, should we add a negative reward to every step so that the agent reaches
the target as quickly as possible?
A: First, introducing an additional negative reward to every step is an affine transfor-
mation of the rewards, which does not change the optimal policy. Second, the discount
rate can automatically encourage the agent to reach the target as quickly as possible.
This is because meaningless detours would increase the trajectory length and reduce
the discounted return.

66
Chapter 4

Value Iteration and Policy Iteration

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 4.1: Where we are in this book.

With the preparation in the previous chapters, we are now ready to present the first
algorithms that can find optimal policies. This chapter introduces three algorithms that
are closely related to each other. The first is the value iteration algorithm, which is
exactly the algorithm suggested by the contraction mapping theorem for solving the
Bellman optimality equation as discussed in the last chapter. We focus more on the
implementation details of this algorithm in the present chapter. The second is the policy
iteration algorithm, whose idea is widely used in reinforcement learning algorithms. The
third is the truncated policy iteration algorithm, which is a unified algorithm that includes
the value iteration and policy iteration algorithms as special cases.

67
4.1. Value iteration S. Zhao, 2023

The algorithms introduced in this chapter are called dynamic programming algorithms
[10, 11], which require the system model. These algorithms are important foundations of
the model-free reinforcement learning algorithms introduced in the subsequent chapters.
For example, the Monte Carlo algorithms introduced in Chapter 5 can be immediately
obtained by extending the policy iteration algorithm introduced in this chapter.

4.1 Value iteration

This section introduces the value iteration algorithm. It is exactly the algorithm suggested
by the contraction mapping theorem for solving the Bellman optimality equation, as
introduced in the last chapter (Theorem 3.3). In particular, the algorithm is

vk+1 = max(rπ + γPπ vk ), k = 0, 1, 2, . . .

π∈Π

It is guaranteed by Theorem 3.3 that vk and πk converge to the optimal state value and
an optimal policy as k → ∞, respectively.
This algorithm is iterative and has two steps in every iteration.

The first step in every iteration is a policy update step. Mathematically, it aims to
find a policy that can solve the following optimization problem:

πk+1 = arg max(rπ + γPπ vk ),

where vk is obtained in the previous iteration.

The second step is called a value update step. Mathematically, it calculates a new
value vk+1 by

vk+1 = rπk+1 + γPπk+1 vk , (4.1)

where vk+1 will be used in the next iteration.

The value iteration algorithm introduced above is in a matrix-vector form. To im-

plement this algorithm, we need to further examine its elementwise form. While the
matrix-vector form is useful for understanding the core idea of the algorithm, the ele-
mentwise form is necessary for explaining the implementation details.

4.1.1 Elementwise form and implementation

Consider the time step k and a state s.

68
4.1. Value iteration S. Zhao, 2023

First, the elementwise form of the policy update step πk+1 = arg maxπ (rπ + γPπ vk ) is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s ∈ S.
π
a r s0
| {z }
qk (s,a)

We showed in Section 3.3.1 that the optimal policy that can solve the above optimiza-
tion problem is
(
1 a = a∗k (s),
πk+1 (a|s) = (4.2)
0 a=6 a∗k (s),

where a∗k (s) = arg maxa qk (s, a). If a∗k (s) = arg maxa qk (s, a) has multiple solutions,
we can select any of them without affecting the convergence of the algorithm. Since
the new policy πk+1 selects the action with the greatest qk (s, a), such a policy is called
greedy.
Second, the elementwise form of the value update step vk+1 = rπk+1 + γPπk+1 vk is
!
X X X
vk+1 (s) = πk+1 (a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s ∈ S.
a r s0
| {z }
qk (s,a)

Substituting (4.2) into the above equation gives

vk+1 (s) = max qk (s, a).

In summary, the above steps can be illustrated as

vk (s) → qk (s, a) → new greedy policy πk+1 (s) → new value vk+1 (s) = max qk (s, a)
a

The implementation details are summarized in Algorithm 4.1.

One problem that may be confusing is whether vk in (4.1) is a state value. The
answer is no. Although vk eventually converges to the optimal state value, it is not
ensured to satisfy the Bellman equation of any policy. For example, it does not satisfy
vk = rπk+1 + γPπk+1 vk or vk = rπk + γPπk vk in general. It is merely an intermediate value
generated by the algorithm. In addition, since vk is not a state value, qk is not an action
value.

4.1.2 Illustrative examples

We next present an example to illustrate the step-by-step implementation of the value
iteration algorithm. This example is a two-by-two grid with one forbidden area (Fig-

69
4.1. Value iteration S. Zhao, 2023

Algorithm 4.1: Value iteration algorithm

Initialization: The probability models p(r|s, a) and p(s0 |s, a) for all (s, a) are known.
Initial guess v0 .
Goal: Search for the optimal state value and an optimal policy for solving the Bellman
optimality equation.
While vk has not converged in the sense that kvk − vk−1 k is greater than a predefined
small threshold, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s),P do
q-value: qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P
Maximum action value: a∗k (s) = arg maxa qk (s, a)
Policy update: πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Value update: vk+1 (s) = maxa qk (s, a)

q-table a1 a2 a3 a4 a5
s1 −1 + γv(s1 ) −1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s1 ) 0 + γv(s1 )
s2 −1 + γv(s2 ) −1 + γv(s2 ) 1 + γv(s4 ) 0 + γv(s1 ) −1 + γv(s2 )
s3 0 + γv(s1 ) 1 + γv(s4 ) −1 + γv(s3 ) −1 + γv(s3 ) 0 + γv(s3 )
s4 −1 + γv(s2 ) −1 + γv(s4 ) −1 + γv(s4 ) 0 + γv(s3 ) 1 + γv(s4 )
Table 4.1: The expression of q(s, a) for the example as shown in Figure 4.2.

ure 4.2). The target area is s4 . The reward settings are rboundary = rforbidden = −1 and
rtarget = 1. The discount rate is γ = 0.9.

s1 s2 s1 s2 s1 s2

s3 s4 s3 s4 s3 s4

Figure 4.2: An example for demonstrating the implementation of the value iteration algorithm.

The expression of the q-value for each state-action pair is shown in Table 4.1.

k = 0:
Without loss of generality, select the initial values as v0 (s1 ) = v0 (s2 ) = v0 (s3 ) =
v0 (s4 ) = 0.
q-value calculation: Substituting v0 (si ) into Table 4.1 gives the q-values shown in
Table 4.2.

70
4.1. Value iteration S. Zhao, 2023

q-table a1 a2 a3 a4 a5
s1 −1 −1 0 −1 0
s2 −1 −1 1 0 −1
s3 0 1 −1 −1 0
s4 −1 −1 −1 0 1
Table 4.2: The value of q(s, a) at k = 0.

q-table a1 a2 a3 a4 a5
s1 −1 + γ0 −1 + γ1 0 + γ1 −1 + γ0 0 + γ0
s2 −1 + γ1 −1 + γ1 1 + γ1 0 + γ0 −1 + γ1
s3 0 + γ0 1 + γ1 −1 + γ1 −1 + γ1 0 + γ1
s4 −1 + γ1 −1 + γ1 −1 + γ1 0 + γ1 1 + γ1
Table 4.3: The value of q(s, a) at k = 1.

Policy update: π1 is obtained by selecting the actions with the greatest q-values for
every state:

π1 (a5 |s1 ) = 1, π1 (a3 |s2 ) = 1, π1 (a2 |s3 ) = 1, π1 (a5 |s4 ) = 1.

This policy is visualized in Figure 4.2 (the middle subfigure). It is clear that this policy
is not optimal because it selects to stay unchanged at s1 . Notably, the q-values for
(s1 , a5 ) and (s1 , a3 ) are actually the same, and we can randomly select either action.
Value update: v1 is obtained by updating the v-value to the greatest q-value for each
state:
v1 (s1 ) = 0, v1 (s2 ) = 1, v1 (s3 ) = 1, v1 (s4 ) = 1.

k = 1:
q-value calculation: Substituting v1 (si ) into Table 4.1 yields the q-values shown in
Table 4.3.
Policy update: π2 is obtained by selecting the greatest q-values:

π2 (a3 |s1 ) = 1, π2 (a3 |s2 ) = 1, π2 (a2 |s3 ) = 1, π2 (a5 |s4 ) = 1.

This policy is visualized in Figure 4.2 (the right subfigure).

Value update: v2 is obtained by updating the v-value to the greatest q-value for each
state:

v2 (s1 ) = γ1, v2 (s2 ) = 1 + γ1, v2 (s3 ) = 1 + γ1, v2 (s4 ) = 1 + γ1.

k = 2, 3, 4, . . .

It is notable that policy π2 , as illustrated in Figure 4.2(c), is already optimal. Therefore,

71
4.2. Policy iteration S. Zhao, 2023

we only need to run two iterations to obtain an optimal policy in this simple example. For
more complex examples, we need to run more iterations until the value of vk converges
(e.g., until kvk+1 − vk k is smaller than a pre-specified threshold).

4.2 Policy iteration

This section presents another important algorithm: policy iteration. Unlike value itera-
tion, policy iteration is not for directly solving the Bellman optimality equation. However,
it has an intimate relationship with value iteration, as shown later. Moreover, the idea
of policy iteration is very important since it is widely utilized in reinforcement learning
algorithms.

4.2.1 Algorithm analysis

Policy iteration is an iterative algorithm. Each iteration has two steps.

The first is a policy evaluation step. As its name suggests, this step evaluates a given
policy by calculating the corresponding state value. That is to solve the following
Bellman equation:

vπk = rπk + γPπk vπk , (4.3)

where πk is the policy obtained in the last iteration and vπk is the state value to be
calculated. The values of rπk and Pπk can be obtained from the system model.
The second is a policy improvement step. As its name suggests, this step is used to
improve the policy. In particular, once vπk has been calculated in the first step, a new
policy πk+1 can be obtained as

πk+1 = arg max(rπ + γPπ vπk ).

Three questions naturally follow the above description of the algorithm.

In the policy evaluation step, how to solve the state value vπk ?
In the policy improvement step, why is the new policy πk+1 better than πk ?
Why can this algorithm finally converge to an optimal policy?

We next answer these questions one by one.

In the policy evaluation step, how to calculate vπk ?

We introduced two methods in Chapter 2 for solving the Bellman equation in (4.3).
We next revisit the two methods briefly. The first method is a closed-form solution:

72
4.2. Policy iteration S. Zhao, 2023

vπk = (I−γPπk )−1 rπk . This closed-form solution is useful for theoretical analysis purposes,
but it is inefficient to implement since it requires other numerical algorithms to compute
the matrix inverse. The second method is an iterative algorithm that can be easily
implemented:

vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ... (4.4)

(j) (0)
where vπk denotes the jth estimate of vπk . Starting from any initial guess vπk , it is
(j)
ensured that vπk → vπk as j → ∞. Details can be found in Section 2.7.
Interestingly, policy iteration is an iterative algorithm with another iterative algorithm
(4.4) embedded in the policy evaluation step. In theory, this embedded iterative algorithm
requires an infinite number of steps (that is, j → ∞) to converge to the true state value
vπk . This is, however, impossible to realize. In practice, the iterative process terminates
when a certain criterion is satisfied. For example, the termination criterion can be that
(j+1) (j)
kvπk − vπk k is less than a prespecified threshold or that j exceeds a prespecified value.
If we do not run an infinite number of iterations, we can only obtain an imprecise value
of vπk , which will be used in the subsequent policy improvement step. Would this cause
problems? The answer is no. The reason will become clear when we introduce the
truncated policy iteration algorithm later in Section 4.3.

In the policy improvement step, why is πk+1 better than πk ?

The policy improvement step can improve the given policy, as shown below.

Lemma 4.1 (Policy improvement). If πk+1 = arg maxπ (rπ + γPπ vπk ), then vπk+1 ≥ vπk .

Here, vπk+1 ≥ vπk means that vπk+1 (s) ≥ vπk (s) for all s. The proof of this lemma is
given in Box 4.1.

Box 4.1: Proof of Lemma 4.1

Since vπk+1 and vπk are state values, they satisfy the Bellman equations:

vπk+1 = rπk+1 + γPπk+1 vπk+1 ,

vπk = rπk + γPπk vπk .

Since πk+1 = arg maxπ (rπ + γPπ vπk ), we know that

rπk+1 + γPπk+1 vπk ≥ rπk + γPπk vπk .

73
4.2. Policy iteration S. Zhao, 2023

It then follows that

vπk − vπk+1 = (rπk + γPπk vπk ) − (rπk+1 + γPπk+1 vπk+1 )

≤ (rπk+1 + γPπk+1 vπk ) − (rπk+1 + γPπk+1 vπk+1 )
≤ γPπk+1 (vπk − vπk+1 ).

Therefore,

vπk − vπk+1 ≤ γ 2 Pπ2k+1 (vπk − vπk+1 ) ≤ . . . ≤ γ n Pπnk+1 (vπk − vπk+1 )

≤ lim γ n Pπnk+1 (vπk − vπk+1 ) = 0.
n→∞

The limit is due to the facts that γ n → 0 as n → ∞ and Pπnk+1 is a nonnegative

stochastic matrix for any n. Here, a stochastic matrix refers to a nonnegative matrix
whose row sums are equal to one for all rows.

Why can the policy iteration algorithm eventually find an optimal policy?

The policy iteration algorithm generates two sequences. The first is a sequence of policies:
{π0 , π1 , . . . , πk , . . . }. The second is a sequence of state values: {vπ0 , vπ1 , . . . , vπk , . . . }.
Suppose that v ∗ is the optimal state value. Then, vπk ≤ v ∗ for all k. Since the policies
are continuously improved according to Lemma 4.1, we know that

vπ0 ≤ vπ1 ≤ vπ2 ≤ · · · ≤ vπk ≤ · · · ≤ v ∗ .

Since vπk is nondecreasing and always bounded from above by v ∗ , it follows from the
monotone convergence theorem [12] (Appendix C) that vπk converges to a constant value,
denoted as v∞ , when k → ∞. The following analysis shows that v∞ = v ∗ .

Theorem 4.1 (Convergence of policy iteration). The state value sequence {vπk }∞k=0 gen-
∗
erated by the policy iteration algorithm converges to the optimal state value v . As a
result, the policy sequence {πk }∞
k=0 converges to an optimal policy.

The proof of this theorem is given in Box 4.2. The proof not only shows the conver-
gence of the policy iteration algorithm but also reveals the relationship between the policy
iteration and value iteration algorithms. Loosely speaking, if both algorithms start from
the same initial guess, policy iteration will converge faster than value iteration due to
the additional iterations embedded in the policy evaluation step. This point will become
clearer when we introduce the truncated policy iteration algorithm in Section 4.3.

74
4.2. Policy iteration S. Zhao, 2023

Box 4.2: Proof of Theorem 4.1

The idea of the proof is to show that the policy iteration algorithm converges faster
than the value iteration algorithm.
In particular, to prove the convergence of {vπk }∞
k=0 , we introduce another sequence
{vk }∞
k=0 generated by

vk+1 = f (vk ) = max(rπ + γPπ vk ).

This iterative algorithm is exactly the value iteration algorithm. We already know
that vk converges to v ∗ when given any initial value v0 .
For k = 0, we can always find a v0 such that vπ0 ≥ v0 for any π0 .
We next show that vk ≤ vπk ≤ v ∗ for all k by induction.
For k ≥ 0, suppose that vπk ≥ vk .
For k + 1, we have

vπk+1 − vk+1 = (rπk+1 + γPπk+1 vπk+1 ) − max(rπ + γPπ vk )

≥ (rπk+1 + γPπk+1 vπk ) − max(rπ + γPπ vk )

π

because vπk+1 ≥ vπk by Lemma 4.1 and Pπk+1 ≥ 0
= (rπk+1 + γPπk+1 vπk ) − (rπk0 + γPπk0 vk )
suppose πk0 = arg max(rπ + γPπ vk )

π

≥ (rπk0 + γPπk0 vπk ) − (rπk0 + γPπk0 vk )

because πk+1 = arg max(rπ + γPπ vπk )
π

= γPπk0 (vπk − vk ).

Since vπk − vk ≥ 0 and Pπk0 is nonnegative, we have Pπk0 (vπk − vk ) ≥ 0 and hence
vπk+1 − vk+1 ≥ 0.
Therefore, we can show by induction that vk ≤ vπk ≤ v ∗ for any k ≥ 0. Since vk
converges to v ∗ , vπk also converges to v ∗ .

4.2.2 Elementwise form and implementation

To implement the policy iteration algorithm, we need to study its elementwise form.

First, the policy evaluation step solves vπk from vπk = rπk + γPπk vπk by using the

75
4.2. Policy iteration S. Zhao, 2023

Algorithm 4.2: Policy iteration algorithm

Initialization: The system model, p(r|s, a) and p(s0 |s, a) for all (s, a), is known. Initial
guess π0 .
Goal: Search for the optimal state value and an optimal policy.
While vπk has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: an arbitrary initial guess vπk
(j)
While vπk has not converged, for the jth iteration, do
For every state s ∈ S, do h i
(j+1) P P P 0 (j) 0
vπk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vπk (s )
Policy improvement:
For every state s ∈ S, do
For every action P a ∈ A, do
qπk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vπk (s0 )
P
a∗k (s) = arg maxa qπk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

iterative algorithm in (4.4). The elementwise form of this algorithm is

!
X X X
vπ(j+1)
k
(s) = πk (a|s) p(r|s, a)r + γ p(s0 |s, a)vπ(j)
k
(s0 ) , s ∈ S,
a r s0

where j = 0, 1, 2, . . . .
Second, the policy improvement step solves πk+1 = arg maxπ (rπ + γPπ vπk ). The
elementwise form of this equation is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπk (s0 ) , s ∈ S,
π
a r s0
| {z }
qπk (s,a)

where qπk (s, a) is the action value under policy πk . Let a∗k (s) = arg maxa qπk (s, a).
Then, the greedy optimal policy is
(
1, a = a∗k (s),
πk+1 (a|s) =
6 a∗k (s).
0, a =

The implementation details are summarized in Algorithm 4.2.

76
4.2. Policy iteration S. Zhao, 2023

4.2.3 Illustrative examples

A simple example

Consider a simple example shown in Figure 4.3. There are two states with three possible
actions: A = {a` , a0 , ar }. The three actions represent moving leftward, staying un-
changed, and moving rightward. The reward settings are rboundary = −1 and rtarget = 1.
The discount rate is γ = 0.9.

s1 s2 s1 s2

(a) (b)

Figure 4.3: An example for illustrating the implementation of the policy iteration algorithm.

We next present the implementation of the policy iteration algorithm in a step-by-step

manner. When k = 0, we start with the initial policy shown in Figure 4.3(a). This policy
is not good because it does not move toward the target area. We next show how to apply
the policy iteration algorithm to obtain an optimal policy.

First, in the policy evaluation step, we need to solve the Bellman equation:

vπ0 (s1 ) = −1 + γvπ0 (s1 ),

vπ0 (s2 ) = 0 + γvπ0 (s1 ).

Since the equation is simple, it can be manually solved that

vπ0 (s1 ) = −10, vπ0 (s2 ) = −9.

In practice, the equation can be solved by the iterative algorithm in (4.4). For example,
(0) (0)
select the initial state values as vπ0 (s1 ) = vπ0 (s2 ) = 0. It follows from (4.3) that
(
(1) (0)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1,
(1) (0)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = 0,
(
(2) (1)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1.9,
(2) (1)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = −0.9,
(
(3) (2)
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −2.71,
(3) (2)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = −1.71,
..
.

(j) (j)
With more iterations, we can see the trend: vπ0 (s1 ) → vπ0 (s1 ) = −10 and vπ0 (s2 ) →
vπ0 (s2 ) = −9 as j increases.

77
4.2. Policy iteration S. Zhao, 2023

Second, in the policy improvement step, the key is to calculate qπ0 (s, a) for each
state-action pair. The following q-table can be used to demonstrate such a process:

qπk (s, a) a` a0 ar
s1 −1 + γvπk (s1 ) 0 + γvπk (s1 ) 1 + γvπk (s2 )
s2 0 + γvπk (s1 ) 1 + γvπk (s2 ) −1 + γvπk (s2 )
Table 4.4: The expression of qπk (s, a) for the example in Figure 4.3.

Substituting vπ0 (s1 ) = −10, vπ0 (s2 ) = −9 obtained in the previous policy evaluation
step into Table 4.4 yields Table 4.5.

qπ0 (s, a) a` a0 ar
s1 −10 −9 −7.1
s2 −9 −7.1 −9.1
Table 4.5: The value of qπk (s, a) when k = 0.

By seeking the greatest value of qπ0 , the improved policy π1 can be obtained as

π1 (ar |s1 ) = 1, π1 (a0 |s2 ) = 1.

This policy is illustrated in Figure 4.3(b). It is clear that this policy is optimal.

The above process shows that a single iteration is sufficient for finding the optimal
policy in this simple example. More iterations are required for more complex examples.

A more complicated example

We next demonstrate the policy iteration algorithm using a more complicated example
shown in Figure 4.4. The reward settings are rboundary = −1, rforbidden = −10, and rtarget =
1. The discount rate is γ = 0.9. The policy iteration algorithm can converge to the
optimal policy (Figure 4.4(h)) when starting from a random initial policy (Figure 4.4(a)).
Two interesting phenomena are observed during the iteration process.

First, if we observe how the policy evolves, an interesting pattern is that the states
that are close to the target area find the optimal policies earlier than those far away.
Only if the close states can find trajectories to the target first, can the farther states
find trajectories passing through the close states to reach the target.
Second, the spatial distribution of the state values exhibits an interesting pattern: the
states that are located closer to the target have greater state values. The reason for
this pattern is that an agent starting from a farther state must travel for many steps
to obtain a positive reward. Such rewards would be severely discounted and hence
relatively small.

78
4.3. Truncated policy iteration S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 -100.0 -100.0 0.0 0.0 2 2 0.0 -100.0

0.0 -100.0
0.0 0.0 0.0

3 3 0.0 0.0 -100.0 0.0 0.0 3 3 0.0 0.0 -100.0

10.0 0.0 0.0

4 4 0.0 -100.0 10.0 -100.0 0.0 4 4 0.0 -100.0

10.0 10.0 -100.0
10.0 0.0

5 5 0.0 -100.0 0.0 0.0 0.0 5 5 0.0 -100.0

9.0 10.0
0.0 0.0 0.0

(a) π0 and vπ0 (b) π1 and vπ1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0

3 3 0.0 0.0 -100.0

10.0 0.0 0.0 3 3 0.0 0.0 -100.0
10.0 0.0 0.0

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

(c) π2 and vπ2 (d) π3 and vπ3

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0

4.3 0.0
4.8 0.0
4.3

2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0
4.8 0.0
5.3 0.0
5.9

3 3 0.0 0.0 -100.0

10.0 0.0 0.0
6.6 3 3 0.0 0.0 -100.0
10.0 0.0
5.9 0.0
6.6

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

(e) π4 and vπ4 (f) π5 and vπ5

.. ..
. .

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3 1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3

2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9 2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9

3 3 0.0
2.8 0.0 -100.0
10.0 0.0
5.9 0.0
6.6 3 3 0.0
2.8 0.0
2.5 -100.0
10.0 0.0
5.9 0.0
6.6

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0
2.5 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0
2.3 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

(g) π9 and vπ9 (h) π10 and vπ10

Figure 4.4: The evolution processes of the policies generated by the policy iteration algorithm.

79
4.3. Truncated policy iteration S. Zhao, 2023

4.3 Truncated policy iteration

We next introduce a more general algorithm called truncated policy iteration. We will
see that the value iteration and policy iteration algorithms are two special cases of the
truncated policy iteration algorithm.

4.3.1 Comparing value iteration and policy iteration

First of all, we compare the value iteration and policy iteration algorithms by listing their
steps as follows.

Policy iteration: Select an arbitrary initial policy π0 . In the kth iteration, do the
following two steps.

- Step 1: Policy evaluation (PE). Given πk , solve vπk from

vπk = rπk + γPπk vπk .

- Step 2: Policy improvement (PI). Given vπk , solve πk+1 from

πk+1 = arg max(rπ + γPπ vπk ).

Value iteration: Select an arbitrary initial value v0 . In the kth iteration, do the
following two steps.

- Step 1: Policy update (PU). Given vk , solve πk+1 from

πk+1 = arg max(rπ + γPπ vk ).

- Step 2: Value update (VU). Given πk+1 , solve vk+1 from

vk+1 = rπk+1 + γPπk+1 vk .

The above steps of the two algorithms can be illustrated as

PE PI PE PI PE PI
Policy iteration: π0 −−→vπ0 −→ π1 −−→ vπ1 −→ π2 −−→ vπ2 −→ . . .
PU VU PU VU PU
Value iteration: v0 −−→ π10 −−→ v1 −−→ π20 −−→ v2 −−→ . . .

It can be seen that the procedures of the two algorithms are very similar.
We examine their value steps more closely to see the difference between the two
algorithms. In particular, let both algorithms start from the same initial condition:
v0 = vπ0 . The procedures of the two algorithms are listed in Table 4.6. In the first
three steps, the two algorithms generate the same results since v0 = vπ0 . They become

80
4.3. Truncated policy iteration S. Zhao, 2023

Policy iteration algorithm Value iteration algorithm Comments

1) Policy: π0 N/A
.
2) Value: vπ0 = rπ0 + γPπ0 vπ0 v0 = vπ 0

3) Policy: π1 = arg maxπ (rπ + γPπ vπ0 ) π1 = arg maxπ (rπ + γPπ v0 ) The two policies are
the same
4) Value: vπ1 = rπ1 + γPπ1 vπ1 v1 = rπ1 + γPπ1 v0 vπ1 ≥ v1 since
vπ 1 ≥ vπ 0
5) Policy: π2 = arg maxπ (rπ + γPπ vπ1 ) π20 = arg maxπ (rπ + γPπ v1 )

.. .. .. ..
. . . .

Table 4.6: A comparison between the implementation steps of policy iteration and value iteration.

different in the fourth step. During the fourth step, the value iteration algorithm executes
v1 = rπ1 + γPπ1 v0 , which is a one-step calculation, whereas the policy iteration algorithm
solves vπ1 = rπ1 + γPπ1 vπ1 , which requires an infinite number of iterations. If we explicitly
write out the iterative process for solving vπ1 = rπ1 +γPπ1 vπ1 in the fourth step, everything
(0)
becomes clear. By letting vπ1 = v0 , we have

vπ(0)
1
= v0
value iteration ← v1 ←− vπ(1)
1
= rπ1 + γPπ1 vπ(0)
1

vπ(2)
1
= rπ1 + γPπ1 vπ(1)
1

..
.
truncated policy iteration ← v̄1 ←− vπ(j)
1
= rπ1 + γPπ1 vπ(j−1)
1

..
.
policy iteration ← vπ1 ←− vπ(∞)
1
= rπ1 + γPπ1 vπ(∞)
1

The following observations can be obtained from the above process.

(1)
If the iteration is run only once, then vπ1 is actually v1 , as calculated in the value
iteration algorithm.
(∞)
If the iteration is run an infinite number of times, then vπ1 is actually vπ1 , as calcu-
lated in the policy iteration algorithm.
If the iteration is run a finite number of times (denoted as jtruncate ), then such an algo-
rithm is called truncated policy iteration. It is called truncated because the remaining
iterations from jtruncate to ∞ are truncated.

As a result, the value iteration and policy iteration algorithms can be viewed as two
extreme cases of the truncated policy iteration algorithm: value iteration terminates

81
4.3. Truncated policy iteration S. Zhao, 2023

Algorithm 4.3: Truncated policy iteration algorithm

Initialization: The probability models p(r|s, a) and p(s0 |s, a) for all (s, a) are known.
Initial guess π0 .
Goal: Search for the optimal state value and an optimal policy.
While vk has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: select the initial guess as vk = vk−1 . The maximum number of
iterations is set as jtruncate .
While j < jtruncate , do
For every state s ∈ S, do h i
(j+1) P P P 0 (j) 0
vk (s) = a πk (a|s) r p(r|s, a)r + γ s 0 p(s |s, a)vk (s )
(j )
Set vk = vk truncate
Policy improvement:
For every state s ∈ S, do
For every actionPa ∈ A(s), do
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P
a∗k (s) = arg maxa qk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

at jtruncate = 1, and policy iteration terminates at jtruncate = ∞. It should be noted

that, although the above comparison is illustrative, it is based on the condition that
(0)
vπ1 = v0 = vπ0 . The two algorithms cannot be directly compared without this condition.

4.3.2 Truncated policy iteration algorithm

In a nutshell, the truncated policy iteration algorithm is the same as the policy iteration
algorithm except that it merely runs a finite number of iterations in the policy evaluation
step. Its implementation details are summarized in Algorithm 4.3. It is notable that vk
(j)
and vk in the algorithm are not state values. Instead, they are approximations of the
true state values because only a finite number of iterations are executed in the policy
evaluation step.
If vk does not equal vπk , will the algorithm still be able to find optimal policies? The
answer is yes. Intuitively, truncated policy iteration is in between value iteration and
policy iteration. On the one hand, it converges faster than the value iteration algorithm
because it computes more than one iteration during the policy evaluation step. On
the other hand, it converges slower than the policy iteration algorithm because it only
computes a finite number of iterations. This intuition is illustrated in Figure 4.5. Such
intuition is also supported by the following analysis.

Proposition 4.1 (Value improvement). Consider the iterative algorithm in the policy

82
4.3. Truncated policy iteration S. Zhao, 2023

v*
v
k

vk
Policy iteration
Value iteration
Truncated policy iteration
Optimal state value

Figure 4.5: An illustration of the relationships between the value iteration, policy iteration, and truncated
policy iteration algorithms.

evaluation step:

vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...

(0)
If the initial guess is selected as vπk = vπk−1 , it holds that

vπ(j+1)
k
≥ vπ(j)
k

for j = 0, 1, 2, . . . .

Box 4.3: Proof of Proposition 4.1

(j) (j−1) (j+1) (j)
First, since vπk = rπk + γPπk vπk and vπk = rπk + γPπk vπk , we have

vπ(j+1)
k
− vπ(j)
k
= γPπk (vπ(j)
k
− vπ(j−1)
k
) = · · · = γ j Pπjk (vπ(1)
k
− vπ(0)
k
). (4.5)

(0)
Second, since vπk = vπk−1 , we have

vπ(1)
k
= rπk + γPπk vπ(0)
k
= rπk + γPπk vπk−1 ≥ rπk−1 + γPπk−1 vπk−1 = vπk−1 = vπ(0)
k
,

(1) (0)
where the inequality is due to πk = arg maxπ (rπ + γPπ vπk−1 ). Substituting vπk ≥ vπk
(j+1) (j)
into (4.5) yields vπk ≥ vπk .

(0)
Notably, Proposition 4.1 requires the assumption that vπk = vπk−1 . However, vπk−1 is
unavailable in practice, and only vk−1 is available. Nevertheless, Proposition 4.1 still sheds
light on the convergence of the truncated policy iteration algorithm. A more in-depth
discussion of this topic can be found in [2, Section 6.5].
Up to now, the advantages of truncated policy iteration are clear. Compared to the

83
4.4. Summary S. Zhao, 2023

policy iteration algorithm, the truncated one merely requires a finite number of iterations
in the policy evaluation step and hence is more computationally efficient. Compared to
value iteration, the truncated policy iteration algorithm can speed up its convergence
rate by running for a few more iterations in the policy evaluation step.

4.4 Summary
This chapter introduced three algorithms that can be used to find optimal policies.

Value iteration: The value iteration algorithm is the same as the algorithm suggested
by the contraction mapping theorem for solving the Bellman optimality equation. It
can be decomposed into two steps: value update and policy update.
Policy iteration: The policy iteration algorithm is slightly more complicated than the
value iteration algorithm. It also contains two steps: policy evaluation and policy
improvement.
Truncated policy iteration: The value iteration and policy iteration algorithms can
be viewed as two extreme cases of the truncated policy iteration algorithm.

A common property of the three algorithms is that every iteration has two steps.
One step is to update the value, and the other step is to update the policy. The idea
of interaction between value and policy updates widely exists in reinforcement learning
algorithms. This idea is also called generalized policy iteration [3].
Finally, the algorithms introduced in this chapter require the system model. Starting
in Chapter 5, we will study model-free reinforcement learning algorithms. We will see that
the model-free can be obtained by extending the algorithms introduced in this chapter.

4.5 Q&A
Q: Is the value iteration algorithm guaranteed to find optimal policies?
A: Yes. This is because value iteration is exactly the algorithm suggested by the
contraction mapping theorem for solving the Bellman optimality equation in the last
chapter. The convergence of this algorithm is guaranteed by the contraction mapping
theorem.
Q: Are the intermediate values generated by the value iteration algorithm state values?
A: No. These values are not guaranteed to satisfy the Bellman equation of any policy.
Q: What steps are included in the policy iteration algorithm?
A: Each iteration of the policy iteration algorithm contains two steps: policy evalu-
ation and policy improvement. In the policy evaluation step, the algorithm aims to
solve the Bellman equation to obtain the state value of the current policy. In the

84
4.5. Q&A S. Zhao, 2023

policy improvement step, the algorithm aims to update the policy so that the newly
generated policy has greater state values.
Q: Is another iterative algorithm embedded in the policy iteration algorithm?
A: Yes. In the policy evaluation step of the policy iteration algorithm, an iterative
algorithm is required to solve the Bellman equation of the current policy.
Q: Are the intermediate values generated by the policy iteration algorithm state val-
ues?
A: Yes. This is because these values are the solutions of the Bellman equation of the
current policy.
Q: Is the policy iteration algorithm guaranteed to find optimal policies?
A: Yes. We have presented a rigorous proof of its convergence in this chapter.
Q: What is the relationship between the truncated policy iteration and policy iteration
algorithms?
A: As its name suggests, the truncated policy iteration algorithm can be obtained
from the policy iteration algorithm by simply executing a finite number of iterations
during the policy evaluation step.
Q: What is the relationship between truncated policy iteration and value iteration?
A: Value iteration can be viewed as an extreme case of truncated policy iteration,
where a single iteration is run during the policy evaluation step.
Q: Are the intermediate values generated by the truncated policy iteration algorithm
state values?
A: No. Only if we run an infinite number of iterations in the policy evaluation step,
can we obtain true state values. If we run a finite number of iterations, we can only
obtain approximates of the true state values.
Q: How many iterations should we run in the policy evaluation step of the truncated
policy iteration algorithm?
A: The general guideline is to run a few iterations but not too many. The use of a few
iterations in the policy evaluation step can speed up the overall convergence rate, but
running too many iterations would not significantly speed up the convergence rate.
Q: What is generalized policy iteration?
A: Generalized policy iteration is not a specific algorithm. Instead, it refers to the
general idea of the interaction between value and policy updates. This idea is root-
ed in the policy iteration algorithm. Most of the reinforcement learning algorithms
introduced in this book fall into the scope of generalized policy iteration.
Q: What are model-based and model-free reinforcement learning?

85
4.5. Q&A S. Zhao, 2023

A: Although the algorithms introduced in this chapter can find optimal policies, they
are usually called dynamic programming algorithms rather than reinforcement learn-
ing algorithms because they require the system model. Reinforcement learning al-
gorithms can be classified into two categories: model-based and model-free. Here,
“model-based” does not refer to the requirement of the system model. Instead, model-
based reinforcement learning uses data to estimate the system model and uses this
model during the learning process. By contrast, model-free reinforcement learning
does not involve model estimation during the learning process. All the reinforce-
ment learning algorithms introduced in this book are model-free algorithms. More
information about model-based reinforcement learning can be found in [13–16].

86
Chapter 5

Monte Carlo Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 5.1: Where we are in this book.

In the previous chapter, we introduced algorithms that can find optimal policies based
on the system model. In this chapter, we start introducing model-free reinforcement
learning algorithms that do not presume system models.
While this is the first time we introduce model-free algorithms in this book, we must
fill a knowledge gap: how can we find optimal policies without models? The philosophy
is simple: If we do not have a model, we must have some data. If we do not have data,
we must have a model. If we have neither, then we are not able to find optimal policies.
The “data” in reinforcement learning usually refers to the agent’s interaction experiences
with the environment.

87
5.1. Motivating example: Mean estimation S. Zhao, 2023

To demonstrate how to learn from data rather than a model, we start this chapter by
introducing the mean estimation problem, where the expected value of a random variable
is estimated from some samples. Understanding this problem is crucial for understanding
the fundamental idea of learning from data.
Then, we introduce three algorithms based on Monte Carlo (MC) methods. These
algorithms can learn optimal policies from experience samples. The first and simplest
algorithm is called MC Basic, which can be readily obtained by modifying the policy itera-
tion algorithm introduced in the last chapter. Understanding this algorithm is important
for grasping the fundamental idea of MC-based reinforcement learning. By extending
this algorithm, we further introduce another two algorithms that are more complicated
but more efficient.

5.1 Motivating example: Mean estimation

We next introduce the mean estimation problem to demonstrate how to learn from data
rather than a model. We will see that mean estimation can be achieved based on Monte
Carlo methods, which refer to a broad class of techniques that use stochastic samples
to solve estimation problems. The reader may wonder why we care about the mean
estimation problem. It is simply because state and action values are both defined as
the means of returns. Estimating a state or action value is actually a mean estimation
problem.
Consider a random variable X that can take values from a finite set of real numbers
denoted as X . Suppose that our task is to calculate the mean or expected value of X:
E[X]. Two approaches can be used to calculate E[X].

The first approach is model-based. Here, the model refers to the probability distribu-
tion of X. If the model is known, then the mean can be directly calculated based on
the definition of the expected value:
X
E[X] = p(x)x.
x∈X

In this book, we use the terms expected value, mean, and average interchangeably.
The second approach is model-free. When the probability distribution (i.e., the model)
of X is unknown, suppose that we have some samples {x1 , x2 , . . . , xn } of X. Then,
the mean can be approximated as
n
1X
E[X] ≈ x̄ = xj .
n j=1

When n is small, this approximation may not be accurate. However, as n increases,

the approximation becomes increasingly accurate. When n → ∞, we have x̄ → E[X].

88
5.1. Motivating example: Mean estimation S. Zhao, 2023

This is guaranteed by the law of large numbers: the average of a large number of
samples is close to the expected value. The law of large numbers is introduced in
Box 5.1.

The following example illustrates the two approaches described above. Consider a
coin flipping game. Let random variable X denote which side is showing when the coin
lands. X has two possible values: X = 1 when the head is showing, and X = −1 when
the tail is showing. Suppose that the true probability distribution (i.e., the model) of X
is

p(X = 1) = 0.5, p(X = −1) = 0.5.

If the probability distribution is known in advance, we can directly calculate the mean as

E[X] = 0.5 · 1 + 0.5 · (−1) = 0.

If the probability distribution is unknown, then we can flip the coin many times and
record the sampling results {xi }ni=1 . By calculating the average of the samples, we can
obtain an estimate of the mean. As shown in Figure 5.2, the estimated mean becomes
increasingly accurate as the number of samples increases.

2
samples
average

-1

-2
0 50 100 150 200
Sample index

Figure 5.2: An example for demonstrating the law of large numbers. Here, the samples are drawn from
{+1, −1} following a uniform distribution. The average of the samples gradually converges to zero, which
is the true expected value, as the number of samples increases.

It is worth mentioning that the samples used for mean estimation must be independent
and identically distributed (i.i.d. or iid). Otherwise, if the sampling values correlate, it
may be impossible to correctly estimate the expected value. An extreme case is that
all the sampling values are the same as the first one, whatever the first one is. In this
case, the average of the samples is always equal to the first sample, no matter how many
samples we use.

89
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

Box 5.1: Law of large numbers

For a random variable X, suppose that {xi }ni=1 are some i.i.d. samples. Let x̄ =
1
Pn
n i=1 xi be the average of the samples. Then,

E[x̄] = E[X],
1
var[x̄] = var[X].
n
The above two equations indicate that x̄ is an unbiased estimate of E[X], and its
variance decreases to zero as n increases to infinity.
The proof is given below.
Pn Pn
First, E[x̄] = E i=1 xi /n = i=1 E[xi ]/n = E[X], where the last equability is
due to the fact that the samples are identically distributed (that is, E[xi ] = E[X]).
Pn Pn 2
= (n · var[X])/n2 =

Second, var(x̄) = var i=1 x i /n = i=1 var[xi ]/n
var[X]/n, where the second equality is due to the fact that the samples are indepen-
dent, and the third equability is a result of the samples being identically distributed
(that is, var[xi ] = var[X]).

5.2 MC Basic: The simplest MC-based algorithm

This section introduces the first and the simplest MC-based reinforcement learning algo-
rithm. This algorithm is obtained by replacing the model-based policy evaluation step in
the policy iteration algorithm introduced in Section 4.2 with a model-free MC estimation
step.

5.2.1 Converting policy iteration to be model-free

There are two steps in every iteration of the policy iteration algorithm (see Section 4.2).
The first step is policy evaluation, which aims to compute vπk by solving vπk = rπk +
γPπk vπk . The second step is policy improvement, which aims to compute the greedy
policy πk+1 = arg maxπ (rπ + γPπ vπk ). The elementwise form of the policy improvement
step is
" #
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπk (s0 )
π
a r s0
X
= arg max π(a|s)qπk (s, a), s ∈ S.
π
a

It must be noted that the action values lie in the core of these two steps. Specifically.
in the first step, the state values are calculated for the purpose of calculating the action

90
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

values. In the second step, the new policy is generated based on the calculated action
values. Let us reconsider how we can calculate the action values. Two approaches are
available.

The first is a model-based approach. This is the approach adopted by the policy
iteration algorithm. In particular, we can first calculate the state value vπk by solving
the Bellman equation. Then, we can calculate the action values by using
X X
qπk (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπk (s0 ). (5.1)
r s0

This approach requires the system model {p(r|s, a), p(s0 |s, a)} to be known.
The second is a model-free approach. Recall that the definition of an action value is

qπk (s, a) = E[Gt |St = s, At = a]

= E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s, At = a],

which is the expected return obtained when starting from (s, a). Since qπk (s, a) is an
expectation, it can be estimated by MC methods as demonstrated in Section 5.1. To
do that, starting from (s, a), the agent can interact with the environment by following
policy πk and then obtain a certain number of episodes. Suppose that there are n
(i)
episodes and that the return of the ith episode is gπk (s, a). Then, qπk (s, a) can be
approximated as
n
1 X (i)
qπk (s, a) = E[Gt |St = s, At = a] ≈ g (s, a). (5.2)
n i=1 πk

We already know that, if the number of episodes n is sufficiently large, the approxi-
mation will be sufficiently accurate according to the law of large numbers.

The fundamental idea of MC-based reinforcement learning is to use a model-free

method for estimating action values, as shown in (5.2), to replace the model-based method
in the policy iteration algorithm.

5.2.2 The MC Basic algorithm

We are now ready to present the first MC-based reinforcement learning algorithm. S-
tarting from an initial policy π0 , the algorithm has two steps in the kth iteration (k =
0, 1, 2, . . . ).

Step 1: Policy evaluation. This step is used to estimate qπk (s, a) for all (s, a). Specif-
ically, for every (s, a), we collect sufficiently many episodes and use the average of the
returns, denoted as qk (s, a), to approximate qπk (s, a).

91
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

Algorithm 5.1: MC Basic (a model-free variant of policy iteration)

Initialization: Initial guess π0 .

Goal: Search for an optimal policy.
For the kth iteration (k = 0, 1, 2, . . . ), do
For every state s ∈ S, do
For every action a ∈ A(s), do
Collect sufficiently many episodes starting from (s, a) by following πk
Policy evaluation:
qπk (s, a) ≈ qk (s, a) = the average return of all the episodes starting from
(s, a)
Policy improvement:
a∗k (s) = arg maxa qk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

P
Step 2: Policy improvement. This step solves πk+1 (s) = arg maxπ a π(a|s)qk (s, a) for
all s ∈ S. The greedy optimal policy is πk+1 (a∗k |s) = 1 where a∗k = arg maxa qk (s, a).

This is the simplest MC-based reinforcement learning algorithm, which is called MC

Basic in this book. The pseudocode of the MC Basic algorithm is given in Algorithm 5.1.
As can be seen, it is very similar to the policy iteration algorithm. The only difference is
that it calculates action values directly from experience samples, whereas policy iteration
calculates state values first and then calculates the action values based on the system
model. It should be noted that the model-free algorithm directly estimates action values.
Otherwise, if it estimates state values instead, we still need to calculate action values
from these state values using the system model, as shown in (5.1).
Since policy iteration is convergent, MC Basic is also convergent when given suffi-
cient samples. That is, for every (s, a), suppose that there are sufficiently many episodes
starting from (s, a). Then, the average of the returns of these episodes can accurately ap-
proximate the action value of (s, a). In practice, we usually do not have sufficient episodes
for every (s, a). As a result, the approximation of the action values may not be accurate.
Nevertheless, the algorithm usually can still work. This is similar to the truncated policy
iteration algorithm, where the action values are neither accurately calculated.
Finally, MC Basic is too simple to be practical due to its low sample efficiency. The
reason why we introduce this algorithm is to let readers grasp the core idea of MC-
based reinforcement learning. It is important to understand this algorithm well before
studying more complex algorithms introduced later in this chapter. We will see that more
complex and sample-efficient algorithms can be readily obtained by extending the MC
Basic algorithm.

92
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

5.2.3 Illustrative examples

A simple example: A step-by-step implementation

s1 s2 s3

s4 s5 s6

s7 s8 s9

Figure 5.3: An example for illustrating the MC Basic algorithm.

We next use an example to demonstrate the implementation details of the MC Basic

algorithm. The reward settings are rboundary = rforbidden = −1 and rtarget = 1. The
discount rate is γ = 0.9. The initial policy π0 is shown in Figure 5.3. This initial policy
is not optimal for s1 or s3 .
While all the action values should be calculated, we merely present those of s1 due
to space limitations. At s1 , there are five possible actions. For each action, we need
to collect many episodes that are sufficiently long to effectively approximate the action
value. However, since this example is deterministic in terms of both the policy and model,
running multiple times would generate the same trajectory. As a result, the estimation
of each action value merely requires a single episode.
Following π0 , we can obtain the following episodes by respectively starting from
(s1 , a1 ), (s1 , a2 ), . . . , (s1 , a5 ).
1 a 1 a1 a
Starting from (s1 , a1 ), the episode is s1 −
→s 1 −
→ s1 −
→ . . .. The action value equals
the discounted return of the episode:

−1
qπ0 (s1 , a1 ) = −1 + γ(−1) + γ 2 (−1) + · · · = .
1−γ

2 a 3 a3 a
Starting from (s1 , a2 ), the episode is s1 −
→s 2 −
→ s5 −
→ . . .. The action value equals
the discounted return of the episode:

2 3 4 γ3
qπ0 (s1 , a2 ) = 0 + γ0 + γ 0 + γ (1) + γ (1) + · · · = .
1−γ

3 a 2 a3 a
Starting from (s1 , a3 ), the episode is s1 −
→s 4 −
→ s5 −
→ . . .. The action value equals

93
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

the discounted return of the episode:

γ3
qπ0 (s1 , a3 ) = 0 + γ0 + γ 2 0 + γ 3 (1) + γ 4 (1) + · · · = .
1−γ

4 a1 1a a
Starting from (s1 , a4 ), the episode is s1 −
→s 1 −
→ s1 −
→ . . .. The action value equals
the discounted return of the episode:

−1
qπ0 (s1 , a4 ) = −1 + γ(−1) + γ 2 (−1) + · · · = .
1−γ

5 a1 1a a
Starting from (s1 , a5 ), the episode is s1 −
→s 1 −
→ s1 −
→ . . .. The action value equals
the discounted return of the episode:

−γ
qπ0 (s1 , a5 ) = 0 + γ(−1) + γ 2 (−1) + · · · = .
1−γ

By comparing the five action values, we see that

γ3
qπ0 (s1 , a2 ) = qπ0 (s1 , a3 ) = >0
1−γ

are the maximum values. As a result, the new policy can be obtained as

π1 (a2 |s1 ) = 1 or π1 (a3 |s1 ) = 1.

It is intuitive that the improved policy, which takes either a2 or a3 at s1 , is optimal.

Therefore, we can successfully obtain an optimal policy by using merely one iteration
for this simple example. In this simple example, the initial policy is already optimal for
all the states except s1 and s3 . Therefore, the policy can become optimal after merely
a single iteration. When the policy is nonoptimal for other states, more iterations are
needed.

A comprehensive example: Episode length and sparse rewards

We next discuss some interesting properties of the MC Basic algorithm by examining

a more comprehensive example. The example is a 5-by-5 grid world (Figure 5.4). The
reward settings are rboundary = −1, rforbidden = −10, and rtarget = 1. The discount rate is
γ = 0.9.
First, we demonstrate that the episode length greatly impacts the final optimal poli-
cies. In particular, Figure 5.4 shows the final results generated by the MC Basic algorithm
with different episode lengths. When the length of each episode is too short, neither the
policy nor the value estimate is optimal (see Figures 5.4(a)-(d)). In the extreme case
where the episode length is one, only the states that are adjacent to the target have

94
5.2. MC Basic: The simplest MC-based algorithm S. Zhao, 2023

Episode length=1 Episode length=1 Episode length=2 Episode length=2

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1

2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2

3 0.0 0.0 1.0 0.0 0.0 3 3 0.0 0.0 1.9 0.0 0.0 3

4 0.0 1.0 1.0 1.0 0.0 4 4 0.0 1.9 1.9 1.9 0.0 4

5 0.0 0.0 1.0 0.0 0.0 5 5 0.0 0.9 1.9 0.9 0.0 5

(a) Final value and policy with episode length=1 (b) Final value and policy with episode length=2
Episode length=3 Episode length=3 Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1

2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2

3 0.0 0.0 2.7 0.0 0.0 3 3 0.0 0.0 3.4 0.0 0.0 3

4 0.0 2.7 2.7 2.7 0.0 4 4 0.0 3.4 3.4 3.4 0.7 4

5 0.0 1.7 2.7 1.7 0.8 5 5 0.0 2.4 3.4 2.4 1.5 5

.. .. ..
. . .

Episode length=14 Episode length=14 Episode length=15 Episode length=15

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1.2 1.6 2.0 2.5 3.0 1 1 1.4 1.8 2.2 2.7 3.3 1

2 0.9 1.2 2.5 3.0 3.6 2 2 1.1 1.4 2.7 3.3 3.8 2

3 0.5 0.3 7.7 3.6 4.3 3 3 0.8 0.5 7.9 3.8 4.5 3

4 0.3 7.7 7.7 7.7 5.0 4 4 0.5 7.9 7.9 7.9 5.2 4

5 0.0 6.7 7.7 6.7 5.8 5 5 0.2 6.9 7.9 6.9 6.0 5

(e) Final value and policy with episode length=14 (f) Final value and policy with episode length=15
Episode length=30 Episode length=30 Episode length=100 Episode length=100
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 3.1 3.5 3.9 4.4 4.9 1 1 3.5 3.9 4.3 4.8 5.3 1

2 2.7 3.1 4.4 4.9 5.5 2 2 3.1 3.5 4.8 5.3 5.9 2

3 2.4 2.1 9.6 5.5 6.1 3 3 2.8 2.5 10.0 5.9 6.6 3

4 2.1 9.6 9.6 9.6 6.9 4 4 2.5 10.0 10.0 10.0 7.3 4

5 1.9 8.6 9.6 8.6 7.7 5 5 2.3 9.0 10.0 9.0 8.1 5

(g) Final value and policy with episode length=30 (h) Final value and policy with episode length=100

Figure 5.4: The policies and state values obtained by the MC Basic algorithm when given different
episode lengths. Only if the length of each episode is sufficiently long, can the state values be accurately
estimated.

95
5.3. MC Exploring Starts S. Zhao, 2023

nonzero values, and all the other states have zero values since each episode is too short
to reach the target or get positive rewards (see Figure 5.4(a)). As the episode length
increases, the policy and value estimates gradually approach the optimal ones (see Fig-
ure 5.4(h)).
As the episode length increases, an interesting spatial pattern emerges. That is, the
states that are closer to the target possess nonzero values earlier than those that are
farther away. The reason for this phenomenon is as follows. Starting from a state, the
agent must travel at least a certain number of steps to reach the target state and then
receive positive rewards. If the length of an episode is less than the minimum desired
number of steps, it is certain that the return is zero, and so is the estimated state value. In
this example, the episode length must be no less than 15, which is the minimum number
of steps required to reach the target when starting from the bottom-left state.
While the above analysis suggests that each episode must be sufficiently long, the
episodes are not necessarily infinitely long. As shown in Figure 5.4(g), when the length
is 30, the algorithm can find an optimal policy, although the value estimate is not yet
optimal.
The above analysis is related to an important reward design problem, sparse reward,
which refers to the scenario in which no positive rewards can be obtained unless the target
is reached. The sparse reward setting requires long episodes that can reach the target.
This requirement is challenging to satisfy when the state space is large. As a result,
the sparse reward problem downgrades the learning efficiency. One simple technique for
solving this problem is to design nonsparse rewards. For instance, in the above grid world
example, we can redesign the reward setting so that the agent can obtain a small positive
reward when reaching the states near the target. In this way, an “attractive field” can
be formed around the target so that the agent can find the target more easily. More
information about sparse reward problems can be found in [17–19].

5.3 MC Exploring Starts

We next extend the MC Basic algorithm to obtain another MC-based reinforcement
learning algorithm that is slightly more complicated but more sample-efficient.

5.3.1 Utilizing samples more efficiently

An important aspect of MC-based reinforcement learning is how to use samples more
efficiently. Specifically, suppose that we have an episode of samples obtained by following
a policy π:

a2 a
4 2a 3a 1 a
s1 −
→ s2 −
→ s1 −
→ s2 −
→ s5 −
→ ... (5.3)

96
5.3. MC Exploring Starts S. Zhao, 2023

where the subscripts refer to the state or action indexes rather than time steps. Every
time a state-action pair appears in an episode, it is called a visit of that state-action pair.
Different strategies can be employed to utilize the visits.
The first and simplest strategy is to use the initial visit. That is, an episode is only
used to estimate the action value of the initial state-action pair that the episode starts
from. For the example in (5.3), the initial-visit strategy merely estimates the action
value of (s1 , a2 ). The MC Basic algorithm utilizes the initial-visit strategy. However, this
strategy is not sample-efficient because the episode also visits many other state-action
pairs such as (s2 , a4 ), (s2 , a3 ), and (s5 , a1 ). These visits can also be used to estimate the
corresponding action values. In particular, we can decompose the episode in (5.3) into
multiple subepisodes:

a2 a
4 2a 3 a 1 a
s1 −
→ s2 −
→ s1 −
→ s2 −
→ s5 −
→ . . . [original episode]
a4 a
2 3a 1 a
s2 −
→ s1 −
→ s2 −
→ s5 −
→ . . . [subepisode starting from (s2 , a4 )]
a2 a
3 1a
s1 −
→ s2 −
→ s5 −
→ . . . [subepisode starting from (s1 , a2 )]
a3 a
1
s2 −
→ s5 −
→ . . . [subepisode starting from (s2 , a3 )]
a1
s5 −
→ . . . [subepisode starting from (s5 , a1 )]

The trajectory generated after the visit of a state-action pair can be viewed as a new
episode. These new episodes can be used to estimate more action values. In this way,
the samples in the episode can be utilized more efficiently.
Moreover, a state-action pair may be visited multiple times in an episode. For exam-
ple, (s1 , a2 ) is visited twice in the episode in (5.3). If we only count the first-time visit,
this is called a first-visit strategy. If we count every visit of a state-action pair, such a
strategy is called every-visit [20].
In terms of sample usage efficiency, the every-visit strategy is the best. If an episode
is sufficiently long such that it can visit all the state-action pairs many times, then this
single episode may be sufficient for estimating all the action values using the every-visit
strategy. However, the samples obtained by the every-visit strategy are correlated because
the trajectory starting from the second visit is merely a subset of the trajectory starting
from the first visit. Nevertheless, the correlation would not be strong if the two visits are
far away from each other in the trajectory.

5.3.2 Updating policies more efficiently

Another aspect of MC-based reinforcement learning is when to update the policy. Two
strategies are available.

The first strategy is, in the policy evaluation step, to collect all the episodes starting
from the same state-action pair and then approximate the action value using the
average return of these episodes. This strategy is adopted in the MC Basic algorithm.

97
5.3. MC Exploring Starts S. Zhao, 2023

Algorithm 5.2: MC Exploring Starts (an efficient variant of MC Basic)

Initialization: Initial policy π0 (a|s) and initial value q(s, a) for all (s, a). Returns(s, a) =
0 and Num(s, a) = 0 for all (s, a).
Goal: Search for an optimal policy.
For each episode, do
Episode generation: Select a starting state-action pair (s0 , a0 ) and ensure that all
pairs can be possibly selected (this is the exploring-starts condition). Following the
current policy, generate an episode of length T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Initialization for each episode: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Returns(st , at ) ← Returns(st , at ) + g
Num(st , at ) ← Num(st , at ) + 1
Policy evaluation:
q(st , at ) ← Returns(st , at )/Num(st , at )
Policy improvement:
π(a|st ) = 1 if a = arg maxa q(st , a) and π(a|st ) = 0 otherwise

The drawback of this strategy is that the agent must wait until all the episodes have
been collected before the estimate can be updated.
The second strategy, which can overcome this drawback, is to use the return of a
single episode to approximate the corresponding action value. In this way, we can
immediately obtain a rough estimate when we receive an episode. Then, the policy
can be improved in an episode-by-episode fashion.

Since the return of a single episode cannot accurately approximate the corresponding
action value, one may wonder whether the second strategy is good. In fact, this strategy
falls into the scope of generalized policy iteration introduced in the last chapter. That is,
we can still update the policy even if the value estimate is not sufficiently accurate.

5.3.3 Algorithm description

We can use the techniques introduced in Sections 5.3.1 and 5.3.2 to enhance the
efficiency of the MC Basic algorithm. Then, a new algorithm called MC Exploring Starts
can be obtained.
The details of MC Exploring Starts are given in Algorithm 5.2. This algorithm uses
the every-visit strategy. Interestingly, when calculating the discounted return obtained
by starting from each state-action pair, the procedure starts from the ending states and
travels back to the starting state. Such techniques can make the algorithm more efficient,
but it also makes the algorithm more complex. This is why the MC Basic algorithm,

98
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023

which is free of such techniques, is introduced first to reveal the core idea of MC-based
reinforcement learning.
The exploring starts condition requires sufficiently many episodes starting from every
state-action pair. Only if every state-action pair is well explored, can we accurately
estimate their action values (according to the law of large numbers) and hence successfully
find optimal policies. Otherwise, if an action is not well explored, its action value may
be inaccurately estimated, and this action may not be selected by the policy even though
it is indeed the best action. Both MC Basic and MC Exploring Starts require this
condition. However, this condition is difficult to meet in many applications, especially
those involving physical interactions with environments. Can we remove the exploring
starts requirement? The answer is yes, as shown in the next section.

5.4 MC -Greedy: Learning without exploring starts

We next extend the MC Exploring Starts algorithm by removing the exploring starts
condition. This condition actually requires that every state-action pair can be visited
sufficiently many times, which can also be achieved based on soft policies.

5.4.1 -greedy policies

A policy is soft if it has a positive probability of taking any action at any state. Consider
an extreme case in which we only have a single episode. With a soft policy, a single
episode that is sufficiently long can visit every state-action pair many times (see the
examples in Figure 5.8). Thus, we do not need to generate a large number of episodes
starting from different state-action pairs, and then the exploring starts requirement can
be removed.
One type of common soft policies is -greedy policies. An -greedy policy is a stochastic
policy that has a higher chance of choosing the greedy action and the same nonzero
probability of taking any other action. Here, the greedy action refers to the action with
the greatest action value. In particular, suppose that ∈ [0, 1]. The corresponding
-greedy policy has the following form:

 1− (|A(s)| − 1), for the greedy action,
|A(s)|



π(a|s) =

, for the other |A(s)| − 1 actions,


|A(s)|


where |A(s)| denotes the number of actions associated with s.

When = 0, -greedy becomes greedy. When = 1, the probability of taking any

action equals |A(s)| .
The probability of taking the greedy action is always greater than that of taking any

99
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023

other action because

1− (|A(s)| − 1) = 1 − + ≥
|A(s)| |A(s)| |A(s)|

for any ∈ [0, 1].

While an -greedy policy is stochastic, how can we select an action by following such
a policy? We can first generate a random number x in [0, 1] by following a uniform
distribution. If x ≥ , then we select the greedy action. If x < , then we randomly select
1
an action in A(s) with the probability of |A(s)| (we may select the greedy action again).

In this way, the total probability of selecting the greedy action is 1 − + |A(s)| , and the

probability of selecting any other action is |A(s)| .

5.4.2 Algorithm description

To integrate -greedy policies into MC learning, we only need to change the policy im-
provement step from greedy to -greedy.
In particular, the policy improvement step in MC Basic or MC Exploring Starts aims
to solve
X
πk+1 (s) = arg max π(a|s)qπk (s, a), (5.4)
π∈Π
a

where Π denotes the set of all possible policies. We know that the solution of (5.4) is a
greedy policy:
(
1, a = a∗k ,
πk+1 (a|s) =
6 a∗k ,
0, a =

where a∗k = arg maxa qπk (s, a).

Now, the policy improvement step is changed to solve
X
πk+1 (s) = arg max π(a|s)qπk (s, a), (5.5)
π∈Π
a

where Π denotes the set of all -greedy policies with a given value of . In this way, we
force the policy to be -greedy. The solution of (5.5) is

|A(s)|−1
(
1− |A(s)|
, a = a∗k ,
πk+1 (a|s) = 1
|A(s)|
, a 6= a∗k ,

where a∗k = arg maxa qπk (s, a). With the above change, we obtain another algorithm
called MC -Greedy. The details of this algorithm are given in Algorithm 5.3. Here, the
every-visit strategy is employed to better utilize the samples.

100
5.4. MC -Greedy: Learning without exploring starts S. Zhao, 2023

Algorithm 5.3: MC -Greedy (a variant of MC Exploring Starts)

Initialization: Initial policy π0 (a|s) and initial value q(s, a) for all (s, a). Returns(s, a) =
0 and Num(s, a) = 0 for all (s, a). ∈ (0, 1]
Goal: Search for an optimal policy.
For each episode, do
Episode generation: Select a starting state-action pair (s0 , a0 ) (the exploring starts
condition is not required). Following the current policy, generate an episode of length
T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Initialization for each episode: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Returns(st , at ) ← Returns(st , at ) + g
Num(st , at ) ← Num(st , at ) + 1
Policy evaluation:
q(st , at ) ← Returns(st , at )/Num(st , at )
Policy improvement:
Let a∗ = arg maxa q(st , a) and
(
1 − |A(s t )|−1
|A(st )|
, a = a∗
π(a|st ) = 1
|A(st )|
, a 6= a∗

If greedy policies are replaced by -greedy policies in the policy improvement step,
can we still guarantee to obtain optimal policies? The answer is both yes and no. By yes,
we mean that, when given sufficient samples, the algorithm can converge to an -greedy
policy that is optimal in the set Π . By no, we mean that the policy is merely optimal in
Π but may not be optimal in Π. However, if is sufficiently small, the optimal policies
in Π are close to those in Π.

5.4.3 Illustrative examples

Consider the grid world example shown in Figure 5.5. The aim is to find the optimal
policy for every state. A single episode with one million steps is generated in every
iteration of the MC -Greedy algorithm. Here, we deliberately consider the extreme case
with merely one single episode. We set rboundary = rforbidden = −1, rtarget = 1, and γ = 0.9.
The initial policy is a uniform policy that has the same probability 0.2 of taking
any action, as shown in Figure 5.5. The optimal -greedy policy with = 0.5 can be
obtained after two iterations. Although each iteration merely uses a single episode, the
policy gradually improves because all the state-action pairs can be visited and hence their
values can be accurately estimated.

101
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) Initial policy (b) After the first iteration (c) After the second iteration

Figure 5.5: The evolution process of the MC -Greedy algorithm based on single episodes.

5.5 Exploration and exploitation of -greedy policies

Exploration and exploitation constitute a fundamental tradeoff in reinforcement learning.
Here, exploration means that the policy can possibly take as many actions as possible.
In this way, all the actions can be visited and evaluated well. Exploitation means that
the improved policy should take the greedy action that has the greatest action value.
However, since the action values obtained at the current moment may not be accurate
due to insufficient exploration, we should keep exploring while conducting exploitation
to avoid missing optimal actions.
-greedy policies provide one way to balance exploration and exploitation. On the
one hand, an -greedy policy has a higher probability of taking the greedy action so that
it can exploit the estimated values. On the other hand, the -greedy policy also has a
chance to take other actions so that it can keep exploring. -greedy policies are used
not only in MC-based reinforcement learning but also in other reinforcement learning
algorithms such as temporal-difference learning as introduced in Chapter 7.
Exploitation is related to optimality because optimal policies should be greedy. The
fundamental idea of -greedy policies is to enhance exploration by sacrificing optimal-
ity/exploitation. If we would like to enhance exploitation and optimality, we need to
reduce the value of . However, if we would like to enhance exploration, we need to
increase the value of .
We next discuss this tradeoff based on some interesting examples. The reinforce-
ment learning task here is a 5-by-5 grid world. The reward settings are rboundary = −1,
rforbidden = −10, and rtarget = 1. The discount rate is γ = 0.9.

Optimality of -greedy policies

We next show that the optimality of -greedy policies becomes worse when increases.

First, a greedy optimal policy and the corresponding optimal state values are shown
in Figure 5.6(a). The state values of some consistent -greedy policies are shown in

102
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(a) A given -greedy policy and its state values: = 0

1 2 3 4 5 1 2 3 4 5

1 1 0.4 0.5 0.9 1.3 1.4

2 2 0.1 0.0 0.5 1.3 1.7

3 3 0.1 -0.4 3.4 1.4 1.9

4 4 -0.1 3.4 3.3 3.7 2.2

5 5 -0.3 2.6 3.7 3.1 2.7

(b) A given -greedy policy and its state values: = 0.1

1 2 3 4 5 1 2 3 4 5

1 1 -2.2 -2.4 -2.1 -1.7 -1.8

2 2 -2.5 -3.0 -3.3 -2.3 -2.0

3 3 -2.3 -3.3 -2.5 -2.8 -2.2

4 4 -2.5 -2.5 -2.8 -2.0 -2.4

5 5 -2.8 -3.2 -2.1 -2.3 -2.2

(c) A given -greedy policy and its state values: = 0.2

1 2 3 4 5 1 2 3 4 5

1 1 -8.0 -9.0 -8.4 -7.2 -7.8

2 2 -8.7 -10.8 -12.4 -9.6 -8.9

3 3 -8.3 -12.3 -15.3 -12.3 -10.5

4 4 -9.7 -15.5 -17.0 -14.4 -12.2

5 5 -10.9 -16.7 -15.2 -14.3 -12.4

(d) A given -greedy policy and its state values: = 0.5

Figure 5.6: The state values of some -greedy policies. These -greedy policies are consistent with each
other in the sense that the actions with the greatest probabilities are the same. It can be seen that,
when the value of increases, the state values of the -greedy policies decrease and hence their optimality
becomes worse.

103
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(a) The optimal -greedy policy and its state values: = 0

1 2 3 4 5 1 2 3 4 5

1 1 0.4 0.5 0.9 1.3 1.4

2 2 0.1 0.0 0.5 1.3 1.7

3 3 0.1 -0.4 3.4 1.4 1.9

4 4 -0.1 3.4 3.3 3.7 2.2

5 5 -0.3 2.6 3.7 3.1 2.7

(b) The optimal -greedy policy and its state values: = 0.1
1 2 3 4 5 1 2 3 4 5

1 1 -1.1 -1.5 -1.1 -0.6 -0.6

2 2 -1.5 -2.2 -2.3 -1.0 -0.6

3 3 -1.2 -2.4 -2.2 -1.5 -0.6

4 4 -1.6 -2.3 -2.6 -1.4 -1.1

5 5 -2.0 -3.0 -1.8 -1.4 -1.0

1 1 -4.3 -5.5 -4.5 -2.6 -2.3

2 2 -5.6 -7.7 -7.7 -4.1 -2.4

3 3 -5.5 -9.0 -8.0 -5.6 -2.8

4 4 -6.8 -8.9 -9.4 -5.5 -4.2

5 5 -7.9 -10.1 -6.7 -5.1 -3.7

(d) The optimal -greedy policy and its state values: = 0.5

Figure 5.7: The optimal -greedy policies and their corresponding state values under different values of
. Here, these -greedy policies are optimal among all -greedy ones (with the same value of ). It can
be seen that, when the value of increases, the optimal -greedy policies are no longer consistent with
the optimal one as in (a).

104
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023

Figures 5.6(b)-(d). Here, two -greedy policies are consistent if the actions with the
greatest probabilities in the policies are the same.
As the value of increases, the state values of the -greedy policies decrease, indicating
that the optimality of these -greedy policies becomes worse. Notably, the value of
the target state becomes the smallest when is as large as 0.5. This is because, when
is large, the agent starting from the target area may enter the surrounding forbidden
areas and hence receive negative rewards with a higher probability.
Second, Figure 5.7 shows the optimal -greedy policies (they are optimal in Π ). When
= 0, the policy is greedy and optimal among all policies. When is as small as 0.1,
the optimal -greedy policy is consistent with the optimal greedy one. However, when
increases to, for example, 0.2, the obtained -greedy policies are not consistent with
the optimal greedy one. Therefore, if we want to obtain -greedy policies that are
consistent with the optimal greedy ones, the value of should be sufficiently small.
Why are the -greedy policies inconsistent with the optimal greedy one when is
large? We can answer this question by considering the target state. In the greedy
case, the optimal policy at the target state is to stay unchanged to gain positive
rewards. However, when is large, there is a high chance of entering the forbidden
areas and receiving negative rewards. Therefore, the optimal policy at the target state
in this case is to escape instead of staying unchanged.

Exploration abilities of -greedy policies

We next illustrate that the exploration ability of an -greedy policy is strong when is
large.
First, consider an -greedy policy with = 1 (see Figure 5.5(a)). In this case, the
exploration ability of the -greedy policy is strong since it has a 0.2 probability of taking
any action at any state. Starting from (s1 , a1 ), an episode generated by the -policy is
given in Figures 5.8(a)-(c). It can be seen that this single episode can visit all the state-
action pairs many times when the episode is sufficiently long due to the strong exploration
ability of the policy. Moreover, the numbers of times that all the state-action pairs are
visited are almost even, as shown in Figure 5.8(d).
Second, consider an -policy with = 0.5 (see Figure 5.6(d)). In this case, the -greedy
policy has a weaker exploration ability than the case of = 1. Starting from (s1 , a1 ), an
episode generated by the -policy is given in Figures 5.8(e)-(g). Although every action
can still be visited when the episode is sufficiently long, the distribution of the number
of visits may be extremely uneven. For example, given an episode with one million steps,
some actions are visited more than 250,000 times, while most actions are visited merely
hundreds or even tens of times, as shown in Figure 5.8(h).
The above examples demonstrate that the exploration abilities of -greedy policies
decrease when decreases. One useful technique is to initially set to be large to enhance

105
5.5. Exploration and exploitation of -greedy policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

(a) = 1, trajectory of 100 steps (e) = 0.5, trajectory of 100 steps

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

(b) = 1, trajectory of 1,000 steps (f) = 0.5, trajectory of 1,000 steps

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

(c) = 1, trajectory of 10,000 steps (g) = 0.5, trajectory of 10,000 steps

105
8300 3

8200 2.5
8100
Visited times

Visited times

2
8000
1.5
7900
1
7800

7700 0.5

7600 0
20 40 60 80 100 120 20 40 60 80 100 120
State-action index State-action index

(d) = 1, number of times each action is vis- (h) = 0.5, number of times each action is
ited within 1 million steps visited within 1 million steps

Figure 5.8: Exploration abilities of -greedy policies with different values of .

106
5.6. Summary S. Zhao, 2023

exploration and gradually reduce it to ensure the optimality of the final policy [21–23].

5.6 Summary
The algorithms in this chapter are the first model-free reinforcement learning algorithms
ever introduced in this book. We first introduced the idea of MC estimation by exam-
ining an important mean estimation problem. Then, three MC-based algorithms were
introduced.

MC Basic: This is the simplest MC-based reinforcement learning algorithm. This al-
gorithm is obtained by replacing the model-based policy evaluation step in the policy
iteration algorithm with a model-free MC-based estimation component. Given suffi-
cient samples, it is guaranteed that this algorithm can converge to optimal policies
and optimal state values.
MC Exploring Starts: This algorithm is a variant of MC Basic. It can be obtained
from the MC Basic algorithm using the first-visit or every-visit strategy to use samples
more efficiently.
MC -Greedy: This algorithm is a variant of MC Exploring Starts. Specifically, in the
policy improvement step, it searches for the best -greedy policies instead of greedy
policies. In this way, the exploration ability of the policy is enhanced and hence the
condition of exploring starts can be removed.

Finally, a tradeoff between exploration and exploitation was introduced by examining

the properties of -greedy policies. As the value of increases, the exploration ability
of -greedy policies increases, and the exploitation of greedy actions decreases. On the
other hand, if the value of decreases, we can better exploit the greedy actions, but the
exploration ability is compromised.

5.7 Q&A
Q: What is Monte Carlo estimation?
A: Monte Carlo estimation refers to a broad class of techniques that use stochastic
samples to solve approximation problems.
Q: What is the mean estimation problem?
A: The mean estimation problem refers to calculating the expected value of a random
variable based on stochastic samples.
Q: How to solve the mean estimation problem?
A: There are two approaches: model-based and model-free. In particular, if the proba-
bility distribution of a random variable is known, the expected value can be calculated

107
5.7. Q&A S. Zhao, 2023

based on its definition. If the probability distribution is unknown, we can use Monte
Carlo estimation to approximate the expected value. Such an approximation is accu-
rate when the number of samples is large.
Q: Why is the mean estimation problem important for reinforcement learning?
A: Both state and action values are defined as expected values of returns. Hence,
estimating state or action values is essentially a mean estimation problem.
Q: What is the core idea of model-free MC-based reinforcement learning?
A: The core idea is to convert the policy iteration algorithm to a model-free one.
In particular, while the policy iteration algorithm aims to calculate values based on
the system model, MC-based reinforcement learning replaces the model-based policy
evaluation step in the policy iteration algorithm with a model-free MC-based policy
evaluation step.
Q: What are initial-visit, first-visit, and every-visit strategies?
A: They are different strategies for utilizing the samples in an episode. An episode
may visit many state-action pairs. The initial-visit strategy uses the entire episode to
estimate the action value of the initial state-action pair. The every-visit and first-visit
strategies can better utilize the given samples. If the rest of the episode is used to
estimate the action value of a state-action pair every time it is visited, such a strategy
is called every-visit. If we only count the first time a state-action pair is visited in the
episode, such a strategy is called first-visit.
Q: What is exploring starts? Why is it important?
A: Exploring starts requires an infinite number of (or sufficiently many) episodes to be
generated when starting from every state-action pair. In theory, the exploring starts
condition is necessary to find optimal policies. That is, only if every action value is
well explored, can we accurately evaluate all the actions and then correctly select the
optimal ones.
Q: What is the idea used to avoid exploring starts?
A: The fundamental idea is to make policies soft. Soft policies are stochastic, enabling
an episode to visit many state-action pairs. In this way, we do not need a large number
of episodes starting from every state-action pair.
Q: Can an -greedy policy be optimal?
A: The answer is both yes and no. By yes, we mean that, if given sufficient samples,
the MC -Greedy algorithm can converge to an optimal -greedy policy. By no, we
mean that the converged policy is merely optimal among all -greedy policies (with
the same value of ).
Q: Is it possible to use one episode to visit all state-action pairs?

108
5.7. Q&A S. Zhao, 2023

A: Yes, it is possible. If the policy is soft (e.g., -greedy) and the episode is sufficiently
long.
Q: What is the relationship between MC Basic, MC Exploring Starts, and MC -
Greedy?
A: MC Basic is the simplest MC-based reinforcement learning algorithm. It is impor-
tant because it reveals the fundamental idea of model-free MC-based reinforcement
learning. MC Exploring Starts is a variant of MC Basic that adjusts the sample us-
age strategy. Furthermore, MC -Greedy is a variant of MC Exploring Starts that
removes the exploring starts requirement. Therefore, while the basic idea is simple,
complication appears when we want to achieve better performance. It is important
to split the core idea from the complications that may be distracting for beginners.

109
Chapter 6

Stochastic Approximation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 6.1: Where we are in this book.

Chapter 5 introduced the first class of model-free reinforcement learning algorithms

based on Monte Carlo estimation. In the next chapter (Chapter 7), we will introduce an-
other class of model-free reinforcement learning algorithms: temporal-difference learning.
However, before proceeding to the next chapter, we need to press the pause button to
better prepare ourselves. This is because temporal-difference algorithms are very different
from the algorithms that we have studied so far. Many readers who see the temporal-
difference algorithms for the first time often wonder how these algorithms were designed
in the first place and why they can work effectively. In fact, there is a knowledge gap
between the previous and subsequent chapters: the algorithms we have studied so far are

110
6.1. Motivating example: Mean estimation S. Zhao, 2023

non-incremental, but the algorithms that we will study in the subsequent chapters are
incremental.
We use the present chapter to fill this knowledge gap by introducing the basics of
stochastic approximation. Although this chapter does not introduce any specific rein-
forcement learning algorithms, it lays the necessary foundations for studying subsequen-
t chapters. We will see in Chapter 7 that the temporal-difference algorithms can be
viewed as special stochastic approximation algorithms. The well-known stochastic gradi-
ent descent algorithms widely used in machine learning are also introduced in the present
chapter.

6.1 Motivating example: Mean estimation

We next demonstrate how to convert a non-incremental algorithm to an incremental one
by examining the mean estimation problem.
Consider a random variable X that takes values from a finite set X . Our goal is to
estimate E[X]. Suppose that we have a sequence of i.i.d. samples {xi }ni=1 . The expected
value of X can be approximated by
n
. 1X
E[X] ≈ x̄ = xi . (6.1)
n i=1

The approximation in (6.1) is the basic idea of Monte Carlo estimation, as introduced in
Chapter 5. We know that x̄ → E[X] as n → ∞ according to the law of large numbers.
We next show that two methods can be used to calculate x̄ in (6.1). The first non-
incremental method collects all the samples first and then calculates the average. The
drawback of such a method is that, if the number of samples is large, we may have to
wait for a long time until all of the samples are collected. The second method can avoid
this drawback because it calculates the average in an incremental manner. Specifically,
suppose that
k
. 1X
wk+1 = xi , k = 1, 2, . . .
k i=1
and hence,
k−1
1 X
wk = xi , k = 2, 3, . . .
k − 1 i=1
Then, wk+1 can be expressed in terms of wk as

k k−1
!
1X 1 X 1 1
wk+1 = xi = xi + xk = ((k − 1)wk + xk ) = wk − (wk − xk ).
k i=1 k i=1
k k

111
6.2. Robbins-Monro algorithm S. Zhao, 2023

Therefore, we obtain the following incremental algorithm:

1
wk+1 = wk − (wk − xk ). (6.2)
k
This algorithm can be used to calculate the mean x̄ in an incremental manner. It can be
verified that

w1 = x1 ,
1
w2 = w1 − (w1 − x1 ) = x1 ,
1
1 1 1
w3 = w2 − (w2 − x2 ) = x1 − (x1 − x2 ) = (x1 + x2 ),
2 2 2
1 1
w4 = w3 − (w3 − x3 ) = (x1 + x2 + x3 ),
3 3
..
.
k
1X
wk+1 = xi . (6.3)
k i=1

The advantage of (6.2) is that the average can be immediately calculated every time we
receive a sample. This average can be used to approximate x̄ and hence E[X]. Notably,
the approximation may not be accurate at the beginning due to insufficient samples.
However, it is better than nothing. As more samples are obtained, the estimation accuracy
can be gradually improved according to the law of large numbers. In addition, one
1
Pk+1 1
Pk
can also define wk+1 = 1+k i=1 xi and wk = k i=1 xi . Doing so would not make
any significant difference. In this case, the corresponding iterative algorithm is wk+1 =
1
wk − 1+k (wk − xk+1 ).
Furthermore, consider an algorithm with a more general expression:

wk+1 = wk − αk (wk − xk ). (6.4)

This algorithm is important and frequently used in this chapter. It is the same as (6.2)
except that the coefficient 1/k is replaced by αk > 0. Since the expression of αk is not
given, we are not able to obtain the explicit expression of wk as in (6.3). However, we
will show in the next section that, if {αk } satisfies some mild conditions, wk → E[X] as
k → ∞. In Chapter 7, we will see that temporal-difference algorithms have similar (but
more complex) expressions.

6.2 Robbins-Monro algorithm

Stochastic approximation refers to a broad class of stochastic iterative algorithms for
solving root-finding or optimization problems [24]. Compared to many other root-finding

112
6.2. Robbins-Monro algorithm S. Zhao, 2023

algorithms such as gradient-based ones, stochastic approximation is powerful in the sense

that it does not require the expression of the objective function or its derivative.
The Robbins-Monro (RM) algorithm is a pioneering work in the field of stochastic
approximation [24–27]. The famous stochastic gradient descent algorithm is a special
form of the RM algorithm, as shown in Section 6.4. We next introduce the details of the
RM algorithm. 2

Suppose that we would like to find the root of the equation

g(w) = 0,

where w ∈ R is the unknown variable and g : R → R is a function. Many problems can be

formulated as root-finding problems. For example, if J(w) is an objective function to be
.
optimized, this optimization problem can be converted to solving g(w) = ∇w J(w) = 0.
In addition, an equation such as g(w) = c, where c is a constant, can also be converted
to the above equation by rewriting g(w) − c as a new function.
If the expression of g or its derivative is known, there are many numerical algorithms
that can be used. However, the problem we are facing is that the expression of the
function g is unknown. For example, the function may be represented by an artificial
neural network whose structure and parameters are unknown. Moreover, we can only
obtain a noisy observation of g(w):

g̃(w, η) = g(w) + η,

where η ∈ R is the observation error, which may or may not be Gaussian. In summary,
it is a black-box system where only the input w and the noisy output g̃(w, η) are known
(see Figure 6.2). Our aim is to solve g(w) = 0 using w and g̃.

w g̃(w, η)
g(w) +η

Fig. 1: xxof solving g(w) = 0 from w and g̃.

Figure 6.2: An illustration of the problem

The RM algorithm that can solve g(w) = 0 is

wk+1 = wk − ak g̃(wk , ηk ), k = 1, 2, 3, . . . (6.5)

where wk is the kth estimate of the root, g̃(wk , ηk ) is the kth noisy observation, and ak is
a positive coefficient. As can be seen, the RM algorithm does not require any information
about the function. It only requires the input and output.

113
6.2. Robbins-Monro algorithm S. Zhao, 2023

Estimated root wk
2

0
0 10 20 30 40 50

k
2

Observation noise
0

-2
0 10 20 30 40 50
Iteration index k

Figure 6.3: An illustrative example of the RM algorithm.

To illustrate the RM algorithm, consider an example in which g(w) = w3 − 5. The

true root is 51/3 ≈ 1.71. Now, suppose that we can only observe the input w and the
output g̃(w) = g(w) + η, where η is i.i.d. and obeys a standard normal distribution
with a zero mean and a standard deviation of 1. The initial guess is w1 = 0, and the
coefficient is ak = 1/k. The evolution process of wk is shown in Figure 6.3. Even though
the observation is corrupted by noise ηk , the estimate wk can still converge to the true
root. Note that the initial guess w1 must be properly selected to ensure convergence
for the specific function of g(w) = w3 − 5. In the following subsection, we present the
conditions under which the RM algorithm converges for any initial guesses.

6.2.1 Convergence properties

Why can the RM algorithm in (6.5) find the root of g(w) = 0? We next illustrate the
idea with an example and then provide a rigorous convergence analysis.
Consider the example shown in Figure 6.4. In this example, g(w) = tanh(w − 1). The
true root of g(w) = 0 is w∗ = 1. We apply the RM algorithm with w1 = 3 and ak = 1/k.
To better illustrate the reason for convergence, we simply set ηk ≡ 0, and consequently,
g̃(wk , ηk ) = g(wk ). The RM algorithm in this case is wk+1 = wk − ak g(wk ). The resulting
{wk } generated by the RM algorithm is shown in Figure 6.4. It can be seen that wk
converges to the true root w∗ = 1.
This simple example can illustrate why the RM algorithm converges.

When wk > w∗ , we have g(wk ) > 0. Then, wk+1 = wk − ak g(wk ) < wk . If ak g(wk ) is
sufficiently small, we have w∗ < wk+1 < wk . As a result, wk+1 is closer to w∗ than wk .
When wk < w∗ , we have g(wk ) < 0. Then, wk+1 = wk − ak g(wk ) > wk . If |ak g(wk )| is
sufficiently small, we have w∗ > wk+1 > wk . As a result, wk+1 is closer to w∗ than wk .

In either case, wk+1 is closer to w∗ than wk . Therefore, it is intuitive that wk converges

to w∗ .

114
6.2. Robbins-Monro algorithm S. Zhao, 2023

1.5

0.5

g(w)
0
...... w w w2 w1
4 3

-0.5

-1

0.5 1 1.5 2 2.5 3 3.5 4

Figure 6.4: An example for illustrating the convergence of the RM algorithm.

The above example is simple since the observation error is assumed to be zero. It
would be nontrivial to analyze the convergence in the presence of stochastic observation
errors. A rigorous convergence result is given below.

Theorem 6.1 (Robbins-Monro theorem). In the Robbins-Monro algorithm in (6.5), if

(a) 0 < c1 ≤ ∇w g(w) ≤ c2 for all w;
P∞ P∞ 2
(b) k=1 ak = ∞ and k=1 ak < ∞;

(c) E[ηk |Hk ] = 0 and E[ηk2 |Hk ] < ∞;

where Hk = {wk , wk−1 , . . . }, then wk almost surely converges to the root w∗ satisfying
g(w∗ ) = 0.

We postpone the proof of this theorem to Section 6.3.3. This theorem relies on the
notion of almost sure convergence, which is introduced in Appendix B.
The three conditions in Theorem 6.1 are explained as follows.

In the first condition, 0 < c1 ≤ ∇w g(w) indicates that g(w) is a monotonically increas-
ing function. This condition ensures that the root of g(w) = 0 exists and is unique. If
g(w) is monotonically decreasing, we can simply treat −g(w) as a new function that
is monotonically increasing.
As an application, we can formulate an optimization problem in which the objective
.
function is J(w) as a root-finding problem: g(w) = ∇w J(w) = 0. In this case, the
condition that g(w) is monotonically increasing indicates that J(w) is convex, which
is a commonly adopted assumption in optimization problems.
The inequality ∇w g(w) ≤ c2 indicates that the gradient of g(w) is bounded from
above. For example, g(w) = tanh(w − 1) satisfies this condition, but g(w) = w3 − 5
does not.

115
6.2. Robbins-Monro algorithm S. Zhao, 2023

The second condition about {ak } is interesting. We often see conditions like this in
reinforcement learning algorithms. In particular, the condition ∞ 2
P
k=1 ak < ∞ means
Pn
that limn→∞ k=1 a2k is bounded from above. It requires that ak converges to zero
as k → ∞. The condition ∞
P Pn
k=1 ak = ∞ means that limn→∞ k=1 ak is infinitely
large. It requires that ak should not converge to zero too fast. These conditions have
interesting properties, which will be analyzed in detail shortly.
The third condition is mild. It does not require the observation error ηk to be Gaussian.
An important special case is that {ηk } is an i.i.d. stochastic sequence satisfying
E[ηk ] = 0 and E[ηk2 ] < ∞. In this case, the third condition is valid because ηk is
independent of Hk and hence we have E[ηk |Hk ] = E[ηk ] = 0 and E[ηk2 |Hk ] = E[ηk2 ].

We next examine the second condition about the coefficients {ak } more closely.

Why is the second condition important for the convergence of the RM algorithm?
This question can naturally be answered when we present a rigorous proof of the above
theorem later. Here, we would like to provide some insightful intuition.
First, ∞ 2
P
k=1 ak < ∞ indicates that ak → 0 as k → ∞. Why is this condition impor-
tant? Suppose that the observation g̃(wk , ηk ) is always bounded. Since

wk+1 − wk = −ak g̃(wk , ηk ),

if ak → 0, then ak g̃(wk , ηk ) → 0 and hence wk+1 − wk → 0, indicating that wk+1 and

wk approach each other when k → ∞. Otherwise, if ak does not converge, then wk
may still fluctuate when k → ∞.
Second, ∞
P
k=1 ak = ∞ indicates that ak should not converge to zero too fast. Why
is this condition important? Summarizing both sides of the equations of w2 − w1 =
−a1 g̃(w1 , η1 ), w3 − w2 = −a2 g̃(w2 , η2 ), w4 − w3 = −a3 g̃(w3 , η3 ), . . . gives
∞
X
w1 − w∞ = ak g̃(wk , ηk ).
k=1

If ∞
P P∞
k=1 ak < ∞, then | k=1 ak g̃(wk , ηk )| is also bounded. Let b denote the finite
upper bound such that
∞
X
|w1 − w∞ | = ak g̃(wk , ηk ) ≤ b. (6.6)
k=1

If the initial guess w1 is selected far away from w∗ so that |w1 − w∗ | > b, then it is
impossible to have w∞ = w∗ according to (6.6). This suggests that the RM algorithm
cannot find the true solution w∗ in this case. Therefore, the condition ∞
P
k=1 ak = ∞
is necessary to ensure convergency given an arbitrary initial guess.

116
6.2. Robbins-Monro algorithm S. Zhao, 2023

P∞ P∞
What kinds of sequences satisfy k=1 ak = ∞ and k=1 a2k < ∞?
One typical sequence is
1
αk = .
k
On the one hand, it holds that
n
!
X 1
lim − ln n = κ,
n→∞
k=1
k

where κ ≈ 0.577 is called the Euler-Mascheroni constant (or Euler’s constant) [28].
Since ln n → ∞ as n → ∞, we have
∞
X 1
= ∞.
k=1
k

In fact, Hn = nk=1 k1 is called the harmonic number in number theory [29]. On the
P

other hand, it holds that

∞
X 1 π2
2
= < ∞.
k=1
k 6
P∞ 1
Finding the value of k=1 k2 is known as the Basel problem [30].
In summary, the sequence {ak = 1/k} satisfies the second condition in Theorem 6.1.
Notably, a slight modification, such as ak = 1/(k + 1) or ak = ck /k where ck is
bounded, also preserves this condition.

In the RM algorithm, ak is often selected as a sufficiently small constant in many

applications. Although the second condition is not satisfied anymore in this case because
P∞ 2 P∞ 2
k=1 a k = ∞ rather than k=1 ak < ∞, the algorithm can still converge in a certain
sense [24, Section 1.5]. In addition, g(x) = x3 − 5 in the example shown in Figure 6.3
does not satisfy the first condition, but the RM algorithm can still find the root if the
initial guess is adequately (not arbitrarily) selected.

6.2.2 Application to mean estimation

We next apply the Robbins-Monro theorem to analyze the mean estimation problem,
which has been discussed in Section 6.1. Recall that

wk+1 = wk + αk (xk − wk )

is the mean estimation algorithm in (6.4). When αk = 1/k, we can obtain the analytical
expression of wk+1 as wk+1 = 1/k ki=1 xi . However, we would not be able to obtain
P

an analytical expression when given general values of αk . In this case, the convergence
analysis is nontrivial. We can show that the algorithm in this case is a special RM

117
6.2. Robbins-Monro algorithm S. Zhao, 2023

algorithm and hence its convergence naturally follows.

In particular, define a function as

.
g(w) = w − E[X].

The original problem is to obtain the value of E[X]. This problem is formulated as a
root-finding problem to solve g(w) = 0. Given a value of w, the noisy observation that
.
we can obtain is g̃ = w − x, where x is a sample of X. Note that g̃ can be written as

g̃(w, η) = w − x
= w − x + E[X] − E[X]
.
= (w − E[X]) + (E[X] − x) = g(w) + η,

.
where η = E[X] − x.
The RM algorithm for solving this problem is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk (wk − xk ),

which is exactly the algorithm in (6.4). As a result, it is guaranteed by Theorem 6.1 that
wk converges to E[X] almost surely if ∞
P P∞ 2
k=1 αk = ∞, k=1 αk < ∞, and {xk } is i.i.d.
It is worth mentioning that the convergence property does not rely on any assumption
regarding the distribution of X.

6.3 Dvoretzky’s convergence theorem

Until now, the convergence of the RM algorithm has not yet been proven. To do that,
we next introduce Dvoretzky’s theorem [31, 32], which is a classic result in the field
of stochastic approximation. This theorem can be used to analyze the convergence
of the RM algorithm and many reinforcement learning algorithms.
This section is slightly mathematically intensive. Readers who are interested in
the convergence analyses of stochastic algorithms are recommended to study this
section. Otherwise, this section can be skipped.

Theorem 6.2 (Dvoretzky’s theorem). Consider a stochastic process

∆k+1 = (1 − αk )∆k + βk ηk ,

where {αk }∞ ∞ ∞
k=1 , {βk }k=1 , {ηk }k=1 are stochastic sequences. Here αk ≥ 0, βk ≥ 0 for all
k. Then, ∆k converges to zero almost surely if the following conditions are satisfied:

118
6.2. Robbins-Monro algorithm S. Zhao, 2023

P∞ P∞ P∞
(a) k=1 αk = ∞, k=1 αk2 < ∞, and k=1 βk2 < ∞ uniformly almost surely;
(b) E[ηk |Hk ] = 0 and E[ηk2 |Hk ] ≤ C almost surely;
where Hk = {∆k , ∆k−1 , . . . , ηk−1 , . . . , αk−1 , . . . , βk−1 , . . . }.
Before presenting the proof of this theorem, we first clarify some issues.

In the RM algorithm, the coefficient sequence {αk } is deterministic. However,

Dvoretzky’s theorem allows {αk }, {βk } to be random variables that depend on
Hk . Thus, it is more useful in cases where αk or βk is a function of ∆k .
In the first condition, it is stated as “uniformly almost surely”. This is because αk
and βk may be random variables and hence the definition of their limits must be in
the stochastic sense. In the second condition, it is also stated as “almost surely”.
This is because Hk is a sequence of random variables rather than specific values.
As a result, E[ηk |Hk ] and E[ηk2 |Hk ] are random variables. The definition of the
conditional expectation in this case is in the “almost sure” sense (Appendix B).
The statement of Theorem 6.2 is slightly different from [32] in the sense that Theo-
rem 6.2 does not require ∞
P P∞
k=1 βk = ∞ in the first condition. When k=1 βk < ∞,
especially in the extreme case where βk = 0 for all k, the sequence can still con-
verge.

6.3.1 Proof of Dvoretzky’s theorem

The original proof of Dvoretzky’s theorem was given in 1956 [31]. There are also other
proofs. We next present a proof based on quasimartingales. With the convergence
results of quasimartingales, the proof of Dvoretzky’s theorem is straightforward. More
information about quasimartingales can be found in Appendix C.
.
Proof of Dvoretzky’s theorem. Let hk = ∆2k . Then,

hk+1 − hk = ∆2k+1 − ∆2k

= (∆k+1 − ∆k )(∆k+1 + ∆k )
= (−αk ∆k + βk ηk )[(2 − αk )∆k + βk ηk ]
= −αk (2 − αk )∆2k + βk2 ηk2 + 2(1 − αk )βk ηk ∆k .

Taking expectations on both sides of the above equation yields

E[hk+1 − hk |Hk ] = E[−αk (2 − αk )∆2k |Hk ] + E[βk2 ηk2 |Hk ] + E[2(1 − αk )βk ηk ∆k |Hk ].
(6.7)

First, since ∆k is included and hence determined by Hk , it can be taken out from
the expectation (see property (e) in Lemma B.1). Second, consider the simple case

119
6.2. Robbins-Monro algorithm S. Zhao, 2023

where αk , βk is determined by Hk . This case is valid when, for example, {αk } and
{βk } are functions of ∆k or deterministic sequences. Then, they can also be taken
out of the expectation. Therefore, (6.7) becomes

E[hk+1 − hk |Hk ] = −αk (2 − αk )∆2k + βk2 E[ηk2 |Hk ] + 2(1 − αk )βk ∆k E[ηk |Hk ]. (6.8)

For the first term, since ∞ 2

P
k=1 αk < ∞ implies αk → 0 almost surely. As a result,
there exists a finite n such that αk ≤ 1 almost surely for all k ≥ n. Without loss of
generality, we can simply consider the case of k ≥ n and hence αk ≤ 1 almost surely.
Then, −αk (2 − αk )∆2k ≤ 0. For the second term, we have βk2 E[ηk2 |Hk ] ≤ βk2 C as
assumed. The third term equals zero because E[ηk |Hk ] = 0 as assumed. Therefore,
(6.8) becomes

E[hk+1 − hk |Hk ] = −αk (2 − αk )∆2k + βk2 E[ηk2 |Hk ] ≤ βk2 C, (6.9)

and hence,
∞
X ∞
X
E[hk+1 − hk |Hk ] ≤ βk2 C < ∞.
k=1 k=1

P∞ 2
The last inequality is due to the condition k=1 βk < ∞. Then, based on the
quasimartingale convergence theorem in Appendix C, we conclude that hk converges
almost surely.
While we now know that hk is convergent and so is ∆k , we next determine what
value ∆k converges to. It follows from (6.9) that
∞
X ∞
X ∞
X
αk (2 − αk )∆2k = βk2 E[ηk2 |Hk ] − E[hk+1 − hk |Hk ].
k=1 k=1 k=1

The first term on the right-hand side is bounded as assumed. The second term
is also bounded because hk converges and hence hk+1 − hk is summable. Thus,
P∞ 2
k=1 αk (2 − αk )∆k on the left-hand side is also bounded. Since we consider the case
of αk ≤ 1, we have
∞
X ∞
X
∞> αk (2 − αk )∆2k ≥ αk ∆2k ≥ 0.
k=1 k=1

P∞ 2
P∞
Therefore, k=1 αk ∆k is bounded. Since k=1 αk = ∞, we must have ∆k → 0
almost surely.

120
6.2. Robbins-Monro algorithm S. Zhao, 2023

6.3.2 Application to mean estimation

While the mean estimation algorithm, wk+1 = wk + αk (xk − wk ), has been analyzed
using the RM theorem, we next show that its convergence can also be directly proven
by Dvoretzky’s theorem.
Proof. Let w∗ = E[X]. The mean estimation algorithm wk+1 = wk + αk (xk − wk ) can
be rewritten as

wk+1 − w∗ = wk − w∗ + αk (xk − w∗ + w∗ − wk )

.
Let ∆ = w − w∗ . Then, we have

∆k+1 = ∆k + αk (xk − w∗ − ∆k )
= (1 − αk )∆k + αk (xk − w∗ ) .
| {z }
ηk

Since {xk } is i.i.d., we have E[xk |Hk ] = E[xk ] = w∗ . As a result, E[ηk |Hk ] = E[xk −
w∗ |Hk ] = 0 and E[ηk2 |Hk ] = E[x2k |Hk ] − (w∗ )2 = E[x2k ] − (w∗ )2 are bounded if the
variance of xk is finite. Following Dvoretzky’s theorem, we conclude that ∆k converges
to zero and hence wk converges to w∗ = E[X] almost surely.

6.3.3 Application to the Robbins-Monro theorem

We are now ready to prove the Robbins-Monro theorem using Dvoretzky’s theorem.
Proof of the Robbins-Monro theorem. The RM algorithm aims to find the root of
g(w) = 0. Suppose that the root is w∗ such that g(w∗ ) = 0. The RM algorithm is

wk+1 = wk − ak g̃(wk , ηk )
= wk − ak [g(wk ) + ηk ].

Then, we have

wk+1 − w∗ = wk − w∗ − ak [g(wk ) − g(w∗ ) + ηk ].

Due to the mean value theorem [7, 8], we have g(wk ) − g(w∗ ) = ∇w g(wk0 )(wk − w∗ ),

121
6.2. Robbins-Monro algorithm S. Zhao, 2023

.
where wk0 ∈ [wk , w∗ ]. Let ∆k = wk − w∗ . The above equation becomes

∆k+1 = ∆k − ak [∇w g(wk0 )(wk − w∗ ) + ηk ]

= ∆k − ak ∇w g(wk0 )∆k + ak (−ηk )
= [1 − ak ∇w g(wk0 )]∆k + ak (−ηk ).
| {z }
αk

Note that ∇w g(w) is bounded as 0 < c1 ≤ ∇w g(w) ≤ c2 as assumed. Since ∞

P
k=1 ak =
P∞ 2 P∞ P∞ 2
∞ and k=1 ak < ∞ as assumed, we know k=1 αk = ∞ and k=1 αk < ∞. Thus,
all the conditions in Dvoretzky’s theorem are satisfied and hence ∆k converges to
zero almost surely.
The proof of the RM theorem demonstrates the power of Dvoretzky’s theorem.
In particular, αk in the proof is a stochastic sequence depending on wk rather than a
deterministic sequence. In this case, Dvoretzky’s theorem is still applicable.

6.3.4 An extension of Dvoretzky’s theorem

We next extend Dvoretzky’s theorem to a more general theorem that can handle
multiple variables. This general theorem, proposed by [32], can be used to analyze
the convergence of stochastic iterative algorithms such as Q-learning.

Theorem 6.3. Consider a finite set S of real numbers. For the stochastic process

∆k+1 (s) = (1 − αk (s))∆k (s) + βk (s)ηk (s),

it holds that ∆k (s) converges to zero almost surely for every s ∈ S if the following
conditions are satisfied for s ∈ S:
P P 2 P 2
(a) k αk (s) = ∞, k αk (s) < ∞, k βk (s) < ∞, and E[βk (s)|Hk ] ≤
E[αk (s)|Hk ] uniformly almost surely;
(b) kE[ηk (s)|Hk ]k∞ ≤ γk∆k k∞ , where γ ∈ (0, 1);
(c) var[ηk (s)|Hk ] ≤ C(1 + k∆k (s)k∞ )2 , where C is a constant.
Here, Hk = {∆k , ∆k−1 , . . . , ηk−1 , . . . , αk−1 , . . . , βk−1 , . . . } represents the historical in-
formation. The term k · k∞ refers to the maximum norm.

Proof. As an extension, this theorem can be proven based on Dvoretzky’s theorem.

Details can be found in [32] and are omitted here.
Some remarks about this theorem are given below.

122
6.4. Stochastic gradient descent S. Zhao, 2023

We first clarify some notations in the theorem. The variable x can be viewed
as an index. In the context of reinforcement learning, it indicates a state or a
state-action pair. The maximum norm k · k∞ is defined over a set. It is similar
.
but different from the L∞ norm of vectors. In particular, kE[ηk (s)|Hk ]k∞ =
.
maxs∈S |E[ηk (s)|Hk ]| and k∆k (s)k∞ = maxs∈S |∆k (s)|.
This theorem is more general than Dvoretzky’s theorem. First, it can handle the
case of multiple variables due to the maximum norm operations. This is important
for a reinforcement learning problem where there are multiple states. Second,
while Dvoretzky’s theorem requires E[ηk (s)|Hk ] = 0 and var[ηk (s)|Hk ] ≤ C, this
theorem only requires that the expectation and variance are bounded by the error
∆k .
It should be noted that the convergence of ∆(s) for all s ∈ S requires that the
conditions are valid for every s ∈ S. Therefore, when applying this theorem to
prove the convergence of reinforcement learning algorithms, we need to show that
the conditions are valid for every state (or state-action pair).

6.4 Stochastic gradient descent

This section introduces stochastic gradient descent (SGD) algorithms, which are widely
used in the field of machine learning. We will see that SGD is a special RM algorithm,
and the mean estimation algorithm is a special SGD algorithm.
Consider the following optimization problem:

min J(w) = E[f (w, X)], (6.10)

where w is the parameter to be optimized, and X is a random variable. The expectation

is calculated with respect to X. Here, w and X can be either scalars or vectors. The
function f (·) is a scalar.
A straightforward method for solving (6.10) is gradient descent. In particular, the
gradient of E[f (w, X)] is ∇w E[f (w, X)] = E[∇w f (w, X)]. Then, the gradient descent
algorithm is

wk+1 = wk − αk ∇w J(wk ) = wk − αk E[∇w f (wk , X)]. (6.11)

This gradient descent algorithm can find the optimal solution w∗ under some mild con-
ditions such as the convexity of f . Preliminaries about gradient descent algorithms can
be found in Appendix D.
The gradient descent algorithm requires the expected value E[∇w f (wk , X)]. One
way to obtain the expected value is based on the probability distribution of X. The

123
6.4. Stochastic gradient descent S. Zhao, 2023

distribution is, however, often unknown in practice. Another way is to collect a large
number of i.i.d. samples {xi }ni=1 of X so that the expected value can be approximated as
n
1X
E[∇w f (wk , X)] ≈ ∇w f (wk , xi ).
n i=1

Then, (6.11) becomes

n
αk X
wk+1 = wk − ∇w f (wk , xi ). (6.12)
n i=1

One problem of the algorithm in (6.12) is that it requires all the samples in each iteration.
In practice, if the samples are collected one by one, then it is favorable to update w every
time a sample is collected. To that end, we can use the following algorithm:

wk+1 = wk − αk ∇w f (wk , xk ), (6.13)

where xk is the sample collected at time step k. This is the well-known stochastic gradient
descent algorithm. This algorithm is called “stochastic” because it relies on stochastic
samples {xk }.
Compared to the gradient descent algorithm in (6.11), SGD replaces the true gra-
dient E[∇w f (w, X)] with the stochastic gradient ∇w f (wk , xk ). Since ∇w f (wk , xk ) 6=
E[∇w f (w, X)], can such a replacement still ensure wk → w∗ as k → ∞? The answer
is yes. We next present an intuitive explanation and postpone the rigorous proof of the
convergence to Section 6.4.5.
In particular, since

∇w f (wk , xk ) = E[∇w f (w, X)] + ∇w f (wk , xk ) − E[∇w f (w, X)]
.
= E[∇w f (w, X)] + ηk ,

the SGD algorithm in (6.13) can be rewritten as

wk+1 = wk − αk E[∇w f (w, X)] − αk ηk .

Therefore, the SGD algorithm is the same as the regular gradient descent algorithm except
that it has a perturbation term αk ηk . Since {xk } is i.i.d., we have Exk [∇w f (wk , xk )] =
EX [∇w f (w, X)]. As a result,
h i
E[ηk ] = E ∇w f (wk , xk ) − E[∇w f (w, X)] = Exk [∇w f (wk , xk )] − EX [∇w f (w, X)] = 0.

Therefore, the perturbation term ηk has a zero mean, which intuitively suggests that it
may not jeopardize the convergence property. A rigorous proof of the convergence of

124
6.4. Stochastic gradient descent S. Zhao, 2023

SGD is given in Section 6.4.5.

6.4.1 Application to mean estimation

We next apply SGD to analyze the mean estimation problem and show that the mean
estimation algorithm in (6.4) is a special SGD algorithm. To that end, we formulate the
mean estimation problem as an optimization problem:

1 2 .
min J(w) = E kw − Xk = E[f (w, X)], (6.14)
w 2

where f (w, X) = kw − Xk2 /2 and the gradient is ∇w f (w, X) = w − X. It can be

verified that the optimal solution is w∗ = E[X] by solving ∇w J(w) = 0. Therefore, this
optimization problem is equivalent to the mean estimation problem.

The gradient descent algorithm for solving (6.14) is

wk+1 = wk − αk ∇w J(wk )
= wk − αk E[∇w f (wk , X)]
= wk − αk E[wk − X].

This gradient descent algorithm is not applicable since E[wk − X] or E[X] on the
right-hand side is unknown (in fact, it is what we need to solve).
The SGD algorithm for solving (6.14) is

wk+1 = wk − αk ∇w f (wk , xk ) = wk − αk (wk − xk ),

where xk is a sample obtained at time step k. Notably, this SGD algorithm is the
same as the iterative mean estimation algorithm in (6.4). Therefore, (6.4) is an SGD
algorithm designed specifically for solving the mean estimation problem.

6.4.2 Convergence pattern of SGD

The idea of the SGD algorithm is to replace the true gradient with a stochastic gradient.
However, since the stochastic gradient is random, one may ask whether the convergence
speed of SGD is slow or random. Fortunately, SGD can converge efficiently in general.
An interesting convergence pattern is that it behaves similarly to the regular gradient
descent algorithm when the estimate wk is far from the optimal solution w∗ . Only when
wk is close to w∗ , does the convergence of SGD exhibit more randomness.
An analysis of this pattern and an illustrative example are given below.

125
6.4. Stochastic gradient descent S. Zhao, 2023

Analysis: The relative error between the stochastic and true gradients is

. |∇w f (wk , xk ) − E[∇w f (wk , X)]|

δk = .
|E[∇w f (wk , X)]|

For the sake of simplicity, we consider the case where w and ∇w f (w, x) are both
scalars. Since w∗ is the optimal solution, it holds that E[∇w f (w∗ , X)] = 0. Then, the
relative error can be rewritten as
|∇w f (wk , xk ) − E[∇w f (wk , X)]| |∇w f (wk , xk ) − E[∇w f (wk , X)]|
δk = ∗
= , (6.15)
|E[∇w f (wk , X)] − E[∇w f (w , X)]| |E[∇2w f (w̃k , X)(wk − w∗ )]|

where the last equality is due to the mean value theorem [7, 8] and w̃k ∈ [wk , w∗ ].
Suppose that f is strictly convex such that ∇2w f ≥ c > 0 for all w, X. Then, the
denominator in (6.15) becomes

E[∇2w f (w̃k , X)(wk − w∗ )] = E[∇2w f (w̃k , X)] (wk − w∗ )

≥ c|wk − w∗ |.

Substituting the above inequality into (6.15) yields

stochastic gradient true gradient

z }| { z }| {
∇w f (wk , xk ) − E[∇w f (wk , X)]
δk ≤ .
c|wk − w∗ |
| {z }
distance to the optimal solution

The above inequality suggests an interesting convergence pattern of SGD: the relative
error δk is inversely proportional to |wk − w∗ |. As a result, when |wk − w∗ | is large, δk
is small. In this case, the SGD algorithm behaves like the gradient descent algorithm
and hence wk quickly converges to w∗ . When wk is close to w∗ , the relative error δk
may be large, and the convergence exhibits more randomness.
Example: A good example for demonstrating the above analysis is the mean estima-
tion problem. Consider the mean estimation problem in (6.14). When w and X are
both scalar, we have f (w, X) = |w − X|2 /2 and hence

∇w f (w, xk ) = w − xk ,
E[∇w f (w, xk )] = w − E[X] = w − w∗ .

Thus, the relative error is

|∇w f (wk , xk ) − E[∇w f (wk , X)]| |(wk − xk ) − (wk − E[X])| |E[X] − xk |

δk = = ∗
= .
|E[∇w f (wk , X)]| |wk − w | |wk − w∗ |

The expression of the relative error clearly shows that δk is inversely proportional to

126
6.4. Stochastic gradient descent S. Zhao, 2023

30
20 SG (m=1)
Mean MBGD (m=5)
Samples 25
15 MBGD (m=50)
SGD (m=1)

Distance to mean
MBGD (m=5) 20
10 MBGD (m=50)

15
5
y

10
0

-5 5

0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step

Figure 6.5: An example for demonstrating stochastic and mini-batch gradient descent algorithms. The
distribution of X ∈ R2 is uniform in the square area centered at the origin with a side length as 20. The
mean is E[X] = 0. The mean estimation is based on 100 i.i.d. samples.

|wk − w∗ |. As a result, when wk is far from w∗ , the relative error is small, and SGD
behaves like gradient descent. In addition, since δk is proportional to |E[X] − xk |, the
mean of δk is proportional to the variance of X.
The simulation results are shown in Figure 6.5. Here, X ∈ R2 represents a random
position in the plane. Its distribution is uniform in the square area centered at the
origin and E[X] = 0. The mean estimation is based on 100 i.i.d. samples. Although
the initial guess of the mean is far away from the true value, it can be seen that the
SGD estimate quickly approaches the neighborhood of the origin. When the estimate
is close to the origin, the convergence process exhibits certain randomness.

6.4.3 A deterministic formulation of SGD

The formulation of SGD in (6.13) involves random variables. One may often encounter
a deterministic formulation of SGD without involving any random variables.
In particular, consider a set of real numbers {xi }ni=1 , where xi does not have to be a
sample of any random variable. The optimization problem to be solved is to minimize
the average:
n
1X
min J(w) = f (w, xi ),
w n i=1

where f (w, xi ) is a parameterized function, and w is the parameter to be optimized. The

gradient descent algorithm for solving this problem is
n
1X
wk+1 = wk − αk ∇w J(wk ) = wk − αk ∇w f (wk , xi ).
n i=1

Suppose that the set {xi }ni=1 is large and we can only fetch a single number each time.

127
6.4. Stochastic gradient descent S. Zhao, 2023

In this case, it is favorable to update wk in an incremental manner:

wk+1 = wk − αk ∇w f (wk , xk ). (6.16)

It must be noted that xk here is the number fetched at time step k instead of the kth
element in the set {xi }ni=1 .
The algorithm in (6.16) is very similar to SGD, but its problem formulation is subtly
different because it does not involve any random variables or expected values. Then,
many questions arise. For example, is this algorithm SGD? How should we use the finite
set of numbers {xi }ni=1 ? Should we sort these numbers in a certain order and then use
them one by one, or should we randomly sample a number from the set?
A quick answer to the above questions is that, although no random variables are
involved in the above formulation, we can convert the deterministic formulation to the
stochastic formulation by introducing a random variable. In particular, let X be a random
variable defined on the set {xi }ni=1 . Suppose that its probability distribution is uniform
such that p(X = xi ) = 1/n. Then, the deterministic optimization problem becomes a
stochastic one: n
1X
min J(w) = f (w, xi ) = E[f (w, X)].
w n i=1
The last equality in the above equation is strict instead of approximate. Therefore, the
algorithm in (6.16) is SGD, and the estimate converges if xk is uniformly and indepen-
dently sampled from {xi }ni=1 . Note that xk may repeatedly take the same number in
{xi }ni=1 since it is sampled randomly.

6.4.4 BGD, SGD, and mini-batch GD

While SGD uses a single sample in every iteration, we next introduce mini-batch gradient
descent (MBGD), which uses a few more samples in every iteration. When all samples
are used in every iteration, the algorithm is called batch gradient descent (BGD).
In particular, suppose that we would like to find the optimal solution that can min-
imize J(w) = E[f (w, X)] given a set of random samples {xi }ni=1 of X. The BGD, SGD,
and MBGD algorithms for solving this problem are, respectively,
n
1X
wk+1 = wk − αk ∇w f (wk , xi ), (BGD)
n i=1
1 X
wk+1 = w k − αk ∇w f (wk , xj ), (MBGD)
m j∈I
k

wk+1 = wk − αk ∇w f (wk , xk ). (SGD)

In the BGD algorithm, all the samples are used in every iteration. When n is large,
(1/n) ni=1 ∇w f (wk , xi ) is close to the true gradient E[∇w f (wk , X)]. In the MBGD al-
P

128
6.4. Stochastic gradient descent S. Zhao, 2023

gorithm, Ik is a subset of {1, . . . , n} obtained at time k. The size of the set is |Ik | = m.
The samples in Ik are also assumed to be i.i.d. In the SGD algorithm, xk is randomly
sampled from {xi }ni=1 at time k.
MBGD can be viewed as an intermediate version between SGD and BGD. Compared
to SGD, MBGD has less randomness because it uses more samples instead of just one
as in SGD. Compared to BGD, MBGD does not require using all the samples in every
iteration, making it more flexible. If m = 1, then MBGD becomes SGD. However, if
m = n, MBGD may not become BGD. This is because MBGD uses n randomly fetched
samples, whereas BGD uses all n numbers. These n randomly fetched samples may
contain the same number multiple times and hence may not cover all n numbers in
{xi }ni=1 .
The convergence speed of MBGD is faster than that of SGD in general. This is
because SGD uses ∇w f (wk , xk ) to approximate the true gradient, whereas MBGD uses
P
(1/m) j∈Ik ∇w f (wk , xj ), which is closer to the true gradient because the randomness is
averaged out. The convergence of the MBGD algorithm can be proven similarly to the
SGD case.
A good example for demonstrating the above analysis is the mean estimation prob-
lem. In particular, given some numbers {xi }ni=1 , our goal is to calculate the mean
x̄ = ni=1 xi /n. This problem can be equivalently stated as the following optimization
P

problem:
n
1 X
min J(w) = kw − xi k2 ,
w 2n i=1
whose optimal solution is w∗ = x̄. The three algorithms for solving this problem are,
respectively,
n
1X
wk+1 = wk − αk (wk − xi ) = wk − αk (wk − x̄), (BGD)
n i=1
1 X
(m)

wk+1 = wk − αk (wk − xj ) = wk − αk wk − x̄k , (MBGD)
m j∈I
k

wk+1 = wk − αk (wk − xk ), (SGD)

(m) P
where x̄k = j∈Ik xj /m. Furthermore, if αk = 1/k, the above equations can be solved

129
6.4. Stochastic gradient descent S. Zhao, 2023

as follows:
k
1X
wk+1 = x̄ = x̄, (BGD)
k j=1
k
1 X (m)
wk+1 = x̄ , (MBGD)
k j=1 j
k
1X
wk+1 = xj . (SGD)
k j=1

The derivation of the above equations is similar to that of (6.3) and is omitted here. It
can be seen that the estimate given by BGD at each step is exactly the optimal solution
(m)
w∗ = x̄. MBGD converges to the mean faster than SGD because x̄k is already an
average.
A simulation example is given in Figure 6.5 to demonstrate the convergence of MBGD.
Let αk = 1/k. It is shown that all MBGD algorithms with different mini-batch sizes can
converge to the mean. The case with m = 50 converges the fastest, while SGD with m = 1
is the slowest. This is consistent with the above analysis. Nevertheless, the convergence
rate of SGD is still fast, especially when wk is far from w∗ .

6.4.5 Convergence of SGD

The rigorous proof of the convergence of SGD is given as follows.

Theorem 6.4 (Convergence of SGD). For the SGD algorithm in (6.13), if the following
conditions are satisfied, then wk converges to the root of ∇w E[f (w, X)] = 0 almost surely.
(a) 0 < c1 ≤ ∇2w f (w, X) ≤ c2 ;
P∞ P∞ 2
(b) k=1 ak = ∞ and k=1 ak < ∞;

The three conditions in Theorem 6.4 are discussed below.

Condition (a) is about the convexity of f . It requires the curvature of f to be bounded

from above and below. Here, w is a scalar, and so is ∇2w f (w, X). This condition can
be generalized to the vector case. When w is a vector, ∇2w f (w, X) is the well-known
Hessian matrix.
Condition (b) is similar to that of the RM algorithm. In fact, the SGD algorithm is
a special RM algorithm (as shown in the proof in Box 6.1). In practice, ak is often
selected as a sufficiently small constant. Although condition (b) is not satisfied in this
case, the algorithm can still converge in a certain sense [24, Section 1.5].
Condition (c) is a common requirement.

130
6.4. Stochastic gradient descent S. Zhao, 2023

Box 6.1: Proof of Theorem 6.4

We next show that the SGD algorithm is a special RM algorithm. Then, the conver-
gence of SGD naturally follows from the RM theorem.
The problem to be solved by SGD is to minimize J(w) = E[f (w, X)]. This
problem can be converted to a root-finding problem. That is, finding the root of
∇w J(w) = E[∇w f (w, X)] = 0. Let

g(w) = ∇w J(w) = E[∇w f (w, X)].

Then, SGD aims to find the root of g(w) = 0. This is exactly the problem solved by
the RM algorithm. The quantity that we can measure is g̃ = ∇w f (w, x), where x is
a sample of X. Note that g̃ can be rewritten as

g̃(w, η) = ∇w f (w, x)
= E[∇w f (w, X)] + ∇w f (w, x) − E[∇w f (w, X)] .
| {z }
η(w,x)

Then, the RM algorithm for solving g(w) = 0 is

wk+1 = wk − ak g̃(wk , ηk ) = wk − ak ∇w f (wk , xk ),

which is the same as the SGD algorithm in (6.13). As a result, the SGD algorithm
is a special RM algorithm. We next show that the three conditions in Theorem 6.1
are satisfied. Then, the convergence of SGD naturally follows from Theorem 6.1.
Since ∇w g(w) = ∇w E[∇w f (w, X)] = E[∇2w f (w, X)], it follows from c1 ≤
∇2w f (w, X) ≤ c2 that c1 ≤ ∇w g(w) ≤ c2 . Thus, the first condition in Theo-
rem 6.1 is satisfied.
The second condition in Theorem 6.1 is the same as the second condition in this
theorem.
The third condition in Theorem 6.1 requires E[ηk |Hk ] = 0 and E[ηk2 |Hk ] < ∞.
Since {xk } is i.i.d., we have Exk [∇w f (w, xk )] = E[∇w f (w, X)] for all k. Therefore,

E[ηk |Hk ] = E[∇w f (wk , xk ) − E[∇w f (wk , X)]|Hk ].

Since Hk = {wk , wk−1 , . . . } and xk is independent of Hk , the first term on the

right-hand side becomes E[∇w f (wk , xk )|Hk ] = Exk [∇w f (wk , xk )]. The second
term becomes E[E[∇w f (wk , X)]|Hk ] = E[∇w f (wk , X)] because E[∇w f (wk , X)] is

131
6.5. Summary S. Zhao, 2023

a function of wk . Therefore,

E[ηk |Hk ] = Exk [∇w f (wk , xk )] − E[∇w f (wk , X)] = 0.

Similarly, it can be proven that E[ηk2 |Hk ] < ∞ if |∇w f (w, x)| < ∞ for all w given
any x.
Since the three conditions in Theorem 6.1 are satisfied, the convergence of the
SGD algorithm follows.

6.5 Summary
Instead of introducing new reinforcement learning algorithms, this chapter introduced the
preliminaries of stochastic approximation such as the RM and SGD algorithms. Com-
pared to many other root-finding algorithms, the RM algorithm does not require the
expression of the objective function or its derivative. It has been shown that the SGD al-
gorithm is a special RM algorithm. Moreover, an important problem frequently discussed
throughout this chapter is mean estimation. The mean estimation algorithm (6.4) is the
first stochastic iterative algorithm we have ever introduced in this book. We showed that
it is a special SGD algorithm. We will see in Chapter 7 that temporal-difference learn-
ing algorithms have similar expressions. Finally, the name “stochastic approximation”
was first used by Robbins and Monro in 1951 [25]. More information about stochastic
approximation can be found in [24].

6.6 Q&A
Q: What is stochastic approximation?
A: Stochastic approximation refers to a broad class of stochastic iterative algorithms
for solving root-finding or optimization problems.
Q: Why do we need to study stochastic approximation?
A: This is because the temporal-difference reinforcement learning algorithms that will
be introduced in Chapter 7 can be viewed as stochastic approximation algorithms.
With the knowledge introduced in this chapter, we can be better prepared, and it will
not be abrupt for us to see these algorithms for the first time.
Q: Why do we frequently discuss the mean estimation problem in this chapter?
A: This is because the state and action values are defined as the means of random
variables. The temporal-difference learning algorithms introduced in Chapter 7 are
similar to stochastic approximation algorithms for mean estimation.

132
6.6. Q&A S. Zhao, 2023

Q: What is the advantage of the RM algorithm over other root-finding algorithms?

A: Compared to many other root-finding algorithms, the RM algorithm is powerful
in the sense that it does not require the expression of the objective function or its
derivative. As a result, it is a black-box technique that only requires the input and
output of the objective function. The famous SGD algorithm is a special form of the
RM algorithm.
Q: What is the basic idea of SGD?
A: SGD aims to solve optimization problems involving random variables. When the
probability distributions of the given random variables are not known, SGD can solve
the optimization problems merely by using samples. Mathematically, the SGD algo-
rithm can be obtained by replacing the true gradient expressed as an expectation in
the gradient descent algorithm with a stochastic gradient.
Q: Can SGD converge quickly?
A: SGD has an interesting convergence pattern. That is, if the estimate is far from
the optimal solution, then the convergence process is fast. When the estimate is close
to the solution, the randomness of the stochastic gradient becomes influential, and
the convergence rate decreases.
Q: What is MBGD? What are its advantages over SGD and BGD?
A: MBGD can be viewed as an intermediate version between SGD and BGD. Com-
pared to SGD, it has less randomness because it uses more samples instead of just one
as in SGD. Compared to BGD, it does not require the use of all the samples, making
it more flexible.

133
Chapter 7

Temporal-Difference Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 7.1: Where we are in this book.

This chapter introduces temporal-difference (TD) methods for reinforcement learning.

Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some
advantages due to its incremental form. With the preparation in Chapter 6, readers will
not feel alarmed when seeing TD learning algorithms. In fact, TD learning algorithms can
be viewed as special stochastic algorithms for solving the Bellman or Bellman optimality
equations.
Since this chapter introduces quite a few TD algorithms, we first overview these
algorithms and clarify the relationships between them.
Section 7.1 introduces the most basic TD algorithm, which can estimate the state

134
7.1. TD learning of state values S. Zhao, 2023

values of a given policy. It is important to understand this basic algorithm first

before studying the other TD algorithms.
Section 7.2 introduces the Sarsa algorithm, which can estimate the action values of a
given policy. This algorithm can be combined with a policy improvement step to find
optimal policies. The Sarsa algorithm can be easily obtained from the TD algorithm
in Section 7.1 by replacing state value estimation with action value estimation.
Section 7.3 introduces the n-step Sarsa algorithm, which is a generalization of the
Sarsa algorithm. It will be shown that Sarsa and MC learning are two special cases
of n-step Sarsa.
Section 7.4 introduces the Q-learning algorithm, which is one of the most classic
reinforcement learning algorithms. While the other TD algorithms aim to solve the
Bellman equation of a given policy, Q-learning aims to directly solve the Bellman
optimality equation to obtain optimal policies.
Section 7.5 compares the TD algorithms introduced in this chapter and provides a
unified point of view.

7.1 TD learning of state values

TD learning often refers to a broad class of reinforcement learning algorithms. For ex-
ample, all the algorithms introduced in this chapter fall into the scope of TD learning.
However, TD learning in this section specifically refers to a classic algorithm for estimat-
ing state values.

7.1.1 Algorithm description

Given a policy π, our goal is to estimate vπ (s) for all s ∈ S. Suppose that we have
some experience samples (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) generated following π. Here, t
denotes the time step. The following TD algorithm can estimate the state values using
these samples:
h i
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − rt+1 + γvt (st+1 ) , (7.1)
vt+1 (s) = vt (s), for all s 6= st , (7.2)

where t = 0, 1, 2, . . . . Here, vt (st ) is the estimate of vπ (st ) at time t; αt (st ) is the learning
rate for st at time t.
It should be noted that, at time t, only the value of the visited state st is updated. The
values of the unvisited states s 6= st remain unchanged as shown in (7.2). Equation (7.2)
is often omitted for simplicity, but it should be kept in mind because the algorithm would
be mathematically incomplete without this equation.

135
7.1. TD learning of state values S. Zhao, 2023

Readers who see the TD learning algorithm for the first time may wonder why it
is designed like this. In fact, it can be viewed as a special stochastic approximation
algorithm for solving the Bellman equation. To see that, first recall that the definition of
the state value is

vπ (s) = E Rt+1 + γGt+1 |St = s , s ∈ S. (7.3)

We can rewrite (7.3) as

vπ (s) = E Rt+1 + γvπ (St+1 )|St = s , s ∈ S. (7.4)

0 0
P P
That is because E[Gt+1 |St = s] = a π(a|s) s0 p(s |s, a)vπ (s ) = E[vπ (St+1 )|St = s].
Equation (7.4) is another expression of the Bellman equation. It is sometimes called the
Bellman expectation equation.
The TD algorithm can be derived by applying the Robbins-Monro algorithm (Chap-
ter 6) to solve the Bellman equation in (7.4). Interested readers can check the details in
Box 7.1.

Box 7.1: Derivation of the TD algorithm

We next show that the TD algorithm in (7.1) can be obtained by applying the
Robbins-Monro algorithm to solve (7.4).
For state st , we define a function as

.
g(vπ (st )) = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st .

Then, (7.4) is equivalent to

g(vπ (st )) = 0.

Our goal is to solve the above equation to obtain vπ (st ) using the Robbins-Monro
algorithm. Since we can obtain rt+1 and st+1 , which are the samples of Rt+1 and
St+1 , the noisy observation of g(vπ (st )) that we can obtain is

g̃(vπ (st )) = vπ (st ) − rt+1 + γvπ (st+1 )

= vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st
| {z }
g(vπ (st ))

+ E Rt+1 + γvπ (St+1 )|St = st − rt+1 + γvπ (st+1 ) .
| {z }
η

136
7.1. TD learning of state values S. Zhao, 2023

Therefore, the Robbins-Monro algorithm (Section 6.2) for solving g(vπ (st )) = 0 is

vt+1 (st ) = vt (st ) − αt (st )g̃(vt (st ))

= vt (st ) − αt (st ) vt (st ) − rt+1 + γvπ (st+1 ) , (7.5)

where vt (st ) is the estimate of vπ (st ) at time t, and αt (st ) is the learning rate.
The algorithm in (7.5) has a similar expression to that of the TD algorithm in
(7.1). The only difference is that the right-hand side of (7.5) contains vπ (st+1 ),
whereas (7.1) contains vt (st+1 ). That is because (7.5) is designed to merely estimate
the action value of st by assuming that the action values of the other states are already
known. If we would like to estimate the state values of all the states, then vπ (st+1 ) on
the right-hand side should be replaced with vt (st+1 ). Then, (7.5) is exactly the same
as (7.1). However, can such a replacement still ensure convergence? The answer is
yes, and it will be proven later in Theorem 7.1.

7.1.2 Property analysis

Some important properties of the TD algorithm are discussed as follows.
First, we examine the expression of the TD algorithm more closely. In particular,
(7.1) can be described as

TD error δ
z }| t {
vt+1 (st ) = vt (st ) −αt (st ) vt (st ) − rt+1 + γvt (st+1 ) , (7.6)
| {z } | {z } | {z }
new estimate current estimate TD target v̄t

where
.
v̄t = rt+1 + γvt (st+1 )

is called the TD target and

.
δt = v(st ) − v̄t = vt (st ) − (rt+1 + γvt (st+1 ))

is called the TD error. It can be seen that the new estimate vt+1 (st ) is a combination of
the current estimate vt (st ) and the TD error δt .

Why is v̄t called the TD target?

This is because v̄t is the target value that the algorithm attempts to drive v(st ) to.
To see that, subtracting v̄t from both sides of (7.6) gives

vt+1 (st ) − v̄t = vt (st ) − v̄t − αt (st ) vt (st ) − v̄t

= [1 − αt (st )] vt (st ) − v̄t .

137
7.1. TD learning of state values S. Zhao, 2023

Taking the absolute values of both sides of the above equation gives

|vt+1 (st ) − v̄t | = |1 − αt (st )||vt (st ) − v̄t |.

Since αt (st ) is a small positive number, we have 0 < 1 − αt (st ) < 1. It then follows
that

|vt+1 (st ) − v̄t | < |vt (st ) − v̄t |.

The above inequality is important because it indicates that the new value vt+1 (st ) is
closer to v̄t than the old value vt (st ). Therefore, this algorithm mathematically drives
vt (st ) toward v̄t . This is why v̄t is called the TD target.
What is the interpretation of the TD error?
First, this error is called temporal-difference because δt = vt (st ) − (rt+1 + γvt (st+1 ))
reflects the discrepancy between two time steps t and t + 1. Second, the TD error is
zero in the expectation sense when the state value estimate is accurate. To see that,
when vt = vπ , the expected value of the TD error is

E[δt |St = st ] = E vπ (St ) − (Rt+1 + γvπ (St+1 ))|St = st

= vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st
= 0. (due to (7.3))

Therefore, the TD error reflects not only the discrepancy between two time steps but
also, more importantly, the discrepancy between the estimate vt and the true state
value vπ .
On a more abstract level, the TD error can be interpreted as the innovation, which
indicates new information obtained from the experience sample (st , rt+1 , st+1 ). The
fundamental idea of TD learning is to correct our current estimate of the state val-
ue based on the newly obtained information. Innovation is fundamental in many
estimation problems such as Kalman filtering [33, 34].

Second, the TD algorithm in (7.1) can only estimate the state values of a given policy.
To find optimal policies, we still need to further calculate the action values and then
conduct policy improvement. This will be introduced in Section 7.2. Nevertheless, the
TD algorithm introduced in this section is very basic and important for understanding
the other algorithms in this chapter.
Third, while both TD learning and MC learning are model-free, what are their ad-
vantages and disadvantages? The answers are summarized in Table 7.1.

138
7.1. TD learning of state values S. Zhao, 2023

TD learning MC learning

Online: TD learning is online. It can up- Offline: MC learning is offline. It must

date the state/action values immediately wait until an episode has been completely
after receiving an experience sample. collected. That is because it must calcu-
late the discounted return of the episode.

Continuing tasks: Since TD learning is Episodic tasks: Since MC learning is of-

online, it can handle both episodic and fline, it can only handle episodic tasks
continuing tasks. Continuing tasks may where the episodes terminate after a finite
not have terminal states. number of steps.

Bootstrapping: TD learning bootstraps Non-bootstrapping: MC is not bootstrap-

because the update of a state/action val- ping because it can directly estimate s-
ue relies on the previous estimate of this tate/action values without initial guesses.
value. As a result, TD learning requires
an initial guess of the values.

Low estimation variance: The estimation High estimation variance: The estimation
variance of TD is lower than that of M- variance of MC is higher since many ran-
C because it involves fewer random vari- dom variables are involved. For example,
ables. For instance, to estimate an ac- to estimate qπ (st , at ), we need samples of
tion value qπ (st , at ), Sarsa merely requires Rt+1 +γRt+2 +γ 2 Rt+3 +. . . . Suppose that
the samples of three random variables: the length of each episode is L. Assume
Rt+1 , St+1 , At+1 . that each state has the same number of
actions as |A|. Then, there are |A|L pos-
sible episodes following a soft policy. If we
merely use a few episodes to estimate, it
is not surprising that the estimation vari-
ance is high.

Table 7.1: A comparison between TD learning and MC learning.

7.1.3 Convergence analysis

The convergence analysis of the TD algorithm in (7.1) is given below.

Theorem 7.1 (Convergence of TD learning). Given a policy π, by the TD algorithm in

P
(7.1), vt (s) converges almost surely to vπ (s) as t → ∞ for all s ∈ S if t αt (s) = ∞ and
P 2
t αt (s) < ∞ for all s ∈ S.
P
Some remarks about αt are given below. First, the condition of t αt (s) = ∞ and
P 2
t αt (s) < ∞ must be valid for all s ∈ S. Note that, at time t, αt (s) > 0 if s is being
P
visited and αt (s) = 0 otherwise. The condition t αt (s) = ∞ requires the state s to
be visited an infinite (or sufficiently many) number of times. This requires either the
condition of exploring starts or an exploratory policy so that every state-action pair can
possibly be visited many times. Second, the learning rate αt is often selected as a small

139
7.1. TD learning of state values S. Zhao, 2023

positive constant in practice. In this case, the condition that t αt2 (s) < ∞ is no longer
P

valid. When α is constant, it can still be shown that the algorithm converges in the sense
of expectation [24, Section 1.5].

Box 7.2: Proof of Theorem 7.1

We prove the convergence based on Theorem 6.3 in Chapter 6. To do that, we need
first to construct a stochastic process as that in Theorem 6.3. Consider an arbitrary
state s ∈ S. At time t, it follows from the TD algorithm in (7.1) that

vt+1 (s) = vt (s) − αt (s) vt (s) − (rt+1 + γvt (st+1 )) , if s = st , (7.7)

vt+1 (s) = vt (s), if s 6= st . (7.8)

The estimation error is defined as

.
∆t (s) = vt (s) − vπ (s),

where vπ (s) is the state value of s under policy π. Deducting vπ (s) from both sides
of (7.7) gives

∆t+1 (s) = (1 − αt (s))∆t (s) + αt (s)(rt+1 + γvt (st+1 ) − vπ (s))

| {z }
ηt (s)

= (1 − αt (s))∆t (s) + αt (s)ηt (s), s = st . (7.9)

Deducting vπ (s) from both sides of (7.8) gives

∆t+1 (s) = ∆t (s) = (1 − αt (s))∆t (s) + αt (s)ηt (s), s 6= st ,

whose expression is the same as that of (7.9) except that αt (s) = 0 and ηt (s) = 0.
Therefore, regardless of whether s = st , we obtain the following unified expression:

∆t+1 (s) = (1 − αt (s))∆t (s) + αt (s)ηt (s).

This is the process in Theorem 6.3. Our goal is to show that the three conditions in
Theorem 6.3 are satisfied and hence the process converges.
The first condition is valid as assumed in Theorem 7.1. We next show that the
second condition is valid. That is, kE[ηt (s)|Ht ]k∞ ≤ γk∆t (s)k∞ for all s ∈ S. Here,
Ht represents the historical information (see the definition in Theorem 6.3). Due to
the Markovian property, ηt (s) = rt+1 + γvt (st+1 ) − vπ (s) or ηt (s) = 0 does not depend

140
7.1. TD learning of state values S. Zhao, 2023

on the historical information once s is given. As a result, we have E[ηt (s)|Ht ] =

E[ηt (s)]. For s 6= st , we have ηt (s) = 0. Then, it is trivial to see that

|E[ηt (s)]| = 0 ≤ γk∆t (s)k∞ . (7.10)

For s = st , we have

E[ηt (s)] = E[ηt (st )]

= E[rt+1 + γvt (st+1 ) − vπ (st )|st ]
= E[rt+1 + γvt (st+1 )|st ] − vπ (st ).

Since vπ (st ) = E[rt+1 + γvπ (st+1 )|st ], the above equation implies that

E[ηt (s)] = γE[vt (st+1 ) − vπ (st+1 )|st ]

X
=γ p(s0 |st )[vt (s0 ) − vπ (s0 )].
s0 ∈S

It follows that

= γ max
0
|vt (s0 ) − vπ (s0 )|
s ∈S

= γkvt (s0 ) − vπ (s0 )k∞

= γk∆t (s)k∞ . (7.11)

Therefore, at time t, we know from (7.10) and (7.11) that |E[ηt (s)]| ≤ γk∆t (s)k∞ for
all s ∈ S regardless of whether s = st . Thus,

kE[ηt (s)]k∞ ≤ γk∆t (s)k∞ ,

which is the second condition in Theorem 7.1. Finally, regarding the third condition,
we have var[ηt (s)|Ht ] = var[rt+1 + γvt (st+1 ) − vπ (st )|st ] = var[rt+1 + γvt (st+1 )|st ] for
s = st and var[ηt (s)|Ht ] = 0 for s 6= st . Since rt+1 is bounded, the third condition
can be proven without difficulty.
The above proof is inspired by [32].

141
7.2. TD learning of action values: Sarsa S. Zhao, 2023

7.2 TD learning of action values: Sarsa

The TD algorithm introduced in Section 7.1 can only estimate state values. This section
introduces another TD algorithm called Sarsa that can directly estimate action values.
Estimating action values is important because it can be combined with a policy improve-
ment step to learn optimal policies.

7.2.1 Algorithm description

Given a policy π, our goal is to estimate the action values. Suppose that we have some
experience samples generated following π: (s0 , a0 , r1 , s1 , a1 , . . . , st , at , rt+1 , st+1 , at+1 , . . . ).
We can use the following Sarsa algorithm to estimate the action values:
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γqt (st+1 , at+1 )) , (7.12)
qt+1 (s, a) = qt (s, a), for all (s, a) 6= (st , at ),

where t = 0, 1, 2, . . . and αt (st , at ) is the learning rate. Here, qt (st , at ) is the estimate of
qπ (st , at ). At time t, only the q-value of (st , at ) is updated, whereas the q-values of the
others remain the same.
Some important properties of the Sarsa algorithm are discussed as follows.

Why is this algorithm called “Sarsa”? That is because each iteration of the algorithm
requires (st , at , rt+1 , st+1 , at+1 ). Sarsa is an abbreviation for state-action-reward-state-
action. The Sarsa algorithm was first proposed in [35] and its name was coined by
[3].
Why is Sarsa designed in this way? One may have noticed that Sarsa is similar to the
TD algorithm in (7.1). In fact, Sarsa can be easily obtained from the TD algorithm
by replacing state value estimation with action value estimation.
What does Sarsa do mathematically? Similar to the TD algorithm in (7.1), Sarsa
is a stochastic approximation algorithm for solving the Bellman equation of a given
policy:

qπ (s, a) = E [R + γqπ (S 0 , A0 )|s, a] , for all (s, a). (7.13)

Equation (7.13) is the Bellman equation expressed in terms of action values. A proof
is given in Box 7.3.

Box 7.3: Showing that (7.13) is the Bellman equation

142
7.2. TD learning of action values: Sarsa S. Zhao, 2023

As introduced in Section 2.8.2, the Bellman equation expressed in terms of action

This equation establishes the relationships among the action values. Since

p(s0 , a0 |s, a) = p(s0 |s, a)p(a0 |s0 , s, a)

= p(s0 |s, a)p(a0 |s0 ) (due to conditional independence)
.
= p(s0 |s, a)π(a0 |s0 ),

(7.14) can be rewritten as

X XX
qπ (s, a) = rp(r|s, a) + γ qπ (s0 , a0 )p(s0 , a0 |s, a).
r s0 a0

By the definition of the expected value, the above equation is equivalent to (7.13).
Hence, (7.13) is the Bellman equation.

Is Sarsa convergent? Since Sarsa is the action-value version of the TD algorithm in

(7.1), the convergence result is similar to Theorem 7.1 and given below.

Theorem 7.2 (Convergence of Sarsa). Given a policy π, by the Sarsa algorithm in

(7.12), qt (s, a) converges almost surely to the action value qπ (s, a) as t → ∞ for all
(s, a) if t αt (s, a) = ∞ and t αt2 (s, a) < ∞ for all (s, a).
P P

The proof is similar to that of Theorem 7.1 and is omitted here. The condition of
P P 2
t αt (s, a) = ∞ and t αt (s, a) < ∞ should be valid for all (s, a). In particular,
P
t αt (s, a) = ∞ requires that every state-action pair must be visited an infinite (or
sufficiently many) number of times. At time t, if (s, a) = (st , at ), then αt (s, a) > 0;
otherwise, αt (s, a) = 0.

7.2.2 Optimal policy learning via Sarsa

The Sarsa algorithm in (7.12) can only estimate the action values of a given policy. To
find optimal policies, we can combine it with a policy improvement step. The combination
is also often called Sarsa, and its implementation procedure is given in Algorithm 7.1.
As shown in Algorithm 7.1, each iteration has two steps. The first step is to update
the q-value of the visited state-action pair. The second step is to update the policy to an
-greedy one. The q-value update step only updates the single state-action pair visited

143
7.2. TD learning of action values: Sarsa S. Zhao, 2023

Algorithm 7.1: Optimal policy learning by Sarsa

Initialization: αt (s, a) = α > 0 for all (s, a) and all t. ∈ (0, 1). Initial q0 (s, a) for all
(s, a). Initial -greedy policy π0 derived from q0 .
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect an experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γqt (st+1 , at+1 ))
Update policy for st :
πt+1 (a|st ) = 1 − |A(s t )| (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1

at time t. Afterward, the policy of st is immediately updated. Therefore, we do not

evaluate a given policy sufficiently well before updating the policy. This is based on the
idea of generalized policy iteration. Moreover, after the policy is updated, the policy is
immediately used to generate the next experience sample. The policy here is -greedy so
that it is exploratory.
A simulation example is shown in Figure 7.2 to demonstrate the Sarsa algorithm.
Unlike all the tasks we have seen in this book, the task here aims to find an optimal
path from a specific starting state to a target state. It does not aim to find the optimal
policies for all states. This task is often encountered in practice where the starting state
(e.g., home) and the target state (e.g., workplace) are fixed, and we only need to find an
optimal path connecting them. This task is relatively simple because we only need to
explore the states that are close to the path and do not need to explore all the states.
However, if we do not explore all the states, the final path may be locally optimal rather
than globally optimal.
The simulation setup and simulation results are discussed below.

Simulation setup: In this example, all the episodes start from the top-left state and ter-
minate at the target state. The reward settings are rtarget = 0, rforbidden = rboundary =
−10, and rother = −1. Moreover, αt (s, a) = 0.1 for all t and = 0.1. The initial
guesses of the action values are q0 (s, a) = 0 for all (s, a). The initial policy has a
uniform distribution: π0 (a|s) = 0.2 for all s, a.
Learned policy: The left figure in Figure 7.2 shows the final policy learned by Sarsa.
As can be seen, this policy can successfully lead to the target state from the starting

144
7.2. TD learning of action values: Sarsa S. Zhao, 2023

1 2 3 4 5

Total rewards
0
1
-200

-400
2
0 100 200 300 400 500

Episode length
200

4 100

0
5 0 100 200 300 400 500
Episode index

Figure 7.2: An example for demonstrating Sarsa. All the episodes start from the top-left state and
terminate when reaching the target state (the blue cell). The goal is to find an optimal path from the
starting state to the target state. The reward settings are rtarget = 0, rforbidden = rboundary = −10, and
rother = −1. The learning rate is α = 0.1 and the value of is 0.1. The left figure shows the final policy
obtained by the algorithm. The right figures show the total reward and length of every episode.

state. However, the policies of some other states may not be optimal. That is because
the other states are not well explored.
Total reward of each episode: The top-right subfigure in Figure 7.2 shows the to-
tal reward of each episode. Here, the total reward is the non-discounted sum of all
immediate rewards. As can be seen, the total reward of each episode increases grad-
ually. That is because the initial policy is not good and hence negative rewards are
frequently obtained. As the policy becomes better, the total reward increases.
Length of each episode: The bottom-right subfigure in Figure 7.2 shows that the
length of each episode drops gradually. That is because the initial policy is not good
and may take many detours before reaching the target. As the policy becomes better,
the length of the trajectory becomes shorter. Notably, the length of an episode may
increase abruptly (e.g., the 460th episode) and the corresponding total reward also
drops sharply. That is because the policy is -greedy, and there is a chance for it to
take non-optimal actions. One way to resolve this problem is to use decaying whose
value converges to zero gradually.

Finally, Sarsa also has some variants such as Expected Sarsa. Interested readers may
check Box 7.4.

Box 7.4: Expected Sarsa

145
7.2. TD learning of action values: Sarsa S. Zhao, 2023

Given a policy π, its action values can be evaluated by Expected Sarsa, which is a
variant of Sarsa. The Expected Sarsa algorithm is
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γE[qt (st+1 , A)]) ,
qt+1 (s, a) = qt (s, a), for all (s, a) 6= (st , at ),

where
X .
E[qt (st+1 , A)] = πt (a|st+1 )qt (st+1 , a) = vt (st+1 )
a

is the expected value of qt (st+1 , a) under policy πt . The expression of the Ex-
pected Sarsa algorithm is very similar to that of Sarsa. They are different only
in terms of their TD targets. In particular, the TD target in Expected Sarsa is
rt+1 + γE[qt (st+1 , A)], while that of Sarsa is rt+1 + γqt (st+1 , at+1 ). Since the algorithm
involves an expected value, it is called Expected Sarsa. Although calculating the
expected value may increase the computational complexity slightly, it is beneficial
in the sense that it reduces the estimation variances because it reduces the random
variables in Sarsa from {st , at , rt+1 , st+1 , at+1 } to {st , at , rt+1 , st+1 }.
Similar to the TD learning algorithm in (7.1), Expected Sarsa can be viewed as
a stochastic approximation algorithm for solving the following equation:
h i
qπ (s, a) = E Rt+1 + γE[qπ (St+1 , At+1 )|St+1 ] St = s, At = a , for all s, a. (7.15)

The above equation may look strange at first glance. In fact, it is another expression
of the Bellman equation. To see that, substituting
X
E[qπ (St+1 , At+1 )|St+1 ] = qπ (St+1 , A0 )π(A0 |St+1 ) = vπ (St+1 )
A0

into (7.15) gives

h i
qπ (s, a) = E Rt+1 + γvπ (St+1 )|St = s, At = a ,

which is clearly the Bellman equation.

The implementation of Expected Sarsa is similar to that of Sarsa. More details
can be found in [3, 36, 37].

146
7.3. TD learning of action values: n-step Sarsa S. Zhao, 2023

7.3 TD learning of action values: n-step Sarsa

This section introduces n-step Sarsa, an extension of Sarsa. We will see that Sarsa and
MC learning are two extreme cases of n-step Sarsa.
Recall that the definition of the action value is

qπ (s, a) = E[Gt |St = s, At = a], (7.16)

where Gt is the discounted return satisfying

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . . .

In fact, Gt can also be decomposed into different forms:

(1)
Sarsa ←− Gt = Rt+1 + γqπ (St+1 , At+1 ),
(2)
Gt = Rt+1 + γRt+2 + γ 2 qπ (St+2 , At+2 ),
..
.
(n)
n-step Sarsa ←− Gt = Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n ),
..
.
(∞)
MC ←− Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 . . .

(1) (2) (n) (∞)

It should be noted that Gt = Gt = Gt = Gt = Gt , where the superscripts merely
indicate the different decomposition structures of Gt .
(n)
Substituting different decompositions of Gt into qπ (s, a) in (7.16) results in different
algorithms.

When n = 1, we have

(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].

The corresponding stochastic approximation algorithm for solving this equation is

h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γqt (st+1 , at+1 )) ,

which is the Sarsa algorithm in (7.12).

When n = ∞, we have

(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].

147
7.3. TD learning of action values: n-step Sarsa S. Zhao, 2023

The corresponding algorithm for solving this equation is

.
qt+1 (st , at ) = gt = rt+1 + γrt+2 + γ 2 rt+3 + . . . ,

where gt is a sample of Gt . In fact, this is the MC learning algorithm, which approx-

imates the action value of (st , at ) using the discounted return of an episode starting
from (st , at ).
For a general value of n, we have

(n)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|s, a].

The corresponding algorithm for solving the above equation is

qt+1 (st , at ) = qt (st , at )

h i
− αt (st , at ) qt (st , at ) − rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n ) . (7.17)

This algorithm is called n-step Sarsa.

In summary, n-step Sarsa is a more general algorithm because it becomes the (one-
step) Sarsa algorithm when n = 1 and the MC learning algorithm when n = ∞ (by
setting αt = 1).
To implement the n-step Sarsa algorithm in (7.17), we need the experience samples
(st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ). Since (rt+n , st+n , at+n ) has not been collected
at time t, we have to wait until time t + n to update the q-value of (st , at ). To that end,
(7.17) can be rewritten as

qt+n (st , at ) = qt+n−1 (st , at )

h i
− αt+n−1 (st , at ) qt+n−1 (st , at ) − rt+1 + γrt+2 + · · · + γ n qt+n−1 (st+n , at+n ) ,

where qt+n (st , at ) is the estimate of qπ (st , at ) at time t + n.

Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, it is not
surprising that the performance of n-step Sarsa is between that of Sarsa and MC learning.
In particular, if n is selected as a large number, n-step Sarsa is close to MC learning:
the estimate has a relatively high variance but a small bias. If n is selected to be small,
n-step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance.
Finally, the n-step Sarsa algorithm presented here is merely used for policy evaluation.
It must be combined with a policy improvement step to learn optimal policies. The
implementation is similar to that of Sarsa and is omitted here. Interested readers may
check [3, Chapter 7] for a detailed analysis of multi-step TD learning.

148
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023

7.4 TD learning of optimal action values: Q-learning

In this section, we introduce the Q-learning algorithm, one of the most classic reinforce-
ment learning algorithms [38,39]. Recall that Sarsa can only estimate the action values of
a given policy, and it must be combined with a policy improvement step to find optimal
policies. By contrast, Q-learning can directly estimate optimal action values and find
optimal policies.

7.4.1 Algorithm description

The Q-learning algorithm is

qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − rt+1 + γ max qt (st+1 , a) , (7.18)
a∈A(st+1 )

qt+1 (s, a) = qt (s, a), for all (s, a) 6= (st , at ),

where t = 0, 1, 2, . . . . Here, qt (st , at ) is the estimate of the optimal action value of (st , at )
and αt (st , at ) is the learning rate for (st , at ).
The expression of Q-learning is similar to that of Sarsa. They are different only
in terms of their TD targets: the TD target of Q-learning is rt+1 + γ maxa qt (st+1 , a),
whereas that of Sarsa is rt+1 + γqt (st+1 , at+1 ). Moreover, given (st , at ), Sarsa requires
(rt+1 , st+1 , at+1 ) in every iteration, whereas Q-learning merely requires (rt+1 , st+1 ).
Why is Q-learning designed as the expression in (7.18), and what does it do mathe-
matically? Q-learning is a stochastic approximation algorithm for solving the following
equation:
h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a . (7.19)
a

This is the Bellman optimality equation expressed in terms of action values. The proof is
given in Box 7.5. The convergence analysis of Q-learning is similar to Theorem 7.1 and
omitted here. More information can be found in [32, 39].

Box 7.5: Showing that (7.19) is the Bellman optimality equation

By the definition of expectation, (7.19) can be rewritten as

X X
q(s, a) = p(r|s, a)r + γ p(s0 |s, a) max0 q(s0 , a).
a∈A(s )
r s0

149
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023

Taking the maximum of both sides of the equation gives

" #
X X
max q(s, a) = max p(r|s, a)r + γ p(s0 |s, a) max0 q(s0 , a) .
a∈A(s) a∈A(s) a∈A(s )
r s0

which is clearly the Bellman optimality equation in terms of state values as introduced
in Chapter 3.

7.4.2 Off-policy vs on-policy

We next introduce two important concepts: on-policy learning and off-policy learning.
What makes Q-learning slightly special compared to the other TD algorithms is that
Q-learning is off-policy while the others are on-policy.
Two policies exist in any reinforcement learning task: a behavior policy and a target
policy. The behavior policy is the one used to generate experience samples. The target
policy is the one that is constantly updated to converge to an optimal policy. When the
behavior policy is the same as the target policy, such a learning process is called on-policy.
Otherwise, when they are different, the learning process is called off-policy.
The advantage of off-policy learning is that it can learn optimal policies based on
the experience samples generated by other policies, which may be, for example, a policy
executed by a human operator. As an important case, the behavior policy can be selected
to be exploratory. For example, if we would like to estimate the action values of all state-
action pairs, we must generate episodes visiting every state-action pair sufficiently many
times. Although Sarsa uses -greedy policies to maintain certain exploration abilities, the
value of is usually small and hence the exploration ability is limited. By contrast, if
we can use a policy with a strong exploration ability to generate episodes and then use
off-policy learning to learn optimal policies, the learning efficiency would be significantly
increased.
To determine if an algorithm is on-policy or off-policy, we can examine two aspects.
The first is the mathematical problem that the algorithm aims to solve. The second is
the experience samples required by the algorithm.

Sarsa is on-policy.

150
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023

The reason is as follows. Sarsa has two steps in every iteration. The first step is to
evaluate a policy π by solving its Bellman equation. To do that, we need samples
generated by π. Therefore, π is the behavior policy. The second step is to obtain an
improved policy based on the estimated values of π. As a result, π is the target policy
that is constantly updated and eventually converges to an optimal policy. Therefore,
the behavior policy and the target policy are the same.
From another point of view, we can examine the samples required by the algorithm.
The samples required by Sarsa in every iteration include (st , at , rt+1 , st+1 , at+1 ). How
these samples are generated is illustrated below:

π
b model b π
st −−−
→ at −−−→ rt+1 , st+1 −−−
→ at+1

As can be seen, the behavior policy πb is the one that generates at at st and at+1
at st+1 . The Sarsa algorithm aims to estimate the action value of (st , at ) of a policy
denoted as πT , which is the target policy because it is improved in every iteration
based on the estimated values. In fact, πT is the same as πb because the evaluation
of πT relies on the samples (rt+1 , st+1 , at+1 ), where at+1 is generated following πb . In
other words, the policy that Sarsa evaluates is the policy used to generate samples.
Q-learning is off-policy.
The fundamental reason is that Q-learning is an algorithm for solving the Bellman
optimality equation, whereas Sarsa is for solving the Bellman equation of a given
policy. While solving the Bellman equation can evaluate the associated policy, solving
the Bellman optimality equation can directly generate the optimal values and optimal
policies.
In particular, the samples required by Q-learning in every iteration is (st , at , rt+1 , st+1 ).
How these samples are generated is illustrated below:

πb model
st −−−
→ at −−−→ rt+1 , st+1

As can be seen, the behavior policy πb is the one that generates at at st . The Q-learning
algorithm aims to estimate the optimal action value of (st , at ). This estimation process
relies on the samples (rt+1 , st+1 ). The process of generating (rt+1 , st+1 ) does not
involve πb because it is governed by the system model (or by interacting with the
environment). Therefore, the estimation of the optimal action value of (st , at ) does
not involve πb and we can use any πb to generate at at st . Moreover, the target
policy πT here is the greedy policy obtained based on the estimated optimal values
(Algorithm 7.3). The behavior policy does not have to be the same as πT .
MC learning is on-policy. The reason is similar to that of Sarsa. The target policy to
be evaluated and improved is the same as the behavior policy that generates samples.

151
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023

Another concept that may be confused with on-policy/off-policy is online/offline

learning. Online learning refers to the case where the value and the policy are updated
once an experience sample is obtained. Offline learning refers to the case where the up-
date can only be done after all experience samples have been collected. For example, TD
learning is online, whereas MC learning is offline. An on-policy learning algorithm like
Sarsa must work online because the updated policy must be used to generate new expe-
rience samples. An off-policy learning algorithm like Q-learning can work either online
or offline. It can either update the value and policy once receiving an experience sample
or after collecting all experience samples.

Algorithm 7.2: Optimal policy learning via Q-learning (on-policy version)

Initialization: αt (s, a) = α > 0 for all (s, a) and all t. ∈ (0, 1). Initial q0 (s, a) for all
(s, a). Initial -greedy policy π0 derived from q0 .
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γ maxa qt (st+1 , a))
Update policy for st :
πt+1 (a|st ) = 1 − |A(s t )| (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
πt+1 (a|st ) = |A(s t )| otherwise

Algorithm 7.3: Optimal policy learning via Q-learning (off-policy version)

Initialization: Initial guess q0 (s, a) for all (s, a). Behavior policy πb (a|s) for all (s, a).
αt (s, a) = α > 0 for all (s, a) and all t.
Goal: Learn an optimal target policy πT for all states from the experience samples
generated by πb .
For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do
For each step t = 0, 1, 2, . . . of the episode, do
Update q-value for (st , at ): h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − (rt+1 + γ maxa qt (st+1 , a))
Update target policy for st :
πT,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT,t+1 (a|st ) = 0 otherwise

152
7.4. TD learning of optimal action values: Q-learning S. Zhao, 2023

1 2 3 4 5

Total rewards
0
1
-200

-400
2
0 100 200 300 400 500

Episode length
3 200

100
4

0
5 0 100 200 300 400 500
Episode index

Figure 7.3: An example for demonstrating Q-learning. All the episodes start from the top-left state and
terminate after reaching the target state. The aim is to find an optimal path from the starting state to
the target state. The reward settings are rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The
learning rate is α = 0.1 and the value of is 0.1. The left figure shows the final policy obtained by the
algorithm. The right figure shows the total reward and length of every episode.

7.4.3 Implementation
Since Q-learning is off-policy, it can be implemented in either an on-policy or off-policy
fashion.
The on-policy version of Q-learning is shown in Algorithm 7.2. This implementation
is similar to the Sarsa one in Algorithm 7.1. Here, the behavior policy is the same as the
target policy, which is an -greedy policy.
The off-policy version is shown in Algorithm 7.3. The behavior policy πb can be any
policy as long as it can generate sufficient experience samples. It is usually favorable when
πb is exploratory. Here, the target policy πT is greedy rather than -greedy since it is
not used to generate samples and hence is not required to be exploratory. Moreover, the
off-policy version of Q-learning presented here is implemented offline: all the experience
samples are collected first and then processed. It can be modified to become online: the
value and policy can be updated immediately once a sample is received.

7.4.4 Illustrative examples

We next present examples to demonstrate Q-learning.
The first example is shown in Figure 7.3. It demonstrates on-policy Q-learning. The
goal here is to find an optimal path from a starting state to the target state. The setup
is given in the caption of Figure 7.3. As can be seen, Q-learning can eventually find an
optimal path. During the learning process, the length of each episode decreases, whereas
the total reward of each episode increases.
The second set of examples is shown in Figure 7.4 and Figure 7.5. They demonstrate
off-policy Q-learning. The goal here is to find an optimal policy for all the states. The
reward setting is rboundary = rforbidden = −1, and rtarget = 1. The discount rate is γ = 0.9.

153
7.5. A unified viewpoint S. Zhao, 2023

The learning rate is α = 0.1.

Ground truth: To verify the effectiveness of Q-learning, we first need to know the
ground truth of the optimal policies and optimal state values. Here, the ground truth
is obtained by the model-based policy iteration algorithm. The ground truth is given
in Figures 7.4(a) and (b).
Experience samples: The behavior policy has a uniform distribution: the probability
of taking any action at any state is 0.2 (Figure 7.4(c)). A single episode with 100,000
steps is generated (Figure 7.4(d)). Due to the good exploration ability of the behavior
policy, the episode visits every state-action pair many times.
Learned results: Based on the episode generated by the behavior policy, the final
target policy learned by Q-learning is shown in Figure 7.4(e). This policy is optimal
because the estimated state value error (root-mean-square error) converges to zero as
shown in Figure 7.4(f). In addition, one may notice that the learned optimal policy
is not exactly the same as that in Figure 7.4(a). In fact, there exist multiple optimal
policies that have the same optimal state values.
Different initial values: Since Q-learning bootstraps, the performance of the algorithm
depends on the initial guess for the action values. As shown in Figure 7.4(g), when the
initial guess is close to the true value, the estimate converges within approximately
10,000 steps. Otherwise, the convergence requires more steps (Figure 7.4(h)). Nev-
ertheless, these figures demonstrate that Q-learning can still converge rapidly even
though the initial value is not accurate.
Different behavior policies: When the behavior policy is not exploratory, the learning
performance drops significantly. For example, consider the behavior policies shown
in Figure 7.5. They are -greedy policies with = 0.5 or 0.1 (the uniform policy in
Figure 7.4(c) can be viewed as -greedy with = 1). It is shown that, when decreases
from 1 to 0.5 and then to 0.1, the learning speed drops significantly. That is because
the exploration ability of the policy is weak and hence the experience samples are
insufficient.

7.5 A unified viewpoint

Up to now, we have introduced different TD algorithms such as Sarsa, n-step Sarsa, and
Q-learning. In this section, we introduce a unified framework to accommodate all these
algorithms and MC learning.
In particular, the TD algorithms (for action value estimation) can be expressed in a
unified expression:

qt+1 (st , at ) = qt (st , at ) − αt (st , at )[qt (st , at ) − q̄t ], (7.20)

154
7.5. A unified viewpoint S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Optimal policy (b) Optimal state value

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

(c) Behavior policy (d) Generated episode

1 2 3 4 5

8
1

6
State value error

4
3

2
4

0
5
0 2 4 6 8 10
Step in the episode 104

(e) Learned policy (f) State value error when q0 (s, a) = 0

3 100

2.5
80
State value error

State value error

2
60
1.5
40
1

20
0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
Step in the episode 104 Step in the episode 104

(g) State value error when q0 (s, a) = 10 (h) State value error when q0 (s, a) = 100

Figure 7.4: Examples for demonstrating off-policy learning via Q-learning. The optimal policy and
optimal state values are shown in (a) and (b), respectively. The behavior policy and the generated
episode are shown in (c) and (d), respectively. The estimated policy and the estimation error evolution
are shown in (e) and (f), respectively. The cases with different initial values are shown in (g) and (h).

155
7.5. A unified viewpoint S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5
8
1 1

State value error

2 2

4
3 3

2
4 4

0
5 5
0 2 4 6 8 10
Step in the episode 104

(a) = 0.5
1 2 3 4 5 1 2 3 4 5

1 1

State value error

2 2

4
3 3

2
4 4

0
5 5
0 2 4 6 8 10
Step in the episode 104

(b) = 0.1
1 2 3 4 5 1 2 3 4 5

1 1

6
State value error

2 2

4
3 3

2
4 4

0
5 5
0 2 4 6 8 10
Step in the episode 104

Figure 7.5: The performance of Q-learning drops when the behavior policy is not exploratory. The figures
in the left column show the behavior policies. The figures in the middle column show the generated
episodes following the corresponding behavior policies. The episode in each example has 100,000 steps.
The figures in the right column show the evolution of the root-mean-square error of the estimated state
values.

156
7.6. Summary S. Zhao, 2023

Algorithm Expression of the TD target q̄t in (7.20)

Sarsa q̄t = rt+1 + γqt (st+1 , at+1 )
n-step Sarsa q̄t = rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )
Q-learning q̄t = rt+1 + γ maxa qt (st+1 , a)
Monte Carlo q̄t = rt+1 + γrt+2 + γ 2 rt+3 + . . .

Algorithm Equation to be solved

Sarsa BE: qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 )|St = s, At = a]
n-step Sarsa BE: qπ (s, a) = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|St = s, At = a]

Q-learning BOE: q(s, a) = E Rt+1 + maxa q(St+1 , a) St = s, At = a
Monte Carlo BE: qπ (s, a) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s, At = a]

Table 7.2: A unified point of view of TD algorithms. Here, BE and BOE denote the Bellman equation
and Bellman optimality equation, respectively.

where q̄t is the TD target. Different TD algorithms have different q̄t . See Table 7.2 for a
summary. The MC learning algorithm can be viewed as a special case of (7.20): we can
set αt (st , at ) = 1 and then (7.20) becomes qt+1 (st , at ) = q̄t .
Algorithm (7.20) can be viewed as a stochastic approximation algorithm for solving
a unified equation: q(s, a) = E[q̄t |s, a]. This equation has different expressions with
different q̄t . These expressions are summarized in Table 7.2. As can be seen, all of the
algorithms aim to solve the Bellman equation except Q-learning, which aims to solve the
Bellman optimality equation.

7.6 Summary
This chapter introduced an important class of reinforcement learning algorithms called
TD learning. The specific algorithms that we introduced include Sarsa, n-step Sarsa, and
Q-learning. All these algorithms can be viewed as stochastic approximation algorithms
for solving Bellman or Bellman optimality equations.
The TD algorithms introduced in this chapter, except Q-learning, are used to eval-
uate a given policy. That is to estimate a given policy’s state/action values from some
experience samples. Together with policy improvement, they can be used to learn opti-
mal policies. Moreover, these algorithms are on-policy: the target policy is used as the
behavior policy to generate experience samples.
Q-learning is slightly special compared to the other TD algorithms in the sense that
it is off-policy. The target policy can be different from the behavior policy in Q-learning.
The fundamental reason why Q-learning is off-policy is that Q-learning aims to solve the
Bellman optimality equation rather than the Bellman equation of a given policy.

157
7.7. Q&A S. Zhao, 2023

It is worth mentioning that there are some methods that can convert an on-policy
algorithm to be off-policy. Importance sampling is a widely used one [3, 40] and will be
introduced in Chapter 10. Finally, there are some variants and extensions of the TD
algorithms introduced in this chapter [41–45]. For example, the TD(λ) method provides
a more general and unified framework for TD learning. More information can be found
in [3, 20, 46].

7.7 Q&A
Q: What does the term “TD” in TD learning mean?
A: Every TD algorithm has a TD error, which represents the discrepancy between the
new sample and the current estimate. Since this discrepancy is calculated between
different time steps, it is called temporal-difference.
Q: What does the term “learning” in TD learning mean?
A: From a mathematical point of view, “learning” simply means “estimation”. That
is to estimate state/action values from some samples and then obtain policies based
on the estimated values.
Q: While Sarsa can estimate the action values of a given policy, how can it be used
to learn optimal policies?
A: To obtain an optimal policy, the value estimation process should interact with
the policy improvement process. That is, after a value is updated, the corresponding
policy should be updated. Then, the updated policy generates new samples that can
be used to estimate values again. This is the idea of generalized policy iteration.
Q: Why does Sarsa update policies to be -greedy?
A: That is because the policy is also used to generate samples for value estimation.
Hence, it should be exploratory to generate sufficient experience samples.
Q: While Theorems 7.1 and 7.2 require that the learning rate αt converges to zero
gradually, why is it often set to be a small constant in practice?
A: The fundamental reason is that the policy to be evaluated keeps changing (or called
nonstationary). In particular, a TD learning algorithm like Sarsa aims to estimate
the action values of a given policy. If the policy is fixed, using a decaying learning
rate is acceptable. However, in the optimal policy learning process, the policy that
Sarsa aims to evaluate keeps changing after every iteration. We need a constant
learning rate in this case; otherwise, a decaying learning rate may be too small to
effectively evaluate policies. Although a drawback of constant learning rates is that
the value estimate may fluctuate eventually, the fluctuation is neglectable as long as
the constant learning rate is sufficiently small.

158
7.7. Q&A S. Zhao, 2023

Q: Should we learn the optimal policies for all states or a subset of the states?
A: It depends on the task. One may notice that some tasks considered in this chapter
(e.g., Figure 7.2) do not require finding the optimal policies for all states. Instead,
they only need to find an optimal path from a given starting state to the target state.
Such tasks are not demanding in terms of data because the agent does not need to
visit every state-action pair sufficiently many times. It, however, must be noted that
the obtained path is not guaranteed to be optimal. That is because better paths may
be missed if not all state-action pairs are well explored. Nevertheless, given sufficient
data, we can still find a good or locally optimal path.
Q: Why is Q-learning off-policy while all the other TD algorithms in this chapter are
on-policy?
A: The fundamental reason is that Q-learning aims to solve the Bellman optimality
equation, whereas the other TD algorithms aim to solve the Bellman equation of a
given policy. Details can be found in Section 7.4.2.
Q: Why does the off-policy version of Q-learning update policies to be greedy instead
of -greedy?
A: That is because the target policy is not required to generate experience samples.
Hence, it is not required to be exploratory.

159
Chapter 8

Value Function Approximation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 8.1: Where we are in this book.

In this chapter, we continue to study temporal-difference learning algorithms. How-

ever, a different method is used to represent state/action values. So far in this book,
state/action values have been represented by tables. The tabular method is straightfor-
ward to understand, but it is inefficient for handling large state or action spaces. To solve
this problem, this chapter introduces the function approximation method, which has be-
come the standard way to represent values. It is also where artificial neural networks are
incorporated into reinforcement learning as function approximators. The idea of function
approximation can also be extended from representing values to representing policies, as
introduced in Chapter 9.

160
8.1. Value representation: From table to function S. Zhao, 2023

v̂(s) = as + b
v̂(s)

s1 s2 s3 s4 · · · sn s

Figure 8.2: An illustration of the function approximation method. The x-axis and y-axis correspond to
s and v̂(s), respectively.

8.1 Value representation: From table to function

We next use an example to demonstrate the difference between the tabular and function
approximation methods.
Suppose that there are n states {si }ni=1 , whose state values are {vπ (si )}ni=1 . Here, π
is a given policy. Let {v̂(si )}ni=1 denote the estimates of the true state values. If we use
the tabular method, the estimated values can be maintained in the following table. This
table can be stored in memory as an array or a vector. To retrieve or update any value,
we can directly read or rewrite the corresponding entry in the table.

State s1 s2 ··· sn
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )

We next show that the values in the above table can be approximated by a function.
In particular, {(si , v̂(si ))}ni=1 are shown as n points in Figure 8.2. These points can be
fitted or approximated by a curve. The simplest curve is a straight line, which can be
described as
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w. (8.1)
|{z} b
φT (s) | {z }
w

Here, v̂(s, w) is a function for approximating vπ (s). It is determined jointly by the state s
and the parameter vector w ∈ R2 . v̂(s, w) is sometimes written as v̂w (s). Here, φ(s) ∈ R2
is called the feature vector of s.
The first notable difference between the tabular and function approximation methods
concerns how they retrieve and update a value.

How to retrieve a value: When the values are represented by a table, if we want to
retrieve a value, we can directly read the corresponding entry in the table. However,

161
8.1. Value representation: From table to function S. Zhao, 2023

when the values are represented by a function, it becomes slightly more complicated
to retrieve a value. In particular, we need to input the state index s into the function
and calculate the function value (Figure 8.3). For the example in (8.1), we first need
to calculate the feature vector φ(s) and then calculate φT (s)w. If the function is
an artificial neural network, a forward propagation from the input to the output is
needed.

s v̂(s, w)
w

function

Figure 8.3: An illustration of the process for retrieving the value of s when using the function approxi-
mation method.

The function approximation method is more efficient in terms of storage due to the
way in which the state values are retrieved. Specifically, while the tabular method
needs to store n values, we now only need to store a lower dimensional parameter
vector w. Thus, the storage efficiency can be significantly improved. Such a benefit
is, however, not free. It comes with a cost: the state values may not be accurately
represented by the function. For example, a straight line is not able to accurately fit
the points in Figure 8.2. That is why this method is called approximation. From a
fundamental point of view, some information will certainly be lost when we use a low-
dimensional vector to represent a high-dimensional dataset. Therefore, the function
approximation method enhances storage efficiency by sacrificing accuracy.
How to update a value: When the values are represented by a table, if we want
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, the way to update a value is
completely different. Specifically, we must update w to change the values indirectly.
How to update w to find optimal state values will be addressed in detail later.
Thanks to the way in which the state values are updated, the function approximation
method has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. When using the tabular method, we can update
a value if the corresponding state is visited in an episode. The values of the states
that have not been visited cannot be updated. However, when using the function
approximation method, we need to update w to update the value of a state. The
update of w also affects the values of some other states even though these states have
not been visited. Therefore, the experience sample for one state can generalize to help
estimate the values of some other states.
The above analysis is illustrated in Figure 8.4, where there are three states {s1 , s2 , s3 }.

162
8.1. Value representation: From table to function S. Zhao, 2023

Suppose that we have an experience sample for s3 and would like to update v̂(s3 ).
When using the tabular method, we can only update v̂(s3 ) without changing v̂(s1 ) or
v̂(s2 ), as shown in Figure 8.4(a). When using the function approximation method,
updating w not only can update v̂(s3 ) but also would change v̂(s1 ) and v̂(s2 ), As
shown in Figure 8.4(b). Therefore, the experience sample of s3 can help update the
values of its neighboring states.

v̂(s) v̂(s)
update v̂(s3 )

s1 s2 s3 s s1 s2 s3 s

(a) Tabular method: when v̂(s2 ) is updated, the other values remain the same.

v̂(s) v̂(s)
update w for s3

s1 s2 s3 s s1 s2 s3 s

(b) Function approximation method: when we update v̂(s2 ) by changing w, the values of the
neighboring states are also changed.

Figure 8.4: An illustration of how to update the value of a state.

We can use more complex functions that have stronger approximation abilities than
straight lines. For example, consider a second-order polynomial:
 
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1]  b  = φT (s)w. (8.2)
 
| {z }
φT (s) c
| {z }
w

We can use even higher-order polynomial curves to fit the points. As the order of the
curve increases, the approximation accuracy can be improved, but the dimension of the
parameter vector also increases, requiring more storage and computational resources.
Note that v̂(s, w) in either (8.1) or (8.2) is linear in w (though it may be nonlinear
in s). This type of method is called linear function approximation, which is the simplest
function approximation method. To realize linear function approximation, we need to
select an appropriate feature vector φ(s). That is, we must decide, for example, whether
we should use a first-order straight line or a second-order curve to fit the points. The
selection of appropriate feature vectors is nontrivial. It requires prior knowledge of the
given task: the better we understand the task, the better the feature vectors we can select.
For instance, if we know that the points in Figure 8.2 are approximately located on a

163
8.2. TD learning of state values based on function approximation S. Zhao, 2023

straight line, we can use a straight line to fit the points. However, such prior knowledge
is usually unknown in practice. If we do not have any prior knowledge, a popular solution
is to use artificial neural networks as nonlinear function approximations.
Another important problem is how to find the optimal parameter vector. If we know
{vπ (si )}ni=1 , this is a least-squares problem. The optimal parameter can be obtained by
optimizing the following objective function:
n n
X 2 X 2
J1 = v̂(si , w) − vπ (si ) = φT (si )w − vπ (si )
i=1 i=1
2
φT (s1 )
   
vπ (s1 )
.. .. .
=  w −  = kΦw − vπ k2 ,
   
. . 
T
φ (sn ) vπ (sn )

where

φT (s1 )
   
vπ (s1 )
.  .. n×2 .  .. n
Φ= ∈R , vπ =  ∈R .
 
. .
φT (sn ) vπ (sn )

It can be verified that the optimal solution to this least-squares problem is

w∗ = (ΦT Φ)−1 Φvπ .

More information about least-squares problems can be found in [47, Section 3.3] and
[48, Section 5.14].
The curve-fitting example presented in this section illustrates the basic idea of value
function approximation. This idea will be formally introduced in the next section.

8.2 TD learning of state values based on function

approximation
In this section, we show how to integrate the function approximation method into TD
learning to estimate the state values of a given policy. This algorithm will be extended
to learn action values and optimal policies in Section 8.3.
This section contains quite a few subsections and many coherent contents. It is better
for us to review the contents first before diving into the details.

The function approximation method is formulated as an optimization problem. The

objective function of this problem is introduced in Section 8.2.1. The TD learning
algorithm for optimizing this objective function is introduced in Section 8.2.2.

164
8.2. TD learning of state values based on function approximation S. Zhao, 2023

To apply the TD learning algorithm, we need to select appropriate feature vectors.

Section 8.2.3 discusses this problem.
Examples are given in Section 8.2.4 to demonstrate the TD algorithm and the impacts
of different feature vectors.
A theoretical analysis of the TD algorithm is given in Section 8.2.5. This subsection
is mathematically intensive. Readers may read it selectively based on their interests.

8.2.1 Objective function

Let vπ (s) and v̂(s, w) be the true state value and approximated state value of s ∈ S,
respectively. The problem to be solved is to find an optimal w so that v̂(s, w) can best
approximate vπ (s) for every s. In particular, the objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ], (8.3)

where the expectation is calculated with respect to the random variable S ∈ S. While S
is a random variable, what is its probability distribution? This question is important for
understanding this objective function. There are several ways to define the probability
distribution of S.

The first way is to use a uniform distribution. That is to treat all the states as equally
important by setting the probability of each state to 1/n. In this case, the objective
function in (8.3) becomes

1X
J(w) = (vπ (s) − v̂(s, w))2 , (8.4)
n s∈S

which is the average value of the approximation errors of all the states. However, this
way does not consider the real dynamics of the Markov process under the given policy.
Since some states may be rarely visited by a policy, it may be unreasonable to treat
all the states as equally important.
The second way, which is the focus of this chapter, is to use the stationary distribution.
The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently
long period, the probability of the agent being located at any state can be described
by this stationary distribution. Interested readers may see the details in Box 8.1.
Let {dπ (s)}s∈S denote the stationary distribution of the Markov process under policy
π. That is, the probability for the agent visiting s after a long period of time is dπ (s).
P
By definition, s∈S dπ (s) = 1. Then, the objective function in (8.3) can be rewritten

165
8.2. TD learning of state values based on function approximation S. Zhao, 2023

as
X
J(w) = dπ (s)(vπ (s) − v̂(s, w))2 , (8.5)
s∈S

which is a weighted average of the approximation errors. The states that have higher
probabilities of being visited are given greater weights.

It is notable that the value of dπ (s) is nontrivial to obtain because it requires knowing
the state transition probability matrix Pπ (see Box 8.1). Fortunately, we do not need to
calculate the specific value of dπ (s) to minimize this objective function as shown in the
next subsection. In addition, it was assumed that the number of states was finite when
we introduced (8.4) and (8.5). When the state space is continuous, we can replace the
summations with integrals.

Box 8.1: Stationary distribution of a Markov decision process

The key tool for analyzing stationary distribution is Pπ ∈ Rn×n , which is the probabil-
ity transition matrix under the given policy π. If the states are indexed as s1 , . . . , sn ,
then [Pπ ]ij is defined as the probability for the agent moving from si to sj . The
definition of Pπ can be found in Section 2.6.
Interpretation of Pπk (k = 1, 2, 3, . . . ).
First of all, it is necessary to examine the interpretation of the entries in Pπk . The
probability of the agent transitioning from si to sj using exactly k steps is denoted
as

(k)
pij = Pr(Stk = j|St0 = i),

where t0 and tk are the initial and kth time steps, respectively. First, by the
definition of Pπ , we have

(1)
[Pπ ]ij = pij ,

which means that [Pπ ]ij is the probability of transitioning from si to sj using a
single step. Second, consider Pπ2 . It can be verified that
n
X
[Pπ2 ]ij = [Pπ Pπ ]ij = [Pπ ]iq [Pπ ]qj .
q=1

Since [Pπ ]iq [Pπ ]qj is the joint probability of transitioning from si to sq and then
from sq to sj , we know that [Pπ2 ]ij is the probability of transitioning from si to sj

166
8.2. TD learning of state values based on function approximation S. Zhao, 2023

using exactly two steps. That is

(2)
[Pπ2 ]ij = pij .

Similarly, we know that

(k)
[Pπk ]ij = pij ,

which means that [Pπk ]ij is the probability of transitioning from si to sj using
exactly k steps.
Definition of stationary distributions.
Let d0 ∈ Rn be a vector representing the probability distribution of the states at
the initial time step. For example, if s is always selected as the starting state,
then d0 (s) = 1 and the other entries of d0 are 0. Let dk ∈ Rn be the vector
representing the probability distribution obtained after exactly k steps starting
from d0 . Then, we have
n
X
dk (si ) = d0 (sj )[Pπk ]ji , i = 1, 2, . . . . (8.6)
j=1

This equation indicates that the probability of the agent visiting si at step k
equals the sum of the probabilities of the agent transitioning from {sj }nj=1 to si
using exactly k steps. The matrix-vector form of (8.6) is

dTk = dT0 Pπk . (8.7)

When we consider the long-term behavior of the Markov process, it holds under
certain conditions that

lim Pπk = 1n dTπ , (8.8)

k→∞

where 1n = [1, . . . , 1]T ∈ Rn and 1n dTπ is a constant matrix with all its rows
equal to dTπ . The conditions under which (8.8) is valid will be discussed later.
Substituting (8.8) into (8.7) yields

lim dTk = dT0 lim Pπk = dT0 1n dTπ = dTπ , (8.9)

k→∞ k→∞

where the last equality is valid because dT0 1n = 1.

Equation (8.9) means that the state distribution dk converges to a constant value
dπ , which is called the limiting distribution. The limiting distribution depends

167
8.2. TD learning of state values based on function approximation S. Zhao, 2023

on the system model and the policy π. Interestingly, it is independent of the

initial distribution d0 . That is, regardless of which state the agent starts from,
the probability distribution of the agent after a sufficiently long period can always
be described by the limiting distribution.
The value of dπ can be calculated in the following way. Taking the limit of both
sides of dTk = dTk−1 Pπ gives limk→∞ dTk = limk→∞ dTk−1 Pπ and hence

dTπ = dTπ Pπ . (8.10)

As a result, dπ is the left eigenvector of Pπ associated with the eigenvalue 1. The

P
solution of (8.10) is called the stationary distribution. It holds that s∈S dπ (s) =
1 and dπ (s) > 0 for all s ∈ S. The reason why dπ (s) > 0 (not dπ (s) ≥ 0) will be
explained later.
Conditions for the uniqueness of stationary distributions.
The solution dπ of (8.10) is usually called a stationary distribution, whereas the
distribution dπ in (8.9) is usually called the limiting distribution. Note that (8.9)
implies (8.10), but the converse may not be true. A general class of Markov
processes that have unique stationary (or limiting) distributions is irreducible (or
regular ) Markov processes. Some necessary definitions are given below. More
details can be found in [49, Chapter IV].

- State sj is said to be accessible from state si if there exists a finite integer k so

that [Pπ ]kij > 0, which means that the agent starting from si can always reach
sj after a finite number of transitions.
- If two states si and sj are mutually accessible, then the two states are said to
communicate.
- A Markov process is called irreducible if all of its states communicate with
each other. In other words, the agent starting from an arbitrary state can
always reach any other state within a finite number of steps. Mathematically,
it indicates that, for any si and sj , there exists k ≥ 1 such that [Pπk ]ij > 0 (the
value of k may vary for different i, j).
- A Markov process is called regular if there exists k ≥ 1 such that [Pπk ]ij > 0
for all i, j. Equivalently, there exists k ≥ 1 such that Pπk > 0, where > is
elementwise. As a result, every state is reachable from any other state within
at most k steps. A regular Markov process is also irreducible, but the converse
is not true. However, if a Markov process is irreducible and there exists i such
0
that [Pπ ]ii > 0, then it is also regular. Moreover, if Pπk > 0, then Pπk > 0 for
any k 0 ≥ k since Pπ ≥ 0. It then follows from (8.9) that dπ (s) > 0 for every s.

168
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Policies that may lead to unique stationary distributions.

Once the policy is given, a Markov decision process becomes a Markov process,
whose long-term behavior is jointly determined by the given policy and the system
model. Then, an important question is what kind of policies can lead to regular
Markov processes? In general, the answer is exploratory policies such as -greedy
policies. That is because an exploratory policy has a positive probability of taking
any action at any state. As a result, the states can communicate with each other
when the system model allows them to do so.
An example is given in Figure 8.5 to illustrate stationary distributions. The policy
in this example is -greedy with = 0.5. The states are indexed as s1 , s2 , s3 , s4 ,
which correspond to the top-left, top-right, bottom-left, and bottom-right cells in
the grid, respectively.
We compare two methods to calculate the stationary distributions. The first
method is to solve (8.10) to get the theoretical value of dπ . The second method is
to estimate dπ numerically: we start from an arbitrary initial state and generate a
sufficiently long episode by following the given policy. Then, dπ can be estimated
by the ratio between the number of times each state is visited in the episode and
the total length of the episode. The estimation result is more accurate when the
episode is longer. We next compare the theoretical and estimated results.

1 2 0.8
Percentage of each state visited

0.6
1 s1
s2
0.4
s3
s4

0.2
2

0
0 200 400 600 800 1000
Step index

Figure 8.5: Long-term behavior of an -greedy policy with = 0.5. The asterisks in the right
figure represent the theoretical values of the elements of dπ .

- Theoretical value of dπ : It can be verified that the Markov process induced

by the policy is both irreducible and regular. That is due to the following
reasons. First, since all the states communicate, the resulting Markov process
is irreducible. Second, since every state can transition to itself, the resulting

169
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Markov process is regular. It can be seen from Figure 8.5 that

 
0.3 0.1 0.1 0
 0.1 0.3 0 0.1 
PπT = .
 
 0.6 0 0.3 0.1 
0 0.6 0.6 0.8

The eigenvalues of PπT can be calculated as {−0.0449, 0.3, 0.4449, 1}. The
unit-length (right) eigenvector of PπT corresponding to the eigenvalue 1 is
[0.0463, 0.1455, 0.1785, 0.9720]T . After scaling this vector so that the sum of
all its elements is equal to 1, we obtain the theoretical value of dπ as follows:
 
0.0345
 0.1084 
dπ =  .
 
 0.1330 
0.7241

The ith element of dπ corresponds to the probability of the agent visiting si

in the long run.
- Estimated value of dπ : We next verify the above theoretical value of dπ by
executing the policy for sufficiently many steps in the simulation. Specifically,
we select s1 as the starting state and run 1,000 steps by following the policy.
The proportion of the visits of each state during the process is shown in Fig-
ure 8.5. It can be seen that the proportions converge to the theoretical value
of dπ after hundreds of steps.

8.2.2 Optimization algorithms

To minimize the objective function J(w) in (8.3), we can use the gradient descent algo-
rithm:

wk+1 = wk − αk ∇w J(wk ),

where

∇w J(wk ) = ∇w E[(vπ (S) − v̂(S, wk ))2 ],

= E[∇w (vπ (S) − v̂(S, wk ))2 ]
= 2E[(vπ (S) − v̂(S, wk ))(−∇w v̂(S, wk ))]
= −2E[(vπ (S) − v̂(S, wk ))∇w v̂(S, wk )].

170
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Therefore, the gradient descent algorithm is

wk+1 = wk + 2αk E[(vπ (S) − v̂(S, wk ))∇w v̂(S, wk )], (8.11)

where the coefficient 2 before αk can be merged into αk without loss of generality. The
algorithm in (8.11) requires calculating the expectation. In the spirit of stochastic gra-
dient descent, we can replace the true gradient with a stochastic gradient. Then, (8.11)
becomes

wt+1 = wt + αt vπ (st ) − v̂(st , wt ) ∇w v̂(st , wt ), (8.12)

where st is a sample of S at time t.

Notably, (8.12) is not implementable because it requires the true state value vπ , which
is unknown and must be estimated. We can replace vπ (st ) with an approximation to make
the algorithm implementable. The following two methods can be used to do so.

Monte Carlo method: Suppose that we have an episode (s0 , r1 , s1 , r2 , . . . ). Let gt be

the discounted return starting from st . Then, gt can be used as an approximation of
vπ (st ). The algorithm in (8.12) becomes

wt+1 = wt + αt gt − v̂(st , wt ) ∇w v̂(st , wt ).

This is the algorithm of Monte Carlo learning with function approximation.

Temporal-difference method: In the spirit of TD learning, rt+1 + γv̂(st+1 , wt ) can be
used as an approximation of vπ (st ). The algorithm in (8.12) becomes

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ). (8.13)

This is the algorithm of TD learning with function approximation. This algorithm is

summarized in Algorithm 8.1.

Understanding the TD algorithm in (8.13) is important for studying the other algo-
rithms in this chapter. Notably, (8.13) can only learn the state values of a given policy.
It will be extended to algorithms that can learn action values in Sections 8.3.1 and 8.3.2.

8.2.3 Selection of function approximators

To apply the TD algorithm in (8.13), we need to select appropriate v̂(s, w). There are two
ways to do that. The first is to use an artificial neural network as a nonlinear function
approximator. The input of the neural network is the state, the output is v̂(s, w), and
the network parameter is w. The second is to simply use a linear function:

v̂(s, w) = φT (s)w,

171
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Algorithm 8.1: TD learning of state values with function approximation

Initialization: A function v̂(s, w) that is a differentiable in w. Initial parameter w0 .

Goal: Learn the true state values of a given policy π.
For each episode {(st , rt+1 , st+1 )}t generated by π, do
For each sample (st , rt+1 , st+1 ), do
In the general case, wt+1 = wt +αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )
In the linear case, wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st )

where φ(s) ∈ Rm is the feature vector of s. The lengths of φ(s) and w are equal to m,
which is usually much smaller than the number of states. In the linear case, the gradient
is
∇w v̂(s, w) = φ(s),

Substituting which into (8.13) yields

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ).

(8.14)

This is the algorithm of TD learning with linear function approximation. We call it

TD-Linear for short.
The linear case is much better understood in theory than the nonlinear case. However,
its approximation ability is limited. It is also nontrivial to select appropriate feature
vectors for complex tasks. By contrast, artificial neural networks can approximate values
as black-box universal nonlinear approximators, which are more friendly to use.
Nevertheless, it is still meaningful to study the linear case. A better understanding
of the linear case can help readers better grasp the idea of the function approximation
method. Moreover, the linear case is sufficient for solving the simple grid world tasks
considered in this book. More importantly, the linear case is still powerful in the sense
that the tabular method can be viewed as a special linear case. More information can be
found in Box 8.2.

Box 8.2: Tabular TD learning is a special case of TD-Linear

We next show that the tabular TD algorithm in (7.1) in Chapter 7 is a special case
of the TD-Linear algorithm in (8.14).
Consider the following special feature vector for any s ∈ S:

φ(s) = es ∈ Rn ,

where es is the vector with the entry corresponding to s equal to 1 and the other

172
8.2. TD learning of state values based on function approximation S. Zhao, 2023

entries equal to 0. In this case,

v̂(s, w) = eTs w = w(s),

where w(s) is the entry in w that corresponds to s. Substituting the above equation
into (8.14) yields

wt+1 = wt + αt rt+1 + γwt (st+1 ) − wt (st ) est .

The above equation merely updates the entry wt (st ) due to the definition of est .
Motivated by this, multiplying eTst on both sides of the equation yields

wt+1 (st ) = wt (st ) + αt rt+1 + γwt (st+1 ) − wt (st ) ,

which is exactly the tabular TD algorithm in (7.1).

In summary, by selecting the feature vector as φ(s) = es , the TD-Linear algorithm
becomes the tabular TD algorithm.

8.2.4 Illustrative examples

We next present some examples for demonstrating how to use the TD-Linear algorithm
in (8.14) to estimate the state values of a given policy. In the meantime, we demonstrate
how to select feature vectors.

1 2 3 4 5 1 2 3 4 5

1 1 -3.8 -3.8 -3.6 -3.1 -3.2

2 2 -3.8 -3.8 -3.8 -3.1 -2.9

3 3 -3.6 -3.9 -3.4 -3.2 -2.9

4 4 -3.9 -3.6 -3.4 -2.9 -3.2

5 5 -4.5 -4.2 -3.4 -3.4 -3.5

(a) (b) (c)

Figure 8.6: (a) The policy to be evaluated. (b) The true state values are represented as a table. (c) The
true state values are represented as a 3D surface.

The grid world example is shown in Figure 8.6. The given policy takes any action at a
state with a probability of 0.2. Our goal is to estimate the state values under this policy.
There are 25 state values in total. The true state values are shown in Figure 8.6(b). The
true state values are visualized as a three-dimensional surface in Figure 8.6(c).

173
8.2. TD learning of state values based on function approximation S. Zhao, 2023

We next show that we can use fewer than 25 parameters to approximate these state
values. The simulation setup is as follows. Five hundred episodes are generated by the
given policy. Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution. In addition, in each simulation trial, the parameter
vector w is randomly initialized such that each element is drawn from a standard normal
distribution with a zero mean and a standard deviation of 1. We set rforbidden = rboundary =
−1, rtarget = 1, and γ = 0.9.
To implement the TD-Linear algorithm, we need to select the feature vector φ(s) first.
There are different ways to do that as shown below.

The first type of feature vector is based on polynomials. In the grid world example, a
state s corresponds to a 2D location. Let x and y denote the column and row indexes
of s, respectively. To avoid numerical issues, we normalize x and y so that their values
are within the interval of [−1, +1]. With a slight abuse of notation, the normalized
values are also represented by x and y. Then, the simplest feature vector is
" #
x
φ(s) = ∈ R2 .
y

In this case, we have

" #
w1
v̂(s, w) = φT (s)w = [x, y] = w1 x + w2 y.
w2

When w is given, v̂(s, w) = w1 x + w2 y represents a 2D plane that passes through the

origin. Since the surface of the state values may not pass through the origin, we need
to introduce a bias to the 2D plane to better approximate the state values. To do
that, we consider the following 3D feature vector:
 
1
φ(s) =  x  ∈ R3 . (8.15)
 

In this case, the approximated state value is

 
w1
v̂(s, w) = φT (s)w = [1, x, y]  w2  = w1 + w2 x + w3 y.
 

When w is given, v̂(s, w) corresponds to a plane that may not pass through the origin.
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the elements
does not matter.
The estimation result when we use the feature vector in (8.15) is shown in Fig-

174
8.2. TD learning of state values based on function approximation S. Zhao, 2023

ure 8.7(a). It can be seen that the estimated state values form a 2D plane. Although
the estimation error converges as more episodes are used, the error cannot decrease
to zero due to the limited approximation ability of a 2D plane.
To enhance the approximation ability, we can increase the dimension of the feature
vector. To that end, consider

φ(s) = [1, x, y, x2 , y 2 , xy]T ∈ R6 . (8.16)

In this case, v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy, which

corresponds to a quadratic 3D surface. We can further increase the dimension of the
feature vector:

φ(s) = [1, x, y, x2 , y 2 , xy, x3 , y 3 , x2 y, xy 2 ]T ∈ R10 . (8.17)

The estimation results when we use the feature vectors in (8.16) and (8.17) are shown
in Figures 8.7(b)-(c). As can be seen, the longer the feature vector is, the more
accurately the state values can be approximated. However, in all three cases, the
estimation error cannot converge to zero because these linear approximators still have
limited approximation abilities.

5 5 3.5
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)

State value error (RMSE)

3
4 4
2.5
3 3 2

2 2 1.5

1
1 1
0.5

0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index

(a) φ(s) ∈ R3 (b) φ(s) ∈ R6 (c) φ(s) ∈ R10

Figure 8.7: TD-Linear estimation results obtained with the polynomial features in (8.15), (8.16), and
(8.17).

In addition to polynomial feature vectors, many other types of features are available
such as Fourier basis and tile coding [3, Chapter 9]. First, the values of x and y of

175
8.2. TD learning of state values based on function approximation S. Zhao, 2023

5 5 6
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)

State value error (RMSE)

5
4 4

4
3 3
3
2 2
2

1 1
1

0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index

(a) q = 1 and φ(s) ∈ R4 (b) q = 2 and φ(s) ∈ R9 (c) q = 3 and φ(s) ∈ R16

Figure 8.8: TD-Linear estimation results obtained with the Fourier features in (8.18).

each state are normalized to the interval of [0, 1]. The resulting feature vector is

..
 
.
(q+1)2
 
φ(s) =  cos π(c1 x + c2 y) ∈R , (8.18)

..
.

where π denotes the circumference ratio, which is 3.1415 . . . , instead of a policy. Here,
c1 or c2 can be set as any integers in {0, 1, . . . , q}, where q is a user-specified integer.
As a result, there are (q + 1)2 possible values for the pair (c1 , c2 ) to take. Hence, the
dimension of φ(s) is (q + 1)2 . For example, in the case of q = 1, the feature vector is
   
cos π(0x + 0y) 1

 cos π(0x + 1y)   cos(πy) 
φ(s) =  =  ∈ R4 .
   

 cos π(1x + 0y)   cos(πx) 

cos π(1x + 1y) cos(π(x + y))

The estimation results obtained when we use the Fourier features with q = 1, 2, 3 are
shown in Figure 8.8. The dimensions of the feature vectors in the three cases are
4, 9, 16, respectively. As can be seen, the higher the dimension of the feature vector
is, the more accurately the state values can be approximated.

8.2.5 Theoretical analysis

Thus far, we have finished describing the story of TD learning with function approxima-
tion. This story started from the objective function in (8.3). To optimize this objective

176
8.2. TD learning of state values based on function approximation S. Zhao, 2023

function, we introduced the stochastic algorithm in (8.12). Later, the true value function
in the algorithm, which was unknown, was replaced by an approximation, leading to the
TD algorithm in (8.13). Although this story is helpful for understanding the basic idea
of value function approximation, it is not mathematically rigorous. For example, the
algorithm in (8.13) actually does not minimize the objective function in (8.3).
We next present a theoretical analysis of the TD algorithm in (8.13) to reveal why
the algorithm works effectively and what mathematical problems it solves. Since gen-
eral nonlinear approximators are difficult to analyze, this part only considers the linear
case. Readers are advised to read selectively based on their interests since this part is
mathematically intensive.

Convergence analysis

To study the convergence property of (8.13), we first consider the following deterministic
algorithm:
h i
wt+1 = wt + αt E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) ,

(8.19)

where the expectation is calculated with respect to the random variables st , st+1 , rt+1 .
The distribution of st is assumed to be the stationary distribution dπ . The algorithm
in (8.19) is deterministic because the random variables st , st+1 , rt+1 all disappear after
calculating the expectation.
Why would we consider this deterministic algorithm? First, the convergence of this
deterministic algorithm is easier (though nontrivial) to analyze. Second and more im-
portantly, the convergence of this deterministic algorithm implies the convergence of the
stochastic TD algorithm in (8.13). That is because (8.13) can be viewed as a stochastic
gradient descent (SGD) implementation of (8.19). Therefore, we only need to study the
convergence property of the deterministic algorithm.
Although the expression of (8.19) may look complex at first glance, it can be greatly
simplified. To do that, define

.. ..
   
. .
 ∈ Rn×m ,  ∈ Rn×n ,
   
Φ= T D= (8.20)
φ (s) dπ (s)
..
   
...
.

where Φ is the matrix containing all the feature vectors, and D is a diagonal matrix with
the stationary distribution in its diagonal entries. The two matrices will be frequently
used.

Lemma 8.1. The expectation in (8.19) can be rewritten as

h i
E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) = b − Awt ,

177
8.2. TD learning of state values based on function approximation S. Zhao, 2023

where

.
A = ΦT D(I − γPπ )Φ ∈ Rm×m ,
.
b = ΦT Drπ ∈ Rm . (8.21)

Here, Pπ , rπ are the two terms in the Bellman equation vπ = rπ + γPπ vπ , and I is the
identity matrix with appropriate dimensions.

The proof is given in Box 8.3. With the expression in Lemma 8.1, the deterministic
algorithm in (8.19) can be rewritten as

wt+1 = wt + αt (b − Awt ), (8.22)

which is a simple deterministic process. Its convergence is analyzed below.

First, what is the converged value of wt ? Hypothetically, if wt converges to a constant
value w∗ as t → ∞, then (8.22) implies w∗ = w∗ + α∞ (b − Aw∗ ), which suggests that
b − Aw∗ = 0 and hence
w∗ = A−1 b.

Several remarks about this converged value are given below.

Is A invertible? The answer is yes. In fact, A is not only invertible but also positive
definite. That is, for any nonzero vector x with appropriate dimensions, xT Ax > 0.
The proof is given in Box 8.4.
What is the interpretation of w∗ = A−1 b? It is actually the optimal solution for min-
imizing the projected Bellman error. The details will be introduced in Section 8.2.5.
The tabular method is a special case. One interesting result is that, when the di-
mensionality of w equals n = |S| and φ(s) = [0, . . . , 1, . . . , 0]T , where the entry corre-
sponding to s is 1, we have

w∗ = A−1 b = vπ . (8.23)

This equation indicates that the parameter vector to be learned is actually the true
state value. This conclusion is consistent with the fact that the tabular TD algorithm
is a special case of the TD-Linear algorithm, as introduced in Box 8.2. The proof
of (8.23) is given below. It can be verified that Φ = I in this case and hence A =
ΦT D(I − γPπ )Φ = D(I − γPπ ) and b = ΦT Drπ = Drπ . Thus, w∗ = A−1 b =
(I − γPπ )−1 D−1 Drπ = (I − γPπ )−1 rπ = vπ .

Second, we prove that wt in (8.22) converges to w∗ = A−1 b as t → ∞. Since (8.22) is

a simple deterministic process, it can be proven in many ways. We present two proofs as
follows.

178
8.2. TD learning of state values based on function approximation S. Zhao, 2023

.
Proof 1: Define the convergence error as δt = wt − w∗ . We only need to show that δt
converges to zero. To do that, substituting wt = δt + w∗ into (8.22) gives

δt+1 = δt − αt Aδt = (I − αt A)δt .

It then follows that

δt+1 = (I − αt A) · · · (I − α0 A)δ0 .

Consider the simple case where αt = α for all t. Then, we have

kδt+1 k2 ≤ kI − αAkt+1
2 kδ0 k2 .

When α > 0 is sufficiently small, we have that kI − αAk2 < 1 and hence δt → 0 as
t → ∞. The reason why kI − αAk2 < 1 holds is that A is positive definite and hence
xT (I − αA)x < 1 for any x.
.
Proof 2: Consider g(w) = b−Aw. Since w∗ is the root of g(w) = 0, the task is actually
a root-finding problem. The algorithm in (8.22) is actually a Robbins-Monro (RM)
algorithm. Although the original RM algorithm was designed for stochastic processes,
it can also be applied to deterministic cases. The convergence of RM algorithms can
shed light on the convergence of wt+1 = wt + αt (b − Awt ). That is, wt converges to
w∗ when t αt = ∞ and t αt2 < ∞.
P P

Box 8.3: Proof of Lemma 8.1

By using the law of total expectation, we have
h i
E rt+1 φ(st ) + φ(st ) γφT (st+1 ) − φT (st ) wt
X h i
T T

= dπ (s)E rt+1 φ(st ) + φ(st ) γφ (st+1 ) − φ (st ) wt st = s
s∈S
X h i X h i
dπ (s)E φ(st ) γφT (st+1 ) − φT (st ) wt st = s .

= dπ (s)E rt+1 φ(st ) st = s +
s∈S s∈S
(8.24)

Here, st is assumed to obey the stationary distribution dπ .

First, consider the first term in (8.24). Note that
h i h i
E rt+1 φ(st ) st = s = φ(s)E rt+1 st = s = φ(s)rπ (s),
P P
where rπ (s) = a π(a|s) r rp(r|s, a). Then, the first term in (8.24) can be rewritten

179
8.2. TD learning of state values based on function approximation S. Zhao, 2023

as
X h i X
dπ (s)E rt+1 φ(st ) st = s = dπ (s)φ(s)rπ (s) = ΦT Drπ , (8.25)
s∈S s∈S

where rπ = [· · · , rπ (s), · · · ]T ∈ Rn .
Second, consider the second term in (8.24). Since
h i
E φ(st ) γφT (st+1 ) − φT (st ) wt st = s

h i h i
T T
= −E φ(st )φ (st )wt st = s + E γφ(st )φ (st+1 )wt st = s
h i
= −φ(s)φT (s)wt + γφ(s)E φT (st+1 ) st = s wt
X
= −φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt ,
s0 ∈S

the second term in (8.24) becomes

X h i
dπ (s)E φ(st ) γφT (st+1 ) − φT (st ) wt st = s

s∈S
X h X i
= dπ (s) − φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt
s∈S s0 ∈S
X h X iT
= dπ (s)φ(s) − φ(s) + γ p(s0 |s)φ(s0 ) wt
s∈S s0 ∈S

= ΦT D(−Φ + γPπ Φ)wt

= −ΦT D(I − γPπ )Φwt . (8.26)

Combining (8.25) and (8.26) gives

h i
E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) = ΦT Drπ − ΦT D(I − γPπ )Φwt

.
= b − Awt , (8.27)

. .
where b = ΦT Drπ and A = ΦT D(I − γPπ )Φ.

Box 8.4: Proving that A = ΦT D(I −γPπ )Φ is invertible and positive definite.

The matrix A is positive definite if xT Ax > 0 for any nonzero vector x with ap-
propriate dimensions. If A is positive (or negative) definite, it is denoted as A 0
(or A ≺ 0). Here, and ≺ should be differentiated from > and <, which indicate
elementwise comparisons. Note that A may not be symmetric. Although positive

180
8.2. TD learning of state values based on function approximation S. Zhao, 2023

definite matrices often refer to symmetric matrices, nonsymmetric ones can also be
positive definite.
We next prove that A 0 and hence A is invertible. The idea for proving A 0
is to show that

.
D(I − γPπ ) = M 0. (8.28)

It is clear that M 0 implies A = ΦT M φ 0 since Φ is a tall matrix with full column

rank (suppose that the feature vectors are selected to be linearly independent). Note
that
M + MT M − MT
M= + .
2 2
Since M − M T is skew-symmetric and hence xT (M − M T )x = 0 for any x, we know
that M 0 if and only if M + M T 0. To show M + M T 0, we apply the fact
that strictly diagonal dominant matrices are positive definite [4].
First, it holds that

(M + M T )1n > 0, (8.29)

where 1n = [1, . . . , 1]T ∈ Rn . The proof of (8.29) is given below. Since Pπ 1n = 1n ,

we have M 1n = D(I − γPπ )1n = D(1n − γ1n ) = (1 − γ)dπ . Moreover, M T 1n =
(I − γPπT )D1n = (I − γPπT )dπ = (1 − γ)dπ , where the last equality is valid because
PπT dπ = dπ . In summary, we have

(M + M T )1n = 2(1 − γ)dπ .

Since all the entries of dπ are positive (see Box 8.1), we have (M + M T )1n > 0.
Second, the elementwise form of (8.29) is
n
X
[M + M T ]ij > 0, i = 1, . . . , n,
j=1

which can be further written as

X
[M + M T ]ii + [M + M T ]ij > 0.
j6=i

It can be verified according to (8.28) that the diagonal entries of M are positive and
the off-diagonal entries of M are nonpositive. Therefore, the above inequality can be

181
8.2. TD learning of state values based on function approximation S. Zhao, 2023

rewritten as
X
[M + M T ]ii > [M + M T ]ij .
j6=i

The above inequality indicates that the absolute value of the ith diagonal entry in
M + M T is greater than the sum of the absolute values of the off-diagonal entries
in the same row. Thus, M + M T is strictly diagonal dominant and the proof is
complete.

TD learning minimizes the projected Bellman error

While we have shown that the TD-Linear algorithm converges to w∗ = A−1 b, we next
show that w∗ is the optimal solution that minimizes the projected Bellman error. To do
that, we review three objective functions.

The first objective function is

JE (w) = E[(vπ (S) − v̂(S, w))2 ],

which has been introduced in (8.3). By the definition of expectation, JE (w) can be
reexpressed in a matrix-vector form as

JE (w) = kv̂(w) − vπ k2D ,

where vπ is the true state value vector and v̂(w) is the approximated one. Here, k · k2D
is a weighted norm: kxk2D = xT Dx = kD1/2 xk22 , where D is given in (8.20).
This is the simplest objective function that we can imagine when talking about func-
tion approximation. However, it relies on the true state, which is unknown. To obtain
an implementable algorithm, we must consider other objective functions such as the
Bellman error and projected Bellman error [50–54].
The second objective function is the Bellman error. In particular, since vπ satisfies
the Bellman equation vπ = rπ + γPπ vπ , it is expected that the estimated value v̂(w)
should also satisfy this equation to the greatest extent possible. Thus, the Bellman
error is

.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D . (8.30)

Here, Tπ (·) is the Bellman operator. In particular, for any vector x ∈ Rn , the Bellman
operator is defined as
.
Tπ (x) = rπ + γPπ x.

182
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Minimizing the Bellman error is a standard least-squares problem. The details of the
solution are omitted here.
Third, it is notable that JBE (w) in (8.30) may not be minimized to zero due to the
limited approximation ability of the approximator. By contrast, an objective function
that can be minimized to zero is the projected Bellman error :

JP BE (w) = kv̂(w) − M Tπ (v̂(w))k2D ,

where M ∈ Rn×n is the orthogonal projection matrix that geometrically projects any
vector onto the space of all approximations.

In fact, the TD learning algorithm in (8.13) aims to minimize the projected Bellman
error JP BE rather than JE or JBE . The reason is as follows. For the sake of simplicity,
consider the linear case where v̂(w) = Φw. Here, Φ is defined in (8.20). The range space
of Φ is the set of all possible linear approximations. Then,

M = Φ(ΦT DΦ)−1 ΦT D ∈ Rn×n

is the projection matrix that geometrically projects any vector onto the range space Φ.
Since v̂(w) is in the range space of Φ, we can always find a value of w that can minimize
JP BE (w) to zero. It can be proven that the solution minimizing JP BE (w) is w∗ = A−1 b.
That is

w∗ = A−1 b = arg min JP BE (w),

The proof is given in Box 8.5.

Box 8.5: Showing that w∗ = A−1 b minimizes JP BE (w)

We next show that w∗ = A−1 b is the optimal solution that minimizes JP BE (w). Since
JP BE (w) = 0 ⇔ v̂(w) − M Tπ (v̂(w)) = 0, we only need to study the root of

v̂(w) = M Tπ (v̂(w)).

In the linear case, substituting v̂(w) = Φw and the expression of M in (8.28) into
the above equation gives

Φw = Φ(ΦT DΦ)−1 ΦT D(rπ + γPπ Φw). (8.31)

183
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Since Φ has full column rank, we have Φx = Φy ⇔ x = y for any x, y. Therefore,

(8.31) implies

w = (ΦT DΦ)−1 ΦT D(rπ + γPπ Φw)

⇐⇒ ΦT D(rπ + γPπ Φw) = (ΦT DΦ)w
⇐⇒ ΦT Drπ + γΦT DPπ Φw = (ΦT DΦ)w
⇐⇒ ΦT Drπ = ΦT D(I − γPπ )Φw
⇐⇒ w = (ΦT D(I − γPπ )Φ)−1 ΦT Drπ = A−1 b,

where A, b are given in (8.21). Therefore, w∗ = A−1 b is the optimal solution that
minimizes JP BE (w).

Since the TD algorithm aims to minimize JP BE rather than JE , it is natural to ask

how close the estimated value v̂(w) is to the true state value vπ . In the linear case, the
estimated value that minimizes the projected Bellman error is v̂(w) = Φw∗ . Its deviation
from the true state value vπ satisfies

1 1 p
kΦw∗ − vπ kD ≤ min kv̂(w) − vπ kD = min JE (w). (8.32)
1−γ w 1−γ w

The proof of this inequality is given in Box 8.6. Inequality (8.32) indicates that the
discrepancy between Φw∗ and vπ is bounded from above by the minimum value of JE (w).
However, this bound is loose, especially when γ is close to one. It is thus mainly of
theoretical value.

Box 8.6: Proof of the error bound in (8.32)

Note that

kΦw∗ − vπ kD = kΦw∗ − M vπ + M vπ − vπ kD
≤ kΦw∗ − M vπ kD + kM vπ − vπ kD
= kM Tπ (Φw∗ ) − M Tπ (vπ )kD + kM vπ − vπ kD , (8.33)

where the last equality is due to Φw∗ = M Tπ (Φw∗ ) and vπ = Tπ (vπ ). Substituting

M Tπ (Φw∗ ) − M Tπ (vπ ) = M (rπ + γPπ Φw∗ ) − M (rπ + γPπ vπ ) = γM Pπ (Φw∗ − vπ )

184
8.2. TD learning of state values based on function approximation S. Zhao, 2023

into (8.33) yields

kΦw∗ − vπ kD ≤ kγM Pπ (Φw∗ − vπ )kD + kM vπ − vπ kD

≤ γkM kD kPπ (Φw∗ − vπ )kD + kM vπ − vπ kD
= γkPπ (Φw∗ − vπ )kD + kM vπ − vπ kD (because kM kD = 1)
≤ γkΦw∗ − vπ kD + kM vπ − vπ kD . (because kPπ xkD ≤ kxkD for all x)

The proof of kM kD = 1 and kPπ xkD ≤ kxkD are postponed to the end of the box.
Recognizing the above inequality gives

1
kΦw∗ − vπ kD ≤ kM vπ − vπ kD
1−γ
1
= min kv̂(w) − vπ kD ,
1−γ w

where the last equality is because kM vπ − vπ kD is the error between vπ and its
orthogonal projection into the space of all possible approximations. Therefore, it is
the minimum value of the error between vπ and any v̂(w).
We next prove some useful facts, which have already been used in the above proof.
√
Properties of matrix weighted norms. By definition, kxkD = xT Dx = kD1/2 xk2 .
The induced matrix norm is kAkD = maxx6=0 kAxkD /kxkD = kD1/2 AD−1/2 k2 . For
matrices A, B with appropriate dimensions, we have kABxkD ≤ kAkD kBkD kxkD .
To see that, kABxkD = kD1/2 ABxk2 = kD1/2 AD−1/2 D1/2 BD−1/2 D1/2 xk2 ≤
kD1/2 AD−1/2 k2 kD1/2 BD−1/2 k2 kD1/2 xk2 = kAkD kBkD kxkD .
Proof of kM kD = 1. This is valid because kM kD = kΦ(ΦT DΦ)−1 ΦT DkD =
kD1/2 Φ(ΦT DΦ)−1 ΦT DD−1/2 k2 = 1, where the last equality is valid due to the
fact that the matrix in the L2 -norm is an orthogonal projection matrix and the
L2 -norm of any orthogonal projection matrix is equal to one.
Proof of kPπ xkD ≤ kxkD for any x ∈ Rn . First,
!
X X X
kPπ xk2D = xT PπT DPπ x = xi [PπT DPπ ]ij xj = xi [PπT ]ik [D]kk [Pπ ]kj xj .
i,j i,j k

185
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Reorganizing the above equation gives

X X 2
kPπ xk2D = [D]kk [Pπ ]ki xi
k i
X X
≤ [D]kk [Pπ ]ki x2i (due to Jensen’s inequality [55, 56])
k i
XX
= [D]kk [Pπ ]ki x2i
i k
X
= [D]ii x2i (due to dTπ Pπ = dTπ )
i

= kxk2D .

Least-squares TD

We next introduce an algorithm called least-squares TD (LSTD) [57]. Like the TD-Linear
algorithm, LSTD aims to minimize the projected Bellman error. However, it has some
advantages over the TD-Linear algorithm.
Recall that the optimal parameter for minimizing the projected Bellman error is
w = A−1 b, where A = ΦT D(I − γPπ )Φ and b = ΦT Drπ . In fact, it follows from (8.27)
∗

that A and b can also be written as

h T i
A = E φ(st ) φ(st ) − γφ(st+1 ) ,
h i
b = E rt+1 φ(st ) .

The above two equations show that A and b are expectations of st , st+1 , rt+1 . The idea of
LSTD is simple: if we can use random samples to directly obtain the estimates of A and
b, which are denoted as Â and b̂, then the optimal parameter can be directly estimated
as w∗ ≈ Â−1 b̂.
In particular, suppose that (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) is a trajectory obtained
by following a given policy π. Let Ât and b̂t be the estimates of A and b at time t,
respectively. They are calculated as the averages of the samples:

t−1
X T
Ât = φ(sk ) φ(sk ) − γφ(sk+1 ) ,
k=0
t−1
X
b̂t = rk+1 φ(sk ). (8.34)
k=0

Then, the estimated parameter is

wt = Â−1
t b̂t .

186
8.2. TD learning of state values based on function approximation S. Zhao, 2023

The reader may wonder if a coefficient of 1/t is missing on the right-hand side of (8.34).
In fact, it is omitted for the sake of simplicity since the value of wt remains the same
when it is omitted. Since Ât may not be invertible especially when t is small, Ât is usually
biased by a small constant matrix σI, where I is the identity matrix and σ is a small
positive number.
The advantage of LSTD is that it uses experience samples more efficiently and con-
verges faster than the TD method. That is because this algorithm is specifically designed
based on the knowledge of the optimal solution’s expression. The better we understand
a problem, the better algorithms we can design.
The disadvantages of LSTD are as follows. First, it can only estimate state values.
By contrast, the TD algorithm can be extended to estimate action values as shown in the
next section. Moreover, while the TD algorithm allows nonlinear approximators, LSTD
does not. That is because this algorithm is specifically designed based on the expression
of w∗ . Second, the computational cost of LSTD is higher than that of TD since LSTD
updates an m × m matrix in each update step, whereas TD updates an m-dimensional
vector. More importantly, in every step, LSTD needs to compute the inverse of Ât , whose
computational complexity is O(m3 ). The common method for resolving this problem is
to directly update the inverse of Ât rather than updating Ât . In particular, Ât+1 can be
calculated recursively as follows:
t
X T
Ât+1 = φ(sk ) φ(sk ) − γφ(sk+1 )
k=0
t−1
X T T
= φ(sk ) φ(sk ) − γφ(sk+1 ) + φ(st ) φ(st ) − γφ(st+1 )
k=0
T
= Ât + φ(st ) φ(st ) − γφ(st+1 ) .

The above expression decomposes Ât+1 into the sum of two matrices. Its inverse can be
calculated as [58]
T −1
Â−1
t+1 = Ât + φ(s t ) φ(s t ) − γφ(s t+1 )
T −1
−1 Â−1
t φ(st ) φ(st ) − γφ(st+1 ) Ât
= Ât + T −1 .
1 + φ(st ) − γφ(st+1 ) Ât φ(st )

Therefore, we can directly store and update Â−1 t to avoid the need to calculate the matrix
inverse. This recursive algorithm does not require a step size. However, it requires setting
the initial value of Â−1
0 . The initial value of such a recursive algorithm can be selected as
Â−1
0 = σI, where σ is a positive number. A good tutorial on the recursive least-squares
approach can be found in [59].

187
8.3. TD learning of action values based on function approximation S. Zhao, 2023

8.3 TD learning of action values based on function

approximation
While Section 8.2 introduced the problem of state value estimation, the present section
introduces how to estimate action values. The tabular Sarsa and tabular Q-learning
algorithms are extended to the case of value function approximation. Readers will see
that the extension is straightforward.

8.3.1 Sarsa with function approximation

The Sarsa algorithm with function approximation can be readily obtained from (8.13)
by replacing the state values with action values. In particular, suppose that qπ (s, a) is
approximated by q̂(s, a, w). Replacing v̂(s, w) in (8.13) by q̂(s, a, w) gives
h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ). (8.35)

The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear
functions are used, we have

q̂(s, a, w) = φT (s, a)w,

where φ(s, a) is a feature vector. In this case, ∇w q̂(s, a, w) = φ(s, a).

The value estimation step in (8.35) can be combined with a policy improvement step
to learn optimal policies. The procedure is summarized in Algorithm 8.2. It should be
noted that accurately estimating the action values of a given policy requires (8.35) to be
run sufficiently many times. However, (8.35) is executed only once before switching to the
policy improvement step. This is similar to the tabular Sarsa algorithm. Moreover, the
implementation in Algorithm 8.2 aims to solve the task of finding a good path to the target
state from a prespecified starting state. As a result, it cannot find the optimal policy
for every state. However, if sufficient experience data are available, the implementation
process can be easily adapted to find optimal policies for every state.
An illustrative example is shown in Figure 8.9. In this example, the task is to find a
good policy that can lead the agent to the target when starting from the top-left state.
Both the total reward and the length of each episode gradually converge to steady values.
In this example, the linear feature vector is selected as the Fourier function of order 5.
The expression of a Fourier feature vector is given in (8.18).

188
8.3. TD learning of action values based on function approximation S. Zhao, 2023

0 1 2 3 4 5

Total reward
-500 1

-1000
2
0 100 200 300 400 500

3
Episode length

500

0
0 100 200 300 400 500 5
Episode index

Figure 8.9: Sarsa with linear function approximation. Here, γ = 0.9, = 0.1, rboundary = rforbidden =
−10, rtarget = 1, and α = 0.001.

Algorithm 8.2: Sarsa with function approximation

Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value:h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1

8.3.2 Q-learning with function approximation

Tabular Q-learning can also be extended to the case of function approximation. The
update rule is
h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ). (8.36)
a∈A(st+1 )

The above update rule is similar to (8.35) except that q̂(st+1 , at+1 , wt ) in (8.35) is replaced
with maxa∈A(st+1 ) q̂(st+1 , a, wt ).
Similar to the tabular case, (8.36) can be implemented in either an on-policy or
off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for
demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is
to find a good policy that can lead the agent to the target state from the top-left state.

189
8.4. Deep Q-learning S. Zhao, 2023

Algorithm 8.3: Q-learning with function approximation (on-policy ver-

sion)

Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value:h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(sε t )| otherwise

As can be seen, Q-learning with linear function approximation can successfully learn an
optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy
version will be demonstrated when we introduce deep Q-learning in Section 8.4.
0 1 2 3 4 5
Total reward

1
-2000

-4000 2
0 100 200 300 400 500

3
Episode length

1000
4

0
0 100 200 300 400 500 5
Episode index

Figure 8.10: Q-learning with linear function approximation. Here, γ = 0.9, = 0.1, rboundary =
rforbidden = −10, rtarget = 1, and α = 0.001.

One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are
represented as functions, the policy π(a|s) is still represented as a table. Thus, it still
assumes finite numbers of states and actions. In Chapter 9, we will see that the policies
can be represented as functions so that continuous state and action spaces can be handled.

8.4 Deep Q-learning

We can integrate deep neural networks into Q-learning to obtain an approach called
deep Q-learning or deep Q-network (DQN) [22, 60, 61]. Deep Q-learning is one of the

190
8.4. Deep Q-learning S. Zhao, 2023

earliest and most successful deep reinforcement learning algorithms. Notably, the neural
networks do not have to be deep. For simple tasks such as our grid world examples,
shallow networks with one or two hidden layers may be sufficient.
Deep Q-learning can be viewed as an extension of the algorithm in (8.36). However,
its mathematical formulation and implementation techniques are substantially different
and deserve special attention.

8.4.1 Algorithm description

Mathematically, deep Q-learning aims to minimize the following objective function:
" 2 #
0
J =E R + γ max0 q̂(S , a, w) − q̂(S, A, w) , (8.37)
a∈A(S )

where (S, A, R, S 0 ) are random variables that denote a state, an action, the immediate
reward, and the next state, respectively. This objective function can be viewed as the
squared Bellman optimality error. That is because

q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , for all s, a
a∈A(St+1 )

is the Bellman optimality equation (the proof is given in Box 7.5). Therefore, R +
γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should equal zero in the expectation sense when
q̂(S, A, w) can accurately approximate the optimal action values.
To minimize the objective function in (8.37), we can use the gradient descent algorith-
m. To that end, we need to calculate the gradient of J with respect to w. It is noted that
.
the parameter w appears not only in q̂(S, A, w) but also in y = R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is as-
sumed that the value of w in y is fixed (for a short period of time) so that the calculation
of the gradient becomes much easier. In particular, we introduce two networks: one is a
main network representing q̂(s, a, w) and the other is a target network q̂(s, a, wT ). The
objective function in this case becomes
" 2 #
J =E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ,
a∈A(S )

where wT is the target network’s parameter. When wT is fixed, the gradient of J is

0
∇w J = E R + γ max0 q̂(S , a, wT ) − q̂(S, A, w) ∇w q̂(S, A, w) , (8.38)
a∈A(S )

where some constant coefficients are omitted without loss of generality.

To use the gradient in (8.38) to minimize the objective function in (8.37), we need to

191
8.4. Deep Q-learning S. Zhao, 2023

pay attention to the following techniques.

The first technique is to use two networks, a main network and a target network,
as mentioned when we calculate the gradient in (8.38). The implementation details
are explained below. Let w and wT denote the parameters of the main and target
networks, respectively. They are initially set to the same value.
In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the replay buffer
(the replay buffer will be explained soon). The inputs of the main network are s and
a. The output y = q̂(s, a, w) is the estimated q-value. The target value of the output
.
is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). The main network is updated to minimize the
TD error (also called the loss function) (y − yT )2 over the samples {(s, a, yT )}.
P

Updating w in the main network does not explicitly use the gradient in (8.38). Instead,
it relies on the existing software tools for training neural networks. As a result, we
need a mini-batch of samples to train a network instead of using a single sample to
update the main network based on (8.38). This is one notable difference between deep
and nondeep reinforcement learning algorithms.
The main network is updated in every iteration. By contrast, the target network is
set to be the same as the main network every certain number of iterations to satisfy
the assumption that wT is fixed when calculating the gradient in (8.38).
The second technique is experience replay [22, 60, 62]. That is, after we have collected
some experience samples, we do not use these samples in the order they were collected.
Instead, we store them in a dataset called the replay buffer. In particular, let (s, a, r, s0 )
.
be an experience sample and B = {(s, a, r, s0 )} be the replay buffer. Every time we
update the main network, we can draw a mini-batch of experience samples from the
replay buffer. The draw of samples, or called experience replay, should follow a uniform
distribution.
Why is experience replay necessary in deep Q-learning, and why must the replay
follow a uniform distribution? The answer lies in the objective function in (8.37).
In particular, to well define the objective function, we must specify the probability
distributions for S, A, R, S 0 . The distributions of R and S 0 are determined by the
system model once (S, A) is given. The simplest way to describe the distribution of
the state-action pair (S, A) is to assume it to be uniformly distributed.
However, the state-action samples may not be uniformly distributed in practice s-
ince they are generated as a sample sequence according to the behavior policy. It is
necessary to break the correlation between the samples in the sequence to satisfy the
assumption of uniform distribution. To do this, we can use the experience replay tech-
nique by uniformly drawing samples from the replay buffer. This is the mathematical
reason why experience replay is necessary and why experience replay must follow a
uniform distribution. A benefit of random sampling is that each experience sample

192
8.4. Deep Q-learning S. Zhao, 2023

Algorithm 8.3: Deep Q-learning (off-policy version)

Initialization: A main network and a target network with the same initial parameter.
Goal: Learn an optimal target network to approximate the optimal action values from
the experience samples generated by a given behavior policy πb .
Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}
For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
of samples
Set wT = w every C iterations

may be used multiple times, which can increase the data efficiency. This is especially
important when we have a limited amount of data.

The implementation procedure of deep Q-learning is summarized in Algorithm 8.3.

This implementation is off-policy. It can also be adapted to become on-policy if needed.

8.4.2 Illustrative examples

An example is given in Figure 8.11 to demonstrate Algorithm 8.3. This example aims
to learn the optimal action values for every state-action pair. Once the optimal action
values are obtained, the optimal greedy policy can be obtained immediately.
A single episode is generated by the behavior policy shown in Figure 8.11(a). This
behavior policy is exploratory in the sense that it has the same probability of taking
any action at any state. The episode has only 1,000 steps as shown in Figure 8.11(b).
Although there are only 1,000 steps, almost all the state action pairs are visited in this
episode due to the strong exploration ability of the behavior policy. The replay buffer is
a set of 1,000 experience samples. The mini-batch size is 100, meaning that we uniformly
draw 100 samples from the replay buffer every time we acquire samples.
The main and target networks have the same structure: a neural network with one
hidden layer of 100 neurons (the numbers of layers and neurons can be tuned). The
neural network has three inputs and one output. The first two inputs are the normalized
row and column indexes of a state. The third input is the normalized action index. Here,
“normalization” means converting a value to the interval of [0,1]. The output of the
network is the estimated action value. The reason why we design the inputs as the row
and column of a state rather than a state index is that we know that a state corresponds
to a two-dimensional location in the grid. The more information about the state we use
when designing the network, the better the network can perform. Moreover, the neural

193
8.4. Deep Q-learning S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) The behavior policy. (b) An episode with 1,000 steps. (c) The final learned policy.

5 10

State value error (RMSE)

TD error / loss function

4 8

3 6

2 4

1 2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

(d) The loss function converges to zero. (e) The value error converges to zero.

Figure 8.11: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 100.

network can also be designed in other ways. For example, it can have two inputs and
five outputs, where the two inputs are the normalized row and column of a state and the
outputs are the five estimated action values for the input state [22].
As shown in Figure 8.11(d), the loss function, defined as the average squared TD
error of each mini-batch, converges to zero, meaning that the network can fit the training
samples well. As shown in Figure 8.11(e), the state value estimation error also converges
to zero, indicating that the estimates of the optimal action values become sufficiently
accurate. Then, the corresponding greedy policy is optimal.
This example demonstrates the high efficiency of deep Q-learning. In particular, a
short episode of 1,000 steps is sufficient for obtaining an optimal policy here. By contrast,
an episode with 100,000 steps is required by tabular Q-learning, as shown in Figure 7.4.
One reason for the high efficiency is that the function approximation method has a strong
generalization ability. Another reason is that the experience samples can be repeatedly
used.
We next deliberately challenge the deep Q-learning algorithm by considering a scenario
with fewer experience samples. Figure 8.12 shows an example of an episode with merely
100 steps. In this example, although the network can still be well-trained in the sense

194
8.5. Summary S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) The behavior policy. (b) An episode with 100 steps. (c) The final learned policy.

7 8

State value error (RMSE)

6
TD error / loss function

7
5

4
6
3

2
5
1

0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error does not converge to zero.

Figure 8.12: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 50.

that the loss function converges to zero, the state estimation error cannot converge to
zero. That means the network can properly fit the given experience samples, but the
experience samples are too few to accurately estimate the optimal action values.

8.5 Summary
This chapter continued introducing TD learning algorithms. However, it switches from
the tabular method to the function approximation method. The key to understanding
the function approximation method is to know that it is an optimization problem. The
simplest objective function is the squared error between the true state values and the
estimated values. There are also other objective functions such as the Bellman error
and the projected Bellman error. We have shown that the TD-Linear algorithm actually
minimizes the projected Bellman error. Several optimization algorithms such as Sarsa
and Q-learning with value approximation have been introduced.
One reason why the value function approximation method is important is that it allows
artificial neural networks to be integrated with reinforcement learning. For example,
deep Q-learning is one of the most successful deep reinforcement learning algorithms.

195
8.6. Q&A S. Zhao, 2023

Although neural networks have been widely used as nonlinear function approximators,
this chapter provides a comprehensive introduction to the linear function case. Fully
understanding the linear case is important for better understanding the nonlinear case.
Interested readers may refer to [63] for a thorough analysis of TD learning algorithms
with function approximation. A more theoretical discussion on deep Q-learning can be
found in [61].
An important concept named stationary distribution is introduced in this chapter.
The stationary distribution plays an important role in defining an appropriate objective
function in the value function approximation method. It also plays a key role in Chapter 9
when we use functions to approximate policies. An excellent introduction to this topic
can be found in [49, Chapter IV]. The contents of this chapter heavily rely on matrix
analysis. Some results are used without explanation. Excellent references regarding
matrix analysis and linear algebra can be found in [4, 48].

8.6 Q&A
Q: What is the difference between the tabular and function approximation methods?
A: One important difference is how a value is updated and retrieved.
How to retrieve a value: When the values are represented by a table, if we would like
to retrieve a value, we can directly read the corresponding entry in the table. However,
when the values are represented by a function, we need to input the state index s into
the function and calculate the function value. If the function is an artificial neural
network, a forward prorogation process from the input to the output is needed.
How to update a value: When the values are represented by a table, if we would like
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, we must update the function
parameter to change the values indirectly.
Q: What are the advantages of the function approximation method over the tabular
method?
A: Due to the way state values are retrieved, the function approximation method is
more efficient in storage. In particular, while the tabular method needs to store |S|
values, the function approximation method only needs to store a parameter vector
whose dimension is usually much less than |S|.
Due to the way in which state values are updated, the function approximation method
has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. With the tabular method, updating one state
value would not change the other state values. However, with the function approx-
imation method, updating the function parameter affects the values of many states.

196
8.6. Q&A S. Zhao, 2023

Therefore, the experience sample for one state can generalize to help estimate the
values of other states.
Q: Can we unify the tabular and the function approximation methods?
A: Yes. The tabular method can be viewed as a special case of the function approxi-
mation method. The related details can be found in Box 8.2.
Q: What is the stationary distribution and why is it important?
A: The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently long
period, the probability of the agent visiting a state can be described by this stationary
distribution. More information can be found in Box 8.1.
The reason why this concept emerges in this chapter is that it is necessary for defining
a valid objective function. In particular, the objective function involves the probability
distribution of the states, which is usually selected as the stationary distribution. The
stationary distribution is important not only for the value approximation method but
also for the policy gradient method, which will be introduced in Chapter 9.
Q: What are the advantages and disadvantages of the linear function approximation
method?
A: Linear function approximation is the simplest case whose theoretical properties
can be thoroughly analyzed. However, the approximation ability of this method is
limited. It is also nontrivial to select appropriate feature vectors for complex tasks.
By contrast, artificial neural networks can be used to approximate values as black-box
universal nonlinear approximators, which are more friendly to use. Nevertheless, it
is still meaningful to study the linear case to better grasp the idea of the function
approximation method. Moreover, the linear case is powerful in the sense that the
tabular method can be viewed as a special linear case (Box 8.2).
Q: Why does deep Q-learning require experience replay?
A: The reason lies in the objective function in (8.37). In particular, to well define
the objective function, we must specify the probability distributions of S, A, R, S 0 .
The distributions of R and S 0 are determined by the system model once (S, A) is
given. The simplest way to describe the distribution of the state-action pair (S, A)
is to assume it to be uniformly distributed. However, the state-action samples may
not be uniformly distributed in practice since they are generated as a sequence by the
behavior policy. It is necessary to break the correlation between the samples in the
sequence to satisfy the assumption of uniform distribution. To do this, we can use
the experience replay technique by uniformly drawing samples from the replay buffer.
A benefit of experience replay is that each experience sample may be used multiple
times, which can increase the data efficiency.

197
8.6. Q&A S. Zhao, 2023

Q: Can tabular Q-learning use experience replay?

A: Although tabular Q-learning does not require experience replay, it can also use
experience relay without encountering problems. That is because Q-learning has no
requirements about how the samples are obtained due to its off-policy attribute. One
benefit of using experience replay is that the samples can be used repeatedly and
hence more efficiently.
Q: Why does deep Q-learning require two networks?
A: The fundamental reason is to simplify the calculation of the gradient of (8.37).
Specifically, the parameter w appears not only in q̂(S, A, w) but also in R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient with respect to w. On the one
hand, if we fix w in R + γ maxa∈A(S 0 ) q̂(S 0 , a, w), the gradient can be easily calculated
as shown in (8.38). This gradient suggests that two networks should be maintained.
The main network’s parameter is updated in every iteration. The target network’s
parameter is fixed within a certain period. On the other hand, the target network’s
parameter cannot be fixed forever. It should be updated every certain number of
iterations.
Q: When an artificial neural network is used as a nonlinear function approximator,
how should we update its parameter?
A: It must be noted that we should not directly update the parameter vector by
using, for example, (8.36). Instead, we should follow the network training procedure
to update the parameter. This procedure can be realized based on neural network
training toolkits, which are currently mature and widely available.

198
Chapter 9

Policy Gradient Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 9.1: Where we are in this book.

The idea of function approximation can be applied not only to represent state/action
values, as introduced in Chapter 8, but also to represent policies, as introduced in this
chapter. So far in this book, policies have been represented by tables: the action prob-
abilities of all states are stored in a table (e.g., Table 9.1). In this chapter, we show
that policies can be represented by parameterized functions denoted as π(a|s, θ), where
θ ∈ Rm is a parameter vector. It can also be written in other forms such as πθ (a|s),
πθ (a, s), or π(a, s, θ).
When policies are represented as functions, optimal policies can be obtained by op-
timizing certain scalar metrics. Such a method is called policy gradient. The policy

199
9.1. Policy representation: From table to function S. Zhao, 2023

a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
.. .. .. .. .. ..
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )

Table 9.1: A tabular representation of a policy. There are nine states and five actions for each state.

π(a1 |s, θ)
s
π(a|s, θ) s ..
θ θ .
a

π(a5 |s, θ)
function function
(a) (b)

Figure 9.2: Function representations of policies. The functions may have different structures.

gradient method is a big step forward in this book because it is policy-based. By contrast,
all the previous chapters in this book discuss value-based methods. The advantages of the
policy gradient method are numerous. For example, it is more efficient for handling large
state/action spaces. It has stronger generalization abilities and hence is more efficient in
terms of sample usage.

9.1 Policy representation: From table to function

When the representation of a policy is switched from a table to a function, it is necessary
to clarify the difference between the two representation methods.

First, how to define optimal policies? When represented as a table, a policy is defined
as optimal if it can maximize every state value. When represented by a function, a
policy is defined as optimal if it can maximize certain scalar metrics.
Second, how to update a policy? When represented by a table, a policy can be updated
by directly changing the entries in the table. When represented by a parameterized
function, a policy can no longer be updated in this way. Instead, it can only be
updated by changing the parameter θ.
Third, how to retrieve the probability of an action? In the tabular case, the probability
of an action can be directly obtained by looking up the corresponding entry in the
table. In the case of function representation, we need to input (s, a) into the function
to calculate its probability (see Figure 9.2(a)). Depending on the structure of the
function, we can also input a state and then output the probabilities of all actions
(see Figure 9.2(b)).

200
9.2. Metrics for defining optimal policies S. Zhao, 2023

The basic idea of the policy gradient method is summarized below. Suppose that J(θ)
is a scalar metric. Optimal policies can be obtained by optimizing this metric via the
gradient-based algorithm:

θt+1 = θt + α∇θ J(θt ),

where ∇θ J is the gradient of J with respect to θ, t is the time step, and α is the
optimization rate.
With this basic idea, we will answer the following three questions in the remainder of
this chapter.

What metrics should be used? (Section 9.2).

How to calculate the gradients of the metrics? (Section 9.3)
How to use experience samples to calculate the gradients? (Section 9.4)

9.2 Metrics for defining optimal policies

If a policy is represented by a function, there are two types of metrics for defining optimal
policies. One is based on state values and the other is based on immediate rewards.

Metric 1: Average state value

The first metric is the average state value or simply called the average value. It is defined
as
X
v̄π = d(s)vπ (s),
s∈S

P
where d(s) is the weight of state s. It satisfies d(s) ≥ 0 for any s ∈ S and s∈S d(s) = 1.
Therefore, we can interpret d(s) as a probability distribution of s. Then, the metric can
be written as

v̄π = ES∼d [vπ (S)].

How to select the distribution d? This is an important question. There are two cases.

The first and simplest case is that d is independent of the policy π. In this case, we
specifically denote d as d0 and v̄π as v̄π0 to indicate that the distribution is independent
of the policy. One case is to treat all the states equally important and select d0 (s) =
1/|S|. Another case is when we are only interested in a specific state s0 (e.g., the
agent always starts from s0 ). In this case, we can design

d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.

201
9.2. Metrics for defining optimal policies S. Zhao, 2023

The second case is that d is dependent on the policy π. In this case, it is common to
select d as dπ , which is the stationary distribution under π. One basic property of dπ
is that it satisfies

dTπ Pπ = dTπ ,

where Pπ is the state transition probability matrix. More information about the
stationary distribution can be found in Box 8.1.
The interpretation of selecting dπ is as follows. The stationary distribution reflects the
long-term behavior of a Markov decision process under a given policy. If one state is
frequently visited in the long term, it is more important and deserves a higher weight;
if a state is rarely visited, then its importance is low and deserves a lower weight.

As its name suggests, v̄π is a weighted average of the state values. Different values
of θ lead to different values of v̄π . Our ultimate goal is to find an optimal policy (or
equivalently an optimal θ) to maximize v̄π .
We next introduce another two important equivalent expressions of v̄π .

Suppose that an agent collects rewards {Rt+1 }∞t=0 by following a given policy π(θ).
Readers may often see the following metric in the literature:
∞
" n
# " #
X X
J(θ) = lim E γ t Rt+1 = E γ t Rt+1 . (9.1)
n→∞
t=0 t=0

This metric may be nontrivial to interpret at first glance. In fact, it is equal to v̄π .
To see that, we have
∞ ∞
" # " #
X X X
E γ t Rt+1 = d(s)E γ t Rt+1 |S0 = s
t=0 s∈S t=0
X
= d(s)vπ (s)
s∈S

= v̄π .

The first equality in the above equation is due to the law of total expectation. The
second equality is by the definition of state values.
The metric v̄π can also be rewritten as the inner product of two vectors. In particular,
let

vπ = [. . . , vπ (s), . . . ]T ∈ R|S| ,
d = [. . . , d(s), . . . ]T ∈ R|S| .

202
9.2. Metrics for defining optimal policies S. Zhao, 2023

Then, we have

v̄π = dT vπ .

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward
[2, 64, 65]. In particular, it is defined as

. X
r̄π = dπ (s)rπ (s)
s∈S

= ES∼dπ [rπ (S)], (9.2)

where dπ is the stationary distribution and

. X
rπ (s) = π(a|s, θ)r(s, a) = EA∼π(s,θ) [r(s, A)|s] (9.3)
a∈A

. P
is the expectation of the immediate rewards. Here, r(s, a) = E[R|s, a] = r rp(r|s, a).
We next present another two important equivalent expressions of r̄π .

Suppose that the agent collects rewards {Rt+1 }∞

t=0 by following a given policy π(θ).
A common metric that readers may often see in the literature is
" n−1 #
1 X
J(θ) = lim E Rt+1 . (9.4)
n→∞ n
t=0

It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to
r̄π :
" n−1 #
1 X X
lim E Rt+1 = dπ (s)rπ (s) = r̄π . (9.5)
n→∞ n
t=0 s∈S

The proof of (9.5) is given in Box 9.1.

The average reward r̄π in (9.2) can also be written as the inner product of two vectors.
In particular, let

rπ = [. . . , rπ (s), . . . ]T ∈ R|S| ,
dπ = [. . . , dπ (s), . . . ]T ∈ R|S| ,

203
9.2. Metrics for defining optimal policies S. Zhao, 2023

where rπ (s) is defined in (9.3). Then, it is clear that

X
r̄π = dπ (s)rπ (s) = dTπ rπ .
s∈S

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

Step 1: We first prove that the following equation is valid for any starting state
s0 ∈ S:
" n−1 #
1 X
r̄π = lim E Rt+1 |S0 = s0 . (9.6)
n→∞ n
t=0

To do that, we notice
" n−1 # n−1
1 X 1X
lim E Rt+1 |S0 = s0 = lim E [Rt+1 |S0 = s0 ]
n→∞ n n→∞ n
t=0 t=0

= lim E [Rt+1 |S0 = s0 ] , (9.7)

t→∞

where the last equality is due to the property of the Cesaro mean (also called the
Cesaro summation). In particular, if {ak }∞k=1 is a convergent sequence such that
Pn ∞
limk→∞ ak exists, then {1/n k=1 ak }n=1 is also a convergent sequence such that
limn→∞ 1/n nk=1 ak = limk→∞ ak .
P

where p(t) (s|s0 ) denotes the probability of transitioning from s0 to s using exactly t
steps. The second equality in the above equation is due to the Markov memoryless
property: the reward obtained at the next time step depends only on the current
state rather than the previous ones.
Note that
lim p(t) (s|s0 ) = dπ (s)
t→∞

204
9.2. Metrics for defining optimal policies S. Zhao, 2023

by the definition of the stationary distribution. As a result, the starting state s0 does
not matter. Then, we have
X X
lim E [Rt+1 |S0 = s0 ] = lim rπ (s)p(t) (s|s0 ) = rπ (s)dπ (s) = r̄π .
t→∞ t→∞
s∈S s∈S

Substituting the above equation into (9.7) gives (9.6).

Step 2: Consider an arbitrary state distribution d. By the law of total expectation,
we have
" n−1 # " n−1 #
1 X 1X X
lim E Rt+1 = lim d(s)E Rt+1 |S0 = s
n→∞ n n→∞ n
t=0 s∈S t=0
" n−1 #
X 1 X
= d(s) lim E Rt+1 |S0 = s .
n→∞ n
s∈S t=0

Since (9.6) is valid for any starting state, substituting (9.6) into the above equation
yields
" n−1 #
1 X X
lim E Rt+1 = d(s)r̄π = r̄π .
n→∞ n
t=0 s∈S

The proof is complete.

Some remarks

Metric Expression 1 Expression 2 Expression 3

P Pn t

v̄π s∈S d(s)vπ (s) ES∼d [vπ (S)] limn→∞ E t=0 γ Rt+1
P 1
Pn−1
r̄π s∈S dπ (s)rπ (s) ES∼dπ [rπ (S)] limn→∞ n E t=0 Rt+1

Table 9.2: Summary of the different but equivalent expressions of v̄π and r̄π .

Up to now, we have introduced two types of metrics: v̄π and r̄π . Each metric has
several different but equivalent expressions. They are summarized in Table 9.2. We
sometimes use v̄π to specifically refer to the case where the state distribution is the
stationary distribution dπ and use v̄π0 to refer to the case where d0 is independent of π.
Some remarks about the metrics are given below.

All these metrics are functions of π. Since π is parameterized by θ, these metrics

are functions of θ. In other words, different values of θ can generate different metric
values. Therefore, we can search for the optimal values of θ to maximize these metrics.
This is the basic idea of policy gradient methods.

205
9.3. Gradients of the metrics S. Zhao, 2023

The two metrics v̄π and r̄π are equivalent in the discounted case where γ < 1. In
particular, it can be shown that

r̄π = (1 − γ)v̄π .

The above equation indicates that these two metrics can be simultaneously maximized.
The proof of this equation is given later in Lemma 9.1.

9.3 Gradients of the metrics

Given the metrics introduced in the last section, we can use gradient-based methods to
maximize them. To do that, we need to first calculate the gradients of these metrics.
The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is

X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a), (9.8)
s∈S a∈A

where η is a state distribution and ∇θ π is the gradient of π with respect to θ. Moreover,

(9.8) has a compact form expressed in terms of expectation:
h i
∇θ J(θ) = ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) , (9.9)

where ln is the natural logarithm.

Some important remarks about Theorem 9.1 are given below.

It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2,
Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios
involving different metrics and discounted/undiscounted cases. The gradients in these
scenarios all have similar expressions and hence are summarized in Theorem 9.1. The
specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in
Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be v̄π0 , v̄π ,
or r̄π . The equality in (9.8) may become a strict equality or an approximation. The
distribution η also varies in different scenarios.
The derivation of the gradients is the most complicated part of the policy gradient
method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1
without knowing the proof. The derivation details presented in the rest of this section
are mathematically intensive. Readers are suggested to study selectively based on
their interests.

206
9.3. Gradients of the metrics S. Zhao, 2023

The expression in (9.9) is more favorable than (9.8) because it is expressed as an

expectation. We will show in Section 9.4 that this true gradient can be approximated
by a stochastic gradient.
Why can (9.8) be expressed as (9.9)? The proof is given below. By the definition of
expectation, (9.8) can be rewritten as
X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
" #
X
= ES∼η ∇θ π(a|S, θ)qπ (S, a) . (9.10)
a∈A

Furthermore, the gradient of ln π(a|s, θ) is

∇θ π(a|s, θ)
∇θ ln π(a|s, θ) = .
π(a|s, θ)

It follows that

∇θ π(a|s, θ) = π(a|s, θ)∇θ ln π(a|s, θ). (9.11)

Substituting (9.11) into (9.10) gives

" #
X
∇θ J(θ) = E π(a|S, θ)∇θ ln π(a|S, θ)qπ (S, a)
a∈A
h i
= ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) .

It is notable that π(a|s, θ) must be positive for all (s, a) to ensure that ln π(a|s, θ) is
valid. This can be achieved by using softmax functions:

eh(s,a,θ)
π(a|s, θ) = P h(s,a0 ,θ)
, a ∈ A, (9.12)
a0 ∈A e

where h(s, a, θ) is a function indicating the preference for selecting a at s. The policy
P
in (9.12) satisfies π(a|s, θ) ∈ [0, 1] and a∈A π(a|s, θ) = 1 for any s ∈ S. This policy
can be realized by a neural network. The input of the network is s. The output layer
is a softmax layer so that the network outputs π(a|s, θ) for all a and the sum of the
outputs is equal to 1. See Figure 9.2(b) for an illustration.
Since π(a|s, θ) > 0 for all a, the policy is stochastic and hence exploratory. The policy
does not directly tell which action to take. Instead, the action should be generated
according to the probability distribution of the policy.

207
9.3. Gradients of the metrics S. Zhao, 2023

9.3.1 Derivation of the gradients in the discounted case

We next derive the gradients of the metrics in the discounted case where γ ∈ (0, 1). The
state value and action value in the discounted case are defined as

vπ (s) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s],

qπ (s, a) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s, At = a].
P
It holds that vπ (s) = a∈A π(a|s, θ)qπ (s, a) and the state value satisfies the Bellman
equation.
First, we show that v̄π (θ) and r̄π (θ) are equivalent metrics.

Lemma 9.1 (Equivalence between v̄π (θ) and r̄π (θ)). In the discounted case where γ ∈
(0, 1), it holds that

r̄π = (1 − γ)v̄π . (9.13)

Proof. Note that v̄π (θ) = dTπ vπ and r̄π (θ) = dTπ rπ , where vπ and rπ satisfy the Bellman
equation vπ = rπ + γPπ vπ . Multiplying dTπ on both sides of the Bellman equation yields

v̄π = r̄π + γdTπ Pπ vπ = r̄π + γdTπ vπ = r̄π + γv̄π ,

which implies (9.13).

Second, the following lemma gives the gradient of vπ (s) for any s.

Lemma 9.2 (Gradient of vπ (s)). In the discounted case, it holds for any s ∈ S that
X X
∇θ vπ (s) = Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a), (9.14)
s0 ∈S a∈A

where
∞
0 . X k k
γ [Pπ ]ss0 = (In − γPπ )−1 ss0

Prπ (s |s) =
k=0

is the discounted total probability of transitioning from s to s0 under policy π. Here,

[·]ss0 denotes the entry in the sth row and s0 th column, and [Pπk ]ss0 is the probability of
transitioning from s to s0 using exactly k steps under π.

Box 9.2: Proof of Lemma 9.2

208
9.3. Gradients of the metrics S. Zhao, 2023

First, for any s ∈ S, it holds that

" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.15)
a∈A

where qπ (s, a) is the action value given by

X
qπ (s, a) = r(s, a) + γ p(s0 |s, a)vπ (s0 ).
s0 ∈S

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 + γ p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.15) yields

It is notable that ∇θ vπ appears on both sides of the above equation. One way to
calculate it is to use the unrolling technique [64]. Here, we use another way based on
the matrix-vector form, which we believe is more straightforward to understand. In
particular, let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

Since
X X X X
π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ) = p(s0 |s)∇θ vπ (s0 ) = [Pπ ]ss0 ∇θ vπ (s0 ),
a∈A s0 ∈S s0 ∈S s0 ∈S

equation (9.16) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  +γ(Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

209
9.3. Gradients of the metrics S. Zhao, 2023

which can be written concisely as

∇θ vπ = u + γ(Pπ ⊗ Im )∇θ vπ .

Here, n = |S|, and m is the dimension of the parameter vector θ. The reason that
the Kronecker product ⊗ emerges in the equation is that ∇θ vπ (s) is a vector. The
above equation is a linear equation of ∇θ vπ , which can be solved as

∇θ vπ = (Inm − γPπ ⊗ Im )−1 u

= (In ⊗ Im − γPπ ⊗ Im )−1 u
= (In − γPπ )−1 ⊗ Im u.

(9.17)

For any state s, it follows from (9.17) that

X
(In − γPπ )−1 u(s0 )

∇θ vπ (s) = ss0
s0 ∈S
X X
(In − γPπ )−1 ∇θ π(a|s0 , θ)qπ (s0 , a).

= ss0
(9.18)
s0 ∈S a∈A

The quantity [(In − γPπ )−1 ]ss0 has a clear probabilistic interpretation. In particular,
since (In − γPπ )−1 = I + γPπ + γ 2 Pπ2 + · · · , we have
∞
X
(In − γPπ )−1 ss0 = [I]ss0 + γ[Pπ ]ss0 + γ 2 [Pπ2 ]ss0 + · · · = γ k [Pπk ]ss0 .

k=0

Note that [Pπk ]ss0 is the probability of transitioning from s to s0 using exactly k
steps (see Box 8.1). Therefore, [(In − γPπ )−1 ]ss0 is the discounted total probability of
.
transitioning from s to s0 using any number of steps. By denoting [(In − γPπ )−1 ]ss0 =
Prπ (s0 |s), equation (9.18) becomes (9.14).

With the results in Lemma 9.2, we are ready to derive the gradient of v̄π0 .

Theorem 9.2 (Gradient of v̄π0 in the discounted case). In the discounted case where
γ ∈ (0, 1), the gradient of v̄π0 = dT0 vπ is

∇θ v̄π0 = E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ ρπ and A ∼ π(S, θ). Here, the state distribution ρπ is

X
ρπ (s) = d0 (s0 )Prπ (s|s0 ), s ∈ S, (9.19)
s0 ∈S

P∞
where Prπ (s|s0 ) = k=0 γ k [Pπk ]s0 s = [(I − γPπ )−1 ]s0 s is the discounted total probability of

210
9.3. Gradients of the metrics S. Zhao, 2023

transitioning from s0 to s under policy π.

Box 9.3: Proof of Theorem 9.2

Since d0 (s) is independent of π, we have

X X
∇θ v̄π0 = ∇θ d0 (s)vπ (s) = d0 (s)∇θ vπ (s).
s∈S s∈S

Substituting the expression of ∇θ vπ (s) given in Lemma 9.2 into the above equation
yields
X X X X
∇θ v̄π0 = d0 (s)∇θ vπ (s) = d0 (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= d0 (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρπ (s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρπ (s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
X X
= ρπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] ,

where S ∼ ρπ and A ∼ π(S, θ). The proof is complete.

With Lemma 9.1 and Lemma 9.2, we can derive the gradients of r̄π and v̄π .

Theorem 9.3 (Gradients of r̄π and v̄π in the discounted case). In the discounted case
where γ ∈ (0, 1), the gradients of r̄π and v̄π are
X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

= E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ dπ and A ∼ π(S, θ). Here, the approximation is more accurate when γ is
closer to 1.

Box 9.4: Proof of Theorem 9.3

211
9.3. Gradients of the metrics S. Zhao, 2023

It follows from the definition of v̄π that

X
∇θ v̄π = ∇θ dπ (s)vπ (s)
s∈S
X X
= ∇θ dπ (s)vπ (s) + dπ (s)∇θ vπ (s). (9.20)
s∈S s∈S

This equation contains two terms. On the one hand, substituting the expression of
∇θ vπ given in (9.17) into the second term gives
X
dπ (s)∇θ vπ (s) = (dTπ ⊗ Im )∇θ vπ
s∈S

= (dTπ ⊗ Im ) (In − γPπ )−1 ⊗ Im u

= dTπ (In − γPπ )−1 ⊗ Im u.

(9.21)

It is noted that
1
dTπ (In − γPπ )−1 = dT ,
1−γ π

which can be easily verified by multiplying (In − γPπ ) on both sides of the equation.
Therefore, (9.21) becomes
X 1
dπ (s)∇θ vπ (s) = dT ⊗ Im u
s∈S
1−γ π
1 X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

On the other hand, the first term of (9.20) involves ∇θ dπ . However, since the second
1
term contains 1−γ , the second term becomes dominant, and the first term becomes
negligible when γ → 1. Therefore,

1 X X
∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

Furthermore, it follows from r̄π = (1 − γ)v̄π that

X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
X X
= dπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] .

212
9.3. Gradients of the metrics S. Zhao, 2023

The approximation in the above equation requires that the first term does not go to
infinity when γ → 1. More information can be found in [66, Section 4].

9.3.2 Derivation of the gradients in the undiscounted case

We next show how to calculate the gradients of the metrics in the undiscounted case
where γ = 1. Readers may wonder why we suddenly start considering the undiscounted
case while we have only considered the discounted case so far in this book. The reasons
are as follows. First, for continuing tasks, it may be inappropriate to introduce the
discount rate and we need to consider the undiscounted case. Second, the definition of
the average reward r̄π is valid for both discounted and undiscounted cases. While the
gradient of r̄π in the discounted case is an approximation, we will see that its gradient in
the undiscounted case is more elegant.

State values and the Poisson equation

In the undiscounted case, it is necessary to redefine state and action values. Since the
undiscounted sum of the rewards, E[Rt+1 + Rt+2 + Rt+3 + . . . |St = s], may diverge, the
state and action values are defined in a special way [64]:

.
vπ (s) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s],
.
qπ (s, a) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s, At = a],

where r̄π is the average reward, which is determined when π is given. There are different
names for vπ (s) in the literature such as the differential reward [65] or bias [2, Sec-
tion 8.2.1]. It can be verified that the state value defined above satisfies the following
Bellman-like equation:
" #
X X X
vπ (s) = π(a|s, θ) p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 ) . (9.22)
a r s0

P P
Since vπ (s) = a∈A π(a|s, θ)qπ (s, a), it holds that qπ (s, a) = r p(r|s, a)(r − r̄π ) +
0 0
P
s0 p(s |s, a)vπ (s ). The matrix-vector form of (9.22) is

vπ = rπ − r̄π 1n + Pπ vπ , (9.23)

where 1n = [1, . . . , 1]T ∈ Rn . Equation (9.23) is similar to the Bellman equation and it
has a specific name called the Poisson equation [65, 67].
How to solve vπ from the Poisson equation? The answer is given in the following
theorem.

213
9.3. Gradients of the metrics S. Zhao, 2023

Theorem 9.4 (Solution of the Poisson equation). Let

vπ∗ = (In − Pπ + 1n dTπ )−1 rπ . (9.24)

Then, vπ∗ is a solution of the Poisson equation in (9.23). Moreover, any solution of the
Poisson equation has the following form:

vπ = vπ∗ + c1n ,

where c ∈ R.

This theorem indicates that the solution of the Poisson equation may not be unique.

Box 9.5: Proof of Theorem 9.4

We prove using three steps.
Step 1: Show that vπ∗ in (9.24) is a solution of (9.25).
For the sake of simplicity, let

.
A = In − Pπ + 1n dTπ .

Then, vπ∗ = A−1 rπ . The fact that A is invertible will be proven in Step 3. Substi-
tuting vπ∗ = A−1 rπ into (9.25) gives

A−1 rπ = rπ − 1n dTπ rπ + Pπ A−1 rπ .

This equation is valid as proven below. Recognizing this equation gives (−A−1 +
In − 1n dTπ + Pπ A−1 )rπ = 0, and consequently,

(−In + A − 1n dTπ A + Pπ )A−1 rπ = 0.

The term in the brackets in the above equation is zero because −In +A−1n dTπ A+
Pπ = −In + (In − Pπ + 1n dTπ ) − 1n dTπ (In − Pπ + 1n dTπ ) + Pπ = 0. Therefore, vπ∗ in
(9.24) is a solution.
Step 2: General expression of the solutions.
Substituting r̄π = dTπ rπ into (9.23) gives

vπ = rπ − 1n dTπ rπ + Pπ vπ (9.25)

and consequently

(In − Pπ )vπ = (In − 1n dTπ )rπ . (9.26)

214
9.3. Gradients of the metrics S. Zhao, 2023

It is noted that In − Pπ is singular because (In − Pπ )1n = 0 for any π. Therefore,

the solution of (9.26) is not unique: if vπ∗ is a solution, then vπ∗ +x is also a solution
for any x ∈ Null(In − Pπ ). When Pπ is irreducible, Null(In − Pπ ) = span{1n }.
Then, any solution of the Poisson equation has the expression vπ∗ + c1n where
c ∈ R.
Step 3: Show that A = In − Pπ + 1n dTπ invertible.
Since vπ∗ involves A−1 , it is necessary to show that A is invertible. The analysis is
summarized in the following lemma.

Lemma 9.3. The matrix In − Pπ + 1n dTπ is invertible and its inverse is

∞
T −1
X
(Pπk − 1n dTπ ) + In .

In − (Pπ − 1n dπ ) =
k=1

Proof. First of all, we state some preliminary facts without proof. Let ρ(M )
be the spectral radius of a matrix M . Then, I − M is invertible if ρ(M ) < 1.
Moreover, ρ(M ) < 1 if and only if limk→∞ M k = 0.
Based on the above facts, we next show that limk→∞ (Pπ − 1n dTπ )k → 0, and then
the invertibility of In − (Pπ − 1n dTπ ) immediately follows. To do that, we notice
that

(Pπ − 1n dTπ )k = Pπk − 1n dTπ , k ≥ 1, (9.27)

which can be proven by induction. For instance, when k = 1, the equation is

valid. When k = 2, we have

(Pπ − 1n dTπ )2 = (Pπ − 1n dTπ )(Pπ − 1n dTπ )

= Pπ2 − Pπ 1n dTπ − 1n dTπ Pπ + 1n dTπ 1n dTπ
= Pπ2 − 1n dTπ ,

where the last equality is due to Pπ 1n = 1n , dTπ Pπ = dTπ , and dTπ 1n = 1. The case
of k ≥ 3 can be proven similarly.
Since dπ is the stationary distribution of the state, it holds that limk→∞ Pπk = dTπ 1n
(see Box 8.1). Therefore, (9.27) implies that

lim (Pπ − 1n dTπ )k = lim Pπk − dTπ 1n = 0.

k→∞ k→∞

As a result, ρ(Pπ −1n dTπ ) < 1 and hence In −(Pπ −1n dTπ ) is invertible. Furthermore,

215
9.3. Gradients of the metrics S. Zhao, 2023

the inverse of this matrix is given by

∞
X
(In − (Pπ − 1n dTπ ))−1 = (Pπ − 1n dTπ )k
k=0
∞
X
= In + (Pπ − 1n dTπ )k
k=1
X∞
= In + (Pπk − 1n dTπ )
k=1
∞
X
= (Pπk − 1n dTπ ) + 1n dTπ .
k=0

The proof is complete.

The proof of Lemma 9.3 is inspired by [66]. However, the result (In − Pπ +
1n dTπ )−1 = ∞ k T
P
k=0 (Pπ − 1n dπ ) given in [66] (the statement above equation (16)
P∞ k T
P∞ k
in [66]) is inaccurate because k=0 (Pπ − 1n dπ ) is singular since k=0 (Pπ −
1n dTπ )1n = 0. Lemma 9.3 corrects this inaccuracy.

Derivation of gradients

Although the value of vπ is not unique in the undiscounted case, as shown in Theorem 9.4,
the value of r̄π is unique. In particular, it follows from the Poisson equation that

r̄π 1n = rπ + (Pπ − In )vπ

= rπ + (Pπ − In )(vπ∗ + c1n )
= rπ + (Pπ − In )vπ∗ .

Notably, the undetermined value c is canceled and hence r̄π is unique. Therefore, we can
calculate the gradient of r̄π in the undiscounted case. In addition, since vπ is not unique,
v̄π is not unique either. We do not study the gradient of v̄π in the undiscounted case. For
interested readers, it is worth mentioning that we can add more constraints to uniquely
solve vπ from the Poisson equation. For example, by assuming that a recurrent state
exists, the state value of this recurrent state is zero [65, Section II], and hence, c can
be determined. There are also other ways to uniquely determine vπ . See, for example,
equations (8.6.5)-(8.6.7) in [2].
The gradient of r̄π in the undiscounted case is given below.

Theorem 9.5 (Gradient of r̄π in the undiscounted case). In the undiscounted case, the

216
9.3. Gradients of the metrics S. Zhao, 2023

gradient of the average reward r̄π is

X X
∇θ r̄π = dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

= E ∇θ ln π(A|S, θ)qπ (S, A) , (9.28)

where S ∼ dπ and A ∼ π(S, θ).

Compared to the discounted case shown in Theorem 9.3, the gradient of r̄π in the
undiscounted case is more elegant in the sense that (9.28) is strictly valid and S obeys
the stationary distribution.

Box 9.6: Proof of Theorem 9.5

P
First of all, it follows from vπ (s) = a∈A π(a|s, θ)qπ (s, a) that
" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.29)
a∈A

where qπ (s, a) is the action value satisfying

X X
qπ (s, a) = p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 )
r s0
X
= r(s, a) − r̄π + p(s |s, a)vπ (s0 ).
0

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 − ∇θ r̄π + p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.29) yields

Let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

217
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

0 0 0 0
P P P
Since a∈A π(a|s, θ) s0 ∈S p(s |s, a)∇θ vπ (s ) = s0 ∈S p(s |s)∇θ vπ (s ), equation
(9.30) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  −1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

where n = |S|, m is the dimension of θ, and ⊗ is the Kronecker product. The above
equation can be written concisely as

∇θ vπ = u − 1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )∇θ vπ ,

and hence

1n ⊗ ∇θ r̄π = u + (Pπ ⊗ Im )∇θ vπ − ∇θ vπ .

Multiplying dTπ ⊗ Im on both sides of the above equation gives

(dTπ 1n ) ⊗ ∇θ r̄π = dTπ ⊗ Im u + (dTπ Pπ ) ⊗ Im ∇θ vπ − dTπ ⊗ Im ∇θ vπ

= dTπ ⊗ Im u,

which implies

∇θ r̄π = dTπ ⊗ Im u
X
= dπ (s)u(s)
s∈S
X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
s∈S a∈A

9.4 Monte Carlo policy gradient (REINFORCE)

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based
method to optimize the metrics to obtain optimal policies.
The gradient-ascent algorithm for maximizing J(θ) is

θt+1 = θt + α∇θ J(θt )

h i
= θt + αE ∇θ ln π(A|S, θt )qπ (S, A) , (9.31)

where α > 0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we

218
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at ), (9.32)

where qt (st , at ) is an approximation of qπ (st , at ). If qt (st , at ) is obtained by Monte Carlo

estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient,
which is one of earliest and simplest policy gradient algorithms.
The algorithm in (9.32) is important since many other policy gradient algorithms can
be obtained by extending it. We next examine the interpretation of (9.32) more closely.
Since ∇θ ln π(at |st , θt ) = ∇π(a
θ π(at |st ,θt )
t |st ,θt )
, we can rewrite (9.32) as

qt (st , at )
θt+1 = θt + α ∇θ π(at |st , θt ),
π(at |st , θt )
| {z }
βt

which can be further written concisely as

θt+1 = θt + αβt ∇θ π(at |st , θt ). (9.33)

Two important interpretations can be seen from this equation.

First, since (9.33) is a simple gradient-ascent algorithm, the following observations

can be obtained.

- If βt ≥ 0, the probability of choosing (st , at ) is enhanced. That is

π(at |st , θt+1 ) ≥ π(at |st , θt ).

The greater βt is, the stronger the enhancement is.

- If βt < 0, the probability of choosing (st , at ) decreases. That is

π(at |st , θt+1 ) < π(at |st , θt ).

The above observations can be proven as follows. When θt+1 − θt is sufficiently small,
it follows from the Taylor expansion that

It is clear that π(at |st , θt+1 ) ≥ π(at |st , θt ) when βt ≥ 0 and π(at |st , θt+1 ) < π(at |st , θt )
when βt < 0.
Second, the algorithm can strike a balance between exploration and exploitation to a

219
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)

Initialization: Initial parameter θ; γ ∈ (0, 1); α > 0.

Goal: Learn an optimal policy for maximizing J(θ).
For each episode, do
Generate an episode {s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT } following π(θ).
For t = 0, 1, . . . , T − 1:
Value update: qt (st , at ) = Tk=t+1 γ k−t−1 rk
P
Policy update: θ ← θ + α∇θ ln π(at |st , θ)qt (st , at )

certain extent due to the expression of

qt (st , at )
βt = .
π(at |st , θt )

On the one hand, βt is proportional to qt (st , at ). As a result, if the action value of

(st , at ) is large, then π(at |st , θt ) is enhanced so that the probability of selecting at
increases. Therefore, the algorithm attempts to exploit actions with greater values.
One the other hand, βt is inversely proportional to π(at |st , θt ) when qt (st , at ) > 0. As
a result, if the probability of selecting at is small, then π(at |st , θt ) is enhanced so that
the probability of selecting at increases. Therefore, the algorithm attempts to explore
actions with low probabilities.

Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is
important to understand how the samples should be obtained.

How to sample S? S in the true gradient E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the
distribution η which is either the stationary distribution dπ or the discounted total
probability distribution ρπ in (9.19). Either dπ or ρπ represents the long-term behavior
exhibited under π.
How to sample A? A in E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the distribution of
π(A|S, θ). The ideal way to sample A is to select at following π(a|st , θt ). Therefore,
the policy gradient algorithm is on-policy.

Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice
due to their low efficiency of sample usage. A more sample-efficient implementation of
(9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by
following π(θ). Then, θ is updated multiple times using every experience sample in the
episode.

220
9.5. Summary S. Zhao, 2023

9.5 Summary
This chapter introduced the policy gradient method, which is the foundation of many
modern reinforcement learning algorithms. Policy gradient methods are policy-based. It
is a big step forward in this book because all the methods in the previous chapters are
value-based. The basic idea of the policy gradient method is simple. That is to select an
appropriate scalar metric and then optimize it via a gradient-ascent algorithm.
The most complicated part of the policy gradient method is the derivation of the
gradients of the metrics. That is because we have to distinguish various scenarios with
different metrics and discounted/undiscounted cases. Fortunately, the expressions of the
gradients in different scenarios are similar. Hence, we summarized the expressions in
Theorem 9.1, which is the most important theoretical result in this chapter. For many
readers, it is sufficient to be aware of this theorem. Its proof is nontrivial, and it is not
required for all readers to study.
The policy gradient algorithm in (9.32) must be properly understood since it is the
foundation of many advanced policy gradient algorithms. In the next chapter, this algo-
rithm will be extended to another important policy gradient method called actor-critic.

9.6 Q&A
Q: What is the basic idea of the policy gradient method?
A: The basic idea is simple. That is to define an appropriate scalar metric, derive
its gradient, and then use gradient-ascent methods to optimize the metric. The most
important theoretical result regarding this method is the policy gradient given in
Theorem 9.1.
Q: What is the most complicated part of the policy gradient method?
A: The basic idea of the policy gradient method is simple. However, the derivation
procedure of the gradients is quite complicated. That is because we have to distin-
guish numerous different scenarios. The mathematical derivation procedure in each
scenario is nontrivial. It is sufficient for many readers to be familiar with the result
in Theorem 9.1 without knowing the proof.
Q: What metrics should be used in the policy gradient method?
A: We introduced three common metrics in this chapter: v̄π , v̄π0 , and r̄π . Since they
all lead to similar policy gradients, they all can be adopted in the policy gradient
method. More importantly, the expressions in (9.1) and (9.4) are often encountered
in the literature.
Q: Why is a natural logarithm function contained in the policy gradient?

221
9.6. Q&A S. Zhao, 2023

A: A natural logarithm function is introduced to express the gradient as an expected

value. In this way, we can approximate the true gradient with a stochastic one.
Q: Why do we need to study undiscounted cases when deriving the policy gradient?
A: First, for continuing tasks, it may be inappropriate to introduce the discount rate
and we need to consider the undiscounted case. Second, the definition of the average
reward r̄π is valid for both discounted and undiscounted cases. While the gradient
of r̄π in the discounted case is an approximation, we will see that its gradient in the
undiscounted case is more elegant.
Q: What does the policy gradient algorithm in (9.32) do mathematically?
A: To better understand this algorithm, readers are recommended to examine its
concise expression in (9.33), which clearly shows that it is a gradient-ascent algorithm
for updating the value of π(at |st , θt ). That is, when a sample (st , at ) is available, the
policy can be updated so that π(at |st , θt+1 ) ≥ π(at |st , θt ) or π(at |st , θt+1 ) < π(at |st , θt )
depending on the coefficients.

222
Chapter 10

Actor-Critic Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 10.1: Where we are in this book.

This chapter introduces actor-critic methods. From one point of view, “actor-critic”
refers to a structure that incorporates both policy-based and value-based methods. Here,
an “actor” refers to a policy update step. The reason that it is called an actor is that the
actions are taken by following the policy. Here, an “critic” refers to a value update step. It
is called a critic because it criticizes the actor by evaluating its corresponding values. From
another point of view, actor-critic methods are still policy gradient algorithms. They can
be obtained by extending the policy gradient algorithm introduced in Chapter 9. It is
important for the reader to well understand the contents of Chapters 8 and 9 before
studying this chapter.

223
10.1. The simplest actor-critic algorithm (QAC) S. Zhao, 2023

10.1 The simplest actor-critic algorithm (QAC)

This section introduces the simplest actor-critic algorithm. This algorithm can be easily
obtained by extending the policy gradient algorithm in (9.32).
Recall that the idea of the policy gradient method is to search for an optimal policy
by maximizing a scalar metric J(θ). The gradient-ascent algorithm for maximizing J(θ)
is

θt+1 = θt + α∇θ J(θt )

h i
= θt + αES∼η,A∼π ∇θ ln π(A|S, θt )qπ (S, A) , (10.1)

where η is a distribution of the states (see Theorem 9.1 for more information). Since the
true gradient is unknown, we can use a stochastic gradient to approximate it:

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at ). (10.2)

This is the algorithm given in (9.32).

Equation (10.2) is important because it clearly shows how policy-based and value-
based methods can be combined. On the one hand, it is a policy-based algorithm since it
directly updates the policy parameter. On the other hand, this equation requires knowing
qt (st , at ), which is an estimate of the action value qπ (st , at ). As a result, another value-
based algorithm is required to generate qt (st , at ). So far, we have studied two ways to
estimate action values in this book. The first is based on Monte Carlo learning and the
second is temporal-difference (TD) learning.

If qt (st , at ) is estimated by Monte Carlo learning, the corresponding algorithm is called

REINFORCE or Monte Carlo policy gradient, which has already been introduced in
Chapter 9.
If qt (st , at ) is estimated by TD learning, the corresponding algorithms are usually
called actor-critic. Therefore, actor-critic methods can be obtained by incorporating
TD-based value estimation into policy gradient methods.

The procedure of the simplest actor-critic algorithm is summarized in Algorithm 10.1.

The critic corresponds to the value update step via the Sarsa algorithm presented in
(8.35). The action values are represented by a parameterized function q(s, a, w). The
actor corresponds to the policy update step in (10.2). This actor-citric algorithm is
sometimes called Q actor-critic (QAC). Although it is simple, QAC reveals the core idea
of actor-critic methods. It can be extended to generate many advanced ones as shown in
the rest of this chapter.

224
10.2. Advantage actor-critic (A2C) S. Zhao, 2023

Algorithm 10.1: The simplest actor-critic algorithm (QAC)

Initialization: A policy function π(a|s, θ0 ) where θ0 is the initial parameter. A value

function q(s, a, w0 ) where w0 is the initial parameter. αw , αθ > 0.
Goal: Learn an optimal policy to maximize J(θ).
At time step t in each episode, do
Generate at following π(a|st , θt ), observe rt+1 , st+1 , and then generate at+1 following
π(a|st+1 , θt ).
Actor (policy update):
θt+1 = θt + αθ ∇θ ln π(at |st , θt )q(st , at , wt )
Critic (value update):
wt+1 = wt + αw rt+1 + γq(st+1 , at+1 , wt ) − q(st , at , wt ) ∇w q(st , at , wt )

10.2 Advantage actor-critic (A2C)

We now introduce the algorithm of advantage actor-critic. The core idea of this algorithm
is to introduce a baseline to reduce estimation variance.

10.2.1 Baseline invariance

One interesting property of the policy gradient is that it is invariant to an additional
baseline. That is
h i h i
ES∼η,A∼π ∇θ ln π(A|S, θt )qπ (S, A) = ES∼η,A∼π ∇θ ln π(A|S, θt )(qπ (S, A) − b(S)) ,
(10.3)

where the additional baseline b(S) is a scalar function of S. We next answer two questions
about the baseline.

First, why is (10.3) valid?

Equation (10.3) holds if and only if
h i
ES∼η,A∼π ∇θ ln π(A|S, θt )b(S) = 0.

225
10.2. Advantage actor-critic (A2C) S. Zhao, 2023

This equation is valid because

Second, why is the baseline useful?

The baseline is useful because it can reduce the approximation variance when we use
samples to approximate the true gradient. In particular, let

.
X(S, A) = ∇θ ln π(A|S, θt )[qπ (S, A) − b(S)]. (10.4)

Then, the true gradient is E[X(S, A)]. Since we need to use a stochastic sample x to
approximate E[X], it would be favorable if the variance var(X) is small. For example,
if var(X) is close to zero, then any sample x can accurately approximate E[X]. On
the contrary, if var(X) is large, the value of a sample may be far from E[X].
Although E[X] is invariant to the baseline, the variance var(X) is not. Our goal is to
design a good baseline to minimize var(X). In the algorithms of REINFORCE and
QAC, we set b = 0, which is not guaranteed to be a good baseline.
In fact, the optimal baseline that minimizes var(X) is

EA∼π k∇θ ln π(A|s, θt )k2 qπ (s, A)

∗
b (s) = , s ∈ S. (10.5)
EA∼π k∇θ ln π(A|s, θt )k2

The proof is given in Box 10.1.

Although the baseline in (10.5) is optimal, it is too complex to be useful in practice.
If the weight k∇θ ln π(A|s, θt )k2 is removed from (10.5), we can obtain a suboptimal
baseline that has a concise expression:

b† (s) = EA∼π [qπ (s, A)] = vπ (s), s ∈ S.

Interestingly, this suboptimal baseline is the state value.

226
10.2. Advantage actor-critic (A2C) S. Zhao, 2023

Box 10.1: Showing that b∗ (s) in (10.5) is the optimal baseline

.
Let x̄ = E[X], which is invariant for any b(s). If X is a vector, its variance is a
matrix. It is common to select the trace of var(X) as a scalar objective function for
optimization:

tr[var(X)] = trE[(X − x̄)(X − x̄)T ]

= trE[XX T − x̄X T − X x̄T + x̄x̄T ]
= trE[XX T − x̄X T − X x̄T + x̄x̄T ]
= E[X T X − X T x̄ − x̄T X + x̄T x̄]
= E[X T X] − x̄T x̄. (10.6)

When deriving the above equation, we use the trace property tr(AB) = tr(BA)
for any squared matrices A, B with appropriate dimensions. Since x̄ is invariant,
equation (10.6) suggests that we only need to minimize E[X T X]. With X defined in
(10.4), we have

E[X T X] = E (∇θ ln π)T (∇θ ln π)(qπ (S, A) − b(S))2

= E k∇θ ln πk2 (qπ (S, A) − b(S))2 ,

where π(A|S, θ) is written as π for short. Since S ∼ η and A ∼ π, the above equation
can be rewritten as
X
E[X T X] = η(s)EA∼π k∇θ ln πk2 (qπ (s, A) − b(s))2 .

s∈S

To ensure ∇b E[X T X] = 0, b(s) for any s ∈ S should satisfy

EA∼π k∇θ ln πk2 (b(s) − qπ (s, A)) = 0,

s ∈ S.

The above equation can be easily solved to obtain the optimal baseline:

EA∼π [k∇θ ln πk2 qπ (s, A)]

b∗ (s) = , s ∈ S.
EA∼π [k∇θ ln πk2 ]

More discussions on optimal baselines in policy gradient methods can be found in

[69, 70].

227
10.2. Advantage actor-critic (A2C) S. Zhao, 2023

10.2.2 Algorithm description

When b(s) = vπ (s), the gradient-ascent algorithm in (10.1) becomes
h i
θt+1 = θt + αE ∇θ ln π(A|S, θt )[qπ (S, A) − vπ (S)]
.
h i
= θt + αE ∇θ ln π(A|S, θt )δπ (S, A) . (10.7)

Here,
.
δπ (S, A) = qπ (S, A) − vπ (S)

is called the advantage function, which reflects the advantage of one action over the
P
others. More specifically, note that vπ (s) = a∈A π(a|s)qπ (s, a) is the mean of the action
values. If δπ (s, a) > 0, it means that the corresponding action has a greater value than
the mean value.
The stochastic version of (10.7) is

θt+1 = θt + α∇θ ln π(at |st , θt )[qt (st , at ) − vt (st )]

= θt + α∇θ ln π(at |st , θt )δt (st , at ), (10.8)

where st , at are samples of S, A at time t. Here, qt (st , at ) and vt (st ) are approximations of
qπ(θt ) (st , at ) and vπ(θt ) (st ), respectively. The algorithm in (10.8) updates the policy based
on the relative value of qt with respect to vt rather than the absolute value of qt . This is
intuitively reasonable because, when we attempt to select an action at a state, we only
care about which action has the greatest value relative to the others.
If qt (st , at ) and vt (st ) are estimated by Monte Carlo learning, the algorithm in (10.8) is
called REINFORCE with a baseline. If qt (st , at ) and vt (st ) are estimated by TD learning,
the algorithm is usually called advantage actor-critic (A2C). The implementation of A2C
is summarized in Algorithm 10.2. It should be noted that the advantage function in this
implementation is approximated by the TD error:

qt (st , at ) − vt (st ) ≈ rt+1 + γvt (st+1 ) − vt (st ).

This approximation is reasonable because

h i
qπ (st , at ) − vπ (st ) = E Rt+1 + γvπ (St+1 ) − vπ (St )|St = st , At = at ,

which is valid due to the definition of qπ (st , at ). One merit of using the TD error is
that we only need to use a single neural network to represent vπ (s). Otherwise, if δt =
qt (st , at ) − vt (st ), we need to maintain two networks to represent vπ (s) and qπ (s, a),
respectively. When we use the TD error, the algorithm may also be called TD actor-
critic. In addition, it is notable that the policy π(θt ) is stochastic and hence exploratory.
Therefore, it can be directly used to generate experience samples without relying on

228
10.3. Off-policy actor-critic S. Zhao, 2023

Algorithm 10.2: Advantage actor-critic (A2C) or TD actor-critic

Initialization: A policy function π(a|s, θ0 ) where θ0 is the initial parameter. A value

function v(s, w0 ) where w0 is the initial parameter. αw , αθ > 0.
Goal: Learn an optimal policy to maximize J(θ).
At time step t in each episode, do
Generate at following π(a|st , θt ) and then observe rt+1 , st+1 .
Advantage (TD error):
δt = rt+1 + γv(st+1 , wt ) − v(st , wt )
Actor (policy update):
θt+1 = θt + αθ δt ∇θ ln π(at |st , θt )
Critic (value update):
wt+1 = wt + αw δt ∇w v(st , wt )

techniques such as ε-greedy. There are some variants of A2C such as asynchronous
advantage actor-critic (A3C). Interested readers may check [71, 72].

10.3 Off-policy actor-critic

The policy gradient methods that we have studied so far, including REINFORCE, QAC,
and A2C, are all on-policy. The reason for this can be seen from the expression of the
true gradient:
h i
∇θ J(θ) = ES∼η,A∼π ∇θ ln π(A|S, θt )(qπ (S, A) − vπ (S)) .

To use samples to approximate this true gradient, we must generate the action samples
by following π(θ). Hence, π(θ) is the behavior policy. Since π(θ) is also the target policy
that we aim to improve, the policy gradient methods are on-policy.
In the case that we already have some samples generated by a given behavior policy,
the policy gradient methods can still be applied to utilize these samples. To do that,
we can employ a technique called importance sampling. It is worth mentioning that the
importance sampling technique is not restricted to the field of reinforcement learning.
It is a general technique for estimating expected values defined over one probability
distribution using some samples drawn from another distribution.

10.3.1 Importance sampling

We next introduce the importance sampling technique. Consider a random variable
X ∈ X . Suppose that p0 (X) is a probability distribution. Our goal is to estimate
EX∼p0 [X]. Suppose that we have some i.i.d. samples {xi }ni=1 .

229
10.3. Off-policy actor-critic S. Zhao, 2023

First, if the samples {xi }ni=1 are generated by following p0 , then the average value
x̄ = n1 ni=1 xi can be used to approximate EX∼p0 [X] because x̄ is an unbiased estimate
P

of EX∼p0 [X] and the estimation variance converges to zero as n → ∞ (see the law of
large numbers in Box 5.1 for more information).
Second, consider a new scenario where the samples {xi }ni=1 are not generated by
p0 . Instead, they are generated by another distribution p1 . Can we still use these
samples to approximate EX∼p0 [X]? The answer is yes. However, we can no longer use
x̄ = n1 ni=1 xi to approximate EX∼p0 [X] since x̄ ≈ EX∼p1 [X] rather than EX∼p0 [X].
P

In the second scenario, EX∼p0 [X] can be approximated based on the importance sam-
pling technique. In particular, EX∼p0 [X] satisfies

X X p0 (x)
EX∼p0 [X] = p0 (x)x = p1 (x) x = EX∼p1 [f (X)]. (10.9)
x∈X x∈X
p1 (x)
| {z }
f (x)

Thus, estimating EX∼p0 [X] becomes the problem of estimating EX∼p1 [f (X)]. Let
n
. 1X
f¯ = f (xi ).
n i=1

Since f¯ can effectively approximate EX∼p1 [f (X)], it then follows from (10.9) that
n n
1X 1 X p0 (xi )
EX∼p0 [X] = EX∼p1 [f (X)] ≈ f¯ = f (xi ) = xi . (10.10)
n i=1 n i=1 p1 (xi )
| {z }
importance
weight

Equation (10.10) suggests that EX∼p0 [X] can be approximated by a weighted average of
xi . Here, pp01 (x i)
(xi )
is called the importance weight. When p1 = p0 , the importance weight
¯
is 1 and f becomes x̄. When p0 (xi ) ≥ p1 (xi ), xi can be sampled more frequently by p0
but less frequently by p1 . In this case, the importance weight, which is greater than one,
emphasizes the importance of this sample.
Some readers may ask the following question: while p0 (x) is required in (10.10), why
P
do we not directly calculate EX∼p0 [X] using its definition EX∼p0 [X] = x∈X p0 (x)x?
The answer is as follows. To use the definition, we need to know either the analytical
expression of p0 or the value of p0 (x) for every x ∈ X . However, it is difficult to obtain
the analytical expression of p0 when the distribution is represented by, for example, a
neural network. It is also difficult to obtain the value of p0 (x) for every x ∈ X when X
is large. By contrast, (10.10) merely requires the values of p0 (xi ) for some samples and
is much easier to implement in practice.

230
10.3. Off-policy actor-critic S. Zhao, 2023

An illustrative example

We next present an example to demonstrate the importance sampling technique. Consider

.
X ∈ X = {+1, −1}. Suppose that p0 is a probability distribution satisfying

p0 (X = +1) = 0.5, p0 (X = −1) = 0.5.

The expectation of X over p0 is

EX∼p0 [X] = (+1) · 0.5 + (−1) · 0.5 = 0.

Suppose that p1 is another distribution satisfying

p1 (X = +1) = 0.8, p1 (X = −1) = 0.2.

The expectation of X over p1 is

EX∼p1 [X] = (+1) · 0.8 + (−1) · 0.2 = 0.6.

Suppose that we have some samples {xi } drawn over p1 . Our goal is to estimate
EX∼p0 [X] using these samples. As shown in Figure 10.2, there are more samples of +1
than −1. That is because p1 (X = +1) = 0.8 > p1 (X = −1) = 0.2. If we directly calculate
the average value ni=1 xi /n of the samples, this value converges to EX∼p1 [X] = 0.6 (see
P

the dotted line in Figure 10.2). By contrast, if we calculate the weighted average value
as in (10.10), this value can successfully converge to EX∼p0 [X] = 0 (see the solid line in
Figure 10.2).

2 samples
average
importance sampling
1

-1

-2
0 50 100 150 200
Sample index

Figure 10.2: An example for demonstrating the importance sampling technique. Here, X ∈ {+1, −1} and
p0 (X = +1) = p0 (X = −1) = 0.5. The samples are generated according to p1 where p1 (X = +1) = 0.8
and p1 (X = −1) = 0.2. The average of the samples converges to EX∼p1 [X] = 0.6, but the weighted
average calculated by the importance sampling technique in (10.10) converges to EX∼p0 [X] = 0.

231
10.3. Off-policy actor-critic S. Zhao, 2023

Finally, the distribution p1 , which is used to generate samples, must satisfy that
p1 (x) 6= 0 when p0 (x) 6= 0. If p1 (x) = 0 while p0 (x) 6= 0, the estimation result may be
problematic. For example, if

p1 (X = +1) = 1, p1 (X = −1) = 0,

then the samples generated by p1 are all positive: {xi } = {+1, +1, . . . , +1}. These
samples cannot be used to correctly estimate EX∼p0 [X] = 0 because
n n n
1 X p0 (xi ) 1 X p0 (+1) 1 X 0.5
xi = 1= 1 ≡ 0.5,
n i=1 p1 (xi ) n i=1 p1 (+1) n i=1 1

no matter how large n is.

10.3.2 The off-policy policy gradient theorem

With the importance sampling technique, we are ready to present the off-policy policy
gradient theorem. Suppose that β is a behavior policy. Our goal is to use the samples
generated by β to learn a target policy π that can maximize the following metric:
X
J(θ) = dβ (s)vπ (s) = ES∼dβ [vπ (S)],
s∈S

where dβ is the stationary distribution under policy β and vπ is the state value under
policy π. The gradient of this metric is given in the following theorem.

Theorem 10.1 (Off-policy policy gradient theorem). In the discounted case where γ ∈
(0, 1), the gradient of J(θ) is

π(A|S, θ)
∇θ J(θ) = ES∼ρ,A∼β ∇θ ln π(A|S, θ)qπ (S, A) , (10.11)
β(A|S)
| {z }
importance
weight

where the state distribution ρ is

. X
ρ(s) = dβ (s0 )Prπ (s|s0 ), s ∈ S,
s0 ∈S

where Prπ (s|s0 ) = ∞ k k −1

P
k=0 γ [Pπ ]s0 s = [(I − γPπ ) ]s0 s is the discounted total probability of
transitioning from s0 to s under policy π.

The gradient in (10.11) is similar to that in the on-policy case in Theorem 9.1, but
there are two differences. The first difference is the importance weight. The second
difference is that A ∼ β instead of A ∼ π. Therefore, we can use the action samples

232
10.3. Off-policy actor-critic S. Zhao, 2023

generated by following β to approximate the true gradient. The proof of the theorem is
given in Box 10.2.

Box 10.2: Proof of Theorem 10.1

Since dβ is independent of θ, the gradient of J(θ) satisfies

X X
∇θ J(θ) = ∇θ dβ (s)vπ (s) = dβ (s)∇θ vπ (s). (10.12)
s∈S s∈S

According to Lemma 9.2, the expression of ∇θ vπ (s) is

X X
∇θ vπ (s) = Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a), (10.13)
s0 ∈S a∈A

. P∞ k k
where Prπ (s0 |s) = −1
k=0 γ [Pπ ]ss0 = [(In − γPπ ) ]ss0 . Substituting (10.13) into
(10.12) yields
X X X X
∇θ J(θ) = dβ (s)∇θ vπ (s) = dβ (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= dβ (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρ(s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρ(s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
" #
X
= ES∼ρ ∇θ π(a|S, θ)qπ (S, a) .
a∈A

The proof is complete. The above proof is similar to that of Theorem 9.1.

233
10.3. Off-policy actor-critic S. Zhao, 2023

10.3.3 Algorithm description

Based on the off-policy policy gradient theorem, we are ready to present the off-policy
actor-critic algorithm. Since the off-policy case is very similar to the on-policy case, we
merely present some key steps.
First, the off-policy policy gradient is invariant to any additional baseline b(s). In
particular, we have

π(A|S, θ)
∇θ J(θ) = ES∼ρ,A∼β ∇θ ln π(A|S, θ) qπ (S, A) − b(S) ,
β(A|S)
h i
π(A|S,θ)
because E β(A|S)
∇θ
ln π(A|S, θ)b(S) = 0. To reduce the estimation variance, we can
select the baseline as b(S) = vπ (S) and obtain

π(A|S, θ)
∇θ J(θ) = E ∇θ ln π(A|S, θ) qπ (S, A) − vπ (S) .
β(A|S)

The corresponding stochastic gradient-ascent algorithm is

π(at |st , θt )
θt+1 = θt + αθ ∇θ ln π(at |st , θt ) qt (st , at ) − vt (st ) ,
β(at |st )

where αθ > 0. Similar to the on-policy case, the advantage function qt (s, a) − vt (s) can
be replaced by the TD error. That is

.
qt (st , at ) − vt (st ) ≈ rt+1 + γvt (st+1 ) − vt (st ) = δt (st , at ).

Then, the algorithm becomes

π(at |st , θ)
θt+1 = θt + αθ ∇θ ln π(at |st , θ)δt (st , at ).
β(at |st )

The implementation of the off-policy actor-critic algorithm is summarized in Algorith-

m 10.3. As can be seen, the algorithm is the same as the advantage actor-critic algorithm
except that an additional importance weight is included in both the critic and the actor.
It must be noted that, in addition to the actor, the critic is also converted from on-policy
to off-policy by the importance sampling technique. In fact, importance sampling is a
general technique that can be applied to both policy-based and value-based algorithms.
Finally, Algorithm 10.3 can be extended in various ways to incorporate more techniques
such as eligibility traces [73].

234
10.4. Deterministic actor-critic S. Zhao, 2023

Algorithm 10.3: Off-policy actor-critic based on importance sampling

Initialization: A given behavior policy β(a|s). A target policy π(a|s, θ0 ) where θ0 is the
initial parameter. A value function v(s, w0 ) where w0 is the initial parameter. αw , αθ > 0.
Goal: Learn an optimal policy to maximize J(θ).
At time step t in each episode, do
Generate at following β(st ) and then observe rt+1 , st+1 .
Advantage (TD error):
δt = rt+1 + γv(st+1 , wt ) − v(st , wt )
Actor (policy update):
t |st ,θt )
θt+1 = θt + αθ π(a
β(at |st ) t θ
δ ∇ ln π(at |st , θt )
Critic (value update):
t |st ,θt )
wt+1 = wt + αw π(aβ(at |st ) t w
δ ∇ v(st , wt )

10.4 Deterministic actor-critic

Up to now, the policies used in the policy gradient methods are all stochastic since it is
required that π(a|s, θ) > 0 for every (s, a). This section shows that deterministic policies
can also be used in policy gradient methods. Here, “deterministic” indicates that, for
any state, a single action is given a probability of one and all the other actions are given
probabilities of zero. It is important to study the deterministic case since it is naturally
off-policy and can effectively handle continuous action spaces.
We have been using π(a|s, θ) to denote a general policy, which can be either stochastic
or deterministic. In this section, we use

a = µ(s, θ)

to specifically denote a deterministic policy. Different from π which gives the probability
of an action, µ directly gives the action since it is a mapping from S to A. This deter-
ministic policy can be represented by, for example, a neural network with s as its input,
a as its output, and θ as its parameter. For the sake of simplicity, we often write µ(s, θ)
as µ(s) for short.

10.4.1 The deterministic policy gradient theorem

The policy gradient theorem introduced in the last chapter is only valid for stochastic
policies. When we require the policy to be deterministic, a new policy gradient theorem
must be derived.

235
10.4. Deterministic actor-critic S. Zhao, 2023

Theorem 10.2 (Deterministic policy gradient theorem). The gradient of J(θ) is

X
∇θ J(θ) = η(s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S

= ES∼η ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) , (10.14)

where η is a distribution of the states.

Theorem 10.2 is a summary of the results presented in Theorem 10.3 and Theorem 10.4
since the gradients in the two theorems have similar expressions. The specific expressions
of J(θ) and η can be found in Theorems 10.3 and 10.4.
Unlike the stochastic case, the gradient in the deterministic case shown in (10.14)
does not involve the action random variable A. As a result, when we use samples to
approximate the true gradient, it is not required to sample actions. Therefore, the de-
terministic policy gradient method is off-policy. In addition, some readers may wonder

why ∇a qµ (S, a) |a=µ(S) cannot be written as ∇a qµ (S, µ(S)), which seems more concise.
That is simply because, if we do that, it is unclear how qµ (S, µ(S)) is a function of a. A
concise yet less confusing expression may be ∇a qµ (S, a = µ(S)).
In the rest of this subsection, we present the derivation details of Theorem 10.2. In
particular, we derive the gradients of two common metrics: the first is the average value
and the second is the average reward. Since these two metrics have been discussed in
detail in Section 9.2, we sometimes use their properties without proof. For most readers,
it is sufficient to be familiar with Theorem 10.2 without knowing its derivation details.
Interested readers can selectively examine the details in the remainder of this section.

Metric 1: Average value

We first derive the gradient of the average value:

X
J(θ) = E[vµ (s)] = d0 (s)vµ (s), (10.15)
s∈S

where d0 is the probability distribution of the states. Here, d0 is selected to be independent

of µ for simplicity. There are two special yet important cases of selecting d0 . The first
case is that d0 (s0 ) = 1 and d0 (s 6= s0 ) = 0, where s0 is a specific state of interest. In
this case, the policy aims to maximize the discounted return that can be obtained when
starting from s0 . The second case is that d0 is the distribution of a given behavior policy
that is different from the target policy.
To calculate the gradient of J(θ), we need to first calculate the gradient of vµ (s) for
any s ∈ S. Consider the discounted case where γ ∈ (0, 1).

236
10.4. Deterministic actor-critic S. Zhao, 2023

Lemma 10.1 (Gradient of vµ (s)). In the discounted case, it holds for any s ∈ S that
X
Prµ (s0 |s)∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 ) ,

∇θ vµ (s) = (10.16)
s0 ∈S

where
∞
0 . X k k
γ [Pµ ]ss0 = (I − γPµ )−1 ss0

Prµ (s |s) =
k=0

is the discounted total probability of transitioning from s to s0 under policy µ. Here, [·]ss0
denotes the entry in the sth row and s0 th column of a matrix.

Box 10.3: Proof of Lemma 10.1

Since the policy is deterministic, we have

vµ (s) = qµ (s, µ(s)).

Since both qµ and µ are functions of θ, we have

∇θ vµ (s) = ∇θ qµ (s, µ(s)) = ∇θ qµ (s, a) |a=µ(s) + ∇θ µ(s) ∇a qµ (s, a) |a=µ(s) . (10.17)

By the definition of action values, for any given (s, a), we have
X
qµ (s, a) = r(s, a) + γ p(s0 |s, a)vµ (s0 ),
s0 ∈S

P
where r(s, a) = r rp(r|s, a). Since r(s, a) is independent of µ, we have
X
∇θ qµ (s, a) = 0 + γ p(s0 |s, a)∇θ vµ (s0 ).
s0 ∈S

Substituting the above equation into (10.17) yields

X
p(s0 |s, µ(s))∇θ vµ (s0 ) + ∇θ µ(s) ∇a qµ (s, a) |a=µ(s) ,

∇θ vµ (s) = γ s ∈ S.
s0 ∈S
| {z }
u(s)

237
10.4. Deterministic actor-critic S. Zhao, 2023

Since the above equation is valid for all s ∈ S, we can combine these equations to
obtain a matrix-vector form:
.. .. ..
     
. . .
     
 ∇θ vµ (s)  =  u(s)  +γ(Pµ ⊗ Im )  ∇θ vµ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vµ ∈Rmn u∈Rmn ∇θ vµ ∈Rmn

where n = |S|, m is the dimensionality of θ, Pµ is the state transition matrix with

[Pµ ]ss0 = p(s0 |s, µ(s)), and ⊗ is the Kronecker product. The above matrix-vector
form can be written concisely as

∇θ vµ = u + γ(Pµ ⊗ Im )∇θ vµ ,

which is a linear equation of ∇θ vµ . Then, ∇θ vµ can be solved as

∇θ vµ = (Imn − γPµ ⊗ Im )−1 u

= (In ⊗ Im − γPµ ⊗ Im )−1 u
= (In − γPµ )−1 ⊗ Im u.

(10.18)

The elementwise form of (10.18) is

X
(I − γPµ )−1 ss0 u(s0 )

∇θ vµ (s) =
s0 ∈S
X h i
(I − γPµ )−1 ss0 ∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 ) .

= (10.19)
s0 ∈S

The quantity [(I − γPµ )−1 ]ss0 has a clear probabilistic interpretation. Since (I −
γPµ )−1 = I + γPµ + γ 2 Pµ2 + · · · , we have

∞
X
(I − γPµ )−1 ss0 = [I]ss0 + γ[Pµ ]ss0 + γ 2 [Pµ2 ]ss0 + · · · = γ k [Pµk ]ss0 .

k=0

Note that [Pµk ]ss0 is the probability of transitioning from s to s0 using exactly k steps
(see Box 8.1 for more information). Therefore, [(I − γPµ )−1 ]ss0 is the discounted
total probability of transitioning from s to s0 using any number of steps. By denoting
.
[(I − γPµ )−1 ]ss0 = Prµ (s0 |s), equation (10.19) leads to (10.16).

With the preparation in Lemma 10.1, we are ready to derive the gradient of J(θ).

Theorem 10.3 (Deterministic policy gradient theorem in the discounted case). In the

238
10.4. Deterministic actor-critic S. Zhao, 2023

discounted case where γ ∈ (0, 1), the gradient of J(θ) in (10.15) is

X
∇θ J(θ) = ρµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S

= ES∼ρµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) ,

where the state distribution ρµ is

X
ρµ (s) = d0 (s0 )Prµ (s|s0 ), s ∈ S.
s0 ∈S

Here, Prµ (s|s0 ) = ∞ k k −1

P
k=0 γ [Pµ ]s0 s = [(I − γPµ ) ]s0 s is the discounted total probability of
transitioning from s0 to s under policy µ.

Box 10.4: Proof of Theorem 10.3

Since d0 is independent of µ, we have
X
∇θ J(θ) = d0 (s)∇θ vµ (s).
s∈S

Substituting the expression of ∇θ vµ (s) given by Lemma 10.1 into the above equation
yields
X
∇θ J(θ) = d0 (s)∇θ vµ (s)
s∈S
X X
Prµ (s0 |s)∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )

= d0 (s)
s∈S s0 ∈S
!
X X
d0 (s)Prµ (s0 |s) ∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )

=
s0 ∈S s∈S
. X
ρµ (s0 )∇θ µ(s0 ) ∇a qµ (s0 , a) |a=µ(s0 )

=
s0 ∈S
X
(change s0 to s)

= ρµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S

= ES∼ρµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) .

The proof is complete. The above proof is consistent with the proof of Theorem 1
in [74]. Here, we consider the case in which the states and actions are finite. When
they are continuous, the proof is similar, but the summations should be replaced by
integrals [74].

239
10.4. Deterministic actor-critic S. Zhao, 2023

Metric 2: Average reward

We next derive the gradient of the average reward:

X
J(θ) = r̄µ = dµ (s)rµ (s)
s∈S

= ES∼dµ [rµ (S)], (10.20)

where
X
rµ (s) = E[R|s, a = µ(s)] = rp(r|s, a = µ(s))
r

is the expectation of the immediate rewards. More information about this metric can be
found in Section 9.2.
The gradient of J(θ) is given in the following theorem.

Theorem 10.4 (Deterministic policy gradient theorem in the undiscounted case). In the
undiscounted case, the gradient of J(θ) in (10.20) is
X
∇θ J(θ) = dµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S

= ES∼dµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) ,

where dµ is the stationary distribution of the states under policy µ.

Box 10.5: Proof of Theorem 10.4

Since the policy is deterministic, we have

vµ (s) = qµ (s, µ(s)).

Since both qµ and µ are functions of θ, we have

∇θ vµ (s) = ∇θ qµ (s, µ(s)) = ∇θ qµ (s, a) |a=µ(s) + ∇θ µ(s) ∇a qµ (s, a) |a=µ(s) . (10.21)

In the undiscounted case, it follows from the definition of action value (Section 9.3.2)
that

qµ (s, a) = E[Rt+1 − r̄µ + vµ (St+1 )|s, a]

X X
= p(r|s, a)(r − r̄µ ) + p(s0 |s, a)vµ (s0 )
r s0
X
= r(s, a) − r̄µ + p(s0 |s, a)vµ (s0 ).
s0

240
10.4. Deterministic actor-critic S. Zhao, 2023

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qµ (s, a) = 0 − ∇θ r̄µ + p(s0 |s, a)∇θ vµ (s0 ).
s0

Substituting the above equation into (10.21) gives

X
p(s0 |s, µ(s))∇θ vµ (s0 ) + ∇θ µ(s) ∇a qµ (s, a) |a=µ(s) ,

∇θ vµ (s) = −∇θ r̄µ + s ∈ S.
s0
| {z }
u(s)

While the above equation is valid for all s ∈ S, we can combine these equations to
obtain a matrix-vector form:
.. .. ..
     
. . .
     
 ∇θ vµ (s)  = −1n ⊗ ∇θ r̄µ + (Pµ ⊗ Im )  ∇θ vµ (s0 )  +  u(s) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vµ ∈Rmn ∇θ vµ ∈Rmn u∈Rmn

where n = |S|, m is the dimension of θ, Pµ is the state transition matrix with

[Pµ ]ss0 = p(s0 |s, µ(s)), and ⊗ is the Kronecker product. The above matrix-vector
form can be written concisely as

∇θ vµ = u − 1n ⊗ ∇θ r̄µ + (Pµ ⊗ Im )∇θ vµ ,

and hence

1n ⊗ ∇θ r̄µ = u + (Pµ ⊗ Im )∇θ vµ − ∇θ vµ . (10.22)

Since dµ is the stationary distribution, we have dTµ Pµ = dTµ . Multiplying dTµ ⊗ Im on

both sides of (10.22) gives

(dTµ 1n ) ⊗ ∇θ r̄µ = dTµ ⊗ Im u + (dTµ Pµ ) ⊗ Im ∇θ vµ − dTµ ⊗ Im ∇θ vµ

= dTµ ⊗ Im u + dTµ ⊗ Im ∇θ vµ − dTµ ⊗ Im ∇θ vµ
= dTµ ⊗ Im u.

241
10.4. Deterministic actor-critic S. Zhao, 2023

Since dTµ 1n = 1, the above equations become

∇θ r̄µ = dTµ ⊗ Im u
X
= dµ (s)u(s)
s∈S
X
= dµ (s)∇θ µ(s) ∇a qµ (s, a) |a=µ(s)
s∈S

= ES∼dµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) .

The proof is complete.

10.4.2 Algorithm description

Based on the gradient given in Theorem 10.2, we can apply the gradient-ascent algorithm
to maximize J(θ):

θt+1 = θt + αθ ES∼dµ ∇θ µ(S) ∇a qµ (S, a) |a=µ(S) .

The corresponding stochastic gradient-ascent algorithm is

θt+1 = θt + αθ ∇θ µ(st ) ∇a qµ (st , a) |a=µ(st ) .

The implementation is summarized in Algorithm 10.4. It should be noted that this

algorithm is off-policy since the behavior policy β may be different from µ. First, the
actor is off-policy. We already explained the reason when presenting Theorem 10.2.
Second, the critic is also off-policy. Special attention must be paid to why the critic is
off-policy but does not require the importance sampling technique. In particular, the
experience sample required by the critic is (st , at , rt+1 , st+1 , ãt+1 ), where ãt+1 = µ(st+1 ).
The generation of this experience sample involves two policies. The first is the policy
for generating at at st , and the second is the policy for generating ãt+1 at st+1 . The
first policy that generates at is the behavior policy since at is used to interact with the
environment. The second policy must be µ because it is the policy that the critic aims
to evaluate. Hence, µ is the target policy. It should be noted that ãt+1 is not used to
interact with the environment in the next time step. Hence, µ is not the behavior policy.
Therefore, the critic is off-policy.
How to select the function q(s, a, w)? The original research work [74] that proposed
the deterministic policy gradient method adopted linear functions: q(s, a, w) = φT (s, a)w
where φ(s, a) is the feature vector. It is currently popular to represent q(s, a, w) using
neural networks, as suggested in the deep deterministic policy gradient (DDPG) method
[75].

242
10.5. Summary S. Zhao, 2023

Algorithm 10.4: Deterministic policy gradient or deterministic actor-critic

Initialization: A given behavior policy β(a|s). A deterministic target policy µ(s, θ0 )

where θ0 is the initial parameter. A value function q(s, a, w0 ) where w0 is the initial
parameter. αw , αθ > 0.
Goal: Learn an optimal policy to maximize J(θ).
At time step t in each episode, do
Generate at following β and then observe rt+1 , st+1 .
TD error:
δt = rt+1 + γq(st+1 , µ(st+1 , θt ), wt ) − q(st , at , wt )
Actor (policy update):
θt+1 = θt + αθ ∇θ µ(st , θt ) ∇a q(st , a, wt ) |a=µ(st )
Critic (value update):
wt+1 = wt + αw δt ∇w q(st , at , wt )

How to select the behavior policy β? It can be any exploratory policy. It can also be
a stochastic policy obtained by adding noise to µ [75]. In this case, µ is also the behavior
policy and hence this way is an on-policy implementation.

10.5 Summary
In this chapter, we introduced actor-critic methods. The contents are summarized as
follows.

Section 10.1 introduced the simplest actor-critic algorithm called QAC. This algo-
rithm is similar to the policy gradient algorithm, REINFORCE, introduced in the
last chapter. The only difference is that the q-value estimation in QAC relies on TD
learning while REINFORCE relies on Monte Carlo estimation.
Section 10.2 extended QAC to advantage actor-critic. It was shown that the policy
gradient is invariant to any additional baseline. It was then shown that an optimal
baseline could help reduce the estimation variance.
Section 10.3 further extended the advantage actor-critic algorithm to the off-policy
case. To do that, we introduced an important technique called importance sampling.
Finally, while all the previously presented policy gradient algorithms rely on stochastic
policies, we showed in Section 10.4 that the policy can be forced to be determinis-
tic. The corresponding gradient was derived, and the deterministic policy gradient
algorithm was introduced.

Policy gradient and actor-critic methods are widely used in modern reinforcement
learning. There exist a large number of advanced algorithms in the literature such as
SAC [76, 77], TRPO [78], PPO [79], and TD3 [80]. In addition, the single-agent case can

243
10.6. Q&A S. Zhao, 2023

also be extended to the case of multi-agent reinforcement learning [81–85]. Experience

samples can also be used to fit system models to achieve model-based reinforcement learn-
ing [15, 86, 87]. Distributional reinforcement learning provides a fundamentally different
perspective from the conventional one [88, 89]. The relationships between reinforcement
learning and control theory have been discussed in [90–95]. This book is not able to cov-
er all these topics. Hopefully, the foundations laid by this book can help readers better
study them in the future.

10.6 Q&A
Q: What is the relationship between actor-critic and policy gradient methods?
A: Actor-critic methods are actually policy gradient methods. Sometimes, we use
them interchangeably. It is required to estimate action values in any policy gradient
algorithm. When the action values are estimated using temporal-difference learning
with value function approximation, such a policy gradient algorithm is called actor-
critic. The name “actor-critic” highlights its algorithmic structure that combines the
components of policy update and value update. This structure is also the fundamental
structure used in all reinforcement learning algorithms.
Q: Why is it important to introduce additional baselines to actor-critic methods?
A: Since the policy gradient is invariant to any additional baseline, we can utilize the
baseline to reduce estimation variance. The resulting algorithm is called advantage
actor-critic.
Q: Can importance sampling be used in value-based algorithms other than policy-
based ones?
A: The answer is yes. That is because importance sampling is a general technique
for estimating the expectation of a random variable over one distribution using some
samples drawn from another distribution. The reason why this technique is useful
in reinforcement learning is that the many problems in reinforcement learning are
to estimate expectations. For example, in value-based methods, the action or state
values are defined as expectations. In the policy gradient method, the true gradient is
also an expectation. As a result, importance sampling can be applied in both value-
based and policy-based algorithms. In fact, it has been applied in the value-based
component of Algorithm 10.3.
Q: Why is the deterministic policy gradient method off-policy?
A: The true gradient in the deterministic case does not involve the action random
variable. As a result, when we use samples to approximate the true gradient, it is
not required to sample actions and hence any policy can be used. Therefore, the
deterministic policy gradient method is off-policy.

244
Appendix A

Preliminaries for Probability Theory

Reinforcement learning heavily relies on probability theory. We next summarize some

concepts and results frequently used in this book.

Random variable: The term “variable” indicates that a random variable can take
values from a set of numbers. The term “random” indicates that taking a value must
follow a probability distribution.
A random variable is usually denoted by a capital letter. Its value is usually denoted
by a lowercase letter. For example, X is a random variable, and x is a value that X
can take.
This book mainly considers the case where a random variable can only take a finite
number of values. A random variable can be a scalar or a vector.
Like normal variables, random variables have normal mathematical operations such
as summation, product, and absolute value. For example, if X, Y are two random
variables, we can calculate X + Y , X + 1, and XY .
A stochastic sequence is a sequence of random variables.
One scenario we often encounter is collecting a stochastic sampling sequence {xi }ni=1
of a random variable X. For example, consider the task of tossing a die n times.
Let xi be a random variable representing the value obtained for the ith toss. Then,
{x1 , x2 , . . . , xn } is a stochastic process.
It may be confusing to beginners why xi is a random variable instead of a deterministic
value. In fact, if the sampling sequence is {1,6,3,5,...}, then this sequence is not a
stochastic sequence because all the elements are already determined. However, if we
use a variable xi to represent the values that can possibly be sampled, it is a random
variable since xi can take any value in {1, . . . , 6}. Although xi is a lowercase letter, it
still represents a random variable.
Probability: The notation p(X = x) or pX (x) describes the probability of the random
variable X taking the value x. When the context is clear, p(X = x) is often written
as p(x) for short.

245
S. Zhao, 2023

Joint probability: The notation p(X = x, Y = y) or p(x, y) describes the probability

of the random variable X taking the value x and Y taking the value y. One useful
identity is as follows:
X
p(x, y) = p(x).
y

Conditional probability: The notation p(X = x|A = a) describes the probability of the
random variable X taking the value x given that the random variable A has already
taken the value a. We often write p(X = x|A = a) as p(x|a) for short.
It holds that
p(x, a) = p(x|a)p(a)

and
p(x, a)
p(x|a) = .
p(a)
P
Since p(x) = a p(x, a), we have
X X
p(x) = p(x, a) = p(x|a)p(a),
a a

which is called the law of total probability.

Independence: Two random variables are independent if the sampling value of one
random variable does not affect the other. Mathematically, X and Y are independent
if
p(x, y) = p(x)p(y).

Another equivalent definition is

p(x|y) = p(x).

The above two definitions are equivalent because p(x, y) = p(x|y)p(y), which implies
p(x|y) = p(x) when p(x, y) = p(x)p(y).
Conditional independence: Let X, A, B be three random variables. X is said to be
conditionally independent of A given B if

p(X = x|A = a, B = b) = p(X = x|B = b).

In the context of reinforcement learning, consider three consecutive states: st , st+1 , st+2 .
Since they are obtained consecutively, st+2 is dependent on st+1 and also st . However,
if st+1 is already given, then st+2 is conditionally independent of st . That is

p(st+2 |st+1 , st ) = p(st+2 |st+1 ).

This is also the memoryless property of Markov processes.

246
S. Zhao, 2023

Law of total probability: The law of total probability was already mentioned when we
introduced the concept of conditional probability. Due to its importance, we list it
again below:
X
p(x) = p(x, y)
y

and
X
p(x|a) = p(x, y|a).
y

Chain rule of conditional probability and joint probability. By the definition of con-
ditional probability, we have

p(a, b) = p(a|b)p(b).

This can be extended to

p(a, b, c) = p(a|b, c)p(b, c) = p(a|b, c)p(b|c)p(c),

Expectation/expected value/mean: Suppose that X is a random variable and the prob-

ability of taking the value x is p(x). The expectation, expected value, or mean of X
is defined as
X
E[X] = p(x)x.
x

The linearity property of expectation is

E[X + Y ] = E[X] + E[Y ],

E[aX] = aE[X].

The second equation above can be trivially proven by definition. The first equation

247
S. Zhao, 2023

is proven below:
XX
E[X + Y ] = (x + y)p(X = x, Y = y)
x y
X X X X
= x p(x, y) + y p(x, y)
x y y x
X X
= xp(x) + yp(y)
x y

= E[X] + E[Y ].

Due to the linearity of expectation, we have the following useful fact:

" #
X X
E ai X i = ai E[Xi ].
i i

Similarly, it can be proven that

E[AX] = AE[X],

where A ∈ Rn×n is a deterministic matrix and X ∈ Rn is a random vector.

Conditional expectation: The definition of conditional expectation is
X
E[X|A = a] = xp(x|a).
x

Similar to the law of total probability, we have the law of total expectation:
X
E[X] = E[X|A = a]p(a).
a

The proof is as follows. By the definition of expectation, it holds that

" #
X X X
E[X|A = a]p(a) = p(x|a)x p(a)
a a x
XX
= p(x|a)p(a)x
x a
" #
X X
= p(x|a)p(a) x
x a
X
= p(x)x
x

= E[X].

The law of total expectation is frequently used in reinforcement learning.

248
S. Zhao, 2023

Similarly, conditional expectation satisfies

X
E[X|A = a] = E[X|A = a, B = b]p(b|a).
b

This equation is useful in the derivation of the Bellman equation. A hint of its proof
is the chain rule: p(x|a, b)p(b|a) = p(x, b|a).
Finally, it is worth noting that E[X|A = a] is different from E[X|A]. The former is
a value, whereas the latter is a random variable. In fact, E[X|A] is a function of the
random variable A. We need rigorous probability theory to define E[X|A].
Gradient of expectation: Let f (X, β) be a scalar function of a random variable X and
a deterministic parameter vector β. Then,

∇β E[f (X, β)] = E[∇β f (X, β)].

P P
Proof: Since E[f (X, β)] = x f (x, a)p(x), we have ∇β E[f (X, β)] = ∇β x f (x, a)p(x) =
P
x ∇β f (x, a)p(x) = E[∇β f (X, β)].

Variance, covariance, covariance matrix : For a single random variable X, its variance
is defined as var(X) = E[(X − X̄)2 ], where X̄ = E[X]. For two random variables X, Y ,
their covariance is defined as cov(X, Y ) = E[(X − X̄)(Y − Ȳ )]. For a random vector
.
X = [X1 , . . . , Xn ]T , the covariance matrix of X is defined as var(X) = Σ = E[(X −
X̄)(X − X̄)T ] ∈ Rn×n . The ijth entry of Σ is [Σ]ij = E[[X − X̄]i [X − X̄]j ] = E[(Xi −
X̄i )(Xj − X̄j )] = cov(Xi , Xj ). One trivial property is var(a) = 0 if a is deterministic.
Moreover, it can be verified that var(AX + a) = var(AX) = Avar(X)AT = AΣAT .
Some useful facts are summarized below.

- Fact: E[(X − X̄)(Y − Ȳ )] = E(XY ) − X̄ Ȳ = E(XY ) − E(X)E(Y ).

Proof: E[(X − X̄)(Y − Ȳ )] = E[XY − X Ȳ − X̄Y + X̄ Ȳ ] = E[XY ] − E[X]Ȳ −
X̄E[Y ] + X̄ Ȳ = E[XY ] − E[X]E[Y ] − E[X]E[Y ] + E[X]E[Y ] = E[XY ] − E[X]E[Y ].
- Fact: E[XY ] = E[X]E[Y ] if X, Y are independent.
P P P P P P
Proof: E[XY ] = x y p(x, y)xy = x y p(x)p(y)xy = x p(x)x y p(y)y =
E[X]E[Y ].
- Fact: cov(X, Y ) = 0 if X, Y are independent.
Proof: When X, Y are independent, cov(X, Y ) = E[XY ]−E[X]E[Y ] = E[X]E[Y ]−
E[X]E[Y ] = 0.

249
Appendix B

Measure-Theoretic Probability
Theory

We now briefly introduce measure-theoretic probability theory, which is also called rig-
orous probability theory. We only present basic notions and results. Comprehensive
introductions can be found in [96–98]. Moreover, measure-theoretic probability theory
requires some basic knowledge of measure theory, which is not covered here. Interested
readers may refer to [99].
The reader may wonder if it is necessary to understand measure-theoretic probability
theory before studying reinforcement learning. The answer is yes if the reader is interested
in rigorously analyzing the convergence of stochastic sequences. For example, we often
encounter the notion of almost sure convergence in Chapter 6 and Chapter 7. This notion
is taken from measure-theoretic probability theory. If the reader is not interested in the
convergence of stochastic sequences, it is okay to skip this part.

Probability triples

A probability triple is fundamental for establishing measure-theoretic probability theory.

It is also called a probability space or probability measure space. A probability triple
consists of three ingredients.
Ω: This is a set called the sample space (or outcome space). Any element (or point)
in Ω, denoted as ω, is called an outcome. This set contains all the possible outcomes
of a random sampling process.
Example: When playing a game of dice, we have six possible outcomes {1, 2, 3, 4, 5, 6}.
Hence, Ω = {1, 2, 3, 4, 5, 6}.
F: This is a set called the event space. In particular, it is a σ-algebra (or σ-field) of
Ω. The definition of a σ-algebra is given in Box B.1. An element in F, denoted as
A, is called an event. An elementary event refers to a single outcome in the sample
space. An event may be an elementary event or a combination of multiple elementary
events.

250
S. Zhao, 2023

Example: Consider the game of dice. An example of an elementary event is “the num-
ber you get is i”, where i ∈ {1, . . . , 6}. An example of a nonelementary event is “the
number you get is greater than 3”. We care about such an event in practice because,
for example, we can win the game if this event occurs. This event is mathematically
expressed as A = {ω ∈ Ω : ω > 3}. Since Ω = {1, 2, 3, 4, 5, 6} in this case, we have
A = {4, 5, 6}.
P: This is a probability measure, which is a mapping from F to [0, 1]. Any A ∈ F is
a set that contains some points in Ω. Then, P(A) is the measure of this set.
Example: If A = Ω, which contains all ω values, then P(A) = 1; if A = ∅, then
P(A) = 0. In the game of dice, consider the event “the number you get is greater
than 3”. In this case, A = {ω ∈ Ω : ω > 3}, and Ω = {1, 2, 3, 4, 5, 6}. Then, we have
A = {4, 5, 6} and hence P(A) = 1/2. That is, the probability of us rolling a number
greater than 3 is 1/2.

Box B.1: Definition of a σ-algebra

An algebra of Ω is a set of some subsets of Ω that satisfy certain conditions. A

σ-algebra is a specific and important type of algebra. In particular, denote F as a
σ-algebra. Then, it must satisfy the following conditions.
F contains ∅ and Ω;
F is closed under complements;
F is closed under countable unions and intersections.
The σ-algebras of a given Ω are not unique. F may contain all the subsets of
Ω, and it may also merely contain some of them as long as it satisfies the above
three conditions (see the examples below). Moreover, the three conditions are not
independent. For example, if F contains Ω and is closed under complements, then it
naturally contains ∅. More information can be found in [96–98].

Example: When playing the dice game, we have Ω = {1, 2, 3, 4, 5, 6}. Then,
F = {Ω, ∅, {1, 2, 3}, {4, 5, 6}} is a σ-algebra. The above three conditions can be
easily verified. There are also other σ-algebras such as {Ω, ∅, {1, 2, 3, 4, 5}, {6}}.
Moreover, for any Ω with finite elements, the collection of all the subsets of Ω is
a σ-algebra.

Random variables

Based on the notion of probability triples, we can formally define random variables. They
are called variables, but they are actually functions that map from Ω to R. In particular,

251
S. Zhao, 2023

a random variable assigns each outcome in Ω a numerical value, and hence it is a function:
X(ω) : Ω → R.
Not all mappings from Ω to R are random variables. The formal definition of a random
variable is as follows. A function X : Ω → R is a random variable if

A = {ω ∈ Ω|X(ω) ≤ x} ∈ F

for all x ∈ R. This definition indicates that X is a random variable only if X(ω) ≤ x is
an event in F. More information can be found in [96, Section 3.1].

Expectation of random variables

The definition of the expectation of general random variables is sophisticated. Here, we

only consider the special yet important case of simple random variables. In particular,
a random variable is simple if X(ω) only takes a finite number of values. Let X be the
set of all the possible values that X can take. A simple random variable is a function:
X(w) : Ω → X . It can be defined in a closed form as

. X
X(ω) = x1Ax (ω),
x∈X

where
.
Ax = {ω ∈ Ω|X(ω) = x} = X −1 (x)

and
(
1, ω ∈ Ax ,
1Ax (ω) =. (B.1)
0, otherwise.

Here, 1Ax (ω) is an indicator function 1Ax (ω) : Ω → {0, 1}. If ω is mapped to x, the
indicator function equals one; otherwise, it equals zero. It is possible that multiple ω’s
in Ω map to the same value in X , but a single ω cannot be mapped to multiple values in
X.
With the above preparation, the expectation of a simple random variable is defined
as

. X
E[X] = xP(Ax ), (B.2)
x∈X

where
Ax = {ω ∈ Ω|X(ω) = x}.

The definition in (B.2) is similar to but more formal than the definition of expectation
P
in the nonmeasure-theoretic case: E[X] = x∈X xp(x).
As a demonstrative example, we next calculate the expectation of the indicator func-

252
S. Zhao, 2023

tion in (B.1). It is notable that the indicator function is also a random variable that
maps Ω to {0, 1} [96, Proposition 3.1.5]. As a result, we can calculate its expectation. In
particular, consider the indicator function 1A where A denotes any event. We have

E[1A ] = P(A).

To prove that, we have

E[1A ] = zP(1A = z)
X

z∈{0,1}

= 0 · P(1A = 0) + 1 · P(1A = 1)
= P(1A = 1)
= P(A).

More properties of indicator functions can be found in [100, Chapter 24].

Conditional expectation as a random variable

While the expectation in (B.2) maps random variables to a specific value, we next intro-
duce a conditional expectation that maps random variables to another random variable.
Suppose that X, Y, Z are all random variables. Consider three cases. First, a condi-
tional expectation like E[X|Y = 2] or E[X|Y = 5] is specific number. Second, E[X|Y = y],
where y is a variable, is a function of y. Third, E[X|Y ], where Y is a random variable,
is a function of Y and hence also a random variable. Since E[X|Y ] is also a random
variable, we can calculate, for example, its expectation.
We next examine the third case closely since it frequently emerges in the convergence
analyses of stochastic sequences. The rigorous definition is not covered here and can be
found in [96, Chapter 13]. We merely present some useful properties [101].

253
S. Zhao, 2023

Proof. We only prove some properties. The others can be proven similarly.
To prove E[a|Y ] = a as in (a), we can show that E[a|Y = y] = a is valid for any y
that Y can possibly take. This is clearly true, and the proof is complete.
To prove the property in (d), we can show that E[Xf (Y )|Y = y] = f (Y = y)E[X|Y =
P P
y] for any y. This is valid because E[Xf (Y )|Y = y] = x xf (y)p(x|y) = f (y) x xp(x|y) =
f (y)E[X|Y = y].

Since E[X|Y ] is a random variable, we can calculate its expectation. The related
properties are presented below. These properties are useful for analyzing the convergence
of stochastic sequences.

Lemma B.2. Let X, Y, Z be random variables. The following properties hold.

= E[X].

The proof of the property in (b) is similar. In particular, we have

X XX X
E E[X|Y, Z] = E[X|y, z]p(y, z) = xp(x|y, z)p(y, z) = xp(x) = E[X].
y,z y,z x x

254
S. Zhao, 2023

Definitions of stochastic convergence

One main reason why we care about measure-theoretic probability theory is that it can
rigorously describe the convergence properties of stochastic sequences.
.
Consider the stochastic sequence {Xk } = {X1 , X2 , . . . , Xk , . . . }. Each element in this
sequence is a random variable defined on a triple (Ω, F, P). When we say {Xk } converges
to a random variable X, we should be careful since there are different types of convergence
as shown below.

Sure convergence:
Definition: {Xk } converges surely (or everywhere or pointwise) to X if

lim Xk (ω) = X(ω), for all ω ∈ Ω.

k→∞

It means that limk→∞ Xk (ω) = X(ω) is valid for all points in Ω. This definition can
be equivalently stated as
n o
A = Ω where A = ω ∈ Ω : lim Xk (ω) = X(ω) .
k→∞

Almost sure convergence:

Definition: {Xk } converges almost surely (or almost everywhere or with probability 1
or w.p.1 ) to X if
n o
P(A) = 1 where A = ω ∈ Ω : lim Xk (ω) = X(ω) . (B.3)
k→∞

It means that limk→∞ Xk (ω) = X(ω) is valid for almost all points in Ω. The points,
for which this limit is invalid, form a set of zero measure. For the sake of simplicity,
(B.3) is often written as

P lim Xk = X = 1.
k→∞

a.s.
Almost sure convergence can be denoted as Xk −−→ X.
Convergence in probability:
Definition: {Xk } converges in probability to X if for any > 0,

lim P(Ak ) = 0 where Ak = {ω ∈ Ω : |Xk (ω) − X(ω)| > } . (B.4)

k→∞

For simplicity, (B.4) can be written as

lim P(|Xk − X| > ) = 0.

k→∞

255
S. Zhao, 2023

The difference between convergence in probability and (almost) sure convergence is

as follows. Both sure convergence and almost sure convergence first evaluate the
convergence of every point in Ω and then check the measure of these points that
converge. By contrast, convergence in probability first checks the points that satisfy
|Xk − X| > and then evaluates if the measure will converge to zero as k → ∞.
Convergence in mean:
Definition: {Xk } converges in the r-th mean (or in the Lr norm) to X if

lim E[|Xk − X|r ] = 0.

k→∞

The most frequently used cases are r = 1 and r = 2. It is worth mentioning that
convergence in mean is not equivalent to limk→∞ E[Xk − X] = 0 or limk→∞ E[Xk ] =
E[X], which indicates that E[Xk ] converges but the variance may not.
Convergence in distribution:
Definition: The cumulative distribution function of Xk is defined as P(Xk ≤ a) where
a ∈ R. Then, {Xk } converges to X in distribution if the cumulative distribution
function converges:

lim P(Xk ≤ a) = P(X ≤ a), for all a ∈ R.

k→∞

A compact expression is

lim P(Ak ) = P(A),

k→∞

where

. .
Ak = {ω ∈ Ω : Xk (ω) ≤ a} , A = {ω ∈ Ω : X(ω) ≤ a} .

The relationships between the above types of convergence are given below:

almost sure convergence ⇒ convergence in probability ⇒ convergence in distribution

convergence in mean ⇒ convergence in probability ⇒ convergence in distribution

Almost sure convergence and convergence in mean do not imply each other. More infor-
mation can be found in [102].

256
Appendix C

Convergence of Sequences

We next introduce some results about the convergence of deterministic and stochastic se-
quences. These results are useful for analyzing the convergence of reinforcement learning
algorithms such as those in Chapters 6 and 7.
We first consider deterministic sequences and then stochastic sequences.

C.1 Convergence of deterministic sequences

Convergence of monotonic sequences
.
Consider a sequence {xk } = {x1 , x2 , . . . , xk , . . . } where xk ∈ R. Suppose that this se-
quence is deterministic in the sense that xk is not a random variable.
One of the most well-known convergence results is that a nonincreasing sequence with
a lower bound converges. The following is a formal statement of this result.

Theorem C.1 (Convergence of monotonic sequences). If the sequence {xk } is nonin-

creasing and bounded from below:

Nonincreasing: xk+1 ≤ xk for all k;

Lower bound: xk ≥ α for all k;

then xk converges to a limit, which is the infimum of {xk }, as k → ∞.

Similarly, if {xk } is nondecreasing and bounded from above, then the sequence is
convergent.

Convergence of nonmonotonic sequences

We next analyze the convergence of nonmonotonic sequences.

Consider a nonnegative sequence {xk ≥ 0} satisfying

xk+1 ≤ xk + ηk .

257
C.1. Convergence of deterministic sequences S. Zhao, 2023

In the simple case of ηk = 0, we have xk+1 ≤ xk , and the sequence is monotonic. We now
focus on a more general case where ηk ≥ 0. In this case, the sequence is not monotonic
because xk+1 may be greater than xk . Nevertheless, we can still ensure the convergence
of the sequence under some mild conditions.

To analyze the convergence of nonmonotonic sequences, we introduce the following

useful operator [103]. For any z ∈ R, define
(
+ . z, if z ≥ 0,
z =
0, if z < 0,
(
. z, if z ≤ 0,
z− =
0, if z > 0.

It is obvious that z + ≥ 0 and z − ≤ 0 for any z. Moreover, it holds that

z = z+ + z−

for all z ∈ R.

To analyze the convergence of {xk }, we rewrite xk as

xk = xk − xk−1 + xk−1 − xk−2 + · · · − x2 + x2 − x1 + x1

k−1
X
= (xi+1 − xi ) + x1
i=1
.
= Sk + x1 , (C.1)

. P
where Sk = k−1
i=1 (xi+1 − xi ). Note that Sk can be decomposed as

k−1
X
Sk = (xi+1 − xi ) = Sk+ + Sk− ,
i=1

where
k−1
X k−1
X
Sk+ = (xi+1 − xi )+ ≥ 0, Sk− = (xi+1 − xi )− ≤ 0.
i=1 i=1

Some useful properties of Sk+ and Sk− are given below.

{Sk+ ≥ 0} is a nondecreasing sequence since Sk+1

+
≥ Sk+ for all k.
{Sk− ≤ 0} is a nonincreasing sequence since Sk+1
−
≤ Sk− for all k.
If Sk+ is bounded from above, then Sk− is bounded from below. This is because
Sk− ≥ −Sk+ − x1 due to the fact that Sk− + Sk+ + x1 = xk ≥ 0.

258
C.1. Convergence of deterministic sequences S. Zhao, 2023

With the above preparation, we can show the following result.

Theorem C.2 (Convergence of nonmonotonic sequences). For any nonnegative sequence

{xk ≥ 0}, if
∞
X
(xk+1 − xk )+ < ∞, (C.2)
k=1

then {xk } converges as k → ∞.

Proof. First, the condition ∞

P + +
Pk−1
k=1 (xk+1 − xk ) < ∞ indicates that Sk = i=1 (xi+1 −
+ +
xi ) is bounded from above for all k. Since {Sk } is nondecreasing, the convergence
of {Sk+ } immediately follows from Theorem C.1. Suppose that Sk+ converges to S∗+ .
Second, the boundedness of Sk+ implies that Sk− is bounded from below since
Sk− ≥ −Sk+ − x1 . Since {Sk− } is nonincreasing, the convergence of {Sk− } immediately
follows from Theorem C.1. Suppose that Sk− converges to S∗− .
Finally, since xk = Sk+ + Sk− + x1 , as shown in (C.1), the convergence of Sk+ and
Sk− implies that {xk } converges to S∗+ + S∗− + x1 .

Theorem C.2 is more general than Theorem C.1 because it allows xk to increase as
long as the increase is damped as in (C.2). In the monotonic case, Theorem C.2 still
applies. In particular, if xk+1 ≤ xk , then ∞ +
P
k=1 (xk+1 − xk ) = 0. In this case, (C.2) is
still satisfied and the convergence follows.
If xk+1 ≤ xk + ηk , the next result provides a condition for ηk to ensure the convergence
of {xk }. This result is an immediate corollary of Theorem C.2.

Corollary C.1. For any nonnegative sequence {xk ≥ 0}, if

xk+1 ≤ xk + ηk

and {ηk ≥ 0} satisfies

∞
X
ηk < ∞,
k=1

then {xk ≥ 0} converges.

Proof. Since xk+1 ≤ xk + ηk , we have (xk+1 − xk )+ ≤ ηk for all k. Then, we have

∞
X ∞
X
+
(xk+1 − xk ) ≤ ηk < ∞.
k=1 k=1

259
C.2. Convergence of stochastic sequences S. Zhao, 2023

As a result, (C.2) is satisfied and the convergence follows from Theorem C.2.

C.2 Convergence of stochastic sequences

We now consider stochastic sequences. While various definitions of stochastic sequences
have been given in Appendix B, how to determine the convergence of a given stochastic
sequence has not yet been discussed. We next present an important class of stochastic
sequences called martingales. If a sequence can be classified as a martingale (or one of
its variants), then the convergence of the sequence immediately follows.

Convergence of martingale sequences

Definition: A stochastic sequence {Xk }∞

k=1 is called a martingale if E[|Xk |] < ∞ and

E[Xk+1 |X1 , . . . , Xk ] = Xk (C.3)

almost surely for all k.

Here, E[Xk+1 |X1 , . . . , Xk ] is a random variable rather than a deterministic value. The
term “almost surely” in the second condition is due to the definition of such expecta-
tions. In addition, E[Xk+1 |X1 , . . . , Xk ] is often written as E[Xk+1 |Hk ] for short where
Hk = {X1 , . . . , Xk } represents the “history” of the sequence. Hk has a specific name
called a filtration. More information can be found in [96, Chapter 14] and [104].
Example: An example that can demonstrate martingales is random walk, which is a
stochastic process describing the position of a point that moves randomly. Specifically,
let Xk denote the position of the point at time step k. Starting from Xk , the expecta-
tion of the next position Xk+1 equals Xk if the mean of the one-step displacement is
zero. In this case, we have E[Xk+1 |X1 , . . . , Xk ] = Xk and hence {Xk } is a martingale.
A basic property of martingales is that

E[Xk+1 ] = E[Xk ]

for all k and hence

E[Xk ] = E[Xk−1 ] = · · · = E[X2 ] = E[X1 ]

This result can be obtained by calculating the expectation on both sides of (C.3)
based on property (b) in Lemma B.2.

While the expectation of a martingale is constant, we next extend martingales to

submartingales and supermartingales, whose expectations vary monotonically.

260
C.2. Convergence of stochastic sequences S. Zhao, 2023

Definition: A stochastic sequence {Xk } is called a submartingale if it satisfies E[|Xk |] <

∞ and

E[Xk+1 |X1 , . . . , Xk ] ≥ Xk (C.4)

for all k.
Taking the expectation on both sides of (C.4) yields E[Xk+1 ] ≥ E[Xk ]. In particular,
the left-hand side leads to E[E[Xk+1 |X1 , . . . , Xk ]] = E[Xk+1 ] due to property (b) in
Lemma B.2. By induction, we have

E[Xk ] ≥ E[Xk−1 ] ≥ · · · ≥ E[X2 ] ≥ E[X1 ].

Therefore, the expectation of a submartingale is nondecreasing.

It may be worth mentioning that, for two random variables X and Y , X ≤ Y means
X(ω) ≤ Y (ω) for all ω ∈ Ω. It does not mean the maximum of X is less than the
minimum of Y .
Definition: A stochastic sequence {Xk } is called a supermartingale if it satisfies
E[|Xk |] < ∞ and

E[Xk+1 |X1 , . . . , Xk ] ≤ Xk (C.5)

for all k.
Taking expectation on both sides of (C.5) gives E[Xk+1 ] ≤ E[Xk ]. By induction, we
have
E[Xk ] ≤ E[Xk−1 ] ≤ · · · ≤ E[X2 ] ≤ E[X1 ].

Therefore, the expectation of a supmartingale is nonincreasing.

The names “submartingale” and “supmartingale” are standard, but it may not be easy
for beginners to distinguish them. Some tricks can be employed to do so. For example,
since “supermartingale” has a letter “p” that points down, its expectation decreases;
since submartingale has a letter “b” that points up, its expectation increases [104].
A supermartingale or submartingale is comparable to a deterministic monotonic se-
quence. While the convergence result for monotonic sequences has been given in Theo-
rem C.1, we provide a similar convergence result for martingales as follows.

Theorem C.3 (Martingale convergence theorem). If {Xk } is a submartingale (or super-

martingale), then there is a finite random variable X such that Xk → X almost surely.

The proof is omitted. A comprehensive introduction to martingales can be found in

[96, Chapter 14] and [104].

261
C.2. Convergence of stochastic sequences S. Zhao, 2023

Convergence of quasimartingale sequences

We next introduce quasimartingales, which can be viewed as a generalization of martin-

gales since their expectations are not monotonic. They are comparable to nonmonotonic
deterministic sequences. The rigorous definition and convergence results of quasimartin-
gales are nontrivial. We merely list some useful results.

.
The event Ak is defined as Ak = {ω ∈ Ω : E[Xk+1 − Xk |Hk ] ≥ 0}, where Hk =
{X1 , . . . , Xk }. Intuitively, Ak indicates that Xk+1 is greater than Xk in expectation.
Let 1Ak be an indicator function:
(
1, E[Xk+1 − Xk |Hk ] ≥ 0,
1Ak =
0, E[Xk+1 − Xk |Hk ] < 0.

The indicator function has a property that

1 = 1A + 1Ac

for any event A where Ac denotes the complementary event of A. As a result, it holds
for any random variable that

X = 1A X + 1Ac X.

Although quasimartingales do not have monotonic expectations, their convergence is

still ensured under some mild conditions as shown below.

Theorem C.4 (Quasimartingale convergence theorem). For a nonnegative stochastic

sequence {Xk ≥ 0}, if
∞
E[(Xk+1 − Xk )1Ak ] < ∞,
X

k=1

k=1 E[(Xk+1 − Xk )1Ak ] > −∞ and there is a finite random variable such that
P∞
then c

Xk → X almost surely as k → ∞.

Theorem C.4 can be viewed as an analogy of Theorem C.2, which is for nonmono-
tonic deterministic sequences. The proof of this theorem can be found in [105, Proposi-
tion 9.5]. Note that Xk here is required to be nonnegative. As a result, the boundedness
of ∞ k=1 E[(Xk+1 − Xk )1Ak ] implies the boundedness of k=1 E[(Xk+1 − Xk )1Ak ].
P P∞
c

Summary and comparison

We finally summarize and compare the results for deterministic and stochastic sequences.

Deterministic sequences:

262
C.2. Convergence of stochastic sequences S. Zhao, 2023

- Monotonic sequences: As shown in Theorem C.1, if a sequence is monotonic and

bounded, then it converges.
- Nonmonotonic sequences: As shown in Theorem C.2, given a nonnegative se-
quence, even if it is nonmonotonic, it can still converge as long as its variation is
damped in the sense that ∞ +
P
k=1 (xk+1 − xk ) < ∞.

Stochastic sequences:

- Supermartingale/submartingale sequences: As shown in Theorem C.3, the expec-

tation of a supermartingale or submartingale is monotonic. If a sequence is a
supermartingale or submartingale, then the sequence converges almost surely.
- Quasimartingale sequences: As shown in Theorem C.4, even if a sequence’s expec-
tation is nonmonotonic, it can still converge as long as its variation is damped in
the sense that ∞
P
k=1 E[(Xk+1 − Xk )1E[Xk+1 −Xk |Hk ]>0 ] < ∞.

The above properties are summarized in Table C.1.

Variants of martingales Monotonicity of E[Xk ]

Martingale Constant: E[Xk+1 ] = E[Xk ]

Submartingale Increasing: E[Xk+1 ] ≥ E[Xk ]

Supermartingale Decreasing: E[Xk+1 ] ≤ E[Xk ]

Quasimartingale Non-monotonic

Table C.1: Summary of the monotonicity of different variants of martingales.

263
Appendix D

Preliminaries for Gradient Descent

We next present some preliminaries for the gradient descent method, which is one of the
most frequently used optimization methods. The gradient descent method is also the
foundation for the stochastic gradient descent method introduced in Chapter 6.

Convexity

Definitions:
.
- Convex set: Suppose that D is a subset of Rn . This set is convex if z = cx + (1 −
c)y ∈ D for any x, y ∈ D and any c ∈ [0, 1].
- Convex function: Suppose f : D → R where D is convex. Then, the function f (x)
is convex if
f (cx + (1 − x)y) ≤ cf (x) + (1 − c)f (y)

for any x, y ∈ D and c ∈ [0, 1].

Convex conditions:

- First-order condition: Consider a function f : D → R where D is convex. Then, f

is convex if [106, 3.1.3]

f (y) − f (x) ≥ ∇f (x)T (y − x), for all x, y ∈ D. (D.1)

When x is a scalar, ∇f (x) represents the slope of the tangent line of f (x) at x.
The geometric interpretation of (D.1) is that the point (y, f (y)) is always located
above the tangent line.
- Second-order condition: Consider a function f : D → R where D is convex. Then,
f is convex if

∇2 f (x) 0, for all x ∈ D,

where ∇2 f (x) is the Hessian matrix.

264
S. Zhao, 2023

Degree of convexity:
Given a convex function, it is often of interest how strong its convexity is. The Hessian
matrix is a useful tool for describing the degree of convexity. If ∇2 f (x) is close to rank
deficiency at a point, then the function is flat around that point and hence weakly
convex. Otherwise, if the minimum singular value of ∇2 f (x) is positive and large,
the function is curly around that point and hence strongly convex. The degree of
convexity influences the step size selection in gradient descent algorithms.
The lower and upper bounds of ∇2 f (x) play an important role in characterizing the
function convexity.

- Lower bound of ∇2 f (x): A function is called strongly convex or strictly convex if

∇2 f (x) `In , where ` > 0 for all x.
- Upper bound of ∇2 f (x): If ∇2 f (x) is bounded from above so that ∇2 f (x) LIn ,
then the change in the first-order derivative ∇f (x) cannot be arbitrarily fast;
equivalently, the function cannot be arbitrarily convex at a point.
The upper bound can be implied by a Lipschitz condition of ∇f (x), as shown
below.
Lemma D.1. Suppose that f is a convex function. If ∇f (x) is Lipschitz contin-
uous with a constant L so that

k∇f (x) − ∇f (y)k ≤ Lkx − yk, for all x, y,

then ∇2 f (x) LIn for all x. Here, k · k denotes the Euclidean norm.

Gradient descent algorithms

Consider the following optimization problem:

min f (x)
x

where x ∈ D ⊆ Rn and f : D → R. The gradient descent algorithm is

xk+1 = xk − αk ∇f (xk ), k = 0, 1, 2, . . . (D.2)

where αk is a positive coefficient that may be fixed or time-varying. Here, αk is called

the step size or learning rate. Some remarks about (D.2) are given below.
Direction of change: ∇f (xk ) is a vector that points in the direction along which f (xk )
increases the fastest. Hence, the term −αk ∇f (xk ) changes xk in the direction along
which f (xk ) decreases the fastest.
Magnitude of change: The magnitude of the change −αk ∇f (xk ) is jointly determined
by the step size αk and the magnitude of ∇f (xk ).

265
S. Zhao, 2023

- Magnitude of ∇f (xk ):
When xk is close to the optimum x∗ where ∇f (x∗ ) = 0, the magnitude k∇f (xk )k
is small. In this case, the update process of xk is slow, which is reasonable because
we do not want to update x too aggressively and miss the optimum.
When xk is far from the optimum, the magnitude of ∇f (xk ) may be large, and
hence, the update process of xk is fast. This is also reasonable because we hope
that the estimate can approach the optimum as quickly as possible.
- Step size αk :
If αk is small, the magnitude of −αk ∇f (xk ) is small, and hence the convergence
process is slow. If αk is too large, the update process of xk is aggressive, which
leads to either fast convergence or divergence.
How to select αk ? The selection of αk should depend on the degree of convexity
of f (xk ). If the function is curly around the optimum (the degree of convexity is
strong), then the step size αk should be small to guarantee convergence. If the
function is flat around the optimum (the degree of convexity is weak), then the
step size could be large so that xk can quickly approach the optimum. The above
intuition will be verified in the following convergence analysis.

Convergence analysis

We next present a proof of the convergence of the gradient descent algorithm in (D.2).
That is to show xk converges to the optimum x∗ where ∇f (x∗ ) = 0. First of all, we make
some assumptions.

Assumption 1: f (x) is strongly convex such that

∇2 f (x) `I,

where ` > 0.
Assumption 2: ∇f (x) is Lipschitz continuous with a constant L. This assumption
implies the following inequality according to Lemma D.1:

∇2 f (x) LIn .

The convergence proof is given below.

Proof. For any xk+1 and xk , it follows from [106, Section 9.1.2] that

1
f (xk+1 ) = f (xk ) + ∇f (xk )T (xk+1 − xk ) + (xk+1 − xk )T ∇2 f (zk )(xk+1 − xk ), (D.3)
2

266
S. Zhao, 2023

where zk is a convex combination of xk and xk+1 . Since it is assumed that ∇2 f (zk ) LIn ,
we have k∇2 f (zk )k ≤ L. (D.3) implies

1
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T (xk+1 − xk ) + k∇2 f (zk )kkxk+1 − xk k2
2
L
≤ f (xk ) + ∇f (xk )T (xk+1 − xk ) + kxk+1 − xk k2 .
2
Substituting xk+1 = xk − αk ∇f (xk ) into the above inequality yields

L
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T (−αk ∇f (xk )) + kαk ∇f (xk )k2
2
α2 L
= f (xk ) − αk k∇f (xk )k2 + k k∇f (xk )k2
2
αk L
= f (xk ) − αk 1 − k∇f (xk )k2 . (D.4)
2
| {z }
ηk

We next show that if we select

2
0 < αk < , (D.5)
L
then the sequence {f (xk )}∞ ∗ ∗
k=1 converges to f (x ) where ∇f (x ) = 0. First, (D.5) implies
that ηk > 0. Then, (D.4) implies that f (xk+1 ) ≤ f (xk ). Therefore, {f (xk )} is a nonin-
creasing sequence. Second, since f (xk ) is always bounded from below by f (x∗ ), we know
that {f (xk )} converges as k → ∞ according to the monotone convergence theorem in
Theorem C.1. Suppose that the limit of the sequence is f ∗ . Then, taking the limit on
both sides of (D.4) gives

lim f (xk+1 ) ≤ lim f (xk ) − lim ηk k∇f (xk )k2

k→∞ k→∞ k→∞
∗ ∗
⇔ f ≤ f − lim ηk k∇f (xk )k2
k→∞

⇔ 0 ≤ − lim ηk k∇f (xk )k2 .

k→∞

Since ηk k∇f (xk )k2 ≥ 0, the above inequality implies that limk→∞ ηk k∇f (xk )k2 = 0. As
a result, x converges to x∗ where ∇f (x∗ ) = 0. The proof is complete. The above proof is
inspired by [107].

The inequality in (D.5) provides valuable insights into how αk should be selected. If
the function is flat (L is small), the step size can be large; otherwise, if the function
is strongly convex (L is large), then the step size must be sufficiently small to ensure
convergence. There are also many other ways to prove the convergence such as the
contraction mapping theorem [108, Lemma 3]. A comprehensive introduction to convex
optimization can be found in [106].

267
Bibliography

[1] M. Pinsky and S. Karlin, An introduction to stochastic modeling (3rd Edition).

Academic Press, 1998.

[2] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic program-

ming. John Wiley & Sons, 2014.

[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2nd Edi-

tion). MIT Press, 2018.

[4] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge University Press, 2012.

[5] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific,

1996.

[6] H. K. Khalil, Nonlinear systems (3rd Edition). Patience Hall, 2002.

[7] G. Strang, Calculus. Wellesley-Cambridge Press, 1991.

[8] A. Besenyei, “A brief history of the mean value theorem,” 2012. Lecture notes.

[9] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transforma-
tions: Theory and application to reward shaping,” in International Conference on
Machine Learning, vol. 99, pp. 278–287, 1999.

[10] R. E. Bellman, Dynamic programming. Princeton University Press, 2010.

[11] R. E. Bellman and S. E. Dreyfus, Applied dynamic programming. Princeton Uni-

versity Press, 2015.

[12] J. Bibby, “Axiomatisations of the average and a further generalisation of monotonic

sequences,” Glasgow Mathematical Journal, vol. 15, no. 1, pp. 63–65, 1974.

[13] A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning:

Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2,
pp. 153–173, 2017.

268
Bibliography S. Zhao, 2023

[14] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker, “Model-based reinforce-

ment learning: A survey,” Foundations and Trends in Machine Learning, vol. 16,
no. 1, pp. 1–118, 2023.

[15] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A survey on model-
based reinforcement learning,” arXiv:2206.09328, 2022.

[16] X. Wang, Z. Zhang, and W. Zhang, “Model-based multi-agent reinforcement learn-

ing: Recent progress and prospects,” arXiv:2203.10603, 2022.

[17] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih,

N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward tasks
from scratch,” in International Conference on Machine Learning, pp. 4344–4353,
2018.

[18] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to

train your robot with deep reinforcement learning: Lessons we have learned,” The
International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.

[19] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone, “Cur-

riculum learning for reinforcement learning domains: A framework and survey,”
The Journal of Machine Learning Research, vol. 21, no. 1, pp. 7382–7431, 2020.

[20] C. Szepesvári, Algorithms for reinforcement learning. Springer, 2010.

[21] A. Maroti, “RBED: Reward based epsilon decay,” arXiv:1910.13701, 2019.

[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,

A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hass-
abis, “Human-level control through deep reinforcement learning,” Nature, vol. 518,
no. 7540, pp. 529–533, 2015.

[23] W. Dabney, G. Ostrovski, and A. Barreto, “Temporally-extended epsilon-greedy

exploration,” arXiv:2006.01782, 2020.

[24] H.-F. Chen, Stochastic approximation and its applications, vol. 64. Springer Science
& Business Media, 2006.

[25] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of

Mathematical Statistics, pp. 400–407, 1951.

[26] J. Venter, “An extension of the Robbins-Monro procedure,” The Annals of Mathe-
matical Statistics, vol. 38, no. 1, pp. 181–190, 1967.

269
Bibliography S. Zhao, 2023

[27] D. Ruppert, “Efficient estimations from a slowly convergent Robbins-Monro pro-

cess,” tech. rep., Cornell University Operations Research and Industrial Engineer-
ing, 1988.

[28] J. Lagarias, “Euler’s constant: Euler’s work and modern developments,” Bulletin
of the American Mathematical Society, vol. 50, no. 4, pp. 527–628, 2013.

[29] J. H. Conway and R. Guy, The book of numbers. Springer Science & Business
Media, 1998.

[30] S. Ghosh, “The Basel problem,” arXiv:2010.03953, 2020.

[31] A. Dvoretzky, “On stochastic approximation,” in The Third Berkeley Symposium

on Mathematical Statistics and Probability, 1956.

[32] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic

iterative dynamic programming algorithms,” Neural Computation, vol. 6, no. 6,
pp. 1185–1201, 1994.

[33] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation. Prentice Hall, 2000.

[34] C. K. Chui and G. Chen, Kalman filtering. Springer, 2017.

[35] G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems.

Technical Report, Cambridge University, 1994.

[36] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and
empirical analysis of Expected Sarsa,” in IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pp. 177–184, 2009.

[37] M. Ganger, E. Duryea, and W. Hu, “Double Sarsa and double expected Sarsa with
shallow and deep learning,” Journal of Data Analysis and Information Processing,
vol. 4, no. 4, pp. 159–176, 2016.

[38] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s College,
1989.

[39] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, p-
p. 279–292, 1992.

[40] T. C. Hesterberg, Advances in importance sampling. PhD Thesis, Stanford Univer-

sity, 1988.

[41] H. Hasselt, “Double Q-learning,” Advances in Neural Information Processing Sys-

tems, vol. 23, 2010.

270
Bibliography S. Zhao, 2023

[42] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double
Q-learning,” in AAAI Conference on Artificial Intelligence, vol. 30, 2016.

[43] C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences:
A survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809–
883, 2014.

[44] J. Clifton and E. Laber, “Q-learning: Theory and applications,” Annual Review of
Statistics and Its Application, vol. 7, pp. 279–301, 2020.

[45] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms: A

comprehensive classification and applications,” IEEE Access, vol. 7, pp. 133653–
133667, 2019.

[46] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine

Learning, vol. 3, no. 1, pp. 9–44, 1988.

[47] G. Strang, Linear algebra and its applications (4th Edition). Belmont, CA: Thom-
son, Brooks/Cole, 2006.

[48] C. D. Meyer and I. Stewart, Matrix analysis and applied linear algebra. SIAM,
2023.

[49] M. Pinsky and S. Karlin, An introduction to stochastic modeling. Academic Press,

2010.

[50] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” The Journal of

Machine Learning Research, vol. 4, pp. 1107–1149, 2003.

[51] R. Munos, “Error bounds for approximate policy iteration,” in International Con-
ference on Machine Learning, vol. 3, pp. 560–567, 2003.

[52] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and J. P. How,

“A tutorial on linear function approximators for dynamic programming and rein-
forcement learning,” Foundations and Trends in Machine Learning, vol. 6, no. 4,
pp. 375–451, 2013.

[53] B. Scherrer, “Should one compute the temporal difference fix point or minimize the
Bellman residual? the unified oblique projection view,” in International Conference
on Machine Learning, 2010.

[54] D. P. Bertsekas, Dynamic programming and optimal control: Approximate dynamic

programming (Volume II). Athena Scientific, 2011.

[55] S. Abramovich, G. Jameson, and G. Sinnamon, “Refining Jensen’s inequality,”

Bulletin mathématique de la Société des Sciences Mathématiques de Roumanie,
pp. 3–14, 2004.

271
Bibliography S. Zhao, 2023

[56] S. S. Dragomir, “Some reverses of the Jensen inequality with applications,” Bulletin
of the Australian Mathematical Society, vol. 87, no. 2, pp. 177–194, 2013.

[57] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for temporal dif-
ference learning,” Machine Learning, vol. 22, no. 1, pp. 33–57, 1996.

[58] K. S. Miller, “On the inverse of the sum of matrices,” Mathematics Magazine,
vol. 54, no. 2, pp. 67–72, 1981.

[59] S. A. U. Islam and D. S. Bernstein, “Recursive least squares for real-time imple-
mentation,” IEEE Control Systems Magazine, vol. 39, no. 3, pp. 82–85, 2019.

[60] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and

M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv preprint
arXiv:1312.5602, 2013.

[61] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,”
in Learning for Dynamics and Control, pp. 486–489, 2020.

[62] L.-J. Lin, Reinforcement learning for robots using neural networks. 1992. Technical
report.

[63] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with
function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5,
pp. 674–690, 1997.

[64] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient meth-

ods for reinforcement learning with function approximation,” Advances in Neural
Information Processing Systems, vol. 12, 1999.

[65] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of Markov reward

processes,” IEEE Transactions on Automatic Control, vol. 46, no. 2, pp. 191–209,
2001.

[66] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,” Journal

of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.

[67] X.-R. Cao, “A basic formula for online policy gradient algorithms,” IEEE Trans-
actions on Automatic Control, vol. 50, no. 5, pp. 696–699, 2005.

[68] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist

reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.

[69] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradi-
ents,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.

272
Bibliography S. Zhao, 2023

[70] E. Greensmith, P. L. Bartlett, and J. Baxter, “Variance reduction techniques for

gradient estimates in reinforcement learning,” Journal of Machine Learning Re-
search, vol. 5, no. 9, 2004.

[71] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,

and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in
International Conference on Machine Learning, pp. 1928–1937, 2016.

[72] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Reinforce-

ment learning through asynchronous advantage actor-critic on a GPU,” arX-
iv:1611.06256, 2016.

[73] T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,” arXiv:1205.4839,

2012.

[74] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “De-

terministic policy gradient algorithms,” in International Conference on Machine
Learning, pp. 387–395, 2014.

[75] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,

and D. Wierstra, “Continuous control with deep reinforcement learning,” arX-
iv:1509.02971, 2015.

[76] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maxi-
mum entropy deep reinforcement learning with a stochastic actor,” in International
Conference on Machine Learning, pp. 1861–1870, 2018.

[77] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu,

A. Gupta, and P. Abbeel, “Soft actor-critic algorithms and applications,” arX-
iv:1812.05905, 2018.

[78] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy
optimization,” in International Conference on Machine Learning, pp. 1889–1897,
2015.

[79] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy

optimization algorithms,” arXiv:1707.06347, 2017.

[80] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in

actor-critic methods,” in International Conference on Machine Learning, pp. 1587–
1596, 2018.

[81] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfac-

tual multi-agent policy gradients,” in AAAI Conference on Artificial Intelligence,
vol. 32, 2018.

273
Bibliography S. Zhao, 2023

[82] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-
agent actor-critic for mixed cooperative-competitive environments,” Advances in
Neural Information Processing Systems, vol. 30, 2017.

[83] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-
agent reinforcement learning,” in International Conference on Machine Learning,
pp. 5571–5580, 2018.

[84] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung,

D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., “Grandmaster level in Star-
Craft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, p-
p. 350–354, 2019.

[85] Y. Yang and J. Wang, “An overview of multi-agent reinforcement learning from
game theoretical perspective,” arXiv:2011.00583, 2020.

[86] S. Levine and V. Koltun, “Guided policy search,” in International Conference on

Machine Learning, pp. 1–9, 2013.

[87] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-
based policy optimization,” Advances in Neural Information Processing Systems,
vol. 32, 2019.

[88] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on re-

inforcement learning,” in International Conference on Machine Learning, pp. 449–
458, 2017.

[89] M. G. Bellemare, W. Dabney, and M. Rowland, Distributional Reinforcement

Learning. MIT Press, 2023.

[90] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive dynamic programming for control:
algorithms and stability. Springer Science & Business Media, 2012.

[91] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and

feedback control: Using natural decision methods to design optimal adaptive con-
trollers,” IEEE Control Systems Magazine, vol. 32, no. 6, pp. 76–105, 2012.

[92] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic program-
ming for feedback control. John Wiley & Sons, 2013.

[93] Z.-P. Jiang, T. Bian, and W. Gao, “Learning-based control: A tutorial and some
recent results,” Foundations and Trends in Systems and Control, vol. 8, no. 3,
pp. 176–284, 2020.

[94] S. Meyn, Control systems and reinforcement learning. Cambridge University Press,
2022.

274
Bibliography S. Zhao, 2023

[95] S. E. Li, Reinforcement learning for sequential decision and optimal control.
Springer, 2023.

[96] J. S. Rosenthal, First look at rigorous probability theory (2nd Edition). World
Scientific Publishing Company, 2006.

[97] D. Pollard, A user’s guide to measure theoretic probability. Cambridge University

Press, 2002.

[98] P. J. Spreij, “Measure theoretic probability,” UvA Course Notes, 2012.

[99] R. G. Bartle, The elements of integration and Lebesgue measure. John Wiley &
Sons, 2014.

[100] M. Taboga, Lectures on probability theory and mathematical statistics (2nd Edition).
CreateSpace Independent Publishing Platform, 2012.

[101] T. Kennedy, “Theory of probability,” 2007. Lecture notes.

[102] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 2000.

[103] L. Bottou, “Online learning and stochastic approximations,” Online Learning in

Neural Networks, vol. 17, no. 9, p. 142, 1998.

[104] D. Williams, Probability with martingales. Cambridge University Press, 1991.

[105] M. Métivier, Semimartingales: A course on stochastic processes. Walter de Gruyter,

1982.

[106] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge Uni-

versity Press, 2004.

[107] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Foundations

and Trends in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.

[108] A. Jung, “A fixed-point of view on gradient methods for big data,” Frontiers in
Applied Mathematics and Statistics, vol. 3, p. 18, 2017.

275
Symbols

In this book, a matrix or a random variable is represented by capital letters. A vector, a

scalar, or a sample is represented by a lowercase letter. The mathematical symbols that
are frequently used in this book are listed below.

= equality
≈ approximation
.
= equality by definition
≥, >, ≤, < elementwise comparison
∈ is an element of
k · k2 Euclidean norm of a vector or the corresponding in-
duced matrix norm
k · k∞ maximum norm of a vector or the corresponding in-
duced matrix norm
ln natural logarithm
R set of real numbers
n
R set of n-dimensional real vectors
n×m
R set of all n × m-dimensional real matrices
A 0 (A 0) matrix A is positive semidefinite (definite)
A 0 (A 0) matrix A is negative semidefinite (definite)
|x| absolute value of real scalar x
|S| number of elements in set S
∇x f (x) gradient of scalar function f (x) with respect to vector
x. It may be written as ∇f (x) for short.
[A]ij element in the ith row and jth column of matrix A
[x]i ith element of vector x
X∼p p is the probability distribution of random variable
X.
p(X = x), Pr(X = x) probability of X = x. They are often written as p(x)
or Pr(x) for short.
p(x|y) conditional probability
EX∼p [X] expectation or expected value of random variable X.
It is often written as E[X] for short when the distri-
bution of X is clear.

276
Bibliography S. Zhao, 2023

var(X) variance of random variable X

arg maxx f (x) maximizer of function f (x)
1n vector of all ones. It is often written as 1 for short
when its dimension is clear.
In n × n-dimensional identity matrix. It is often written
as I for short when its dimensions are clear.

277
Index

-greedy policy, 99 contraction property, 55

n-step Sarsa, 147 elementwise expression, 49
matrix-vector expression, 51
action, 14
optimal policy, 58
action space, 14
optimal state value, 58
action value, 41
solution and properties, 57
illustrative examples, 42
bootstrapping, 29
relationship to state value, 41
undiscounted case, 213 Cauchy sequence, 53
actor-critic, 224 contraction mapping, 52
advantage actor-critic, 225 contraction mapping theorem, 53
deterministic actor-critic, 235
deterministic actor-critic, 235
off-policy actor-critic, 229
policy gradient theorem, 236
QAC, 224
pseudocode, 243
advantage actor-critic, 225
deterministic policy gradient, 243
advantage function, 228
discount rate, 21
baseline invariance, 225
discounted return, 21
optimal baseline, 226
Dvoretzky’s convergence theorem, 118
pseudocode, 229
agent, 24 environment, 24
episode, 22
Bellman equation, 31
episodic tasks, 22
closed-form solution, 38
expected Sarsa, 146
elementwise expression, 32
experience replay, 192
equivalent expressions, 33
exploration and exploitation, 102
expression in action values, 43
policy gradient, 220
illustrative examples, 33
iterative solution, 39 feature vector, 161
matrix-vector expression, 37 fixed point, 52
policy evaluation, 38
grid world example, 13
Bellman error, 182
Bellman expectation equation, 136 importance sampling, 229
Bellman optimality equation, 49 illustrative examples, 231

278
Index S. Zhao, 2023

importance weight, 230 deterministic policy, 17

stochastic policy, 17
law of large numbers, 90
tabular representation, 18
least-squares TD, 186
policy evaluation
recursive least squares, 187
illustrative examples, 28
Markov decision process, 23 solving the Bellman equation, 38
model and dynamics, 23 policy gradient theorem, 206
Markov process, 24 deterministic case, 236
Markov property, 23 off-policy case, 232
mean estimation, 88 policy iteration algorithm, 72
incremental manner, 111 comparison with value iteration, 80
metrics for policy gradient convergence analysis, 74
average reward, 203 pseudocode, 76
average value, 201 projected Bellman error, 183
equivalent expressions, 205 Q-learning (deep Q-learning), 191
metrics for value function approximation experience replay, 192
Bellman error, 182 illustrative examples, 193
projected Bellman error, 183 main network, 191
Monte Carlo methods, 88 pseudocode, 193
MC -Greedy, 100 replay buffer, 192
MC Basic, 91 target network, 191
MC Exploring Starts, 96 Q-learning (function representation), 189
comparison with TD learning, 138 Q-learning (tabular representation), 149
on-policy, 151 illustrative examples, 153
pseudocode, 152
off-policy, 150
off-policy, 150
off-policy actor-critic, 229
QAC, 224
importance sampling, 229
policy gradient theorem, 232 REINFORCE, 218
pseudocode, 234 replay buffer, 192
on-policy, 150 return, 20
online and offline, 139 reward, 18
optimal policy, 48 Robbins-Monro algorithm, 112
greedy is optimal, 58 application to mean estimation, 117
impact of the discount rate, 60 convergence analysis, 115
impact of the reward values, 62
Sarsa (function representation), 188
optimal state value, 48
Sarsa (tabular representation), 142
Poisson equation, 213 convergence analysis, 143
policy, 16 on-policy, 150
function representation, 200 variant: n-step Sarsa, 147

279
Index S. Zhao, 2023

variant: expected Sarsa, 146 Sarsa, 142

algorithm, 142 TD learning of state values, 135
optimal policy learning, 144 a unified viewpoint, 154
state, 14 expected Sarsa, 146
state space, 14 value function approximation, 160
state transition, 15 trajectory, 20
state value, 30 truncated policy iteration, 80
function representation, 161 comparison with value iteration and
relationship to action value, 41 policy iteration, 84
undiscounted case, 213 pseudocode, 82
stationary distribution, 166
metrics for policy gradient, 201 value function approximation
metrics for value function approxima- Q-learning with function approxima-
tion, 165 tion, 189
stochastic gradient descent, 123 Sarsa with function approximation,
application to mean estimation, 125 188
comparison with batch gradient de- TD learning of state values, 164
scent, 128 deep Q-learning, 191
convergence analysis, 130 function approximators, 171
convergence pattern, 125 illustrative examples, 173
deterministic formulation, 127 least-squares TD, 186
TD error, 137 linear function, 164
TD target, 137 theoretical analysis, 176
temporal-difference methods, 134 value iteration algorithm, 68
n-step Sarsa, 147 comparison with policy iteration, 80
Q-learning, 149 pseudocode, 70

280

Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
AR23
No ratings yet
AR23
159 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
No ratings yet
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
264 pages
Quantecon Python Advanced
No ratings yet
Quantecon Python Advanced
1,074 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
Abstract Dynamic Programming Bertsekas Dimitri P Download
No ratings yet
Abstract Dynamic Programming Bertsekas Dimitri P Download
87 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
DP Book
No ratings yet
DP Book
428 pages
Book All in One
No ratings yet
Book All in One
288 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Book
No ratings yet
Book
534 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
2019 Algorithmic Game Theory Lecture Notes
No ratings yet
2019 Algorithmic Game Theory Lecture Notes
106 pages
Controle Stochastique M2 S10
No ratings yet
Controle Stochastique M2 S10
203 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Full Notes
No ratings yet
Full Notes
197 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
NM4M
No ratings yet
NM4M
277 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Notes
No ratings yet
RL Notes
69 pages
MasterThesis EdouardBerthe
No ratings yet
MasterThesis EdouardBerthe
58 pages
Main Ai Games Markets
No ratings yet
Main Ai Games Markets
89 pages
Algorithmic Game Theory Lecture Notes
No ratings yet
Algorithmic Game Theory Lecture Notes
110 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
2 - Overview of This Book
No ratings yet
2 - Overview of This Book
4 pages
CS229
No ratings yet
CS229
17 pages
Meanfieldgames Priceformation
No ratings yet
Meanfieldgames Priceformation
32 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Section All
No ratings yet
Section All
63 pages
M 2
No ratings yet
M 2
12 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
G10 - PA1 Syllabus & Marking Scheme
No ratings yet
G10 - PA1 Syllabus & Marking Scheme
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Class 11 Selected List For Admission
No ratings yet
Class 11 Selected List For Admission
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Methods For Applied Macroeconomic Research Fabio Canova C °
No ratings yet
Methods For Applied Macroeconomic Research Fabio Canova C °
14 pages
Corosion Assignment Full
No ratings yet
Corosion Assignment Full
104 pages
Contents Preface Oct07
No ratings yet
Contents Preface Oct07
10 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Errorless Physics (2)
No ratings yet
Errorless Physics (2)
31 pages
Core Earth-Science Module-18-Q2 Bautista 241118 063615
No ratings yet
Core Earth-Science Module-18-Q2 Bautista 241118 063615
9 pages
Techtop Aust Motor Catalogue
No ratings yet
Techtop Aust Motor Catalogue
38 pages
Aqa 84645P1H SMS
No ratings yet
Aqa 84645P1H SMS
14 pages
Thesis On Wind Energy Conversion System
100% (2)
Thesis On Wind Energy Conversion System
5 pages
Imo - No DWT LBP LOA Beam Draft Depth
No ratings yet
Imo - No DWT LBP LOA Beam Draft Depth
42 pages
CDB Web
No ratings yet
CDB Web
269 pages
Machine Tool Drives
No ratings yet
Machine Tool Drives
4 pages
Unit 3 Equipment Design
No ratings yet
Unit 3 Equipment Design
18 pages
Littlewood Paley
No ratings yet
Littlewood Paley
3 pages
Enhancing Performance of Passive Tubular Daylight System
No ratings yet
Enhancing Performance of Passive Tubular Daylight System
29 pages
Compressor Surge
No ratings yet
Compressor Surge
8 pages
Introduction To Design: Refer Textbook Control Systems Engineering by Nagrath, Gopal
No ratings yet
Introduction To Design: Refer Textbook Control Systems Engineering by Nagrath, Gopal
49 pages
Kavayitri Bahinabai Chaudhari North Maharashtra University
No ratings yet
Kavayitri Bahinabai Chaudhari North Maharashtra University
2 pages
DLP in Mathematics Seven-Mean J Median J and Mode Final Demo
No ratings yet
DLP in Mathematics Seven-Mean J Median J and Mode Final Demo
6 pages
Griffin 1976
No ratings yet
Griffin 1976
21 pages
Modeling and Backstepping-Based Nonlinear Control
No ratings yet
Modeling and Backstepping-Based Nonlinear Control
9 pages
A Detailed Lesson Plan in Science 3
83% (6)
A Detailed Lesson Plan in Science 3
3 pages
3-Design of Levers-EME
No ratings yet
3-Design of Levers-EME
3 pages
Activity No. 1 Familiarization of Material Testing Apparatuses and Equipment
100% (1)
Activity No. 1 Familiarization of Material Testing Apparatuses and Equipment
9 pages
ENS167 Sample Final
No ratings yet
ENS167 Sample Final
1 page
Holiday Homework Winter 23-24 Class-7
No ratings yet
Holiday Homework Winter 23-24 Class-7
1 page
Physics of Solar Energy
No ratings yet
Physics of Solar Energy
15 pages
Country's Best Online Test Platform
No ratings yet
Country's Best Online Test Platform
6 pages
Geothermal Energy Project
No ratings yet
Geothermal Energy Project
13 pages
B2 UNIT 4 Test Standard
No ratings yet
B2 UNIT 4 Test Standard
6 pages
Engineering Plastics Machining Guide
No ratings yet
Engineering Plastics Machining Guide
7 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet

Book All-In-One 2

Uploaded by

Book All-In-One 2

Uploaded by

Book draft

Overview of this Book 8

2 State Values and Bellman Equation 26

3 Optimal State Values and Bellman Optimality Equation 46

4 Value Iteration and Policy Iteration 67

5 Monte Carlo Methods 87

5.4.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Stochastic Approximation 110

7 Temporal-Difference Methods 134

7.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8 Value Function Approximation 160

9 Policy Gradient Methods 199

10 Actor-Critic Methods 223

10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A Preliminaries for Probability Theory 245

B Measure-Theoretic Probability Theory 250

C Convergence of Sequences 257

D Preliminaries for Gradient Descent 264

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 1: The map of this book.

that is suitable for you after reading this overview.

learning algorithms, it is important because it lays the necessary foundations for s-

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 1.1: Where we are in this book.

1.1 A grid world example

1.2 State and action

1.3 State transition

We next examine two important examples.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

Mathematically, the state transition process can be described by conditional proba-

(a) A deterministic policy

(b) Trajectories obtained from the policy

Mathematically, policies can be described by conditional probabilities. Denote the

same (both are 0.5). In this case, the policy for s1 is

Policies represented by conditional probabilities can be stored as tables. For example,

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)

After executing an action at a state, the agent obtains a reward, denoted as r, as

 If the agent attempts to exit the boundary, let rboundary = −1.

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

1.6 Trajectories, returns, and episodes

A trajectory is a state-action-reward chain. For example, given the policy shown in

Returns are also called total rewards or cumulative rewards.

The corresponding return is

The direct sum of the rewards along this trajectory is

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

1.7 Markov decision processes

- State space: the set of all states, denoted as S.

- State transition probability: At state s, when taking action a, the probability of

- Reward probability: At state s, when taking action a, the probability of obtaining

 Policy: At state s, the probability of choosing action a is π(a|s). It holds that

 Markov property: The Markov property refers to the memoryless property of a s-

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),

Finally, reinforcement learning can be described as an agent-environment interaction

State Values and Bellman Equation

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 2.1: Where we are in this book.

chapter introduces another important concept called the action value.

2.1 Motivating example 1: Why are returns impor-

 Following the first policy, the trajectory is s1 → s3 → s4 → s4 · · · . The corresponding

where γ ∈ (0, 1) is the discount rate.

By comparing the returns of the three policies, we notice that

return1 > return3 > return2 (2.1)

2.2 Motivating example 2: How to calculate returns?

starting from the four states in Figure 2.3 can be calculated as

v1 = r1 + γ(r2 + γr3 + . . . ) = r1 + γv2 ,

which can be written compactly as

Thus, the value of v can be calculated easily as v = (I − γP )−1 r, where I is the

2.3 State values

By definition, the discounted return along the trajectory is

 vπ (s) depends on s. This is because its definition is a conditional expectation with

2.4 Bellman equation

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

If the agent attempts to exit the boundary, let rboundary = −1.

Policy: At state s, the probability of choosing action a is π(a|s). It holds that

Markov property: The Markov property refers to the memoryless property of a s-

Following the first policy, the trajectory is s1 → s3 → s4 → s4 · · · . The corresponding

vπ (s) depends on s. This is because its definition is a conditional expectation with

I − γPπ is invertible. The proof is as follows. According to the Gershgorin circle

First, it follows from the properties of conditional expectation that

Q: Why do we care about state values?