[go: up one dir, main page]

0% found this document useful (0 votes)
425 views296 pages

DP by Bellman Functional Equation

This document provides an overview of a PhD course on dynamic programming with applications taught by Professor René Caldentey. The course covers fundamental concepts in dynamic programming and optimal control through concrete applications. It includes six sessions over topics like discrete dynamic programming, extensions to the basic model, and applications in inventory control and revenue management. Students complete individual homework assignments after each session and a take-home exam assessing their understanding of the cumulative material. The goal is to introduce students to powerful techniques in sequential decision-making under uncertainty through selective topics and illustrations in different domains.

Uploaded by

.cadeau01
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
425 views296 pages

DP by Bellman Functional Equation

This document provides an overview of a PhD course on dynamic programming with applications taught by Professor René Caldentey. The course covers fundamental concepts in dynamic programming and optimal control through concrete applications. It includes six sessions over topics like discrete dynamic programming, extensions to the basic model, and applications in inventory control and revenue management. Students complete individual homework assignments after each session and a take-home exam assessing their understanding of the cumulative material. The goal is to introduce students to powerful techniques in sequential decision-making under uncertainty through selective topics and illustrations in different domains.

Uploaded by

.cadeau01
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 296

Dynamic Programming with Applications

Class Notes
Rene Caldentey
Stern School of Business, New York University
Spring 2011


PhD Programme
Dynamic Programming
Ren Caldentey

DYNAMIC PROGRAMMING (DP) VIA APPLICATIONS

PROFESSOR: Ren Caldentey

rcaldent@stern.nyu.edu
Rene.CALDENTEY@insead.edu

PMLS Off 0.23
Ext 4425
ASSISTANT: Virginie Frisch

virginie.frisch@insead.edu


Ext 9296


COURSE DESCRIPTION
Dynamic Programming (DP) provides a set of general methods for making sequential,
interrelated decisions under uncertainty. This course brings a new dimension to static
models studied in the optimization course, by investigating dynamic systems and their
optimization over time. The focus of the course is on modeling and deriving structural
properties for discrete time, stochastic problems. The techniques are illustrated through
concrete applications from Operations, Decision Sciences, Finance and Economics.

Prerequisites: An introductory course in Optimization and Probability.


Required and Recommended Textbooks
REQUIRED MATERIAL:
D. Bertsekas (2005). Dynamic Programming and Optimal Control. Athena Scientific,
Boston, MA.
Lecture Notes Dynamic Programming with Applications prepared by the instructor
to be distributed before the beginning of the class.

RECOMMENDED TEXTBOOKS:
M. Puterman (2005). Markov Decisions Processes. Wiley, NJ.
S. Ross (1983). Introduction to Stochastic Dynamic Programming. Academic Press,
San Diego, CA.
W. Fleming and R. Rishel (1975).Deterministic and Stochastic Optimal Control.
Springer-Verlag, New York, NY.
P. Brmaud (1981). Point Processes and Queue: Martingale Dynamics. Springer-
Verlag, New York, NY.

Description of other readings and case material, which will be distributed in class
The course also includes some additional readings, mostly research papers that we will use
to complement the material and discussion covered in class. Some of these papers
described important applications of dynamic programming in Operations Management and
other fields.

Schedule
Classroom: T.B.A in the HEC campus.
Time: All sessions will run from 10:00 to 13:00 (FT) or 15:00 to 18:00 (ST)
TOPICS
The following is the list of sessions and topics that we will cover in this class. These topics serve as
an introduction to Dynamic Programming. The coverage of the discipline is very selective: We
concentrate on a small number of powerful themes that have emerged as the central building
blocks in the theory of sequential optimization under uncertainty.

In preparation to class, students should read the REQUIRED READINGS indicated below under each
session (including the chapters in Bertsekass textbook and the lecture notes provided). Due to
time limitations, we will not be able to review all the material covered in these readings during the
lectures. If you have specific questions about concepts that are not discussed in class, please
contact the instructor to schedule additional office hours.


Session 1 (March 7): Introduction to Dynamic Programming and Optimal Control
We will first introduce some general ideas of optimizations in vector spaces most
notoriously the ideas of extremals and admissible variations. These concepts will lead us to
formulation of the classical Calculus of Variations and Eulers equation. We will proceed to
formulate a general optimal deterministic control problem and derive a set of necessary
conditions (Pontryagin Minimum principle) that characterize an optimal solution. Finally, we
will discuss an alternative way of characterizing an optimal solution using the important idea
of the principle of optimality pioneered by Richard Bellman. This approach will lead us to
derive the so-called Hamilton-Jacobi-Bellman (HJB) sufficient optimality conditions.
We will complement the discussion reviewing a paper on optimal price trajectory in a retail
environment by Smith and Achabal (1998).

REQUIRED READINGS:
Chapter 3 in Bertsekas.
Chapter 1 in the Lecture Notes.
S. Smith and D. Achabal (1998). Clearance Pricing and Inventory Policies for Retail
Chains. Management Science, 44(3), 285-300.


Session 2 (March 15): Discrete Dynamic Programming
In this session, we review the classical model of dynamic programming (DP) in discrete
time and finite time horizon. First, we discuss deterministic DP models and interpret it as a
shortest path problem in an appropriate network. Different algorithms to find the shortest
past are discussed. We then extend the DP framework to include uncertainty (both in the
payoffs and the evolution of the system) and connect it to the theory on Markov Decision
process. We review some basic properties of the value function and numerical methods to
compute it.

REQUIRED READINGS:
Chapters 1 & 2 in Bertsekas.
Chapter 2, sections 2.1-2.4 in Lecture Notes.


Session 3 (March 22): Extensions to the Basic Dynamic Programming Model
In this session we discuss some fundamental properties and extensions of the classical DP
model discussed in the previous lecture. We discuss in detail a particular but important
special case, the so-called Linear-Quadratic problem. We also discuss the connection
between DP and supermodularity. Finally, we discuss some extensions of the DP model
regarding state-space augmentation and the value of information.

REQUIRED READINGS:
Chapter 2 in Bertsekas.
Chapter 2, sections 2.5-2.7 in Lecture Notes.


Session 4 (March 29): Applications of Dynamic Programming
This session is dedicated to review three important applications of DP. The first application
that we discussed is on the optimality of (S,s) policies in a multi-period inventory control
setting. We then review the single-leg multiclass revenue management problem. We
conclude studying the application of DP to optimal stopping problem

REQUIRED READINGS:
Chapter 4 in Bertsekas.
Chapter 3 in Lecture Notes.
H. Scarf (1959). The Optimality of (S,s) Policies in the Dynamic Inventory Problem.
In Mathematical Methods in the Social Sciences. Proceedings of the First Stanford
Symposium. (Available at http://cowles.econ.yale.edu/~hes/pub/ss-policies.pdf)
S. Brumelle and J. McGill (1993). Airline Seat Allocation with Multiple Nested Fare
Classes. Operations Research 41, 127-137.


Session 5 (April 7): Dynamic Programming with Imperfect State Information
In this section we extend the basic DP framework to the case in which the controller has
only imperfect (noisy) information about the state of the system at any given time. This is a
common situation in many practical applications (e.g., firms do not know the exact type of a
customer; a repairman does not know the status of a machine, etc.). We will discuss an
efficient formulation of this problem and find conditions under which a sufficient set of
statistics can be used to describe the available information. We will also revisit the LQ
problem and review the Kalman filtering theory.

REQUIRED READINGS:
Chapter 5 in Bertsekas.
Chapter 4 in Lecture Notes.


Session 6 (April 13): Infinite Horizon and Semi-Markov Decisions Models
In this section we extend the models discussed in the previous sessions to the case in
which the planning horizon is infinite. We review alternative formulations of the problem
(e.g., discounted versus average objective criteria) and derive the associated Bellman
equation for these formulations. We also discuss the connection between DP and semi-
Markov decision theory.

REQUIRED READINGS:
Chapter 7 in Bertsekas.
Chapter 5 in Lecture Notes.


Session 7 (April 21): Optimal Point Process Control
In this section we consider the problem of how to optimally control the intensity of a Poisson
process. This problem (and some of its variations) has become an important building block
in many applications including dynamic pricing models. We will review the basic theory and
some concrete applications in revenue management.

REQUIRED READINGS:
Chapter 6 in Lecture Notes
G. Gallego and G. van Ryzin (1994). Optimal Dynamic Pricing of Inventories with
Stochastic Demand over Finite Horizons. Management Science 40(8), 999-1020.






The Grading Scheme

1. There are six individual assignments that will be assigned at the end of the first six
sessions. Students have a week to prepare their solutions which will be collected at the
beginning of the following session. Homework should be considered as a take-home exam
and must be done individually. In the same spirit, students are not supposed to consult
solutions from previous year. Presentation is part of the grading of these assignments.
Assignments must be submitted on time; late submissions will not be accepted.

2. A take-home exam to be distributed the last day of class. Students will have two weeks to
prepare and submit their solutions. The exam will be cumulative and will include the
implementation of a computational algorithm.



Final Score
60% Individual Homework
40% Final Exam


Prof. R. Caldentey
Preface
These lecture notes are based on the material that my colleague Gustavo Vulcano uses in the
Dynamic Programming Ph.D. course that he regularly teaches at the New York University Leonard
N. Stern School of Business.
Part of this material is based on the widely used Dynamic Programming and Optimal Control
textbook by Dimitri Bertsekas, including a set of lecture notes publicly available in the textbooks
website: http://www.athenasc.com/dpbook.html
However, I have added some additional material on Optimal Control for deterministic systems
(Chapter 1) and for point processes (Chapter 6). I have also tried to add more applications related
to Operations Management.
The booklet is organized in six chapters. We will cover each chapter in a 3-hour lecture except for
Chapter 2 where we will spend two 3-hour lectures. The details of each session is presented in the
syllabus. I hope that you will nd the material useful!
2
Contents
1 Deterministic Optimal Control 7
1.1 Introduction to Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Abstract Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Classical Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Continuous-Time Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Pontryagin Minimum Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Weak & Strong Extremals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Deterministic Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.1 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.2 DPs Partial Dierential Equations . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3 Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 The Linear-Quadratic Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.1 The Method of Characteristics for First-Order PDEs . . . . . . . . . . . . . . 26
1.5.2 Optimal Control and Myopic Solution . . . . . . . . . . . . . . . . . . . . . . 29
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Discrete Dynamic Programming 41
2.1 Discrete-Time Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Deterministic DP and the Shortest Path Problem . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Deterministic nite-state problem . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.2 Backward and forward DP algorithms . . . . . . . . . . . . . . . . . . . . . . 47
3
Prof. R. Caldentey CONTENTS
2.2.3 Generic shortest path problems . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.4 Some shortest path applications . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.5 Shortest path algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.6 Alternative shortest path algorithms: Label correcting methods . . . . . . . . 54
2.2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Stochastic Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 The Dynamic Programming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5 Linear-Quadratic Regulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1 Preliminaries: Review of linear algebra and quadratic forms . . . . . . . . . . 71
2.5.2 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.5.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.4 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.5 Asymptotic behavior of the Riccati equation . . . . . . . . . . . . . . . . . . 75
2.5.6 Random system matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.5.7 On certainty equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.6 Modular functions and monotone policies . . . . . . . . . . . . . . . . . . . . . . . . 82
2.6.1 Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.6.2 Supermodularity and increasing dierences . . . . . . . . . . . . . . . . . . . 83
2.6.3 Parametric monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6.4 Applications to DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.7.1 The Value of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.7.2 State Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.7.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.7.4 Multiplicative Cost Functional . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3 Applications 101
3.1 Inventory Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.1.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.1.2 Structure of the cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.1.3 Positive xed cost and (s, S) policies . . . . . . . . . . . . . . . . . . . . . . . 106
4
CONTENTS Prof. R. Caldentey
3.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2 Single-Leg Revenue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2.1 System with observable disturbances . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.2 Structure of the value function . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.2.3 Structure of the optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.2.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.2.5 Airlines: Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . 121
3.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3 Optimal Stopping and Scheduling Problems . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.1 Optimal stopping problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.2 General stopping problems and the one-step look ahead policy . . . . . . . . 129
3.3.3 Scheduling problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4 DP with Imperfect State Information. 135
4.1 Reduction to the perfect information case . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2 Linear-Quadratic Systems and Sucient Statistics . . . . . . . . . . . . . . . . . . . 144
4.2.1 Linear-Quadratic systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.2.2 Implementation aspects Steady-state controller . . . . . . . . . . . . . . . . 149
4.2.3 Sucient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2.4 The conditional state distribution recursion . . . . . . . . . . . . . . . . . . . 153
4.3 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.3.1 Conditional state distribution: Review of basics . . . . . . . . . . . . . . . . . 155
4.3.2 Finite-state systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5 Innite Horizon Problems 167
5.1 Types of innite horizon problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.1.1 Preview of innite horizon results . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.1.2 Total cost problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.2 Stochastic shortest path problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.2.1 Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.3 Discounted problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.4 Average cost-per-stage problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5
Prof. R. Caldentey CONTENTS
5.4.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.4.2 Associated stochastic shortest path (SSP) problem . . . . . . . . . . . . . . . 183
5.4.3 Heuristic argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.4.4 Bellmans equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.4.5 Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.5 Semi-Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.5.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.5.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.5.3 Discounted cost problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.5.4 Average cost problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.6 Application: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6 Point Process Control 201
6.1 Basic Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.2 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.3 Optimal Intensity Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.3.1 Dynamic Programming for Intensity Control . . . . . . . . . . . . . . . . . . 205
6.4 Applications to Revenue Management . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.4.1 Model Description and HJB Equation . . . . . . . . . . . . . . . . . . . . . . 206
6.4.2 Bounds and Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7 Papers and Additional Readings 209
6
Chapter 1
Deterministic Optimal Control
In this chapter, we discuss the basic Dynamic Programming framework in the context of determin-
istic, continuous-time, continuous-state-space control.
1.1 Introduction to Calculus of Variations
Given a function f : X R, we are interested in characterizing a solution to
min
xX
f(x), []
where X is a nite-dimensional space, e.g., in classical calculus X R
n
.
If n = 1 and X = [a, b], then under some smoothness conditions we can characterize solutions to []
through a set of necessary conditions.
Necessary conditions for a minimum at x

:
- Interior point: f

(x

) = 0, f

(x

) 0, and a < x

< b.
- Left Boundary: f

(x

) 0 and x

= a.
- Right Boundary: f

(x

) 0 and x

= b.
Existence: If f is continuous on [a, b] then it has a minimum on [a, b].
Uniqueness: If f is strictly convex on [a, b] then it has a unique minimum on [a, b].
1.1.1 Abstract Vector Space
Consider a general optimization problem:
min
xD
J(x) []
where D is a subset of a vector space V.
We consider functions = () : [a, b] D such that the composite J is dierentiable. Suppose
that x

D and J(x

) J(x) for all x D. In addition, let such that (

) = x

then (necessary
conditions):
7
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
- Interior point:
d
d
J(())

= 0,
d
2
d
2
J(())

0, and a <

< b.
- Left Boundary:
d
d
J(())

0 and

= a.
- Right Boundary:
d
d
J(())

0 and

= b.
How do we use these necessary conditions to identify good candidates for x

?
Extremals and Gateau Variations
Denition 1.1.1
Let (V, ) be a normed linear space and let D V.
We say that a point x

D is an extremal point for a real-valued function J on D if


J(x

) J(x) for all x D J(x

) J(x) for all x D.


A point x
0
D is called a local extremal point for J if for some > 0, x
0
is an extremal point on
D

(x
0
) := {x D : x x
0
< }.
A point x D is an internal (radial) point of D in the direction v V if
(v) > 0 such that x +v D for all || < (v) (0 < (v)).
The directional derivative of order n of J at x in the direction v is denoted by

n
J(x; v) =
d
n
d
n
J(x +v)

=0
.
J is Gateau-dierentiable at x if x is an internal point in the direction v and J(x; v) exists for
all v V.
Theorem 1.1.1 (Necessary Conditions) Let (V; ) be a normed linear space. If J has a
(local) extremal at a point x

on D then J(x

, v) = 0 for all v V such that (i) x

is an internal
point in the direction v and (ii) J(x

, v) exists.
This result is useful if there is enough directions v so that the condition J(x

, v) = 0 can
determine x

.
Problem 1.1.1
1. Find the extremal points for
J(y) =
_
b
a
y
2
(x) dx
on the domain D = {y C[a, b] : y(a) = and y(b) = }.
2. Find the extremal for
J(P) =
_
b
a
P(t) D(P(t)) dt
on the domain D = {P C[a, b] :

P(t) }.
8
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
Extremal with Constraints
Suppose that in a normed linear space (V, ) we want to characterize extremal points for a
real-valued function J on a domain D V. Suppose that the domain is given by the level set
D := {x V : G(x) = }, where G is a real-valued function on V and R.
Let x

be a (local) extremal point. We will assume that both J and G are dened in a neighborhood
of x

. We pick an arbitrary pair of directions v, w and a dene the mapping


F
v,w
(r, s) :=
_
(r, s)
(r, s)
_
=
_
J(x

+rv +sw)
G(x

+rv +sw)
_
which is well dened in a neighborhood of the origin.
Suppose F maps a neighborhood of 0 in the (r, s) plane into an neighborhood of (

) :=
(J(x

), G(x

)) in the (, ) plane. Then x

cannot be an extremal point of J.


!
"
"
#
"
!
#
"
Figure 1.1.1:
This condition is assured if F has an inverse which is continuous at (

).
Theorem 1.1.2 For x R
n
and a neighborhood N( x), if a vector valued function F : N( x) R
n
has continuous rst partial derivatives in each component with nonvanishing Jacobian determinant
at x, then F provides a continuously invertible mapping between a neighborhood of x and a region
containing a full neighborhood of F( x).
In our case, x = 0 and the Jacobian matrix of F is given by
F(0, 0) =
_
J(x

; v) J(x

; w)
G(x

; v) G(x

; w)
_
Then if |F(0, 0)| = 0 then x

cannot be an extremal point for J when constraint to the level set


dened by G(x

).
9
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Denition 1.1.2 In a normed linear space (V, ), the Gateau variations J(x, v) of a real valued
function J are said to be weakly continuous at x

V if for each v V J(x; v) J(x

; v) as
x x

.
Theorem 1.1.3 (Lagrange) In a normed linear space (V, ), if a real valued functions J and
G are dened in a neighborhood of x

, a (local) extremal point for J constrained by the level set


G(x

), and have there weakly continuous Gateau variations, then either


a) G(x

; w) = 0, for all w V, or
b) there exists a constant R such that J(x

, v) = G(x

; v), for all v V.


Problem 1.1.2 Find the extremal for
J(P) =
_
T
0
P(t) D(P(t)) dt
on the domain D = {P C[0, T] :
_
T
0
D(P(t)) dt = I}.
1.1.2 Classical Calculus of Variations
Historical Background
The theory of Calculus of Variations has been the classic approach to solve dynamic optimiza-
tion problems, dating back to the late 17th century. It started with the Brachistochrone problem
proposed by Johann Bernoulli in 1696: Find the planar curve which would provide the faster time
of transit to a particle sliding down it under the action of gravity (see Figure 1.1.2). Five solutions
were proposed by Jakob Bernoulli (Johanns brother), Newton, Euler, Leibniz, and LHospital. An-
other classical example of the method of calculus of variations is the Geodesic problems: Find the
shortest path in a given domain connecting two points of it (e.g., the shortest path in a sphere).
Figure 1.1.2: The Brachistochrone problem: Find the curve which would provide the faster time of
transit to a particle sliding down it from Point A to Point B under the action of gravity.
10
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
More generally, calculus of variations problems involve nding (possibly multidimensional) curves x(t)
with certain optimality properties. In general, the calculus of variations approach requires the dif-
ferentiability of the functions that enter the problem in order to get interior solutions.
The Simplest Problem in Calculus of Variations
J(x) =
_
b
a
L(t, x(t), x(t)) dt,
where x(t) =
d
dt
x(t). The variational integrand is assumed to be smooth enough (e.g., at least C
2
).
Example 1.1.1
Geodesic: L =

1 + x
2
Brachistochrone: L =
_
1+ x
2
x
Minimal Surface of Revolution: L = x

1 + x
2
.
Admissible Solutions: A function x(t) is called piecewise C
n
on [a, b], if x(t) is C
n1
on [a, b]
and x
(n)
(t) is piecewise continuous on [a, b],i.e, continuous except on a nite number of points. We
denote by H[a, b] the vector space of all real-valued piecewise C
1
function on [a, b] and by H
e
[a, b]
the subspace of H[a, b] such that x(a) = x
a
and x(b) = x
b
for all x H
e
[a, b].
Problem: min
xHe[a,b]
J(x).
Admissible Variations or Test Functions: Let Y[a, b] H[a, b] be the subspace of piecewise
C
1
functions y such that
y(a) = y(b) = 0.
We note that for x H
e
[a, b], y Y[a, b], and R, the function x +y H
e
[a, b].
Theorem 1.1.4 Let J have a minimum on H
e
[a, b] at x

. Then
L
x

_
t
a
L
x
d = constant for all t [a, b]. (1.1.1)
A function x

(t) satisfying (1.1.1) is called extremal.


Corollary 1.1.1 (Eulers Equation) Every extremal x

satises the dierential equation


L
x
=
d
dt
L
x
.
Problem 1.1.3 (Production-Inventory Control)
Consider a rm that operates according to a make-to-stock policy during a planning horizon [0, T]. The
company faces an exogenous and deterministic demand with intensity (t). Production is costly; if the
rm chooses a production rate at time t then the instantaneous production cost rate is c(t, ). In
11
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
addition, there are holding and backordering costs. We denote by h(t, I) the holding/backordering cost
rate if the inventory position at time t is I. We suppose that the company starts with an initial inventory
I
0
and tries to minimize total operating costs during the planning horizon of length T > 0 subject to
the requirement that the nal inventory position at time T is I
T
.
a) Formulate the optimization problem as a calculus of variations problem.
b) What is Eulers equation?
Sucient Conditions: Weierstrass Method
Suppose that x

is an extremal for
J(x) =
_
b
a
f(t, x(t), x(t)) dt :=
_
b
a
f[x(t)] dt
on D = {x C
1
[a, b] : x(a) = x

(a); x(b) = x

(b)}. Let x(t) D be an arbitrary feasible solution.


For each (a, b] we dene the function (t; ) on (a, ) such that (t; ) is an extremal function
for f on (a, ) whose graph joins (a, x

(a)) to (, x()) and such that (t; b) = x

(t).
a b !
x
1
x
2
(a,x
*
(a))
(b,x
*
(b))
(t, x(t))
(!, x(!))
(t,x
*
(t))
t
"! "!#
We dene
() :=
_

a
f[(t; )] dt
_
b

f[ x(t)] dt,
which has the following properties:
(a) =
_
b
a
f[ x(t)] dt = J( x) and (b) =
_
b
a
f[(t, b)] dt = J(x

).
Therefore, we have that
J( x) J(x

) = (b) (a) =
_
b
a
() d,
so that a sucient condition for the optimality of x

is () 0. That is,
Weierstrass formula
() := E(, x(),

(; ),

x())
= f[ x()] f(, x(),

(; )) f
x
(, x(),

(; )) (

x()

(; )) 0
12
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.1.3 Exercises
Exercise 1.1.1 (Convexity and Eulers Equation) Let V be a linear vector space and D a subset of V.
A real-valued function f dened on D is said to be [strictly] convex on D if
f(y +v) f(y) f(y; v) for all y, y +v D,
[with equality if and only if v = 0]. Where f(y; v) is the rst Gateau variation of f at y on the direction
v.
a) Prove the following: If f is [strictly] convex on D then each x

D for which f(x

; y) = 0 for
all x

+y D minimizes f on D [uniquely].
Let f = f(x, y, z) be a real value function on [a, b] R
2
. Assume that f and the partial derivatives
f
y
and f
z
are dened and continuous on S. For all y C
1
[a, b] we dene the integral function
F(y) =
_
b
a
f(x, y(x), y

(x)) dx :=
_
b
a
f[y(x)] dx,
where f[y(x)] = f(x, y(x), y

(x)).
b) Prove that the rst Gateau variation of F is given by
F(y; v) =
_
b
a
_
f
y
[y(x)] v(x) +f
z
[y(x)] v

(x)
_
dx.
c) Let D be a domain in R
2
. For two arbitrary real numbers and dene
D
,
[a, b] =
_
y C
1
[a, b] : y(a) = , y(b) = , and (y(x), y

(x)) D x [a, b]
_
.
Prove that if f(x, y, z) is convex on [a, b] D then
1. F(y) dened above is convex on D and
2. each y D for which
d
dx
f
z
[y(x)] = f
y
[y(x)] []
on (a, b) satises F(y, v) = 0 for all y +v D.
Conclude that such a y D that satises [] minimizes F on D. That is, extremal solutions are
minimizers.
Exercise 1.1.2 (du Bois-Reymonds Lemma)The proof of Eulers equation uses du Bois-Reymonds
Lemma:
If h C[a, b] and
_
b
a
h(x)v

(x) dx = 0
for all v D
0
= {v C
1
[a, b] : v(a) = v(b) = 0}
then h =constant on [a, b]. Using this lemma prove the more general results.
13
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
a) If g, h C[a, b] and
_
b
a
[g(x)v(x) +h(x)v

(x)] dx = 0
for all v D
0
= {v C
1
[a, b] : v(a) = v(b) = 0}
then h C
1
[a, b] and h

= g.
b) If h C[a, b] and for some m = 1, 2 . . . we have
_
b
a
h(x)v
(m)
(x) dx = 0
for all v D
(m)
0
= {v C
m
[a, b] : v
(k)
(a) = v
(k)
(b) = 0, k = 0, 1, 2, . . . , m1}
then on [a, b], h is a polynomial of degree m1.
Exercise 1.1.3 Suppose you have inherited a large sum S and plan to spend it so as to maximize your
discounted cumulative utility for the next T units of time. Let u(t) be the amount that you expend
on period t and let
_
u(t) the the instantaneous utility rate that you receive at time t. Let be the
discount factor that you use to discount future utility, i.e, the discounted value of expending u at time
t is equal to exp(t)

u. Let be the risk-free interest rate available on the market, i.e., one dollar
today is equivalent to exp(t) dollars t units of time in the future.
a) Formulate the control problem that maximizes the discounted cumulative utility given all necessary
constraints.
b) Find the optimal expenditure rate {u(t)} for all t [0, T].
Exercise 1.1.4 (Production-Inventory Problem)Consider a make-to-stock manufacturing facility pro-
ducing a single type of product. Initial inventory at time t = 0 is I
0
. Demand rate for the next selling
season [0, T] is know and equal to (t) t [0, T]. We denote by (t) the production rate and by I(t)
the inventory position. Suppose that due to poor inventory management there is a xed proportion
of inventory that is lost per unit time. Thus, at time t the inventory I(t) increases at a rate (t) and
decreases at a rate (t) +I(t).
Suppose the company has set target values for the inventory and production rate during [0, T]. Let

I
and

P be these target values, respectively. Deviation from these values are costly, and the company uses
the following cost function C(I, P) to evaluate a production-inventory strategy (P, I):
C(I, P) =
_
T
0
[
2
(

I I(t))
2
+ (

P P(t))
2
] dt.
The objective of the company is to nd and optimal production-inventory strategy that minimizes the
cost function subject to the additional condition that I(T) =

I.
a) Rewrite the cost function C(I, P) as a function of the inventory position and its rst derivative
only.
b) Find the optimal production-inventory strategy.
14
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.2 Continuous-Time Optimal Control
The Optimal Control problem that we study in this section, and in particular the optimality con-
ditions that we derive (HJB equation and Pontryagin Minimum principle) will provide us with an
alternative and powerful method to solve the variational problems discussed in the previous section.
This new method is not only useful as a solution technique but also as a insightful methodology to
understand how dynamic programming works.
Compared to the method of Calculus of Variation, Optimal Control theory is a more modern
and exible approach that requires less stringent dierentiability conditions and can handle corner
solutions. In fact, calculus of variations problems can be reformulated as optimal control problems,
as we show lated in this section.
The rst, and most fundamental, step in the derivation of these new solution techniques is the
notion of a System Equation:
System Equation (also called equation of motion or system dynamics):
x(t) = f(t, x(t), u(t)), 0 t T, x(0) : given,
i. e.
dx
i
(t)
dt
= f
i
(t, x(t), u(t)), i = 1, . . . , n.
where:
x(t) R
n
is the state vector at time t,
x(t) is the gradient of x(t) with respect to t,
u(t) U R
m
is the control vector at time t,
T is the terminal time.
Assumptions:
An admissible control trajectory is a piecewise continuous function u(t) U, t [0, t],
that does not involve an innite value of u(t) (i.e., all jumps are of nite size).
U could be a bounded control set. For instance, U could be a compact set such as
U = [0, 1], so that corner solutions (boundary solutions) could be admitted. When
this feature is combined with jump discontinuities on the control path, an interesting
phenomenon called a bang-bang solution may result, where the control alternates between
corners.
An admissible state trajectory x(t) is continuous, but it could have a nite number of
corners; i.e., it must be piecewise dierentiable. A sharp point on the state trajectory
occurs at a time when the control trajectory makes a jump.
Like admissible control paths, admissible state paths must have a nite x(t) value for
every t [0, T]. See Figure 1.2.1 for an illustration of a control path and the associated
state path.
The control trajectory {u(t) | t [0, T]} uniquely determines {x
u
(t) | t [0, T]}. We will
drop the superscript u from now on, but this dependence should be clear. In a more rigor-
ous treatment, the issue of existence and uniqueness of the solution should be addressed
more carefully.
15
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
t
1
t
2
t
1
t
2
u
x
State path
Control path
Figure 1.2.1: Control and state paths for a continuous-time optimal control problem under the regular assumptions.
Objective: Find an admissible policy (control trajectory) {u(t) | t [0, T]} and correspond-
ing state trajectory that optimizes a given functional J of the state x = (x
t
: 0 t T).
The following are some common formulations for the functional J and the associated optimal
control problem.
Lagrange Problem: min
uU
J(x) =
_
T
0
g(t, x(t), u(t)) dt
subject to x(t) = f(t, x(t), u(t)), x(0) = x
0
(system dynamics)
(x(T)) = 0 (boundary conditions).
Mayer Problem: min
uU
h(x(T))
subject to x(t) = f(t, x(t), u(t)), x(t
0
) = x
0
(system dynamics)
(x(T)) = 0 k = 2, . . . , k (boundary conditions).
Bolza Problem: min
uU
h(x(T)) +
_
T
0
g(t, x(t), u(t))dt.
subject to x(t) = f(t, x(t), u(t)), x(t
0
) = x
0
(system dynamics)
(x(T)) = 0 k = 2, . . . , k (boundary conditions).
The functions f, h, g and are normally assumed to be continuously dierentiable with respect
to x; and f, g are continuous with respect to t and u.
16
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
Problem 1.2.1 Show that all three versions of the optimal control problem are equivalent.
Example 1.2.1 (Motion Control) A unit mass moves on a line under the inuence of a force u.
Here, u =force=acceleration. (Recall from physics that force = mass acceleration, with mass=1 in
this case).
State: x(t) = (x
1
(t), x
2
(t)), where x
1
represents position and x
2
represents velocity.
Problem: From a given initial (x
1
(0), x
2
(0)), bring the mass near a given nal position-velocity
pair ( x
1
, x
2
) at time T; in the sense that it minimizes
|x
1
(T) x
1
|
2
+|x
2
(T) x
2
|
2
,
such that |u(t)| 1, t [0, T].
System Equation:
x
1
(t) = x
2
(t)
x
2
(t) = u(t)
Costs:
h(x(T)) = (x
1
(T) x
1
)
2
+ (x
2
(T) x
2
)
2
g(x(t), u(t)) = 0, t [0, T].
Example 1.2.2 (Resource Allocation) A producer with production rate x(t) at time t may allocate
a portion u(t) [0, 1] of her production rate to reinvestment (i.e., to increase the production rate) and
[1 u(t)] to produce a storable good. Assume a terminal cost h(x(T)) = 0.
System Equation:
x
1
(t) = u(t)x(t), where > 0 is the reinvestment benet, u(t) [0, 1].
Problem: The producer wants to maximize the total amount of product stored
max
u(t)[0,1]
_
T
0
(1 u(t))x(t)dt
Assume x(0) is given.
Example 1.2.3 (An application of Calculus of Variations) Find a curve from a given point to
a given vertical line that has minimum length. (Intuitively, this should be a straight line) Figure 1.2.2
illustrates the formulation as an innite sum of innitely small hypotenuses .
The problem in terms of calculus of variations is:
min
_
T
0
_
1 + ( x(t))
2
dt
s.t. x(0) = .
17
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
t
time
(t, x(t)) (t+dt, x(t))
(t+dt, x(t)+x(t)dt)

Zoom-in:
dt t x
dt t x dt
2
2 2
)) ( ( 1
) ) ( ( ) (

(t+dt, x(t+dt))
Figure 1.2.2: Problem of nding a curve of minimum length from a given point to a given line, and its formulation
as an optimal control problem.
The corresponding optimal control problem is:
min
u(t)
_
T
0
_
1 + (u(t))
2
dt
s.t. x(t) = u(t)
x(0) =
1.3 Pontryagin Minimum Principle
1.3.1 Weak & Strong Extremals
Let H[a, b] be a subset of piecewise right-continuous function with left-limit (c`adl`ag). We dene on
H[a, b] two norms
for x H[a, b] x = sup
t[a,b]
{|x(t)|} and x
1
= x + x.
A set {x H[a, b] : x x

1
< } is called a weak neighborhood of x

. A solution x

is called a
weak solution if J(x

) J(x) for all x in a weak neighborhood containing x

.
A set {x H[a, b] : x x

< } is called a strong neighborhood of x

. A solution x

is called a
strong solution if J(x

) J(x) for all x in a strong neighborhood containing x

.
Example 1.3.1
min
x
J(x) =
_
1
1
(x(t) sign(t))
2
dt +

t[1,1]
(x(t) x(t

))
2
,
where x(t

) = lim
t
x().
18
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.3.2 Necessary Conditions
Given a control u U with corresponding trajectory x(t), we consider the following family of
variations:
For a xed direction v U, [0, T], and > 0 small, we dened the strong variation of u(t)
in the direction v by the function
: 0 U
() = u

,
where
u

(t) =
_
v if t ( , ]
u(t) if t [0, T] ( , ]
c
.
0 T ! !"#
$
$
u(t)
%
t
Strong Variation
Weak Variation
Figure 1.3.1: An example of strong and weak variations
Lemma 1.3.1 For a real variable , let x

(t) be the solution of x

(t) = f(t, x

(t), u(t)) on [0, T]


with initial condition
x

(0) = x(0) +y +o().


Then,
x

(t) = x(t) +(t) +o(t, ),


where (t) is the solution of

(t) = f
x
(t, x(t), u(t)) (t), t [0, T] and (0) = y.
Lemma 1.3.2 If x

are solutions to x

(t) = f(t, x

(t), u

(t)) with the same initial condition x

(0) =
x
0
then
x

(t) = x(t) +(t) +o(t, ),


where (t) solves
(t) =
_
0 if 0 t <
f(, x(), v) f(, x(), u()) +
_
t

f
x
(s, x(s), u(s)) (s) ds if t T.
19
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Theorem 1.3.1 (Pontryagin Principle For Free Terminal Conditions)
- Mayers formulation: Let P(t) be the solution of

P(t) = P(t) f
x
(t, x(t), u(t)), P(t
1
) =
x
(x(T)).
A necessary condition for optimality of a control u is that
P(t) [f(t, x(t), v) f(t, x(t), u(t))] 0
for each v U and t (0, T].
- Lagranges formulation: We dene the Hamiltonian H as
H(t, x, u) := P(t)f(t, x, u) L(t, x, u).
Where P(t) solves

P(t) =

x
H(t, x, u)
with boundary condition P(T) = 0. A necessary condition for a control u to be optimal is
H(t, x(t), v) H(t, x(t), u(t)) 0 for all v U, t [0, T].
Theorem 1.3.2 (Pontryagin Principle with Terminal Conditions)
- Mayers formulation: Let P(t) be the solution of

(t) = P

(t) f
x
(t, x(t), u(t)), P(t
1
) =

x
(T, x(T)).
A necessary condition for optimality of a control u U is that there exists , a nonzero
k-dimensional vector with
1
0, such that
P(t)

[f(t, x(t), v) f(t, x(t), u(t))] 0


P(T)

f(T, x(T), u(T)) =

t
(T, x(T)).
Problem 1.3.1 Solve
min
u
_
T
0
(u(t) 1)x(t) dt,
subject to x(t) = u(t) x(t) x
0
> 0,
0 u(t) 1, for all t [0, T].
20
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.4 Deterministic Dynamic Programming
1.4.1 Value Function
Consider the following optimal control problem in Mayers form:
V (t
0
, x
0
) = inf
uU
J(t
1
, x(t
1
)) (1.4.1)
subject to x(t) = f(t, x(t), u(t)), x(t
0
) = x
0
(state dynamics) (1.4.2)
(t
1
, x(t
1
)) M (boundary conditions). (1.4.3)
The terminal set M is a closed subset of R
n+1
. The admissible control set U is assumed to be the
set of piecewise continuous function on [t
0
, t
1
]. The performance function J is assumed to be C
1
.
The function V (, ) is called the value function and we shall use the convention V (t
0
, x
0
) = if the
control problem above admits no feasible solution. We will denote by U(x
0
, t
0
), the set of feasible
controls with initial condition (x
0
, t
0
), that is, the set of control u such that the corresponding
trajectory x satises x(t
1
) M.
REMARK 1.4.1 For notational convenience, in this section the time horizon is denoted by the interval
[t
0
, t
1
] instead of [0, T].
Proposition 1.4.1 Let u(t) U(x
0
, t
0
) be a feasible control and x(t) the corresponding trajectory.
Then, for any t
0

1

2
t
1
, V (
1
, x(
1
)) V (
2
, x(
2
)). That is, the value function is a
nondecreasing function along any feasible trajectory.
Proof:
! !
Corollary 1.4.1 The value function evaluated along any optimal trajectory is constant.
Proof: Let u

be an optimal control with corresponding trajectory x

. Then V (t
0
, x
0
) = J(t
1
, x

(t
1
)).
In addition, for any t [t
0
, t
1
] u

is a feasible control starting at (t, x

(t)) and so V (t, x

(t))
J(t
1
, x

(t
1
)). Finally by Proposition 1.4.1 V (t
0
, x
0
) V (t, x

(t)) so we conclude V (t, x

(t)) =
V (t
0
, x
0
) for all t [t
0
, t
1
].
According to the previous results a necessary condition for optimality is that the value function is
constant along the optimal trajectory. The following result provides a sucient condition.
21
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Theorem 1.4.1 Let W(s, y) be an extended real valued function dened on R
n+1
such that W(s, y) =
J(s, y) for all (s, y) M. Given an initial condition (t
0
, x
0
), suppose that for any feasible tra-
jectory x(t), the function W(t, x(t)) is nite and nondecreasing on [t
0
, t
1
]. If u

is a feasible
control with corresponding trajectory x

such that W(t, x

(t)) is constant then u

is optimal and
V (t
0
, x
0
) = W(t
0
, x
0
).
Proof: For any feasible trajectory x, W(t
0
, x
0
) W(t
1
, x(t
1
)) = J(t
1
, x(t
1
). On the other hand, for
x

,W(t
0
, x
0
) = W(t
1
, x

(t
1
)) = J(t
1
, x

(t
1
).
Corollary 1.4.2 Let u

be an optimal control with corresponding feasible trajectory x

. Then the
restriction of u

to [t, t
1
] is an optimal for the control problem with initial condition (t, x

(t)).
In many applications, the control problem is given in its Lagrange form
V (t
0
, x
0
) = inf
uU(x
0
,t
0
)
_
t
1
t
0
L(t, x(t), u(t)) dt (1.4.4)
subject to x(t) = f(t, x(t), u(t)), x(t
0
) = x
0
. (1.4.5)
In this case, the following result is the analogue to Proposition 1.4.1.
Theorem 1.4.2 (Bellmans Principle of Optimality). Consider an optimal control problem in La-
grange form. For any u U(s, y) and its corresponding trajectory x
V (s, y)
_

s
L(t, x(t), u(t)) dt +V (, x()).
Proof: Given u U(s, y), let u U(, x()) be arbitrary. Dene
u(t) =
_
u(t) s t
u(t) t t
1
.
Thus, u U(s, y) so that
V (s, y)
_
t
1
s
L(t, x(t), u(t)) dt =
_

s
L(t, x(t), u(t)) dt +
_
t
1

L(t, x(t), u(t)) dt. (1.4.6)


Since the inequality holds for any u U(, x()) we conclude
V (s, y)
_

s
L(t, x(t), u(t)) dt +V (, x().
Although the conditions given by Theorem 1.4.1 are sucient, they do not provide a concrete way
to construct an optimal solution. In the next section, we will provide a direct method to compute
the value function.
22
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.4.2 DPs Partial Dierential Equations
Dene Q
0
the reachable set as
Q
0
= {(s, y) R
n+1
: U(s, y) = }.
This set dene the collection of initial conditions for which the optimal control problem is feasible.
Theorem 1.4.3 Let (s, y) be any interior point of Q
0
at which V (s, y) is dierentiable. Then
V (s, y) satises
V
s
(s, y) +V
y
(s, y) f(s, y, v) 0 for all v U.
If there is an optimal u

U(s, y), then the PDE


min
vU
{V
s
(s, y) +V
y
(s, y) f(s, y, v)} = 0
is satised and the minimum is achieved by the right limit u

(s)
+
of the optimal control at s.
Proof: Pick any v U and let x
v
(t) be the corresponding trajectory for s t s +, > 0 small.
Given the initial condition (s, y), we dene the feasible control u

as follows
u

(t) =
_
v s t s +
u(t) s + t t
1
.
Where u U(s + , x
v
(s + )) is arbitrary. Note that for small (s + , x
v
(s + )) Q
0
and so
u

U(s, y). We denote by x

(t) the corresponding trajectory. By proposition (1.5.1), V (t, x

(t))
is nondecreasing, hence,
D
+
V (t, x

(t)) := lim
h0
V (t +h, x

(t +h)) V (t, x

(t))
h
0
for any t at which the limit exists, in particular t = s. Thus, from the chain rule we get
D
+
V (s, x

(s)) = V
s
(s, y) +V
y
(s, y) D
+
x

(s) = V
s
(s, y) +V
y
(s, y) f(s, y, v).
The equalities use the indentity x

(s) = y and the system dynamic equation D


+
x

(t) = f(t, x

, u

(t)
+
).
If u

U(s, y) is an optimal control with trajectory x

then corollary 1.4.1 implies V (t, x

(t)) =
J(t
1
, x

(t
1
)) for all t [s, t
1
], so dierentiating (from the right) this equality at t = 2 we conclude
V
s
(s, y) +V
y
(s, y) f(s, y, u

(s)
+
) = 0.
Corollary 1.4.3 (Hamilton-Jacobi-Bellman equation (HJB)) For a control problem given in
Lagrange form (1.4.4)-(1.4.5), the value function at a point (s, y) int(Q
0
) satises
V
s
(y, s) +V
y
(s, y) f(s, y, v) +L(s, y, v) 0 for all v U.
If there exists an optimal control u

then the PDE


min
vU
{V
s
(y, s) +V
y
(s, y) f(s, y, v) +L(s, y, v)} = 0
is satised and the minimum is achieved by the right limit u

(s)
+
of the optimal control at s.
23
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
In many applications, instead of solving the HJB equation a candidate for the value function is
identied, say by inspection. It is important to be able to decide whether or not the proposed
solution is in fact optimal.
Theorem 1.4.4 (Verication Theorem) Let W(s, y) be a C
1
solution to the partial dierential
equation
min
vU
{V
s
(s, y) +V
y
(s, y) f(s, y, v)} = 0
with boundary condition W(s, y) = J(s, y) for all (s, y) M. Let (t
0
, x
0
) Q
0
, u U(t
0
, x
0
) and x
the corresponding trajectory. Then, W(t, x(t)) is nondecreasing on t. If u

is a control in U(t
0
, x
0
)
dened on [t
0
, t

1
] with corresponding trajectory x

such that for any t [t


0
, t

1
]
W
s
(t, x

(t)) +W
y
(t, x

(t)) f(t, x

(t), u

(t)) = 0
then u

is an optimal control in calU(t


0
, x
0
) and V (s, y) = W(s, y).
Example 1.4.1
min
u1
J(t
0
, x
0
, u) =
1
2
(x())
2
subject to x(t) = u(t), x(t
0
) = x
0
where u = max
0t
{|u(t)|}. The HJB equation is min
|u|1
{V
t
(t, x) +V
x
(t, x) u} = 0 with bound-
ary condition V (, x) =
1
2
x
2
. We can solve this problem by inspection. Since the only cost is associated
to the terminal state x(), and optimal control will try to make x() as close to zero as possible, i.e.,
u

(t, x) = sgn(x) =
_

_
1 x < 0
0 x = 0
1 x > 0.
(Bang-Bang policy)
We should now verify that u

is in fact an optimal control. Let J

(t, x) = J(t, x, u

). Then, it is not
hard to show that
J

(t, x) =
1
2
(max{0 ; |x| ( t)})
2
which satises the boundary condition J

(, x) =
1
2
x
2
. In addition,
J

t
(t, x) = (|x| ( t))
+
and J

x
(t, x) = sgn(x) (|x| ( t))
+
.
Therefore, for any u such that |u| 1 it follows that
J

t
(t, x) +J

x
(t, x) u = (1 + sgn(x) u) (|x| ( t))
+
0
with the equality holding for u = u

(t, x). Thus, J

(t, x) is the value function and u

is optimal.
1.4.3 Feedback Control
In the previous example, the notion of a feedback control policy was introduced. Specically, a
feedback control u is a mapping from R
n+1
to U such that u = u(t, x) and the system dynamics
24
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
x = f(t, x, u(t, x)) has a unique solution for each initial condition (s, y) Q
0
. Given a feedback
control u and an initial condition (s, y), we can dene the trajectory x(t; s, y) as the solution to
x = f(t, x, u(t, x)) x(s) = y.
The corresponding control policy is u(t) = u(t, x(t; s, y)).
A feedback control u

is an optimal feedback control if for any (s, y) Q


0
the control u(t) =
u

(t, x(t; s, y)) solve the optimization problem (1.4.1)-(1.4.3) with initial condition (s, y).
Theorem 1.4.5 If there is an optimal feedback control u

and t
1
(s, y) and x(t
1
; s, y) are the ter-
minal time and terminal state for the trajectory
x = f(t, x, u(t, x)) x(s) = y
then the value function V (s, y) is dierentiable at each point at which t
1
(s, y) and x(t
1
; s, y) are
dierentiable with respect to (s, y).
Proof: From the optimality of u

we have that
V (s, y) = J(t
1
(s, y), x(t
1
(s, y); s, y)).
The result follows from this identity and the fact that J is C
1
.
1.4.4 The Linear-Quadratic Problem
Consider the following optimal control problem.
minx(T)

Q
T
x(T) +
_
T
0
_
x(t)

Qx(t) +u(t)

Ru(t)

dt (1.4.7)
subject to x(t) = Ax(t) +Bu(t) (1.4.8)
where the n n matrices Q
T
and Q are symmetric positive semidenite and the mm matrix R
is symmetric positive denite. The HJB equation for this problem is given by
min
uR
m
_
V
t
(t, x) +V
x
(t, x)

(Ax +Bu) +x

Qx +u

Ru
_
= 0
with boundary condition V (T, x) = x

Q
T
x.
We guess a quadratic solution for the HJB equation. That is, we suppose that V (t, x) = x

K(t) x
for a n n symmetric matrix K(t). If this is the case then
V
t
(t, x) = 2K(t) x and V
x
(t, x) = x


K(t) x.
Plugging back these derivatives on the HJB equation we get
min
uR
m
_
x


K(t) x + 2x

K(t)Ax + 2x

K(t)Bu +x

Qx +u

Ru
_
= 0. (1.4.9)
Thus, the optimal control satises
2B

K(t) x + 2Ru = 0 = u

= R
1
B

K(t) x.
25
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Substituting the value of u

in equation (1.4.9) we obtain the condition


x

_

K(t) +K(t)A+A

K(t) K(t)BR
1
B

K(t) +Q
_
x = 0 for all (t, x).
Therefore, for this to hold matrix K(t) must satisfy the continuous-time Ricatti equation in matrix
form

K(t) = K(t)AA

K(t) = K(t)BR
1
B

K(t) Q, with boundary condition K(T) = Q


T
.
(1.4.10)
Reversing the argument it can be shown that if K(t) solves (1.4.10) then W(t, x) = x

K(t)x is a
solution of the HJB equation and so nt the verication theorem we conclude that it is equal to the
value function. In addition, the optimal feedback control is u

(t, x) = R
1
B

K(t)x.
1.5 Extensions
1.5.1 The Method of Characteristics for First-Order PDEs
First-Order Homogeneous Case
Consider the following rst-order homogeneous PDE
u
t
(t, x) +a(t, x)u
x
(t, x) = 0, x R, t > 0,
with boundary conditions u(x, 0) = (x) for all x R. We assume that a and are smooth
enough functions. A PDE problem in this form is referred to as a Cauchy problem.
We will investigate the solution to this problem using the method of characteristics. The charac-
teristics of this PDE are curves in the x t plane dened by
x(t) = a(x(t), t), x(0) = x
0
. (1.5.1)
Let x = x(t) be a solution with x(0) = x
0
. Let u be a solution to the PDE, we want to study the
evolution of u along x(t).
u(t, x(t)) = u
t
(t, x(t)) +u
x
(t, x(t))

x(t) = u
t
(t, x(t)) +u
x
(t, x(t)) a( x(t), t) = 0.
So, u(t, x) is constant along the characteristic curve x(t), that is,
u(t, x(t)) = u(0, x(0)) = (x
0
), t > 0. (1.5.2)
Thus, if we are able to solve the ODE (1.5.3) then we would be able to nd the solution to the
original PDE.
Example 1.5.1 Consider the Cauchy problem
u
t
+x u
x
= 0, x R, t > 0
u(x, o) = (x), x R.
26
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
The characteristic curves are dened by
x(t) = x(t), x(0) = x
0
,
so x(t) = x
0
exp(t). So for a given (t, x) the characteristic passing through this point has initial
condition x
0
= x exp(t). Since u(t, x(t)) = (x
0
) we conclude that u(t, x) = (x exp(t)).
First-Order Non-Homogeneous Case
Consider the following nonhomogeneous problem.
u
t
(t, x) +a(t, x) u
x
(t, x) = b(t, x), x R, t > 0
u(x, 0) = (x), x R.
Again, the characteristic curves are given by
x(t) = a(x(t), t), x(0) = x
0
. (1.5.3)
Thus, for a solution u(t, x) of the PDE along a characteristic curve x(t) we have that
u(t, x(t)) = u
t
(t, x(t)) +u
x
(t, x(t))

x(t) = u
t
(t, x(t)) +u
x
(t, x(t)) a( x(t), t) = b(t, x(t)).
Hence, the solution to the PDE is given by
u(t, x(t)) = (x
0
) +
_
t
0
b(, x()) d
along the characteristic (t, x(t)).
Example 1.5.2 Consider the Cauchy problem
u
t
+ u
x
= x, x R, t > 0
u(x, o) = (x), x R.
The characteristic curves are dened by
x(t) = 1, x(0) = x
0
,
so x(t) = x
0
+t. So for a given (t, x) the characteristic passing through this point has initial condition
x
0
= x t. In addition, along a characteristic x(t) = x
0
+t starting at x
0
, we have
u(t, x(t)) = (x
0
) +
_
t
0
x() d = (x
0
) +x
0
t +
1
2
t
2
.
Thus, the solution to the PDE is given by
u(t, x) = (x t) +
_
x
t
2
_
t.
27
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Applications to Optimal Control
Given that the partial dierential equation of dynamic programming is a rst-order PDE, we can
try to apply the method of characteristic to nd the value function. In general, the HJB is not a
standard rst-order PDE because of the maximization that takes place. So in general, we can not
just solve a simple rst-order PDE to get the value function of dynamic programming. Nevertheless,
in some situations it is possible to obtain good results as the following example shows.
Example 1.5.3 (Method of Characteristics) Consider the optimal control problem
min
u1
J(t
0
, x
0
, u) =
1
2
(x())
2
subject to x(t) = u(t), x(t
0
) = x
0
where u = max
0t
{|u(t)|}.
A candidate for value function W(t, x) should satisfy the HJB equation
min
|u|1
{W
t
(t, x) +W
x
(t, x) u} = 0,
with boundary condition W(, x) =
1
2
x
2
.
For a given u U, let solve the PDE
W
t
(t, x; u) +W
x
(t, x; u) u = 0, W(, x; u) =
1
2
x
2
. (1.5.4)
A characteristic curve x(t) is found solving
x(t) = u, x(0) = x
0
,
so x(t) = x
0
+ut. Since the solution to the PDE is constant along the characteristic curve we have
W(t, x(t); u) = W(, x(); u) =
1
2
(x())
2
=
1
2
(x
0
+u)
2
.
The characteristic passing through the point (t, x) has initial condition x
0
= x ut, so the general
solution to the PDE (1.5.4) is
W(t, x; u) =
1
2
(x + ( t)u)
2
.
Since our objective is to minimize the terminal cost, we can identify a policy by minimizing W(t, x; u)
over u above. It is straightforward to see that the optimal control (in feedback form) satises
u

(x, t) =
_

_
1 if x > t
x
t
if |x| t
1 if x < t .
The corresponding candidate for value function W

(t, x) = W(t, x; u

(t, x)) satises


W(t, x) =
1
2
_
max{0 ; |x| ( t)}
_
2
which we already know satises the HJB equation.
28
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
1.5.2 Optimal Control and Myopic Solution
Consider the following deterministic control problem in Bolza form:
min
uU
J(x(T)) +
_
T
0
L(x
t
, u
t
) dt
subject to x(t) = f(x
t
, u
t
), x(0) = x
0
.
The functions f, J, and L are assumed to be suciently smooth.
The solution to this problem can be found solving the associated Hamilton-Jacobi-Bellman equation
V
t
(t, x) + min
uU
{f(x, u) V (t, x) +L(x, u)} = 0
with boundary condition V (T, x) = J(x). The value function V (t, x) represents the optimal cost-
to-go starting at time t in state x.
Suppose, we x the control u U and solve the rst-order PDE
W
t
(t, x; u) +f(x, u) W
x
(t, x; u) +L(x, u) = 0, W(T, x; u) = J(x) (1.5.5)
using the methods of characteristics. That is, we solve the characteristic ODE x(t) = f(x, u) and
let x(t) = H(t; s, y, u) the solution passing through the point (s, y), i.e., x(s) = H(s; s, y, u) = y.
By construction, along a characteristic curve (t, x(t)) the function W(t, x(t); u) satises

W(t, x(t); u)+
L(x(t), u) = 0. Therefore, after integration we have that
W(s, x(s); u) = W(T, x(T); u) +
_
T
s
L(x(t), u) dt = J(x(T)) +
_
T
s
L(x(t), u),
where the second equality follows from the boundary condition for W. We can rewrite this last
identity for the particular characteristic curve passing through (t, x) as follows
W(t, x; u) = J(H(T; t, x, u)) +
_
T
t
L(H(s; t, x; u), u) ds.
Since the control u has been xed so far, we call W(t, x; u) the static value function associated to
control u. Now, if we view W(t, x; u) as a function of u, we can minimize this static value function.
We dene
u

(t, x) = arg min


uU
W(t, x; u) and V(t, x) = W(t, x; u

(t, x)).
Proposition 1.5.1 Suppose that u

(t, x) is an interior solution and that W(t, x; u) is suciently


smooth so that u

(t, x) satises
dW(t, x; u)
du

u=u

(t,x)
= 0. (1.5.6)
Then the function V(t, x) satises the PDE
V
t
(t, x) +f(x, u

(t, x)) V
x
(t, x) +L(x, u

(t, x)) = 0 (1.5.7)


with boundary condition V(t, x) = J(x).
29
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Proof: Let us rewrite the PDE in terms of W(t, x, u

) to get
W(t, x, u

)
t
+f(x, u

)
W(t, x, u

)
x
+L(x, u

)
. .
(a)
+
_
u

t
+f(x, u

)
u

x
_
W(t, x, u

)
u

. .
(b)
.
We note that by construction of the function W on equation (1.5.5) the expression denoted by (a)
is equal to zero. In addition, the optimality condition (1.5.6) implies that (b) is also equal to zero.
Therefore, V(t, x) satises the PDE (1.5.7). The border condition follows again from the denition
of the value function W.
Given this result, the question that naturally arises is whether V(t, x) is in fact the value function
(that is V(t, x) = V (t, x)) and u

(t, x) is the corresponding optimal feedback control.


Unfortunately, this is not generally true. In fact, to prove that V (t, x) = V(t, x) we would need to
show that
u

(t, x) = arg min


uU
{f(x, u) V
x
(t, x) +L(x, u)} .
Since we have assumed that u

(t, x) is an interior solution then the rst order optimality condition


for the minimization problem above is given by
f
u
(x, u

(t, x)) V
x
(t, x) +L
u
(x, u

(t, x)) = 0.
Using the optimality condition (1.5.6) we have that
V
x
(t, x) = W
x
(t, x; u

) = J

(H) H
x
+
_
T
t
L
x
(H, u

) H
x
dt,
where H = H(T; t; x, u

) and H
x
= H
x
(T; t; x, u

) the partial derivative of H(T; t; x, u

) with
respect to x keeping u

= u

(t, x) xed. Thus, the rst order optimality condition that needs to be
veried is
f
u
(x, u

(t, x))
_
J

(H) H
x
+
_
T
t
L
x
(H, u

) H
x
dt
_
+L
u
(x, u

(t, x)) = 0. (1.5.8)


On the other hand, the optimality condition (1.5.6) that u

(t, x) satises is
J

(H) H
u
+
_
T
t
[L
x
(H, u

) H
u
+L
u
(H, u

)] dt = 0. (1.5.9)
It should be clear that condition (1.5.9) does not necessarily imply condition (1.5.8) and so V(t, x)
and u

(t, x) are not guaranteed to be the value function and the optimal feedback control, respec-
tively. The following example shows the suboptimality of u

(t, x).
Example 1.5.4 Consider the traditional linear-quadratic control problem
min
u
_
x
2
(T) +
_
T
0
(x
2
(t) +u
2
(t)) dt
_
subject to x(t) = x(t) +u(t), x(0) = x
0
.
Exact solution to the HJB equation: This problem is traditionally tackled solving an associated
Riccati dierential equation. We suppose that the optimal control satises
u(t, x) = k(t) x,
30
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
where the function k(t) satises the Riccati ODE

k(t) + 2k(t) =
2
k
2
(t) 1, k(T) = 1.
We can get a particular solution assuming k(t) =

k = constant. In this case,

2

k
2
2

k 1 = 0 =

k

=

_

2
+
2

2
.
Now, let us dene k(t) = z(t) +

k
+
then the Riccati becomes
z(t) + 2(
2

k
+
) z(t) =
2
z
2
(t) =
z(t)
z
2
(t)
+
2(
2

k
+
)
z(t)
=
2
.
If we set w(t) = z
1
(t) then the last ODE is equivalent to
w(t) + 2(
2

k
+
) w(t) =
2
.
This a simple linear dierential equation that can be solved using the integrating factor exp(2(

2

k
+
) t, that is,
d
dt
_
exp
_
2(
2

k
+
) t
_
w(t)
_
= exp
_
2(
2

k
+
) t
_

2
.
The solution to this ODE is
w(t) =

k exp
_
2(
2

k
+
) t
_
+

2
2(
2
k
+
)
,
where

k is a constant of integration. Using the fact that


2

k
+
=
_

2
+
2
and k(t) =

k
+
+1/w(t)
we get
k(t) =
+
_

2
+
2

2
+
2
_

2
+
2
2

k
_

2
+
2
exp
_
2(
2
k
+
) t
_

2
.
The value of

k is obtained from the border condition k(T) = 1.
Myopic Solution: If we solve the problem using the myopic approach described at the beginning of
this notes, we get that the characteristic curve is given by
x(t) = x(t) + u = ln(x(t) + u) = (t +A),
with A a constant of integration. The characteristic passing through the point (t, x) satises A =
ln(x + u)/ t and is given by
x() =
(x + u) exp(( t)) u

.
The value of the static value function W(t, x; u) is given by
W(t, x; u) =
_
(x + u) exp((T t)) u

_
2
+
_
T
t
_
_
(x + u) exp(( t)) u

_
2
+u
2
_
d.
If we compute the derivative of W(t, x; u) with respect to u and make it equal to zero we get, after
some manipulations, that the optimal myopic solution is
u

(t, x) =
_
(3 exp(2(T t)) 4 exp(T t) + 1)

2
(3 exp(2(T t)) 8 exp(T t) + 5) + 2(T t)(
2
+
2
)
_
x.
Interestingly, this myopic feedback control is also linear on x as in the optimal solution, however, the
solution is clearly dierent and suboptimal.
31
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
The previous example shows that in general the use of a myopic policy produces suboptimal solu-
tions. However, a questions remains still open which is under what conditions is the myopic solution
optimal? A general solution to this problem can be obtained by looking under what restrictions on
the problems data the optimality condition (1.5.8) is implied by condition (1.5.9).
In what follows we present one specic case for which the optimal solution is given by the myopic
solution. Consider the control problem
min
uU
J(x(T)) +
_
T
0
L(u(t)) dt
subject to x(t) = f(x(t), u(t)) := g(x(t)) h(u(t)), x(0) = x
0
.
In this case, it can be shown that the characteristic equation passing through the point (t, x) is
given by
x() = G
1
(h(u)( t) +G(x)), where G(x) :=
_
dx
g(x)
.
In this case, the static value function is
W(t, x; u) = J(G
1
(h(u)(T t) +G(x))) +L(u)(T t)
and the myopic solution satises
d
du
W(t, x; u) = 0 or equivalently
0 = J

(G
1
(h(u)(T t) +G(x))) h

(u) (T t) G
1
(h(u)(T t) +G(x)) + (T t) L

(u)
0 = J

(G
1
(h(u)(T t) +G(x))) f
u
(x, u) G

(x)G
1
(h(u)(T t) +G(x)) +L

(u)
0 = f
u
(x, u) W
x
(t, x; u) +L

(u).
The second equality uses the identities G

(x) = 1/f(x) and f


u
(x, u) = f(x) h

(u). Therefore, the


optimal myopic policy u

(t, x) satises
0 = f
u
(x, u

) V
x
(t, x) +L

(u

)
i.e., the rst order optimality condition (1.5.8).
Example 1.5.5 Consider control problem
min
u
_
x
2
(T) +
_
T
0
u
2
(t) dt
_
subject to x(t) = x(t)u(t), x(0) = x
0
.
In this case, the characteristic passing through (t, x) is given by
x() = x exp(u( t)).
The static value function is
W(t, x; u) = x
2
exp(2u(T t)) +u
2
(T t).
Minimizing W over u implies
x
2
exp(2u

(T t)) +u

= 0
and the corresponding value function
V (t, x) = V(t, x) = u

(t, x)
_
u

(t, x) (T t) 1
_
.
32
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
Connecting the HJB Equation with Pontryagin Principle
We consider the optimal control problem in Lagrange form. In this case, the HJB equation is given
by
min
uU
{V
t
(t, x) +V
x
(t, x) f(t, x, u) +L(t, x, u)} = 0,
with boundary condition V (t
1
, x(t
1
)) = 0.
Let us dene the so-called Hamiltonian
H(t, x, u, ) := f(x, t, u) L(t, x, u).
Thus, the HJB equation implies that the value function satises
max
uU
H(t, x, u, V
x
) = 0,
and so the optimal control can be found maximizing the Hamiltonian. Specically, let x

(t) be
the optimal trajectory and let P(t) = V
x
(t, x

(t)), then the optimal control satises the so-called


Maximum Principle
H(t, x

(t), u

(t), P(t)) H(t, x

(t), u, P(t)), for all u U.


In order to complete the connection with Pontryagin principle we need to derive the adjoint equa-
tions. Let x

(t) be the optimal trajectory and consider a small perturbation x(t) such that
x(t) = x

(t) +(t), where |(t)| < .


First, we note that the HJB equation together with the optimality of x

and its corresponding


control u

implies that
H(t, x

(t), u

(t), V
x
(t, x

(t))) V
t
(t, x

(t)) H(t, x(t), u

(t), V
x
(t, x(t))) V
t
(t, x(t)).
Therefore, the derivative of H(t, x(t), u

(t), V
x
(t, x(t))) +V
t
(t, x(t)) with respect to x so be equal
to zero at x

(t). Using the denition of H this condition implies that


V
xx
(t, x

(t)) f(t, x

(t), u

(t)) V
x
(t, x

(t)) f
x
(t, x

(t), u

(t)) L
x
(t, x

(t), u

(t)) V
xt
(t, x

(t)) = 0.
In addition, using the dynamics of the system we get that

V
x
(t, x

(t)) = V
tx
(t, x

(t)) +V
xx
(t, x

(t)) f(t, x

(t), u

(t)),
therefore

V
x
(t, x

(t)) = V
x
(t, x

(t)) f(t, x

(t), u

(t)) +L(t, x

(t), u

(t)).
Finally, using the denition of P(t) and H we conclude that P(t) satises the adjoint condition

P(t) =

x
H(t, x

(t), u

(t), P(t)).
The boundary condition for P(t) are obtained from the boundary conditions of the HJB, that is,
P(t
1
) = V
x
(t
1
, x(t
1
)) = 0. (transversality condition)
33
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Economic Interpretation of the Maximum Principle
Let us again consider the control problem in Lagrange form. In this case the performance measure
is
V (t, x) = min
_
T
0
L(t, x(t), u(t)) dt.
The function L corresponds to the instantaneous cost rate. According to our denition of P(t) =
V
x
(t, x(t)), we can interpret this quantity as the marginal prot associated to a small change on
the state variable x. The economic interpretation of the Hamiltonian is as follows:
H dt = P(t) f(t, x, u) dt L(t, x, u) dt
= P(t) x(t) dt L(t, x, u) dt
= P(t) dx(t) L(t, x, u) dt.
The term L(t, x, u) dt corresponds to the instantaneous prot made at time t at state x if control u
is selected. We can look at this prot as a direct contribution. The second term P(t) dx(t) represents
the instantaneous prot that it is generated by changing the state from x(t) to x(t) + dx(t). We
can look at this prot as an indirect contribution. Therefore H dt can be interpreted as the total
contribution made from time t to t + dt given the state x(t) and the control u.
With this interpretation, the Maximum Principle simply state that an optimal control should try to
maximize the total contribution for every time t. In other words, the Maximum Principle decouples
the dynamic optimization problem in to a series of static optimization problem, one for every time
t.
Note also that if we integrate the adjoint equation we get
P(t) =
_
t
1
t
H
x
dt.
So P(t) is the cumulative gain obtained over [t, t
1
] by marginal change of the state space. In this
respect, the adjoint variables behave in much the same way as dual variables in LP.
1.6 Exercises
Exercise 1.6.1 In class, we solved the following deterministic optimal control problem
min
u1
J(t
0
, x
0
, u) =
1
2
(x())
2
subject to x(t) = u(t), x(t
0
) = x
0
where u = max
0t
{|u(t)|} using the method of characteristics. In particular, we solved the
open-loop HJB PDE equation
W
t
(t, x; u) +W
x
(t, x; u) u = 0, W(, x; u) =
1
2
x
2
.
for a xed u and then nd the optimal close-loop control solving
u

(t, x) = arg min


u1
W(t, x; u)
34
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
and computing the value function as V (t, x) = W(t, x; u

(t, x)).
a) Explain why this methodology does not work in general. Provide a counter example.
b) What specic control problems can be solve using this open-loop approach.
c) Propose an algorithm that uses the open-loop solution to approximately solve a general deter-
ministic optimal control problem.
Exercise 1.6.2 (Dynamic Pricing in Discrete Time)
Assume that we have x
0
items of a ceratin type that we want to sell over a period of N days. At
each day, we may sell at most one item. At the k
th
day, knowing the current number x
k
of remaining
unsold items, we can set the selling price u
k
of a unit item to a nonnegative number of our choice;
then, the probability q
k
(u
k
) of selling an item on the k
th
day depends on u
k
as follows:
q
k
(u
k
) = exp(u
k
)
where 0 < < 1 is a given scalar. The objective is to nd the optimal price setting policy so as to
maximize the total expected revenue over N days. Let V
k
(x
k
) be the optimal expected cost from
day k to the end if we have x
k
unsold units.
a) Assuming that for all k, the value function V
k
(x
k
) is monotonically nondecreasing as a function
of x
k
, prove that for x
k
> 0, the optimal prices have the form

k
(x
k
) = 1 +J
k+1
(x
k
) V
k+1
(x
k
1)
and that
V
k
(x
k
) = exp(

k
(x
k
)) +V
k+1
(x
k
).
b) Prove simultaneously by induction that, for all k, the value function V
k
(x
k
) is indeed monotoni-
cally nondecreasing a sa function of x
k
, that the optimal price

k
(x
k
) is monotonically nonincreasing
as a function of x
k
, and that V
k
(x
k
) is given in closed form by
V
k
(x
k
) =
_

_
(N k) exp(1) if x
k
N k,

Nx
k
i=k
exp(

i
(x
k
)) +x
k
exp(1) if 0 < x
k
< N k,
0 if x
k
= 0.
Exercise 1.6.3 Consider a deterministic optimal control problem in which u is a scalar control
and x is also scalar. The dynamics are given by
f(t, x, u) = a(x) +b(x)u
where a(x) and b(x) are C
2
vector functions. If
P(t) b(x(t)) = 0 on a time interval t ,
the Hamiltonian does not depend on u and the problem is singular. Show that under these conditions
P(t) q(x) = 0, t ,
35
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
where q(x) = b
x
(x)a(x) a
x
(x)b(x). Show further that if
P(t)[q
x
(x(t)) b(x(t)) b
x
(x(t)) q(x(t))] = 0
then
u(t) =
P(t)[q
x
(x(t)) a(x(t)) a
x
(x(t)) q(x(t))]
P(t)[q
x
(x(t)) b(x(t)) b
x
(x(t)) q(x(t))]
.
Exercise 1.6.4 The objective of this note is to characterize a particular family of Learning Func-
tion. These learning functions are useful modelling devices for situations where there is an agent
that tries to increase his or her level of knowledge about a certain phenomenon (such as cus-
tomers preferences or product quality) by applying a certain control or eort. To x ideas, in
what follows knowledge will be represented by the variable x while eort will be represented by the
variable e. For simplicity we will assume that knowledge takes values in the [0, 1] interval while eort
is a nonnegative real variable. The family of learning function that we are interested in this note
are those than can be derived from a specic subfamily that we called Additive Learning Functions.
The formal denition of an Additive Learning Function
1
is as follows.
Denition 1.6.1 Consider a function L : R
+
[0, 1] [0, 1]. The function L would be called
Additive Learning Function if it satises the following properties:
Additivity: L(e
2
+e
1
, x) = L(e
2
, L(e
1
, x)) for all e
1
, e
2
R
+
and x [0, 1].
Boundary Condition: L(0, x) = x for all x [0, 1].
Monotonicity: L
e
(t, x) =
L
e
L(e, x) > 0 for all (e, x) R [0, 1].
Satiation: lim
e
L(e, x) = 1 for all x [0, 1].
a) Prove the following. Suppose that L(e, x) is a C
1
additive learning function. Then L(e, x) satises
L
e
(e, x) L
e
(0, x) L
x
(e, x) = 0
where L
e
and L
x
are the partial derivatives of L(e, x) with respect to e and x respectively.
b) Using the method of characteristics solve the PDE of part a) as a function of
H(x) =
_

1
L
e
(0, x)
dx
and prove that the solution is of the form
L(e, x) := H
1
(H(x) e) .
Consider the following optimal control problem.
V (0, x) = max
pt
_
T
0
[p
t
(p
t
) x
t
] dt (1.6.1)
subject to x
t
= L
e
(0, x
t
) (p
t
) x
0
= x [0, 1] given. (1.6.2)
1
This name is probably not standard since I do not know the relevant literature well enough.
36
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
Where L
e
(0, x) is the partial derivative of the learning function L(e, x) with respect to e evaluated
at (0, x). This problem corresponds to the case of a seller that tries to maximize cumulative revenue
during the period [0, T]. Potential demand rate at time t is given by (p
t
) where p
t
is the price
set by the seller at time t. However, only a fraction x
t
[0, 1] of the potential customers buy the
product at time t. The dynamics of x
t
are given by (1.6.2).
c) Show that equation (1.6.2) can be rewritten as
x
t
= L(y
t
, x) where y
t
:=
_
t
0

s
ds.
and use this fact to reformulate your control problem as follows
max
yt
_
T
0
p( y
t
) y
t
L(y
t
, x) dt subject to y
0
= 0. (1.6.3)
d) Deduce that the optimality conditions in this case are given by
y
2
t
p

( y
t
) L(y
t
, x) = constant. (1.6.4)
e) Solve the optimality condition for the case
(p) =
0
exp(p) and L(e, x) = 1 + (x 1) exp(e), , > 0.
1.7 Exercises
Exercise 1.7.1 Solve the problem:
min (x(T))
2
+
_
T
0
(u(t))
2
dt
subject to x(t) = u(t), |u(t)| 1, t [0, T]
Calculate the cost-to-go function J

(t, x) and verify that it satises the HJB equation.


A young investor has earned in the stock market a large amount of money S and plans to spend
it so as to maximize his enjoyment through the rest of his life without working. He estimates that
he will live exactly T more years and that his capital x(t) should be reduced to zero at time T, i.e.
x(T) = 0. Also, he models the evolution of his capital by the dierential equation
dx(t)
dt
= x(t) u(t),
where x(0) = S is his initial capital, > 0 is a given interest rate, and u(t) 0 is his rate of
expenditure. The total enjoyment he will obtain is given by
_
T
0
e
t
_
u(t)dt
Here is some positive scalar, which serves to discount future enjoyment. Find the optimal {u(t)|t
[0, T]}.
37
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
Exercise 1.7.2 Analyze the problem of nding a curve {x(t)|t [0, T]} that maximizes the area
under x,
_
T
0
x(t) dt,
subject to the constraints
x(0) = a, x(T) = b,
_
T
0
_
1 + ( x(t))
2
dt = L,
where a, b and L are given positive scalars. The last constraint is known as the isoperimetric
constraint: it requires that the length of the curve be L.
Hint: Introduce the system equations x
1
= u, x
2
=

1 +u
2
, and view the problem as a xed
terminal state problem. Show that the optimal curve x(t) satises the condition sin(t) = (C
1

t)/C
2
for given constants C
1
, C
2
. Under some assumptions on a, b, and L, the optimal curve is a
circular arc.
Exercise 1.7.3 Let a, b and T be positive scalars, and let A = (0, a) and B = (T, b) be two
points in a medium within which the velocity of propagation of light is proportional to the vertical
coordinate. Thus the time it takes for light to propagate from A to B along curve {x(t)|t [0, T]}
is
_
T
0
_
1 + ( x(t))
2
Cx(t)
dt,
where C is a given positive constant. Find the curve of minimum travel time of light from A to B,
and show that it is an arc of a circle of the form
(x(t))
2
+ (t d)
2
= D,
where d and D are some constants.
Hint: Introduce the system equation x = u, and consider a xed initial/terminal state problem
x(0) = a and x(T) = b.
Exercise 1.7.4 Use the discrete time Minimum Principle to solve the following problem:
A farmer annually producing x
k
units of a certain crop stores (1 u
k
)x
k
units of his production,
where 0 u
k
1, and invests the remaining u
k
x
k
units, thus increasing the next years production
to a level x
k+1
given by
x
k+1
= x
k
+ wu
k
x
k
, k = 0, 1, . . . , N 1
The scalar w is xed at a known deterministic value. The problem is to nd the optimal investment
policy that maximizes the total expected product stored over N years,
x
N
+
N1

k=0
(1 u
k
)x
k
Show the optimality of the following policy that consists of constant functions:
1. If w > 1,

0
(x
0
) = =

N1
(x
N1
) = 1.
38
CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey
2. If 0 < w < 1/N,

0
(x
0
) = =

N1
(x
N1
) = 0.
3. If 1/N w 1,

0
(x
0
) = =

k1
(x
N

k1
) = 1,

k
(x
N

k
) = =

N1
(x
N1
) = 0,
where

k is such that
1

k + 1
< w
1

k
.
39
Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL
40
Chapter 2
Discrete Dynamic Programming
Dynamic programming (DP) is a technique pioneered by Richard Bellman
1
in the 1950s to model
and solve problems where decisions are made in stages
2
in order to optimize a particular functional
(e.g., minimize a certain cost) that depends (possibly) on the entire evolution (trajectory) of the
system over time as well as on the decisions that were made along the way. The distinctive feature
of DP (and one that is useful to keep in mind) with respect to the method of Calculus of Variations
discussed in the previous chapter is that instead of thinking of an optimal trajectory as a point in
an appropriate space, DP constructs this optimal trajectory sequentially over time, in essence DP
is an algorithm.
A fundamental idea that emerges from DP is that in general decisions cannot be made myopically
(that is, optimizing current performance) since a low cost now might mean a high cost in the future.
2.1 Discrete-Time Formulation
Let us introduce the basic DP model using one of the most classical examples in Operations Man-
agement, namely, the Inventory Control problem.
Example 2.1.1 (Inventory control) Consider the problem faced by a rm that must replenish
periodically (i.e., every month) the level of inventory of a certain good. The inventory of this good is
used to satised a (possibly stochastic) demand. The dynamics of this inventory system are depicted in
Figure 3.1.1. There are two costs incurred per period: a per-unit purchasing cost c, and an inventory
cost incurred at the end of a period that accounts for either holding (even there is a positive amount
of inventory that is carried over to the next period) or backlog costs (associated to unsatised demand
that must be met in the future) given by a function r(). The manager of this rm must decide at the
beginning of every period k the amount of inventory to order (u
k
) based on the initial level of inventory in
period k (x
k
) and the available forecast of future demands, (w
k
, w
k+1
, . . . , w
N
), this forecast is captured
by the underlying joint probabilities distribution of these future demands.
1
For a brief historical account of the early developments of DP, including the origin of its name, see S. Dreyfus
(2002). Richard Bellman on the Birth of Dynamic Programming, Operations Research vol. 50, No. 1, JanFeb,
4851.
2
For the most part we will consider applications in which these dierent stages correspond to dierent moments
in time.
41
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Stock ordered at
beginning of Period k
Stock at beginning
of Period k
Demand during
Period k
Figure 2.1.1: System dynamics for the Inventory Control problem
Assumptions:
1. Leadtime = 0 (i.e., instantaneous replenishment)
2. Independent demands w
0
, w
1
, . . . , w
N1
3. Fully backlogged demand
4. Zero terminal cost (i.e., free disposal g
N
(x
N
) = 0)
The objective is to minimize the total cost over N periods, i.e.
min
u
0
,...,u
N1
0
E
w
0
,w
1
,...,w
N1
_
N1

k=0
(cu
k
+r(x
k
+u
k
w
k
))
_
We will prove that for convex cost functions r(), the optimal policy is of the order up to form.
The inventory problem highlights the following main features of our Basic Model:
1. An underlying discrete-time dynamic system
2. A nite horizon
3. A cost function that is additive over time
System dynamics are described by a sequence of states driven by a system equation
x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, 1, . . . , N 1,
where:
k is a discrete time index
f
k
is the state transition function
x
k
is the current state of the system. It could summarize past information relevant for future
optimization when the system is not Markovian.
42
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
u
k
is the control; decision variable to be selected at time k
w
k
is a random parameter (disturbance or noise) described by a probability distribu-
tion P
k
(|x
k
, u
k
)
N is the length of the horizon; number of periods when control is applied
The per-period cost function is given by g
k
(x
k
, u
k
, w
k
). The total cost function is additive, with a
total expected cost given by
E
w
0
,w
1
,...,w
N1
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
, u
k
, w
k
)
_
,
where the expectation is taken over the joint distribution of the random variables w
0
, w
1
, . . . , w
N1
involved.
The sequence of events in a period k is the following:
1. The system manager observes the current state x
k
2. Decision u
k
is made
3. Random noise w
k
is realized. It could potentially depend on x
k
and u
k
(for example, think of
a case where u
k
is price and w
k
is demand).
4. Cost g
k
(x
k
, u
k
, w
k
) is incurred
5. Transition x
k+1
= f
k
(x
k
, u
k
, w
k
) occurs
If we think about tackling a possible solution to a discrete DP such as the Inventory example 2.1.1,
two somehow extreme strategies can be considered:
1. Open Loop: Select all orders u
0
, u
1
, . . . , u
N1
at time k = 0.
2. Closed Loop: Sequential decision making, place an order u
k
at time k. Here, we gain
information about the realization of demand on the y.
Intuitively, in a deterministic DP settings in which the values of (w
0
, w
1
, . . . , w
N1
) are known at
time 0, open and closed loop strategies are equivalent because no uncertainty is revealed over time
and hence there is no gain from waiting. However, in a stochastic environment postponing decision
can have a signicant impact on the overall performance of a particular strategy. So closed-loop
optimization are generally needed to solve a stochastic DP problem to optimality. In closed-loop
optimization, we want to nd an optimal rule (i.e., a policy) for selecting action u
k
in period k, as a
function of the state x
k
. So, we want to nd a sequence of functions
k
(x
k
) = u
k
, k = 0, 1, . . . , N1.
The sequence = {
0
,
1
, . . . ,
N1
} is a policy or control law. For each policy , we can associate
a trajectory x

= (x

0
, x

1
, . . . , x

N
) that describes the evolution of the state of the system (e.g., units
in inventory at the beginning of every period in Example 2.1.1) over time when the policy has
been chosen. Note that in genera x

is a stochastic process. The corresponding performance of


policy = {
0
,
1
, . . . ,
N1
} is given by
J

= E
w
0
,w
1
,...,w
N1
_
g
N
(x

N
) +
N1

k=0
g
k
(x

k
,
k
(x

k
), w
k
)
_
.
43
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
If the initial state is xed, i.e., x

0
= x
0
for all feasible policy then we denote the performance of
policy by J

(x
0
).
The objective of dynamic programming is to optimize J

over all policies that satisfy the con-


straints of the problem.
2.1.1 Markov Decision Processes
There are situations where the state x
k
is naturally discrete, and its evolution can be modeled by a
Markov chain. In these cases, the state transition function is described by the transition probabilities
matrix between the states:
p
ij
(u, k) = P{x
k+1
= j|x
k
= i, u
k
= u}
Claim: Transition probabilities System equation
Proof: ) Given a transition probability representation,
p
ij
(u, k) = P{x
k+1
= j|x
k
= i, u
k
= u},
we can cast it in terms of the basic DP framework as
x
k+1
= w
k
, where P{w
k
= j|x
k
= i, u
k
= u} = p
ij
(u, k).
) Given a discrete-state system equation x
k+1
= f
k
(x
k
, u
k
, w
k
), and a probability distribution
for w
k
, P
k
(w
k
|x
k
, u
k
), we can get the following transition probability representation:
p
ij
(u, k) = P
k
{W
k
(i, u, j)|x
k
= i, u
k
= u),
where the event W
k
is dened as
W
k
(i, u, j) = {w|j = f
k
(i, u, w)}.
Example 2.1.2 (Scheduling)
Objective: Find the optimal sequence of operations A, B, C, D to produce a certain product.
Precedence constraints: A B, C D.
State denition: Set of operations already performed.
Costs: Startup costs S
A
and S
B
incurred at time k = 0, and setup transition costs C
nm
from
operation m to n.
This example is represented in Figure 2.1.2. The optimal solution is described by a path of minimum
cost that starts at the initial state and ends at some state at the terminal time. The cost of a path is
the sum of the labels in the arcs plus the terminal cost (label in the leaf).
44
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Figure 2.1.2: System dynamics for the Scheduling problem.
Example 2.1.3 (Chess Game)
Objective: Find the optimal two-game chess match strategy that maximizes the winning chances.
Description of the match:
Each game can have two outcomes: Win (1 point for winner, 0 for loser), and Draw (1/2 point
for each player)
If the score is tied after two games, the match continues until one of them wins a game (sudden
death).
State: vector with the two scores attained so far. It could also be the net score (dierence between
the scores).
Each player has two playing styles, and can choose one of the two at will in each game:
Timid play: draws with probability p
d
> 0, and loses w.p. 1 p
d
.
Bold play: wins w.p. p
w
> 0, and loses w.p. 1 p
w
.
Observations: If there is a tie after the 2nd game, the player must play Bold. So, from an analytical
perspective, the problem is a two-period one. Also note that this is not a game theory setting,
since there is no best response here. The other players strategy is somehow captured by the
corresponding probabilities.
Using the equivalence between system equation and transition probability function mentioned above, in
Figure 2.1.3 we show the transition probabilities for period k = 0. In Figure 2.1.4 we show the transition
45
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Figure 2.1.3: Transition probability graph for period k = 0 for the chess match
probabilities for the second stage of the match (i.e., k = 1), and the cost of the terminal states.
Note that these numbers are negative because maximizing the probability of winning p is equivalent to
minimizing p (recall that we are working with min problems so far). One interesting feature of this
-1
-1
-pw
0
0
-1
-1
-pw
0
0
Figure 2.1.4: Transition probability graph for period k = 1 for the chess match
problem (to be veried later) is that even if p
w
< 1/2, the player could still have more than 50% chance
of winning the match.
2.2 Deterministic DP and the Shortest Path Problem
In this section, we focus on deterministic problems, i.e., problems where the value of each distur-
bance w
k
is known in advance at time 0. In deterministic problems, using feedback results does not
help in terms of cost reduction and hence open-loop and closed-loop policies are equivalent.
Claim: In deterministic problems, minimizing cost over admissible policies {
0
,
1
, . . . ,
N1
} (i.e.,
sequence of functions) leads to the same optimal cost as minimizing over sequences of control vectors
{u
0
, u
1
, . . . , u
N1
}.
46
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Proof: Given a policy {
0
,
1
, . . . ,
N1
}, and an initial state x
0
, the future states are perfectly
predictable through the equation
x
k+1
= f
k
(x
k
, (x
k
)), k = 0, 1, . . . , N 1,
and the corresponding controls are perfectly predictable through the equation
u
k
=
k
(x
k
), k = 0, 1, . . . , N 1.
Thus, the cost achieved by an admissible policy {
0
,
1
, . . . ,
N1
} for a deterministic problem is
also achieved by the control sequence {u
0
, u
1
, . . . , u
N1
} dened above.
Hence, we may restrict attention to sequences of controls without loss of optimality.
2.2.1 Deterministic nite-state problem
This type of problems can be represented by a graph (see Figure 2.2.1), where:
States Nodes
Each control applied over a state x
k
Arc from node x
k
.
So, every outgoing arc from a node x
k
represents one possible control u
k
. Among them, we
have to choose the best one u

k
(i.e., the one that minimizes the cost from node x
k
onwards).
Control sequences (open-loop) paths from initial state s to terminal states
Final stage Articial terminal node t
Each state x
N
at stage N Connected to the terminal node t with an arc having cost g
N
(x
N
).
One-step costs g
k
(i, u
k
) Cost of an arc a
k
ij
(cost of transition from state i S
k
to state
j S
k+1
at time k, viewed as the length of the arc) if u
k
forces the transition i j.
Dene a
N
it
as the terminal cost of state i S
N
.
Assume a
k
it
= if there is no control that drives from i to j.
Cost of control sequence Cost of the corresponding path (view it as length of the path)
The deterministic nite-state problem is equivalent to nding a shortest path from s to t.
2.2.2 Backward and forward DP algorithms
The usual backward DP algorithm takes the form:
J
N
(i) = a
N
it
, i S
N
,
J
k
(i) = min
jS
k+1
_
a
k
ij
+J
k+1
(j)
_
, i S
k
, k = 0, 1, . . . , N 1.
The optimal cost is J
0
(s) and is equal to the length of the shortest path from s to t.
47
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
. . .
. . .
. . .
Stage 0 Stage 1 Stage 2 Stage N - 1 Stage N
Initial State
s
t
Artificial Terminal
Node
Terminal Arcs
with Cost Equal
to Terminal Cost
. . .
C
o
s
t
g
0
(
s
,
u
0
)
Cost: g
1
(x
1
,u
1
)
Control: u
1
x
1
x
2
x
2
= f
1
(x
1
,u
1
)
g
N
(x
N
)
Figure 2.2.1: Construction of a transition graph for a deterministic nite-state system.
Observation: An optimal path s t is also an optimal path t s in a reverse shortest
path problem where the direction of each arc is reversed and its length is left unchanged.
The previous observation leads to the forward DP algorithm:

J
N
(j) = a
0
sj
, j S
1
,

J
k
(j) = min
iS
Nk
_
a
Nk
ij
+

J
k+1
(i)
_
, j S
Nk+1
, k = 0, 1, . . . , N 1.
The optimal cost is

J
0
(t) = min
iS
N
_
a
N
it
+

J
1
(i)
_
.
Note that both algorithms yield the same result: J
0
(s) =

J
0
(t). Take

J
k
(j) as the optimal
cost-to-arrive to state j from initial state s.
The following observations apply to the forward DP algorithm:
There is no forward DP algorithm for stochastic problems.
Mathematically, for stochastic problems, we cannot restrict ourselves to open-loop sequences,
so the shortest path viewpoint fails.
Conceptually, in the presence of uncertainty, the concept of optimal cost-to-arrive at a
state x
k
does not make sense. The reason is that it may be impossible to guarantee (w.p. 1)
that any given state can be reached.
By contrast, even in stochastic problems, the concept of optimal cost-to-go from any state x
k
(in expectation) makes clear sense.
Conclusion: A deterministic nite-state problem is equivalent to a special type of shortest path
problem and can be solved by either the ordinary (backward) DP algorithm or by an alternative
forward DP algorithm.
48
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
2.2.3 Generic shortest path problems
Here, we are converting a shortest path problem to a deterministic nite-state problem. More
formally, given a graph, we want to compute the shortest path from each node i to the nal node t.
How to cast this into the DP framework?
Let {1, 2, . . . , N, t} be the set of nodes of a graph, where t is the destination node.
Let a
ij
be the cost of moving from node i to node j.
Objective: Find a shortest (minimum cost) path from each node i to node t.
Assumption: All cycles have nonnegative length. Then, an optimal path need not take more
than N moves (depth of a tree).
We formulate the problem as one where we require exactly N moves but allow degenerate
moves from a node i to itself with cost a
ii
= 0.
In terms of the DP framework, we propose a formulation with N stages labeled 0, 1, . . . , N1.
Denote:
J
k
(i) = Optimal cost of getting from i to t in N k moves
J
0
(i) = Cost of the optimal path from i to t in N moves.
DP algorithm:
J
k
(i) = min
j=1,2,...,N
{a
ij
+J
k+1
(j)}, k = 0, 1, . . . , N 2,
with J
N1
(i) = a
it
, i = 1, 2, . . . , N.
The optimal policy when at node i after k moves is to move to a node j

such that
j

= argmin
1jN
{a
ij
+J
k+1
(j)}
If the optimal path from the algorithm contains degenerate moves from a node to itself, it
means that the path in reality involves less than N moves.
Demonstration of the algorithm
Consider the problem exhibited in Figure 2.2.2 where the costs a
ij
with i = j are shown along the
connecting line segments. The graph is represented as a non-directed one, meaning that the arc
costs are the same in both directions, i.e., a
ij
= a
ji
.
Running the algorithm:
In this case, we have N = 4, so it is a 3-stage problem with 4 states:
1. Starting from stage N 1 = 3, we compute J
N1
(i) = a
it
, for i = 1, 2, 3, 4 and t = 5:
J
3
(1) = cost of getting from node 1 to node 5 = 2
J
3
(2) = cost of getting from node 2 to node 5 = 7
J
3
(3) = cost of getting from node 3 to node 5 = 5
J
3
(4) = cost of getting from node 4 to node 5 = 3
The numbers above represent the cost of getting from i to t in N (N 1) = 1 move.
49
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
2
7 5
2
5
5
6
1
3
0.5
3
1
2
4
Destination
5
Figure 2.2.2: Shortest path problem data. There are N = 4 states, and a destination node t = 5.
2. Proceeding backwards to stage N 2 = 2, we have:
J
2
(1) = min
j=1,2,3,4
{a
1j
+J
3
(j)} = min{ a
11
..
0
+J
3
(1)
. .
2
, a
12
..
6
+J
3
(2)
. .
7
, a
13
..
5
+J
3
(3)
. .
5
, a
14
..
2
+J
3
(4)
. .
3
} = 2
J
2
(2) = min
j=1,2,3,4
{a
2j
+J
3
(j)} = min{ a
21
..
6
+J
3
(1)
. .
2
, a
22
..
0
+J
3
(2)
. .
7
, a
23
..
0.5
+J
3
(3)
. .
5
, a
24
..
5
+J
3
(4)
. .
3
} = 5.5
J
2
(3) = min
j=1,2,3,4
{a
3j
+J
3
(j)} = min{ a
31
..
5
+J
3
(1)
. .
2
, a
32
..
0.5
+J
3
(2)
. .
7
, a
33
..
0
+J
3
(3)
. .
5
, a
34
..
1
+J
3
(4)
. .
3
} = 4
J
2
(4) = min
j=1,2,3,4
{a
4j
+J
3
(j)} = min{ a
41
..
2
+J
3
(1)
. .
2
, a
42
..
5
+J
3
(2)
. .
7
, a
43
..
1
+J
3
(3)
. .
5
, a
44
..
0
+J
3
(4)
. .
3
} = 3
3. Proceeding backwards to stage N 3 = 1, we have:
J
1
(1) = min
j=1,2,3,4
{a
1j
+J
2
(j)} = min{ a
11
..
0
+J
2
(1)
. .
2
, a
12
..
6
+J
2
(2)
. .
5.5
, a
13
..
5
+J
2
(3)
. .
4
, a
14
..
2
+J
2
(4)
. .
3
} = 2
J
1
(2) = min
j=1,2,3,4
{a
2j
+J
2
(j)} = min{ a
21
..
6
+J
2
(1)
. .
2
, a
22
..
0
+J
2
(2)
. .
5.5
, a
23
..
0.5
+J
2
(3)
. .
4
, a
24
..
5
+J
2
(4)
. .
3
} = 4.5
J
1
(3) = min
j=1,2,3,4
{a
3j
+J
2
(j)} = min{ a
31
..
5
+J
2
(1)
. .
2
, a
32
..
0.5
+J
2
(2)
. .
5.5
, a
33
..
0
+J
2
(3)
. .
4
, a
34
..
1
+J
2
(4)
. .
3
} = 4
J
1
(4) = min
j=1,2,3,4
{a
4j
+J
2
(j)} = min{ a
41
..
2
+J
2
(1)
. .
2
, a
42
..
5
+J
2
(2)
. .
5.5
, a
43
..
1
+J
2
(3)
. .
4
, a
44
..
0
+J
2
(4)
. .
3
} = 3
4. Finally, proceeding backwards to stage 0, we have:
J
0
(1) = min
j=1,2,3,4
{a
1j
+J
1
(j)} = min{ a
11
..
0
+J
1
(1)
. .
2
, a
12
..
6
+J
1
(2)
. .
4.5
, a
13
..
5
+J
1
(3)
. .
4
, a
14
..
2
+J
1
(4)
. .
3
} = 2
J
0
(2) = min
j=1,2,3,4
{a
2j
+J
1
(j)} = min{ a
21
..
6
+J
1
(1)
. .
2
, a
22
..
0
+J
1
(2)
. .
4.5
, a
23
..
0.5
+J
1
(3)
. .
4
, a
24
..
5
+J
1
(4)
. .
3
} = 4.5
J
0
(3) = min
j=1,2,3,4
{a
3j
+J
1
(j)} = min{ a
31
..
5
+J
1
(1)
. .
2
, a
32
..
0.5
+J
1
(2)
. .
4.5
, a
33
..
0
+J
1
(3)
. .
4
, a
34
..
1
+J
1
(4)
. .
3
} = 4
J
0
(4) = min
j=1,2,3,4
{a
4j
+J
1
(j)} = min{ a
41
..
2
+J
1
(1)
. .
2
, a
42
..
5
+J
1
(2)
. .
4.5
, a
43
..
1
+J
1
(3)
. .
4
, a
44
..
0
+J
1
(4)
. .
3
} = 3
50
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Figure 2.2.3 shows the outcome of the shortest path (DP-type) algorithm applied over the graph in
Figure 2.2.2.
0 1 2 3 Stage k
1
2
3
4
State i
3 3 3 3
4 4 4 5
4.5 4.5 5.5 7
2 2 2 2
Figure 2.2.3: Outcome of the shortest path algorithm. The arcs represent the optimal control to follow from a
given state i (node i in the graph) at a particular stage k (where stage k means 4 k transitions left to reach t = 5).
When there is more than one arc going out of a node, it represents the availability of more than one optimal control.
The label next to each node shows the cost-to-go starting at the corresponding (stage, state) position.
2.2.4 Some shortest path applications
Hidden Markov models and the Viterbi algorithm
Consider a Markov chain for which we do not observe the outcome of the transitions but rather we
observe a signal or proxy that relates to that transition. The setting of the problem is the following:
Markov chain (discrete time, nite number of states) with transition probabilities p
ij
.
State transition are hidden from view.
For each transition, we get an independent observation.
Denote
i
: Probability that the initial state is i.
Denote r(z; i, j): Probability that the observation takes the value z when the state transition
is from i to j.
3
Trajectory estimation problem: Given the observation sequence Z
N
= {z
1
, z
2
, . . . , z
N
}, what
is the most likely (unobservable) transition sequence

X
N
= { x
0
, x
1
, . . . , x
N
}? More formally:
We are looking for the transition sequence

X
N
= { x
0
, x
1
, . . . , x
N
} that maximizes p(X
N
|Z
N
)
over all X
N
= {x
0
, x
1
, . . . , x
N
}. We are using the notation

X
N
to emphasize the fact that this
is an estimated sequence. We do not observe the true sequence, but just a proxy for it given
by Z
N
.
3
The probabilities pij and r(z; i, j) are assumed to be independent of time for notational convenience, but the
methodology could be extended to time-dependent probabilities.
51
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Viterbi algorithm
We know from conditional probability that
P(X
N
|Z
N
) =
P(X
N
, Z
N
)
P(Z
N
)
,
for unconditional probabilities P(X
N
, Z
N
) and P(Z
N
). Since P(Z
N
) is a positive constant once Z
N
is known, we can just maximize P(X
N
, Z
N
), where
P(X
N
, Z
N
) = P(x
0
, x
1
, . . . , x
N
, z
1
, z
2
, . . . , z
N
)
=
x
0
P(x
1
, . . . , x
N
, z
1
, z
2
, . . . , z
N
|x
0
)
=
x
0
P(x
1
, z
1
|x
0
)P(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
)
=
x
0
p
x
0
x
1
r(z
1
; x
0
, x
1
) P(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
)
. .
P(x
2
,z
2
|x
0
,x
1
,z
1
)P(x
3
,...,x
N
,z
3
,...,z
N
|x
0
,x
1
,z
1
,x
2
,z
2
)
=px
1
x
2
r(z
2
;x
1
,x
2
)P(x
3
,...,x
N
,z
3
,...,z
N
|x
0
,x
1
,z
1
,x
2
,z
2
)
.
Continuing in the same manner we obtain:
P(X
N
, Z
N
) =
x
0

N
k=1
p
x
k1
x
k
r(z
k
; x
k1
, x
k
)
Instead of working with this function, we will maximize log P(X
N
, Z
N
), or equivalently:
min
x
0
,x
1
,...,x
N
_
log(
x
0
)
N

k=1
log(p
x
k1
x
k
r(z
k
; x
k1
, x
k
))
_
The outcome of this minimization problem will be the sequence

X
N
= { x
1
, x
2
, . . . , x
N
}.
Transformation into a shortest path problem in a trellis diagram
We build the trellis diagram shown in Figure 2.2.4 as follows:
Arc (s, x
0
) Cost = log
x
0
.
Arc (x
N
, t) Cost = 0.
Arc (x
k1
, x
k
) Cost = log(p
x
k1
x
k
r(z
k
; x
k1
, x
k
)).
The shortest path denes the estimated state sequence { x
0
, x
1
, . . . , x
N
}.
In practice, the shortest path is most conveniently constructed sequentially by forward DP: Suppose
that we have already computed the shortest distances D
k
(x
k
) from s to all states x
k
, on the basis
of the observation sequence z
1
, . . . , z
k
, and suppose that we observe z
k+1
. Then:
D
k+1
(x
k+1
) = min
{x
k
:px
k
x
k+1
>0}
_
D
k
(x
k
) log(p
x
k
x
k+1
r(z
k+1
; x
k
, x
k+1
))
_
, k = 1, . . . , N 1,
starting from D
0
(x
0
) = log
x
0
.
Observations:
Final estimated sequence

X
N
corresponds to the shortest path from s to the nal state x
N
that minimizes D
N
(x
N
) over the nal set of possible states x
N
.
52
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
. . .
. . .
. . .
s x
0
x
1
x
2
x
N - 1
x
N
t
Cost = log
x0 Cost = 0
Cost = log (p
x1 x2
r(z
2
;x
1
,x
2
))
Copy 0 of
the state space
Copy 1 of
the state space
Copy N of
the state space
Figure 2.2.4: State estimation of a hidden Markov model viewed as a problem of nding a shortest path from s
to t. There are N +1 copies of the space state (recall that the number of states is nite). So, x
k
stands for any state
in the copy k of the state space. An arc connects x
k1
with x
k
if px
k1
x
k
> 0.
Advantage: It can be computed in real time, as new observations arrive.
Applications of the Viterbi algorithm:
Speech recognition, where the goal is to transcribe a spoken word sequence in terms of
elementary speech units called phonemes.
Setting:
States of the hidden Markov model: phonemes.
Given a sequence of recorded phonemes Z
N
= {z
1
, . . . , z
N
} (i.e., a noisy representa-
tion of words) try to nd a phonemic sequence

X
N
= { x
1
, . . . , x
N
} that maximizes
over all possible X
N
= {x
1
, . . . , x
N
} the conditional probability P(X
N
|Z
N
).
The probabilities p
x
k1
x
k
and r(z
k
; x
k1
, x
k
) can be experimentally obtained.
Computerized recognition of handwriting.
2.2.5 Shortest path algorithms
Computational implications of the equivalence shortest path problems deterministic nite-state
DP:
We can use DP to solve general shortest path problems.
Although there are other methods with superior worst-case performance, DP could be pre-
ferred because it is highly parallelizable.
There are many non-DP shortest path algorithms that can be used to solve deterministic
nite-state problems.
53
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
They may be preferable than DP if they avoid calculating the optimal cost-to-go at every
state.
This is essential for problems with huge state spaces (e.g., combinatorial optimization
problems).
Example 2.2.1 (An Example with very large number of nodes: TSP)
The Traveling Salesman Problem (TSP) is about nding a tour (cycle) that passes exactly once for each
city (node) of a graph, and that minimizes the total cost. Consider for instance the problem described
in Figure 2.2.5.
A B
C D
5
4
1
3
20
15
Figure 2.2.5: Basic graph for the TSP example with four cities.
To convert a TSP problem over a map (graph) with N nodes to a shortest path problem, build a new
execution graph as follows:
Pick a city and set it as the initial node s.
Associate a node with every sequence of n distinct cities, n N.
Add an articial terminal node t.
A node representing a sequence of cities c
1
, c
2
, . . . , c
n
is connected with a node representing a
sequence c
1
, c
2
, . . . , c
n
, c
n+1
with an arc with weight a
cnc
n+1
(length of the arc in the original
graph).
Each sequence of N cities is connected to the terminal node through an arc with same cost as the
cost of the arc connecting the last city of the sequence and city s in the original graph.
Figure 2.2.6 shows the construction of the execution graph for the example described in Figure 2.2.5.
2.2.6 Alternative shortest path algorithms: Label correcting methods
Working on the shortest path execution graph as the one in Figure 2.2.6, the idea of these methods
is to progressively discover shorter paths from the origin s to every other node i.
Given: Origin s, destination t, lengths a
ij
0.
54
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
ABC ABD ACB ACD ADB ADC
ABCD
AB
AC AD
ABDC ACBD ACDB ADBC ADCB
Artificial Terminal Node t
Origin Node s A
1
1 1
20
20
20 20
4 4
4 4
15
15 5
5
3 3
5
3 3
15
Figure 2.2.6: Structure of the shortest path execution graph for the TSP example.
Notation:
Label d
i
: Length of the shortest path found (initially d
s
= 0, d
i
= for i = s).
Variable UPPER: Label d
t
of the destination.
Set OPEN: Contains nodes that are currently active in the sense that they are candidates
for further examination (initially, OPEN:={s}). It is sometimes called candidate list.
Function ParentOf(j): Saves the predecessor of j in the shortest path found so far from s
to j. At the end of the algorithm, proceeding backward from node t, it allows to rebuild
the shortest path from s.
Label Correcting Algorithm (LCA)
Step 1 Node removal: Remove a node i from OPEN and for each child j of i, do Step 2.
Step 2 Node insertion test: If d
i
+ a
ij
< min{d
j
, UPPER}, set d
j
:= d
i
+ a
ij
and set i :=
ParentOf(j).
In addition, if j = t, set OPEN:=OPEN{j}; while if j = t, set UPPER:= d
t
.
Step 3 Termination test: If OPEN is empty, terminate; else go to Step 1.
As a clarication for Step 2, note that since OPEN is a set, if j, j = t, is already in OPEN, then
OPEN remains the same. Also, when j = t, note that UPPER takes the new value d
t
= d
i
+ a
it
that has just been updated. Figure 2.2.7 sketches the Label Correcting Algorithm.
The execution of the algorithm over the TSP example above is represented in Figure 2.2.8 and
Table 2.1. Interestingly, note that several nodes of the execution graph never enter the OPEN set.
Indeed, this computational reduction with respect to DP is what makes this method appealing.
The following proposition establishes the validity of the Label Correcting Algorithm.
55
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Is d
i
+a
ij
<d
j
?
(Is the path s i j
better than the current
path s j ?)
i j
YES
Is d
i
+a
ij
<UPPER?
(Does the path s i j
have a chance to be part of
a shorter s t path?)
YES
Set i:=ParentOf( j)
Set d
j
:=d
i
+ a
ij
If jt INSERT j
If j=t UPPER:= d
t
OPEN
REMOVE
Figure 2.2.7: Sketch of the Label Correcting Algorithm.
Iter No. Node exiting Label update / Observations Status after iteration
OPEN OPEN UPPER
0 - - {1}
1 1 d
2
:= 5, d
7
:= 1, d
10
:= 15. ParentOf(2,7,10):=1. {2, 7, 10}
2 2 d
3
:= 25, d
5
:= 9. ParentOf(3,5):=2. {3, 5, 7, 10}
3 3 d
4
:= 28. ParentOf(4):=3. {4, 5, 7, 10}
4 4 Reached terminal node t. Set d
t
:= 43. ParentOf(t):=4. {5, 7, 10} 43
5 5 d
6
:= d
5
+ 3 = 12. ParentOf(6):=5. {6, 7, 10} 43
6 6 Reached terminal node t. Set d
t
:= 13. ParentOf(t):=6. {7, 10} 13
7 7 d
8
:= d
7
+ 3 = 4. Node ABC would have a label {8, 10} 13
d
7
+ 20 = 21 >UPPER. So, it does not enter OPEN.
ParentOf(8):=7.
8 8 d
9
:= 8. ParentOf(9):=8. {9, 10} 13
9 9 d
9
+a
9t
= d
9
+ 5 = 13 UPPER. {10} 13
10 10 If picked ADB: d
10
+ 4 = 19 >UPPER. Empty 13
If picked ADC: d
10
+ 3 = 18 >UPPER.
Table 2.1: The optimal solution ABCD is found after examining nodes 1 through 10 in Figure 2.2.8, in that order.
The table also shows the successive contents of the OPEN list, the value of UPPER at the end of an iteration, and
the actions taken during each iteration.
Proposition 2.2.1 If there exists at least one path from the origin to the destination in the exe-
cution graph, the label correcting algorithm terminates with UPPER equal to the shortest distance
from the origin to the destination. Otherwise, the algorithm terminates with UPPER=.
Proof: We proceed in three steps:
1. The algorithm terminates
Each time a node j enters OPEN, its label d
j
is decreased and becomes equal to some path
from s to j. The number of distinct lengths of paths from s to j that are smaller than any
given number is nite. Hence, there can only be a nite number of label reductions.
2. Suppose that there is no path s t.
56
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
ABC ABD ACB ACD ADB ADC
ABCD
AB AC AD
ABDC ACBD ACDB ADBC ADCB
Artificial Terminal Node t
Origin Node s A
1
1 1
20 20
20 20
4 4
4 4
15
15 5
5
3 3
5
3 3
15
Figure 2.2.8: Labeling of the nodes in the execution graph when executing the LCA corresponds to the iteration
of the LCA.
Then a node i such that (i, t) is an arc cannot enter the OPEN list, because if that happened,
since the paths are built starting from s, it would mean that there is a path s i, which
jointly with the arc (i, t) would determine a path s t, which is a contradiction. Since this
holds for all i adjacent to t in the basic graph, UPPER can never be reduced from its initial
value .
3. Suppose that there is a path s t. Then, there is a shortest path s t. Let (s, j
1
, j
2
, . . . , j
k
, t)
be a shortest path, and let d

be the corresponding shortest distance. We will see that


UPPER=d

upon termination.
Each subpath (s, j
1
, j
2
, . . . , j
m
), m = 1, . . . , k, must be a shortest path s j
m
. If UPPER> d

at termination, then same occurs throughout the algorithm (because UPPER is decreasing
during the execution). So, UPPER is bigger than the length of all paths s j
m
(due to the
nonnegative arc length assumption).
In particular, node j
k
will never enter the OPEN list with d
j
k
equal to the shortest distance s
j
k
. To see this, suppose j
k
enters OPEN. When at some point the algorithm picks j
k
from
OPEN, it will set d
t
= d
j
k
+a
j
k
t
. .
d

, and UPPER= d

.
Similarly, node j
k1
will never enter OPEN with d
j
k1
equal to the shortest distance s j
k1
.
Proceeding backward, j
1
never enters OPEN with d
j
1
equal to the shortest distance s j
1
; i.e.
a
sj
1
. However, this happens at the rst iteration of the algorithm, leading to a contradiction.
Therefore, UPPER will be equal to the shortest distance s t.
57
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Specic Label Correcting Methods
Making the method ecient:
Reduce the value of UPPER as quickly as possible (i.e., try to discover good s t paths
early in the course of the algorithm).
Keep the number of reentries into OPEN low.
Try to remove from OPEN nodes with small label rst.
Heuristic rationale: if d
i
is small, then d
j
when set to d
i
+a
ij
will be accordingly small,
so reentrance of j in the OPEN list is less likely.
Reduce the overhead for selecting the node to be removed from OPEN.
These objectives are often in conict. They give rise to a large variety of distinct implemen-
tations.
Node selection methods:
Breadth-rst search: Also known as the Bellman-Ford method. The set OPEN is treated as
an ordered list. FIFO policy.
Depth-rst search: The set OPEN is treated as an ordered list. LIFO policy. It often requires
relatively little memory, specially for sparse (i.e., tree-like) graphs. It reduces UPPER quickly.
Best-rst search: Also known as the Djikstra method. Remove from OPEN a node j with
minimum value of label d
j
. In this way, each node will be inserted in OPEN at most once.
Advanced initialization:
In order to get a small starting value of UPPER, instead of starting from d
i
= for all i = s, we
can initialize the value of the labels d
i
and the set OPEN as follows:
Start with
d
i
:= length of some path from s to i.
If there is no such path, set d
i
= . Then, set OPEN:={i = t|d
i
< }.
No node with shortest distance greater or equal than the initial value of UPPER will enter
OPEN.
Good practical idea:
Run a heuristic to get a good starting path P from s to t.
Use as UPPER the length of P, and as d
i
the path distances of all nodes i along P.
58
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
2.2.7 Exercises
Exercise 2.2.1 A decision maker must continually choose between two activities over a time inter-
val [0, T]. Choosing activity i at time t, where i = 1, 2, earns reward at a rate g
i
(t), and every switch
between the two activities costs c > 0. Thus, for example, the reward for starting with activity 1,
switching to 2 at time t
1
, and switching back to 1 at time t
2
> t
1
, earns total reward
_
t
1
0
g
1
(t)dt +
_
t
2
t
1
g
2
(t)dt +
_
T
t
2
g
1
(t)dt 2c
We want to nd a set of switching times that maximize the total reward. Assume that the function
g
1
(t) g
2
(t) changes sign a nite number of times in the interval [0, T]. Formulate the problem as
a nite horizon problem and write the corresponding DP algorithm.
Exercise 2.2.2 Assume that we have a vessel whose maximum weight capacity is z and whose
cargo is to consist of dierent quantities of N dierent items. Let v
i
denote the value of the ith type
of the item, w
i
the weight of ith type of item, and x
i
the number of items of type i that are loaded
in the vessel. The problem is to nd the most valuable cargo subject to the capacity constraint.
Formulate this problem in terms of DP.
Exercise 2.2.3 Find a shortest path from each node to node 6 for the graph below by using the
DP algorithm:
4
1
2
5
8
9
1
5
0 5
1
2
3
4
5
6
Exercise 2.2.4 Air transportation is available between n cities, in some cases directly and in others
through intermediate stops and change of carrier. The airfare between cities i and j is denoted by
a
ij
. We assume that a
ij
= a
ji
, and for notational convenience, we write a
ij
= if there is no direct
ight between i and j. The problem is to nd the cheapest airfare for going between two cities
perhaps through intermediate stops. Let n = 6 and
a
12
= 30, a
13
= 60, a
14
= 25, a
15
= a
16
= ,
a
23
= a
24
= a
25
= , a
26
= 50,
a
34
= 35, a
35
= a
36
= ,
a
45
= 15, a
46
= ,
a
56
= 15.
Find the cheapest airfare from every city to every other city by using the DP algorithm.
59
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Exercise 2.2.5 Label correcting with negative arc lengths. Consider the problem of nding
a shortest path from node s to node t, and assume that all cycle lengths are nonnegative (instead
of all arc lengths being nonnegative). Suppose that a scalar u
j
is known for each node j, which
is an underestimate of the shortest distance from j to t (u
j
can be taken if no underestimate
is known). Consider a modied version of the typical iteration of the label correcting algorithm
discussed above, where Step 2 is replaced by the following:
Modied Step 2: If d
i
+ a
ij
< min{d
j
, UPPER u
j
}, set d
j
= d
i
+ a
ij
and set i:=ParentOf(j). In
addition, if j = t, place j in OPEN if it is not already in OPEN, while if j = t, set UPPER to the
new value d
i
+a
it
of d
t
.
1. Show that the algorithm terminates with a shortest path, assuming there is at least one path
from s to t.
2. Why is the Label Correcting Algorithm given in class a special case of the one here?
Exercise 2.2.6 We have a set of N objects, denoted 1, 2, . . . , N, which we want to group in clusters
that consist of consecutive objects. For each cluster i, i+1, . . . , j, there is an associated cost a
ij
. We
want to nd a grouping of the objects in clusters such that the total cost is minimum. Formulate
the problem as a shortest path problem, and write a DP algorithm for its solution. (Note: An
example of this problem arises in typesetting programs, such as TEX/LATEX, that break down a
paragraph into lines in a way that optimizes the paragraphs appearance).
2.3 Stochastic Dynamic Programming
We present here a general problem of decision making under stochastic uncertainty over a nite
number of stages. The components of the formulation are listed below:
The discrete time dynamic system evolves according to the system equation
x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, 1, . . . , N 1,
where
the state x
k
is an element of a space S
k
,
the control u
k
veries u
k
U
k
(x
k
) C
k
, for a space C
k
, and
the random disturbance w
k
is an element of a space D
k
.
The random disturbance w
k
is characterized by a probability distribution P
k
(|x
k
, u
k
) that may
depend explicitly on x
k
and u
k
. For now, we assume independent disturbances w
0
, w
1
, . . . , w
N1
.
In this case, since the system evolution from state to state is independent of the past, we have
a Markov decision model.
Admissible policies = {
0
,
1
, . . . ,
N1
}, where
k
maps states x
k
into controls u
k
=
k
(x
k
),
and is such that
k
(x
k
) U
k
(x
k
) for all x
k
S
k
.
60
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Given an initial state x
0
and an admissible policy = {
0
,
1
, . . . ,
N1
}, the states x
k
and
disturbances w
k
are random variables with distributions dened through the system equation
x
k+1
= f
k
(x
k
,
k
(x
k
), w
k
), k = 0, 1, . . . , N 1,
The stage cost function is given by g
k
(x
k
,
k
(x
k
), w
k
).
The expected cost of a policy starting at x
0
is
J

(x
0
) = E
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
,
k
(x
k
), w
k
)
_
,
where the expectation is taken over the r.v. w
k
and x
k
.
An optimal policy

is one that minimizes this cost; that is,


J

(x
0
) = min

(x
0
), (2.3.1)
where is the set of all admissible policies.
Observations:
It is useful to see J

as a function that assigns to each initial state x


0
the optimal cost J

(x
0
),
and call it the optimal cost function or optimal value function.
When produced by DP,

is typically independent of x
0
, because

= {

0
(x
0
), . . . ,

N1
(x
N1
)},
and

0
(x
0
) must be dened for all x
0
.
Even though we will be using the min operator in (2.3.1), it should be understood that the
correct formal formulation should be J

(x
0
) = inf

(x
0
).
Example 2.3.1 (Control of a single server queue)
Consider a single server queueing system with the following features:
Waiting room for n 1 customers; an arrival nding n people in the system (n 1 waiting and
one in the server) leaves.
Discrete random service time belonging to the set {1, 2, . . . , N}.
Probability p
m
of having m arrivals at the beginning of a period, with m = 0, 1, 2, . . ..
System oers two types of service, that can be chosen at the beginning of each period:
Fast, with cost per period c
f
, and that nishes at the end of the current period w.p. q
f
,
Slow, with cost per period c
s
, and that nishes at the end of the current period w.p. q
s
.
61
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Assume q
f
> q
s
and c
f
> c
s
.
A recent arrival cannot be immediately served.
The system incurs two costs in every period: The service cost (either c
f
or c
s
), and a waiting time
cost r(i) if there are i customers waiting at the beginning of a period.
There is a terminal cost R(i) if there are i customers waiting in the nal period (i.e., in period N).
Problem: Choose the type of service at the beginning of each period (fast or slow) in order to
minimize the total expected cost over N periods.
Intuitive optimal strategy must be of the threshold type: When there are more than i customers in the
system use fast; otherwise, use slow.
In terms of DP terminology, we have:
State x
k
: Number of customers in the system at the start of period k.
Control u
k
: Type of service provided; either u
k
= u
f
(fast) or u
k
= u
s
(slow)
Cost per period k: For 0 k N 1, g
k
(i, u
k
, w
k
) = r(i) + c
f
11{u
k
= u
f
} + c
s
11{u
k
= u
f
}.
4
For k = N, g
N
(i) = R(i).
According to the claim above, since states are discrete, transition probabilities are enough to describe
the system dynamics:
If the system is empty, then:
p
0j
(u
f
) = p
0j
(u
s
) = p
j
, j = 0, 1, . . . , n 1; and p
0n
(u
f
) = p
0n
(u
s
) =

m=n
p
m
In words, since customers cannot be served immediately, they accumulate and system jumps to
state j < n (if there were less than n arrivals), or to state n (if there are n or more).
When the system is not empty (i.e., x
k
= i > 0), then
p
ij
(u
f
) = 0, if j < i 1 (we cannot have more than one service completion per period),
p
ij
(u
f
) = q
f
p
0
, if j = i 1 (current customer nishes in this period and nobody arrives),
p
ij
(u
f
) = P{j i + 1 arrivals, service completed} +P{j i arrivals, service not completed}
= q
f
p
ji+1
+ (1 q
f
)p
ji
, if i 1 < j < n 1,
p
i(n1)
(u
f
) = P{at least n i arrivals, service completed}
+P{n 1 i arrivals, service uncompleted}
= q
f

m=ni
p
m
+ (1 q
f
)p
n1i
,
p
in
(u
f
) = P{at least n i arrivals, service not completed}
= (1 q
f
)

m=ni
p
m
For control u
s
the formulas are analogous, with u
s
and q
s
replacing u
f
and q
f
, respectively.
4
Here, 11{A} is the indicator function, taking the value one if event A occurs, and zero otherwise.
62
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
2.4 The Dynamic Programming Algorithm
The DP technique rests on a very intuitive idea, the Principle of Optimality. The name is due to
Bellman (New York 1920 - Los Angeles 1984).
Principle of optimality. Let

= {

0
,

1
, . . . ,

N1
} be an optimal policy for the basic problem,
and assume that when using

a given state x
i
occurs at time i with some positive probability.
Consider the tail subproblem whereby we are at x
i
at time i and wish to minimize the cost-to-go
from time i to time N,
E
_
g
N
(x
N
) +
N1

k=i
g
k
(x
k
,
k
(x
k
), w
k
)
_
.
Then, the truncated (tail) policy {

i
,

i+1
, . . . ,

N1
} is optimal for this subproblem.
Figure 2.4.1 illustrates the intuition of the Principle of Optimality. The DP algorithm is based on
k=0
k=N k=i
State x
0
State x
i State x
N
Then, policy {
i
*,
i+1
*,,
N-1
*}
must be optimal
Policy {
1
*,
2
*,,
N-1
*}
is optimal
Figure 2.4.1: Principle of Optimality: The tail policy is optimal for the tail subproblem.
this idea: It rst solves all tail subproblems of nal stage, and then proceeds backwards solving all
tail subproblems of a given time length using the solution of the tail subproblems of shorter time
length. Next, we introduce the DP algorithm with two examples.
Solution to Scheduling Example 2.1.2:
Consider the graph of costs for previous Example 2 given in Figure 2.4.2. Applying the DP algorithm
from the terminal nodes (stage N = 3), and proceeding backwards, we get the representation in
Figure 2.4.3. At each state-time pair, we record the optimal cost-to-go and the optimal decision.
For example, node AC has a cost of 5, and the optimal decision is to proceed to ACB (because it
has the lowest stage k = 3 cost, starting from k = 2 and state AC). In terms of our formal notation
for the cost, g
3
(ACB) = 1, for a terminal state x
3
= ACB.
Solution to Inventory Example 2.1.1:
Consider again the stochastic inventory problem described in Example 1. The application of the DP
algorithm is very similar to the deterministic case, except for the fact that now costs are computed
as expected values.
63
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Empty
schedule
A B
C
D
5
3
3
4
2
6
1
3
4
3
4
3
Figure 2.4.2: Graph of one-step switching costs for Example 2.
Tail subproblems of length 1: The optimal cost for the last period is
J
N1
(x
N1
) = min
u
N1
0
E
w
N1
[cu
N1
+r(x
N1
+u
N1
w
N1
)]
Note:
u
N1
=
N1
(x
N1
) depends on x
N1
.
J
N1
(x
N1
) may be computed numerically and stored as a column of a table.
Tail subproblems of length N k: The optimal cost for period k is
J
k
(x
k
) = min
u
k
0
E
w
k
[cu
k
+r(x
k
+u
k
w
k
) +J
k+1
(x
k
+u
k
w
k
)] ,
where x
k+1
= x
k
+u
k
w
k
is the initial inventory of the next period.
The value J
0
(x
0
) is the optimal expected cost when the initial stock at time 0 is x
0
.
If the number of attainable states x
k
is discrete with nite support [0, S], the output of the DP
algorithm could be stored in two tables (one for the optimal cost J
k
, and one for the optimal
control u
k
), each table consisting of N columns labeled from k = 0 to k = N 1, and S + 1 rows
labeled from 0 to S. The tables are lled by the DP algorithm from right to left.
DP algorithm: Start with
J
N
(x
N
) = g
N
(x
N
),
and go backwards using the recursion
J
k
(x
k
) = min
u
k
U
k
(x
k
)
E
w
k
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(f
k
(x
k
, u
k
, w
k
))] , k = 0, 1, . . . , N 1, (2.4.1)
where the expectation is taken with respect to the probability distribution of w
k
, which may depend
on x
k
and u
k
.
Proposition 2.4.1 For every initial state x
0
, the optimal cost J

(x
0
) of the basic problem is equal
to J
0
(x
0
), given by the last step of the DP algorithm. Furthermore, if u

k
=

k
(x
k
) minimizes the
RHS of (2.4.1) for each x
k
and k, the policy

= {

0
, . . . ,

N1
} is optimal.
64
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
A
C
AB
AC
CDA
ABC
CA
CD
ACD
ACB
CAB
CAD
Initial
State
1 0
7
6
2
8
6
6
2
2
9
3
3
3
3
3
3
5
1
5
4
4
3
1
5
4
Figure 2.4.3: Transition graph for Example 2. Next to each node/state we show the cost of optimally
completing the scheduling starting from that state. This is the optimal cost of the corresponding
tail subproblem. The optimal cost for the original problem is equal to 10. The optimal schedule is
CABD.
Observations:
Justication: Proof by induction that J
k
(x
k
) is equal to J

k
(x
k
), dened as the optimal cost
of the tail subproblem that starts at time k at state x
k
.
All the tail subproblems are solved in addition to the original problem. Observe the intensive
computational requirements. The worst-case computational complexity is

N1
k=0
|S
k
||U
k
|,
where |S
k
| is the size of the state space in period k, and |U
k
| is the size of the control space in
period k. In particular, note that potentially we could need to search over the whole control
space, although we just store the optimal one in each period-state pair.
Proof of Proposition 2.4.1
For this version of the proof, we need the following additional assumptions:
The disturbance w
k
takes a nite or countable number of values
The expected values of all stage costs are nite for every admissible policy
The functions J
k
(x
k
) generated by the DP algorithm are nite for all states x
k
and periods k.
Informal argument
Let
k
= {
k
,
k+1
, . . . ,
N1
} denote a tail policy from time k onward.
Border case: For k = N, dene J

N
(x
N
) = J
N
(x
N
) = g
N
(x
N
).
65
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
For J
k+1
(x
k+1
) generated by the DP algorithm and the optimal J

k+1
(x
k+1
), assume that
J
k+1
(x
k+1
) = J

k+1
(x
k+1
). Then
J

k
(x
k
) = min
(
k
,
k+1
)
E
w
k
,w
k+1
,...,w
N1
_
g
k
(x
k
,
k
(x
k
), w
k
) +g
N
(x
N
) +
N1

i=k+1
g
i
(x
i
,
i
(x
i
), w
i
)
_
= min

k
E
w
k
_
g
k
(x
k
,
k
(x
k
), w
k
) + min

k+1
E
w
k+1
,...,w
N1
_
g
N
(x
N
) +
N1

i=k+1
g
i
(x
i
,
i
(x
i
), w
i
)
__
(this is the informal step, since we are moving the min inside the E[])
= min

k
E
w
k
_
g
k
(x
k
,
k
(x
k
), w
k
) +J

k+1
(f
k
(x
k
,
k
(x
k
), w
k
))

(by def. of J

k+1
)
= min

k
E
w
k
[g
k
(x
k
,
k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,
k
(x
k
), w
k
))] (by induction hypothesis (IH))
= min
u
k
U
k
(x
k
)
E
w
k
[g
k
(x
k
,
k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,
k
(x
k
), w
k
))]
= J
k
(x
k
),
where the second to last equality follows from converting a minimization problem over func-
tions
k
to a minimization problem over scalars u
k
. In symbols, for any function F of x and u,
we have
min
M
F(x, (x)) = min
uU(x)
F(x, u),
where M is the set of all functions (x) such that (x) U(x) for all x.
A more formal argument
For any admissible policy = {
0
, . . . ,
N1
} and each k = 0, 1, . . . , N 1, denote

k
= {
k
,
k+1
, . . . ,
N1
}.
For k = 0, 1, . . . , N 1, let J

k
(x
k
) be the optimal cost for the (N k)-stage problem that starts at
state x
k
and time k, and ends at time N; that is
J

k
(x
k
) = min

k
E
_
g
N
(x
N
) +
N1

i=k
g
i
(x
i
,
i
(x
i
), w
i
)
_
.
For k = N, we dene J

N
(x
N
) = g
N
(x
N
). We will show by backward induction that the functions J

k
are equal to the functions J
k
generated by the DP algorithm, so that for k = 0 we get the desired
result.
Start by dening for any > 0, and for all k and x
k
, an admissible control

k
(x
k
) U
k
(x
k
) for the
DP recursion (2.4.1) such that
E
w
k
[g
k
(x
k
,

k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,

k
(x
k
), w
k
))] J
k
(x
k
) + (2.4.2)
Because of our former assumption, J
k+1
(x
k
) generated by the DP algorithm is well dened and
nite for all k and x
k
S
k
. Let J

k
(x
k
) be the expected cost when using the policy {

k
, . . . ,

N1
}.
We will show by induction that for all x
k
and k, it must hold that
J
k
(x
k
) J

k
(x
k
) J
k
(x
k
) + (N k), (2.4.3)
J

k
(x
k
) J

k
(x
k
) J

k
(x
k
) + (N k), (2.4.4)
J
k
(x
k
) = J

k
(x
k
) (2.4.5)
66
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
For k = N 1, we have
J
N1
(x
N1
) = min
u
N1
U
N1
(x
N1
)
E
w
N1
[g
N1
(x
N1
, u
N1
, w
N1
) +g
N
(x
N
))]
J

N1
(x
N1
) = min

N1
E
w
N1
[g
N1
(x
N1
,
N1
(x
N1
), w
N1
) +g
N
(x
N
))]
Both minimizations guarantee the LHS inequalities in (2.4.3) and (2.4.4) when comparing
versus
J

N1
(x
N1
) = E
w
N1
_
g
N1
(x
N1
,

N1
(x
N1
), w
N1
) +g
N
(x
N
))

,
with

N1
(x
N1
) U
N1
(x
N1
). The RHS inequalities there hold just by the construction
in (2.4.2). By taking 0 in equations (2.4.3) and (2.4.4), it is also seen that J
N1
(x
N1
) =
J

N1
(x
N1
).
Suppose that equations (2.4.3)-(2.4.5) hold for period k + 1. For period k, we have:
J

k
(x
k
) = E
w
k
_
g
k
(x
k
,

k
(x
k
), w
k
) +J

k+1
(f
k
(x
k
,

k
(x
k
), w
k
))

(by denition of J

k
(x
k
))
E
w
k
[g
k
(x
k
,

k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,

k
(x
k
), w
k
))] + (N k 1) (by IH)
J
k
(x
k
) + + (N k 1) (by equation (2.4.2))
= J
k
(x
k
) + (N k)
We also have
J

k
(x
k
) = E
w
k
_
g
k
(x
k
,

k
(x
k
), w
k
) +J

k+1
(f
k
(x
k
,

k
(x
k
), w
k
))

(by denition of J

k
(x
k
))
E
w
k
[g
k
(x
k
,

k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,

k
(x
k
), w
k
))] (by IH)
min
u
k
U
k
(x
k
)
E
w
k
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(f
k
(x
k
, u
k
, w
k
))] (by min over all admissible controls)
= J
k
(x
k
).
Combining the preceding two relations, we see that equation (2.4.3) holds.
In addition, for every policy = {
0
,
1
, . . . ,
N1
}, we have
J

k
(x
k
) = E
w
k
_
g
k
(x
k
,

k
(x
k
), w
k
) +J

k+1
(f
k
(x
k
,

k
(x
k
), w
k
))

(by denition of J

k
(x
k
))
E
w
k
[g
k
(x
k
,

k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,

k
(x
k
), w
k
))] + (N k 1) (by IH)
min
u
k
U
k
(x
k
)
E
w
k
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(f
k
(x
k
, u
k
, w
k
))] + (N k) (by (2.4.2))
E
w
k
[g
k
(x
k
,
k
(x
k
), w
k
) +J
k+1
(f
k
(x
k
,
k
(x
k
), w
k
))] + (N k)
(since
k
(x
k
) is an admissible control for period k)
(Note that by IH, J
k+1
is the optimal cost starting from period k + 1)
E
w
k
[g
k
(x
k
,
k
(x
k
), w
k
) +J

k+1(f
k
(x
k
,
k
(x
k
), w
k
))] + (N k)
(where
k+1
is an admissible policy starting from period k + 1)
= J

k(x
k
) + (N k), (for
k
= (
k
,
k+1
))
Since
k
is any admissible policy, taking the minimum over
k
in the preceding relation, we
obtain for all x
k
,
J

k
(x
k
) J

k
(x
k
) + (N k).
We also have by the denition of the optimal cost J

k
, for all x
k
,
J

k
(x
k
) J

k
(x
k
).
67
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Combining the preceding two relations, we see that equation (2.4.4) holds for period k. Finally,
equation (2.4.5) follows from equations (2.4.3) and (2.4.4) by taking 0, and the induction
is complete.
Example 2.4.1 (Linear-quadratic example)
A certain material is passed through a sequence of two ovens (see Figure 2.4.4). Let
x
0
: Initial temperature of the material,
x
k
, k = 1, 2: Temperature of the material at the exit of Oven k,
u
0
: Prevailing temperature in Oven 1,
u
1
: Prevailing temperature in Oven 2,
Oven 1
Temperature u
0
Initial
temperature
x
0
x
1
Oven 2
Temperature u
1
Final temp.
x
2
Figure 2.4.4: System dynamics of Example 5.
Consider a system equation
x
k+1
= (1 a)x
k
+au
k
, k = 0, 1,
where a (0, 1) is a given scalar. Note that the system equation is linear in the control and the state.
The objective is to get a nal temperature x
2
close to a given target T, while expending relatively little
energy. This is expressed by a total cost function of the form
r(x
2
T)
2
+u
2
0
+u
2
1
,
where r is a scalar. In this way, we are penalizing quadratically a deviation from the target T. Note that
the cost is quadratic in both controls and states.
We can cast this problem into a DP framework by setting N = 2, and a terminal cost g
2
(x
2
) = r(x
2
T)
2
,
so that the border condition for the algorithm is
J
2
(x
2
) = r(x
2
T)
2
.
Proceeding backwards, we have
J
1
(x
1
) = min
u
1
0
{u
2
1
+J
2
(x
2
)}
= min
u
1
0
{u
2
1
+J
2
((1 a)x
1
+au
1
)}
= min
u
1
0
{u
2
1
+r((1 a)x
1
+au
1
T)
2
} (2.4.6)
68
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
This is a quadratic function in u
1
that we can solve by setting to zero the derivative with respect to u
1
.
We will get an expression u
1
=

1
(x
1
) depending linearly on x
1
. By substituting back this expression
for u
1
into (2.4.6), we obtain a closed form expression for J
1
(x
1
), which is quadratic in x
1
.
Proceeding backwards further,
J
0
(x
0
) = min
u
0
0
{u
2
0
+J
1
((1 a)x
0
+au
0
)}
Since J
1
() is quadratic in its argument, then it is quadratic in u
0
. We minimize with respect to u
0
by
setting the correspondent derivative to zero, which will depend on x
0
. The optimal temperature of the
rst oven will be a function

0
(x
0
). The optimal cost is obtained by substituting this expression in the
formula for J
0
.
2.4.1 Exercises
Exercise 2.4.1 Consider the system
x
k+1
= x
k
+u
k
+w
k
, k = 0, 1, 2, 3,
with initial state x
0
= 5, and cost function
3

k=0
(x
2
k
+u
2
k
)
Apply the DP algorithm for the following three cases:
(a) The control constraint set is U
k
(x
k
) = {u|0 x
k
+u 5, u integer}, for all x
k
and k, and the
disturbance veries w
k
= 0 for all k.
(b) The control constraint and the disturbance w
k
are as in part (a), but there is in addition a
constraint x
4
= 5 on the nal state.
Hint: For this problem you need to dene a state space for x
4
that consists of just the value
x
4
= 5, and to redene U
3
(x
3
). Alternatively, you may use a terminal cost g
4
(x
4
) equal to a
very large number for x
4
= 5.
(c) The control constraint is as in part (a) and the disturbance w
k
takes the values 1 and 1 with
probability 1/2 each, for all x
k
and w
k
, except if x
k
+w
k
is equal to 0 or 5, in which case w
k
= 0
with probability 1.
Note: In this exercise (and in the exercises below), when the output of the DP algorithm is requested,
submit the tables describing state x
k
, optimal cost J
k
(x
k
), and optimal control
k
(x
k
), for periods
k = 0, 1, . . . , N 1 (e.g, in this case, N = 4).
Exercise 2.4.2 Suppose that we have a machine that is either running or is broken down. If it runs
throughout one week, it makes a gross prot of $100. If it fails during the week, gross prot is zero.
If it is running at the start of the week and we perform preventive maintenance, the probability
that it will fail during the week is 0.4. If we do not perform such maintenance, the probability of
failure is 0.7. However, maintenance will cost $20.
69
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
When the machine is broken down at the start of the week, it may either be repaired at a cost of
$40, in which case it will fail during the week with a probability of 0.4, or it maybe replaced at a
cost of $150 by a new machine that is guaranteed to run its rst week of operation.
Find the optimal repair, replacement, and maintenance policy that maximizes total prot over four
weeks, assuming a new machine at the start of the rst week.
Exercise 2.4.3 In the framework of the basic problem, consider the case where the cost is of the
form
E
w
k
k = 0, 1, . . . , N 1
_

N
g
N
(x
N
) +
N1

k=0

k
g
k
(x
k
, u
k
, w
k
)
_
,
where is a discount factor with 0 < < 1. Show that an alternate form of the DP algorithm is
given by
V
N
(x
N
) = g
N
(x
N
),
V
k
(x
k
) = min
u
k
U
k
(x
k
)
E
w
k
{g
k
(x
k
, u
k
, w
k
) +V
k+1
(f
k
(x
k
, u
k
, w
k
))}
Exercise 2.4.4 In the framework of the basic problem, consider the case where the system evolution
terminates at time i when a given value w of the disturbance at time i occurs, or when a termination
decision u
i
is made by the controller. If termination occurs at time i, the resulting cost is
T +
i

k=0
g
k
(x
k
, u
k
, w
k
),
where T is a termination cost. If the process has not terminated up to the nal time N, the resulting
cost is g
N
(x
N
) +

N1
k=0
g
k
(x
k
, u
k
, w
k
). Reformulate the problem into the framework of the basic
problem.
Hint: Augment the state space with a special termination state.
Exercise 2.4.5 For the Stock Option problem discussed in class, where the time index runs back-
wards (i.e., period 0 is the terminal stage), prove the following statements:
1. The optimal cost function J
n
(s) is increasing and continuous in s.
2. The optimal cost function J
n
(s) is increasing in n.
3. If
F
0 and we do not exercise the option if expected prot is zero, then the option is never
exercised before maturity.
Exercise 2.4.6 Consider a device consisting of N stages connected in series, where each stage
consists of a particular component. The components are subject to failure, and to increase the
reliability of the device duplicate components are provided. For j = 1, 2, . . . , N, let (1 + m
j
) be
the number of components for the jth stage (one mandatory component, and m
j
backup ones),
let p
j
(m
j
) be the probability of successful operation when (1 + m
j
) components are used, and let
c
j
denote the cost of a single backup component at the jth stage. Formulate in terms of DP the
problem of nding the number of components at each stage that maximizes the reliability of the
device expressed by the product
p
1
(m
1
) p
2
(m
2
) p
N
(m
N
),
subject to the cost constraint

N
j=1
c
j
m
j
A, where A > 0 is given.
70
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Exercise 2.4.7 (Monotonicity Property of DP) An evident, yet very important property of
the DP algorithm is that if the terminal cost g
N
is changed to a uniformly larger cost g
N
(i.e.,
g
N
(x
N
) g
N
(x
N
), x
N
), then clearly the last stage cost-to-go J
N1
(x
N1
) will be uniformly
increased (i.e., J
N1
(x
N1
)

J
N1
(x
N1
)).
More generally, given two functions J
k+1
and

J
k+1
, with J
k+1
(x
k+1
)

J
k+1
(x
k+1
) for all x
k+1
, we
have, for all x
k
and u
k
U
k
(x
k
),
E
w
k
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(f
k
(x
k
, u
k
, w
k
))] E
w
k
_
g
k
(x
k
, u
k
, w
k
) +

J
k+1
(f
k
(x
k
, u
k
, w
k
))

.
Suppose now that in the basic problem the system and cost are time invariant; that is, S
k

=
S, C
k

= C, D
k

= D, f
k

= f, U
k

= U, P
k

= P, and g
k

= g. Show that if in the DP algorithm we


have J
N1
(x) J
N
(x) for all x S, then
J
k
(x) J
k+1
(x), for all x S and k.
Similarly, if we have J
N1
(x) J
N
(x) for all x S, then
J
k
(x) J
k+1
(x), for all x S and k.
2.5 Linear-Quadratic Regulator
2.5.1 Preliminaries: Review of linear algebra and quadratic forms
We will be using some results of linear algebra. Here is a summary of them:
1. Given a matrix A, we let A

be its transpose. It holds that (AB)

= B

, and (A
n
)

= (A

)
n
.
2. The rank of a matrix A R
mn
is equal to the maximum number of linearly independent
row (column) vectors. The matrix is said to be full rank if rank(A) = min{m, n}. A square
matrix is of full rank if and only if it is nonsingular.
3. rank(A) = rank(A

).
4. Given a matrix A R
nn
, the determinant of the matrix I A, where I is the nn identity
matrix and is a scalar, is an nth degree polynomial. The n roots of this polynomial are
called the eigenvalues of A. Thus, is an eigenvalue of A if and only if the matrix I A is
singular (i.e., it does not have an inverse), or equivalently, if there exists a vector v = 0 such
that Av = v. Such vector v is called an eigenvector corresponding to .
The eigenvalues and eigenvectors of A can be complex even if A is real.
A matrix A is singular if and only if it has an eigenvalue that is equal to zero.
If A is nonsingular, then the eigenvalues of A
1
are the reciprocals of the eigenvalues of A.
The eigenvalues of A and A

coincide.
5. A square symmetric n n matrix A is said to be positive semidenite if x

Ax 0, x
R
n
, x = 0. It is said to be positive denite if x

Ax > 0, x R
n
, x = 0. We will denote A 0
and A > 0 to denote positive semideniteness and deniteness, respectively.
71
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
6. If A is an nn positive semidef. symmetric matrix and C is an mn matrix, then the matrix
CAC

is positive semidef. symmetric. If A is positive def. symmetric, and C has rank m


(equivalently, m n and C has full rank), then CA

C is positive def. symmetric.


7. An nn positive def. matrix A can be written as CC

where C is a square invertible matrix.


If A is positive semidef. symmetric and its rank is m, then it can be written as CC

, where C
is an n m matrix of full rank.
8. The expected value of a random vector x = x
1
, . . . , x
n
) is the vector:
E[x] = (E[x
1
], . . . , E[x
n
]).
The covariance matrix of a random vector x with expected value E[ x] is dened to be the
n n positive semidenite matrix
_
_
_
E[(x
1
x
1
)
2
] E[(x
1
x
1
)(x
n
x
n
)]
.
.
.
.
.
.
.
.
.
E[(x
n
x
n
)(x
1
x
1
)] E[(x
n
x
n
)
2
]
_
_
_
9. Let f : R
n
R be a quadratic form
f(x) =
1
2
x

Qx +b

x,
where Q is a symmetric n n matrix and b R
n
. Its gradient is given by
f(x) = Qx +b.
The function f is convex if and only if Q is positive semidenite. If Q is positive denite,
then f is convex and Q is invertible, so a vector x

minimizes f if and only if


f(x

) = Qx

+b = 0,
or equivalently, x

= Q
1
b.
2.5.2 Problem setup
System equation: x
k+1
= A
k
x
k
+B
k
u
k
+w
k
[Linear in both state and control.]
Quadratic cost:
E
w
0
,...,w
N1
_
x

N
Q
N
x
N
+
N1

k=0
x

k
Q
k
x
k
+u

k
R
k
u
k
_
,
where:
Q
k
0 are square, symmetric, positive semidenite matrices with appropriate dimension,
R
k
> 0 are square, symmetric, positive denite matrices with appropriate dimension,
Disturbances w
k
are independent with E[w
k
] = 0, and do not depend on x
k
nor on u
k
(the case E[w
k
] = 0 will be discussed later, in Section 2.5.7),
Controls u
k
are unconstrained.
72
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
DP Algorithm:
J
N
(x
N
) = x

N
Q
N
x
N
(2.5.1)
J
k
(x
k
) = min
u
k
E
w
k
_
x

k
Q
k
x
k
+u

k
R
k
u
k
+J
k+1
(A
k
x
k
+B
k
u
k
+w
k
)
_
(2.5.2)
Intuition: The purpose of this DP is to bring the state closer to x
k
= 0 as soon as possible. Any
deviation from zero is penalized quadratically.
2.5.3 Properties
J
k
(x
k
) is quadratic in x
k
Optimal policy {

0
,

1
, . . . ,

N1
} is linear, i.e.

k
(x
k
) = L
k
x
k
Similar treatment to several variants of the problems as follows
Variant 1: Nonzero mean w
k
.
Variant 2: Shifted problem, i.e., set the target in a vector x
N
rather than in zero:
E[(x
N
x
N
)

Q
N
(x
N
x
N
) +
N1

k=0
((x
N
x
k
)

Q
k
(x
k
x
k
)] +u

k
R
k
u
k
)].
2.5.4 Derivation
By induction, we want to verify that :

k
(x
k
) = L
k
x
k
and J
k
(x
k
) = x

k
K
k
x
k
+ constant, where L
k
are gain matrices
5
given by
L
k
= (B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
A
k
,
and where K
k
are symmetric positive semidenite matrices given by
K
N
= Q
N
K
k
= A

k
(K
k+1
K
k+1
B
k
(B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
)A
k
+Q
k
. (2.5.3)
The above equation is called the discrete time Riccati equation. Just like DP, it starts at the terminal
time N and proceeds backwards.
We will show that the optimal policy (but not the optimal cost) is the same as when w
k
is replaced
by E[w
k
] = 0 (property known as certainty equivalence).
Induction argument proceeds as follows. Write equation (4.2.5) for N 1 :
J
N1
(x
N1
) = min
u
N1
E
w
N1
{x

N1
Q
N1
x
N1
+u

N1
R
N1
u
N1
+(A
N1
x
N1
+B
N1
u
N1
+w
N1
)

Q
N
(A
N1
x
N1
+B
N1
u
N1
+w
N1
)}
= x

N1
Q
N1
x
N1
+ min
u
N1
{u

N1
R
N1
u
N1
+u

N1
B

N1
Q
N
B
N1
u
N1
+2x

N1
A

N1
Q
N
B
N1
u
N1
+x

N1
A

N1
Q
N
A
N1
x
N1
+ E[w

N1
Q
N
w
N1
]}, (2.5.4)
5
The idea is that L
k
represent how much we gain in our path towards the target zero.
73
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
where in the expansion of the last term in the second line we are using the fact that since
E[w] = 0, then E[w
N1
Q
N
(A
N1
x
N1
+B
N1
u
N1
)] = 0.
By dierentiating the equation w.r.t u
N1
, and setting the derivative equal to 0; we get
( R
N1
. .
Posit. def.
+B

N1
Q
N
B
N1
. .
Posit. semidef.
. .
Posit. denite Invertible
)u
N1
= B

N1
Q
N
A
N1
x
N1
,
leading to
u

N1
= (R
N1
+B

N1
Q
N
B
N1
)
1
B

N1
Q
N
A
N1
. .
L
N1
x
N1
.
By substitution into (4.2.6), we get
J
N1
(x
N1
) = x

N1
K
N1
x
N1
+ E[w

N1
Q
N
w
N1
]
. .
0
, (2.5.5)
where K
N1
= A

N1
(Q
N
Q
N
B
N1
(B

N1
Q
N
B
N1
+R
N1
)
1
B

N1
Q
N
)A
N1
+Q
N1
.
Facts:
The matrix K
N1
is symmetric, since K
N1
= K

N1
.
Claim: K
N1
0 (we need this result to prove that the next matrix L
N2
is invertible).
Proof: : From (4.2.7) we have
x

N1
K
N1
x
N1
= J
N1
(x
N1
) E[w

N1
Q
N
w
N1
]. (2.5.6)
So,
x

K
N1
x = x

Q
N1
. .
0
x + min
u
{u

R
N1
. .
>0
u + (A
N1
x +B
N1
u)

Q
N
..
0
(A
N1
x +B
N1
u)}.
Note that the E[w

N1
Q
N
w
N1
] in J
N1
(x
N1
) cancels out with the one in (2.5.6). Thus,
the expression in brackets is nonnegative for every u. The minimization over u preserves
nonnegativity, showing that K
N1
is also positive semidenite.
In conclusion,
J
N1
(x
N1
) = x

N1
K
N1
x
N1
+ constant
is a positive semidenite quadratic function (plus an inconsequential constant term), we may proceed
backward and obtain from DP equation (4.2.5) the optimal law for stage N2. As earlier, we show
that J
N2
is positive semidef. and by proceeding sequentially we obtain
u

K
(x
k
) = L
k
x
k
,
where
L
k
= (B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
A
k
,
and where the symmetric 0 matrices K
k
are given recursively by
K
N
= Q
N
,
74
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
K
k
= (A

k
(K
k+1
K
k+1
B
k
(B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
)A
k
+Q
k
).
Just like DP, this algorithm starts at the terminal time N and proceeds backwards. The optimal
cost is given by
J
0
(x
0
) = x

0
K
0
x
0
+
N1

k=0
E
_
w

k
K
k+1
w
k

.
The control u

k
and the system equation lead to the linear feedback structure represented in Fig-
ure 2.5.1.
x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
Apply control
(x
k
)=L
k
x
k
Initial state x
k
Noise w
k
is realized
Figure 2.5.1: Linear feedback structure of the optimal controller for the linear-quadratic problem.
2.5.5 Asymptotic behavior of the Riccati equation
The objective of this section is to prove the following result: If matrices A
k
, B
k
, Q
k
and R
k
are
constant and equal to A, B, Q, R respectively, then K
k
K as k (i.e., when we have many
periods ahead), where K satises the algebraic Riccati equation:
K = A

(K KB(B

KB +R)
1
B

K)A+Q,
where K 0 and is unique (within the class of positive semidenite matrices) solution. This
property indicates that for the system
x
k+1
= Ax
k
+Bu
k
+w
k
, k = 0, 1, 2, . . . , N 1,
and a large N, one can approximate the control

k
(x
k
) = L
k
x
k
by the steady state control:

k
(x) = Lx,
where
L = (B

KB +R)
1
B

KA.
Before proving the above result, we need to introduce three notions: controllability, observability,
and stability.
Denition 2.5.1 A pair of matrices (A, B), where A R
nn
and B R
nm
, is said to be con-
trollable if the n (n, m) matrix: [B, AB, A
2
B, . . . , A
n1
B] has full rank.
Denition 2.5.2 A pair (A, C), A R
nn
, C R
mn
is said to be observable if the pair (A

, C

)
is controllable.
75
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
The next two claims provide intuition for the previous denitions:
Claim: If the pair (A, B) is controllable, then for any initial state x
0
there exists a sequence of
control vectors u
0
, u
1
, . . . , u
N1
, that forces state x
n
of the system: x
k+1
= Ax
k
+Bu
k
to be equal
to zero at time n.
Proof: By successively applying the equation x
k+1
= Ax
k
+ Bu
k
, for k = n 1, n 2, . . . , 0, we
obtain
x
n
= A
n
x
0
+Bu
n1
+ABu
n2
+ +A
n1
Bu
0
,
or equivalently
x
n
A
n
x
0
= (B, AB, . . . , A
n1
B)(u
n1
, u
n2
, . . . , u
1
, u
0
)

(2.5.7)
Since (A, B) is controllable, (B, AB, . . . , A
n1
B) has full rank and spans the whole space R
n
. Hence,
we can nd (u
n1
, u
n2
, . . . , u
1
, u
0
) R
n
such that
(B, AB, . . . , A
n1
B)(u
n1
, u
n2
, . . . , u
1
, u
0
)

= v,
for any vector v R
n
. In particular, by setting v = A
n
x
0
, we obtain x
n
= 0 in equation (3.3.4).
In words: The system equation x
k+1
= Ax
k
+Bu
k
under controllable matrices (A, B) in the space R
n
warrants convergence to the zero vector in exactly n steps.
Claim: Suppose that (A, C) is observable (i.e., (A

, C

) is controllable). In the context of estimation


problems, given measurements z
0
, z
1
, . . . , z
n1
of the form
z
k
..
R
m1
= C
..
R
mn
x
k
..
R
n1
,
it is possible to uniquely infer the initial state x
0
of the system x
k+1
= Ax
k
.
Proof: In view of the relation
z
0
= Cx
0
x
1
= Ax
0
z
1
= Cx
1
= CAx
0
x
2
= Ax
1
= A
2
x
0
z
2
= Cx
1
= CA
2
x
0
.
.
.
.
.
.
z
n1
= Cx
n1
= CA
n1
x
0
,
or in matrix form, in view of
(z
0
, z
1
, . . . , z
n1
)

= (C, CA, . . . , CA
n1
)

x
0
, (2.5.8)
where (C, CA, . . . , CA
n1
) has full rank n, there is a unique x
0
that satises (2.5.8).
To get the previous result we are using the following: If (A, C) is observable, then (A

, C

) is
controllable. So, if we denote


= (C

, A

, (A

)
2
C

, . . . , (A

)
n1
C

) = (C

, (CA)

, (CA
2
)

, . . . , (CA
n1
)

),
76
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
then is full rank, and therefore

has full rank, where

=
_
_
_
_
_
_
_
_
C
CA
CA
2
.
.
.
CA
n1
_
_
_
_
_
_
_
_
,
which completes the argument.
In words: The system equation x
k+1
= Ax
k
under observable matrices (A, C) allow to infer the
initial state of a sequence of observations z
0
, z
1
, . . . , z
n1
given by z
k
= Cx
k
.
Stability: The concept of stability refers to the fact that in the absence of random disturbance,
the dynamics of the system driven by the control (x) = Lx, bring the state
x
k+1
= Ax
k
+Bu
k
= (A+BL)x
k
, k = 0, 1, . . . ,
towards zero as k . For any x
0
, since x
k
= (A+BL)
k
x
0
, it follows that the closed-loop system
is stable if and only if (A + BL)
k
0, or equivalently, if and only if the eigenvalues of the matrix
(A+BL) are strictly within the unit circle of the complex plane.
Assume time-independent system and cost per stage, and some technical assumptions: controlla-
bility of (A, B) and observability of (A, C) where Q = C

C. The Riccati equation (2.5.3) converges


lim
k
K
k
= K, where K is positive denite, and is the unique (within the class of positive
semidef. matrices) solution of the algebraic Riccati equation
K = A

(K KB(B

KB +R)
1
B

K)A+Q.
The following proposition formalizes this result. To simplify notation, we reverse the time indexing
of the Riccati equation. Thus, P
k
corresponds to K
Nk
in (2.5.3).
Proposition 2.5.1 Let A be an n n matrix, B be an n m matrix, Q be an n n positive
semidef. symmetric matrix, and R be an m m positive denite symmetric matrix. Consider the
discrete-time Riccati equation
P
k+1
= A

(P
k
P
k
B(B

P
k
B +R)
1
B

P
k
)A+Q, k = 0, 1, . . . , (2.5.9)
where the initial matrix P
0
is an arbitrary positive semidef. symmetric matrix. Assume that the
pair (A, B) is controllable. Assume also that Q may be written as Q = C

C, where the pair (A, C)


is observable. Then,
(a) There exists a positive def. symmetric matrix P such that for every positive semidef. symmetric
initial matrix P
0
we have lim
k
P
k
= P. Furthermore, P is the unique solution of the algebraic
matrix equation
P = A

(P PB(B

PB +R)
1
B

P)A+Q
within the class of positive semidef. symmetric matrices.
77
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
(b) The corresponding closed-loop system is stable; that is, the eigenvalues of the matrix
D = A+BL,
where
L = (B

PB +R)
1
B

PA,
are strictly within the unit circle of the complex plane.
Observations:
The implication of the observability assumption in the proposition is that in the absence of
control, if the state cost per stage x

k
Qx
k
0 as k , or equivalently Cx
k
0, then also
x
k
0.
We could replace the statement in Propostion 2.5.1, part (b), by
(A+BL)
k
0 as k .
Since x
k+1
= (A+BL)
k
x
0
, then x
k
0 as k .
Graphical proof of Proposition 2.5.1 for the scalar case
We provide here a proof for a limited version of the statement in the proposition, where we assume
a one-dimensional state and control. For A = 0, B = 0, Q > 0, and R > 0, the Riccati equation
in (2.5.9) is given by
P
k+1
= A
2
_
P
k

B
2
P
2
k
B
2
P
k
+R
_
+Q,
which can be equivalently written as
P
k+1
= F(P
k
), where F(P) =
A
2
RP
B
2
P +R
+Q. (2.5.10)
Figure 2.5.2 illustrates this recursion.
Facts about Figure 2.5.2:
F is concave and monotonic increasing in the range (R/B
2
, ).
The equation P = F(P) has one solution P

> 0 and one solution



P < 0.
The Riccati iteration P
k+1
= F(P
k
) converges to P

> 0 starting anywhere in (



P, ).
Technical note: Going back to the matrix case: If controllability of (A, B) and observability of
(A, C) are replaced by two weaker assumptions:
Stabilizability, i.e., there exists a feedback gain matrix G R
mn
such that the closed-loop
system x
k+1
= (A+BG)x
k
is stable.
Detectability, i.e., A is such that if u
k
0 and Cx
k
0, then it follows that x
k
0, and
that x
k+1
= (A+BL)x
k
is stable.
Then, the conclusions of the proposition hold with the exception of positive def. of the limit
matrix P, which can now only be guaranteed to be positive semidef.
78
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
P
k+1
=F(P
k
)
P
k+2
=F(P
k+1
)
Figure 2.5.2: Graphical illustration of the recursion in equation (2.5.10). Note that F(0) = Q, limPF(P) =
A
2
R
B
2
+Q (horizontal asymptote) , and lim
PR/B
2 F(P) = (vertical asymptote).
2.5.6 Random system matrices
Setting:
Suppose that {A
0
, B
0
}, . . . , {A
N1
, B
N1
} are not known but rather are independent random
matrices that are also independent of w
0
, . . . , w
N1
.
Assume that their probability distribution are given, and have nite variance.
To cast this problem into the basic DP framework, dene disturbances (A
k
, B
k
, w
k
).
The DP algorithm is:
J
N
(x
N
) = x

N
Q
N
x
N
J
k
(x
k
) = min
u
k
E
A
k
,B
k
,w
k
_
x

k
Q
k
x
k
+u

k
R
k
u
k
+J
k+1
(A
k
x
k
+B
k
u
k
+w
k
)

In this case, similar calculations to those for the deterministic matrices give:

k
(x
k
) = L
k
x
k
,
where the gain matrices are given by
L
k
= (R
k
+ E[B

k
K
k+1
B
k
])
1
E[B

k
K
k+1
A
k
],
and where the matrices K
k
are given by the generalized Riccati equation
K
N
= Q
N
,
K
k
= E[A

k
K
k+1
A
k
] E[A

k
K
k+1
B
k
](R
k
+ E[B

k
K
k+1
B
k
])
1
E[B

k
K
k+1
A
k
] +Q
k
. (2.5.11)
In the case of a stationary system and constant matrices Q
k
and R
k
, it is not necessarily true that
the above equation converges to a steady-state solution. This is illustrated in Figure 2.5.3. In the
79
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Q
45
0
0 P
F(P)
-
R
E{B
2
}
Figure 2.5.3: Graphical illustration of the asymptotic behavior of the generalized Riccati equation (2.5.11) in the
case of a scalar stationary system (one-dimensional state and control).
case of a scalar stationary system (one-dimensional state and control), using P
k
in place of K
Nk
,
this equation is written as
P
k+1
=

F(P
k
),
where the function

F is given by

F(P) =
E[A
2
]RP
E[B
2
]P +R
+Q+
TP
2
E[B
2
]P +R
,
and where
T = E[A
2
]E[B
2
] (E[A])
2
(E[B])
2
.
If T = 0, as in the case where A and B are not random, the Riccati equation becomes identical
with the one of Figure 2.5.2 and converges to a steady-state. Convergence also occurs when T has
a small positive value. However, as illustrated in the gure, for T large enough, the graph of the
function

F and the 45-degree line that passes through the origin do not intersect at a positive value
of P, and the Riccati equation diverges to innity.
Interpretation: T is a measure of the uncertainty in the system. If there is a lot of uncertainty,
optimization over a long horizon is meaningless. This phenomenon has been called the uncertainty
threshold principle.
2.5.7 On certainty equivalence
Consider the optimization problem:
min
u
E
w
[(ax +bu +w)
2
],
where a, b are scalars, x is known, and w is random. We have
E
w
[(ax +bu +w)
2
] = E[(ax +bu)
2
+w
2
+ 2(ax +bu)w]
= (ax +bu)
2
+ 2(ax +bu)E[w] + E[w
2
]
80
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Taking derivative with respect to u gives
2(ax +bu)b + 2bE[w] = 0,
and hence the minimizer is
u

=
a
b
x
1
b
E[w].
Observe that u

depends on w only through the mean E[w]. In particular, the result of the opti-
mization problem is the same as for the corresponding deterministic problem where w is replaced
by E[w]. This property is called the certainty equivalence principle.
In particular,
For example, when A
k
and B
k
are known, the certainty equivalence holds (the optimal control
is still linear in the state x
k
).
When A
k
and B
k
are random, certainty equivalence does not hold.
2.5.8 Exercises
Exercise 2.5.1 Consider a linear-quadratic problem where A
k
, B
k
are known, for the case where at
the beginning of period k we have a forecast y
k
{1, 2, . . . , n} consisting of an accurate prediction
that w
k
will be selected in accordance with a particular probability distribution P
k|y
k
. The vectors
w
k
need not have zero mean under the distribution P
k|y
k
. Show that the optimal control law is of
the form
u
k
(x
k
, y
k
) = (B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
(A
k
x
k
+ E[w
k
|y
k
]) +
k
,
where the matrices K
k
are given by the discrete time Riccati equation, and
k
are appropriate
vectors.
Hint:
System equation: x
k+1
= A
k
x
k
+B
k
u
k
+w
k
, k = 0, 1, . . . , N 1,
Cost = E
w
0
,...,w
N1
_
x

N
Q
N
x
N
+
N1

k=0
(x

k
Q
k
x
k
+u

k
R
k
u
k
)
_
Let
y
k
= Forecast available at the beginning of period k
P
k|y
k
= p.d.f. of w
k
given y
k
p
k
y
k
= a priori p.d.f. of y
k
at stage k
We have the following DP algorithm:
J
N
(x
N
, y
N
) = x

N
Q
N
x
N
J
k
(x
k
, y
k
) = min
u
k
E
w
k
_
x

k
Q
k
x
k
+u

k
R
k
u
k
+
n

i=1
p
k+1
i
J
k+1
(x
k+1
, i)

y
k
_
,
where the noise w
k
is explained by P
k|y
k
.
81
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Prove the following result by induction. The control u

k
(x
k
, y
k
) should be derived on the way.
Proposition: Under the conditions of the problem:
J
k
(x
k
, y
k
) = x

k
K
k
x
k
+x

k
b
k
(y
k
) +c
k
(y
k
), k = 0, 1, . . . , N,
where b
k
(y
k
) is an n-dimensional vector, c
k
(y
k
) is a scalar, and K
k
is generated by the discrete time
Riccati equation.
Exercise 2.5.2 Consider a scalar linear system
x
k+1
= a
k
x
k
+b
k
u
k
+w
k
, k = 0, 1, . . . , N 1,
where a
k
, b
k
R, and each w
k
is a Gaussian random variable with zero mean and variance
2
. We
assume no control constraints and independent disturbances.
1. Show that the control law {

0
,

1
, . . . ,

N1
} that minimizes the cost function
E
_
exp
_
x
2
N
+
N1

k=0
(x
2
k
+r u
2
k
)
__
, r > 0,
is linear in the state variable, assuming that the optimal cost is nite for every x
0
.
2. Show by example that the Gaussian assumption is essential for the result to hold.
Hint 1: Note from integral tables that
_
+

e
(ax
2
+bx+c)
dx =
_

a
e
(b
2
4ac)/(4a)
, for a > 0
Let w be a normal random variable with zero mean and variance
2
< 1/(2). Using this denite
integral, prove that
E
_
e
(a+w)
2
_
=
1
_
1 2
2
exp
_
a
2
1 2
2
_
Then, prove that if the DP algorithm has a nite minimizing value at each step, then
J
N
(x
N
) = e
x
2
N
,
J
k
(x
k
) =
k
e

k
x
2
k
, for constants
k
,
k
> 0, k = 0, 1, . . . , N 1.
Hint 2: In particular for w
N1
, consider the discrete distribution
P(w
N1
= ) =
_
1/4, if || = 1
1/2, if = 0
Find a functional form for J
N1
(x
N1
), and check that u

N1
=
N1
x
N1
, for a constant
N1
.
2.6 Modular functions and monotone policies
Now we go back to the basic DP setting on problems with perfect state information. We will
identify conditions on a parameter (e.g., could be related to the state of a system) under which
the optimal action D

() varies monotonically with it. We start with some technical denitions and
relevant properties.
82
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
2.6.1 Lattices
Denition 2.6.1 Given two points x = (x
1
, . . . , x
n
) and y = (y
1
, . . . , y
n
) in R
n
, we dene
Meet of x and y: x y = (min{x
1
, y
1
}, . . . , min{x
n
, y
n
}),
Join of x and y: x y = (max{x
1
, y
1
}, . . . , max{x
n
, y
n
}).
Then
x y x x y
Denition 2.6.2 A set X R
n
is said to be a sublattice of R
n
if x, y X, x y X and
x y X.
Examples:
I =
_
(x, y) R
2
|0 x 1, 0 y 1
_
is a sublattice.
H =
_
(x, y) R
2
|x +y = 1
_
is not a sublattice, because for example (1, 0) and (0, 1) are in H,
but not (0, 0) nor (1, 1) are in H.
Denition 2.6.3 A point x

X is said to be a greatest element of a sublattice X if x

x, x
X. A point x X is said to be a least element of a sublattice X if x x, x X.
Theorem 2.6.1 Suppose X = , X a compact (i.e., closed and bounded) sublattice of R
n
. Then,
X has a least and a greatest element.
2.6.2 Supermodularity and increasing dierences
Let S R
n
, R
l
. Suppose that both S and are sublattices.
Denition 2.6.4 A function f : S R is said to be supermodular in (x, ) if for all z = (x, )
and z

= (x

) in S :
f(z) +f(z

) f(z z

) +f(z z

).
Similarly, f is submodular if
f(z) +f(z

) f(z z

) +f(z z

).
Example: Let S = = R
+
, and let f : S R be given by f(x, ) = x. We will show that f
is supermodular in (x, ).
Pick any (x, ) and (x

) in S , and assume w.l.o.g. x x

. There are two cases to consider:


1.

(x, ) (x

) = (x, ), and (x, ) (x

) = (x

). Then,
f(x, )
. .
x
+f(x

)
. .
x

f((x, ) (x

)
. .
(x,)
)
. .
x
+f((x, ) (x

)
. .
(x

)
)
. .
x

83
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
2. <

(x, ) (x

) = (x,

), and (x, ) (x

) = (x

, ). Then,
f((x, ) (x

))
. .
x

+f((x, ) (x

))
. .
x

= x

+x

and we would have


f(x, )
. .
x
+f(x

)
. .
x

f((x, ) (x

))
. .
x

+f((x, ) (x

))
. .
x

,
if and only if
x +x

+x

x(

) x

) 0
(x x

)
. .
0
(

)
. .
>0
0,
which is indeed the case.
Therefore, f(x, ) = x is supermodular in S .
Denition 2.6.5 For S, R, a function f : S R is said to satisfy increasing dierences
in (x, ) if for all pairs (x, ) and (x

) in S , if x x

and

, then
f(x, ) f(x

, ) f(x,

) f(x

).
If the inequality becomes strict whenever x x

and

, then f is said to satisfy strictly


increasing dierences.
In other words, f has increasing dierences in (x, ) if the dierence
f(x, ) f(x

, ), for x x

,
is increasing in .
Theorem 2.6.2 Let S, R, and suppose f : S R is supermodular in (x, ). Then
1. f is supermodular in x, for each xed (i.e., for any xed , and for any x, x

S,
we have f(x, ) +f(x

, ) f(x x

, ) +f(x x

, )).
2. f satises increasing dierences in (x, ).
Proof: For part (1), x . Let z = (x, ), z

= (x

, ). Since f is supermodular in (x, ):


f(x, ) +f(x

, ) f(x x

, ) +f(x x

, ),
or equivalently
f(z) +f(z

) f(z z

) +f(z z

),
and the result holds.
84
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
For part (2), pick any z = (x, ) and z

= (x

) that satisfy x x

and

. Let w = (x,

)
and w

= (x

, ). Then, w w

= z and w w

= z

. Since f is supermodular on S ,
f( w
..
(x,

)
) +f( w

..
(x

,)
) f(w w

. .
z=(x,)
) +f( w w

. .
z

=(x

)
).
Rearranging terms,
f(x, ) f(x

, ) f(x,

) f(x

),
and so f also satises increasing dierences, as claimed.
Remark: We will prove later on that the reverse of part (2) in the theorem also holds.
Recall: A function f : S R
n
R
n
is said to be of class C
k
if the derivatives f
(1)
, f
(2)
, ..., f
(k)
exist
and are continuous (the continuity is automatic for all the derivatives except the last one, f
(k)
).
Moreover, if f is C
k
, then the cross-partial derivatives satisfy

2
z
i
z
j
f(z) =

2
z
j
z
i
f(z).
Theorem 2.6.3 Let Z be an open sublattice of R
n
. A C
2
function h : Z R is supermodular
on Z if and only if for all z Z, we have

2
z
i
z
j
h(z) 0, i, j = 1, . . . , n, i = j.
Similarly, h is submodular if and only if for all z Z, we have

2
z
i
z
j
h(z) 0, i, j = 1, . . . , n, i = j.
Proof: We prove here the result for supermodularity for the case n = 2.
) If

2
z
i
z
j
h(z) 0, i, j = 1, . . . , n, i = j,
then for x
1
> x
2
and y
1
> y
2
,
_
y
1
y
2
_
x
1
x
2

2
xy
h(x, y)dxdy 0
So,
_
y
1
y
2

y
(h(x
1
, y) h(x
2
, y)) dy 0,
and thus,
h(x
1
, y
1
) h(x
2
, y
1
) (h(x
1
, y
2
) h(x
2
, y
2
)) 0,
or equivalently,
h(x
1
, y
1
) h(x
2
, y
1
) h(x
1
, y
2
) h(x
2
, y
2
),
which shows that h satises increasing dierences and hence is supermodular.
85
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
) Suppose h is supermodular. Then, it satises increasing dierences and so, for x
1
> x
2
, y
1
> y,
h(x
1
, y
1
) h(x
1
, y)
y
1
y

h(x
2
, y
1
) h(x
2
, y)
y
1
y
.
Letting y
1
y, we have

y
h(x
1
, y)

y
h(x
2
, y), when x
1
x
2
,
implying that

2
xy
h(x, y) 0.
Note that the limit above denes a left derivative, but since f is dierentiable, it is also the right
derivative.
2.6.3 Parametric monotonicity
Suppose S, R, f : S R, and consider the optimization problem
max
xS
f(x, ).
Here, by parametric monotonicity we mean that the higher the value of , the higher the maxi-
mizer x

().
6
Lets give some intuition for strictly increasing dierences implying parametric monotonicity. We
argue by contradiction. Suppose that in this maximization problem a solution exists for all
(e.g., suppose that f(, ) is continuous on S for each xed , and that S is compact). Pick any two
values
1
,
2
with
1
>
2
. Let x
1
, x
2
be values that are optimal at
1
and
2
, respectively. Thus,
f(x
1
,
1
) f(x
2
,
1
) 0 f(x
1
,
2
) f(x
2
,
2
). (2.6.1)
Suppose f satises strictly increasing dierences, and that
1
>
2
, but parametric monotonicity
fails. Furthermore, assume x
1
< x
2
. So, the vectors (x
2
,
1
) and (x
1
,
2
) satisfy x
2
> x
1
and
1
>
2
.
By strictly increasing dierences:
f(x
2
,
1
) f(x
1
,
1
) > f(x
2
,
2
) f(x
1
,
2
),
contradicting (4.2.3). So, we must have x
1
x
2
, where x
1
, x
2
were arbitrary selections from the
sets of optimal actions at
1
and
2
, respectively.
In summary, if S, R, strictly increasing dierences imply monotonicity of optimal actions in
the parameter .
6
Note that this concept is dierent from what is stated in the Envelope Theorem, which studies the marginal
change in the value of the maximized function, and not of the optimizer of that function:
Envelope Theorem: Consider a maximization problem: M() = maxx f(x, ). Let x

() be the argmax value


of x that solves the problem in terms of , i.e., M() = f(x

(), ). Assume that f is continuously dierentiable


in (x, ), and that x

is continuously dierentiable in . Then,

M() =

f(y, )

y=x

()
.
86
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Note: This result also holds for S R
n
, n 2, but the proof is dierent and requires additional
assumptions. The problem of the extension of the previous argument to higher dimensional settings
is that we cannot say anymore that x
1
x
2
implies x
1
< x
2
.
The following theorem relaxes the strict condition of the increasing dierences to guarantee para-
metric monotonicity.
Theorem 2.6.4 Let S be a compact sublattice of R
n
, be a sublattice of R
l
, and f : S R be
a continuous function on S for each xed . Suppose that f satises increasing dierences in (x, ),
and is supermodular in x for each xed . Let the correspondence D

: S be dened by
D

() = arg max{f(x, )|x S}.


Then,
1. For each , D

() is a nonempty compact sublattice of R


n
, and admits a greatest element,
denoted x

().
2. x

(
1
) x

(
2
) whenever
1
>
2
.
3. If f satises strictly increasing dierences in (x, ), then x
1
x
2
for any x
1
D(
1
) and
x
2
D(
2
), whenever
1
>
2
.
Proof: For part (1): Since f is continuous on S for each xed , and since S is compact, D

() =
for each . Fix and take a sequence {x
p
} in D

() converging to x S. Then, for any y S,


since x
p
is optimal, we have
f(x
p
, ) f(y, ).
Taking limit as p , and using the continuity of f(, ), we obtain
f(x, ) f(y, ),
so x D

(). Therefore, D

() is closed, and as a closed subset of a compact set S, it is also compact.


Now, we argue by contradiction: Let x and x

be distinct elements of D

(). If x x

(), we
must have
f(x x

, ) < f(x, ) = f(x

, ).
However, supermodularity in x means
f(x, ) +f(x

, )
. .
2f(x,)
f(x

x, ) +f(x

x, ) < f(x

x, ) +f(x, ),
which implies
f(x

x, ) > f(x, ) = f(x

, ),
which in turn contradicts the presumed optimality of x and x

at . A similar argument also


establishes that x x

(). Thus, D

() is a sublattice of R
n
, and as a nonempty compact
sublattice of R
n
, admits a greatest element x

().
87
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
For part (2): Let
1
and
2
be given with
1
>
2
. Let x
1
D

(
1
), and x
2
D

(
2
). Then, we
have
0 f(x
1
,
1
) f(x
1
x
2
,
1
) (by optimality of x
1
at
1
)
f(x
1
x
2
,
1
) f(x
2
,
1
) (by supermodularity in x)
f(x
1
x
2
,
2
) f(x
2
,
2
) (by increasing dierences in (x, ))
0 (by optimality of x
2
at
2
),
so equality holds at every point in this string. Now, suppose x
1
= x

(
1
) and x
2
= x

(
2
). Since
equality holds at all points in the string, using the rst equality we have
f(x
1
x
2
,
1
) = f(x
1
,
1
),
and so x
1
x
2
is also an optimal action at
1
. If x
1
x
2
, then we would have x
1
x
2
> x
1
, and
this contradicts the denition of x
1
as the greatest element of D

(
1
). Thus, we must have x
1
x
2
.
For part (3): Suppose that x
1
D

(
1
), x
2
D

(
2
). Suppose that x
1
x
2
. Then, x
2
> x
1
x
2
.
If f satises strictly increasing dierences, then since
1
>
2
, we have
f(x
2
,
1
) f(x
1
x
2
,
1
) > f(x
2
,
2
) f(x
1
x
2
,
2
),
so the third inequality in the string above becomes strict, contradicting the equality.
Remark: For the case where S, R, from Theorem 2.6.2 it can be seen that if f is supermodular,
it automatically veries the hypotheses of Theorem 2.6.4, and therefore in principle supermodular-
ity in R
2
constitutes a sucient condition for parametric monotonicity. For a more general case
in R
n
, n > 2, a related result follows.
For this general case, the denition of increasing dierences is: For all z Z, for all distinct
i, j {1, . . . , n}, and for all z

i
, z

j
such that
z

i
z
i
, and z

j
z
j
;
it is the case that
f(z
ij
, z

i
, z

j
) f(z
ij
, z
i
, z

j
) f(z
ij
, z

i
, z
j
) f(z
ij
, z
i
, z
j
).
In words, f has increasing dierences on Z if it has increasing dierences in each pair (z
i
, z
j
) when
all other coordinates are held xed at some value.
Theorem 2.6.5 A function f : Z R
n
R is supermodular on Z if and only if f has increasing
dierences on Z.
Proof: The implication can be proved by a slight modication of part (2) in Theorem 2.6.2.
To prove , pick any z and z

in Z. We are required to show that


f(z) +f(z

) f(z z

) +f(z z

).
88
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
If z z

or z z

, the inequality trivially holds. So, suppose z and z

are not comparable under .


For notational convenience, arrange the coordinates of z and z

so that
z z

= (z

1
, . . . , z

k
, z
k+1
, . . . , z
n
),
and
z z

= (z
1
, . . . , z
k
, z

k+1
, . . . , z

n
).
Note that since z and z

are not comparable under , we must have 0 < k < n.


Now, for 0 i j n, dene
z
i,j
= (z

1
, . . . , z

i
, z
i+1
, . . . , z
j
, z

j+1
, . . . , z

n
).
Then, we have
z
0,k
= z z

, z
k,n
= z z

, z
0,n
= z, z
k,k
= z

. (2.6.2)
Since f has increasing dierences on Z, it is the case that for all 0 i < k j < n,
f(z
i+1,j+1
) f(z
i,j+1
) f(z
i+1,j
) f(z
i,j
).
Therefore, we have for k j < n,
f(z
k,j+1
) f(z
0,j+1
) =
k1

i=0
[f(z
i+1,j+1
) f(z
i,j+1
)]

k1

i=0
[f(z
i+1,j
) f(z
i,j
)]
= f(z
k,j
) f(z
0,j
).
Since this inequality holds for all j satisfying k j < n, it follows that the LHS is at its highest
value at j = n 1, while the RHS is at its lowest value when j = k. Therefore,
f(z
k,n
) f(z
0,n
) f(z
k,k
) f(z
0,k
).
From (2.6.2), this is precisely the statement that
f(z z

) f(z) f(z

) f(z z

).
Since z and z

were chosen arbitrarily, f is shown to be supermodular on Z.


Remark: From Theorem 2.6.5, it is sucient to prove supermodularity (or increasing dierences)
to prove parametric monotonicity.
2.6.4 Applications to DP
We include here a couple of examples that show how useful the concept of parametric monotonicity
could be to characterize monotonicity properties of the optimal policy.
89
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
Example 2.6.1 (A gambling model with changing win probabilities)
Consider a gambler who is allowed to bet any amount up to his present fortune at each play.
He will win or lose that amount according to a given probability p.
Before each gamble, the value of p changes (p F).
Control: On each play, the gambler must decide, after the win probability is announced, how much
to bet.
Consider a sequence of N gambles.
Objective: Maximize the expected value of a given utility function G of his nal fortune x,
where G(x) is continuously dierentiable and nondecreasing in x.
State: (x, p), where x is his current fortune, and p is the current win probability.
Assume indices run backward in time.
DP formulation:
Dene the value function V
k
(x, p) as the maximal expected nal utility for state (x, p) when there are k
games left.
The DP algorithm is:
V
0
(x, p) = G(x),
and for k = N, N 1, . . . , 1,
V
k
(x, p) = max
0ux
_
p
_
1
0
V
k1
(x +u, )dF() + (1 p)
_
1
0
V
k1
(x u, )dF()
_
.
Let u
k
(x, p) be the largest u that maximizes this equation. Let g
k
(u, p) be the expression to maximize
above, i.e.,
g
k
(u, p) = p
_
1
0
V
k1
(x +u, )dF() + (1 p)
_
1
0
V
k1
(x u, )dF().
Intuitively, for given k and x, the optimal amount u
k
(x, p) to bet should be increasing in p. So, we
would like to prove parametric monotonicity of u
k
(x, p) in p. To this end, it would be enough to prove
increasing dierences of g
k
(u, p) in (u, p), or equivalently, it would be enough to prove supermodularity
of g
k
(u, p) in (u, p). Or it would be enough to prove

2
up
g
k
(u, p) 0.
The derivation proceeds as follows:

p
g
k
(u, p) =
_
1
0
V
k1
(x +u, )dF()
_
1
0
V
k1
(x u, )dF().
90
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Then, by the Leibniz rule
7

2
up
g
k
(u, p) =
_
1
0

u
V
k1
(x +u, )dF()
_
1
0

u
V
k1
(x u, )dF().
Then,

2
up
g
k
(u, p) 0
if for all ,

u
[V
k1
(x +u, ) V
k1
(x u, )] 0,
or equivalently, if for all ,
V
k1
(x +u, ) V
k1
(x u, )
increases in u, which follows if V
k1
(z, ) is increasing in z, which immediately holds because for z

> z,
V
k1
(z

, ) V
k1
(z, ),
since in the former we are maximizing over a bigger domain.
For V
0
(, ), it holds because G(z

) G(z).
Example 2.6.2 (An optimal allocation problem subject to penalty costs)
There are N stages to construct I components sequentially.
At each stage, we allocate u dollars for the construction of one component.
If we allocate $u, then the component constructed will be a success w.p. P(u) (continuous,
nondecreasing, with P(0) = 0).
After each component is constructed, we are informed as to whether or not it is successful.
If at the end of N stages we are j components short, we incur a penalty cost C(j) (increasing,
with C(j) = 0 for j 0).
Control: How much money to allocate in each stage to minimize the total expected cost (con-
struction + penalty).
State: Number of successful components still needed.
Indices run backward in time.
DP formulation:
Dene the value function V
k
(i) as the minimal expected remaining cost when state is i and k stages
remain.
The DP algorithm is:
V
0
(i) = C(i),
7
We would need to prove that V
k
(x, p) and

x
V
k
(x, p) are continuous in x. A sucient condition or that is
that

k
(x, p) is continuously dierentiable in x.
91
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
and for k = N, N 1, . . . , 1, and i > 0,
V
k
(i) = min
u0
{u +P(u)V
k1
(i 1) + (1 P(u))V
k1
(i)} . (2.6.3)
We set V
k
(i) = 0, i 0, and for all k.
It follows immediately from the denition of V
k
(i) and the monotonicity of C(i) that V
k
(i) increases
in i and decreases in k.
Let u
k
(i) be the minimizer of (2.6.3). Two intuitive results should follow:
1. The more we need, the more we should invest (i.e., u
k
(i) is increasing in i).
2. The more time we have, the less we need to invest at each stage (i.e., u
k
(i) is decreasing in k).
Lets determine conditions on C() that make the previous two intuitions valid.
Dene
g
k
(i, u) = u +P(u)V
k1
(i 1) + (1 P(u))V
k1
(i).
Minimizing g
k
(i, u) is equivalent to maximizing (g
k
(i, u)). Then, in order to prove u
k
(i) increasing
in i, it is enough to prove (g
k
(i, u)) supermodular in (i, u), or g
k
(i, u) submodular in (i, u). Note that
here we are treating i as a continuous quantity.
So, u
k
(i) increases in i if

2
iu
g
k
(i, u) 0.
We compute this cross-partial derivative. First, we calculate

u
g
k
(i, u) = 1 +P

(u)[V
k1
(i 1) V
k1
(i)],
and then

2
iu
g
k
(i, u) = P

(u)
. .
0

i
[V
k1
(i 1) V
k1
(i)] 0,
so that u
k
(i) increases in i if [V
k1
(i 1) V
k1
(i)] decreases in i. Similarly, u
k
(i) decreases in k if
[V
k1
(i 1) V
k1
(i)] increases in k. Therefore, submodularity gives a sucient condition on g
k
(i, u),
which ensures the desired monotonicity of the optimal policy. For this example, we show below that
if C(i) is convex in i, then [V
k1
(i 1) V
k1
(i)] decreases in i and increases in k, ensuring the desired
structure of the optimal policy.
Two results are easy to verify:
V
k
(i) is increasing in i, for a given k.
V
k
(i) is decreasing in k, for a given i.
Proposition 2.6.1 If C(i +2) C(i +1) C(i +1) C(i), i (i.e., C() convex), then u
k
(i) increases
in i and decreases in k.
92
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Proof: Dene
A
i,k
: V
k+1
(i + 1) V
k+1
(i) V
k
(i + 1) V
k
(i), k 0
B
i,k
: V
k+1
(i) V
k
(i) V
k+2
(i) V
k+1
(i), k 0
C
i,k
: V
k
(i + 1) V
k
(i) V
k
(i + 2) V
k
(i + 1), k 0
We proceed by induction on n = k +i. For n = 0 (i.e., k = i = 0):
A
0,0
: V
1
(1)
. .
=0 from (2.6.3)
V
1
(0)
. .
0
V
0
(1)
. .
0
V
0
(0)
. .
0
,
B
0,0
: V
1
(0)
. .
0
V
0
(0)
. .
0
V
2
(0)
. .
0
V
1
(0)
. .
0
,
C
0,0
: V
0
(1)
. .
C(1)
V
0
(0)
. .
C(0)=0
V
0
(2)
. .
C(2)
V
0
(1)
. .
C(1)
,
where the last inequality holds because C() is convex.
IH: The 3 inequalities above are true for k +i < n.
Suppose now that k +i = n. We proceed by proving one inequality at a time.
1. For A
i,k
:
If i = 0 A
0,k
: V
k+1
(1) V
k+1
(0)
. .
0
V
k
(1) V
k
(0)
. .
0
, which holds because V
k
(i) is decreasing
in k.
If i > 0, then there is u such that
V
k+1
(i) = u +P( u)V
k
(i 1) + (1 P( u))V
k
(i).
Thus,
V
k+1
(i) V
k
(i) = u +P( u)[V
k
(i 1) V
k
(i)] (2.6.4)
Also, since u is the minimizer just for V
k+1
(i),
V
k+1
(i + 1) u +P( u)V
k
(i) + (1 P( u))V
k
(i + 1).
Then,
V
k+1
(i + 1) V
k
(i + 1) u +P( u)[V
k
(i) V
k
(i + 1)] (2.6.5)
Note that from C
i1,k
(which holds by IH because i 1 +k = n 1), we get
V
k
(i) V
k
(i + 1) V
k
(i 1) V
k
(i)
Then, using the RHS of (4.2.5) and (5.3.2), we have
V
k+1
(i + 1) V
k
(i + 1) V
k+1
(i) V
k
(i),
or equivalently,
V
k+1
(i + 1) V
k+1
(i) V
k
(i + 1) V
k
(i),
which is exactly A
i,k
.
93
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
2. For B
i,k
:
Note that for some u,
V
k+2
(i) = u +P( u)V
k+1
(i 1) + (1 P( u))V
k+1
(i),
or equivalently,
V
k+2
(i) V
k+1
(i) = u +P( u)[V
k+1
(i 1) V
k+1
(i)]. (2.6.6)
Also, since u is the minimizer for V
k+2
(i),
V
k+1
(i) u +P( u)V
k
(i 1) + (1 P( u))V
k
(i),
so that
V
k+1
(i) V
k
(i) u +P( u)[V
k
(i 1) V
k
(i)]. (2.6.7)
By IH, A
i1,k
, for i 1 +k = n 1, holds. So,
V
k+1
(i) V
k+1
(i 1) V
k
(i) V
k
(i 1),
or equivalently,
V
k
(i 1) V
k
(i) V
k+1
(i 1) V
k+1
(i).
Plugging it in (3.3.4), and using the RHS of (4.2.7), we obtain
V
k+1
(i) V
k
(i) V
k+2
(i) V
k+1
(i),
which is exactly B
i,k
.
3. For C
i,k
, we rst note that B
i+1,k1
(already proved since i + 1 +k 1 = n) states that
V
k
(i + 1) V
k1
(i + 1) V
k+1
(i + 1) V
k
(i + 1),
or equivalently,
2V
k
(i + 1) V
k+1
(i + 1) +V
k1
(i + 1). (2.6.8)
Hence, if we can show that,
V
k1
(i + 1) +V
k+1
(i + 1) V
k
(i) +V
k
(i + 2), (2.6.9)
then from (2.6.8) and (2.6.9) we would have
2V
k
(i + 1) V
k
(i) +V
k
(i + 2),
or equivalently,
V
k
(i + 1) V
k
(i) V
k
(i + 2) V
k
(i + 1),
which is exactly C
i,k
.
Now, for some u,
V
k
(i + 2) = u +P( u)V
k1
(i + 1) + (1 P( u))V
k1
(i + 2),
which implies
V
k
(i + 2) V
k1
(i + 1) = u +P( u)V
k1
(i + 1) + (1 P( u))V
k1
(i + 2) V
k1
(i + 1)
= u + (1 P( u))[V
k1
(i + 2) V
k1
(i + 1)]. (2.6.10)
94
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
Moreover, since u is the minimizer of V
k
(i + 2):
V
k+1
(i + 1) u +P( u)V
k
(i) + (1 P( u))V
k
(i + 1).
Subtracting V
k
(i) from both sides:
V
k+1
(i + 1) V
k
(i) u + (1 P( u))[V
k
(i + 1) V
k
(i)].
Then, equation (2.6.9) will follow if we can prove that
V
k
(i + 1) V
k
(i) V
k1
(i + 2) V
k1
(i + 1), (2.6.11)
because then
V
k+1
(i + 1) V
k
(i) u + (1 P( u))[V
k1
(i + 2) V
k1
(i + 1)]
= V
k
(i + 2) V
k1
(i + 1).
Now, from A
i,k1
(which holds by IH), it follows that
V
k
(i + 1) V
k
(i) V
k1
(i + 1) V
k1
(i). (2.6.12)
Also, from C
i,k1
(which holds by IH), it follows that
V
k1
(i + 1) V
k1
(i) V
k1
(i + 2) V
k1
(i + 1). (2.6.13)
Finally, (2.6.12) and (2.6.13) (2.6.11) (2.6.9), and we close this case.
In the end, the three inequalities hold, and the proof is completed.
2.7 Extensions
2.7.1 The Value of Information
The value of information is the reduction in cost between optimal closed-loop and open-loop policies.
To illustrate its computation, we revisit the two-game chess match example.
Example 2.7.1 (Two-game chess match)
Closed-Loop: Recall that the optimal policy when p
d
> p
w
is to play timid if and only if one is
ahead in the score. Figure 2.7.1 illustrates this. The optimal payo under the closed-loop policy
is the sum of the payos in the leaves of Figure 2.7.1. This is because the four payos correspond
to four mutually exclusive outcomes. The total payo is
P(win) = p
w
p
d
+p
2
w
((1 p
d
) + (1 p
w
))
= p
2
w
(2 p
w
) +p
w
(1 p
w
)p
d
. (2.7.1)
For example, if p
w
= 0.45 and p
d
= 0.9, we know that P(win) = 0.53.
95
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
0-0
1-0
0-1
1.5-
0.5
1-1
1-1
0-2
bold
timid
bold
p
w
1-p
w
p
d
1-p
d
1-p
w
p
w
p
w
p
d
p
w
2
(1-p
d
)
p
w
2
(1-p
w
)
0 (already lost)
Figure 2.7.1: Optimal closed-loop policy for the two-game chess match. The payos included next to the leaves
represent the total cumulative payo for following that particular branch from the root.
Open-Loop: There are four possible policies:
Note that the latter two policies lead to the same payo, and that this payo dominates the rst
policy (i.e., playing (timid-timid)) because
p
w
p
d
. .
p
2
d
pw
+p
2
w
(1 p
d
)
. .
0
p
2
d
p
w
.
Therefore, the maximum open-loop probability of winning the match is:
max{p
2
w
(3 2p
w
)
. .
Play (bold,bold)
, p
w
p
d
+p
2
w
(1 p
d
)
. .
Play (bold, timid) or (timid, bold)
} = p
2
w
+p
w
(1 p
w
) max{2p
w
, p
d
} (2.7.2)
So,
if p
d
> 2p
w
, then the optimal policy is to play either (timid,bold) or (bold, timid);
if p
d
2p
w
, then the optimal policy is to play (bold,bold).
Again, if p
w
= 0.45 and p
d
= 0.9, then P(win) = 0.425
For the aforementioned probabilities p
w
= 0.45 and p
d
= 0.9, the value of information is the
dierence between both optimal payos: 0.53 0.425 = 0.105.
More generally, by subtracting (2.7.1)-(2.7.2):
Value of Information = p
2
w
(2 p
w
) +p
w
(1 p
w
)p
d
p
2
w
p
w
(1 p
w
) max{2p
w
, p
d
}
= p
w
(1 p
w
) min{p
w
, p
d
p
w
}.
2.7.2 State Augmentation
In the basic DP formulation, the random noise is independent across all periods, and the control
depends just on the current state. In this regard, the system is of the Markovian type. In order to
deal with a more general situation, we enlarge the state denition so that the current state captures
information of the past. In some applications, this past information could be helpful for the future.
96
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
0-0
0.5-
0.5
0-1
1-1
0-2
timid
timid
p
d
1-p
d
p
d
1-p
d
1-p
d
p
d
p
d
2
p
w
0
0
0
0.5-
1.5
0.5-
1.5
timid
0-0
0-0
0-1
2-0
0-2
bold
bold
p
w
1-p
w
p
w
1-p
w
1-p
w
p
w
p
w
2
P
w
2
(1-p
w
)
0
1-1
1-1
bold
P
w
2
(1-p
w
)
Play (timid-timid)
Play (bold-bold)
P(win) = p
d
2
p
w
P(win) = p
w
2
+2 p
w
2
(1-p
w
) = p
w
2
(32 p
w
)
0-0
1-0
0-1
0-2
bold
timid
p
w
1-p
w
p
d
1-p
d
1-p
d
p
d
p
w
p
d
p
w
2
(1-p
d
)
0
1-1
timid
0
0-0
0-1
0-2
timid
bold
p
d
1-p
d
p
w
1-p
w
1-p
w
p
w
0
0
1-1
bold
p
w
2
(1-p
d
)
1.5-
0.5
0.5-
1.5
0.5-
0.5
1.5-
0.5
0.5-
1.5
p
d
p
w
Play (bold-timid)
Play (timid-bold)
P(win) = p
w
p
d
+ p
w
2
(1 p
d
) P(win) = p
w
p
d
+ p
w
2
(1 p
d
)
Figure 2.7.2: Open-loop policies for the two-game chess match. The payos included next to the leaves represent
the total cumulative payo for following that particular branch from the root.
Time Lags
Suppose that the next state x
k+1
depends on the last two states x
k
and x
k1
, and on the last two
controls u
k
and u
k1
. For instance,
x
k+1
= f
k
(x
k
, x
k1
, u
k
, u
k1
, w
k
), k = 1, . . . , N 1,
x
1
= f
0
(x
0
, u
0
, w
0
).
We redene the system equation as follows:
_
_
_
x
k+1
y
k+1
s
k+1
_
_
_
. .
e x
k+1
=
_
_
_
f
k
(x
k
, y
k
, u
k
, s
k
, w
k
)
x
k
u
k
_
_
_
. .
e
f
k
(e x
k
,u
k
,w
k
)
,
where x
k
= (x
k
, y
k
, s
k
) = (x
k
, x
k1
, u
k1
).
DP algorithm
97
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
When the DP algorithm for the reformulated problem is translated in terms of the variables of the
original problem, it takes the form:
J
N
(x
N
) = g
N
(x
N
),
J
N1
(x
N1
, x
N2
, u
N2
) = min
u
N1
U
N1
(x
N1
)
E
w
N1
{ g
N1
(x
N1
, u
N1
, w
N1
)
+J
N
(f
N1
(x
N1
, x
N2
, u
N1
, u
N2
, w
N1
)
. .
x
N
) } ,
J
k
(x
k
, x
k1
. . . , u
k1
) = min
u
k
U
k
(x
k
)
E
w
k
{ g
k
(x
k
, u
k
, w
k
) +J
k+1
(f
k
(x
k
, x
k1
, u
k
, u
k1
, w
k
), x
k
, u
k
) } ,
k = 1, . . . , N 2,
J
0
(x
0
) = min
u
0
U
0
(x
0
)
E
w
0
{g
0
(x
0
, u
0
, w
0
) +J
1
(f
0
(x
0
, u
0
, w
0
), x
0
, u
0
)} .
Correlated Disturbances
Assume that w
0
, w
1
, . . . , w
N1
can be represented as the output of a linear system driven by inde-
pendent r.v. For example, suppose that disturbances can be modeled as:
w
k
= C
k
y
k+1
, where y
k+1
= A
k
y
k
+
k
, k = 0, 1, . . . , N 1,
where C
k
, A
k
are matrices of appropriate dimension, and
k
are independent random vectors. By
viewing y
k
as an additional state variable; we obtain the new system equation:
_
x
k+1
y
k+1
_
=
_
f
k
(x
k
, u
k
, C
k
(A
k
y
k
+
k
))
A
k
y
k
+
k
_
,
for some initial y
0
. In period k, this correlated disturbance can be represented as the output of a
linear system driven by independent random vectors:

k
-
y
k+1
= A
k
y
k
+
k
y
k+1
-
C
k
w
k
-
Observation: In order to have perfect state information, the controller must be able to observe y
k
.
This occurs for instance when C
k
is the identity matrix, and therefore w
k
= y
k+1
. Since w
k
is
realized at the end of period k, its known value is carried over the next period k + 1 through the
state component y
k+1
.
DP algorithm
J
N
(x
N
, y
N
) = g
N
(x
N
)
J
k
(x
k
, y
k
) = min
u
k
U
k
(x
k
)
E

k
_

_
g
k
(x
k
, u
k
, C
k
(A
k
y
k
+
k
)) +J
k+1
(f
k
(x
k
, u
k
, C
k
(A
k
y
k
+
k
)
. .
x
k+1
, A
k
y
k
+
k
. .
y
k+1
)
_

_
98
CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey
2.7.3 Forecasts
Suppose that at time k, the controller has access to a forecast y
k
that results in a reassessment of
the probability distribution of w
k
.
In particular, suppose that at the beginning of period k, the controller receives an accurate prediction
that the next disturbance w
k
will be selected according to a particular prob. distribution from a
collection {Q
1
, Q
2
, . . . , Q
m
}. For example if a forecast is i, then w
k
is selected according to a
probability distribution Q
i
. The a priory probability that the forecast will be i is denoted by p
i
and is given.
System equation:
_
x
k+1
y
k+1
_
=
_
f
k
(x
k
, u
k
, w
k
)

k
_
,
where
k
is the r.v. taking value i w.p. p
i
. So, when
k
takes the value i, then w
k+1
will occur
according to distribution Q
i
. Note that there are two sources of randomness now: w
k
= (w
k
,
k
),
where w
k
stands for the outcome of the previous forecast in the current period, and
k
passes the
new forecast to the next period.
DP algorithm
J
N
(x
N
, y
N
) = g
N
(x
N
)
J
k
(x
k
, y
k
) = min
u
k
U
k
(x
k
)
E
w
k
_
g
k
(x
k
, u
k
, w
k
) +
m

i=1
p
i
J
k+1
(f
k
(x
k
, u
k
, w
k
), i)/y
k
_
So, in current period k, the forecast y
k
is known (given), and it determines the distribution for the
current noise w
k
. For the future, the forecast is i w.p. p
i
.
2.7.4 Multiplicative Cost Functional
The basic formulation of the DP problem assumes that the cost functional is additive over time.
That is, every period (depending on states, actions and uncertainty) the system generates a cost and
it is the sum of these single-period costs that we are interested to minimize. It should be relatively
clear by now why this additivity assumption is crucial for the DP method to work. However, what
we really need is an appropriate form of separability of the cost functional into its single-period
components and additivity is one convenient (and most natural form most practical applications)
form to ensure this separability, but is is not the only one. The following exercises clarify this point.
Exercise 2.7.1 In the framework of the basic problem, consider the case where the cost is of the
form
E
w
_
exp
_
g
N
(x
N
) +sum
N1
k=1
g
k
(x
k
, u
k
, w
k
)
__
.
a) Show that the optimal cost and optimal policy can be obtained from the DP-like algorithm
J
N
(x
N
) = exp(g
N
(x
N
)), J
k
(x
k
) = min
u
k
U
k
E[J
k+1
(f
k
(x
k
, u
k
, w
k
)) exp(g
k
(x
k
, u
k
, w
k
))] .
99
Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING
b) Dene the functions V
k
(x
k
) = ln(J
k
(x
k
)). Assume also that g
k
(x, u, w) = g
k
(x, u), that is, the
g
k
are independent of w
k
. Show that the above algorithm can be rewritten as follows:
V
N
(x
n
) = g
N
(x
N
),
V
k
(x
k
) = min
u
k
U
k
{g
k
(x
k
, u
k
) + ln(E[exp(V
k+1
(f
k
(x
k
, u
k
, w
k
)))])} .
Exercise 2.7.2 Consider the case where the cost has the following multiplicative form
E
w
_
g
N
(x
N
)
N
1

k=1
g
k
(x
k
, u
k
, w
k
)
_
.
Develop a DP-like algorithm for this problem assuming g
k
(x
k
, u
k
, w
k
) 0 for all x
k
, u
k
and w
k
.
100
Chapter 3
Applications
3.1 Inventory Control
In this section, we study the inventory control problem discussed in Example 2.1.1.
3.1.1 Problem setup
We assume the following:
Excess demand in each period is backlogged and is lled when additional inventory becomes
available, i.e.,
x
k+1
= x
k
+u
k
w
k
, k = 0, 1, . . . , N 1.
Demands w
k
take values within a bounded interval and are independent.
Cost of state x:
r(x) = p max{0, x} +hmax{0, x},
where p 0 is the per-unit backlog cost, and h 0 is the per-unit holding cost.
Per-unit purchasing cost c.
Total expected cost to be minimized:
E
_
N1

k=0
(cu
k
+p max{0, w
k
x
k
u
k
} +hmax{0, x
k
+u
k
w
k
}
_
,
where the costs are incurred based on the inventory (potentially, negative) available at the
end of each period k.
Suppose that p > c (otherwise, if c p, it would never be optimal to buy stock in the last
period N 1 and possibly in the earlier periods).
Most of the subsequent analysis generalizes to the case where r() is a convex function that
grows to innity with asymptotic slopes p and h as its argument tends to and , respec-
tively.
101
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Stock ordered at
beginning of Period k
Stock at beginning
of Period k
Demand during
Period k
Figure 3.1.1: System dynamics for the inventory control problem.
Figure 3.1.1 illustrates the problem setup and the dynamics of the system.
By applying the DP algorithm, we have
J
N
(x
N
) = 0,
J
k
(x
k
) = min
u
k
0
{cu
k
+H(x
k
+u
k
) + E
w
k
[J
k+1
(x
k
+u
k
w
k
)]} , (3.1.1)
where
H(y) = E[r(y w
k
)] = pE[(w
k
y)
+
] +hE[(y w
k
)
+
].
If the probability distribution of w
k
is time-varying, then H depends on k. To simplify notation in
what follows, we will assume that all demands are identically distributed.
By dening y
k
= x
k
+ u
k
..
0
(i.e., y
k
is the inventory level right after getting the new units, and
before demand for the period is realized), we could write
J
k
(x
k
) = min
y
k
x
k
{cy
k
+H(y
k
) + E
w
k
[J
k+1
(y
k
w
k
)]} cx
k
. (3.1.2)
3.1.2 Structure of the cost function
Note that H(y) is convex, since for a given w
k
, both terms in its denition are convex (Fig-
ure 3.1.2 illustrates this) the sum is convex taking expectation on w
k
preserves convexity.
Assume J
k+1
() is convex (to be proved later), then the function G
k
(y) minimized in (3.1.2) is
convex. Suppose for now that there is an unconstrained minimum S
k
(existence to be veried);
that is, for each k, the scalar S
k
minimizes the function
G
k
(y) = cy +H(y) + E
w
[J
k+1
(y w)].
In addition, if G
k
(y) has the shape shown in Figure 3.1.3, then the minimizer of G
k
(y), for
y
k
x
k
, is
y

k
=
_
S
k
if x
k
< S
k
x
k
if x
k
S
k
.
102
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
Max{0,w
k
-y}
y w
k
Max{0,y-w
k
}
y w
k
Figure 3.1.2: Graphical illustration of the two terms in the H function.
y
k
G
k
(y)
S
k
x
k1
x
k2
Figure 3.1.3: The function G
k
(y) has a bowl shape. The minimum for y
k
x
k
1
is S
k
; the minimum for y
k
x
k
2
is x
k
2
.
Using the reverse transformation u
k
= y
k
x
k
(recall that u
k
is the amount ordered), then
an optimal policy is determined by a sequence of scalars {S
0
, S
1
, . . . , S
N1
} and has the form

k
(x
k
) =
_
S
k
x
k
if x
k
< S
k
0 if x
k
S
k
.
(3.1.3)
This control is known as basestock policy, with basestock level S
k
.
To complete the proof of the optimality of the control policy (3.1.3), we need to prove the next
result:
Proposition 3.1.1 The following three facts hold:
1. The value function J
k+1
(y) (and hence, G
k
(y)) is convex in y, k = 0, 1, . . . , N 1.
2. lim
|y|
G
k
(y) = , k.
3. lim
|y|
J
k
(y) = , k.
103
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Proof: By induction. For k = N 1,
G
N1
(y) = cy +H(y) + E
w
[J
N
(y w)
. .
0
],
and since H() is convex, G
N1
(y) is convex. For y very negative,

y
H(y) = p, and so

y
G
N1
(y) = c p < 0. For y very positive,

y
H(y) = h, and so

y
G
N1
(y) = c + h > 0.
So, lim
|y|
G
N1
(y) = .
1
Hence, the optimal control for the last period turns out to be

N1
(x
N1
) =
_
S
N1
x
N1
if x
N1
< S
N1
0 if x
N1
S
N1
,
(3.1.4)
and from the DP algorithm in (3.1.1), we get
J
N1
(x
N1
) =
_
c(S
N1
x
N1
) +H(S
N1
) if x
N1
< S
N1
H(x
N1
) if x
N1
S
N1
.
Before continuing, we need the following auxiliary result:
Claim: J
N1
(x
N1
) is convex in x
N1
.
Proof: Note that we can write
J
N1
(x) =
_
cx +cS
N1
+H(S
N1
) if x < S
N1
H(x) if x S
N1
.
(3.1.5)
Figure 3.1.4 illustrates the convexity of the function G
N1
(y) = cy + H(y). Recall that we had
denoted S
N1
the unconstrained minimizer of G
N1
(y). The unconstrained minimizer H

of the
H*
Slope=c H(S
N-1
)
G
N-1
(y)=cy+H(y)
Figure 3.1.4: The function GN1(y) is convex with unconstrained minimizer SN1.
function H(y) occurs to the right of S
N1
. To verify this, compute

y
G
N1
(y) = c +

y
H(y).
1
Note that GN1(y) is shifted one index back in the argument to show convexity, since given the convexity of JN(),
it turns out to be convex. However, we still need to prove the convexity of JN1().
104
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
2
Evaluating the derivative at S
N1
, we get

y
G
N1
(S
N1
) = c +

y
H(S
N1
) = 0,
and therefore,

y
H(S
N1
) = c < 0; that is, H() is decreasing at S
N1
, and thus its minimum H

occurs to its right.


Figure 3.1.5 plots J
N1
(x
N1
). Note that according to (3.1.5), the function is linear to the left
c x
N-1
H*
H(S
N-1
)
cS
N-1
+H(S
N-1
)
c S
N-1
J(x
N-1
) is linear
in x
N-1
for x
N-1
< S
N-1
Slope= c
Figure 3.1.5: The function JN1(xN1) is convex with unconstrained minimizer H

.
of S
N1
, and tracks H(x
N1
) to the right of S
N1
. The minimum value of J
N1
() occurs at
x
N1
= H

, but we should be cautious on how to interpret this fact: This is the best possible
state that the controller can reach, however, the purpose of DP is to prescribe the best course
of action for any initial state x
N1
at period N 1, which is given by the optimal control (3.1.4)
above.
Continuing with the proof of Proposition 3.1.1, so far we have that given the convexity of J
N
(x),
we prove the convexity of G
N1
(x), and then the convexity of J
N1
(x). Furthermore, Figure 3.1.5
also shows that
lim
|y|
J
N1
(y) = .
The argument can be repeated to show that for all k = N 2, . . . , 0, if J
k+1
(x) is convex,
lim
|y|
J
k+1
(y) = , and lim
|y|
G
k
(y) = , then we have
J
k
(x
k
) =
_
c(S
k
x
k
) +H(S
k
) + E[J
k+1
(S
k
w
k
)] if x
k
< S
k
H(x
k
) + E[J
k+1
(x
k
w
k
)] if x
k
S
k
,
where S
k
minimizes G
k
(y) = cy+H(y)+E[J
k+1
(yw)]. Furthermore, J
k
(y) is convex, lim
|y|
J
k
(y) =
, G
k1
(y) is convex, and lim
|y|
G
k1
(y) = .
2
Note that the function H(y), on a sample path basis, is not dierentiable everywhere (see Figure 3.1.2). However,
the probability of the r.v. hitting the value y is zero if w has a continuous density, and so we can assert that H() is
dierentiable w.p.1.
105
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Technical note: To formally complete the proof above, when taking derivative of G
k
(y), that
will involve taking derivative of a expected value. Under relatively mild technical conditions, we
can safely interchange dierentiation and expectation. For example, it is safe to do that when the
density f
w
(w) of the r.v. does not depend on y. More formally, if R
w
is the support of the r.v. w,

x
E[g(x, w)] =

x
_
wRw
g(x, w)f
w
(w)dw.
Using Leibnizs rule, if the function f
w
(w) does not depend on x, the set R
w
does not depend on x
either, and the derivative

x
g(x, w) is well dened and bounded, we can interchange derivative and
integral:

x
_
wRw
g(x, w)f
w
(w)dw =
_
wRw
_

x
g(x, w)
_
f
w
(w)dw,
and so

x
E[g(x, w)] = E
_

x
g(x, w)
_
.
3.1.3 Positive xed cost and (s, S) policies
Suppose that there is a xed cost K > 0 associated with a positive inventory order, i.e., the cost of
ordering u 0 units is:
C(u) =
_
K +cu if u > 0
0 if u = 0.
The DP algorithm takes the form
J
N
(x
N
) = 0
J
k
(x
k
) = min
u
k
0
{C(u
k
) +H(x
k
+u
k
) + E
w
k
[J
k+1
(x
k
+u
k
w
k
)]} ,
where again
H(y) = pE[(w y)
+
] +hE[(y w)
+
].
Consider again
G
k
(y) = cy +H(y) + E[J
k+1
(y w)].
Then,
J
k
(x
k
) = min
_

_
G
k
(x
k
)
. .
Do not order
, min
u
k
>0
{K +G
k
(x
k
+u
k
)}
. .
Order u
k
_

_
cx
k
.
By changing variable y
k
= x
k
+u
k
like in the zero xed-cost case, we get
J
k
(x
k
) = min
_
G
k
(x
k
), min
y
k
>x
k
{K +G
k
(y
k
)}
_
cx
k
. (3.1.6)
When K > 0, G
k
is not necessarily convex
3
, opening the possibility of very complicated optimal
policies (see Figure 3.1.6). Under this kind of function G
k
(y), for the cost function (3.1.6), the
optimal policy would be:
3
Note that G
k
involves K through J
k+1
.
106
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
G
k
(s)
y
Increasing
Increasing
Figure 3.1.6: Potential form of the function G
k
(y) when the xed cost is nonzero.
1. If x
k
Zone I G
k
(x
k
) > G
k
(s), x
k
< s Order u

k
= Sx
k
, such that y

k
= S. Clearly,
G
k
(S) +K < G
k
(x
k
), x
k
< s. So, if x
k
Zone I, u

k
= S x
k
.
2. If x
k
Zone II
If s < x
k
< S and u > 0 (i.e., y
k
> x
k
) K + G
k
(y
k
) > G
k
(x
k
), and it is suboptimal
to order.
If S < x
k
< y

and u > 0 (i.e., y


k
> x
k
) K + G
k
(y
k
) > G
k
(x
k
), and it is also
suboptimal to order.
So, if x
k
Zone II, u

k
= 0.
3. If x
k
Zone III Order u

k
=

S x
k
, so that y

k
=

S, and G
k
(x
k
) > K + G(

S), for all


y

< x
k
< s.
4. If x
k
Zone IV Do not order (i.e., u

k
= 0), since otherwise K+G
k
(y
k
) > G
k
(x
k
), y
k
> x
k
.
In summary, the optimal policy would be to order u

k
= (S x) in zone I, u

k
= 0 in zones II and IV,
and u

k
= (

S x) in zone III.
We will show below that even though the functions G
k
may not be convex, they do have some
structure: they are K-convex.
Denition 3.1.1 A real function g(y) is K-convex if and only if it veries the property:
K +g(z +y) g(y) +z
_
g(y) g(y b)
b
_
,
for all z 0, b > 0, y R.
The denition is illustrated in Figure 3.1.7. Observation: Note that the situation described in
Figure 3.1.6 is impossible under K-convexity: Since y
0
is a local maximum in zone III, we must
have for b > 0 small enough,
G
k
(y
0
) G
k
(y
0
b) 0
G
k
(y
0
) G
k
(y
0
b)
b
0,
107
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Figure 3.1.7: Graph of a k-convex function.
and from the denition of K-convexity, we should have for

S = y
0
+z, and y = y
0
,
K +G
k
(

S) G
k
(y
0
) + z
..
0
G
k
(y
0
) G
k
(y
0
b)
b
. .
0
G
k
(y
0
),
which does not hold in our case.
Intuition: A K-convex function is a function that is almost convex, and for which K represents
the size of the almost. Scarf(1960) invented the notion of K-convex functions for the explicit
purpose of analyzing this inventory model.
For a function f to be K-convex, it must lie below the line segment connecting (x, f(x)) and
(y, K + f(y)), for all real numbers x and y such that x y. Figure 3.1.8 below shows that a
K-convex function, namely f
1
, need not be continuous. However, it can be shown that a K-convex
function cannot have a positive jump at a discontinuity, as illustrated by f
2
. Moreover, a negative
jump cannot be too large, as illustrated by f
3
.
Next, we compile some results on K-convex functions:
Lemma 3.1.1 Properties of K-convex functions:
(a) A real-valued convex function g is 0-convex and hence also K-convex for all K > 0.
(b) If g
1
(y) and g
2
(y) are K-convex and L-convex respectively, then g
1
(y) +g
2
(y) is (K +L)-
convex, for all , > 0.
(c) If g(y) is K-convex and w is a random variable, then E
w
[g(y w)] is also K-convex, provided
E
w
[|g(y w)|] < , for all y.
(d) If g is a continuous K-convex function and g(y) as |y| , then there exist scalars s
and S, with s S, such that
(i) g(S) g(y), y (i.e., S is a global minimum).
108
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
f
1
f
2
f
3
y
x
K
K
K
Figure 3.1.8: Function f1 is K-convex; f2 and f3 are not.
(ii) g(S) +K = g(s) < g(y), y < s.
(iii) g(y) is decreasing on (, s).
(iv) g(y) g(z) +K, y, z, with s y z.
Using part (d) of Lemma 3.1.1, we will show that the optimal policy is of the form

k
(x
k
) =
_
S
k
x
k
if x
k
< s
k
0 if x
k
s
k
,
where S
k
is the value of y that minimizes G
k
(y), and s
k
is the smallest value of y for which
G
k
(y) = K +G
k
(S
k
). This control policy is called the (s, S) multiperiod policy.
109
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Proof of the optimality of the (s, S) multiperiod policy
For stage N 1,
G
N1
(y) = cy +H(y) + E
w
[J
N
(y w)
. .
0
]
Therefore, G
N1
(y) is clearly convex It is K-convex. Then, we have
J
N1
(x) = min
_
G
N1
(x), min
y>x
{K +cy +G
N1
(y)}
_
cx,
where by dening S
N1
as the minimizer of G
N1
(y) and s
N1
= min{y : G
N1
(y) = K +
G
N1
(S
N1
)} (see Figure 3.1.9), we have the optimal control

N1
(x
N1
) =
_
S
N1
x
N1
if x
N1
< s
N1
0 if x
N1
s
N1
,
which leads to the optimal value function
J
N1
(x) =
_
K +G
N1
(S
N1
) cx for x < s
N1
G
N1
(x) cx for x s
N1
.
(3.1.7)
Observations:
s
N1
= S
N1
, because K > 0


y
G
N1
(s
N1
) 0
It turns out that the left derivative of J
N1
() at s
N1
is greater than the right derivative J
N1
()
is not convex (again, see Figure 3.1.9). Here, as we saw for the zero xed ordering cost, the minimum
G
N-1
(x)=c x + H(x)
H(x) = G
N-1
(x) c x
s
N-1
S
N-1
-cx
H*
c x J
x
N

) (
1
c x G
x
c x J
x
N N



0
1 1
) ( ) (
Slope=c
Figure 3.1.9: Structure of the cost-to-go function when xed cost is nonzero.
110
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
H

occurs to the right of S


N1
(recall that S
N1
is the unconstrained minimizer of G
N1
(x)). To
see this, note that

y
G
N1
(y) = c +

y
H(y)

y
G
N1
(S
N1
) = c +

y
H(S
N1
) = 0

y
H(S
N1
) = c < 0,
meaning that H is decreasing at S
N1
, and so its minimum H

occurs to the right of S


N1
.
Claim: J
N1
(x) is K-convex.
Proof: We must verify for all z 0, b > 0, and y, that
K +J
N1
(y +z) J
N1
(y) +z
_
J
N1
(y) J
N1
(y b)
b
_
(3.1.8)
There are three cases according to the relative position of y, y +z, and s
N1
.
Case 1: y s
N1
(i.e., y +z y s
N1
).
If y b s
N1
J
N1
(x) = G
N1
(x)
. .
convexK-convex
cx
..
linear
, so by part (b) of
Lemma 3.1.1, it is K-convex.
If y b < s
N1
in view of equation (3.1.7) we can write (3.1.8) as
K +J
N1
(y +z) = K +G
N1
(y +z) c(y +z)
G
N1
(y) cy
. .
J
N1
(y)
+z
_
_
_
_
_
_
_
_
_
_
_
J
N1
(y)
..
G
N1
(y) cy
J
N1
(yb)=
G
N1
(s
N1
)
..
K +G
N1
(S
N1
) c(yb)
..
G
N1
(s
N1
) +c(y b)
b
_
_
_
_
_
_
_
_
_
_
_
,
or equivalently,
K +G
N1
(y +z) G
N1
(y) +z
_
G
N1
(y) G
N1
(s
N1
)
b
_
(3.1.9)
There are three subcases:
(i) If y is such that G
N1
(y) G
N1
(s
N1
), y = s
N1
by the K-convexity of
G
N1
, and taking y s
N1
as the constant b > 0,
K +G
N1
(y +z) G
N1
(y) +z
_
_
_
_
G
N1
(y) G
N1
(
y(ys
N1
)
..
s
N1
)
y s
N1
_
_
_
_
.
Thus, K-convexity hold.
111
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
(ii) If y is such that G
N1
(y) < G
N1
(s
N1
) From part (d-i) in Lemma 3.1.1, for a
scalar y +z,
K +G
N1
(y +z) K +G
N1
(S
N1
)
= G
N1
(s
N1
) (by denition of s
N1
)
> G
N1
(y) (by hypothesis of this case).
G
N1
(y) +z
_
_
_
_
<0
..
G
N1
(y) G
N1
(s
N1
)
b
_
_
_
_
,
and equation (3.1.9) holds.
(iii) If y = s
N1
, then by K-convexity of G
N1
, note that (3.1.9) becomes
K +G
N1
(y +z) G
N1
(y) +z
_
_
_
_
0
..
G
N1
(y) G
N1
(s
N1
)
b
_
_
_
_
= G
N1
(y).
From Lemma 3.1.1, part (d-iv), taking y = s
N1
there,
K +G
N1
(s
N1
+z) G
N1
(s
N1
)
is veried, for all z 0.
Case 2: y y +z s
N1
.
By equation (3.1.7), the function J
N1
(y) is linear It is K-convex.
Case 3: y < s
N1
< y +z.
Here, we can write (3.1.8) as
K +J
N1
(y +z) = K +G
N1
(y +z) c(y +z)
J
N1
(y) +z
_
_
_
_
J
N1
(y) J
N1
(
<y
..
y b)
b
_
_
_
_
= K +G
N1
(s
N1
) cy +z
_
K +G
N1
(s
N1
) cy (K +G
N1
(s
N1
) c(y b))
b
_
= K +G
N1
(s
N1
) cy
czb
b
= K +G
N1
(s
N1
) c(y +z)
Thus, the previous sequence of relations holds if and only if
K +G
N1
(y +z) c(y +z) K +G
N1
(s
N1
) c(y +z),
or equivalently, if and only if G
N1
(y + z) G
N1
(s
N1
), which holds from Lemma 3.1.1,
part (d-i), since G
N1
() is K-convex.
112
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
This completes the proof of the claim.
We have thus proved that K-convexity and continuity of G
N1
, together with the fact that G
N1
(y)
as |y| , imply K-convexity of J
N1
. In addition, J
N1
(x) can be seen to be continuous in x.
Using the following facts:
From the denition of G
k
(y):
G
N2
(y) = cy +H(y) + E
w
[ J
N1
(y w)
. .
Kconvex from
Lemma 3.1.1-(c)
]
. .
K-convex from Lemma 3.1.1-(b)
G
N2
(y) is continuous (because of boundedness of w
N2
).
G
N2
(y) as |y| .
and repeating the preceding argument, we obtain that J
N2
is K-convex, and proceeding similarly,
we prove K-convexity and continuity of the functions G
k
for all k, as well as that G
k
(y) as
|y| . At the same time, by using Lemma 3.1.1-(d), we prove optimality of the multiperiod (s, S)
policy.
Finally, it is worth noting that it is not necessary that G
k
() be K-convex for an (s, S) policy to be
optimal; it is just a sucient condition.
3.1.4 Exercises
Exercise 3.1.1 Consider an inventory problem similar to the one discussed in class, with zero xed
cost. The only dierence is that at the beginning of each period k the decision maker, in addition
to knowing the current inventory level x
k
, receives an accurate forecast that the demand w
k
will be
selected in accordance with one out of two possible probability distributions P
l
, P
s
(large demand,
small demand). The a priori probability of a large demand forecast is known.
(a) Obtain the optimal ordering policy for the case of a single-period problem
(b) Extend the result to the N-period case
Exercise 3.1.2 Consider the inventory problem with nonzero xed cost, but with the dierence
that demand is deterministic and must be met at each time period (i.e., the shortage cost per unit
is ). Show that it is optimal to order a positive amount at period k if and only if the stock x
k
is insucient to meet the demand w
k
. Furthermore, when a positive amount is ordered, it should
bring up stock to a level that will satisfy demand for an integral number of periods.
Exercise 3.1.3 Consider a problem of expanding over N time periods the capacity of a production
facility. Let us denote by x
k
the production capacity at the beginning of period k, and by u
k
0
the addition to capacity during the kth period. Thus, capacity evolves according to
x
k+1
= x
k
+u
k
, k = 0, 1, . . . , N 1.
113
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
The demand at the kth period is denoted w
k
and has a known probability distribution that does not
depend on either x
k
or u
k
. Also, successive demands are assumed to be independent and bounded.
We denote:
C
k
(u
k
): Expansion cost associated with adding capacity u
k
.
P
k
(x
k
+u
k
w
k
): Penalty associated with capacity x
k
+u
k
and demand w
k
.
S(x
N
): Salvage value of nal capacity x
N
.
Thus, the cost function has the form
E
w
0
,...,w
N1
_
S(x
N
) +
N1

k=0
(C
k
(u
k
) +P
k
(x
k
+u
k
w
k
))
_
.
(a) Derive the DP algorithm for this problem.
(b) Assume that S is a concave function with lim
x
dS(x)/dx = 0, P
k
are convex functions, and
the expansion cost C
k
is of the form
C
k
(u) =
_
K +c
k
u if u > 0,
0 if u = 0,
where K 0, c
k
> 0 for all k. Show that the optimal policy is of the (s, S) type assuming
c
k
y + E[P
k
(y w
k
)] as |y| .
3.2 Single-Leg Revenue Management
Revenue Management (RM) is an OR subeld that deals with business related problems where
there are nite, perishable capacities that must be depleted by a due time. Applications span
from airlines, hospitality industry, and car rental, to more recent practices in retailing and media
advertising. The problem sketched below is the basic RM problem that consists of rationing the
capacity of a single resource through imposing limits on the quantities to be sold at dierent prices,
for a given set of prices.
Setting:
Initial capacity C; remaining capacity denoted by x.
There are n costumers classes labeled such that p
1
> p
2
> > p
n
.
Time indices run backwards in time.
Class n arrives rst, followed by classes n 1, n 2, , 1.
Demands are r.v: D
n
, D
n1
, . . . , D
1
.
At the beginning of stage j, demands D
j
, D
j1
, . . . , D
1
.
Within stage j the model assumes the following sequence of events:
114
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
1. The realization of the demand D
j
occurs, and we observe the value.
4
2. We decide on a quantity u to accept: u min{D
j
, x}. The optimal control then is a
function of both current demand and remaining capacity: u

(D
j
, x). This is done for
analytical convenience. In practice, the control decision has to be made before observ-
ing D
j
. We will see that the calculation of the optimal control does not use information
about D
j
, so this assumption vanishes ex-post.
3. Revenue p
j
u is collected, and we proceed to stage j 1 (since indices run backwards).
For the single-leg RM problem, the DP formulation becomes
V
j
(x) = E
D
j
_
max
0umin{D
j
,x}
{p
j
u +V
j1
(x u)}
_
, (3.2.1)
with boundary conditions: V
0
(x) = 0, x = 0, 1, . . . , C. Note that in this formulation we have
inverted the usual order between max{} and E[]. We prove below that this is w.l.o.g. for the kind
of setting that we are dealing with here.
3.2.1 System with observable disturbances
Departure from standard DP: We can base our control u
t
on perfect knowledge of the random noise
of the current period, w
t
. For this section, assume as in the basic DP setting that indices run
forward.
Claim: Assume a discrete nite horizon t = 1, 2, . . . , T. The formulation
V
t
(x) = max
u(x,wt)Ut(x,wt)
E
wt
[g
t
(x, u(x, w
t
), w
t
) +V
t+1
(f
t
(x, u(x, w
t
), w
t
))]
is equivalent to
V
t
(x) = E
w
j
_
max
u(x)Ut(x)
g
t
(x, u, w
t
) +V
t+1
(f
t
(x, u, w
t
))
_
, (3.2.2)
which is more convenient to handle.
Proof: State space augmentation argument:
1. Reindex disturbances by dening w
t
= w
t+1
, t = 1, . . . , T 1.
2. Augment state to include the new disturbance, and dene the system equation:
_
x
t+1
y
t+1
_
=
_
f
t
(x
t
, u
t
, w
t
)
w
t
_
.
3. Starting from (x
0
, y
0
) = (x, w
1
), the standard DP recursion is:
V
t
(x, y) = max
u(x)Ut(x)
E
e wt
[g
t
(x, u, y) +V
t+1
(f
t
(x, u, y), w
t
)]
4
Note that this is a departure from the basic DP formulation where typically we make a decision in period k before
the random noise w
k
is realized.
115
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
4. Dene G
t
(x) = E
e wt
[V
t
(x, w
t
)]
5. Note that:
V
t
(x, y) = max
u(x)Ut(x)
E
e wt
[g
t
(x, u, y) +V
t+1
(f
t
(x, u, y), w
t
)]
= max
u(x)Ut(x)
{g
t
(x, u, y) + E
e wt
[V
t+1
(f
t
(x, u, y), w
t
)]} (because g
t
() does not depend on w
t
)
= max
u(x)Ut(x)
{g
t
(x, u, y) +G
t+1
(f
t
(x, u, y))} (by denition of G
t+1
())
6. Replace y by w
t
and take expectation with respect to w
t
on both sides above to obtain:
E
wt
[V
t
(x, w
t
)]
. .
Gt(x)
= E
wt
_
max
u(x)Ut(x)
{g
t
(x, u, w
t
) +G
t+1
(f
t
(x, u, w
t
))}
_
Observe that the LHS is indeed G
t
(x) modulus a small issue with the name of the random
variable, which is justied by noting that G
t
(x)

= E
e wt
[V
t
(x, w
t
)] = E
w
[V
t
(x, w)]. Finally,
there is another minor name issue, because the nal DP is expressed in terms of the value
function G. It remains to replace G by V to recover formulation (3.2.2).
In words, what we are doing is anticipating and solving today the problem that we will face tomor-
row, given the disturbances of today. The implicit sequence of actions of this alternative formulation
is the following:
1. We observe current state x.
2. The value of the disturbance w
t
is realized.
3. We make the optimal decision u

(x, w
t
).
4. We collect the current period reward g
t
(x, u(x, w
t
), w
t
).
5. We move to the next state t + 1.
3.2.2 Structure of the value function
Now, we turn to our original RM problem. First, we dene the marginal value of capacity,
V
j
(x) = V
j
(x) V
j
(x 1), (3.2.3)
and proceed to characterize the structure of the value function.
Proposition 3.2.1 The marginal value of capacity V
j
(x) satises:
(i) For a xed j, V
j
(x + 1) V
j
(x), x = 0, . . . , C.
(ii) For a xed x, V
j+1
(x) V
j
(x), j = 1, . . . , n.
The proposition states two intuitive economic properties:
116
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
(i) For a given period, the marginal value of capacity is decreasing in the number of units left.
(ii) The marginal value of capacity x at stage j is smaller than its value at stage j +1 (recall that
indices are running backwards). Intuitively, this is because there are less periods remaining,
and hence less opportunities to sell the xth unit.
Before going over the proof of this proposition, we need the following auxiliary lemma:
Lemma 3.2.1 Suppose g : Z
+
R is concave. Let f : Z
+
R be dened by:
f(x) = max
a=1,...,m
{ap +g(x a)}
for any given p 0; and nonnegative integer m x. Then f(x) is concave in x as well.
Proof: We proceed in three steps:
1. Change of variable: Dene y = x a, so that we can write:
f(x) =

f(x) +px; where

f(x) = max
xmyx
{yp +g(y)}
With this change of variable, we have that a = x y and hence the inner part can be written
as: (x y)p + g(y), where the range for the argument is such that 0 x y m, or
x m y x. The new function is
f(x) = max
xmyx
{(x y)p +g(y)}.
Thus, f(x) =

f(x) +px, where

f(x) = max
xmyx
{yp +g(y)}
Note that since x m, then y 0.
2. Closed-form for

f(x): Let h(y) = yp+g(y), for y 0. Let y

be the unconstrained maximizer


of h(y), i.e. y

= argmax
y0
h(y). Because of the shape of h(y), this maximizer is always well
dened. Moreover, since g(y) is concave, h(y) is also concave, nondecreasing for y y

, and
nonincreasing for y > y

. Therefore, for given m and p:

f(x) =
_

_
xp +g(x) if x y

p +g(y

) if y

x y

+m
(x m)p +g(x m) if x y

+m
The rst part holds because h(y) is nondecreasing for 0 y y

. The second part holds


because

f(x) = y

p +g(y

) = h(y

), for y

in the range {x m, . . . , x}, or equivalently, for


y

x y

+m. Finally, since h(y) is nonincreasing for y > y

, the maximum is attained in


the border of the range, i.e., in x m.
3. Concavity of

f(x):
117
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Take x < y

. We have

f(x + 1)

f(x) = [(x + 1)p +g(x + 1)] [xp +g(x)]
p +g(x + 1) g(x) (because g(x) is concave)
=

f(x)

f(x 1).
So, for x < y

,

f(x) is concave.
Take y

x < y

+m. Here,

f(x + 1)

f(x) = 0, and so

f(x) is trivially concave.
Take x y

+m. We have

f(x + 1)

f(x) = [(x + 1 m)p +g(x + 1 m)] [(x m)p +g(x m)]
= p +g(x + 1 m) g(x m)
p +g(x m) g(x m1) (because g(x) is concave)
=

f(x)

f(x 1).
So, for x y

+m,

f(x) is concave.
Therefore

f(x) is concave for all x 0, and since f(x) =

f(x) + px, f(x) is concave in x 0 as
well.
Proof of Proposition 3.2.1
Part (i): V
j
(x + 1) V
j
(x), x.
By induction:
In terminal stage: V
0
(x) = 0, x, so it holds.
IH: Assume that V
j1
(x) is concave in x.
Consider V
j
(x). Note that
V
j
(x) = E
D
j
_
max
0umin{D
j
,x}
{p
j
u +V
j1
(x u)}
_
.
For any realization of D
j
, the function
H(x, D
j
) = max
0umin{D
j
,x}
{p
j
u +V
j1
(x u)}
has exactly the same structure as the function of the Lemma above, with m = min{D
j
, x}, and
therefore it is concave in x. Since E
D
j
[H(x, D
j
)] is a weighted average of concave functions,
it is also concave.
Going back to the original formulation for the single-leg RM problem in (3.2.1), we can express it
as follows:
V
j
(x) = E
D
j
_
max
0umin{D
j
,x}
{p
j
u +V
j1
(x u)}
_
= V
j1
(x) + E
D
j
_
max
0umin{D
j
,x}
_
u

z=1
(p
j
V
j1
(x + 1 z))
__
, (3.2.4)
118
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
where we are using (3.2.3) to write V
j1
(x u) as a sum of increments:
V
j1
(x u) = V
j1
(x)
u

z=1
V
j1
(x + 1 z)
= V
j1
(x) [V
j1
(x) + V
j1
(x 1) + +
+ V
j1
(x + 1 (u 1)) + V
j1
(x + 1 u)]
= V
j1
(x) [V
j1
(x) V
j1
(x 1) +V
j1
(x 1) V
j1
(x 2) + +
+V
j1
(x + 2 u) V
j1
(x + 1 u) +V
j1
(x + 1 u) V
j1
(x u)]
Note that all terms in the RHS except for the last one cancel out. The inner sum in (3.2.4) is
dened to be zero when u = 0.
Part (ii): V
j+1
(x) V
j
(x), j.
From (3.2.4) we can write:
V
j+1
(x) = V
j
(x) + E
D
j+1
_
max
0umin{D
j+1
,x}
_
u

z=1
(p
j+1
V
j
(x + 1 z))
__
.
Similarly, we can write:
V
j+1
(x 1) = V
j
(x 1) + E
D
j+1
_
max
0umin{D
j+1
,x1}
_
u

z=1
(p
j+1
V
j
(x z))
__
.
Subtracting both equalities, we get:
V
j+1
(x) = V
j
(x) + E
D
j+1
_
max
0umin{D
j+1
,x}
_
u

z=1
(p
j+1
V
j
(x + 1 z))
__
E
D
j+1
_
max
0umin{D
j+1
,x1}
_
u

z=1
(p
j+1
V
j
(x z))
__
V
j
(x) + E
D
j+1
_
max
0umin{D
j+1
,x}
_
u

z=1
(p
j+1
V
j
(x z))
__
E
D
j+1
_
max
0umin{D
j+1
,x1}
_
u

z=1
(p
j+1
V
j
(x z))
__
V
j
(x) + E
D
j+1
_
max
0umin{D
j+1
,x1}
_
u

z=1
(p
j+1
V
j
(x z))
__
E
D
j+1
_
max
0umin{D
j+1
,x1}
_
u

z=1
(p
j+1
V
j
(x z))
__
= V
j
(x),
where the rst inequality holds from part (i) in Proposition 3.2.1, and the second one holds because
the domain of u in the maximization problem of the rst expectation (in the second to last line) is
smaller, and hence it is a more constrained optimization problem.
119
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
3.2.3 Structure of the optimal policy
The good feature of formulation (3.2.4) is that it is very insightful about the structure of the
optimal policy. In particular, from part (i) of Proposition 3.2.1, since V
j
(x) is decreasing in x,
p
j
V
j1
(x + 1 z) is decreasing in z. So, it is optimal to keep adding terms to the sum (i.e.,
increase u) as long as
p
j
V
j1
(x + 1 u) 0,
or the upper bound min{D
j
, x} is reached, whichever comes rst. In words, we compare the
instantaneous revenue p
j
with the marginal value of capacity (i.e., the value of a unit if we keep it
for the next period). If the former dominates, then we accept the price p
j
for the unit.
The resulting optimal controls can be expressed in terms of optimal protection levels y

j
, for
classes j, j 1, . . . , 1 (i.e., class j and higher in the revenue order). Specically, we dene
y

j
= max{x : 0 x C, p
j+1
< V
j
(x)}, j = 1, 2, . . . , n 1, (3.2.5)
and we assume y

0
= 0 and y

n
= C. Figure 3.2.1 illustrates the determination of y

j
. For x y

j
,
p
j+1
V
j
(x), and therefore it is worth waiting for the demand to come rather than selling now.
x
C
p
j+1
V
j
(x)
y
j
*
Reject class j+1 Accept class j+1
Figure 3.2.1: Calculation of the optimal protection level y

j
.
The optimal control at stage j + 1 is then

j+1
(x, D
j+1
) = min{(x y

j
)
+
, D
j+1
}
The key observation here is that the computation of y

j
does not depend on D
j+1
, because the
knowledge of D
j+1
does not aect the future value of capacity. Therefore, going back to the
assumption we made at the beginning, assuming that we know demand D
j+1
to compute y

j
does
not really matter, because we do not make real use of that information.
Part (ii) in Proposition 3.2.1 implies the nested protection structure
y

1
y

2
y

n1
y

n
= C.
This is illustrated in Figure 3.2.2. The reason is that since the curve V
j1
(x) is below the
curve V
j
(x) pointwise, and since by denition, p
j
> p
j+1
, then y

j1
y

j
.
120
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
x
C
p
j+1
V
j
(x)
y
j
*
V
j-1
(x)
y
j-1
*
p
j
Figure 3.2.2: Nesting feature of the optimal protection levels.
3.2.4 Computational complexity
Using the optimal control, the single-leg RM problem (3.2.1) could be reformulated as
V
j
(x) = E
D
j
_
p
j
min{(x y

j1
)
+
, D
j
} +V
j1
(x min{(x y

j1
)
+
, D
j
})

(3.2.6)
This procedure is repeated starting from j = 1 and working backward to j = n.
For discrete-demand distributions, computing the expectation in (3.2.6) for each state x re-
quires evaluating at most O(C) terms since min{(x y

j1
)
+
, D
j
} C. Since there are C
states (capacity levels), the complexity at each stage is O(C
2
).
The critical values y

j
can then be identied from (3.2.5) in log(C) time by binary search
as V
j
(x) is nonincreasing. In fact, since we know y

j
y

j1
, the binary search can be further
constrained to values in the interval [y

j1
, C]. Therefore, computing y

j
does not add to the
complexity at stage j
These steps must be repeated for each of the n1 stages, giving a total complexity of O(nC
2
).
3.2.5 Airlines: Practical implementation
Airlines that use capacity control as their RM strategy (as opposed to dynamic pricing) post pro-
tection levels y

j
in their own reservation systems, and accept requests for product j + 1 until y

j
is
reached or stage j + 1 ends (whichever comes rst). Figure 3.2.3 is a snapshot from Expedia.com
showing this practice from American Airlines.
3.2.6 Exercises
Exercise 3.2.1 Single-leg Revenue Management problem: For a single leg RM problem
assume that:
There are n = 10 classes.
121
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Figure 3.2.3: Optimal protection levels at American Airlines.
Demand D
j
is calculated through discretizing a truncated normal with mean = 10 and
standard deviation = 2, on support [0, 20]. Specically, take:
P(D
j
= k) =
((k + 0.5 10)/2) ((k 0.5 10)/2)
((20.5 10)/2) ((0.5 10)/2)
, k = 0, . . . , 20
Note that this discretization and re-scaling veries:

20
k=0
P(D
j
= k) = 1.
Total capacity available is C = 100.
Prices are p
1
= 500, p
2
= 480, p
3
= 465, p
4
= 420, p
5
= 400, p
6
= 350, p
7
= 320, p
8
= 270, p
9
=
250, and p
10
= 200.
Write a MATLAB or C code to compute optimal protection levels y

1
, . . . , y

9
; and nd the total
expected revenue V
10
(100). Note that you can take advantage of the structure of the optimal policy
to simplify its computation. Submit your results, and a copy of the code.
Exercise 3.2.2 Heuristic for the single-leg RM problem: In the airline industry, the single-
leg RM problem is typically solved using a heuristic; the so-called EMSR-b (expected marginal seat
revenue - version b). There is no much reason for this other than the tradition of its usage, and
the fact that it provides consistently good results. Here is a description:
Consider stage j + 1 in which we want to determine protection level y
j
. Dene the aggregated
future demand for classes j, j 1, . . . , 1, by S
j
=

j
k=1
D
k
, and let the weighted-average revenue
122
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
from classes 1, . . . , j, denoted p
j
, be dened by
p
j
=

j
k=1
p
k
E[D
k
]

j
k=1
E[D
k
]
.
Then the EMSR-b protection level for class j and higher, y
j
, is chosen by
P(S
j
> y
j
) =
p
j+1
p
j
.
It is common when using EMSR-b to assume demand for each class j is independent and normally
distributed with mean
j
and variance
2
j
, in which case
y
j
= +z

,
where =

j
k=1

k
is the mean and
2
=

j
k=1

2
k
is the variance of the aggregated demand to
come at stage j + 1, and
z

=
1
(1 p
j+1
/ p
j
).
Apply this heuristic to compute protection levels y
1
, . . . , y
9
using the data of the previous exercise
and assuming that demand is normal (no truncation, no discretization), and compare the outcome
with the optimal protection levels computed before.
3.3 Optimal Stopping and Scheduling Problems
In this section, we focus on two other types of problems with perfect state information: optimal
stopping problems (mainly) and discuss few ideas on scheduling problems.
3.3.1 Optimal stopping problems
We assume the following:
At each state, there is a control available that stops the system.
At each stage, you observe the current state and decide either to stop or continue.
Each policy consists of a partition of the set of states x
k
into two regions: the stop region and
the continue region. Figure 3.3.1 illustrates this.
Domain of states remains the same throughout the process.
Application: Asset selling problem
Consider a person owning an asset for which she is oered an amount of money from period
to period, across N periods.
Oers are random and independent, denoted w
0
, w
1
, ...w
N1
, with w
i
[0, w].
123
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Continue Region
Stop
Region
Initial state
Terminal state
Figure 3.3.1: Each policy consists of a partition of the state space into the stop and the continue regions.
If the seller accepts an oer, she can invest the money at a xed rate r > 0.
Otherwise, she waits until next period to consider the next oer.
Assume that the last oer w
N1
must be accepted if all prior oers are rejected.
Objective: Find a policy for maximizing reward at the Nth period.
Lets solve this problem.
Control:

k
(x
k
) =
_
u
1
: Sell
u
2
: Wait
State: x
k
= R
+
{T}.
System equation:
x
k+1
=
_
T if x
k
= T, or x
k
= T and
k
= u
1
,
w
k
otherwise.
Reward function:
E
w
0
,...w
N1
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
,
k
, w
k
)
_
where
g
N
(x
N
) =
_
x
N
if x
N
= T,
0 if x
N
= T
(i.e., the seller must accept the oer by time N),
and for k = 0, 1, . . . , N 1,
g
k
(x
k
,
k
, w
k
) =
_
(1 +r)
Nk
x
k
if x
k
= T and
k
= u
1
,
0 otherwise.
Note that here, a critical issue is how to account for the reward, being careful with the double
counting. In this formulation, once the seller accepts the oer, she gets the compound interest
for the rest of the horizon all together, and from there onwards, she gets zero reward.
124
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
DP formulation
J
N
(x
N
) =
_
x
N
if x
N
= T,
0 if x
N
= T.
(3.3.1)
For k = 0, 1, . . . , N 1,
J
k
(x
k
) =
_

_
max{(1 +r)
Nk
x
k
. .
Sell
, E[J
k+1
(w
k
)]
. .
Wait
} if x
k
= T,
0 if x
k
= T.
(3.3.2)
Optimal policy: Accept oer only when
x
k
>
k

=
E[J
k+1
(w
k
)]
(1 +r)
Nk
Note that
k
represents the net present value of the expected reward. This comparison is
a fair one, because it is conducted between the instantaneous payo x
k
and the expected
reward discounted back to the present time k. Thus, the optimal policy is of the threshold
type, described by the scalar sequence {
k
: k = 0, . . . , N 1}. Figure 3.3.2 represents this
threshold structure.
0 1 2 N - 1 N k
ACCEPT
REJECT

N - 1

2
Figure 3.3.2: Optimal policy of accepting/rejecting oers in the asset selling problem.
Proposition 3.3.1 Assume that oers w
k
are i.i.d., with w F(). Then,
k

k+1
, k =
1, . . . , N 1, with
N
= 0.
Proof: For now, lets disregard the terminal condition, and dene
V
k
(x
k
)

=
J
k
(x
k
)
(1 +r)
Nk
, x
k
= T.
We can rewrite equations (3.3.1) and (3.3.2) as follows:
V
N
(x
N
) = x
N
,
V
k
(x
k
) = max{x
k
, (1 +r)
1
E
w
[V
k+1
(w)]}, k = 0, . . . , N 1. (3.3.3)
125
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Hence, dening
N
= 0 (since we have to accept no matter what in the last period), we get

k
=
E
w
[V
k+1
(w)]
1 +r
, k = 0, 1, ..., N 1.
Next, we compare the value function at periods N 1 and N: For k = N and k = N 1, we have
V
N
(x) = x
V
N1
(x) = max{x, (1 +r)
1
E
w
[V
N
(w)]
. .

N1
} V
N
(x)
Given that we have a stationary system, from the monotonicity of DP (see Homework #2), we know
that
V
1
(x) V
2
(x) V
N
(x), x.
Since
k
=
E
w
[V
k+1
(w)]
1 +r
and
k+1
=
Ew[V
k+2
(w)]
1+r
, we have
k

k+1
.
Compute limiting
Next, we explore the question: What if the selling horizon is very long? Note that equation (3.3.3)
can be written as V
k
(x
k
) = max{x
k
,
k
}, where

k
= (1 +r)
1
E
w
[V
k+1
(w)]
=
1
1 +r
_

k+1
0

k+1
dF(w) +
1
r + 1
_

k+1
wdF(w)
=

k+1
1 +r
F(
k+1
) +
1
1 +r
_

k+1
wdF(w) (3.3.4)
We will see that the sequence {
k
} converges as k (i.e., as the selling horizon becomes very
long).
Observations:
1. 0
F()
1 +r

1
1 +r
.
2. For k = 0, 1, . . . , N 1,
0
1
1 +r
_

k+1
wdF(w)
1
1 +r
_

0
wdF(w) =
E[w]
1 +r
.
3. From equation (3.3.4) and Proposition 3.3.1:

k


k+1
1 +r
+
E[w]
1 +r

k
1
1 +r
+
E[w]
1 +r

k
<
E[w]
r
Using
k

k+1
and knowing that the sequence is bounded from above, we know that when
k ,
k
, where satises
(1 +r) = F( ) +
_


wdF(w)
126
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
When N is big, then an approximate method is to use the constant policy: Accept the oer x
k
if and only if x
k
> . More formally, if we dene G() as
G()

=
1
r + 1
_
F() +
_

wdF(w)
_
,
then from the Contraction Mapping Theorem (due to Banach, 1922), G() is a contraction mapping,
and hence the iterative procedure
n+1
= G(
n
) nds the unique xed point in [0, E[w]/r], starting
from any arbitrary
0
[0, E[w]/r].
Recall: G is a contraction mapping if for all x, y R
n
, ||G(x) G(y)|| < K||x y||, for a constant
0 K < 1, K independent of x, y.
Application: Purchasing with a deadline
Assume that a certain quantity of raw material is needed at a certain time.
Price of raw materials uctuates
Decision: Purchase or not?
Objective: Minimum expected price of purchase
Assume that successive prices w
k
are i.i.d. and have c.d.f. F().
Purchase must be made within N time periods.
Controls:

k
(x
k
) =
_
u
1
: Purchase
u
2
: Wait
State: x
k
= R
+
{T}.
System equation:
x
k+1
=
_
T if x
k
= T, or x
k
= T and
k
= u
1
,
w
k
otherwise.
DP formulation:
J
N
(x
N
) =
_
x
N
if x
N
= T,
0 otherwise.
For k = 0, . . . , N 1,
J
k
(x
k
) =
_

_
min{ x
k
..
Purchase
, E[J
k+1
(w
k
)]
. .
Wait
} if x
k
= T,
0 if x
k
= T.
Optimal policy: Purchase if and only if
x
k
<
k

= E
w
[J
k+1
(w)],
127
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
where

= E
w
[J
k+1
(w)] = E
w
[min{w,
k+1
] =
_

k+1
0
wdF(w) +
_

k+1

k+1
dF(w).
With terminal condition:

N1
=
_

0
wdF(w) = E[w].
Analogously to the asset selling problem, it must hold that

1

2

N1
= E[w].
Intuitively, we are less stringent and willing to accept a higher price as time goes by.
The case of correlated prices
Suppose that prices evolve according to the system equation
x
k+1
= x
k
+
k
, where 0 < < 1,
and where
1
,
2
, . . . ,
N1
are i.i.d. with E[] =

> 0.
DP Algorithm:
J
N
(x
N
) = x
N
J
k
(x
k
) = min{x
k
, E[J
k+1
(x
k
+
k
)]}, k = 0, . . . , N 1.
In particular, for k = N 1, we have
J
N1
(x
N1
) = min{x
N1
, E[J
N
(x
N1
+
N1
)]}
= min{x
N1
, x
N1
+

}
Optimal policy at time N 1: Purchase only when x
N1
<
N1
, where
N1
comes from
x
N1
< x
N1
+

x
N1
<
N1

1
.
In addition, we can see that
J
N1
(x) = min{x, x +

} x = J
N
(x).
Using the stationarity of the system and the monotonicity property of DP, we have that for any x,
J
k
(x) J
k+1
(x), k = 0, . . . , N 1.
Moreover, J
N1
(x) is concave and increasing in x (see Figure 3.3.3). By a backward induction
argument, we can prove for k = 0, 1, . . . , N 2 that J
k
(x) is concave and increasing in x (see
Figure 3.3.4). These facts imply that the optimal policy for every period k is of the form: Purchase
if and only if x
k
<
k
, where the scalar
k
is the unique positive solution of the equation
x = E[J
k+1
(x +
k
)].
Notice that the relation J
k
(x) J
k+1
(x) for all x and k implies that

k

k+1
, k = 0, . . . , N 2,
and again (as one would expect) the threshold price to purchase increases as the deadline gets
closer. In other words, one is more willing to accept a higher price as one approaches the end of the
horizon. This is illustrated in Figure 3.3.5.
128
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
x
N-1
x
N-1
+
J
N-1
(x
N-1
)

N-1
J
N-1
(x
N-1
)

x
N-1
Figure 3.3.3: Structure of the value function JN1(x) when prices are correlated.
x
k
E[J
k+1
( x
k
+
k
)]
J
k
(x
k
)

k
J
k
(x
k
)
E[J
k+1
(
k
)]
x
k
Figure 3.3.4: Structure of the value function J
k
(x
k
) when prices are correlated.
3.3.2 General stopping problems and the one-step look ahead policy
Consider a stationary problem
At time k, we may stop at cost t(x
k
) or choose a control
k
(x
k
) U(x
k
) and continue.
The DP algorithm is given by:
J
N
(x
N
) = t(x
N
),
and for k = 0, 1, . . . , N 1,
J
k
(x
k
) = min
_
t(x
k
), min
u
k
U(x
k
)
E[g(x
k
, u
k
, w) +J
k+1
(f(x
k
, u
k
, w))]
_
,
and it is optimal to stop at time k for states x in the set
T
k
=
_
x : t(x) min
uU(x)
E[g(x, u, w) +J
k+1
(f(x, u, w))]
_
. (3.3.5)
129
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
x

k
J
k
(x)
x
J
k+1
(x)

k+1
Figure 3.3.5: Structure of the value functions J
k
(x) and J
k+1
(x) when prices are correlated.
Note that J
N1
(x) J
N
(x), x. This holds because
J
N1
(x) = min
_
t(x), min
u
N1
U(x)
E[g(x, u
N1
, w) +J
N
(f(x, u
N1
, w))]
_
t(x) = J
N
(x).
Using the monotonicity of the DP, we have J
k
(x) J
k+1
(x), k = 0, 1, . . . , N 1. Since
T
k+1
=
_
x : t(x) min
uU(x)
E[g(x, u, w) +J
k+2
(f(x, u, w))]
_
, (3.3.6)
and the RHS in (3.3.5) is less or equal than the RHS in (3.3.6), we have
T
0
T
1
T
k
T
k+1
T
N1
. (3.3.7)
Question: When are all stopping sets T
k
equal?
Answer: Suppose that the set T
N1
is absorbing in the sense that if a state belongs to T
N1
and termination is not selected, the next state will also be in T
N1
; that is,
f(x, u, w) T
N1
, x T
N1
, u U(x), and w. (3.3.8)
By denition of T
N1
we have
J
N1
(x) = t(x), for all x T
N1
.
We obtain for x T
N1
,
min
uU(x)
E[g(x, u, w) +J
N1
(f(x, u, w))] = min
uU(x)
E[g(x, u, w) +t(f(x, u, w))]
t(x) (because of (3.3.5) applied to k = N 1).
Since
J
N2
(x) = min
_
t(x), min
uU(x)
E[g(x, u, w) +J
N
1
(f(x, u, w))]
_
,
then x T
N2
, or equivalently T
N1
T
N2
. This, together with (3.3.7), implies T
N1
=
T
N2
. Proceeding similarly, we obtain T
k
= T
N1
, k.
130
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
Conclusion: If condition (3.3.8) holds (i.e., the one-step stopping set T
N1
is absorbing),
then the stopping sets T
k
are all equal to the set of states for which it is better to stop rather
than continue for one more stage and then stop. A policy of this type is known as a one-step-
look-ahead policy. Such a policy turns out to be optimal in several types of applications.
Example 3.3.1 (Asset selling with past oers retained)
Take the previous asset selling problem in Section 3.3.1, and suppose now that rejected oers can be
accepted at a later time. Then, if the asset is not sold at time k, the state evolves according to:
x
k+1
= max{x
k
, w
k
},
instead of just x
k+1
= w
k
. Note that this system equation retains the best oered got so far from
period 0 to k.
The DP algorithm becomes:
V
N
(x
N
) = x
N
,
and for k = 0, 1, . . . , N 1,
V
k
(x
k
) = max
_
x
k
, (1 +r)
1
E
w
k
[V
k+1
(max{x
k
, w
k
}]
_
.
The one-step stopping set is:
T
N1
=
_
x : x (1 +r)
1
E
w
[max{x, w}]
_
.
Dene as the x that satises the equation
x =
E
w
[max{x, w}]
1 +r
;
so that T
N1
= {x : x }. Thus,
=
1
1 +r
E
w
[max{ , w}]
=
1
1 +r
__

0
dF(w) +
_


wdF(w)
_
=
1
1 +r
_
F( ) +
_


wdF(w)
_
,
or equivalently,
(1 +r) = F( ) +
_


wdF(w).
Since past oers can be accepted at a later date, the eective oer available cannot decrease with time,
and it follows that the one-step stopping set
T
N1
= {x : x }
is absorbing in the sense of (3.3.8). In symbols, for x T
N1
, f(x, u, w) = max{x, w} x , and
so f(x, u, w) T
N1
. Therefore, the one-step-look-ahead stopping rule that accepts the rst oer that
equals or exceeds is optimal.
131
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
3.3.3 Scheduling problem
Consider a given a set of tasks to perform, with the ordering subject to optimal choice.
Costs depend on the order.
There might be uncertainty, and precedence and resource availability constraints.
Some problems can be solved eciently by an interchange argument.
Example: Quiz problem
Given a list of N questions, if question i is answered correctly (which occurs with probability
p
i
), we receive reward R
i
; if not the quiz terminates.
Let i and j be the kth and (k + 1)st questions in an optimally ordered list
L

= (i
0
, i
1
, . . . , i
k1
, i, j, i
k+2
, . . . , i
N1
)
We have
E[Reward(L)] = E[Reward(i
0
, ..., i
k1
)]+p
i
0
p
i
1
p
i
k1
(p
i
R
i
+p
i
p
j
R
j
)+E[Reward(i
k+2
, ..., i
N1
)].
Consider the list L, now with i and j interchanged, and let:
L


= (i
0
, . . . , i
k1
, j, i, i
k+2
, . . . , i
N1
).
Since L is optimal,
E[Reward(L)] E[Reward(L

)],
and then
p
i
R
i
+p
i
p
j
R
j
p
j
R
j
+p
i
p
j
R
i
,
or
p
i
R
i
1 p
i

p
j
R
j
1 p
j
.
Therefore, to maximize the total expected reward, questions should be ordered in decreasing order
of p
i
R
i
/(1 p
i
).
3.3.4 Exercises
Exercise 3.3.1 Consider the optimal stopping, asset selling problem discussed in class. Suppose
that the oers w
k
are i.i.d. random variables, Unif[500, 2000]. For N = 10, compute the thresholds

k
, k = 0, 1, ..., 9, for r = 0.05 and r = 0.1. Recall that
N
= 0. Also compute the expected value
J
0
(0) for both interest rates.
Exercise 3.3.2 Consider again the optimal stopping, asset selling problem discussed in class.
132
CHAPTER 3. APPLICATIONS Prof. R. Caldentey
(a) For the stationary, limiting policy dened by , where is the solution to the equation
(1 +r) = F() +
_

wdF(w)
Prove that G(), dened as
G() =
1
r + 1
_
F() +
_

wdF(w)
_
,
is a contraction mapping, and hence the iterative procedure
n+1
= G(
n
) nds the unique
xed point in [0, E[w]/r], starting from any arbitrary
0
[0, E[w]/r].
Recall: G is a contraction mapping if for all x and y, ||G(x) G(y)|| < ||x y||, for a constant
0 < 1, independent of x, y.
(b) Apply the iterative procedure to compute over the scenarios described in Exercise 1.
(c) Compute the expected value

J
0
(0) for Problem 1 when the controller applies control in every
stage. Compare the results and comment on them.
Exercise 3.3.3 (The job/secretary/partner selection problem) A collection of N 2 ob-
jects is observed randomly and sequentially one at a time. The observer may either select the
current object observed, in which case the selection process is terminated, or reject the object and
proceed to observe the next. The observer can rank each object relative to those already observed,
and the objective is to maximize the probability of selecting the best object according to some
criterion. It is assumed that no two objects can be judged to be equal. Let r

be the smallest
positive integer r such that
1
N 1
+
1
N 2
+ +
1
r
1
Show that an optimal policy requires that the rst r

objects be observed. If the r

th object has
rank 1 relative to the others already observed, it should be selected; otherwise, the observation
process should be continued until an object of rank 1 relative to those already observed is found.
Hint: Assume uniform distribution of the objects, i.e., if the rth object has rank 1 relative to the
previous (r 1) objects, then the probability that it is the best is r/N. Dene the state of the
system as
x
k
=
_

_
T if the selection has already terminated,
1 if the kth object observed has rank 1 among the rst k objects,
0 if the kth object observed has rank > 1 among the rst k objects.
For k r

, let J
k
(0) be the maximal probability of nding the best object assuming k objects have
been observed and the kth object is not best relative to the previous (k 1) objects. Show that
J
k
(0) =
k
N
_
1
N 1
+ +
1
k
_
.
Analogously, let J
k
(1) be the maximal probability of nding the best object assuming k objects
have been observed and the kth object is indeed the best relative to the previous (k 1) objects.
Show that
J
k
(1) =
k
N
.
Then, analyze the case k < r

.
133
Prof. R. Caldentey CHAPTER 3. APPLICATIONS
Exercise 3.3.4 A driver is looking for parking on the way to his destination. Each parking place
is free with probability p independently of whether other parking places are free or not. The driver
cannot observe whether a parking place is free until he reaches it. If he parks k places from his
destination, he incurs a cost k. If he reaches the destination without having parked, the cost is C.
(a) Let F
k
be the minimal expected cost if he is k parking places from his destination, where
F
0
= C. Show that
F
k
= p min{k, F
k1
} +qF
k1
, k = 1, 2, . . . ,
where q = 1 p.
(b) Show that an optimal policy is of the form: Never park if k k

, but take the rst free place


if k < k

, where k is the number of parking places from the destination, and


k

= min
_
i : i integer, q
i1
< (pC +q)
1
_
Exercise 3.3.5 (Hardys Theorem) Let {a
1
, . . . , a
n
} and {b
1
, . . . , b
n
} be monotonically nonde-
creasing sequences of numbers. Let us associate with each i = 1, . . . , n, a distinct index j
i
, and
consider the expression

n
i=1
a
i
b
j
i
. Use an interchange argument to show that this expression is
maximized when j
i
= i for all i, and is minimized when j
i
= n i + 1 for all i.
134
Chapter 4
DP with Imperfect State Information.
So far we have studied the problem that the controller has access to the exact value of the current
state, but this assumption is sometimes unrealistic. In this chapter, we will study the problems
with imperfect state information. In this setting, we suppose that the controller receives some noisy
observations about the value of the current state instead of the actual underlying states.
4.1 Reduction to the perfect information case
Basic problem with imperfect state information
Suppose that the controller has access to observations z
k
of the form
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
(x
k
, u
k1
, v
k
), k = 1, 2, ..., N 1,
where
z
k
Z
k
(observation space)
v
k
V
k
(random observation disturbances)
The random observation disturbance v
k
is characterized by a probability distribution
P
v
k
(.|x
k
, ..., x
0
, u
k1
, ..., u
0
, w
k1
, ..., w
0
, v
k1
, ..., v
0
)
Initial state x
0
Control
k
U
k
C
k
Dene I
k
the information available to the controller at time k and call it the information
vector
I
k
= (z
0
, z
1
, ..., z
k
, u
0
, u
1
, ...u
k1
)
Consider a class of policies consisting of a sequence of functions = {
0
,
1
, ...,
N1
} where

k
(I
k
) U
k
for all I
k
, k = 0, 1, ..., N 1
135
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Objective: Find an admissible policy = {
0
,
1
, ...,
N1
} that minimizes the cost function
J

= E
x
0
,w
k
,v
k
k=0,1,...,N1
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
,
k
(I
k
), w
k
)
_
subject to the system equation
x
k+1
= f
k
(x
k
,
k
(I
k
), w
k
), k = 0, 1, ..., N 1,
and the measurement equation
z
0
= h
0
(x
0
, v
0
)
z
k
= h
k
(x
k
,
k1
(I
k1
), v
k
), k = 1, 2, ..., N 1
Note the dierence from the perfect state information case. In perfect state information case, we
tried to nd a rule that would specify the control u
k
to be applied for each state x
k
at time k.
However, now we are looking for a rule that gives the control to be applied for every possible
information vector I
k
, for every sequence of observations received and controls applied up to time k.
Example: Multiaccess Communication
Consider a group of transmitting stations sharing a common channel
Stations are synchronized to transmit packets of data at integer times
Each packet requires one slot (time unit) for transmission
Let a
k
= Number of packet arrivals during slot k (with a given probability distribution)
x
k
= Number of packet waiting to be transmitted at the beginning of slot k (backlog)
Packet transmissions are scheduled using a strategy called slotted Aloha protocol :
Each packet in the system at the beginning of slot k is transmitted during that slot with
probability u
k
(common for all packets)
If two or more packets are transmitted simultaneously, they collide and have to rejoin
the backlog for retransmission at a later slot
Stations can observe the channel and determine whether in any one slot:
1. there was a collision
136
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
2. a success in the slot
3. nothing happened (i.e., idle slot)
Control: transmission probability u
k
Objective: keep backlog small, so we assume a cost per stage g
k
(x
k
), with g
k
(.) a monotonically
increasing function of x
k
State of system: size of the backlog x
k
(unobservable)
System equation:
x
k+1
= x
k
+a
k
t
k
where a
k
is the number of new arrivals, and t
k
is the number of packets successfully transmitted
during slot k. The distribution of t
k
is given by
t
k
=
_
1 (success) w.p. x
k
u
k
(1 u
k
)
x
k
1
(i.e., P(one Tx, x
k
1 do not Tx)),
0 (failure) w.p. 1 x
k
u
k
(1 u
k
)
x
k
1
Measurement equation:
z
k+1
= v
k+1
=
_

_
idle w.p. (1 u
k
)
x
k
success w.p. x
k
u
k
(1 u
k
)
x
k
1
collision w.p. 1 (1 u
k
)
x
k
x
k
u
k
(1 u
k
)
x
k
1
where z
k+1
is the observation obtained at the end of the kth slot
Reformulated as a perfect information problem
Candidate for state is the information vector I
k
I
k+1
= (I
k
, z
k+1
, u
k
), k = 0, 1, ..., N 2, I
0
= z
0
The state of the system is I
k
, the control is u
k
and z
k+1
can be viewed as a random disturbance.
Furthermore, we have
P(z
k+1
|I
k
, u
k
) = P(z
k+1
|I
k
, u
k
, z
0
, z
1
, ..., z
k
)
Note that the prior disturbances z
0
, z
1
, ..., z
k
are already included in the information vector I
k
. So,
in the LHS we now have the system in the framework of basic DP where the probability distribution
of z
k+1
depends explicitly only on the state I
k
and control u
k
of the new system and not on the
prior disturbances (although implicitly it does through I
k
).
The cost per stage:
g
k
(I
k
, u
k
) = E
x
k
,w
k
[g
k
(x
k
, u
k
, w
k
)|I
k
, u
k
]
Note that the new formulation is focused on past info and controls rather than on original
system disturbances w
k
.
DP algorithm:
J
N1
(I
N1
) = min
u
N1
U
N1
{E
x
N1
,w
N1
[g
N
(f
N1
(x
N1
, u
N1
, w
N1
))
+g
N1
(x
N1
, u
N1
, w
N1
)|I
N1
, u
N1
]}
137
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
and for k = 0, 1, ..., N 2,
J
k
(I
k
) = min
u
k
U
k
{E
x
k
,w
k
,z
k+1
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(I
k
, z
k+1
, u
k
. .
I
k+1
)|I
k
, u
k
]}
We minimize the RHS for every possible value of the vector I
k+1
to obtain

k+1
(I
k+1
). The optimal
cost is given by J

0
= E
z
0
[J
0
(z
0
)].
Example 4.1.1 (Machine repair)
A machine can be in one of two unobservable states: P (good state) and

P (bad state)
State space: {P,

P}
Number of periods: N = 2
At the end of each period, the machine is inspected with two possible inspection outcomes: G
(probably good state), B (probably bad state)
Control space: actions after each inspection, which could be either
C : continue operation of the machine; or
S : stop, diagnose its state and if it is in bad state

P, repair.
Cost per stage: g(P, C) = 0; g(P, S) = 1; g(

P, C) = 2; g(

P, S) = 1
Total cost: g(x
0
, u
0
) +g(x
1
, u
1
) (assume zero terminal cost)
Let x
0
, x
1
be the state of the machine at the end of each period
Distribution of initial state: P(x
0
= P) =
2
3
, P(x
0
=

P) =
1
3
Assume that we start with a machine in good state, i.e., x
1
= P
System equation:
x
k+1
= w
k
, k = 0, 1
where the transition probabilities are given by
138
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Note that we do not have perfect state information, since the inspection does not reveal the state
of the machine with certainty. Rather, the result of each inspection may be viewed as a noisy
measure of the system state.
Result of inspections: z
k
= v
k
, k = 0, 1; v
k
{B, G}
Information vector:
I
0
= z
0
, I
1
= (z
0
, z
1
, u
0
)
and we seek functions
0
(I
0
),
1
(I
1
) that minimize
E
x
0
,w
0
,w
1
v
0
,v
1
_

_g(x
0
,
0
( z
0
..
I
0
)) +g(x
1
,
1
(z
0
, z
1
,
0
(z
0
)
. .
I
1
))
_

_
DP algorithm. Terminal condition: J
2
(I
2
) = 0 for all I
2
For k = 0, 1 :
J
k
(I
k
) = min {
cost if u
k
=C
..
P(x
k
= P|I
k
, C) g(P, C)
. .
0
+P(x
k
=

P|I
k
, C) g(

P, C)
. .
2
+E
z
k+1
[J
k+1
(I
k
, z
k+1
, C)|I
k
, C],
cost if u
k
=S
..
P(x
k
= P|I
k
, S) g(P, S)
. .
1
+P(x
k
=

P|I
k
, S) g(

P, S)
. .
1
+E
z
k+1
[J
k+1
(I
k
, z
k+1
, S)|I
k
, S]}
Last stage (k = 1): compute J
1
(I
1
) for each possible I
1
= (z
0
, z
1
, u
0
). Rcalling that J
2
(I) = 0, I, we
have
cost of C = 2P(x
1
=

P|I
1
), cost of S = 1,
and therefore J
1
(I
1
) = min{2P(x
1
=

P|I
1
), 1}. Compute probability P(x
1
=

P|I
1
) for all possible
realizations of I
1
= (z
0
, z
1
, u
0
) by using the conditional probability formula:
P(X|A, B) =
P(X, A|B)
P(A|B)
.
There are 8 cases to consider. We describe here 3 of them.
(1) For I
1
= (G, G, S)
P(x
1
=

P|G, G, S) =
P(x
1
=

P, G, G|S)
P(G, G|S)
=
1
7
139
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Numerator:
P(x
1
=

P, G, G|S) =
_
2
3

3
4

1
3

1
4
_
+
_
1
3

1
4

1
3

1
4
_
=
7
144
Denominator:
P(G, G|S) =
_
2
3

3
4

2
3

3
4
_
+
_
2
3

3
4

1
3

1
4
_
+
_
1
3

1
4

2
3

3
4
_
+
_
1
3

1
4

1
3

1
4
_
=
49
144
Hence,
J
1
(G, G, S) = 2P(x
1
=

P|G, G, S) =
2
7
< 1,

1
(G, G, S) = C
(2) For I
1
= (B, G, S)
P(x
1
=

P|B, G, S) =
1
7
Numerator:
P(x
1
=

P, B, G|S) =
1
4

1
3

_
1
4

2
3
+
3
4

1
3
_
=
5
144
140
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Denominator:
P(B, G|S) =
2
3

1
4

_
2
3

3
4
+
1
3

1
4
_
+
1
3

3
4

_
2
3

3
4
+
1
3

1
4
_
Hence,
J
1
(B, G, S) =
2
7
,

1
(B, G, S) = C
(3) For I
1
= (G, B, S)
P(x
1
=

P|G, B, S) =
P(x
1
=

P, G, B|S)
P(G, B|S)
=
3
5
Numerator:
P(x
1
=

P, G, B|S) =
3
4

1
3

_
3
4

2
3
+
1
4

1
3
_
=
7
48
141
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Denominator:
P(G, B|S) =
1
4

2
3

_
3
4

2
3
+
1
4

1
3
_
+
3
4

1
3

_
3
4

2
3
+
1
4

1
3
_
=
35
144
Hence,
J
1
(G, B, S) = 1,

1
(G, B, S) = S
Summary: For all other 5 cases of I
1
, we compute J
1
(I
1
) and

1
(I
1
). The optimal policy is to continue
(u
1
= C) if the result of last inspection was G and to stop (u
1
= S) if the result of the last inspection
was B.
First stage (k = 0): Compute J
0
(I
0
) for each of the two possible information vectors I
0
= (G), I
0
= (B).
We have
cost of C = 2P(x
0
=

P|I
0
, C) + E
z
1
{J
1
(I
0
, z
1
, C)|I
0
, C}
= 2P(x
0
=

P|I
0
, C) +P(z
1
= G|I
0
, C)J
1
(I
0
, G, C) +P(z
1
= B|I
0
, C)J
1
(I
0
, B, C)
cost of S = 1 + E
z
1
{J
1
(I
0
, z
1
, S)|I
0
, S}
= 1 +P(z
1
= G|I
0
, S)J
1
(I
0
, G, S) +P(z
1
= B|I
0
, S)J
1
(I
0
, B, S),
using the values of J
1
from previous stage. Thus, we have
J
0
(I
0
) = min{cost of C, cost of S}
The optimal cost is
J

= P(G)J
0
(G) +P(B)J
0
(B)
For illustration, we compute one of the values. For example, for I
0
= G
P(z
1
= G|G, C) =
P(z
1
= G,
z
0
..
G |
u
0
..
C )
P( G
..
z
0
|C)
=
P(z
1
= G, G|C)
P(G)
=
15
48
7
12
=
15
28
Note that the P(G|C) = P(G) follows since z
0
= G is independent of the control u
0
= C or u
0
= S
142
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Numerator:
P(z
1
= G, G|C) =
2
3

3
4

_
2
3

3
4
+
1
3

1
4
_
+
1
3

1
4
1
1
4
=
15
48
Denominator:
P(G) =
2
3

3
4
+
1
3

1
4
=
7
12
Similarly, we can compute P(B) =
2
3

1
4
+
1
3

3
4
=
5
12
143
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Note: The optimal policy for both stages is to continue (C) if the result of latest inspection is G and
to stop and repair (S) otherwise. The optimal cost can be proved to be J

=
176
144
Problem: The DP can be computationally prohibitive if the number of information vectors I
k
is large or
innite.
4.2 Linear-Quadratic Systems and Sucient Statistics
In this section, we consider again the problem studied in Section 2.5, but now under the assumption
that the controller does not observe the real state of the system x
k
, but just a noisy representation
of it, z
k
. Then, we investigate how we can reduce the quantity of information needed to solve
problems under imperfect state information.
4.2.1 Linear-Quadratic systems
Problem setup
System equation: x
k+1
= A
k
x
k
+B
k
u
k
+w
k
[Linear in both state and control.]
Quadratic cost:
E
x
0
,w
0
,...,w
N1
_
x

N
Q
N
x
N
+
N1

k=0
(x

k
Q
k
x
k
+u

k
R
k
u
k
)
_
,
where:
Q
k
are square, symmetric, positive semidenite matrices with appropriate dimension,
R
k
are square, symmetric, positive denite matrices with appropriate dimension,
Disturbances w
k
are independent with E[w
k
] = 0, nite variance, and independent of x
k
and u
k
.
Controls u
k
are unconstrained, i.e., u
k
R
n
.
Observations: Driven by a linear measurement equation:
z
k
..
R
s
= C
k
..
R
sn
x
k
+ v
k
..
R
s
, k = 0, 1, . . . , N 1,
where v
k
s are mutually independent, and also independent from w
k
and x
0
.
Key fact to show: Given an information vector I
k
= (z
0
, . . . , z
k
, u
0
, . . . , u
k1
), the optimal policy
{

0
, . . . ,

N1
} is of the form

k
(I
k
) = L
k
E[x
k
|I
k
],
where
L
k
is the same as for the perfect state info case, and solves the control problem.
E[x
k
|I
k
] solves the estimation problem.
This means that the control and estimation problems can be solved separately.
144
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
DP algorithm
The DP algorithm becomes:
At stage N 1,
J
N1
(I
N1
) = min
u
N1
E
x
N1
,w
N1
_
x

N1
Q
N1
x
N1
+u

N1
R
N1
u
N1
+ (A
N1
x
N1
+B
N1
u
N1
+w
N1
)

Q
N
(A
N1
x
N1
+B
N1
u
N1
+w
N1
)|I
N1
, u
N1
]
(4.2.1)
Recall that w
N1
is independent of x
N1
, and that both are random at stage N 1;
thats why we take expected value over both of them.
Since the w
k
are mutually independent and do not depend on x
k
and u
k
either, we have
E[w
N1
|I
N1
, u
N1
] = E[w
N1
|I
N1
] = E[w
N1
] = 0,
then the minimization just involves
min
u
N1
_

_
u

N1
(B

N1
Q
N
B
N1
. .
0
+R
N1
. .
>0
)u
N1
+ 2E[x
N1
|I
N1
]

N1
Q
N
B
N1
u
N1
_

_
Taking derivative of the argument with respect to u
N1
, we have the rst order condition:
2(B

N1
Q
N
B
N1
+R
N1
)u
N1
+ 2E[x
N1
|I
N1
]

N1
Q
N
B
N1
= 0.
This yields the optimal u

N1
:
u

N1
=

N1
(I
N1
) = L
N1
E[x
N1
|I
N1
],
where
L
N1
= (B

N1
Q
N
B
N1
+R
N1
)
1
B

N1
Q
N
A
N1
.
Note that this is very similar to the perfect state info counterpart, except that now x
N1
is replaced by E[x
N1
|I
N1
].
Substituting back in (4.2.1), we get:
J
N1
(I
N1
) = E
x
N1
_
x

N1
K
N1
x
N1
|I
N1

(quadratic in x
N1
)
+E
x
N1
_
(x
N1
E[x
N1
|I
N1
])

P
N1
(x
N1
E[x
N1
|I
N1
])|I
N1

(quadratic in estimation error x


N1
E[x
N1
|I
N1
])
+E
w
N1
_
w

N1
Q
N
w
N1

(constant term),
where the matrices K
N1
and P
N1
are given by
P
N1
= A

N1
Q
N
B
N1
(R
N1
+B

N1
Q
N
B
N1
)
1
B

N1
Q
N
A
N1
,
and
K
N1
= A

N1
Q
N
A
N1
P
N1
+Q
N1
.
145
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Note the structure of J
N1
: In addition to the quadratic and constant terms (which are
identical to the perfect state info case for a given state x
N1
), it involves a quadratic
term in the estimation error
x
N1
E[x
N1
|I
N1
].
In words, the estimation error is penalized quadratically in the value function.
At stage N 2,
J
N2
(I
N2
) = min
u
N2
E
x
N2
,w
N2
,z
N1
_
x

N2
Q
N2
x
N2
+u

N2
R
N2
u
N2
+J
N1
(I
N1
)|I
N2
, u
N2

= E[x

N2
Q
N2
x
N2
|I
N2
] + min
u
N2
_
u

N2
R
N2
u
N2
+ E[x

N1
K
N1
x
N1
|I
N2
, u
N2
]
_
+E[(x
N1
E[x
N1
|I
N1
])

P
N1
(x
N1
E[x
N1
|I
N1
])|I
N2
, u
N2
] (4.2.2)
+E
w
N1
_
w

N1
Q
N
w
N1

Key point (to be proved): The term (4.2.2) turns out to be independent of u
N2
, and so
we can exclude it from the minimization with respect to u
N2
.
This says that the quality of estimation as expressed by the statistics of the error x
k
E[x
k
|I
k
]
cannot be inuenced by the choice of control, which is not very intuitive!
For the next result, we need the linearity of both system and measurement equations.
Lemma 4.2.1 (Quality of Estimation) For every stage k, there is a function M
k
() such that
M
k
(x
0
, w
0
, . . . , w
k1
, v
0
, . . . , v
k
) = x
k
E[x
k
|I
k
],
independently of the policy being used.
Proof: Fix a policy, and consider the following two systems:
1. There is a control u
k
being implemented, and the system evolves according to
x
k+1
= A
k
x
k
+B
k
u
k
+w
k
, z
k
= C
k
x
k
+v
k
.
2. There is no control being applied, and the system evolves according to
x
k+1
= A
k
x
k
+ w
k
, z
k
= C
k
x
k
+ v
k
. (4.2.3)
Consider the evolution of the two systems from identical initial conditions: x
0
= x
0
; and when
system disturbances and observation noise vectors are also identical:
w
k
= w
k
, v
k
= v
k
, k = 0, 1, . . . , N 1.
Consider the vectors:
Z
k
= (z
0
, . . . , z
k
)

,

Z
k
= ( z
0
, . . . , z
k
)

, W
k
= (w
0
, . . . , w
k
)

,
V
k
= (v
0
, . . . , v
k
)

, and U
k
= (u
0
, . . . , u
k
)

.
146
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Applying the system equations above for stages 0, 1, . . . , k, their linearity implies the existence of
matrices F
k
, G
k
and H
k
such that:
x
k
= F
k
x
0
+G
k
U
k1
+H
k
W
k1
, (4.2.4)
x
k
= F
k
x
0
+H
k
W
k1
. (4.2.5)
Note that the vector U
k1
= (u
0
, . . . , u
k1
)

is part of the information vector I


k
, as veried below:
I
k
= (z
0
, . . . , z
k
, u
0
, . . . , u
k1
. .
U
k1
), k = 1, . . . , N 1,
I
0
= z
0
.
Then, U
k1
= E[U
k1
|I
k
], and conditioning with respect to I
k
in (5.3.2) and (4.2.5):
E[x
k
|I
k
] = F
k
E[x
0
|I
k
] +G
k
U
k1
+H
k
E[W
k1
|I
k
] (4.2.6)
E[ x
k
|I
k
] = F
k
E[x
0
|I
k
] +H
k
E[W
k1
|I
k
]. (4.2.7)
Then,
x
k
..
from (5.3.2)
E[x
k
|I
k
]
. .
from (4.2.6)
= x
k
..
from (4.2.5)
E[ x
k
|I
k
]
. .
from (4.2.7)
,
where the term G
k
U
k1
gets canceled. The intuition for this is that the linearity of the system
equation aects equally the true state x
k
and our estimation of it, E[x
k
|I
k
].
Applying now the measurement equations above for 0, 1, . . . , k, their linearity implies the existence
of a matrix R
k
such that:
Z
k


Z
k
= R
k
U
k1
Note that Z
k
involves the term B
k1
u
k1
from the system equation for x
k
, and recursively we
can build such a matrix R
k
. In addition, from (4.2.3) above and the sample path identity for the
disturbances,

Z
k
depends on the original w
k
, v
k
and x
0
:
Z
k


Z
k
= R
k
U
k1


Z
k
= Z
k
R
k
U
k1
= S
k
W
k1
+T
k
V
k
+D
k
x
0
,
where S
k
, T
k
, and D
k
are matrices of appropriate dimension. Thus, the information provided by
I
k
= (Z
k
, U
k1
) regarding x
k
is summarized in

Z
k
, and we have
E[ x
k
|I
k
] = E[ x
k
|

Z
k
],
so that
x
k
E[x
k
|I
k
] = x
k
E[ x
k
|I
k
]
= x
k
E[ x
k
|

Z
k
].
Therefore, the function M
k
to use is
M
k
(x
0
, w
0
, . . . , w
k1
, v
0
, . . . , v
k
) = x
k
E[ x
k
|

Z
k
],
which does not depend on the controls u
0
, . . . , u
k1
.
Going back to the DP equation J
N2
(I
N2
), and using the Quality of Estimation Lemma, we get

N1

= M
N1
(x
0
, w
0
, . . . , w
N2
, v
0
, . . . , v
N1
) = x
N1
E[x
N1
|I
N1
].
147
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Since
N1
is independent of u
N2
, we have
E[

N1
P
N1

N1
|I
N2
, u
N2
] = E[

N1
P
N1

N1
|I
N2
].
So, going back to the DP equation for J
N2
(I
N2
), we can drop the term (4.2.2) to minimize
over u
N2
, and similarly to stage N 1, the minimization yields
u

N2
=

N2
(I
N2
) = L
N2
E[x
N2
|I
N2
].
Continuing similarly (using also the Quality of Estimation Lemma), we have

k
(I
k
) = L
k
E[x
k
|I
k
],
where L
k
is the same as for perfect state info:
L
k
= (R
k
+B

k
K
k+1
B
k
)
1
B

k
K
k+1
A
k
,
with K
k
generated from K
N
= Q
N
, using
K
k
= A

k
K
k+1
A
k
P
k
+Q
k
,
P
k
= A

k
K
k+1
B
k
(R
k
+B

k
K
k+1
B
k
)
1
B

k
K
k+1
A
k
.
The optimal controller is represented in Figure 4.2.1.
x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
z
k
= C
k
x
k
+ v
k
Estimator:
update
I
k
:=I
k-1
(u
k-1
,z
k
),
and compute
E[x
k
|I
k
]
Actuator:
compute
u
k
=L
k
E[x
k
|I
k
]
E[x
k
|I
k
] u
k
u
k
Delay
(from previous period)
v
k
z
k
z
k
u
k-1
w
k
Start reading
from here
x
k
Figure 4.2.1: Structure of the optimal controller for the L-Q problem.
Separation interpretation:
1. The optimal controller can be decomposed into:
An estimator, which uses the data to generate the conditional expectation E[x
k
|I
k
].
An actuator, which multiplies E[x
k
|I
k
] by the gain matrix L
k
and applies the control u
k
=
L
k
E[x
k
|I
k
].
148
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
2. Observation: Consider the problem of nding the estimate x of a random vector x given some
information (random vector) I, which minimizes the mean squared error
E
x
[||x x||
2
|I] = E[||x||
2
] 2E[x|I] x +|| x||
2
.
When we take derivative with respect to x and set it equal to zero:
2 x 2E[x|I] = 0 x = E[x|I],
which is exactly our estimator.
3. The estimator portion of the optimal controller is optimal for the problem of estimating the
state x
k
assuming the control is not subject to choice.
4. The actuator portion is optimal for the control problem assuming perfect state information.
4.2.2 Implementation aspects Steady-state controller
In the imperfect info case, we need to compute an estimator x
k
= E[x
k
|I
k
], which is indeed
the one that minimizes the mean squared error E
x
[||x x||
2
|I].
However, this is computationally hard in general.
Fortunately, if the disturbances w
k
and v
k
, and the initial state x
0
are Gaussian random
vectors, a convenient implementation of the estimator is possible by means of the Kalman
lter algorithm.
This algorithm produces x
k+1
at time k + 1 just depending on z
k+1
, u
k
and x
k
.
Kalman lter recursion: For all k = 0, 1, . . . , N 1, compute
x
k+1
= A
k
x
k
+B
k
u
k
+
k+1|k+1
C

k+1
N
1
k+1
(z
k+1
C
k+1
(A
k
x
k
+B
k
u
k
)),
and
x
0
= E[x
0
] +
0|0
C

0
N
1
0
(z
0
C
0
E[x
0
]),
where the matrices
k|k
are precomputable and are given recursively by

k+1|k+1
=
k+1|k

k+1|k
C

k+1
(C
k+1

k+1|k
C

k+1
+N
k+1
)
1
C
k+1

k+1|k
,

k+1|k
= A
k

k|k
A

k
+M
k
, k = 0, 1, . . . , N 1,
with

0|0
= S SC

0
(C
0
SC

0
+N
0
)
1
C
0
S.
In these equations, M
k
, N
k
, and S are the covariance matrices
1
of w
k
, v
k
and x
0
, respectively, and
we assume that w
k
and v
k
have zero mean; that is
E[w
k
] = E[v
k
] = 0,
M
k
= E[w
k
w

k
], N
k
= E[v
k
v

k
],
S = E[(x
0
E[x
0
])(x
0
E[x
0
])

].
Moreover, we are assuming that matrices N
k
are positive denite (and hence, invertible).
1
Recall that for a random vector X, its covariance matrix is given by E[(XE[X])(XE[X])

]. Its entry (i, j) is


given by ij = Cov(Xi, Xj) = E[(Xi E[Xi])(Xj E[Xj])]. The covariance matrix is always positive semi-denite.
149
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Stationary case
Assume that the system and measurement equations are stationary (i.e., same distribution
across time; N
k
= N, M
k
= M).
Suppose that (A, B) is controllable and that matrix Q can be written as Q = F

F, where F
is a matrix such that (A, F) is observable.
2
By the theory of LQ-systems under perfect info, when N (i.e., the horizon length
becomes large), the optimal controller tends to the steady-state policy

k
(I
k
) = L x
k
,
where
L = (R +B

KB)
1
B

KA,
and where K is the unique 0 symmetric solution of the algebraic Riccati equation
K = A

(K KB(R +B

KB)
1
B

K)A+Q.
It can also be shown in the limit as N , that
x
k+1
= (A+BL) x
k
+

C

N
1
(z
k+1
C(A+BL) x
k
),
where

is given by

= C

(CC

+N)
1
C,
and is the unique 0 symmetric solution of the Riccati equation
= A( C

(CC

+N)
1
C)A

+M.
The assumptions required for this are:
1. (A, C) is observable.
2. The matrix M can be written as M = DD

, where D is a matrix such that the pair


(A, D) is controllable.
Non-Gaussian uncertainty
When the uncertainty of the system is non-Gaussian, computing E[x
k
|I
k
] may be very dicult from
a computational viewpoint. So, a suboptimal solution is typically used.
A common suboptimal controller is to replace E[x
k
|I
k
] by the estimate produced by the Kalman
lter (i.e., act as if x
0
, w
k
and v
k
are Gaussian).
A nice property of this approximation is that it can be proved to be optimal within the class of
controllers that are linear functions of I
k
.
2
Recall the denitions: A pair of matrices (A, B), where A R
nn
and B R
nm
, is said to be controllable if
the n (n, m) matrix: [B, AB, A
2
B, . . . , A
n1
B] has full rank. A pair (A, C), A R
nn
, C R
mn
is said to be
observable if the pair (A

, C

) is controllable.
150
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
4.2.3 Sucient statistics
Problem of DP algorithm under imperfect state info: Growing dimension of the reformulated
state space I
k
.
Objective: Find sucient statistics (ideally, of smaller dimension) for I
k
that summarize all
the essential contents of I
k
as far as control is concerned.
Recall the DP formulation for the imperfect state info case:
J
N1
(I
N1
) = min
u
N1
U
N1
E
x
N1
,w
N1
_
g
N
(f
N1
(x
N1
, u
N1
, w
N1
))
+g
N1
(x
N1
, u
N1
, w
N1
)|I
N1
, u
N1

, (4.2.8)
J
k
(I
k
) = min
u
k
U
k
E
x
k
,w
k
,z
k+1
[g
k
(x
k
, u
k
, w
k
) +J
k+1
(I
k
, z
k+1
, u
k
)|I
k
, u
k
] . (4.2.9)
Suppose that we can nd a function S
k
(I
k
) such that the RHS of (4.2.8) and (4.2.9) can be
written in terms of some function H
k
as
min
u
k
U
k
H
k
(S
k
(I
k
), u
k
),
such that
J
k
(I
k
) = min
u
k
U
k
H
k
(S
k
(I
k
), u
k
).
Such a function S
k
is called a sucient statistic.
An optimal policy obtained by the preceding minimization can be written as

k
(I
k
) =

k
(S
k
(I
k
)),
where u

k
is an appropriate function.
Example of a sucient statistic: S
k
(I
k
) = I
k
.
Another important sucient statistic is the conditional probability distribution of the state x
k
given the information vector I
k
, i.e.,
S
k
(I
k
) = P
x
k
|I
k
For this case, we need an extra assumption: The probability distribution of the observation
disturbance v
k+1
depends explicitly only on the immediate preceding x
k
, u
k
and w
k
, and not
on earlier ones.
It turns out that P
x
k
|I
k
is generated recursively by a dynamic system (estimator) of the form
P
x
k+1
|I
k+1
=
k
(P
x
k
|I
k
, u
k
, z
k+1
), (4.2.10)
for a suitable function
k
determined from the data of the problem. (We will verify this later)
151
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
Claim: Suppose for now that function
k
in equation (4.3.1) exists. We will argue now that
this is enough to solve the DP algorithm.
Proof: By induction. For k = N 1 (i.e., to solve (4.2.8)), given the Markovian nature of
the system, it is sucient to know the distribution P
x
N1
,I
N1
together with the distribution
P
w
N1
|x
N1
,u
N1
, so that
J
N1
(I
N1
) = min
u
N1
U
N1
H
N1
(P
x
N1
|I
N1
, u
N1
) =

J
N1
(P
x
N1
|I
N1
)
for appropriate functions H
N1
and

J
N1
.
IH: Assume
J
k+1
(I
k+1
) = min
u
k+1
U
k+1
H
k+1
(P
x
k+1
|I
k+1
, u
k+1
) =

J
k+1
(P
x
k+1
|I
k+1
), (4.2.11)
for appropriate functions H
k+1
and

J
k+1
.
We want to show that there exist functions H
k
and

J
k
such that
J
k
(I
k
) = min
u
k
U
k
H
k
(P
x
k
|I
k
, u
k
) =

J
k
(P
x
k
|I
k
).
Using equations (4.3.1) and (4.2.11), the DP in (4.2.9) can be written as
J
k
(I
k
) = min
u
k
U
k
E
x
k
,w
k
,z
k+1
_
g
k
(x
k
, u
k
, w
k
) +

J
k+1
(
k
(P
x
k
|I
k
, u
k
, z
k+1
))|I
k
, u
k

. (4.2.12)
To solve this problem, we also need the joint distribution P(x
k
, w
k
, z
k+1
|I
k
, u
k
), or equivalently,
given that from the primitives of the system,
z
k+1
= h
k+1
(x
k+1
, u
k
, v
k+1
), and x
k+1
= f
k
(x
k
, u
k
, w
k
),
we need
P(x
k
, w
k
, h
k+1
(f
k
(x
k
, u
k
, w
k
), u
k
, v
k+1
)|I
k
, u
k
).
This distribution can be expressed in terms of P
x
k
|I
k
, the given distributions
P(w
k
|x
k
, u
k
), P(v
k+1
|f
k
(x
k
, u
k
, w
k
), u
k
, w
k
),
and the system equation x
k+1
= f
k
(x
k
, u
k
, w
k
).
Therefore, the expression minimized over u
k
in (4.2.12) can be written as a function of P
x
k
|I
k
and u
k
, and the DP equation (4.2.12) can be written as
J
k
(I
k
) = min
u
k
U
k
H
k
(P
x
k
|I
k
, u
k
)
for a suitable function H
k
. Thus, P
x
k
|I
k
is a sucient statistic.
If the conditional distribution P
x
k
|I
k
is uniquely determined by another expression S
k
(I
k
), i.e.,
there exist a function G
k
such that
P
x
k
|I
k
= G
k
(S
k
(I
k
)),
then S
k
(I
k
) is also a sucient statistic.
For example, if we can show that P
x
k
|I
k
is a Gaussian distribution, then the mean and the
covariance matrix corresponding to P
x
k
|I
k
form a sucient statistic.
152
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
The representation of the optimal policy as a sequence of functions of P
x
k
|I
k
, i.e.,

k
(I
k
) =
k
(P
x
k
|I
k
), k = 0, 1, . . . , N 1,
is conceptually very useful. It provides a decomposition of the optimal controller in two parts:
1. An estimator, which uses at time k the measurement z
k
and the control u
k1
to generate
the probability distribution P
x
k
|I
k
.
2. An actuator, which generates a control input to the system as a function of the probability
distribution P
x
k
|I
k
.
This is illustrated in Figure 4.2.2.
System equation:
x
k+1
=f
k
(x
k
,u
k
,w
k
)
Measurement equation:
z
k
= h
k
(x
k,
u
k-1,
v
k
)
Estimator:

k-1
(P
xk-1|Ik-1
,u
k-1
,z
k
)
Actuator:
u
k
=
k
(P
xk|Ik
)
P
xk|Ik
u
k
u
k
Delay
(from previous period)
v
k
z
k
z
k
u
k-1
w
k
Start reading
from here
x
k
Figure 4.2.2: Conceptual separation of the optimal controller into an estimator and an actuator.
This separation is the basis for various suboptimal control schemes that split the controller a
priori into an estimator and an actuator.
The controller
k
(P
x
k
|I
k
) can be viewed as controlling the probabilistic state P
x
k
|I
k
, so as
to minimize the expected cost-to-go conditioned on the information I
k
available.
4.2.4 The conditional state distribution recursion
We still need to justify the recursion
P
x
k+1
|I
k+1
=
k
(P
x
k
|I
k
, u
k
, z
k+1
) (4.2.13)
For the case where the state, control, observation, and disturbance spaces are the real line, and all
r.v. involved posses p.d.f., the conditional density p(x
k+1
|I
k+1
) is generated from p(x
k
|I
k
), u
k
, and
z
k+1
by means of the equation:
p(x
k+1
|I
k+1
) = p(x
k+1
|I
k
, u
k
, z
k+1
)
=
p(x
k+1
, z
k+1
|I
k
, u
k
)
p(z
k+1
|I
k
, u
k
)
=
p(x
k+1
|I
k
, u
k
)p(z
k+1
|I
k
, u
k
, x
k+1
)
_

p(x
k+1
|I
k
, u
k
)p(z
k+1
|I
k
, u
k
, x
k+1
)dx
k+1
.
153
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
In this expression, all the probability densities appearing in the RHS may be expressed in terms of
p(x
k
|I
k
), u
k
, and z
k+1
.
In particular:
The density p(x
k+1
|I
k
, u
k
) may be expressed through p(x
k
|I
k
), u
k
, and the system equation
x
k+1
= f
k
(x
k
, u
k
, w
k
) using the given density p(w
k
|x
k
, u
k
) and the relation
p(w
k
|x
k
, u
k
) =
_

p(x
k
|I
k
)p(w
k
|x
k
, u
k
)dx
k
.
The density p(z
k+1
|I
k
, u
k
, x
k+1
) is expressed through the measurement equation z
k+1
= h
k+1
(x
k+1
, u
k
, v
k+1
)
using the densities
p(x
k
|I
k
), p(w
k
|x
k
, u
k
), p(v
k+1
|x
k
, u
k
, w
k
).
Now, we give an example for the nite space set case.
Example 4.2.1 (A search problem)
At each period, decide to search or not search a site that may contain a treasure.
If we search and treasure is present, we nd it w.p. and remove it from the site.
State x
k
(unobservable at the beginning of period k): Treasure is present or not.
Control u
k
: search or not search.
If the site is searched in period k, the observation z
k+1
takes two values: treasure found or not.
If site is not searched, the value of z
k+1
is irrelevant.
Denote p
k
: probability a treasure is present at the beginning of period k.
The probability evolves according to the recursion:
p
k+1
=
_

_
p
k
if site is not searched at time k
0 if the site is searched and a treasure is found (and removed)
p
k
(1)
p
k
(1)+1p
k
if the site is searched but no treasure is found
For the third case:
Numerator p
k
(1 ): It is the kth period probability that the treasure is present and the
search is unsuccessful.
Denominator p
k
(1 ) +1 p
k
: Probability of an unsuccessful search, when the treasure is
either there or not.
The recursion for p
k+1
is a special case of (4.3.4).
4.3 Sucient Statistics
In this section, we continue investigating the conditional state distribution as a sucient statistic
for problems with imperfect state information.
154
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
4.3.1 Conditional state distribution: Review of basics
Recall the important sucient statistic conditional probability distribution of the state x
k
given the information vector I
k
, i.e.,
S
k
(I
k
) = P
x
k
|I
k
For this case, we need an extra assumption: The probability distribution of the observation
disturbance v
k+1
depends explicitly only on the immediate preceding x
k
, u
k
and w
k
and not
on earlier ones, which gives the system a Markovian avor.
It turns out that P
x
k
|I
k
is generated recursively by a dynamic system (estimator) of the form
P
x
k+1
|I
k+1
=
k
(P
x
k
|I
k
, u
k
, z
k+1
), (4.3.1)
for a suitable function
k
determined from the data of the problem.
We have already proven that if function
k
in equation (4.3.1) exists, then we can solve the
DP algorithm.
The representation of the optimal policy as a sequence of functions of P
x
k
|I
k
, i.e.,

k
(I
k
) =
k
(P
x
k
|I
k
), k = 0, 1, . . . , N 1,
is conceptually very useful. It provides a decomposition of the optimal controller in two parts:
1. An estimator, which uses at time k the measurement z
k
and the control u
k1
to generate
the probability distribution P
x
k
|I
k
.
2. An actuator, which generates a control input to the system as a function of the probability
distribution P
x
k
|I
k
.
The DP algorithm can be written as:

J
N1
(P
x
N1
|I
N1
) = min
u
N1
U
N1
E
x
N1
,w
N1
_
g
N
(f
N1
(x
N1
, u
N1
, w
N1
))
+g
N1
(x
N1
, u
N1
, w
N1
)|I
N1
, u
N1

, (4.3.2)
and for k = 0, 1, . . . , N 2,

J
k
(P
x
k
|I
k
) = min
u
k
U
k
E
x
k
,w
k
,z
k+1
_
g
k
(x
k
, u
k
, w
k
) +

J
k+1
(
k
(P
x
k
|I
k
, u
k
, z
k+1
))|I
k
, u
k

, (4.3.3)
where P
x
k
|I
k
plays the role of the state, and
P
x
k+1
|I
k+1
=
k
(P
x
k
|I
k
, u
k
, z
k+1
) (4.3.4)
is the system equation. Here, the role of control is played by u
k
, and the role of the disturbance
is played by z
k+1
.
Example 4.3.1 (A search problem Revisited)
At each period, decide to search or not search a site that may contain a treasure.
155
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
If we search and the treasure is present, we nd it w.p. and remove it from the site.
State x
k
(unobservable at the beginning of period k): Treasure is present or not.
Control u
k
: search or not search.
Basic costs: Treasures worth is V , and search cost is C.
If the site is searched in period k, the observation z
k+1
takes one of two values: treasure found or
not.
If site is not searched, the value of z
k+1
is irrelevant.
Denote p
k
: probability that the treasure is present at the beginning of period k.
The probability evolves according to the recursion:
p
k+1
=
_

_
p
k
if site is not searched at time k
0 if the site is searched and a treasure is found (and removed)
p
k
(1)
p
k
(1)+1p
k
if the site is searched but no treasure is found
For the third case:
Numerator p
k
(1 ): It is the kth period probability that the treasure is present and the
search is unsuccessful.
Denominator p
k
(1 ) +1 p
k
: Probability of an unsuccessful search, when the treasure is
either there or not.
The recursion for p
k+1
is a special case of (4.3.4).
Assume that once we decide not to search in a period, we cannot search at future times.
The DP algorithm is

J
N
(p
N
) = 0,
and for k = 0, 1, . . . , N 1,

J
k
(p
k
) = max
_

_
0, C + p
k
V
. .
reward
for search & nd
+ (1 p
k
)
. .
prob. of
search & not nd

J
k+1
_
p
k
(1 )
p
k
(1 ) + 1 p
k
_
_

_
It can be shown by induction that the functions

J
k
(p
k
) satisfy

J
k
(p
k
) = 0, p
k

C
V
Furthermore, it is optimal to search at period k if and only if
p
k
V
. .
expected reward
from search
C.
..
cost of search

156
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
4.3.2 Finite-state systems
Suppose the system is a nite-state Markov chain with states 1, . . . , n.
Then, the conditional probability distribution P
x
k
|I
k
is an n-dimensional vector
(P(x
k
= 1|I
k
), P(x
k
= 2|I
k
), . . . , P(x
k
= n|I
k
)).
When a control u U is applied (U nite), the system moves from state i to state j w.p.
p
ij
(u). Note that the real system state transition is only driven by the control u applied at
each stage.
There is a nite number of possible observation outcomes z
1
, z
2
, . . . , z
q
. The probability of
occurrence of z

, given that the current state is x


k
= j and the preceding control was u
k1
,
is denoted by P(z
k
= z

|u
k1
, x
k
= j)

= r
j
(u
k1
, z

), = 1, . . . , q.
The information available to the controller at stage k is
I
k
(z
1
, . . . , z
k
, u
0
, . . . , u
k1
).
Following the observation z
k
, a control u
k
is applied, and a cost g(x
k
, u
k
) is incurred.
The terminal cost at stage N for being in state x is G(x).
Objective: Minimize the expected cumulative cost incurred over N stages.
We can reformulate the problem as one with imperfect state information. The objective is to control
the column vector of conditional probabilities
p
k
= (p
1
k
, . . . , p
n
k
)

,
where
p
i
k
= P(x
k
= i|I
k
), i = 1, 2, . . . , n.
We refer to p
k
as the belief state. It evolves according to
p
k+1
=
k
(p
k
, u
k
, z
k+1
),
where the function
k
is an estimator that given the sucient statistic p
k
provides the new sucient
statistic p
k+1
. The initial belief p
0
is given.
The conditional probabilities can be updated according to the Bayesian updating rule
p
j
k+1
= P(x
k+1
= j|I
k+1
)
= P(x
k+1
= j|z
0
, . . . , z
k+1
, u
0
, . . . , u
k
)
=
P(x
k+1
= j, z
k+1
|I
k
, u
k
)
P(z
k+1
|I
k
, u
k
)
(because P(A|B, C) = P(A, B|C)/P(B|C))
=

n
i=1
P(x
k
= i|I
k
)P(x
k+1
= j|x
k
= i, u
k
)P(z
k+1
|u
k
, x
k+1
= j)

n
s=1

n
i=1
P(x
k
= i|I
k
)P(x
k+1
= s|x
k
= i, u
k
)P(z
k+1
|u
k
, x
k+1
= s)
=

n
i=1
p
i
k
p
ij
(u
k
)r
j
(u
k
, z
k+1
)

n
s=1

n
i=1
p
i
k
p
is
(u
k
)r
s
(u
k
, z
k+1
)
.
157
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
In vector form, we have
p
j
k+1
=
r
j
(u
k
, z
k+1
)[P(u
k
)

p
k
]
j

n
s=1
r
s
(u
k
, z
k+1
)[P(u
k
)

p
k
]
s
, j = 1, . . . , n, (4.3.5)
where P(u
k
) is the n n transition probability matrix formed by p
ij
(u
k
), and [P(u
k
)

p
k
]
j
is the
jth component of vector [P(u
k
)

p
k
].
The corresponding DP algorithm (4.3.2)-(4.3.3) has the specic form

J
k
(p
k
) = min
u
k
U
_
p

k
g(u
k
) + E
z
k+1
_

J
k+1
((p
k
, u
k
, z
k+1
))|p
k
, u
k
_
, k = 0, . . . , N 1, (4.3.6)
where g(u
k
) is the column vector with components g(1, u
k
), . . . , g(n, u
k
), and p

k
g(u
k
) is the expected
stage cost.
The algorithm starts at stage N with

J
N
(p
N
) = p

N
G,
where G is the column vector with components the terminal costs G(i), i = 1, . . . , n, and proceeds
backwards.
It turns out that the cost-to-go functions

J
k
in the DP algorithm are piecewise linear and concave. A
consequence of this fact is that

J
k
can be characterized by a nite set of scalars. Still, however, for a
xed k, the number of these scalars can increase fast with N, and there may be no computationally
ecient way to solve the problem.
Example 4.3.2 (Machine repair revisted)
Consider again the machine repair problem, whose setting is included below:
A machine can be in one of two unobservable states (i.e., n = 2):

P (bad state) and P (good
state).
State space: {

P, P}, where for the indexing: State 1 is



P, and state 2 is P.
Number of periods: N = 2
At the end of each period, the machine is inspected with two possible inspection outcomes: G
(probably good state), B (probably bad state)
Control space: actions after each inspection, which could be either
C : continue operation of the machine; or
S : stop, diagnose its state and if it is in bad state

P, repair.
Cost per stage: g(

P, C) = 2; g(P, C) = 0; g(

P, S) = 1; g(P, S) = 1, or in vector form:
g(C) =
_
2
0
_
, g(S) =
_
1
1
_
.
158
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Total cost: g(x
0
, u
0
) +g(x
1
, u
1
) (assume zero terminal cost)
Let x
0
, x
1
be the state of the machine at the end of each period
Distribution of initial state: P(x
0
=

P) =
1
3
, P(x
0
= P) =
2
3
Assume that we start with a machine in good state, i.e., x
1
= P
System equation:
x
k+1
= w
k
, k = 0, 1
where the transition probabilities are given by
In matrix form, following the aforementioned indexing of the states, transition probabilities can be
expressed as
P(C) =
_
1 0
1/3 2/3
_
; P(S) =
_
1/3 2/3
1/3 2/3
_
.
Note that we do not have perfect state information, since the inspection does not reveal the state
of the machine with certainty. Rather, the result of each inspection may be viewed as a noisy
assessment of the system state.
Result of inspections: z
k
= v
k
, k = 0, 1; v
k
{B, G}
159
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
The inspection results can be described by the following denitions:
r
1
(S, G)

= P(z
k+1
= G|u
k
= S, x
k+1
=

P) =
1
4
= r
1
(C, G),
r
1
(S, B)

= P(z
k+1
= B|u
k
= S, x
k+1
=

P) =
3
4
= r
1
(C, B),
r
2
(S, G)

= P(z
k+1
= G|u
k
= S, x
k+1
= P) =
3
4
= r
2
(C, G),
r
2
(S, B)

= P(z
k+1
= B|u
k
= S, x
k+1
= P) =
1
4
= r
2
(C, B).
Note that in this case, the observation z
k+1
does not depend on the control u
k
, but just on the
state x
k+1
.
Dene the belief state p
0
as the 2-dimensional vector with components:
p
1
0

= P(x
0
=

P|I
0
), p
2
0

= P(x
0
= P|I
0
) = 1 p
1
0
.
Similarly, dene the belief state p
1
with coordinates
p
1
1

= P(x
1
=

P|I
1
), p
2
1

= P(x
1
= P|I
1
) = 1 p
1
1
,
where the evolution of the beliefs is driven by the estimator
p
1
=
0
(p
0
, u
0
, z
1
).
We will use equation (4.3.5) to compute p
1
given p
0
, but rst we calculate the matrix products P(u
0
)

p
0
,
for u
0
{S, C}:
P(S)

p
0
=
_
1/3 1/3
2/3 2/3
_ _
p
1
0
p
2
0
_
=
_
1
3
(p
1
0
+p
2
0
)
2
3
(p
1
0
+p
2
0
)
_
=
_
1
3
2
3
_
, (4.3.7)
and
P(C)

p
0
=
_
1 1/3
0 2/3
_ _
p
1
0
p
2
0
_
=
_
p
1
0
+
1
3
p
2
0
2
3
p
2
0
_
=
_
1
3
+
2
3
p
1
0
2
3

2
3
p
1
0
_
. (4.3.8)
Now, using equation (4.3.5) for state j = 1 (i.e., for state

P), we get
For u
0
= S, z
1
= G:
p
1
1
=
r
1
(S, G)[P(S)

p
0
]
1
r
1
(S, G)[P(S)

p
0
]
1
+r
2
(S, G)[P(S)

p
0
]
2
=
1
4

1
3
1
4

1
3
+
3
4

2
3
=
1
7
.
For u
0
= S, z
1
= B:
p
1
1
=
r
1
(S, B)[P(S)

p
0
]
1
r
1
(S, B)[P(S)

p
0
]
1
+r
2
(S, B)[P(S)

p
0
]
2
=
3
4

1
3
3
4

1
3
+
1
4

2
3
=
3
5
.
For u
0
= C, z
1
= G:
p
1
1
=
r
1
(C, G)[P(C)

p
0
]
1
r
1
(C, G)[P(C)

p
0
]
1
+r
2
(C, G)[P(C)

p
0
]
2
=
1
4

_
1
3
+
2
3
p
1
0
_
1
4

_
1
3
+
2
3
p
1
0
_
+
3
4

_
2
3

2
3
p
1
0
_ =
1 + 2p
1
0
7 4p
1
0
.
160
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
For u
0
= C, z
1
= B:
p
1
1
=
r
1
(C, B)[P(C)

p
0
]
1
r
1
(C, B)[P(C)

p
0
]
1
+r
2
(C, B)[P(C)

p
0
]
2
=
3
4

_
1
3
+
2
3
p
1
0
_
3
4

_
1
3
+
2
3
p
1
0
_
+
1
4

_
2
3

2
3
p
1
0
_ =
3 + 6p
1
0
5 + 4p
1
0
.
In summary, we get
p
1
1
= [
0
(p
0
, u
0
, z
1
)]
1
=
_

_
1
7
if u
0
= S, z
1
= G,
3
5
if u
0
= S, z
1
= B,
1+2p
1
0
74p
1
0
if u
0
= C, z
1
= G,
3+6p
1
0
5+4p
1
0
if u
0
= C, z
1
= B,
where p
2
0
= 1 p
1
0
and p
2
1
= 1 p
1
1
.
The DP algorithm (4.3.6) may be written as:

J
2
(p
2
) = 0 (i.e., zero terminal cost),
and

J
1
(p
1
) = min
u
1
{S,C}
_
p

1
g(u
1
)
_
= min
_

_
(p
1
1
, p
2
1
)
_
1
1
_
. .
u
1
=S
, (p
1
1
, p
2
1
)
_
2
0
_
. .
u
1
=C
_

_
= min
_

_
p
1
1
+p
2
1
. .
u
1
=S
, 2p
1
1
..
u
1
=C
_

_
= min
_

_
1
..
u
1
=S
, 2p
1
1
..
u
1
=C
_

_
.
This minimization yields

1
(p
1
) =
_
C if p
1
1

1
2
S if p
1
1
>
1
2
For stage k = 0, we have

J
0
(p
0
) = min
u
0
{C,S}
_
p

0
g(u
0
) + E
z
1
_

J
1
((p
0
, u
0
, z
1
))|p
0
, u
0
_
= min
_
2p
1
0
+P(z
1
= G|I
0
, C)

J
1
(
0
(p
0
, C, G)) +P(z
1
= B|I
0
, C)

J
1
(
0
(p
0
, C, B))
. .
u
0
=C
,
(p
1
0
+p
2
0
)
. .
1
+P(z
1
= G|I
0
, S)

J
1
(
0
(p
0
, S, G)) +P(z
1
= B|I
0
, S)

J
1
(
0
(p
0
, S, B))
. .
u
0
=S
_
The probabilities here may be expressed in terms of p
0
by using the expression in the denominator
of (4.3.5); that is:
P(z
k+1
|I
k
, u
k
) =
n

s=1
n

i=1
P(x
k
= i|I
k
)P(x
k+1
= s|x
k
= i, u
k
)P(z
k+1
|u
k
, x
k+1
= s)
=
n

s=1
n

i=1
p
i
k
p
is
(u
k
)r
s
(u
k
, z
k+1
)
=
n

s=1
r
s
(u
k
, z
k+1
)[P(u
k
)

p
k
]
s
.
161
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
In our case:
P(z
1
= G|I
0
, u
0
= C) = r
1
(C, G)[P(C)

p
0
]
1
+r
2
(C, G)[P(C)

p
0
]
2
=
1
4

_
1
3
+
2
3
p
1
0
_
+
3
4

_
2
3

2
3
p
1
0
_
=
7 4p
1
0
12
.
Similarly, we obtain:
P(z
1
= B|I
0
, C) =
5 + 4p
1
0
12
, P(z
1
= G|I
0
, S) =
7
12
, P(z
1
= B|I
0
, S) =
5
12
.
Using these values we have

J
0
(p
0
) = min
_
2p
1
0
+
7 4p
1
0
12

J
1
_
1 + 2p
1
0
7 4p
1
0
. .
p
1
1
, 1 p
1
1
_
+
5 + 4p
1
0
12

J
1
_
3 + 6p
1
0
5 + 4p
1
0
. .
p
1
1
, 1 p
1
1
_
,
1 +
7
12

J
1
_
1
7
,
6
7
_
+
5
12

J
1
_
3
5
,
2
5
__
.
By substitution of

J
1
(p
1
) and after some algebra we obtain

J
0
(p
0
) =
_
19
12
if
3
8
p
1
0
1,
7+32p
1
0
12
if 0 p
1
0

3
8
,
and an optimal control for the rst stage

0
(p
0
) =
_
C if p
1
0

3
8
,
S if p
1
0
>
3
8
.
Also, we know that P(z
0
= G) =
7
12
, and P(z
0
= B) =
5
12
. In addition, we can establish the initial
value for p
1
0
according to the value of I
0
(i.e., z
0
):
P(x
0
=

P|z
0
= G) =
P(x
0
=

P, z
0
= G)
P(z
0
= G)
=
1
3

1
4
7
12
=
1
7
,
and
P(x
0
=

P|z
0
= B) =
P(x
0
=

P, z
0
= B)
P(z
0
= B)
=
1
3

3
4
5
12
=
3
5
,
so that the formula
J

= E
z
0
_

J
0
(P
x
0
|z
0
)

=
7
12

J
0
_
1
7
,
6
7
_
+
5
12

J
0
_
3
5
,
2
5
_
=
176
144
yields the same optimal cost as the one obtained above by means of the general DP algorithm for
problems with imperfect state information.
Observe also that the functions

J
k
are linear in this case; recall that we had said that in general they
are piecewise linear.
162
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
4.4 Exercises
Exercise 4.4.1 Take the linear system and measurement equation for the LQ-system with imper-
fect state information. Consider the problem of nding a policy {

0
(I
0
), . . . ,

N1
(I
N1
)} that
minimizes the quadratic cost
E
_
x

N
Qx
N
+
N1

k=0
u

k
R
k
u
k
_
Assume, however, that the random vectors x
0
, w
0
, . . . , w
N1
, v
0
, . . . , v
N1
are correlated and have
a given joint probability distribution, and nite rst and second moments. Show that the optimal
policy is given by

k
(I
k
) = L
k
E[y
k
|I
k
],
where the gain matrices L
k
are obtained from the algorithm
L
k
= (B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
A
k
,
K
N
= Q,
K
k
= A

k
(K
k+1
K
k+1
B
k
(B

k
K
k+1
B
k
+R
k
)
1
B

k
K
k+1
)A
k
,
and the vectors y
k
are given by y
N
= x
N
, and
y
k
= x
k
+A
1
k
w
k
+A
1
k
A
1
k+1
w
k+1
+ +A
1
k
A
1
N1
w
N1
, for k = 0, . . . , N 1.
(assuming the matrices A
0
, A
1
, . . . , A
N1
are invertible).
Hint: Show that the cost can be written as
E
_
y

0
K
0
y
0
+
N1

k=0
(u
k
L
k
y
k
)

P
k
(u
k
L
k
y
k
)
_
,
where P
k
= B

k
K
k+1
B
k
+R
k
.
Exercise 4.4.2 Consider the scalar, imperfect state information system
x
k+1
= x
k
+u
k
+w
k
,
z
k
= x
k
+v
k
,
where we assume that the initial state x
0
, and the disturbances w
k
and v
k
are all independent. Let
the cost be
E
_
x
2
N
+
N1

k=0
(x
2
k
+u
2
k
)
_
,
and let the given probability distributions be
P(x
0
= 2) = 1/2, P(w
k
= 1) = 1/2, P(v
k
= 1/4) = 1/2,
P(x
0
= 2) = 1/2, P(w
k
= 1) = 1/2, P(v
k
= 1/4) = 1/2.
(a) Show that this problem could be transformed in a perfect information problem, where rst we
infer the value of x
0
, and then we sequentially compute the values x
1
, . . . , x
N
. Determine the
optimal policy. Hint: For this problem, x
k
can be determined from x
k1
, u
k1
, and z
k
.
163
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
(b) Determine the policy that is identical to the optimal except that it uses a linear least square
estimator of x
k
given I
k
in place of E[x
k
|I
k
]
Exercise 4.4.3 A linear system with Gaussian disturbances and Gaussian initial state
x
k+1
= Ax
k
+Bx
k
+w
k
,
is to be controlled so as to minimize a quadratic cost similar to that discussed above. The dierence
is that the controller has the option of choosing at each time k one of two types of measurement
equations for the next stage k + 1:
First type: z
k+1
= C
1
x
k+1
+v
1
k+1
,
Second type: z
k+1
= C
2
x
k+1
+v
2
k+1
.
Here, C
1
and C
2
are given matrices of appropriate dimension, and {v
1
k
} and {v
2
k
} are zero-mean,
independent, random sequences with given nite covariances that do not depend on x
0
and {w
k
}.
There is a cost g
1
(or g
2
) each time a measurement of type 1 (or type 2) is taken. The problem is
to nd the optimal control and measurement selection policy that minimizes the expected value of
the sum of the quadratic cost
x

N
Qx
N
+
N1

k=0
(x

k
Qx
k
+u

k
Ru
k
)
and the total measurement cost. Assume for convenience that N = 2 and that the rst measure-
ment z
0
is of type 1. Show that the optimal measurement selection at time k = 0 and k = 1 does not
depend on the value of the information vectors I
0
and I
1
, and can be determined a priori. Describe
the nature of the optimal policy.
Exercise 4.4.4 Consider a machine that can be in one of two states, good or bad. Suppose that
the machine produces an item at the end of each period. The item produced is either good or bad
depending on whether the machine is in good or bad state at the beginning of the corresponding
period, respectively. We suppose that once the machine is in a bad state it remains in that state
until it is replaced. If the machine is in a good state at the beginning of a certain period, then with
probability t it will be in the bad state at the end of the period. Once an item is produced, we may
inspect the item at a cost I, or not inspect. If an inspected item is found to be bad, the machine
is replaced with a machine in good state at a cost R. The cost for producing a bad item is C > 0.
Write a DP algorithm for obtaining an optimal inspection policy assuming a machine is initially in
good state and a horizon of N periods. Then, solve the problem for t = 0.2, I = 1, R = 3, C = 2,
and N = 8.
Hint: Dene
x
k
= State at the beginning of the kth stage {Good, Bad}
w
k
= State at the end of the kth stage before an action is taken
u
k
{Inspect, No inspect}
Take as information vector the stage at which the last inspection was made.
164
CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey
Exercise 4.4.5 A person is oered 2 to 1 odds in a coin-tossing game where he wins whenever a
tail occurs. However, he suspects that the coin is biassed and has an a priori probability distribu-
tion F(p) for the probability p that a head occurs at each toss. The problem is to nd an optimal
policy of deciding whether to continue or stop participating in the game given the outcomes of the
game so far. A maximum of N tossing is allowed. Indicate how such a policy can be found by
means of DP. Specify the update rule for the belief about p.
Hint: Dene the state as n
k
, the number of heads observed in the rst k ips.
165
Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.
166
Chapter 5
Innite Horizon Problems
5.1 Types of innite horizon problems
Setting similar to the basic nite horizon problem, but:
The number of stages is innite.
The system is stationary.
Simpler version: Assume nite number of states. (We will keep this assumption)
Total cost problems: Minimize over all admissible policies ,
J

(x
0
) = lim
N
E
w
0
,w
1
,...
_
N1

k=0

k
g(x
k
,
k
(x
k
), w
k
)
_
The value function J

(x
0
) should be nite for at least some admissible policies and some
initial states x
0
.
Variants of total cost problems:
(a) Stochastic shortest path problems ( = 1): It requires a cost free terminal state t that
is reached in nite time w.p.1.
(b) Discounted problems ( < 1) with bounded cost per stage, i.e., |g(x, u, w)| < M.
Here, J

(x
0
) < decreasing geometric progression {
k
M}.
(c) Discounted and non-discounted problems with unbounded cost per stage.
Here, 1, but |g(x, u, w)| could be . Technically more challenging!
Average cost problems (type (d)): Minimize over all admissible policies ,
J

(x
0
) = lim
N
1
N
E
w
0
,w
1
,...
_
N1

k=0
g(x
k
,
k
(x
k
), w
k
)
_
The approach works even if J

(x
0
) is innite for every policy and initial state x
0
.
167
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
5.1.1 Preview of innite horizon results
Key issue: The relation between the innite and nite horizon optimal cost-to-go functions.
Illustration: Let = 1 and J
N
(x) denote the optimal cost of the N-stage problem, generated
after N iterations of the DP algorithm, starting from J
0
(x) = 0, and proceeding with
J
k+1
= min
uU(x)
E
w
[g(x, u, w) +J
k
(f(x, u, w))] . x. (5.1.1)
Typical results for total cost problems:
Relation valuable from a computational viewpoint:
J

(x) = lim
N
J
N
(x), x. (5.1.2)
It holds for problems (a) and (b); some unusual exceptions for problems (c).
The limiting form of the DP algorithm should hold for all states x,
J

(x) = min
uU(x)
E
w
[g(x, u, w) +J

(f(x, u, w))] , x. (Bellmans equation)


If (x) minimizes RHS in Bellmans equation for each x, the policy = {, , . . .} is
optimal. This is true for most innite horizon problems of interest (and in particular,
for problems (a) and (b)).
5.1.2 Total cost problem formulation
We assume an underlying system equation
x
k+1
= w
k
.
At state i, the use of a control u species the transition probability p
ij
(u) to the next state j.
The control u is constrained to take values in a given nite constraint set U(i), where i is the
current state.
We will assume a kth stage cost g(x
k
, u
k
) for using control u
k
at stage x
k
. If g(i, u, j) is
the cost of using u at state i and moving to state j, we use as cost-per-stage the expected
cost g(i, u) given by
g(i, u) =

j
p
ij
(u) g(i, u, j).
The total expected cost associated with an initial state i and a policy = {
0
,
1
, . . .} is
J

(i) = lim
N
E
_
N1

k=0

k
g(x
k
,
k
(x
k
))

x
0
= i
_
,
where is a discount factor, with 0 < 1.
Optimal cost from state i is J

(i) = min

(i).
168
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Stationary policy: Admissible policy (i.e.,
k
(x
k
) U(x
k
)) of the form
= {, , . . .} ,
with corresponding cost function J

(i).
The stationary policy is optimal if
J

(i) = J

(i) = min

(i), i.
5.2 Stochastic shortest path problems
Assume there is no discounting (i.e., = 1).
Set of normal states {1, 2, . . . , n}.
There is a special, cost-free, absorbing, terminal state t. That is, p
tt
= 1, and g(t, u) = 0 for
all u U(t).
Objective: Reach the terminal state with minimum expected cost.
Assumption 5.2.1 There exists an integer m such that for every policy and initial state,
there is a positive probability that the termination state t will be reached in at most m stages.
Then for all , we have

= P{x
m
= t|x
0
= t, } < 1
That is, P{x
m
= t|x
0
= t, } > 0.
In terms of discrete-time Markov chains, Assumption 5.2.1 is claiming that t is accessible from
any state i.
Remark: Assumption 5.2.1 is requiring that all policies are proper. A stationary policy is
proper if when using it, there is a positive probability that the destination will be reached
after at most n stages. Otherwise, it is improper.
However, the results to be presented can be proved under the following weaker conditions:
1. There exists at least one proper policy.
2. For every improper policy , the corresponding cost J

(i) is for at least one state i.


Note that the assumption implies that
P{x
m
= t|x
0
= i, } P{x
m
= t|x
0
= t, } =

< 1, i = 1, . . . , n.
Let
= max

.
Since the number of controls available at each state is nite, the number of distinct m-stage
policies is also nite. So, there must be only a nite number of values of

, so that the max


above is well dened (we do not need sup). Then,
P{x
m
= t|x
0
= t, } < 1.
169
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
For any and any initial state i,
P{x
2m
= t|x
0
= i, } = P{x
2m
= t|x
m
= t, x
0
= i, } P{x
m
= t|x
0
= i, }
P{x
2m
= t|x
m
= t, } P{x
m
= t|x
0
= t, }

2
,
and similarly,
P{x
km
= t|x
0
= i, }
k
, i = 1, . . . , n.
So,

E[cost between times km and (k + 1)m1]

m
..
# of stages

k
max
i=1,...,n
uU(i)

g(u, i)

,
. .
bound for each stage
instant. cost.
(5.2.1)
and hence,
|J

(i)|

k=0
m
k
max
i=1,...,n
uU(i)

g(u, i)

=
m
1
max
i=1,...,n
uU(i)

g(u, i)

.
Key idea for the main result (to be presented below) is that the tail of the cost series vanishes,
i.e.,
lim
K

k=mK
E[g(x
k
,
k
(x
k
))] = 0.
The reason is that lim
K
P{x
mK
= t|x
0
= i, } = 0.
Proposition 5.2.1 Under Assumption 5.2.1, the following hold for the stochastic shortest path
problem:
(a) Given any initial conditions J
0
(1), . . . , J
0
(n), the sequence J
k
(i) generated by the DP iteration
J
k+1
(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
, i, (5.2.2)
converges to the optimal cost J

(i).
(b) The optimal costs J

(1), . . . , J

(n) satisfy Bellmans equation,


J

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
_
_
, i = 1, . . . , n, (5.2.3)
and in fact they are the unique solution of this equation.
(c) For any stationary policy , the costs J

(1), . . . , J

(n) are the unique solution of the equation


J

(i) = g(i, (i)) +


n

j=1
p
ij
((i))J

(j), i = 1, . . . , n.
170
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Furthermore, given any initial conditions J
0
(1), . . . , J
0
(n), the sequence J
k
(i) generated by the
DP iteration
J
k+1
(i) = g(i, (i)) +
n

j=1
p
ij
((i))J
k
(j), i = 1, . . . , n,
converges to the cost J

(i) for each i.


(d) A stationary policy is optimal if and only if for every state i, (i) attains the minimum in
Bellmans equation (5.2.3).
Proof: Following the labeling of the proposition:
(a) For every possible integer K, initial state x
0
, and policy = {
0
,
1
, . . .}, we break down the
cost J

(x
0
) as follows:
J

(x
0
) = lim
N
E
_
N1

k=0
g(x
k
,
k
(x
k
))
_
= E
_
mK1

k=0
g(x
k
,
k
(x
k
))
_
+ lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_
(5.2.4)
Let M be an upper bound on the cost of an m-stage cycle, assuming t is not reached during
the cycle, i.e.,
M = m max
i=1,...,n
uU(i)

g(i, u)

.
Recall from (5.2.1) that

E[cost during Kth cycle, between stages Km and (K + 1)m1]

M
K
, (5.2.5)
so that

lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_

k=K

k
=

K
M
1
. (5.2.6)
Also, denoting J
0
(t) = 0, let us view J
0
as a terminal cost function. We will provide a bound for
its expected value based on the current policy applied over mK stages. Starting from x
0
= t,
J
0
(x
mK
) is the cost of reaching state x
mK
in mK steps. So,

E[J
0
(x
mK
)]

i=1
P{x
mK
= i|x
0
= t, }J
0
(i) +P{x
mK
= t|x
0
= t, } J
0
(t)
. .
0

i=1
P{x
mK
= i|x
0
= t, }J
0
(i)

_
n

i=1
P{x
mK
= i|x
0
= t, }
_
. .
P{x
mK
=t|x
0
=t,}
max
i=1,...,n

J
0
(i)


K
max
i=1,...,n

J
0
(i)

(5.2.7)
171
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Now, we set the following bound (following equations (5.2.6) and (5.2.7)):

E[J
0
(x
mK
)] lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_

E[J
0
(x
mK
)]

lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_


K
max
i=1,...,n

J
0
(i)

+M

k=K

k
=
K
max
i=1,...,n

J
0
(i)

+

K
M
1
.
Taking the LHS above, and using (5.2.4), we have

E[J
0
(x
mK
)] lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_

E[J
0
(x
mK
)]+E
_
mK1

k=0
g(x
k
,
k
(x
k
))
_
J

(x
0
)

Then, we get the bounds

K
max
i=1,...,n

J0(i)

+J(x0)

K
M
1
E[J0(xmK)]+E
"
mK1
X
k=0
g(x
k
,
k
(x
k
))
#

K
max
i=1,...,n

J0(i)

+J(x0)+

K
M
1
(5.2.8)
Note that
The expected value of the middle term above is the mK-stage cost of policy , starting
from state x
0
, with terminal cost J
0
(x
mK
).
The min of this mK-stage cost over all is equal to the value J
mK
(x
0
), which is generated
by the DP recursion (5.2.2) after mK iterations.
Thus, taking the min over in equation (5.2.8), we obtain for all x
0
and K,

K
max
i=1,...,n

J
0
(i)

+J

(x
0
)

K
M
1
J
mK
(x
0
)
K
max
i=1,...,n

J
0
(i)

+J

(x
0
) +

K
M
1
. (5.2.9)
When taking limit as K , the terms in LHS and RHS involving
K
0, leading to
lim
K
J
mK
(x
0
) = J

(x
0
), x
0
.
Since from (5.2.5)

J
mK+q
(x
0
) J
mK
(x
0
)


K
M, q = 0, . . . , m1,
we see that for q = 0, . . . , m1,

K
M +J
mK
(x
0
) J
mK+q
(x
0
) J
mK
(x
0
) +
K
M.
Taking limit as K , we get
lim
K
_

K
M +J
mK
(x
0
)
_
= J

(x
0
).
Thus, for any q = 0, . . . , m1,
lim
K
J
mK+q
(x
0
) = J

(x
0
),
and hence,
lim
k
J
k
(x
0
) = J

(x
0
).
172
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
(b) Existence: By taking limit as k in the DP iteration (5.2.2), and using the convergence
result of part (a) J

(1), . . . , J

(n) satisfy Bellmans equation.


Uniqueness: If J(1), . . . , J(n) satisfy Bellmans equation, then the DP iteration (5.2.2) starting
from J(1), . . . , J(n) just replicates J(1), . . . , J(n). Then, from the convergence result of part (a),
J(i) = J

(i), i = 1, . . . , n.
(c) Given stationary policy , redene the control constraint sets to be

U(i) = {(i)} instead
of U(i). From part (b), we then obtain that J

(1), . . . , J

(n) solve uniquely Bellmans equation


for this redened problem; i.e.,
J

(i) = g(i, (i)) +


n

j=1
p
ij
((i))J

(j), i = 1, . . . , n,
and from part (a) it follows that the corresponding DP iteration converges to J

(i).
(d) We have that (i) attains the minimum in equation (5.2.3) if and only if we have
J

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
_
_
= g(i, (i)) +
n

j=1
p
ij
((i))J

(j), i = 1, . . . , n.
This equation and part (c) imply that J

(i) = J

(i) for all i. Conversely, if J

(i) = J

(i) for
all i, parts (b) and (c) imply the above equation.
This completes the proof of the four parts of the proposition.
Observation: Part (c) provides a a way to compute J

(i), i = 1, . . . , n, for a given stationary


policy , but the computation is substantial for large n (of order O(n
3
)).
Example 5.2.1 (Minimizing Expected Time to Termination)
Let g(i, u) = 1, i = 1, . . . , n, u U(i).
Under our assumptions, the costs J

(i) uniquely solve Bellmans equation, which has the form


J

(i) = min
uU(i)
_
_
_
1 +
n

j=1
p
ij
(u)J

(j)
_
_
_
, i = 1, . . . , n.
In the special case where there is only one control at each state, J

(i) is the mean rst passage


time from i to t. These times, denoted m
i
, are the unique solution of the equations
m
i
= 1 +
n

j=1
p
ij
m
j
, i = 1, . . . , n.
Recall that in a discrete-time Markov chain, if there is only one recurrent class and t is a state
of that class (in our case, the only recurrent class is given by {t}), the mean rst passage times
from i to t are the unique solution to the previous system of linear equations.
173
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Example 5.2.2 (Spider and a y)
A spider and a y move along a straight line.
At the beginning of each period, the slider knows the position of the y.
The y moves one unit to the left w.p. p, one unit to the right w.p. p, and stays where it is
w.p. 1 2p.
The spider moves one unit towards the y if its distance from the y is more than one unit.
If the spider is one unit away from the y, it will either move one unit towards the y or stay where
it is.
If the spider and the y land in the same position, the spider captures the y.
The spiders objective is to capture the y in minimum expected time.
The initial distance between the spider and the y is n.
This is a stochastic shortest path problem with state i =distance between spider and y, with
i = 1, . . . , n, and t = 0 the termination state.
There is control choice only at state 1. Otherwise, the spider simply moves towards the y.
Assume that the controls (in state 1) are M =move, and

M = dont move.
The transition probabilities from state 1 when using control M are described in Figure 5.2.1.
F S F S
P
11
(M)=2p, described by the two possible situations:
or
P
10
(M)=12p, when fly did not move
Figure 5.2.1: Transition probabilities for control M from state 1.
Other probabilities are:
p
12
(

M) = p, p
11
(

M) = 1 2p, p
10
(

M) = p,
and for i 2,
p
ii
= p, p
i(i1)
= 1 2p, p
i(i2)
= p.
All other transition probabilities are zero.
174
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Bellmans equation:
J

(i) = 1 +pJ

(i) + (1 2p)J

(i 1) +pJ

(i 2), i 2.
J

(1) = 1 + min{2pJ

(1)
. .
M
, pJ

(2) + (1 2p)J

(1)
. .

M
},
with J

(0) = 0.
In order to solve the Bellmans equation, we proceed as follows: First, note that
J

(2) = 1 +pJ

(2) + (1 2p)J

(1).
Then, substitute J

(2) in the equation for J

(1), getting:
J

(1) = 1 + min
_
2pJ

(1),
p
1 p
+
(1 2p)J

(1)
1 p
_
.
Next, we work from here to nd that when one unit away from the y, it is optimal to use

M if and
only if p 1/3. Moreover, it can be veried that
J

(1) =
_
1/(1 2p) if p 1/3,
1/p if p 1/3.
Given J

(1), we can compute J

(2), and then J

(i), for all i 3.


5.2.1 Computational approaches
There are three main computational approaches used in practice for calculating the optimal cost
function J

. From
Value iteration
The DP iteration
J
k+1
(i) = min
iU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
, i = 1, . . . , n,
is called value iteration.
From equation (5.2.9), we know that the error

J
mK
(i) J

(i)

D
K
, for some constant D.
The value iteration algorithm can sometimes be strengthened with the use of error bounds (i.e., they
provide a useful guideline for stopping the value iteration algorithm while being assured that J
k
approximates J

with sucient accuracy). In particular, it can be shown that for all k and j, we
have
J
k+1
(j) + (N

(j) 1)c
k
J

(j) J

k(j) J
k+1
(j) + (N
k
(j) 1) c
k
,
where
175
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS

k
is such that
k
(i) attains the minimum in the kth iteration for all i,
N

(j) = average number of stages to reach t starting from j and using some optimal stationary
policy,
N
k
(j) = average number of stages to reach t starting from j and using some stationary
policy
k
,
c
k
= min
i=1,...,n
{J
k+1
(i) J
k
(i)},
c
k
= max
i=1,...,n
{J
k+1
(i) J
k
(i)},
Unfortunately, the values N

(j) and N
k
(j) are easily computed or approximated only in some cases.
Policy iteration
It generates a sequence
1
,
2
, . . . of stationary policies, starting with any stationary policy
0
.
At a typical iteration, given a policy
k
, we perform two steps:
(i) Policy evaluation step: Computes J

k(i) as the solution of the linear system of equations


J(i) = g(i,
k
(i)) +
n

j=1
p
ij
(
k
(i))J(j), i = 1, . . . , n, (5.2.10)
in the unknowns J(1), . . . , J(n).
(ii) Policy improvement step: Computes a new policy
k+1
as

k+1
(i) = arg min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

k(j)
_
_
_
, i = 1, . . . , n. (5.2.11)
The algorithm stops when J

k(i) = J

k+1(i) for all i.


Proposition 5.2.2 Under Assumption 5.2.1, the policy iteration algorithm for the stochastic short-
est path problem generates an improving sequence of policies (i.e., J

k+1(i) J

k(i), i, k) and
terminates with an optimal policy.
Proof: For any k, consider the sequence generated by the recursion
J
N+1
(i) = g(i,
k+1
(i)) +
n

j=1
p
ij
(
k+1
(i))J
N
(j), i = 1, . . . , n, (5.2.12)
where N = 0, 1, . . . , and the solution to equation (5.2.10):
J
0
(i) = J

k(i), i = 1, . . . , n.
176
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
From equation (5.2.10), we have
J
0
(i) = g(i,
k
(i)) +
n

j=1
p
ij
(
k
(i))J
0
(j)
g(i,
k+1
(i)) +
n

j=1
p
ij
(
k+1
(i))J
0
(j) (from (5.2.11))
= J
1
(i), i (from iteration (5.2.12))
By using the above inequality we obtain
J
1
(i) = g(i,
k+1
(i)) +
n

j=1
p
ij
(
k+1
(i))J
0
(j)
g(i,
k+1
(i)) +
n

j=1
p
ij
(
k+1
(i))J
1
(j) (because J
0
(i) J
1
(i))
= J
2
(i), i (from iteration (5.2.12)). (5.2.13)
Continuing similarly we get
J
0
(i) J
1
(i) J
N
(i) J
N+1
(i) , i = 1, . . . , n. (5.2.14)
Since by Proposition 5.2.1(c), J
N
(i) J

k+1(i), we obtain
J
0
(i) J

k+1(i) J

k(i) J

k+1(i), i = 1, . . . , n, k = 0, 1, . . . (5.2.15)
Thus, the sequence of generated policies is improving, and since the number of stationary policies
is nite, we must after a nite number of iterations say, k + 1 obtain J

k(i) = J

k+1(i), for all i.


Then, we will have equality holding throughout equation (5.2.15), which in particular means
from (5.2.12),
J
0
(i) = J

k(i) = J
1
(i) = g(i,
k+1
(i)) +
n

j=1
p
ij
(
k+1
(i))J

k(j),
and in particular,
J

k(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

k(j)
_
_
_
, i = 1, . . . , n.
Thus, the costs J

k(1), . . . , J

k(n) solve Bellmans equation and by Proposition 5.2.1(b), it follows


that J

k(i) = J

(i), and that


k
(i) is optimal.
Linear programming
Claim: J

is the largest J that satises the constraints


J(i) g(i, u) +
n

j=1
p
ij
(u)J(j), (5.2.16)
for all i = 1, . . . , n, and u U(i).
177
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Proof: Assume that J
0
(i) J
1
(i), where J
1
(i) is generated through value iteration; i.e.,
J
0
(i) min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
0
(j)
_
_
_
, i = 1, . . . , n.
Then because of the stationarity of the problem and the monotonicity property of DP, we will have
J
k
(i) J
k+1
(i), for all k and i. From Proposition 5.2.1(a), the value iteration sequence converges
to J

(i), so that J
0
(i) J

(i), for all i.


Hence, J

= (J

(1), . . . , J

(n)) is the solution of the linear program


max
n

i=1
J(i),
subject to the constraint (5.2.16).
Figure 5.2.2 illustrates a linear program associated with a two-state stochastic shortest path prob-
lem. The decision variables in this case are J(1) and J(2).
J(1)
J(2)
0
J* = (J*(1),J*(2))
J(1) = g(1,u
2
) + p
11
(u
2
)J(1) + p
12
(u
2
)J(2)
J(1) = g(1,u
1
) + p
11
(u
1
)J(1) + p
12
(u
1
)J(2)
J(2) = g(2,u
1
) + p
21
(u
1
)J(1) + p
22
(u
1
)J(2)
J(2) = g(2,u
2
) + p
21
(u
2
)J(1) + p
22
(u
2
)J(2)
Figure 5.2.2: Illustration of the LP solution method for innite horizon DP.
Drawback: For large n, the dimension of this program is very large. Furthermore, the number of
constraints is equal to the number of state-control pairs.
5.3 Discounted problems
Go back to the total cost problem, but now assume a discount factor < 1 (i.e., future costs
matter less than current cost).
Can be converted to a stochastic shortest path (SSP) problem, for which the analysis of the
preceding section holds.
The transformation mechanism relies on adjusting the probabilities using the discount factor .
The instantaneous costs g(i, u) are preserved. Figure 5.3.1 illustrates this transformation.
178
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Justication: Take a policy , and apply it over both formulations. Note that:
Given that the terminal state has not been reached in SSP, the state evolution in the
two problems is governed by the same transition probabilities.
The expected cost of the kth stage of the associated SSP is g(x
k
,
k
(x
k
)), multiplied by
the probability that state t has not been reached, which is
k
. This is also the expected
cost of the kth stage of the discounted problem.
Note that value iteration produces identical iterates for the two problems:
Discounted: J
k+1
(i) = min
uU(i)
_
g(i, u) +

n
j=1
p
ij
(u)J
k
(j)
_
, i = 1, . . . , n.
Corresponding SSP: J
k+1
(i) = min
uU(i)
_
g(i, u) +

n
j=1
(p
ij
(u))J
k
(j)
_
, i = 1, . . . , n.
Original discounted problem
Associated stochastic
shortest path problem
t
1
Figure 5.3.1: Illustration of the transformation from -discounted to stochastic shortest path.
The results of SPP, summarized in Proposition 5.2.1, extend to this case. In particular:
(i) Value iteration converges to J

for all initial J


0
:
J
k+1
(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
, i = 1, . . . , n.
(ii) J

is the unique solution of Bellmans equation:


J

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
_
_
, i = 1, . . . , n.
(iii) Policy iteration converges nitely to an optimal.
(iv) Linear programming also works.
For completeness, we compile these results in the following proposition.
Proposition 5.3.1 The following hold for the discounted problem:
179
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
(a) Given any initial conditions J
0
(1), . . . , J
0
(n), the value iteration algorithm
J
k+1
(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
, i, (5.3.1)
converges to the optimal cost J

(i).
(b) The optimal costs J

(1), . . . , J

(n) satisfy Bellmans equation,


J

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
_
_
, i = 1, . . . , n,
and in fact they are the unique solution of this equation.
(c) For any stationary policy , the costs J

(1), . . . , J

(n) are the unique solution of the equation


J

(i) = g(i, (i)) +


n

j=1
p
ij
((i))J

(j), i = 1, . . . , n.
Furthermore, given any initial conditions J
0
(1), . . . , J
0
(n), the sequence J
k
(i) generated by the
DP iteration
J
k+1
(i) = g(i, (i)) +
n

j=1
p
ij
((i))J
k
(j), i = 1, . . . , n,
converges to the cost J

(i) for each i.


(d) A stationary policy is optimal if and only if for every state i, (i) attains the minimum in
Bellmans equation of part (b).
(e) The policy iteration algorithm given by

k+1
(i) = arg min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

k(j)
_
_
_
, i = 1, . . . , n,
generates an improving sequence of policies and terminates with an optimal policy.
As in the case of stochastic shortest path problems (see equation (5.2.9)), we can show that
|J
k
(i) J

(i)| D
k
, for some constant D.
The error bounds become
J
k+1
(j) +

1
c
k
J

(j) J

k(j) J
k+1
(j) +

1
c
k
,
where
k
(j) attains the minimum in the kth value iteration (5.3.1) for all i, and
c
k
= min
i=1,...,n
[J
k+1
(i) J
k
(i)] , and c
k
= max
i=1,...,n
[J
k+1
(i) J
k
(i)] .
Example 5.3.1 (Asset selling problem)
180
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Assume system evolves according to x
k+1
= w
k
.
If the oer x
k
of period k is accepted, it is invested at an interest rate r.
By depreciating the sale amount to period 0 dollars, we view (1 +r)
k
x
k
as the reward for selling
the asset in period k at a price x
k
, where r > 0 is the interest rate.
Idea: We discount the reward by the interest we did not make for the rst k periods.
The discount factor is therefore: = 1/(1 +r).
J

is the unique solution of Bellmans equation


J

(x) = max
_
x,
E[J

(w)]
1 +r
_
.
An optimal policy is to sell of and only if the current oer x
k
is greater than or equal to , where
=
E[J

(w)]
1 +r
.
Example 5.3.2 (Manufacturers production plan)
A manufacturer at each time period receives an order for her product with probability p and receives
no order with probability 1 p.
At any period she has a choice of processing all unlled orders in a batch, or process no order at
all.
The cost per unlled order at each time period is c > 0, and the setup cost to process the unlled
orders is K > 0. The manufacturer wants to nd a processing policy that minimizes the total
expected cost, assuming the discount factor is < 1 and the maximum number of orders that
can remain unlled is n. When the maximum n of unlled orders is reached, the orders must
necessarily be processed.
Dene the state as the number of unlled orders at the beginning of each period. The Bellmans
equation for this problem is
J

(i) = min{K +(1 p)J

(0) +pJ

(1)
. .
Process remaining orders
, ci +(1 p)J

(i) +pJ

(i + 1)
. .
Do nothing
}
for the states i = 0, 1, . . . , n 1, and takes the form
J

(n) = K +(1 p)J

(0) +pJ

(1)
for state n.
Consider the value iteration method applied over this problem. We prove now by using the
(nite horizon) DP algorithm that the k-stage optimal cost functions J
k
(i) are monotonically
nondecreasing in i for all k, and therefore argue that the optimal innite horizon cost function J

(i)
is also monotonically nondecreasing in i since
J

(i) = lim
k
J
k
(i).
181
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Given that J

(i) is monotonically nondecreasing in i, we have that if processing a batch of m


orders is optimal, that is,
K +(1 p)J

(0) +pJ

(1) cm+(1 p)J

(m) +pJ

(m+ 1),
then processing a batch of m+1 orders is also optimal. Therefore, a threshold policy (i.e., a policy
that processes the orders if their number exceeds some threshold integer m

) is optimal.
Claim: The k-stage optimal cost functions J
k
(i) are monotonically nondecreasing in i for all k.
Proof: We proceed by induction. Start from J
0
(i) = 0, for all i, and suppose that J
k
(i + 1)
J
k
(i) for all i. We will see that J
k+1
(i +1) J
k+1
(i) for all i. Consider rst the case i +1 < n.
Then, by induction hypothesis, we have
c(i + 1) +(1 p)J
k
(i + 1) +pJ
k
(i + 2) ci +(1 p)J
k
(i) +pJ
k
(i + 1). (5.3.2)
Dene for any scalar ,
F
k
() = min{K +(1 p)J
k
(0) +pJ
k
(1), }.
Since F
k
() is monotonically increasing in , we have from equation (5.3.2),
J
k+1
(i + 1) = F
k
(c(i + 1) +(1 p)J
k
(i + 1) +pJ
k
(i + 2))
F
k
(ci +(1 p)J
k
(i) +pJ
k
(i + 1))
= J
k+1
(i).
Finally, consider the case i + 1 = n. Then, we have
J
k+1
(n) = K +(1 p)J
k
(0) +pJ
k
(1)
F
k
(ci +(1 p)J
k
(i) +pJ
k
(i + 1))
= J
k+1
(n 1).
The induction is complete.
5.4 Average cost-per-stage problems
5.4.1 General setting
Stationary system with nite number of states and controls
Minimize over admissible policies = {
0
,
1
, . . .},
J

(i) = lim
N
1
N
E
_
N1

k=0
g(x
k
,
k
(x
k
))|x
0
= i
_
Assume 0 g(x
k
,
k
(x
k
)) < .
182
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Fact: For most problems of interest, the average cost per stage of a policy and the optimal
average cost per stage are independent of the initial state.
Intuition: Costs incurred in the early stages do not matter in the long run. More formally,
suppose that all state communicate under a given stationary policy .
1
Let
K
ij
() = rst passage time from i to j under ,
i.e., K
ij
() is the rst index k such that x
k
= j starting from x
0
= i. Then,
J

(i) = lim
N
1
N
E
_
_
K
ij
()1

k=0
g(x
k
,
k
(x
k
))|x
0
= i
_
_
. .
=0
+ lim
N
1
N
E
_
_
N1

k=K
ij
()
g(x
k
,
k
(x
k
))|x
0
= i
_
_
Therefore, J

(i) = J

(j), for all i, j with E[K


ij
()] < (or equivalently, with P(K
ij
() =
) = 0.
Because communication issues are so important, the methodology relies heavily on Markov
chain theory.
5.4.2 Associated stochastic shortest path (SSP) problem
Assumption 5.4.1 State n is such that for some integer m > 0, and for all initial states and all
policies, n is visited with positive probability at least once within the rst m stages.
In other words, state n is recurrent in the Markov chain corresponding to each stationary policy.
Consider a sequence of generated states, and divide it into cycles that go through n, as shown in
Figure 5.4.1.
X
0
=i
X
k
=n
Figure 5.4.1: Each cycle can be viewed as a state trajectory of a corresponding SSP problem with termination
state being n.
The SSP is obtained via the transformation described in Figure 5.4.2.
Let the cost at i of the SSP be g(i, u)

. We will show that


Average cost problem Min cost cycle problem SSP problem
1
We are assuming that there is a single recurrent class. Recall that a state is recurrent if the probability of
reentering it is one. Positive recurrent means that the expected time of returning to it is nite. Also, recall that in a
nite state Markov chain, all recurrent states are positive recurrent.
183
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
t
1
Figure 5.4.2: LHS: Original average cost per stage problem. RHS: Associated SSP problem. The original
transition probabilities are adjusted as follows: probabilities from the states i = t to state t are set equal to
p
in
(u), the probabilities of transition from all states to state n are set to zero, and all other probabilities are
left unchanged.
5.4.3 Heuristic argument
Under all stationary policies in the original average cost problem, there will be an innite
number of cycles marked by successive visits to state n We want to nd a stationary
policy that minimizes the average cycle stage cost.
Consider a minimum cycle cost problem: Find a stationary policy that minimizes:
Expected cost per transition within a cycle =
E[cost from n up to the rst return to n]
E[time from n up to the rst return to n]
=
C
nn
()
N
nn
()
.
Intuitively, the optimal average cost

should be equal to optimal average cycle cost, i.e.,

=
C
nn
(

)
N
nn
(

)
; or equivalently C
nn
(

) N
nn
(

= 0.
So, for any stationary policy ,


C
nn
()
N
nn
()
or equivalently C
nn
() N
nn
()

0.
Thus, to obtain an optimal policy , we must solve
min

{C
nn
() N
nn
()

}
Note that C
nn
() N
nn
()

is the expected cost of starting from n in the associated SSP


with stage cost g(i, u)

, justied by
E
_
_
Knt()1

k=0
(g(x
k
, (x
k
))

|x
0
= n)
_
_
= C
nn
() N
nn
()
. .
E[Knt()]

.
184
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Let h

(i) be the optimal cost of the SSP (i.e., of the path from x
0
= i to t) when starting at
states i = 1, . . . , n. Then by Proposition 1(b) in h

(1), . . . , h

(n) solve uniquely the Bellmans


equation:
h

(i) = min
uU(i)
_
_
_
g(i, u)

+
n

j=1
p
ij
(u)h

(j)
_
_
_
= min
uU(i)
_

_
g(i, u)

+
n1

j=1
p
ij
(u)h

(j) + p
in
(u)
. .
=0, by construction
h

(n)
_

_
(5.4.1)
If

is a stationary policy that minimizes the cycle cost, then

must satisfy
h

(n) = C
nn
(

) N
nn
(

= 0
See Figure 5.4.3.
n
t
i
1
i
2
g(i
1
,u)
*
g(i
2
,u)
*
Figure 5.4.3: h

(n) in the SSP is the expected cost of the path from n to t (i.e., of the cycle from n to n in
the original problem) based on the original g(i, u), minus N
nn
(

.
We can then rewrite (5.4.1) as
h

(n) = 0

+h

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h

(j)
_
_
_
, i = 1, . . . , n
From the results on SSP, we know that this equation has a unique solution (as long as we
impose the constraint h

(n) = 0). Moreover, minimization of the RHS should give an optimal


stationary policy.
Interpretation: h

(i) is a relative or dierential cost:


h

(i) = min
_
E[cost to go from i to n for the rst time]
E[cost if the stage cost were constant at

instead of at g(j, u), j]


_
In words, h

(i) is a measure of how much away from the average cost we are when starting
from node i.
185
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
5.4.4 Bellmans equation
The following proposition provides the main results regarding Bellmans equation:
Proposition 5.4.1 Under Assumption 1, the following hold for the average cost per stage problem:
(a) The optimal avergae cost

is the same for all initial staes and together with some vector
h

= {h

(1), . . . , h

(n)} satises Bellmans equation

+h

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h

(j)
_
_
_
, i = 1, . . . , n. (5.4.2)
Furthermore, if (i) attains the minimum in the above equation for all i, the stationary policy
is optimal. In addition, out of all vectors h

satisfying this equation, there is a unique vector


for which h

(n) = 0.
(b) If a scalar and a vector h = {h(1), . . . , h(n)} satisfy Bellmans equation, then is the average
optimal cost per stage for each initial state.
(c) Given a stationary policy with corresponding average cost per stage

, there is a unique
vector h

= {h

(1), . . . , h

(n)} such that h

(n) = 0 and

+h

(i) = g(i, (i)) +


n

j=1
p
ij
((i))h

(j), i = 1, . . . , n.
Proof: We proceed item by item:
(a) Let

= min

Cnn()
Nnn()
. Then, for all ,
C
nn
() N
nn
()

0,
with
C
nn
(

) N
nn
(

. .
h

(n) in the associated SSP


= 0 h

(n) = 0.
Consider the associated SSP with stage cost: g(i, u)

. Then, by Proposition 1(b), and using


the fact that p
in
(u) = 0, the costs h

(1), . . . , h

(n) solve uniquely the corresponding Bellmans


equation:
h

(i) = min
uU(i)
_
_
_
g(i, u)

+
n1

j=1
p
ij
(u)h

(j)
_
_
_
, i = 1, . . . , n. (5.4.3)
Thus, we can rewrite (5.4.3) as

+h

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h

(j)
_
_
_
, i = 1, . . . , n. (5.4.4)
We will show that this implies

=

.
186
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Let = {
0
,
1
, . . .} be any admissible policy, let N be a positive integer, and for all k =
0, . . . , N 1, dene J
k
(i) using the recursion:
J
0
(i) = h

(i), i = 1, . . . , n,
J
k+1
(i) = g(i,
Nk1
(i)) +
n

j=1
p
ij
(
Nk1
(i))J
k
(j), i = 1, . . . , n. (5.4.5)
In words, J
N
(i) is the N-stage cost of when starting state is i and the terminal cost is h

.
From (5.4.4), since
N1
() is just one admissible policy, we have

+h

(i)
. .
J
0
(i)
g(i,
N1
(i)) +
n

j=1
p
ij
(
N1
(i))h

(j)
. .
J
1
(i), from (5.4.5), by setting k=0
, i = 1, . . . , n.
Thus,

+J
0
(i) J
1
(i), i = 1, . . . , n.
Then,
J
2
(i) = g(i,
N2
(i)) +
n

j=1
p
ij
(
N2
(i)) J
1
(j)
. .

+J
0
(i)
g(i,
N2
(i)) +

+
n

j=1
p
ij
(
N2
(i))J
0
(j)


+

+h

(i) (by equation (5.4.4))


= 2

+h

(i), i = 1, . . . , n.
By repeating this argument,
k

+h

(i) J
k
(i), k = 0, . . . , N, i = 1, . . . , n.
In particular, for k = N,
N

+h

(i) J
N
(i)

+
h

(i)
N

J
N
(i)
N
, i = 1, . . . , n. (5.4.6)
Equality holds in (5.4.6) if
k
(i) attains the minimum in (5.4.4) for all i, k. Now,

+
h

(i)
N
. .

when N

J
N
(i)
N,
. .
J(i) when N
where J

(i) is the average cost per stage of , starting at i. Then, we get

(i), i = 1, . . . , n,
for all admissible .
If = {, , . . .} where (i) attains the minimum in (5.4.4) for all i, k, we get

= min

(i) =

, i = 1, . . . , n.
Replacing

by

in equation (5.4.4), we obtain (5.4.2). Finally, h

(n) = 0 jointly with (5.4.4)


are equivalent to (5.4.3) for the associated SSP. But the solution to (5.4.3) is unique (due to
Proposition 1(b)), so there must be a unique solution for the equations h

(n) = 0 and (5.4.4).


187
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
(b) The proof follows from the proof of part (a), starting from equation (5.4.4).
(c) The proof follows from part (a), constraining the control set to

U(i) = {(i)}.
Remarks:
Proposition 5.4.1 can be shown under weaker conditions. In particular, it can be shown
assuming that all stationary policies have a single recurrent class even if their corresponding
recurrent classes do not have state n in common.
It can also be shown assuming that for every pair of states i, j, there is a stationary policy
under which there is a positive probability of reaching j starting from i.
Example: A manufacturer, at each time:
1. May process all unlled orders at cost K > 0, or process no order at all. The cost per unlled
order at each time is c > 0.
2. Receives an order w.p. p, and no order w.p. 1 p.
Maximum number of orders that can remain unlled is n. When there are n pending orders,
he has to process).
Objective: Find a processing policy that minimizes the total expected cost per stage.
State: Number of unlled orders. We set state 0 is the special state for the SSP formulation.
Bellmans equation: For states i = 0, 1, . . . , n 1,

+h

(i) = min{K + (1 p)h

(0) +ph

(1)
. .
Process unlled orders
, ci + (1 p)h

(i) +ph

(i + 1),
. .
Do nothing
}
and for state n,

+h

(n) = K + (1 p)h

(0) +ph

(1).
Optimal policy: Process i unlled orders if
K + (1 p)h

(0) +ph

(1) ci + (1 p)h

(i) +ph

(i + 1).
If we view h

(i) as the dierential cost associated with an optimal policy (or by interpreting
h

(i) as the optimal cost-to-go for the associated SSP), then h

(i) should be monotonically


nondecreasing with i. This monotonicity implies that a threshold policy is optimal: Process
the orders if their number exceeds some threshold integer m.
188
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
5.4.5 Computational approaches
Value iteration
Procedure: Generate optimal k-stage costs by the DP algorithm starting from any J
0
:
J
k+1
(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
, i. (5.4.7)
Claim: lim
k
J
k
(i)
k
=

, i.
Proof: Let h

be a solution vector of Bellmans equation:

+h

(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h

(j)
_
_
_
, i = 1, . . . , n. (5.4.8)
From here, dene the recursion
J

0
(i) = h

(i)
J

k+1
(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J

k
(j)
_
_
_
, i = 1, . . . , n.
Like in the proof of Proposition 5.4.1(a), it can be shown that
J

k
(i) = k

+h

(i), i = 1, . . . , n.
On the other hand, it can be seen that
|J
k
(i) J

k
(i)| max
j=1,...,n
|J
0
(j) h

(j)|, i = 1, . . . , n,
because J
k
(i) and J

k
(i) are optimal costs for two k-stage problems that dier only in the corre-
sponding terminal cost functions which are J
0
and h

respectively.
From the preceding two equations, we see that for all k,
|J
k
(i) (k

(i))| max
j=1,...,n
|J
0
(j) h

(j)|.
Therefore,
max
j=1,...,n
|J
0
(j) h

(j)| h

(i)
. .
max
j=1,...,n
|h

(j)|
J
k
(i) k

max
j=1,...,n
|J
0
(j) h

(j)| h

(i)
. .
max
j=1,...,n
|h

(j)|
,
or equivalently,
|J
k
(i) k

| max
j=1,...,n
|J
0
(j) h

(j)| + max
j=1,...,n
|h

(j)|,
which implies

J
k
(i)
k

constant
k
.
Taking limit as k in both sides above gives
lim
k
J
k
(i)
k
=

.
The only condition required is that Bellmans equation (5.4.8) holds for some vector h

.
Remarks:
189
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Pros: Very simple to implement
Cons:
Since typically some of the components of J
k
diverge to or , direct calculation of
lim
k
J
k
(i)
k
is numerically cumbersome.
Method does not provide a corresponding dierential cost vector h

.
Fixing the diculties:
Subtract the same constant from all components of the vector J
k
:
J
k
(i) := J
k
(i) C; i = 1, . . . , n.
Consider the algorithm:
h
k
(i) = J
k
(i) J
k
(s);
for some xed state s, and for all i = 1, . . . , n. By using equation (5.4.7) for i = 1, . . . , n,
h
k+1
(i) = J
k+1
(i) J
k+1
(s)
= min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
_
_
min
uU(s)
_
_
_
g(s, u) +
n

j=1
p
sj
(u)J
k
(j)
_
_
_
= min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)(h
k
(j) J
k
(s))
_
_
_
min
uU(s)
_
_
_
g(s, u) +
n

j=1
p
sj
(u)(h
k
(j) J
k
(s))
_
_
_
= min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h
k
(j)
_
_
_
J
k
(s) min
uU(s)
_
_
_
g(s, u) +
n

j=1
p
sj
(u)h
k
(j)
_
_
_
+J
k
(s)
The above algorithm is called relative value iteration.
Mathematically equivalent to the value iteration method (5.4.7) that generates J
k
(i).
Iterates generated by the two methods dier by a constant (i.e., J
k
(s), since J
k
(i) =
h
k
(i) +J
k
(s), i).
Big advantage of new method: Under Assumption 5.4.1 it can be shown that the iterates h
k
(i)
are bounded, while this is typically not true for the plain vanilla method.
It can be seen that if the relative value iteration converges to some vector h, then we have:
h(s) = 0, and
+h(i) = min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h(j)
_
_
_
By Proposition 5.4.1(b), this implies that is indeed the optimal average cost per stage, and
h is the associated dierential cost vector.
Disadvantage: Under Assumption 5.4.1, convergence is not guaranteed. However, convergence
can be guaranteed for a simple variant:
h
k+1
(i) = (1)h
k
(i)+ min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h
k
(j)
_
_
_
min
uU(s)
_
_
_
g(s, u) +
n

j=1
p
sj
(u)h
k
(j)
_
_
_
,
for i = 1, . . . , n, and a constant satisfying 0 < < 1.
190
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Policy iteration
Start from an arbitrary stationary policy
0
.
At a typical iteration, we have a stationary policy
k
. We perform two steps per iteration:
Policy evaluation: Compute
k
and h
k
(i) of
k
, using the n + 1 equations h
k
(n) = 0,
and for i = 1, . . . , n,

k
+h
k
(i) = g(i,
k
(i)) +
n

j=1
p
ij
(
k
(i))h
k
(j).
If
k+1
=
k
and h
k+1
(i) = h
k
(i), i, stop. Otherwise, continue with the next step.
Policy improvement: Find a stationary policy
k+1
where for all i,
k+1
(i) is such
that

k+1
(i) = arg min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h
k
(j)
_
_
_
,
and repeat.
The next proposition shows that each iteration of the algorithm makes some irreversible
progress towards optimality.
Proposition 5.4.2 Under Assumption 5.4.1, in the policy iteration algorithm, for each k we either
have
k+1
<
k
; or else we have

k+1
=
k
, and h
k+1
(i) = h
k
(i), i.
Furthermore, the algorithm terminates and the policies
k
and
k+1
obtained upon termination are
optimal.
Proof: Denote
k
:= ,
k+1
:= ,
k
:= ,
k+1
:=

, h
k
(i) := h(i), h
k+1
:=

h(i). Dene for
N = 1, 2, . . .,
h
N
(i) = g(i, (i)) +
n

j=1
p
ij
( (i))h
N1
(j); i = 1, . . . , n,
where h
0
(i) = h(i).
Thus, we have

= J

(i) = lim
N
1
N
h
N
(i), i = 1, 2, . . . , n. (5.4.9)
By denition of we have for all i = 1, . . . , n:
h
1
(i) = g(i, (i)) +
n

j=1
p
ij
( (i))h
0
(j) (from the iteration above)
g(i, (i)) +
n

j=1
p
ij
((i))h
0
(j) (because was the min of this RHS)
= +h
0
(i) (because of Proposition 5.4.1).
191
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
From the equation above, we also obtain
h
2
(i) = g(i, (i)) +
n

j=1
p
ij
( (i))h
1
(j)
g(i, (i)) +
n

j=1
p
ij
( (i))( +h
0
(j))
= +g(i, (i)) +
n

j=1
p
ij
( (i))h
0
(j)
+g(i, (i)) +
n

j=1
p
ij
((i))h
0
(j)
. .
+h
0
(i)
= 2 +h
0
(i),
and by proceeding similarly, we see that for all i, N,
h
N
(i) N +h
0
(i).
Thus,
h
N
(i)
N
+
h
0
(i)
N
Taking limit as N , the LHS converges to

(from equation (5.4.9)), and the 2nd term in the
RHS goes to zero, implying that

.
If

= , the iteration that produces
k+1
is a policy improvement step for the associated SSP
with cost per stage g(i,
k
) . Moreover, h(i) and

h(i) are the optimal costs starting from i
and corresponding to and respectively, in this associated SSP. Thus,

h(i) h(i), i.
Since there are only a nite number of stationary policies, there are also a nite number of
(each one being the average cost per stage of each of the stationary policies). For each there
is only a nite number of possible vectors h (see Proposition 5.4.1(c), where we can vary the
reference h

(n) = 0).
In view of the improvement properties already shown, no pair (, h) can be repeated without
termination of the algorithm, implying that the algorithm must terminate with

= and

h(i) = h(i), i.
Claim: When the algorithm terminates, the policies and are optimal.
Proof: Upon termination, we have for all i,
+h(i) =

+

h(i)
= g(i, (i)) +
n

j=1
p
ij
( (i))

h(j) (by policy evaluation step)


= g(i, (i)) +
n

j=1
p
ij
( (i))h(j) (because

h(j) = h(j), j)
= min
uU(i)
_
_
_
g(i, u) +
n

j=1
p
ij
(u)h(j)
_
_
_
(by policy improvement step)
192
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Therefore, (, h) satisfy Bellmans equation, and by Proposition 5.4.1(b), must be equal to the
optimal average cost per stage. Furthermore, (i) attains the minimum in the RHS of Bellmans
equation (see the last two equalities above), and hence by Proposition 5.4.1(a), is optimal. Since
we also have for all i (due to the self-consistency of the policy evaluation step),
+h(i) = g(i, (i)) +
n

j=1
p
ij
((i))h(j),
the same is true for .
5.5 Semi-Markov Decision Problems
5.5.1 General setting
Stationary system with nite number of states and controls.
State transitions occur at discrete times.
Control applied at these discrete times and stays constant between transitions.
Time between transitions is random, or may depend on the current state and the choice of
control.
Cost accumulates in continuous time, or maybe incurred at the time of transition.
Example: Admission control in a system with restricted capacity (e.g., a communication link)
Customer arrivals: Poisson process.
Customers entering the system, depart after an exponentially distributed time.
Upon arrival we must decide whether to admit or block a customer.
There is a cost for blocking a customer.
For each customer that is in the system, there is a customer-dependent reward per unit
of time.
Objective: Minimize time-discounted or average cost.
Note that at transition times t
k
, the future of the system statistically depends only on the
current state. This is guaranteed by not allowing the control to change in between transitions.
Otherwise, we should include the time elapsed from the last transition as part of the system
state.
5.5.2 Problem formulation
x(t) and u(t): State and control at time t. Stay constant between transitions.
t
k
: Time of the kth transition (t
0
= 0).
x
k
= x(t
k
): We have x(t) = x
k
for t
k
t t
k+1
.
193
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
u
k
= u(t
k
): We have u(t) = u
k
for t
k
t t
k+1
.
In place of transition probabilities, we have transition distributions. For any pair (state i,
control u), specify the joint distribution of the transition interval and the next state:
Q
ij
(, u) = P{t
k+1
t
k
, x
k+1
= j|x
k
= i, u
k
= u}.
Two important observations:
1. Transition distributions specify the ordinary transition probabilities via
p
ij
(u) = P{x
k+1
= j|x
k
= i, u
k
= u} = lim

Q
ij
(, u).
We assume that for all states i and controls u U(i), the average transition time,

i
(u) =
n

j=1
_

0
Q
ij
(d, u),
is nonzero and nite, 0 <
i
(u) < .
2. The conditional cumulative distribution function (c.d.f.) of given i, j, and u is (assuming
p
ij
(u) > 0)
P{x
k+1
= j|x
k
= i, u
k
= u} =
Q
ij
(u)
p
ij
(u)
. (5.5.1)
Thus, Q
ij
(u) can be seen as a scaled c.d.f., i.e.,
Q
ij
(u) = P{x
k+1
= j|x
k
= i, u
k
= u} p
ij
(u).
Important case: Exponential transition distributions
Important example of transition distributions:
Q
ij
(, u) = p
ij
(u)(1 e

i
(u)
),
where p
ij
(u) are transition probabilities, and
i
(u) > 0 is called the transition rate at state i.
Interpretation: If the system is in state i and control u is applied,
The next state will be j w.p. p
ij
(u).
The time between the transition to state i and the transition to the next state j is
Exp(
i
(u)) (independently of j);
P{transition time interval > |i, u} = e

i
(u)
.
The exponential distribution is memoryless. This implies that for a given policy, the system
is a continuous-time Markov chain (the future depends on the past through the present).
Without the memoryless property, the Markov property holds only at the times of transition.
194
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
Cost structures
There is a cost g(i, u) per unit time, i.e.,
g(i, u)dt = cost incurred during small time period dt
There maybe an extra instantaneous cost g(i, u) at the time of a transition (lets ignore this
for the moment).
Total discounted cost of = {
0
,
1
, . . . , } starting from state i (with discount factor > 0)
lim
N
E
_
N1

k=0
_
t
k+1
t
k
e
t
g(x
k
,
k
(x
k
))dt

x
0
= i
_
Average cost per unit time of = {
0
,
1
, . . . , } starting from state i
lim
N
1
E[t
N
|x
0
= i, ]
E
_
N1

k=0
_
t
k+1
t
k
g(x
k
,
k
(x
k
))dt

x
0
= i
_
We will see that both problems have equivalent discrete time versions.
A note on notation
The scaled c.d.f. Q
ij
(, u) can be used to model discrete, continuous, and mixed distributions
for the transition time .
Generally, expected values of functions of can be written as integrals involving dQ
ij
(, u).
For example, from (5.5.1) (noting that there is no in the denominator there), the conditional
expected value of given i, j, and u is written as
E[|i, j, u] =
_

0

dQ
ij
(, u)
p
ij
(u)
If Q
ij
(, u) is discontinuous and staircase-like, expected values can be written as summa-
tions.
5.5.3 Discounted cost problems
For a policy = {
0
,
1
, . . .}, write
J

(i) = E[cost of 1st transition] + E[e

1
(j)|i,
0
(i)], (5.5.2)
where J

1
(j) is the cost-to-go of the policy
1
= {
1
,
2
, . . .}.
We calculate the two costs in the RHS. The expected cost of a single transition if u is applied
at state i is
G(i, u) = E
j
[E

[transition cost|j]]
=
n

j=1
p
ij
(u)
_

0
__

0
e
t
g(i, u)dt
_
dQ
ij
(, u)
p
ij
(u)
=
n

j=1
_

0
1 e

g(i, u)dQ
ij
(, u), (5.5.3)
195
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
where the 2nd equality follows from computing E

[transition cost|j] via integrating the tail of


the nonnegative r.v. , and the 3rd one because
_

0
e
t
dt = (1 e

)/.
Thus, E[cost of 1st transition] is
G(i,
0
(i)) = g(i,
0
(i))
n

j=1
_

0
1 e

dQ
ij
(,
0
(i)).
Regarding the 2nd term in (5.5.2),
E[e

1
(j)|i,
0
(i)] = E
j
[E[e

|j, i,
0
(i)]J

1
(j)]
=
n

j=1
p
ij
(
0
(i))
__

0
e

dQ
ij
(,
0
(i))
p
ij
(
0
(i))
_
J

1
(j)
=
n

j=1
m
ij
(
0
(i))J

1
(j),
where m
ij
(u) is given by
m
ij
(u) =
_

0
e

dQ
ij
(, u).
Note that m
ij
(u) satises
m
ij
(u) <
_

0
dQ
ij
(, u) = lim

Q
ij
(, u) = p
ij
(u).
So, m
ij
(u) can be viewed as the eective discount factor (the analog of p
ij
(u) in the discrete-
time case).
So, going back to (5.5.2), J

(i) can be written as


J

(i) = G(i,
0
(i)) +
n

j=1
m
ij
(
0
(i))J

1
(j).
Equivalence to an SSP
Similar to the discrete-time case, introduce a stochastic shortest path problem with an articial
termination state t.
Under control u, from state i the system moves to state j w.p. m
ij
(u), and to the terminal
state t w.p. 1

n
j=1
m
ij
(u).
Bellmans equation: For i = 1, . . . , n,
J

(i) = min
uU(i)
_
_
_
G(i, u) +
n

j=1
m
ij
(u)J

(j)
_
_
_
Analogs of value iteration, policy iteration, and linear programming.
196
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
If in addition to the cost per unit of time g, there is an extra (instantaneous) one-stage cost
g(i, u), Bellmans equation becomes
J

(i) = min
uU(i)
_
_
_
g(i, u) +G(i, u) +
n

j=1
m
ij
(u)J

(j)
_
_
_
Example 5.5.1 (Manufacturers production plan)
A manufacturer receives orders with interarrival times uniformly distributed in [0,
max
].
He may process all unlled orders at cost K > 0, or process none. The cost per unit of time of
an unlled order is c. Maximum number of unlled orders is n.
Objective: Find a processing policy that minimizes the total expected cost, assuming the discount
factor is < 1.
The nonzero transition distributions are
Q
i1
(, Fill) = Q
i,i+1
(, No Fill) = min
_
1,

max
_
The one-stage expected cost G (see equation (5.5.3)) is
G(i, Fill) = 0, G(i, Not Fill) = ci,
where
=
n

j=1
_

0
1 e

dQ
ij
(, u) =
_

max
0
1 e

max
d.
There is an instantaneous cost
g(i, Fill) = K, g(i, Not Fill) = 0.
The eective discount factors m
ij
(u) in Bellmans equation are
m
i1
(Fill) = m
i,i+1
(Not Fill) = ,
where
=
_

0
e

dQ
ij
(, u) =
_

max
0
e

max
d =
1 e

max

max
.
Bellmans equation has the form
J

(i) = min{K +J

(1), ci +J

(i + 1)}, i = 1, 2, . . .
As in the discrete-time case, it can be proved that J

(i) is monotonically decreasing in i. Therefore,


there must exist an optimal threshold i

such that the manufacturer must ll the orders if and


only if their number i exceeds i

.
197
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
5.5.4 Average cost problems
Cost function for the continuous time average cost per unit time problem (assuming that there
is a special state that is recurrent under all policies) would be
lim
T
1
T
E
__
T
0
g(x(t), u(t))dt
_
However, we will use instead the cost function
lim
N
1
E[t
N
]
E
__
t
N
0
g(x(t), u(t))dt
_
,
where t
N
is the completion time of the Nth transition. This cost function is equivalent to the
previous one under the conditions of the subsequent analysis.
We now apply the SSP argument used for the discrete-time case. Divide trajectory into cycles
marked by successive visits to n. The cost at (i, u) is G(i, u)


i
(u), where

is the optimal
expected cost per unit of time. Each cycle is viewed as a state trajectory of a corresponding
SSP problem with the termination state being essentially n.
Bellmans equation for the average cost problem is
h

(i) = min
uU(i)
_
_
_
G(i, u)


i
(u) +
n

j=1
p
ij
(u)h

(j)
_
_
_
.
The expected transition times are

i
(Fill) =
i
(Not Fill) =

max
2
.
The expected transition cost is
G(i, Fill) = 0, G(i, Not Fill) =
ci
max
2
,
and the instantaneous cost is
g(i, Fill) = K, g(i, Not Fill) = 0.
Bellmans equation is
h

(i) = min
_
K

max
2
+h

(1), ci

max
2

max
2
+h

(i + 1)
_
.
Again, it can be shown that a threshold policy is optimal.
5.6 Application: Multi-Armed Bandits
5.7 Exercises
Exercise 5.7.1 A computer manufacturer can be in one of two states. In state 1 his product
sells well, while in state 2 his product sells poorly. While in state 1 he can advertise his product
198
CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey
in which case the one-stage reward is 4 units, and the transition probabilities are p
11
= 0.8 and
p
12
= 0.2. If in state 1, he does not advertise, the reward is 6 units and the transition probabilities
are p
11
= p
12
= 0.5. While is state 2, he can do research to improve his product, in which case
the one-stage reward is 5 units, and the transition probabilities are p
21
= 0.7 and p
22
= 0.3. If in
state 2 he does not do the research, the reward is 3, and the transition probabilities are p
21
= 0.4,
and p
22
= 0.6. Consider the innite horizon, discounted version of this problem.
(a) Show that when the discount factor is suciently small, the computer manufacturer should
follow the shortsighted policy of not advertising (not doing research) while in state 1 (state 2).
By contrast, when is suciently close to 1, he should follow the farsighted policy of adver-
tising (doing research) while in state 1 (state 2).
(b) For = 0.9, calculate the optimal policy using policy iteration.
(c) For = 0.99, use a computer to solve the problem by value iteration.
Exercise 5.7.2 An energetic salesman works every day of the week. He can work in only one of
two towns A and B on each day. For each day he works in town A (or B) his expected reward is r
A
(or r
B
, respectively). The cost of changing towns is c. Assume that c > r
A
> r
B
, and that there is
a discount factor < 1.
(a) Show that for suciently small, the optimal policy is to stay in the town he starts in, and
that for suciently close to 1, the optimal policy is to move to town A (if not starting there)
and stay in A for all subsequent times.
(b) Solve the problem for c = 3, r
A
= 2, r
B
= 1 and = 0.9 using policy iteration.
(c) Use a computer to solve the problem of part (b) by value iteration.
Exercise 5.7.3 A person has an umbrella that she takes from home to oce and viceversa. There
is a probability p of rain at the same time she leaves home or oce independently of earlier weather.
If the umbrella is in the place where she is and it rains, she takes the umbrella to go to the other
place (this involves no cost). If there is no umbrella and it rains, there is a cost W for getting wet.
If the umbrella is in the place where she is but it does not rain, she may take the umbrella to go to
the other place (this involves an inconvenience cost V ) or she may leave the umbrella behind (this
involves no cost). Costs are discounted at a factor < 1.
(a) Formulate this as an innite horizon total cost discounted problem. Try to reduce the number
of states of the model. Two or three states should be enough for this problem!
(b) Characterize the optimal policy as best as you can.
Exercise 5.7.4 An unemployed worker receives a job oer at each time period, which she may ac-
cept or reject. The oered salary takes one of n possible values w
1
, . . . , w
n
, with given probabilities,
independently of preceding oers. If she accepts the oer, she must keep the job for the rest of her
life at the same salary level. If she rejects the oer, she receives unemployment compensation c for
the current period and is eligible to accept future oers. Assume that income is discounted by a
factor < 1.
199
Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS
Hint: Dene the states s
i
, i = 1, . . . , n, corresponding to the worker being unemployed and being
oered a salary w
i
, and s
i
, i = 1, . . . , n,, corresponding to the worker being employed at a salary
level w
i
.
(a) Show that there is a threshold w such that it is optimal to accept an oer if and only if its
salary is larger than w, and characterize w.
(b) Consider the variant of the problem where there is a given probability p
i
that the worker will
be red from her job at any one period if her salary is w
i
. Show that the result of part (a)
holds in the case where p
i
is the same for all i. Argue what would happen in the case where p
i
depends on i.
Exercise 5.7.5 An unemployed worker receives a job oer at each time period, which she may ac-
cept or reject. The oered salary takes one of n possible values w
1
, . . . , , w
n
with given probabilities,
independently of preceding oers. If she accepts the oer, she must keep the job for the rest of her
life at the same salary level. If she rejects the oer, she receives unemployment compensation c for
the current period and is eligible to accept future oers.
Suppose that there is a probability p that the worker will be red from her job at any one period,
and further assume that w
1
< w
2
< < w
n
.
Show that when the worker maximizes her average income per period, there is a threshold value w
such that it is optimal to accept an oer if and only if her salary is larger than w, and characterize w.
Hint: Dene the states s
i
, i = 1, . . . , n, corresponding to the worker being unemployed and being
oered a salary w
i
, and s
i
, i = 1, . . . , n,, corresponding to the worker being employed at a salary
level w
i
.
200
Chapter 6
Point Process Control
The following chapter is based on Chapters I, II and VII in Bremauds book Point Processes and
Queues (1981) .
6.1 Basic Denitions
Consider some probability space (, F, P). A real-valued mapping X : R is a random variable
if for every C B(R) the pre-image X
1
(C) F.
A ltration (or history) of a measurable space (, F) is a collection (F
t
)
t0
of sub--elds of F such
that for all 0 s t
F
s
F
t
.
We denote by F

=
_
t0
F
t
:=
_

t0
F
t
_
.
A family (X
t
)
t0
of real-valued random variables is called a stochastic process. The ltration
generated by X
t
is
F
X
t
:= (X
s
: s [0, t]) .
For a xed , the function X
t
() is called a path of the stochastic process.
We say that the stochastic process is adapted to the ltration F
t
if F
X
t
F
t
for all t 0. We
say that X
t
is F
t
-progressive if for all t 0 the mapping (t, ) X
t
() from [0, t] R is
B([0, t]) F
t
-measurable.
Let F
t
be a ltration. We dene the F
t
-predictable -eld P(F
t
) as follows:
P(F
t
) := ((s, t] A : s [0, t] and A F
s
) .
A stochastic process X
t
is F
t
-predictable if X
t
is P(F
t
)-measurable.
Proposition 6.1.1 A real-valued stochastic process X
t
adapted to F
t
and left-continuous is F
t
-
predictable
Given a ltration F
t
, a process X
t
is called a F
t
-martingale over [0, c] if the following three conditions
are satised
201
Prof. R. Caldentey CHAPTER 6. POINT PROCESS CONTROL
1. X
t
is adapted to F
t
.
2. E[|X
t
|] < for all t [0, c].
3. E[X
t
|F
s
] = X
s
a.s., for all 0 s t c.
If the equality in (3) is replaced by () then X
t
is called a submartingale (supermartingale).
Exercise 6.1.1 Let X
t
be a real-valued process with independent increment, that is, for all 0
s t, X
t
X
s
is independent of F
X
s
. Suppose that X
t
is integrable and E[|X
t
|] = 0. Show that X
t
is a F
X
t
-martingale.
If in addition X
2
t
is integrable then X
2
t
is a F
X
t
-submartingale and X
2
t
E[X
2
t
] is a F
t
-martingale.
6.2 Counting Processes
Denition 6.2.1 A sequence of random variables {T
n
: n 0} is called a point process if for each
n 0, T
n
F and
T
0
() = 0 and T
n
() < T
n+1
() whenever T
n
< .
We will only consider nonexplosive point process, that is, process for which lim
n
T
n
= P-a.s.
Associated to a nonexplosive point process {T
n
}, we dene the corresponding counting process
{N
t
: t 0} as follows
N
t
= n if t [T
n
, T
n+1
).
Thus, N
t
is a right-continuous step function starting at 0 (see gure). Since we are considering
t
N
t
T
1
T
2
T
3
T
4
T
5
1
2
3
4
5
T
0
nonexplosive point processes, N
t
< for all t 0 P-a.s. In addition, if E[N
t
] is nite for all t then
the point process is said to be integrable.
Exercise 6.2.1 Simple Renewal Process:
Consider a sequence {X
n
: n 1} of iid nonnegative random variables. We dene the point process
recursively as follows: T
0
= 0 and T
n
= T
n1
+X
n
, n 1. A sucient condition for the process T
n
to be nonexplosive is E[X] > 0.
202
CHAPTER 6. POINT PROCESS CONTROL Prof. R. Caldentey
One of the most famous renewal process is the Poisson process. In this case X, the inter-arrival
interval, has an exponential distribution with time homogenous rate .
Exercise 6.2.2 Queueing Process:
A simple queueing process Q
t
is a nonnegative integer-valued process o of the form
Q
t
= Q
0
+A
t
D
t
,
where A
t
(arrival process) and D
t
(departure process) are two nonexplosive point processes without
common jumps. Note that by denition D
t
Q
0
+A
t
.
Exercise 6.2.3 Let N
t
(1) and N
t
(2) be two nonexplosive point processes without common jumps
and let Q
0
be a nonnegative integer-valued random variable. Dene X
t
= Q
0
+N
t
(1) N
t
(2) and
m
t
= min
0st
{X

s
}. Show that Q
t
= X
t
m
t
is a simple queueing process with arrival process
A
t
= N
t
(1), departure process D
t
=
_
t
0
11(Q
s
> 0) dN
s
(2), and m
t
=
_
t
0
11(Q
s
= 0) dN
s
(2).
Poisson processes are commonly used in practice to represent point process. One possible expla-
nation for this popularity is its inherent mathematical tractability. However, a well-known result
in renewal theory says that the sum of a large number of independent renewal processes converges
as the number of summands goes to innity to a Poisson process. So, the Poisson process can
be in fact a good approximation to many applications in practice. To make the Poisson process
even more realistic we would like to have more exibility on the arrival rate . We can achieve this
generalization as follows.
Denition 6.2.2 (Doubly Stochastic or Conditional Poisson Process) Let N
t
be a point process
adapted to a ltration F
t
, and let
t
be a nonnegative measurable process. Suppose that

t
is F
0
measurable for all t 0,
and that
_
t
0
s ds < P-a.s., for all t 0.
If for all s [0, t] the increment N
t
N
s
is independent of F
s
given F
0
and
P[N
t
N
s
= k|F
s
] =
1
k!
exp
__
t
s

u
du
_ __
t
s

u
du
_
k
then N
t
is called a (P F
t
)-doubly stochastic Poisson process with stochastic intensity
t
.
The process
t
is referred as the intensity of the process. A special case of a doubly stochastic
Poisson process occurs when
t
= F
0
. Another example is the case where
t
= f(t, Y
t
) for a
measurable nonnegative function f and a process Y
t
such that F
Y

F
0
.
Exercise 6.2.4 Show that for a doubly stochastic Poisson process N
t
is such that E[
_
t
0

s
ds] <
for all t 0 then
M
t
= N
t

_
t
0

s
ds
is a F
t
-martingale.
203
Prof. R. Caldentey CHAPTER 6. POINT PROCESS CONTROL
Based on this observation we have the following important result.
Proposition 6.2.1 If N
t
is an integrable doubly stochastic Poisson process with F
t
-intensity
t
,
then for all nonnegative F
t
-predictable processes C
t
E
__

0
C
s
dN
s
_
= E
__

0
C
s

s
ds
_
where
_
t
0
C
s
dN
s
:=

n1
C
Tn
11(T
n
t). It turns out that the converse is also true. This was rst
proved by Watanabe in a less general setting.
Proposition 6.2.2 (Watanabe (1964)) Let N
t
be a point process adapted to the ltration F
t
, and
let (t) be a locally integrable nonnegative function. Suppose that N
t

_
t
0
(s) ds is an F
t
-martingale.
Then N
t
is Poisson process with intensity (t), that is, for all 0 s t, N
t
N
s
is a Poisson
random variable with parameter
_
t
s

u
du independent of F
s
.
Motivated by this result we dene the notion of stochastic intensity for an arbitrary point process
as follows.
Denition 6.2.3 (Stochastic Intensity)
Let N
t
be a point process adapted to some ltration F
t
, and let
t
be a nonnegative F
t
-progressive
process such that for all t 0
_
t
0

s
ds < P a.s.
If for all nonnegative F
t
predictable processes C
t
, the equality
E
__

0
C
s
dN
s
_
= E
__

0
C
s

s
ds
_
is veried, then we say that N
t
admits the F
t
-intensity
t
.
Exercise 6.2.5 Let N
t
be a point process with the F
t
-intensity
t
. Show that if
t
id G
t
-progressive
for some ltration G
t
such that F
N
t
G
t
F
t
t 0 then
t
is also the G
t
-intensity N
t
.
Similarly to the Poisson process, we can connect point processes with stochastic intensities to
martingales.
Proposition 6.2.3 (Integration Theorem)
If N
t
admits the F
t
-intensity
t
(where
_
t
0

s
ds < a.s.) then N
t
is nonexplosive and
1. M
t
= N
t

_
t
0

s
ds is an F
t
-local martingale.
2. if X
t
is F
t
-predictable process such that E[
_
t
0
|X
s
|
s
ds] < then
_
t
0
X
s
dM
s
is an F
t
-
martingale.
3. if X
t
is F
t
-predictable process such that
_
t
0
|X
s
|
s
ds < a.s. then
_
t
0
X
s
dM
s
is an F
t
-local
martingale.
204
CHAPTER 6. POINT PROCESS CONTROL Prof. R. Caldentey
6.3 Optimal Intensity Control
In this section we study the problem of controlling a point process. In particular, we focus on the
case where the controller can aect the intensity of the point process. This type of control diers
from impulsive control where the controller has the ability to add or erase some of the point in the
sequence.
We consider a point process N
t
that we wish to control. The control u belongs to a set U of
admissible controls. We will assume that U consists on the set of real-valued processes dened on
(, F) adapted to F
N
t
in addition for each t [0, T] we assume that u
t
U
t
. In addition, for each
u U the point process N
t
admits a (P
u
, F
t
)-intensity
t
(u). Here, F
t
is some ltration associated
to N
t
.
The performance measure that we will consider is given by
J(u) = E
u
__
T
0
C
s
(u) ds +
T
(u)
_
. (6.3.1)
The expectation in J(u) above is taken with respect to P
u
. The function C
t
(u) is an F
t
-progressive
process and
T
(u) is a F
T
-measurable random variable.
We will consider a problem with complete information so that F
t
F
N
t
. In addition, we assume
local dynamics
u
t
= u(t, N
t
) is F
N
t
predictable

t
(u) = (t, N
t
, u
t
)
C
t
(u) = C(t, N
t
, u
t
)

T
=
T
(T, N
T
).
Exercise 6.3.1 Consider the cost function
J(u) = E
_
_

0<TnT
k
Tn
(u)
_
_
.
Where k
t
(u) is a nonnegative F
t
-measurable process. Show that this cost function can be written
in the from given by equation (6.3.1).
6.3.1 Dynamic Programming for Intensity Control
Theorem 6.3.1 (Hamilton-Jacobi Sucient Conditions)
Suppose there exists for each n N
+
a dierentiable bounded F
t
-progressive mapping V (t, , n)
such that all Omega and all n N
+

t
V (t, , n) + inf
vUt
{(t, , n, v) [V (t, , n + 1) V (t, , n)] +C(t, , n, v)} = 0 (6.3.2)
V (T, , n) = (T, , n). (6.3.3)
and suppose there exists for each n N
+
an F
N
t
-predictable process u

(t, , n) such that u

(t, , n)
achieves the minimum in equation (6.3.2). Then, u

is the optimal control.


205
Prof. R. Caldentey CHAPTER 6. POINT PROCESS CONTROL
Exercise 6.3.2 Proof the theorem.
This Theorem lacks of practical applicability because of Value Function is in general path dependent.
The analysis can be greatly simplied if we assume that the problem is Markovian.
Corollary 6.3.1 (Markovian Control)
Suppose that (t, , n, v), C(t, , n, v), and (t, , n) do not dependent on and that there is a
function V (t, n) such that

t
V (t, n) + inf
vUt
{(t, n, v) [V (t, n + 1) V (t, n)] +C(t, n, v)} = 0 (6.3.4)
V (T, n) = (T, n). (6.3.5)
suppose that the minimum is achieved by a measurable function U

(t, n). Then, u

is the optimal
control.
6.4 Applications to Revenue Management
In this section we present an application of point process optimal control based on the work by
Gallego and van Ryzin (1994)
1
.
6.4.1 Model Description and HJB Equation
Consider a seller that owns I units of a product that wants to sell over a xed time period T.
Demand for the product is characterized by a point process N
t
. Given a price policy p, N
t
admits
the intensity (p
t
). The sellers problem is to select a price strategy {p
t
: t [0, T]} (a predictable
process) that maximizes the expected revenue over the selling horizon. That is,
max J
p
(t, I) := E
p
__
T
0
p
s
dN
s
_
subject to
_
T
0
dN
s
I P
p
a.s.
In order to ensure that the problem is feasible we will assume that there exist a price p

such
that (p

) = 0 a.s. In the case, we dene the set of admissible pricing policies A, as the set of
predictable p(t, I N
t
) policies such that p(t, 0) = p

. The sellers problem becomes to maximize


over p A the expected revenue J
p
(t, I).
Using corollary 6.3.1, we can write the optimality condition for this problem as follows:

t
V (t, n) + sup
p
{(p) [V (t, n) V (t, n 1)] p (p)} = 0
V (T, n) = 0.
Let us make the following transformation of time t T t, that is, t measures the remaining selling
time. Also, instead of looking at the price p as the decision variable we use as the control and
1
Optimal Dynamic Pricing of Inventories with Stochastic Demand over Finite Horizons, Mgmnt. Sci. 40, 999-1020.
206
CHAPTER 6. POINT PROCESS CONTROL Prof. R. Caldentey
p() as the inverse demand function. The revenue rate r() := p() is assumed to be regular, i.e.,
continuous, bounded, concave, has a bounded maximizer

= min{

= argmax{r()}} and such


that lim
0
r() = 0. Under this new denitions the HJB equation becomes

t
V (t, n) = sup

{r() [V (t, n) V (t, n 1)]} = 0


V (0, n) = 0.
Proposition 6.4.1 If (p) is a regular demand function then there exists a unique solution to the
HJB equation. Further, the optimal intensities satises

(t, n)

for all n for all 0 t T.


Closed-form solution to the HJB equation are generally intractable however, the optimality condition
can be exploited to get some qualitative results about the optimal solution.
Theorem 6.4.1 The optimal value function V

(t, n) is strictly increasing and strictly concave in


both n and t. Furthermore, there exists an optimal intensity

(n, t) that is strictly increasing in n


and strictly decreasing in t.
The following gure plots a sample path of price and inventory under an optimal policy.
0 10 20 30 40 50 60 70 80 90 100
0
2
4
6
8
10
12
14
16
18
20
22
Time
Inventory
Price
Figure 6.4.1: Path of an optimal price policy and its inventory level. Demand is a time homogeneous Poisson
process with intensity (p) = exp(0.1p), the initial inventory is C0 = 20, and the selling horizon is H = 100. The
dashed line corresponds to the minimum price p
min
= 10.
6.4.2 Bounds and Heuristics
The fact that the HJB is intractable in most cases creates the need for alternative solution methods.
One possibility, that we consider here, is the use of the certainty equivalent version of the problem.
207
Prof. R. Caldentey CHAPTER 6. POINT PROCESS CONTROL
That is, the deterministic control problem resulting from changing all uncertainty by its expected
value. In this case, the deterministic version of the problem is given by
V
D
(T, n) = max

_
T
0
r(
s
) ds
subject to
_
T
0

s
ds x.
The solution to this time-homogeneous problem can be found easily. Let
0
(T, n) =
n
T
, that is,
the , run-out rate. Then, it is straightforward to show that the optimal deterministic rate is

D
= min{

,
0
(T, n)} and the optimal expect revenue is V
D
(T, n) = T min{r(

), r(
0
(T, n))}.
At least two things make the deterministic solution interesting.
Theorem 6.4.2 If (p) is a regular demand function then for all n 0 and t 0
V

(t, n) V
D
(t, n).
Thus, the deterministic value function provides an upper bound on the optimal expected revenue.
In addition, if we xed the price at p
D
and we denote by V
FP
(t, n) the expected revenue collected
from this xed price strategy then we have the following important result.
Theorem 6.4.3
V
FP
(t, n)
V

(t, n)
1
1
2
_
min{n,

t}
.
Therefore, the xed price policy is asymptotically optimal as the number of product n or the selling
horizon t become large.
208
Chapter 7
Papers and Additional Readings
209

A Brief Survey of the History of the Calculus
of Variations and its Applications

James Ferguson
jcf@uvic.ca
University of Victoria

Abstract
In this paper, we trace the development of the theory of the calculus of variations.
From its roots in the work of Greek thinkers and continuing through to the
Renaissance, we see that advances in physics serve as a catalyst for developments
in the mathematical theory. From the 18th century onwards, the task of
establishing a rigourous framework of the calculus of variations is studied,
culminating in Hilberts work on the Dirichlet problem and the development of
optimal control theory. Finally, we make a brief tour of some applications of the
theory to diverse problems.


Introduction

Consider the following three problems:

1) What plane curve connecting two given points has the shortest length?

2) Given two points A and B in a vertical plane, find the path AMB which the
movable particle M will traverse in shortest time, assuming that its acceleration is
due only to gravity.

3) Find the minimum surface of revolution passing through two given fixed points,
(x
A
, y
A
) and (x
B
, y
B
).

All three of these problems can be solved by the calculus of variations. A field
developed primarily in the eighteenth and nineteenth centuries, the calculus of variations
has been applied to a myriad of physical and mathematical problems since its inception.
In a sense, it is a generalization of calculus. Essentially, the goal is to find a path, curve,
or surface for which a given function has a stationary value. In our three introductory
problems, for instance, this stationary value corresponds to a minimum.

The variety and diversity of the theorys practical applications is quite astonishing. From
soap bubbles to the construction of an ideal column and from quantum field theory to
softer spacecraft landings, this venerable branch of mathematics has a rich history and
continues to spring upon us new surprises. Its development has also served as a catalyst
for theoretical advances in seemingly disparate fields of mathematics, such as analysis,
topology, and partial differential equations. In fact, at least two modern (i.e. since the
beginning of the twentieth century) areas of research can claim the calculus of variations
as a common ancestor; namely Morse theory and optimal control theory. Since the
theory was initially developed to tackle physical problems, it is not surprising that
variational methods are at the heart of modern approaches to problems in theoretical
physics. More surprising is that the calculus of variations has been applied to problems
in economics, literature, and interior design!

In the course of this paper, we will trace the historical development of the calculus of
variations. Along the way, we will explore a few of the more interesting historical
problems and applications, and we shall highlight some of the major contributors to the
theory. First, let us get an intuitive sense of the theory of the calculus of variations with
the following mathematical interlude, which might be found along similar lines in an
applied math or physics text (e.g. [2] and [5]).

Mathematical Background

In this section we derive the differential equation that y(x) must obey in order to
minimize the integral
!
" #
B
A
x
x
dx y y x f I ) , , (
where x
A
, x
B
, y(x
A
) = y
A
, y(x
B
) = y
B
and f are all given, and f is assumed to be a twice-
differentiable function of all its arguments. Let us denote the function which minimizes I
to be y(x). Now consider the one-parameter family of comparison functions (or test
functions), ) , (
~
$ x y , which satisfy the conditions:
a)
A A
y x y # ) (
~
,
$ ,
B B
y x y # ) , (
~
$ for all $;
b) ) ( ) 0 , (
~
x y x y # , the desired minimizing function;
c) ) , (
~
$ x y and all its derivatives through second order are continuous functions
of x and $.

For a given comparison function, the integral
!
" #
B
A
x
x
dx y y x f I )
~
,
~
, ( ) ($
is clearly a function of $. Also, since setting $ = 0 corresponds, by condition (b), to
replacing y
~
by and ) (x y y"
~
by ), (x y" we see that I($) should be a minimum with respect
to $ for the value $ = 0 according to the designation that y(x) is the actual minimizing
function. This is true for any ) , (
~
$ x y .

A necessary condition for a minimum is the vanishing of the first derivative. Thus we
have
2
0
0
dI
d
$
$
#
% &
#
' (
) *

as a necessary condition for the integral to take on a minimum value at $ = 0.
Differentiating with respect to $ (remembering that x is a function only of and y y
~
), we
get:
! (
*
&
'
)
%
+
" +
" +
+
,
+
+
+
+
#
B
A
x
x
dx
y
y
f y
y
f
d
dI
$ $ $
~
~
~
~

and by condition (c), we can write this as:
B
A
x
x
dI f dy f d dy
dx
d y d y dx d $ $ $
" % & + + - .
# ,
/ 0 ' (
" + +
1 2
) *
!
! !
! !
.
Integrating the second term by parts gives us:
B
B B
A A
A
x
x x
x x
x
dI f dy dy f dy d df
dx dx
d y d d y d dx dy $ $ $ $
% & - + +
# , 3
/ 0 ' (
" " + +
.
) * 1
! !
! ! !
! !
2
!
.
Now by condition (a),
A A
y x y # ) (
~
,
$ and
B B
y x y # ) , (
~
$ for all $. Therefore,
0
A B
x x x x
dy dy
d d $ $
# #
# #
! !

and in the end, we get:
B
A
x
x
dI f d f dy
dx
d y dx y d $ $
% & - . + +
# 3
' ( / 0
" + +
1 2 ) *
!
!
! !
.
We now require that I($) have a minimum at $ = 0, that is
0 0
0
B
A
x
x
dI f d f dy
dx
d y dx y d
$ $
$
$ $
# #
#
% & - . + + % & % &
# 3
' ( / 0
' ( ' (
" + +
) * ) *
1 2 ) *
!
!
! !
.
If we set $ = 0, this is the same as setting ( , ) ( ), y x y x $ # ! ( , ) ( ), y x y x $ " " # ! and
( , ) ( ). y x y x $ "" "" # ! (Note that the integrand depends on , y"" ! and in taking the limit $ = 0, we
need to know that the second derivative ( , ) y x $ "" ! is a continuous function of its two
variables. This is guaranteed by condition (c).)

Now if we set
0
( ),
dy
x
d
$
4
$
#
% &
#
' (
) *
!

we obtain
( ) 0.
B
A
x
x
f d f
x dx
y dx y
4
% & - . + +
3 #
' ( / 0
" + +
1 2 ) *
!
! !

Now ( ) x 4 vanishes at
A
x and
B
x by condition (a) and it is continuous and differentiable
by condition (c). However, aside from these qualities, ( ) x 4 is completely arbitrary.
Therefore, for the integral above to vanish, we must have
3
0.
f d f
y dx y
- . + +
3 #
/ 0
" + +
1 2

This is known as the Euler-Lagrange equation, which is used to develop the Lagrangian
formulation of classical mechanics. If we expand the total derivative with respect to x,
we get
2 2 2
2 2
0.
f f f f
y y
y x y y y y
+ + + +
" "" 3 3 3 #
" " + + + + + +

This is a second-order differential equation, whose solution is a twice-differentiable
minimizing function provided a minimum exists. Note that our initial condition of ( ), y x
0
0
dI
d
$
$
#
% &
#
' (
) *

is only a necessary condition for a minimum. The solution could also produce a
maximum or an inflection point. In other words, is an extremizing function. [5]
( ) y x
( ) y x

Hero and the Principle of Least Time

Probably the first person to seriously consider minimization problems from a scientific
point of view was Hero of Alexandria, who lived sometime between 150 BC and 300 AD.
He studied the optics of reflection and pointed out, without proof, that reflected light
travels in a way that minimizes its travel time. This is a precursor to Fermats principal
of least time.

Hero showed that when a ray of light is reflected by a mirror, the path taken from the
object to the observers eye is shorter than any other possible path so reflected. It is
worthwhile to quote from Heros Catoptrics:

Practically all who have written of dioptrics and of optics have been in doubt as to
why rays proceeding from our eyes are reflected by mirrors and why the reflections
are at equal angles. Now the proposition that our sight is directed in straight lines
proceeding from the organ of vision may be substantiated as follows. For whatever
moves with unchanging velocity moves in a straight line... for because of the
impelling force the object in motion strives to move over the shortest possible
distance, since it has not the time for slower motion, that is, for motion over a
longer trajectory. The impelling force does not permit such retardation. And so,
by reason of its speed, the object tends to move over the shortest path. But the
shortest of all lines having the same end points is a straight line... Now by the
same reasoning, that is, by a consideration of the speed of the incidence and the
reflection, we shall prove that these rays are reflected at equal angles in the case of
plane and spherical mirrors. For our proof must again make use of minimum lines.
[20]





4
Pappus and Isoperimetric Problems

Once upon a time, kings would reward exceptional civil servants and military personnel
by giving them all the land that they could encompass by a ploughed furrow in a
specified period of time. In this way, the problem of finding the plane curve of a given
length which encloses the greatest area, or the isoperimetric problem, was born [5].
Pappus of Alexandria (c.290 AD - c.350 A.D.) was not the first person to consider
isoperimetric problems. However, in his book Mathematical Collection, he collected and
systematized results from many previous mathematicians, drawing upon works from
Euclid (325 BC - 265 BC), Archimedes (287 BC - 212 BC), Zenodorus (200 BC - 140
BC), and Hypsicles (190 BC - 120 BC). This topic is often linked to the five so-called
Platonic solids (pyramid, cube, octahedron, dodecahedron, and icosahedron).



In Book 5 of the Mathematical Collection, Pappus compares figures with equal contours
(or surfaces) to see which has the greatest area (or volume). We can summarize the main
mathematical contents of Book 5 in the following way:

1. Among plane figures with the same perimeter, the circle has the greatest area.
2. Among solid figures with the same area, the sphere has the greatest volume.
3. There are five and only five regular solids.

Apart from these three primary results, Pappus also notes the following secondary points:

1. Given any two regular plane polygons with the same perimeter, the one with
the greater number of angle has the greater area, and consequently,
2. Given a regular plane polygon and a circle with the same perimeter, the circle
has the greater area.
3. A circle has the same area as a right-angled triangle whose base is equal to the
radius and whose height is equal to the circumference of the circle.
4. Of isoperimetric polygons with the same number of sides, a regular polygon is
greater than an irregular one.
5. Given any segments with the same circumference, the semicircle has the
greatest area.
6. There are only five regular solid bodies.
7. Given a sphere and any of the five regular solids with equal surface, the sphere
is greater.
8. Of solid bodies with the same surface, the one with more faces is the greatest.
9. Every sphere is equal to a cone whose base is the surface of the sphere and
whose height is its radius.

Pappus appears to have been a master of demonstrating what had already been shown. In
fact, item 9 from the list above was well known to the world as being proved by
Archimedes and was even engraved on his tombstone. In spite of this, his works were a
useful collection of facts related to problems about isoperimetry [6].


5
Fermat (1601 - 1665)

A more serious and more general minimization problem in optics was studied in the mid-
17th century by the French mathematician Pierre de Fermat (1601-1665). He believed
that nature operates by means and ways that are easiest and fastest but not always on
shortest paths. When it came to light rays, Fermat believed that light travelled more
slowly in a denser medium. (While this may seem intuitive to us, Descartes believed the
opposite - that light travelled faster in a denser medium.) He was able to show that the
time required for a light ray to traverse a neighbouring virtual path differs from the time
actually taken by a quantity of the second order [20].

We can state Fermats principle mathematically as:
0 #
!
Q
P
v
ds
5 ,
where P and Q are the starting- and end-points of the path, v the velocity at any point and
ds an element of the path. The equation indicates that the variation of the integral is zero,
i.e., the difference between this integral taken along the actual path and that taken along a
neighbouring path is an infinitesimal quantity of the second order in the distance between
the paths.

However, this disagreement with the great Ren Descartes (1596-1650) was the cause of
much personal and public agony for Fermat. One can sense the style and wit of Fermat,
in the following excerpt from a letter he wrote to Clerselier (a defender of Descartes) in
May 1662:

I believe that I have often said both to M. de la Chambre and to you that I do not
pretend, nor have I ever pretended to be in the inner confidence of Nature. She
has obscure and hidden ways which I have never undertaken to penetrate. I
would have only offered her a little geometrical aid on the subject of refraction,
should she have been in need of it. But since you assure me, Sir, that she can
manage her affairs without it, and that she is content to follow the way that has
been prescribed to her by M. Descartes, I willingly hand over to you my alleged
conquest of physics; and I am satisfied that you allow me to keep my geometrical
problem - pure and in abstracto, by means of which one can find the path of a
thing moving through two different media and seeking to complete its movement
as soon as it can [20].

Newton and Surfaces of Revolution in a Resisting Medium

The first real problem in the calculus of variations was studied by Sir Isaac Newton
(1643-1727) in his famous work on mechanics, Philosophiae naturalis principia
mathematica (1685), or the Principia for short [11]. Newton examined the motion of
bodies in a resisting medium. First, he considered a specific case, that of the motion of a
frustum of a cone moving through a resisting medium in the direction of its axis. This
problem can be solved using the ordinary (i.e. pre-existing) theory of maxima and
minima, which Newton showed. Next, Newton considered a more general problem.
Suppose a body moves with a given constant velocity through a fluid, and suppose that
6
the body covers a prescribed maximal cross section (orthogonal to the velocity vector) at
its rear end. Find the shape of the body which renders its resistance minimal.

This was the first problem in the field to be clearly formulated and also the first to be
correctly solved, thus marking the birth of the theory of the calculus of variations. The
geometrical technique used by Newton was later adopted by Jacob Bernoulli in his
solution of the Brachistochrone and was also later systematized by Euler. Aside from
giving birth to an entire field, a further point of interest about Newtons study of motion
in a resisting medium is that it is actually one of the most difficult problems ever tackled
by variational methods until the twentieth century. Firstly, the formulation of the
problem requires several assumptions to be made regarding the resisting medium and the
nature of the resistance experienced by the moving body. As it turns out, the restrictions
imposed by Newton are only valid for bodies moving at a velocity greater than the speed
of sound for the given medium. Secondly, the problem can possess solution curves
having a corner (i.e. a discontinuous slope) which, when expressed parametrically, may
not have a solution in the ordinary sense [11]. This foreshadows twentieth century
developments in optimal control theory.

We should make a few further remarks regarding Newtons solution to this problem, as
appeared in the Principia. Anyone who has ever been baffled by a mathematical text
before will find solace in the fact that Newtons solution appeared without a suggestion
or hint as to how to derive it. Furthermore, none of Newtons contemporaries, including
Leibniz (but with the possible exception of Huygens), could grasp the fundamental ideas
behind Newtons technique. The mathematical community was completely baffled.
Eventually, an astronomy professor at Oxford, named David Gregory, persuaded Newton
to write out an analysis of the problem for him in 1691. After studying Newtons
detailed exposition, Gregory communicated it to his students, and thereby the rest of the
world, through his Oxford lectures in the fall of 1691. Since that time, numerous studies
have been undertaken involving more general considerations (i.e. more realistic types of
resistance, non-symmetric surfaces, etc.). A good overview can be found in [4].

The Brachistochrone

The most famous problem in the history of the subject is undoubtedly the problem of the
Brachistochrone. In June of 1696, Johann Bernoulli (1667-1748), a member of the most
famous mathematical family in history, issued an open challenge to the mathematical
world with the following problem (problem (2) from the Introduction above):

Given two points A and B in a vertical plane, find the path AMB which the
movable particle M will traverse in shortest time, assuming that its acceleration is
due only to gravity.

The problem is in fact based on a similar problem considered by Galileo Galilei (1564-
1642) in 1638. Galileo did not solve the problem explicitly and did not use methods
based on the calculus. Due to the incomplete nature of Galileos work on the subject,
Johann was fully justified in bringing the matter to the attention of the world. After
stating the problem, Johann assured his readers that the solution to the problem was very
7
useful in mechanics and that it was not a straight line but rather a curve familiar to
geometers. He gave the world until the end of 1696 to solve problem, at which time he
promised to publish his own solution. At the end of the year, he published the challenge
a second time, adding an additional problem (one of a geometrical nature), and extending
his deadline until Easter of 1697.

At the time of the initial challenge to the world, Johann Bernoulli had also sent the
problem privately to one of the most gifted minds of the day, Gottfried Wilhelm Leibniz
(1646-1716), in a letter dated 9 June 1696. A short time later, he received a complete
solution in reply, dated 16 June 1696! In our modern society, which has become
obsessed doing everything as soon as possible, focusing so much on speed that we
often sacrifice quality, it is refreshing to see that technology is not a prerequisite for
timeliness. It also gives us an indication of Leibnizs genius. It was in correspondence
between Leibniz and Johann Bernoulli that the name Brachistochrone was born. Leibniz
had originally suggested the name Tachistoptotam (from the Greek tachistos, swiftest,
and piptein, to fall). However, Bernoulli overruled him and christened the problem under
the name Brachistochrone (from the Greek brachistos, shortest, and chronos, time) [11].

The other great mathematical mind of the day, Newton, was also able to solve the
problem posed by Johann Bernoulli. As legend has it, on the afternoon of 29 January
1697, Newton found a copy of Johann Bernoullis challenge waiting for him as he
returned home after a long day at work. At this time, Newton was Warden of the London
Mint. By four oclock that morning, after roughly twelve hours of continuous work,
Newton had succeeded in solving both of the problems found in Bernoullis challenge!
That same day, Newton communicated his solution anonymously to the Royal Society.
While it is quite a feat, comparable to that of Leibnizs rapid response to Bernoulli, one
should note that Bernoulli himself claimed that neither problem should take a man
capable of it more than half an hours careful thought. As Ball slyly notes [3], since it
actually took Newton twelve hours, it is a warning from the past of how administration
dulls the mind. Indeed, it is rather surprising that it took Newton so long, considering
the similarities that the Brachistochrone problem has with Newtons previously solved
problem of bodies in a resisting medium.

When Johann originally posed the problem, it is likely that his main motivation was to
fuel the fire of his bitter feud with elder brother, Jacob Bernoulli (1654-1705). Johann
had publicly described his brother Jacob as incompetent and was probably using the
Brachistochrone problem, which he has already solved, as a means of publicly
triumphing over his brother. Such an attitude towards ones contemporaries prompted one
scholar to remark that it must have been Johann Bernoulli who first said the words, It is
not enough for you to succeed; your colleagues must also fail. [5]

In the end, Jacob Bernoulli was able to solve the problem set to him by his brother,
joining Leibniz, Newton, and lHpital as the only people to correctly solve the problem.
It is interesting to note that even though Newton sent in his result anonymously, Johann
Bernoulli was not fooled. He later wrote to a colleague that Newtons unmistakable style
was easy to spot and that he knew the lion from his touch. Far from being gracious,
8
however, Johann was quick to proclaim his superiority over others when summarizing the
results of his challenge:

I have with one blow solved two fundamental problems, one optical and the other
mechanical and have accomplished more than I have asked of others: I have
shown that the two problems, which arose from totally different fields of
mathematics, nevertheless possess the same nature.

Bernoulli refers to the fact that he was the first to publicly demonstrate that the least time
principle of Fermat and the least time nature of the Brachistochrone are two
manifestations of the same phenomenon.

Let us exhibit a solution for the Brachistochrone problem, not in the geometrical
language of the times, but rather in the more modern way that was developed later by
Euler and Lagrange (as we shall soon see).

Let us take A as the origin in our coordinate system, assume that the particle of
mass m has zero initial velocity, and assume that there is no friction. Let us also
take the y-axis to be directed vertically downward. The speed along the curve
AMB is dt ds v # and thus, the total time of descent is
.
1
0
2
! !
#
" ,
# #
B
A
x
x
B
A
dx
v
y
v
ds
I
Now we know by conservation of energy that the change in kinetic energy must
equal the change in potential energy. Therefore, we can write

mgy mv #
2
2
1
,

so that the functional to be minimized becomes

!
" ,
#
B
x
dx
y
y
g
I
0
2
.
1
2
1


Now we can use the Euler-Lagrange equation to obtain (neglecting the constant
factor of g 2 1 )

C
y
y
y y
y
#
" ,
3
" ,
"
2
2
2
1
) 1 (
or .
) 1 (
1
2
2
C
y y
#
" ,


Setting , 2 1
2
a C # we obtain

,
2
y
y a
y
3
# "
9

and integration yields

.
2
0
!
3
# 3 dy
y a
y
x x

After making the change of variables ), cos 1 ( 6 3 # a y the integral becomes

). sin (
2
sin 2
2
0
6 6 6
6
3 # # 3
!
a d a x x

Therefore, the solution to the brachistochrone problem, in parametric form, is

0
) sin ( x a x , 3 # 6 6 , ). cos 1 ( 6 3 # a y

These are the equations of a cycloid generated by the motion of a fixed point on
the circumference of a circle of radius a which rolls on the positive side of the
line y = 0, that is, on the underside of the x-axis. There exists one and only one
cycloid through the origin and the point (x
B
, y
B
); a suitable choice of a and x
0
will
give this cycloid [5].


Euler, Maupertuis, and the Principle of Least Action

The brilliant and prolific Swiss mathematician Leonhard Euler (1707-1783) had close ties
to the Bernoulli family. Not only was his father, Paul Euler, friends with Johann but Paul
had also lived in Jakobs house while he studied theology at the University of Basel.
Paul Euler had high hopes that, following in his footsteps, his son would become a
Protestant minister. However, it was not long before Johann, who was Leonhards
mentor, noticed the young boys mathematical ability while he was a student (at the age
of fourteen) at the University of Basel. In Eulers own words:

I soon found an opportunity to be introduced to a famous professor Johann
Bernoulli. ... True, he was very busy and so refused flatly to give me private
lessons; but he gave me much more valuable advice to start reading more
difficult mathematical books on my own and to study them as diligently as I
could; if I came across some obstacle or difficulty, I was given permission to
visit him freely every Sunday afternoon and he kindly explained to me
everything I could not understand[18]

Given his close relationship with the Bernoullis, it is not surprising that Euler became
interested in the calculus of variations. As early as 1728, Leonhard Euler had already
written On finding the equation of geodesic curves. By the 1730s, he was concerning
himself with isoperimetric problems.

In 1744, Euler published his landmark book Methodus inveniendi lineas curvas maximi
minimive proprietate gaudentes, sive solutio problematis isoperimetrici latissimo sensu
10
accepti (A method for discovering curved lines that enjoy a maximum or minimum
property, or the solution of the isoperimetric problem taken in the widest sense). Some
mathematicians date this as the birth of the theory of the calculus of variations [14].

Euler took the methods used to solve specific problems and systematized them into a
powerful apparatus. With this method, he was then able to study a very general class of
problems. His opus considered a variety of geodesic problems, various modified and
more general brachistochrone problems (such as considering the effects of a resistance to
the falling body), problems involving isoperimetric constraints, and even questions of
invariance. Although few mathematicians before Euler would give a second thought to
such things, he examined whether his fundamental conditions would remain unchanged
under general coordinate transformations. (These questions were not completely
resolved until the twentieth century.)

Also in this publication, it was shown for the first time that in order for , satisfying ( ) y x
( , , ) ,
B
A
x
x
I f x y y dx " #
!
( ) ,
A A
y x y # ( )
B B
y x y , #
A B
x x 7 ,
to yield a minimum of I, then a necessary condition is the so-called Euler-Lagrange
equation (which first appeared in Eulers work eight years previously)

0.
f d f
y dx y
- . + +
3 #
/ 0
" + +
1 2


Another important element of Eulers exposition was his statement and discussion of a
very important principle in mechanics. However, it has also been attributed to another,
lesser, mathematician.

In two papers read to lAcadmie des Sciences in 1744, and to the Prussian Academy in
1746, the French mathematician Maupertuis (1698-1759) proclaimed to the world le
principe de la moindre quantit daction, or the principle of least action. In an almost
Pythagorean spirit, Maupertuis said that la Nature, dans la production de ses effets, agit
toujours par les moyens les plus simples. (In her actions, Nature always works by the
simplest methods.) He believed that physically, things unfold in Nature in such a way
that a certain quality, which he called the action, is always minimized. While
Maupertuiss intuition was good, he certainly lacked the logical motivation and clarity of
Euler. His definition of the action was vague. His rationale in developing this principle
was somewhat mystical. He sought to develop not only a mathematical foundation for
mechanics but a theological one as well. He went so far as to say, in his Essai de
cosmologie (1759), that the perfection of God would be incompatible with anything other
than utter simplicity and the minimum expenditure of action!

Notre principe, plus conforme aux ides que nous devons avoir des choses, laisse
le Monde dans le besoin continuel de la puissance de Crateur, et est une suite
ncessaire de lemploi le plus sage de cette puissance Ces loix si belle et si
simples sont peut-tre les seules que le Crateur et lOrdonnateur des choses a
tablies dans la matire pour y oprer tous les phnomnes de ce Monde visible.
11

(Our principle, which conforms better to the ideas that we should have about
things, leaves the world inconstant need of the strength of the Creator and
follows necessarily from the most wise use of this strength These simple and
beautiful laws are perhaps the only ones that the Creator and Organizer of all
things has put in place to carry out the workings of the visible world.) [20]

Returning to Euler, and his magnificent work of 1744, we see strikingly similar ideas but
without the theological overtones. Near the beginning of the section on the principle of
least action, Euler writes:

Since all the effects of Nature follow a certain law of maxima or minima, there is
no doubt that, on the curved paths, which the bodies describe under the action of
certain forces, some maximum or minimum property ought to obtain. What this
property is, nevertheless, does not appear easy to define a priori by proceeding
from the principles of metaphysics; but since it may be possible to determine
these same curved paths by means of a direct method, that very thing which is a
maximum or minimum along these curves can be obtained with due attention
being exhibited. But above all the effect arising from the disturbing forces ought
especially to be regarded; since this [effect] consists of the motion produced in
the body, it is consonant with the truth that this same motion or rather the
aggregate of all motions, which are present in the body ought to be a minimum.
Although this conclusion does not seem sufficiently confirmed, nevertheless if I
show that it agrees with a truth known a priori so much weight will result that all
doubts which could originate on this subject will completely vanish. Even better
when its truth will have been shown, it will be very easy to undertake studies in
the profound laws of Nature and their final causes, and to corroborate this with
the firmest arguments [11].

As often happens in mathematics even today, there was a bitter dispute as to the priority
of the discovery of the principle of least action. In 1757, the mathematician Knig
produced a letter supposedly written by Leibniz in 1707 that contained a formulation of
the principle of least action. At the time, Maupertuis, who was a headstrong and virulent
man, was the president of the Prussian Academy and had a sharp reaction to this claim.
He accused his fellow-member of plagiarism and was convinced that the letter was a
forgery. Ironically, Euler sided with his French colleague in this affair, even though it is
possible (and perhaps most likely) that it was Euler himself who was the first to put his
finger on the principle.

An additional topic of interest stemming from Eulers opus of 1744 is that of minimal
surfaces. One of the most fascinating areas of geometry, minimal surfaces are obtained
from the calculus of variations as portions of surfaces of least area among all surfaces
bounded by a given space curve. Euler discovered the first non-trivial such surface, the
catenoid, which is generated by rotating a catenary (i.e. a cosh curve or the curve of a
hanging chain); for example, where r is the distance in 3-dimensional space
from the x-axis [5]. We will have more to say about minimal surfaces later.
, cosh x A r #

12
While it is true that a short time later, Eulers technique was superseded by that of
Lagrange (as we shall soon see), at the time it was completely new mathematics. His
systematic methods, in an elegant form, were remarkable for their clarity and insight. As
the twentieth century mathematician Carathodory, who edited Eulers works, wrote in
the introduction,

[Eulers book] is one of the most beautiful mathematical works ever written. We
cannot emphasize enough the extent to which that Lehrbuch over and over again
served later generations as a prototype in the endeavour of presenting special
mathematical material in its [logical, intrinsic] connection [14].

Lagrange

In 1755, a 19-year-old from Turin sent a letter to Euler that contained details of a new
and beautiful idea. Eulers correspondent, Ludovico de la Grange Tournier, was no
ordinary teenager. Less than two months after he wrote that fateful letter to Euler, the
man we now know as Joseph-Louis Lagrange (1736-1813) was appointed professor of
mathematics at the Royal Artillery School in Turin. His rare gifts, his humility, and his
devotion to mathematics made him one of the giants of eighteenth century mathematics.
He contributed much groundbreaking work in fields as diverse analysis, number theory,
algebra, and celestial mechanics. However, it was with the calculus of variations that his
early reputation was made.

In his first letter to the legendary Swiss mathematician, Lagrange showed Euler how to
eliminate the tedious geometrical methods from his process. Essentially, Lagrange had
developed the idea of comparison functions (like the ( ) x 4 function used in the
mathematical background section above), which lead almost directly to the Euler-
Lagrange equation. After considering Lagranges method, Euler became an instant
convert, dropped his old geometrical methods, and christened the entire field by the name
we now use, the calculus of variations, in honour of Lagranges variational method [11].

With the recipe reduced to a much simpler analytic method, even more general results
could be obtained. The following year, in 1756, Euler read two papers to the Berlin
Academy in which he made liberal use of Lagranges method. In his first paper, he was
quick to give the young man from Turin his due:

Even though the author of this [Euler] had meditated a long time and had
revealed to friends his desire yet the glory of first discovery was reserved to the
very penetrating geometer of Turin, Lagrange, who having used analysis alone,
has clearly attained the very same solution which the author had deduced by
geometrical considerations [11].

The two great mathematicians corresponded frequently over the next few years, with
Lagrange working hard to extend the theory. Toward the end of 1760, he was able to
publish a number of his results in Miscellanea Taurinensia, a scientific journal in Turin,
under the title Essai dune nouvelle mthode pour dterminer les maxima et les minima
des formules intgrals indefinites (Essay on a new method for determining maxima and
13
minima for formulas of indefinite integrals). Solutions to more general problems we
investigated for the first time, such as variable end-point brachistochrone problems,
finding the surface of least area among all those bounded by a given curve (a problem
that we associate today with Plateau), and finding the polygon whose area is greatest
among all those that have a fixed number of sides. An apt rsum of the advances of the
new theory comes from the pen of Lagrange himself:

Euler is the first who has given the general formula for finding the curve along
which a given integral expression has its greatest valuebut the formulas of this
author are less general than ours: 1. because he only permits the single variable y
in the integrand to vary; 2. because he assumes that the first and last points of the
curve are fixedBy the methods which have been explained one can seek the
maxima and minima for curved surfaces in a most general manner that has not
been done till now [11].

It was also in this early work of Lagrange that his famous rule of multipliers was first
discussed. However, the generality and power of the method was not clear to him at that
time and it was not until his path-breaking tour de force Mchanique analytique (1788),
that he clearly expressed the rule in its modern form.

When trying to extremize a function, often difficulties arise when the function is subject
to certain outside conditions or constraints. In principle, we could use each constraint to
eliminate one variable at a time, thereby reducing the problem progressively to a simpler
and simpler one. However, this can be both tedious and time consuming. Lagranges
method of multipliers is a powerful tool that allows for solutions to the problem without
having to solve the conditions or constraints explicitly. Let us now show the solution of
such a problem, arising from a simple quantum mechanical system.

Consider the problem of a particle of mass m in a box, which we can consider as
a parallelepiped with sides a, b, and c. The so-called ground state energy of the
particle is given by

2
2 2 2
1 1 1
,
8
h
E
m a b c
- .
# , ,
/ 0
1 2


where h is Plancks constant. Now suppose we wish to find the shape of the box
that will minimize the energy E, subject to the constraint that the volume of the
box is constant, i.e.

( , , ) . V a b c abc k # #

Essentially, we need to minimize the function subject to the constraint ( , , ) E a b c
( , , ) 0. a b c abc k 8 # 3 # For the variable a, this implies that

2
3
0,
4
E h
bc
a a ma
8
9 9
+ +
, # 3 , #
+ +


14
where 9 is an arbitrary constant (called the Lagrange multiplier). Of course we
have similar equations for the other variables:

2
3
0,
4
h
ac
mb
9 3 , #
2
3
0.
4
h
ab
mc
9 3 , #

After multiplying the first equation by a, the second by b, and the third by c, we
obtain

2 2 2
2 2
.
4 4 4
h h h
abc
ma mb mc
9 # # #
2


Hence, our solution is which is a cube. Notice how we did not even
need to determine the multiplier
, a b c # #
9 explicitly [2].

The Mchanique analytique was an ambitious undertaking, as it summarized all the work
done in the field of classical mechanics since Newton. In fact, as books on mechanics go,
it is mentioned in the same breath as Newtons Philosophiae naturalis principia
mathematica. Whereas Newton considered most problems from the geometrical point of
view, Lagrange did everything with differential equations. In the preface, he even states
that

one will not find figures in this work. The methods that I expound require
neither constructions, nor geometrical or mechanical arguments, but only
algebraic operations, subject to a regular and uniform course [11].

Classical mechanics had really come of age with Lagrange. Building on the great
insights of Euler, Lagrange was able to rescue mechanical problems from the tedium of
geometrical methods. His approach is still meaningful today and it forms one of the
cornerstones of the mathematical framework of modern theoretical physics. As it turns
out, there was still much work to be done in the calculus of variations. There were
unforeseen problems with the approach of Euler and Lagrange. However, let us pay our
debt to Lagrange by remembering the words of Carl Gustav Jacob Jacobi (1804-1851),
who was one the main contributors to the theory of variational problems in the nineteenth
century:

By generalizing Eulers method he arrived at his remarkable formulas which in
one line contain the solution of all problems of analytical mechanics.

[In his Memoir of 1760-61] he created the whole calculus of variations with one
stroke. This is one of the most beautiful articles that has ever been written. The
ideas follow one another like lightning with the greatest rapidity [14].





15
Legendre

In 1786, Adrien-Marie Legendre (1752-1833) presented a memoir to the Paris Academy
entitled Sur la manire de distinguer les maxima des minima dans le calcul des variations
(On the method of distinguishing maxima from minima in the calculus of variations).
Legendre was a well-known mathematician from Paris who developed many analytical
tools for problems in mathematical physics and served as editor for Lagranges
Mchanique analytique.

Legendre considered the problem of determining whether an extremal is a minimizing or
a maximizing arc. Let us recall that in extrema problems of one variable calculus, we
consider not only points where the first derivative vanishes, but we also study the second
derivative at these points. Similarly, Legendre examined the second variation of the
functional, motivated by the theorem of Taylor:

: ;
. 2
2
~
2
2 2
2
0
2
2 2
2
! 0
0
2
.
/
/
1
-
" , " , #
+
+
#
" " "
#
B
A
x
x
y y y y yy
dx f f f
y I
I 4 4 4 4
$
$
$
5
$


Legendre was able to show the condition along a minimizing curve and 0 <
" "y y
f 0 =
" "y y
f
along a maximizing curve, which is surprisingly similar to what we obtain in elementary
calculus in the second derivative test! In spite of the fact that he was on the right track,
Legendres attempt to show that this condition is both necessary and sufficient was not
quite correct [11], [14]. The idea did not catch on and by the time Lagrange levelled
several objections to the second variation approach in his Thorie des fonctions
analytiques (1797), it appeared that the death knell has sounded for Legendres
innovative idea.

Jacobi

It was not until fifty years passed since Legendres initial discovery of the second
variation condition that another mathematician took up the task of developing the theory
even further. In 1836, in a paper remarkable for its brevity and obscurity, Jacobi
demonstrated rigourously what we now call the Jacobi condition, namely that:

For a local minimum, it suffices to have both of the following satisfied:

1) , and 0 >
" "y y
f
2) x
B
closer to x
A
than to the conjugate point of x
A
, which is the first value x >
x
A
where a nonzero solution of

, 0 # 0
2
.
/
1
-
3 3 0
2
.
/
1
-
" " "
w f
dx
d
f
dx
dw
f
dx
d
y y yy y y
0 ) ( #
A
x
x w ) (
A
x x <

16
vanishes. [Here
0
) (
#
+
+
#
?
?
y
x w and 0 # ? corresponds to y
~
in the family of
extremals ). , ( ? x y y # ]

Another way to say this is that when the two conditions above are satisfied, then there
exists a minimizing y
~
among satisfying the boundary conditions y(x ] , [
1
B A
x x C y @
A
) =
y
A
and y(x
B
) = y
B
, and satisfying:

(a) ,
~
A 7 3 y y (b) A 7 " 3 " y y
~
for small positive A .

This paper, entitled Zur Theorie der Variations-Rechnung und der Differential-
Gleichungen (On the calculus of variations and the theory of differential equations), was
so terse that rigourous proofs were not given but instead were hinted at [11], [14].
Perhaps, as one mathematical historian has suggested, Jacobi was in a rush to publish his
results first to ensure intellectual priority [11]. It is difficult to agree with such a theory
since progress in this field had stagnated for half a century! In any case, it was an
opportunity for numerous mathematicians to provide further elucidation and commentary
in the years that followed.




Hamilton-Jacobi Theory

While not directly connected with the development of the theory of the calculus of
variations, it is timely to draw attention to another aspect of Jacobis work. In the mid
1830s, a Scottish mathematician named William Rowan Hamilton (1788-1856)
developed the foundations of what we now call Hamiltonian mechanics. Closely related
to the methods developed by Lagrange, Hamilton showed that under certain conditions,
problems in mechanics involving many variables and constraints can be reduced to an
examination of the partial derivatives of a single function, which we now appropriately
call the Hamiltonian. In the original papers of 1834 and 1835, some rigour was lacking
and Jacobi was quick to step in. Hamilton did not show under which conditions he could
be certain that his equations possessed solutions. In 1838, Jacobi was able to rectify this,
in addition to showing that one of the equations Hamilton studied was redundant. Due to
the tidying up and simplification performed by Jacobi, many modern books on classical
mechanics refer to this approach as Hamilton-Jacobi theory [11].

Nineteenth Century Applications to Other Fields: Edgeworth and Poe

By the nineteenth century, mathematical methods had advanced further than many had
dreamed possible. Previously unsolved problems in physics, astronomy, engineering, and
technology were being overcome at last. New theories were being developed at a speed
never seen before, with a startling predictive nature that few imagined possible. One only
needs to consider Newtonian mechanics, the developments in understanding
17
thermodynamic systems, or especially, the elegant systematization of the theories of
electricity and magnetism laid out in Maxwells equations. How natural, then that people
tried to apply the same powerful techniques to other disciplines. In some cases, a
measure of success was attained. In other cases, the results seem laughable.

In 1881, a book appeared with the title Mathematical Psychics: An Essay on the
Application of Mathematics to the Moral Sciences [19]. The author was Francis
Edgeworth (1845-1926), an English economist. A primary goal of the text was to
construct a model of human science in which ethics can be viewed as a science. Today,
the book is remembered chiefly for the merit of its ideas for economic theory. For us, the
most interesting part of the book is the section on utilitarian calculus. Inspired by the
utilitarian Jeremy Bentham (1748-1832), Edgeworth used the mathematical techniques of
the calculus of variations in an effort to extremize the happiness function, or a function
that was designed to measure the achievement of the ultimate good in society.

Defining fundamental units of pleasure within the context of human interpersonal
contracts, Edgeworth was able to obtain an equation involving the sum over all
individuals utility. Despite variations from point to point, Edgeworth hypothesized that
there would exist a locus at which the sum of the utilities of the individuals is a maximum.
Edgeworth called this the utilitarian point. Edgeworth was quick to realize that the
Benthamite slogan, the greatest happiness of the greatest number needed restating in a
more precise form. After some mathematical labour, he was able to show that the
ultimate good was to be conceived as the maximum value of the triple integral over the
variables pleasure, individuals, and time.

In retrospect, it is hardly surprising that this treatise has no impact on the development of
moral and ethical philosophy.

Caught up in the spirit of things, and inspired by the writings of the greatest
mathematicians on the calculus of variations, Edgar Allan Poe (1809-1849) published a
story in 1841 called Descent into the Maelstrom [12]. In the story, the protagonist is able
to survive a violent storm by noting certain critical properties of solids moving in a
resisting medium:

...what I observed was, in fact, the natural consequence of the forms of floating
fragments...a cylinder, swimming in a vortex, offered more resistance to its
suction, and was drawn in with greater difficulty than any equally bulky body, of
any form whatever.

Poe was inspired, no doubt, by Newtons Principia. Fortunately for Poe, good science is
not needed in order to tell a good story. In the story, it is claimed that the sphere offered
the minimum resistance, although Newton showed long ago that this is not the case. In
addition, Newtons results were only good for bodies moving through a motionless fluid,
not a violent sea. In any case, it is still a good example of how science can motivate the
creative arts.


18
Riemann, Dirichlet, and Weierstrass

It is surprising to discover that the development of the theory of the calculus of variations
not only impacted physical problems and the theory of partial differential equations, but
also the fields of classical analysis and functional analysis. In the mid-1800s, many
mathematicians, such as Bernhard Riemann (1826-1866) and Gustave Lejeune Dirichlet
(1805-1859) searched for general solutions to boundary value and initial value problems
of partial differential equations arising in physical problems. Problems of this type are of
great importance in physics, as they are basic to the understanding of gravitation,
electrostatics, heat conduction, and fluid flow. One of the problems that attracted many
of the top mathematicians of the day was an existence proof of a solution u, in a general
domain B, satisfying:

0
2
# C u in ; B , f u #
B +
) (
2
B @C u D ), (
0
B C E B R
2
or R
3
,

where . This is known as a Dirichlet problem. Riemann used
principles from the calculus of variations to develop a proof of this, which was a problem
he had first seen in lectures by Dirichlet. He named it Dirichlets principle and stated it
as follows
zz yy xx
u u u u , , # C
2

There exists a function u that satisfies the condition above and that minimizes the
functional

: ; ,
2
!
B
C # dV u u D E B R
2
or R
3
,

among all functions ) (
2
B @C u D ) (
0
B C which take on given values f on the
boundary of . B + B

Dirichlets principle had been used earlier by Gauss (1839) and Lord Kelvin (1847)
before Riemann used the principle in 1851 in order to obtain fundamental results in
potential theory using complex analytic functions [15], [16]. However, something was
not quite right with the theory. As one mathematician noted:

It was a strange situation. Dirichlets principle had helped to produce exciting
basic results but doubts about its validity began to appear, first in private remarks
of Weierstrass - which did not impress Riemann, who placed no decisive value
on the derivation of his existence theorems by Dirichlets principle - and then,
after both Dirichlet and Riemann had died, in Weierstrasss public address to the
Berlin Academy...

As it turns out, there was a fundamental conceptual error involved in the faulty method of
proof employed by Riemann. He failed to distinguish the differences between a greatest
lower bound and a minimum for the Dirichlet problem. Karl Weierstrass (1815-1897)
was the first to point out that in some cases, a minimizing function can come arbitrarily
close to the lower bound without ever reaching it.
19

The breakdown of Dirichlets principle (which had been the basis for many new results)
turned out to be very beneficial for the theory of analysis. In an effort to patch up the
theory, three new methods of existence proofs were developed, by Hermann Schwarz
(1843-1921), Henri Poincar (1854-1912), and Carl Neumann (1832-1925) [15].

Beginning in the 1870s, Weierstrass gave the theory of the calculus of variations a
complete overhaul. It took quite some time for these results to become widely known to
the rest of the mathematical community, principally through the dissertations of his
graduate students. Known for his rigourous approach to mathematics, Weierstrass was
the first to stress the importance of the domain of the functional that one is trying to
minimize. He also examined the family of admissible functions satisfying all of the
constraints. His most notable accomplishment was the fact that he gave the first ever
completely correct sufficiency theorem for a minimum. Two new concepts, the field of
extremals and the E-function, were developed in order to tackle the problem of
sufficiency and a new type of minimum (a so-called strong minimum) was defined [15],
[16].

Philosophical Interlude

To the applied mathematician or physicist, all of this work to define conditions of
sufficiency for the existence of an extremum might sound like splitting hairs. As Gthe
wrote in Maxims and Reflections,

Mathematicians are like a certain type of Frenchman: when you talk to them they
translate it into their own language, and then it soon turns into something
completely different.

For problems in mechanics, for example, the Euler-Lagrange equation works perfectly
well ninety-nine times out of a hundred - and when it doesnt, then it should be physically
obvious. This point of view was expressed by Gelfand and Fomin:

...the existence of an extremum is often clear from the physical or geometric
meaning of the problem, e.g., in the brachistochrone problem... If in such a case
there exists only one extremal satisfying the boundary conditions of the problem,
this extremal must perforce be the curve for which the extremum is achieved [17].

The rigourous mathematician would surely answer that in mathematics, conclusions
should be logically deducible from initial hypotheses. And when it comes to a physical
model, the mathematician would no doubt remind us that we should be mindful of the
assumptions and idealizations we make for the sake of simplicity, and the consequences
these assumptions entail.

In reality, what is truly surprising is not that mathematicians fought over the smallest
details of the calculus of variations for more than one hundred years, but that it took so
long for anyone to realize the elementary mistakes that Euler made when he first
examined these problems. A twentieth century mathematician, L.C. Young, remarked at
20
length on this oversight in his excellent book, Lectures on the Calculus of Variations and
Optimal Control Theory [21]. It is rewarding to see how he puts things into perspective:

In the Middle Ages, an important part was played by the jester: a little joke that
seemed so harmless could, as its real meaning began to sink in, topple kingdoms.
It is just such little jokes that play havoc today with a mathematical theory: we
call them paradoxes.

Perrons paradox runs as follows: Let N be the largest positive integer. Then
for we have contrary to the definition of as largest. Therefore

1 N F
2
N N > N
1. N #

The implications of this paradox are devastating. In seeking the solution to a
problem, we can no longer assume that this solution exists. Yet this assumption
has been made from time immemorial, right back in the beginnings of elementary
algebra, where problems are solved starting off with the phrase: Let x be the
desired quantity.

In the calculus of variations, the Euler equation and the transversality conditions
are among the so-called necessary conditions. They are derived by exactly the
same pattern of argument as in Perrons paradox; they assume the existence of a
solution. This basic assumption is made explicitly, and it is then used to
calculate the solutions whose existence was postulated. In the class of problems
in which the basic assumption is valid, there is nothing wrong with doing this.
But what precisely is this class of problems? How do we know that a particular
problem belongs to this class? The so-called necessary conditions do not answer
this. Therefore a solution derived by necessary conditions only is simply no
valid solution at all.

It is strange that so elementary a point of logic should have passed unnoticed for
so long! The first to criticize the Euler-Lagrange method was Weierstrass,
almost a century later. Even Riemann made the same unjustified assumption in
his famous Dirichlet principle...

The main trouble is that, as Perrons paradox shows, the fact that a solution has
actually been calculated in no way disposes of the logical objection to the
original assumption.

A reader may here interpose that, in practice, surely this is not serious and would
lead no half competent person to false results; was not Euler at times logically
incorrect by todays standards, but nonetheless correct in his actual conclusions?
Do not the necessary corrections amount to no more than a sprinkling of
definitions, which his insight perhaps took into account, without explicit
formulation?

Actually, this legend of infallibility applies neither to the greatest mathematicians
nor to competent or half competent persons, and the young candidate with an
error in his thesis does not disgrace his calling... Newton formulated a variational
problem of a solid of revolution of least resistance, in which the law of resistance
assumed is physically absurd and ensures that the problem has no solution the
21
more jagged the profile, the less the assumed resistance and this is close to
Perrons paradox. If this had been even approximately correct, after removing
absurdities, there would be no need today for costly wind tunnel experiments.
Lagrange made many mistakes. Cauchy made one tragic error of judgment in
rejecting Galoiss work. The list is long. Greatness is not measured negatively,
by absence of error, but by methods and concepts which guide further
generations [21].


Twentieth Century Developments

With the calculus of variations on a relatively firm foothold, aided by the rigourous work
of the school of Weierstrass, things were set for the theory to develop even further. In his
famous turn- of-the-century address to the International Congress of Mathematicians in
Paris, David Hilbert (1862-1943) made mention of the calculus of variations on several
occasions when discussing other problems. In addition, his twenty-third problem was a
call for the further elucidation of the theory:

So far, I have generally mentioned problems as definite and special as possible,
in the opinion that it is just such definite and special problems that attract us the
most and from which the most lasting influence is often exerted upon science.
Nevertheless, I should like to close with a general problem, namely with the
indication of a branch of mathematics repeatedly mentioned in this lecture
which, in spite of the considerable advancement lately given it by Weierstrass,
does not receive the general appreciation which, in my opinion, is its dueI
mean the calculus of variations.

In the next few years, Hilbert and his associates continued where Weierstrass left off,
developing many new results and setting the stage for the next leap forward.

Morse Theory

Marston Morse (1892-1977) turned his eye to the global picture and developed the
calculus of variation in the large, with applications to equilibrium problems in
mathematical physics. We now call the field Morse theory. In a paper published in 1925
entitled Relations between the critical points of a real function of n independent variables,
Morse proved some important new results that had a big effect on global analysis, which
is the study of ordinary and partial differential equations from a topological point of view.
Much of his work depended on the results obtained by Hilbert and company [15].

Optimal Control Theory

Another new field developed in the twentieth century from the roots of the calculus of
variations is optimal control theory. A generalization of the calculus of variations, this
theory is able to tackle problems of even greater generality and abstraction. New
mathematical tools were developed by chiefly Pontryagin, Rockafellar, and Clarke that,
among other things, enabled nonlinear and nonsmooth functionals to be optimized.
While this may sound like a mathematical abstraction, in reality there are many physical
22
problems that can only be solved in such a manner. Two examples which come from the
engineering world are the problem of landing a spacecraft as softly as possible with the
minimum expenditure of fuel and the construction an ideal column [9].

Minimal Surfaces

The minimal surfaces discovered by Euler have also played a substantial role in twentieth
century mathematics, during which time two Fields Medals were awarded for work
related to the subject. In 1936, Jesse Douglas won a Fields Medal for his solution to
Plateaus problem and in 1974, Enrico Bombieri shared a Fields Medal for his work on
higher dimensional minimal surfaces. It is becoming apparent that minimal surfaces are
found throughout nature. Examples are soap films, grain boundaries in metals,
microscopic sea animals (called radiolarians), and the spreading and sorting of embryonic
tissues and cells. In addition, minimal surfaces have proved popular in design, through
the work of the German architect Frei Otto, as well as in art, exemplified in the works of
J.C.C. Nitsche [1].

Physics

We have already seen the rich interplay between the mathematical methods used in the
calculus of variations and developments in understanding the natural laws of our universe.
Recall the least time principles of Fermat, Maupertuis, Euler, Lagrange, and Hamilton
and their effects on the history of optics and mechanics. The success of these variational
methods in solving physical problems is not surprising [9]. As Yourgrau and
Mandelstam point out:

Arguments involving the principle of least action have excited the imagination of
physicists for diverse reasons. Above all, its comprehensiveness has appealed, in
various degrees, to prominent investigators, since a wide range of phenomena
can be encompassed by laws differing in detail yet structurally identical. It
seems inevitable that some theorists would elevate these laws to the status of a
single, universal canon, and regard the individual theorems as mere instances
thereof. It further constitutes an essential characteristic of action principles that
they describe the change of a system in such a manner as to include its states
during a definite time interval, instead of determining the changes which take
place in an infinitesimal element of time, as do most differential equations of
physics. On this account, variational conditions are often termed integral
principles as opposed to the usual differential principles. By enforcing
seemingly logical conclusions upon arguments of this type, it has been claimed
that the motion of the system during the whole of the time interval is
predetermined at the beginning, and thus teleological reflections have intruded
into the subject matter. To illustrate this attitude: if a particle moves from one
point to another, it must, so to speak, consider all the possible paths between
the two points and select that which satisfies the action condition [20].


In 1948, motivated by a suggestion by P.A.M. Dirac, the American physicist Richard
Feynman (1918-1988) developed a completely new approach to quantum mechanics,
23
based on variational methods. Although not mathematically well-defined, the Feynman
path integral was what he called a summation over histories of the path of a particle.
Despite the fact that the original paper was rejected by one journal for being nothing new,
Feynmans original approach was ideally suited to extending quantum theory to a more
general framework, incorporating relativistic effects [10].

It did not take long for the mathematicians to come along and tidy up everything. Mark
Kac showed that Feymans integral can be thought of as a special case of the Wiener
integral, developed by Norbert Wiener in the 1920s. With a rigourous mathematical
underpinning, physicists were then able to apply the new variational techniques to a host
of all quantum and statistical phenomena. Today, these methods are employed in the
monumental task of developing the so-called Grand Unified Theory.

As the field evolved from our search to understand the inner workings of Nature, perhaps
it is fitting to end this survey of the history of the calculus of variations with a quote from
someone still actively involved in this search. When asked about the role of the calculus
of variations in modern physics, Maxim Pospelov, a theoretical physicist specializing in
supersymmetry, had this to say:

The most notable change that the 20th century brought to physics is
the transition from a deterministic classical mechanics where the variation of
action leads to the equations of motion and single trajectory when the boundary
conditions are fixed to quantum mechanics that allows multiple trajectories and
determines the probability for a certain trajectory. The functional integral
approach to quantum mechanics and quantum field theory is the modern
language that everybody uses. All, absolutely all, physical processes in quantum
field theory can be studied as a variation of the vacuum-vacuum transition
amplitude in the presence of external sources over these sources.

Variational methods are often used in particular calculations when,
for example, one needs to find a complicated wave function when the exact
solution to the Schrdinger equation is not possible. I know that the variational
approach to the helium atom yields a very precise determination of its energy
levels and ionization threshold [7].




24
Bibliography

[1] Almgren F.J. (1982) Minimal surface forms. Math. Intelligencer Vol. 4 No.4, pp. 164-172.

[2] Arfken G. and Weber H. (2001) Mathematical Methods for Physicists. San Diego:
Academic Press.

[3] Ball J.M. (1998) The calculus of variations and materials science. Quart. Appl. Math. Vol.
56, No. 4, pp. 719-740.

[4] Buttazzo G. and Kawohl B. (1993) On Newtons problem of minimal resistance. Math.
Intelligencer Vol. 15 No.4, pp. 7-12.

[5] Byron F. and Fuller R. (1969) Mathematics of Classical and Quantum Physics. Reading:
Addison-Wesley.

[6] Cuomo S. (2000) Pappus of Alexandria and the Mathematics of Late Antiquity. Cambridge:
Cambridge University Press.

[7] Ferguson J. (2003) Private e-mail correspondence with M. Pospelov.

[8] Ferguson J. (2003) Private discussions with J. Ye.

[9] Ferguson J. (2003) Private discussions with W. Israel.

[10] Feynman, R. (1948) Space-time approach to non-relativistic quantum mechanics. Rev. Mod.
Phys. Vol. 20, No. 2, pp.367-387.

[11] Goldstine H. (1980) A History of the Calculus of Variations from the 17th through the 19th
Century. New York: Springer-Verlag.

[12] Gould S. (1985) Newton, Euler, and Poe in the calculus of variations. Differential geometry,
calculus of variations, and their applications. Gould S. (Ed.) New York: Dekker.

[13] Kirmser P. and Hu K-K. (1993) The shape of an ideal column reconsidered. Math.
Intelligencer Vol. 15 No.3, pp. 62-68.

[14] Kreyszig E. On the calculus of variations and its major influences on the mathematics of the
first half of our century. I. Amer. Math. Monthly 101 (1994) no.7, pp. 674-678.
[15] _________. On the calculus of variations and its major influences on the mathematics of the
first half of our century. II. Amer. Math. Monthly 101 (1994) no.9, pp. 902-908.
[16] _________. Interaction between general topology and functional analysis. Handbook of the
History of General Topology, Vol.1, Aull C.E. and Lowen R. (Eds.), pp. 357-389, Kluwer
Acad. Publ., Dordrecht, 1997.

[17] McShane E.J. (1989) The calculus of variations from the beginning through optimal control
theory. SIAM J. Cont. Optim. Vol. 27, No. 5, pp. 916-989

25
[18] OConnor J.J. and Robertson E.F. MacTutor History of Mathematics Archive. http://www-
gap.dcs.st-and.ac.uk/~history/. 29 Nov. 2003.

[19] Wall B. (1978/79) F. Y. Edgeworths mathematical ethics. Greatest happiness with the
calculus of variations. Math. Intelligencer Vol. 1, No.3, pp. 177-181.

[20] Yourgrau W. and Mandelstam S. (1968) Variational Principles in Dynamics and Quantum
Theory. London: Pitman & Sons.

[21] Young L.C. (1969) Lectures on the Calculus of Variations and Optimal Control Theory.
Philadelphia: W.B. Saunders Company.
26
AIRLINE SEAT ALLOCATION WITH MULTIPLE
NESTED FARE CLASSES
S. L. BRUMELLE
C'rli~.er.virj, (?f'Briiivh CollririL~ia, C'cirlcocl1,cr. C'(/r~(/cia
J. I. McGlLL
Queen's University, Kingston, Ontario, Canada
(Received July 1990; revision received Februaq 1991: accepted June 1992)
This paper addresses the problem of determining optimal booking policies for multiple fare classes that share the same
seating pool on one leg of an airline flight when seats are booked in a nested hshion and when lower fare classes book
before higher ones. U'e show that a fixed-limit booking policy that maximizes expected revenue can be characterized by
a simple set of conditions on the subdiferential of the expected revenue function. These conditions are appropriate for
either the discrete or continuous demand cases. These conditions are further simplified to a set of conditions that relate
the probability distributions of demand for the various fare classes to their respective fares. The latter conditions are
guaranteed to have a solution when the joint probabilit) distribution of demand is continuous. Characterization of the
problem as a series of monotone optimal stopping problems proves optimalit) of the fixed-limit polic) over all admissible
policies. '4 comparison is made of the optimal solutions with the approximatt solutions obtained by P. Belobaba using
the expected marginal seat revenue (EMSR) method.
0
ne of the obvious impacts of the deregulation of of the booking process in the long lead-time before
North American airlines has been increased flight departure.
price competition and the resulting proliferation of Prior work on this problem has tended to fall into
discount fare booking classes. While this has had the one of two categories. First, attempts have been made
expected effect of greatly expanded demand for air to encompass some or all of the above-mentioned
travel, it has presented the airlines with a significant complications with mathematical programming and/
tactical planning problen1- that of determining book- or network models (Mayer 1976, Glover et al. 1982,
ing policies that result in optimal allocations of seats Alstrup et al. 1986, Wollmer 1986, 1987, Dror.
among the various fare classes. What is sought is the Trudeau and Ladany 1988). Second, elements of the
best tradeoff between the revenue gained through problem have been studied in isolation under restric-
greater demand for discount seats against revenues tive assumptions (Littlewood 1972. Bhatia and Parekh
lost when full-fare reservations requests must be 1973, Richter 1982, Belobaba 1987, Brumelle et al.
turned away because of prior discount seat sales. 1990. Cursy 1990. Wollmer 1992). These studies have
This problem is made more difficult by the tendency produced easy to apply rules that provide some insight
of discount fare reservations to anive before full-fare into the nature of good solutions. Such rules are
ones. This occurs because of the nature of the cus- suboptimal when viewed in the context of the overall
tomers for the respective classes (leisure travelers in problem, but they can point the way to useful approx-
the discount classes. business travelers in full fare) and imation methods. The present paper falls into the
because of early booking restrictions placed on second categos1 .
the discount classes. Thus, decisions about limits to This paper deals with the airline seat allocation
place on the number of discount fare bookings proble~n when multiple fare classes are booked into a
must often be made before any full-fare demand is common scating pool in the aircraft. The following
observed. Further con~plications are introduced by assumptions are made:
factors such as multiple-flight passenger itineraries,
interactions with other flights, cancellation and over- 1 . Single fligl~t leg: Bookings are made on the basis
booking considerations, and the dynamic nature of a single departure and landing. No allowance is
Sl ( / l i ( ' ( ~l (./(I.S.S!/~(.NIIO)?S: Decision analysis. applications: stochastic. integer capacity allocation. Transportation, airline seat i n~ent or , control. bield
management.
.IJ' (>ii 0ft'f'i8itlt'. TK. \ ~SPOKT~\ TI OU ( s P 1 ~ 1 4 1ISSL'E OU STOCH~STIC ~~10h. i . DISTRIHI, TION. 4 ND LOGISTICS I h D D~h4hllCMODFIS IU TRAUSP(IKI
Operations Research 0030-361X/93/3101-0177 $01.75
Vol. 41. No. 1, Januan- Febr uan 1993 .C1993 Operations Research Society of America
madeforthepossibility thatbookings maybepart
of largertripitineraries.
2. Independent demands:Thedemandsforthediffer-
entfareclasses arestochasticallyindependent.
3. Low before high demands: The lowest fare reser-
vations requests anive first, followed by the next
lowest,etc.
4. No cancellations: Cancellations, no-shows and
overbookingarenotconsidered.
5. Limited information: The decision to close a fare
class is based only on the number of current
bookings.
6. Nested classes; .4ny fare class can be booked into
seatsnottaken by bookings in lowerfareclasses.
The independent demands and low before high
assumptions imply that atanytimeduringthebook-
ing process the observed demands in the fare class
currentlybeingbooked andin lower fareclassescon-
vey noinformation about future demandsfor higher
fare classes. The limited information assumption
excludesthepossibility ofbasing a decisiontoclosea
fareclassonsuchfactorsasthetimeremainingbefore
theflight.
Assumptions 1-5 are restrictive when compared to
the actual decision problem faced by airlines, but
analysis of this simplified version can both provide
insightsintothenatureofoptimal solutionsandserve
asa basis forapproximatesolutionstomore realistic
versions.
Thenestingof fareclasses(assumption 6),which is
a commonpractice in modernairlinereservationsys-
tems,suggeststhe followinggeneralapproach tocon-
trollingbookings: seta fixed upperlimit forbookings
in thelowest fare class;a second,higher limit for the
total bookings in thetwolowestclasses,andsoonup
tothehighest fareclass.Viewed inanotherway,such
bookinglimitsestablishprotectionlevelsforsuccessive
nestsofhigher fareclasses.
Thefirstusefulresultontheseatallocationproblem
(fortwofareclasses)waspresented by Littlewood.He
proposed that an airline should continue to reduce
theprotection level for class-1 (full-fare)seatsaslong
asthefareforclass-2 (discount)seatssatisfied
where,f;denotesthefareoraveragerevenuefromthe
ith fareclass, Pr[.]denotesprobability,X, isfull-fare
demand,andPIis the full-fare protection level. The
intuition here is clear-accept the immediate return
fromsellinganadditionaldiscountseataslongasthe
discount revenue equalsorexceeds theexpectedfull-
farerevenuefromtheseat.
A continuous version of Littlewood's rule was
derivedinBhatiaandParekh.Richtergaveamarginal
analysiswhich proved that (1)gives an optimal allo-
cation (assumingcertain continuityconditions).
More recently, Belobaba ( 1987)proposed a gener-
alization of ( 1 ) to more than two fare classes called
the Expected Marginal Seat Revenue (EMSR)
method.In thisapproach,theprotection level forthe
highest fareclassp , isobtained from
.f? =.f;Pr[X,> p, ]. (2)
Thisisjust Littlewood's ruleexpressedasanequation,
and it is appropriate as long as it is reasonable to
approximate the protection level with a continuous
variable andto attribute a probability density to the
demand XI . Thetotal protection for the two highest
fareclassesp, isobtained from
p, = pi + p;, (3)
wherepi and p: are two individual protection levels
determined from
.f; = .f;Pr[X,> pi] (4)
and
The protection for the three highest fare classes is
obtained by summing three individual protection
levels, and so on. This process is continued until
nested protection levelsp,, areobtained forall classes
exceptthelowest.Thebooking limit foranyclasskis
thenjust ( C - ph-, ), where C is the total number of
seatsavailable.
TheEMSRmethodobtainsoptimal booking limits
betweeneachpairoffareclassesregarded in isolation,
but it doesnot yield limits that areoptimal when all
classes are considered. While the idea of comparing
theexpected marginal revenues from futurebookings
with current marginal revenues is valid, the method
outlined above does not in general lead to a correct
assessmentofexpectedfuturerevenues(exceptforthe
highest fare class). To avoid confusion, the EMSR
approximation described above will henceforth be
referred toastheEMSRa method.
The nonoptimality of the EMSRa approach has
been reported independentlyby McGi11(1988),Currl
(1988), Wollmer ( 1988), and Robinson (1990).
Curry(1990)derivesthecorrectoptimalityconditions
when demands are assumed to follow a continuous
probability distribution and generalizes to the case
that fare classes are nested on an origin-destination
basis.Wollmer(1 992)dealswith thediscretedemand
case and provides an algorithm for computing both
theoptimalprotectionlevelsandtheoptimalexpected
revenue.
Thispapermakesthefollowingcontributionstothe
work onthisproblem:
1 . The approach used (subdifferential optimization
within a stochastic dynamic programming frame-
work)admitseitherdiscreteorcontinuousdemand
distributions and obtains optimality results in a
relatively straightforward manner.
2. Thcconnection of the seat allocation problem to
the theory of optimal stopping is demonstrated,
andaformalproofisgiventhatfixed-limitbooking
policies areoptimal within theclassof all policies
that depend only on the observed number of
current bookings.
3. Weshowthat theoptimality conditionsreduce to
a simple set of probability statements that clearly
characterize the difference between the EMSRa
solutionsandtheoptimal ones.
4. We show with a simple counterexample that the
EMSRa method can both over- or underestimate
theoptimal protection levels.
Specifically,we showthatanoptimal setofprotec-
tion levelspT, pf, . . . must satisfytheconditions
~+ER,[PR] .L+I ~ - E R L [ P ~ ]
for each k = 1, 2, . . . , (6)
where ER,[p,] is the expected revenue from the k
highest fare classes when pi. seats are protected for
thoseclasses, and6, and6- denotethe right andleft
derivativewith respect top,, respectively. These con-
ditions arejust an expression of the usual first-order
result-a changeinp, awayfromp?ineitherdirection
will produce a smaller increase in expected revenues
than theimmediateincreaseof,f;+I.Thesamecondi-
tionsapply whetherdemandsareviewed ascontinu-
ousrandomvariablesasinCurry( 1990)orasdiscrete
random variables asin Wollmer( 1992).
It is further shown that under certain continuity
conditions these optimal protection levels can be
obtained by findingp?,pf , . . . thatsatisfy
.f? =.l;Pr[XI> pT1
.I; =.f;Pr[X1>p? n XI + x2> pfl
(7)
n. . .nx,+ x ~ + . . . + x , ~ > ~ x I .
Airline Seat Allocation / 1 29
Theseconditionshaveasimpleandintuitiveinterpre-
tation since, as noted in Robinson, the probability
term ontheright-hand sideofthegeneralequation in
(7)is simply the probability that all remaining seats
are solid. The first of these equations is identical to
thefirstintheEMSRamethod,sotheEMSRamethod
doesderivetheoptimalprotection levelforthehighest
fareclass.
Thepaperisorganized asfollows. Thenext section
presents notation and assumptions. Section 2 gives
the revenue function and its directional derivatives.
In the following section, concavity properties of the
expected revenue functionareestablishedandresults
(6)and(7)areobtained.Weshowthat when demand
is integer-valuedthere exist integer optimal solutions
that satisfy (6),and these solutionsare optimal over
theclassofallpoliciesthatdependonlyonthehistory
of the demand process. The final section provides
numerical comparisons of the EMSRa and optimal
solutions.
1. NOTATION AND ASSUMPTIONS
The demand for fare class k is X,,( k= 1, 2, . . .),
where XI corresponds to the highest fare class. We
assume that these demands are stochastically inde-
pendent. Thevector of demandsisX = (X, , x, , . . .).
Each booking of a fare classk seat generates average
revenue of.h,where,f;>,f? > . . . .
Demands for the lowest fare class arrive first, and
seatsarebooked forthisclassuntil a fixed timelimit
is reached, bookings have reached somelimit,orthe
demandisexhausted.Salestothis fareclassarethen
closed,andsalestotheclasswith thenext lowestfare
arebegun,andsoon forall fareclasses. It isassumed
that any time limitson bookings for fare classes are
prespecified. Thatis,thesettingofsuchtimelimitsis
notpartoftheproblemconsidered here.It ispossible,
dependingontheairplanecapacity,fares,anddemand
distributionsthatsomefareclasseswillnot beopened
atall.
A booking policji isa set of rules which specifiesat
any point duringthe booking process whethera fare
class that has not reached its time limit should be
available for bookings. In general, such policies may
dependonthepatternofpriordemandsorberandom-
ized in somemanner.Anystoppingruleforfareclass
k which is measurable with respect to the u-algebra
generated by [X, 2 ,Y]for,Y = 0, 1, . . . is admissible.
However, we first restrict attention toa simplerclass
of booking policies, denoted by 9,that can be
described by a vector of fixed protection levels p =
( p i , p., . . .), where p, is the number of seats to be
protected for fare classes I-k. If at some stage in the
process described above there are s seats available to
be booked and there is a fare class k demand, then the
seat will be booked if s is greater than the protection
level pL-, for the k - I higher fare classes. (Restriction
to this class of policies is implicit in previous research
in this area except for that of Brumelle et al.) The
initial number of classes that are open for any book-
ings is, of course, determined by setting s equal to the
capacity of the aircraft or compartment. We will show
formally that the class 9contains a policy that is
optimal over the class of all admissible policies.
2. THE REVENUE FUNCTION
The function RL[s; p; x] is the revenue generated by
the k highest fare classes when s seats are available to
satisfy all demand from these classes, when x =
( xi , x2, . . .) is the demand vector, and p =( p l , p?, . . .)
is the vector of protection levels. We define the reve-
nue function recursively by
RI, +~[J; P; XI
RA[s; p; xl
for 0 s s <PL
(5 - pL).h+l + RA[~L;
for ph G s <PA+ XA+I (9)
xi+lji+l+ RA[S- P; XI ~ A + I ;
for PA+ XA+Is S,
f o r k = l , 2 , . . . .
For convenience of notation, a dummy protection
level po will be introduced; its value will be identically
zero throughout. There is no limit to the number of
fare classes or to the corresponding lengths of the
protection and demand vectors; however, the revenue
from the k highest fares depends only on the pro-
tection levels (po, p ~ , . . . , p ~ - ) ) and the demands
( xi , ,YZ, . . . , ,xi,). The symbols p and x will be used to
denote vectors of lengths which vary depending on
context, as in
The objective is to find a vector p that maximizes the
expected revenue ERA[s; p; x] for all k. If s is viewed
as a real-valued variable, the function ERk[s; p; X] is
continuous and piecewise linear on s > 0 and not
differentiable at the points s = p,. Maximization of
this function can be accomplished either by treating
available seats s and protection limits p as integer-
valued and using arguments based on first differences,
or by treating these variables as continuous and using
standard tools of nonsmooth optimization. The sec-
ond approach will be used in this paper because it
permits greater economy of notation and terminology.
Note that the demands Xcan be discrete or continuous
in either case. In the case that demands are taken as
integer-valued, both approaches are equivalent for this
problem and yield the same set of integer optimal
solutions. The second approach may admit additional
noninteger optimal solutions, but these can easily
be avoided in practice. If the demands are approxi-
mated by continuous random variables, the second
approach may lead to noninteger optimal solutions.
This eventuality is discussed in subsection 3.3 under
implementation.
2.1. Marginal Value of an Extra Seat
This section develops the first-order properties of the
revenue function. The notation and terminology used
here and in what follows are consistent with
Rockafellar (1970) except that they have been modi-
fied in obvious ways to handle concave rather than
convex functions. Let 6- and 6, denote the left and
right derivatives with respect to the first argument of
the revenue or expected revenue functions. Thus,
6-ERL[s: (pO, . . . , X] is the left derivative of
ERA[. ] with respect to s. (This slightly unconventional
notation is required because s, the number of seats
remaining, will sometimes be replaced by pi when the
argument is being viewed as a discretionary quantity.)
For fixed p and x, the derivatives for the revenue
function are easy to compute from (8) and (9) to be
6-R 1 [s; p; ,Y] =
0 for s >XI
and
for 0 s s <pA
f o r p ~ G s < p ~ + , x ~ + , (12)
~+RI , [ S- , Y, +~; ~; , Y] for pi, + xi+,s s.
6-RA [s; p; ,Y] for 0 <S S .oh
for pA < s 6 pi, + XA+l
6-Ri,[5 -x~+I;[.';,Y] ~ O S Q L+ XA+I < S. ( 13)
Any continuous, piecewise-linear function f [ s ] is
concave on s > 0 if and only if the right derivative is
less than or equal to the left derivative for any s. This
condition can be extended to the point s = 0 by
defining 6- f [ O] = +cc. The subdifferential 6f [ s ] is then
defined for any s 3 0 as the closed interval from 6 + f [ s ]
to 6- f [ A] . Given concavity, f [ .] will be maximized at
any point c for which 0 E 8f [ s ] .
3. OPTIMAL PROTECTION LEVELS
This section establishes the optimality within the class
.9of protection levels determined by the first-order
conditions given in (6). We first consider a point in
the booking process when s seats remain unbooked,
fare class k + I is being booked, and the decision of
whether or not to stop booking that class is to be
made. That is, a decision on the value ofthe protection
level p, for the remaining fare classes is to be made.
The following lemma establishes a condition under
which concavity of the expected revenue function with
respect to s is ensured, conditional on the value of
Xa+].This leads to an argument by induction that
concavity of the conditional expected revenue func-
tion will be satisfied if ( 6)is satisfied for all the higher
protection levels. Finally, we show that condition (6)
also guarantees optimality of p,.
Lemma 1. If'some polity p makes
concave on s 3 0 and i f pf satisfies
then
is concave on s 3 0 with probabilitj) 1 .
Proof. It follows from the definition of the revenue
function in (9) and the hypothesized concavity of ERA
that
is continuous on s > 0 and concave on the three
intervals 0 =s s < pa, pa -( s < pa + XA+1,and
pi. + Xh+l s 5.
To complete the proof, it is enough to verify that
Airline Seat Allocation / 131
at the two points s = pf and s = pi! + XA+, .From
( 12)and (13) the left and right derivatives at s = pf
are
and
By the hypothesis of the lemma. inequality ( 1 5 ) must
be satisfied.
Again applying (12) and (13), the left and right
derivatives at s = pf + are
and
By the hypothesis of the lemma, inequality (15)must
be satisfied at s = pf + XL+, .
Corollary 1. If;.for some k E 11, 2, . . .)the conditions
oj'the lemma hold, then
is concave on s 3 0.
Proof. We have
It follows from the concavity of the conditional ex-
pectation on the right-hand side that
(The expectation operator E and the differential op-
erators 6 , and 6- can be interchanged because Rh+lis
bounded by J;s for all policies p and demand x.)
Theorem 1. Let p be any policy that satisfies
f h + ~E 6ERaIpa; ( PO, . . . , pa-1); XI (20)
.for k = 1 , 2, . . . . Then E(Ra+I[s; p; XI I Xa+l ) is con-
cave on s 3 0 for li = 1 , 2, . . . . Moreover, it is optimal
to continue the sales of.fare class k + 1 while rnore
than pa seats rernain zlnsold, and to protect p, seatsfor
the nest of t he k highest,fare classes.
132 / AND MCGILL BRUMELLE
Proof. From ( 10) and ( 1 1).
and
where II = 1 if condition A holds, and I, = 0
otherwise. Hence
Thus, E( R~[ s , p; X] IXI1 is concave in s for any policy
y, and, given condition (20), the concavity assertion
in the theorem follows from Lemma 1 by induction.
To prove optimality of the protection level pA
it is necessary to examine the behavior of
ERA+l[s; (yo, . . . , pa); X] as a function of ph for any
s. Denote the left derivative, right derivative and sub-
differential with respect to pi by -,-, ?+, and ?,
respectively.
From (9),
for 0 c s c yh
+ 6+Ra[yk; P; XI
f ~ r p a < ~ < p ~ + x h + ~ (23)
for ph + < S.
for 0 < s <pa
+ ~- RL[ PA; P; XI
for pa s s < pa + XL+I (24)
for pa + XL+I c s.
Recall that R~ [ p i ; p; x] is independent of XA+ ~ .
Taking the expectations of these derivatives and re-
versing the order of differentiation and expectation
yields for pa < s <PA+ xi +
Conditions (20), (23), (24), (25) and (26) imply
that is, 0 E yERa+l [ ~; y; XI. Also, from (25), (26) and
the concavity of ERh[s; y; XI with respect to s, it
follows that ERA+I[s; (po, . . . , pi); X] is nondecreas-
ing over yL < ph and nonincreasing over yL > pA.
Thus pA maximizes ERh+,[s; y; XI, as required.
It has thus been established that condition (20) is
sufficient for optimality of a policy p. The next
theorem shows that there exist integer policies that are
optimal, given that demand is integer-valued.
In what follows, the abbreviation CLBI (for Concave
and Linear Between Integers) will denote that a reve-
nue or expected revenue function is concave and
piecewise linear with changes in slope only at integer
values of the domain. A CLBI function has the prop-
erty that the set of subdifferentials at integer points of
the domain covers all real numbers between any par-
ticular right derivative and any greater left derivative.
That is, a CLBI function f ( x) satisfies the following
covering property:
If c is a constant that 6, f(s2) < c < 6-f(sl),for some sl
< s2, then there is an integer n E [sl , s2] such that c E
4f(n).
Theorem 2. I f the demand random variables XI ,
Xz, . . . are integer-valzied, there exists an optimal
integer policy y*.
Proof. (By induction): Taking expectations with re-
spect to XI in (2 1) and (22) yields the subdifferential
GERI[J;P; XI = [.f;Pr[Xl>sl,,f;Pr[Xl3 sll. (27)
By inspection of (8) and (27) and the fact that
demand is integer-valued, ER~[ s ; p; XI is CLBI on
s 0. Furthermore, since demand is finite with prob-
ability 1, there is an s sufficiently large that
(In practice a sufficiently large s might exceed the
capacity of the aircraft. However, in this case, there
would be no need to find the next protection level.)
Also, by definition,
Then the covering property of CLBI functions ensures
the existence of an integer pf that satisfies
f i E 6ERl[pT; p; XI; that is, pT satisfies the optimality
condition (20) for k = 1.
Let d[x] denote the largest integer less than or equal
to x, and u[x] the smallest integer greater than or
equal to x. Thus, d[x] = u[x - 11 when x is a
noninteger, and d[x] = u[x - 11 + 1 when x is an
integer. Taking expectations with respect to Xh+l in
( 12) and ( 13) yields
d[.\-/lAl
+ C G+ERh[s - i; y; X]Pr[XA+, = i], (28)
i =o
and
! I [ \-Ilk- I ]
+ 2 G-ERI,[s - i; p;X]Pr[Xh,.,=i ] . ( 29)
i=o
Now suppose that ER, <[s; p;x] is CLBI on s3 0 for
some k, and there are integer protection levels
pT, y t , ...pX satisfying ( 20) . From ( 28) and ( 29) ,
the integrality of p f and Xh+!and the fact that
ERA[s;p; x] is CLBI ensure that the left and right
derivatives of ER, +, [s; p*; XI are equal and constant
at noninteger sand that equality can fail to hold only
at integer s. That is, ERI,+~[s; y*;XI is CLBI. That
ERh+, [ s ; y*;XI is concave follows from Corollary 1 .
By recursive application of ( 28) and ( 29) , using the
fact that total demand is finite with probability 1 ,
there exists an s sufficiently large that
for each k =2, 3. ....
Property ( 30) together with the covering property
of the subdifferentials of CLBI functions ensure
that there is an integer s=pi f ; +! satisfying
f ; +r E 6ERh[ pf +, ; p*; XI ; that is, optimality condition
( 20) .The existence of an optimal integer policy p* =
( y T, p f , ...)follows by induction.
3.1. MonotoneOptimalStoppingProblemsand
theOptimalityof FixedProtectionLevel
BookingPolicies
In this section, we establish that the fixed protection
levels p defined by condition ( 20)are optimal over the
set of all admissible policies, not just over the set of
fixed policies 9.To this end, consider the problem
of stopping bookings in fare class k + 1 when there
are s seats remaining and Xa+l3 X,&+I has been ob-
served, where 3 0.
The problem of finding an optimal policy for choos-
ing ph belongs to the class of stochastic optimization
problems known as optimal stopping problems. It
has been shown by Derman and Sacks ( 1960) and
Chow and Robbins ( 1961) that optimal stopping
problems defined as monotonehave particularly sim-
ple solutions.
To check the conditions for monotonicity, we need
to consider the expected gain in revenue obtained by
changing the protection level for the nest of the k
highest fare classes from pi, + 1 to pr, given that the
additional seat being released will be sold to fare class
AirlineSeatAllocation / 133
k +I . Call this expected gain Ga, where
By ( 23) ,the gain can be rewritten as
The booking problem for fare class k will be
monotone if for fixed s and ( p , , ..., ph- , ) the follow-
ing conditions are satisfied:
1. There is a p,* such that the gain G,is nonnegative
for ph <pX and nonpositive for p, 3 pX.
2. I R[ s ;( PO,P I , ...,pa + 1 ) ; XI - R[ s ; ( PO,P I ,...,
p,); XI I is bounded for all p,.
Condition 2 is trivial because the total revenue is
certainly bounded by sJ;.Suppose that p* is an integer
policy satisfying the conditions in Theorem 1 . Then
pf and Gh[ s ; (po*, pT, ...,p i t l , ph)]satisfy condition
1 by Theorem 1 .
If the model is monotone the expected revenue will
be maximized by protecting pif;seats for the nest of
the k highest fare classes; that is, a fixed-limit policy
will be optimal for the protection level ph.
The significance of this result in the context of
airline seat allocation is that fixed protection levels
defined by condition ( 20) will be optimal as long as
no change in the probability distributions of demand
is foreseen. In other words, no ad hoc adjustment of
protection levels is justified unless a shift in the de-
mand distributions is detected. In practice, one or
more of the independent demands, low before high or
limited information assumptions may not be satisfied,
and there is the possibility that revenues can be in-
creased by protection level adjustments in a dynamic
reservations environment. The point here is that such
adjustments must be properly justified, for example,
the observation of a sudden rush of demand in one
fare class should not lead to a protection level adjust-
ment unless it is believed that the rush signals a
genuine shift in the underlying demand distribution.
For a preliminary investigation of the effects of sto-
chastically dependent demands on the optimal policy,
see Brumelle et al.
3.2. AnAlternative ExpressionfortheOptimal
ProtectionLevels
Thissection presents thederivationoftheexpression
fortheoptimalprotectionlevelsintermsofdemands
givenin(7).Thisexpressionisrelevantwhendemand
distributionscanbeapproximatedbycontinuousdis-
tributions, and it provides the optimality conditions
ina formanalogoustotheEMSRaapproximation.
Lemma 2. If p satisfies
f;Pr[Xl > P I n XI+ X, > p 2 n . . . nx,
$ ~ r allk, then ).t'it/zprobubilitj1 1for k = 1,2, . . . and
J 2 pi
Proof. Assume that p satisfies the hypothesis of the
lemma. For s 3 p ~ , we can obtain the following
expression from (12)by taking the expectation and
interchangingEand6,:
Using(31)t osubstitutef or h+, , theright-hand sideof
thisexpressioncanbe rewritten as
Fork = 1, using (10)andthefactthat
(33)becomes
Thusthelemmaholdsfork= 1.
The proof is completed by induction. Using the
induction hypothesis that the lemma holds for k,
substitutefor6+Rkinthelastterm of(35).
=f, Pr[X, > p i n... nXI + . . . + X, >p ,
whichcompletestheproof.
Corollary 2. I f psatisfies (3 l ), thenfor s 3 pL
nXI + . . . + X,+,> s].
(37)
Theorem 3. Ifp satisfies (3 l ), thenp isoptirnal.
Proof. By Lemma2ifp satisfies(3l ), then
By Theorem 1, p isthusoptimal.
3.3. Applicationof theOptimalityConditions
Condition 20 provides a concise characterization of
optimalpoliciesintermsofthesubdifferential(orfirst
differences) of the expected revenue function. Given
anyestimatesoffuturedemanddistributions(discrete
orcontinuous),it is easy todeterminethe subdiffer-
entialof the expected revenue function for fareclass
1 as a function of seats remaining and then numeri-
callyidentifyanintegerp: thatsatisfiedtheoptimality
condition.Theremainingsubdifferentialsandoptimal
protection levels can be determined in a like manner
bysuccessiveapplicationsof (20).
An alternative approach is provided by solving for
theoptimal protectionlevels given by (3 1)fork = 1,
2, . . . . A condition which guaranteesthesolvability
of thissystem of equations is that thedemandshave
acontinuousjoint distribution function.If anempir-
icaldistributionforintegerdemandisbeingused,then
theaboveequationscanlikelybesolvedtowithin the
statisticalerror of the demand distribution. This ap-
proach is consistent with previous airline practice
where estimated continuous demand distributions
(e.g., fitted normal distributions) have been used in
methodslikeEMSRa.
Airline Seat Allocation / 135
Empiricalstudieshaveshownthatthenormalprob-
abilitydistribution gives a good continuousapproxi-
mationtoairlinedemanddistributions(Shlifer1975).
If normality is assumed, solutions to (31) can be
obtained with straightforward numerical methods.
Robinson has generalized the conditions to the case
that faresarenot necessarily monotonicandhaspro-
posed anefficientMonteCarlointegration schemefor
findingoptimal protectionlevels.
There is a way in which the optimality conditions
(31)can be used t omonitorthepast performanceof
seat allocation decisionsgiven historical data on seat
bookings for a series of flights. For simplicity, the
discussion will assume three fare classes; the method
generalizes easily to an arbitrary number of classes.
Withthreefareclasses,conditions(3 I)canbewritten
Given a series of past flights, the probability
Pr[Xl > pl ] can be estimated by the proportion of
flights onwhich class-1 demand exceeded itsprotec-
tion level. Then (39) specifies that this proportion
shouldbe close tothe ratio,filf;.Similarly,(40)spec-
ifiesthattheproportion offlightsonwhich bothclass-
1 demand exceeded its protection level and thetotal
of class-1 and 2 demands exceeded their protection
level should be close to the ratio,f;[f;. If allocation
decisionsarebeing made optimally,these conditions
shouldbesatisfiedapproximatelyinasufficientlylong
series of past flights. Severe departures from these
ratioswouldbesymptomaticofsuboptimalallocation
decisions.Theappealingaspectofthisapproachisits
simplicity-no modeling of thedemand distributions
andnonumerical integrationsarerequired.
4. COMPARISON OF EMSRa AND OPTIMAL
SOLUTIONS
The EMSRa method determines the optimal protec-
tion level forthe full-fareclassbut is not optimalfor
theremainingfareclasses.However,theEMSRaequa-
tions are particularly simple to implement because
they donot involvejoint probability distributions. It
isthusof interest toexaminethe performance ofthe
EMSRamethodrelativetotheoptimalsolutionsgiven
above. Note that neither the EMSRa norexact opti-
mality conditions give explicit formulas for the
optimal protection levels in terms of the problem
parameters,soanalytical comparison ofthe revenues
Table I
ComparisonofEMSRaVersusOptimalfor
ThreeFareClasses
Example % Error
No. S? SL Pi EMSRa Optimal Revenue
1 0.6 0.7 32
2 0.6 0.8 27
3 0.6 0.9 19
4 0.7 0.8 27
5 0.7 0.9 19
6 0.8 0.9 19
produced by the two methods is difficult unless un-
realistic demand distributionsare assumed. Numeri-
calcomparisonofthetwomethodscan,however,give
someindicationof relative performance.
Thissectiongivestheresultsof numericalcompar-
isons of EMSRa versus optimal solutions in a three
fare-classproblem. Table I presents the resultsof six
examplesinwhich cabincapacityisfixed at 100seats
and fares1; arevaried. Fares areexpressed as propor-
tionsoffullfare;thus,f ; = 1 throughout. The% error
revenue column gives the loss in revenues incurred
from using the EMSRa method as a percentage of
optimal revenues. In Table 11, thefaresareheld con-
stantat1evels.b= 0.7 andf2= 0.9,andcabincapacity
isvaried.
Discreteapproximations tothe normal probability
distribution were used for all demand distributions.
Thenominal meandemandsforfare classes 1,2 and
3 were 40, 60 and 80, and the nominal standard
deviations were 16, 24 and 32, respectively. These
figuresare nominal because the discretization proce-
dure introduced smalldeviationsfrom the exact pa-
rameter values. These parameters correspond to a
coefficient ofvariation of0.4;i.e.,thestandarddevia-
tion is 40% of the mean. This is slightly higher than
the0.33thatBelobaba(1987)mentionsasacommon
airline"k factor" fortotaldemand.
(Note that the normal distribution has significant
mass below zero when the coefficient of variation is
Table I1
Capacity Effects
%Error
Capacity Revenue
82 0.54
100 0.45
120 0.35
140 0.24
160 0.14
much higher than 0.4. Use of a truncated normal or
other positive distribution is indicated under these
circumstances.)
Remarks
In thisset of examplesthe EMSRa method produces
seat allocations that are significantly different from
optimalallocations,butthelossinrevenueassociated
isnotgreat.Specifically:
a. Intheseexamples,theEMSRamethodconsistently
underestimatesthenumberofseatsthat shouldbe
protected for the two upper fare classes. The dis-
crepancy is 19%intheworst case(example6).We
will showwith a counterexamplethat the EMSRa
method isnot guaranteedtounderestimate in this
way.
b. In theworst casethediscrepancy betweenEMSRa
and optimal solutions with respect to revenues is
approximately'/2%.
c. Theerrorappearstoincreaseasthediscount fares
approachthefullfare;however,thesampleismuch
toosmall heretojustify anygeneralconclusionof
thisnature.
d. The error decreases as the aircraft capacity in-
creases. This effect is, of course, to be expected
because allocation policies have less impact when
the capacity is able to accommodate most of the
demands.
Onthebasisoftheseexamples,adecisionofwhether
ornot to use the EMSRa approachrests onwhether
ornot apotentialrevenue lossontheorderof !/2% or
less(withthreefareclasses)isjustified by the simpler
implementation ofthemethodrelative totheoptimal
method. Further work is needed to determine the
relative performance of the EMSRa method with a
larger number of fare classes orunder circumstances
inwhichdynamicadjustmentsofprotectionlevelsare
justified.
Additional numerical analyses related to the seat
allocation problem are provided in Wollmer (1992)
and have been conducted by P. Belobaba and col-
leaguesattheMITFlight TransportationLaboratory.
4.1. EMSRaUnderestimationofProtection
Levels-A Counterexample
As mentioned, the EMSRa method consistently
underestimated the protection level p2 for the two
upper fareclassesinall thenumericaltrials. It isthus
reasonable toconjecturethat the approximationwill
always behave in this way. This is not true for all
demand distributions, as shown by the following
counterexample using exponentially distributed de-
mands. It remains an open question whether or not
the conjecture holds true for normally distributed
demands.
For convenience, let the unit of demand be 100
seats, and introduce the relative fares r2= f ? / f ; and
r, =.LLf;.NowsupposethatXI andX, followidentical,
independent exponentialdistributionswith mean 1.0
(100seats).That is, Pr[X,> x,]= e-"1 for i = 1,2. It
is not suggested that theexponential distributionhas
any particular merit for modeling airline demands,
although it could serve as a surrogate for a severely
right-skeweddistributioniftheneedarose.Itsusehere
ispurelyasadeviceforestablishingacounterexample
toageneralconjecture.
Let p:' denote protection levels obtained with the
EMSRa method. Then with the abovedistributional
assumptions and (2)-(5), we have p'; = -ln(rz), and
@$ = -ln(r3) - In(r,/rz).
For the optimal solutions,(7)givesp, = -ln(r,) =
PY, and
Supposethat r. = l/2 and r, = ! ' I . Thenp, = 0.69 and
plP 2.08 (69and208seats,respectively).Givenp,,a
simple line search using (41) produces the optimal
11. P 2.37 from the equation above. Thus, for this
example,theEMSRamethod underestimatesfi by 29
seats.Thisbehaviorisconsistent withtheconjecture.
Now suppose instead that r2= 4/10 and r3 = I / I o .
Thenp;' = 0.92 andpq = 3.69.In thiscase. however.
p, = 3.61, andthe EMSRa method overestir?zate.s pz
by 8 seats. It is not difficult to show that for these
demand distributions,the EMSRa method will over-
estimatep. whenever r2/y3 > 3.51, approximately.
5. SUMMARY
This paper provides a rigorous formulation of the
revenue function forthe multiple fare classseatallo-
cation problem foreitherdiscreteorcontinuousprob-
abilitydistributionsofdemandanddemonstratescon-
ditionsunderwhich theexpected revenue functionis
concave. We show that a booking policy that maxi-
mizes expected revenue can be characterized by a
simpleset of conditions onthe subdifferential ofthe
expected revenue function.Theseconditions are fur-
ther simplified t o a set of conditions relating to the
probability distributions of demand for the various
fareclassest otheir respective fares.Theseconditions
are guaranteed to have a solution if the joint distri-
bution ofthe demands is approximated by a contin-
uousprobabilitydistribution.Itisshownthatthefixed
protection limit policies given by these optimality
conditions are optimal over the class of all policies
thatdependonlyonthehistoryof t hebookingprocess.
A numerical comparison is made of the optimal so-
lutionswith theapproximate solutionsyielded by the
expected marginal seat revenue(EMSRa)method.A
tentative conclusiononthebasis ofthisrestricted set
ofexamplesisthattheEMSRamethodproducesseat
allocations that are significantly different from opti-
mal allocations, and the associated loss in revenue is
oftheorderof' / 2 %.
ACKNOWLEDGMENT
This work was supported in part by the Natural
Sciences and Engineering Research Council of
Canada, grant no. A4104. The authors wish t o
thank Professor H. I. Gassmann of the School of
Business Administration, Dalhousie University,
Halifax, Canada; Mr. Xiao Sun of the University of
British Columbia; and two anonymous referees
for helpfulcommentsandsuggestions.
REFERENCES
ALSTRUP, J., S. BOAS, 0 . B. G. MADSEN AND R. V. V.
VIDAL. i986. Booking Policy for Flights With Two
TypesofPassengers.Eur. J.Opnl. Res. 27,274-288.
BELOBABA, P. P. 1987.Air Travel Demand and Airline
Seat Inventory Management. Ph.D. Dissertation.
MIT,Cambridge,Mass.
BELOBABA, 1989. Application of a Probabilistic P. P.
Decision Model to Airline Seat Inventory Control.
Opns. Res. 37, 183-197.
BHATIA, 1973.OptimalAllo- A. V., ANDS.C. PAREKH.
cation of Seats by Fare. Presentation by TWA
AirlinestoAGIFORSReservations StudyGroup.
BRUMELLE, S. L., J. I. MCGILL, T. H. OUM, M. W.
TRETHEWAY 1990. Allocation of AND K. SAWAKI.
AirlineSeatsBetween StochasticallyDependent De-
mands.Trans. Sci. 24, 183-192.
CHOW, Y. S., AND H. ROBBINS. 1961. A Martingale
SystemsTheorem and Applications. InProceedings
Airline Seat Allocation / 137
4th Berkeley Symposilim Mathematical und Statis-
tical Probability, University ofCalifornia Press.
CURRY, R. E. 1988.OptimumSeatAllocationWithFare
Classes Nested on Segments and Legs. Technical
Note 88-1, Aeronomics Incorporated, Fayetteville,
Ga.
CURRY, R.E. 1990.OptimalAirlineSeatAllocationWith
Fare Classes Nested by Origins and Destinations.
Trans. Sci. 24, 193-203.
DERMAN, C.,ANDJ. SACKS. 1960.Replacementof Peri-
odically Inspected Equipment. Naval Res. Logist.
Quart. 7, 597-607.
DROR, M. P. TRUDEAU 1988.Net- AND S. P. LADANY.
work Modelsfor Seat Allocation on Flights. Trans.
Res. B22B, 239-250.
GLOVER, F.,R. GLOVER, ANDC.MCMILLAN. J.LORENZO
1982.ThePassenger Mix ProblemintheScheduled
Airlines. Interfrrces 12,73-79.
LITTLEWOOD, K. 1972.ForecastingandControl of Pas-
sengers.InProceedings 12th AGIFORS Symposium.
AmericanAirlines,New York,95-1 17.
MAYER, M. 1976.Seat Allocation, oraSimple Modelof
SeatAllocation via Sophisticated Ones.InProceed-
ings 16th AGIFORS Symposium, 103-135.
MCGILL, J. I. 1988. Airline Multiple Fare Class Seat
Allocation. Presented at Fall ORSAITIMS Joint
NationalConference, Denver,Colo.
RICHTER, H. 1982.TheDifferential RevenueMethod to
Determine OptimalSeat Allotmentsby Fare Type.
In Proceedings 22nd AGIFORS Symposium,
339-362.
ROCKAFELLAR, R. T. 1970.Convex Analysis. Princeton
University Press,Princeton,N.J.
ROBINSON, L.W. 1990.OptimalandApproximateCon-
trol Policies for Airline Booking With sequential
FareClasses.WorkingPaper 90-03,Johnson Grad-
uate School of Management, Cornell University,
Ithaca,N.Y.
SHLIFER, 1975.AnAirlineOverbook- R.,ANDY.VARDI.
ingPolicy. Trans. Sci. 9, 101-1 14.
WOLLMER, R. D. 1986. An Airline Reservation Model
forOpeningandClosingFareClasses.Unpublished
Company Report, Douglas Aircraft Company,
LongBeach, Calif.
WOLLMER, R.D. 1987.ASeat Management Modelfora
Single Leg Route. Unpublished Company Report,
DouglasAircraft Company,Long Beach,Calif.
WOLLMER, R.D. 1988.ASeat ManagementModelfora
Single Leg Route When Lower Fare Classes Book
First.PresentedatFallORSAITIMSJointNational
Conference,Denver,Colo.
WOLLMER, R. D. 1992. An Airline Seat Management
Model for a Single Leg Route When Lower Fare
ClassesBook First. Opns. Res. 40, 26-37.

You might also like