[go: up one dir, main page]

0% found this document useful (0 votes)
24 views41 pages

Reinforcement Learning in Finance

Uploaded by

Htc Htc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views41 pages

Reinforcement Learning in Finance

Uploaded by

Htc Htc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Foundations of Reinforcement Learning with

Applications in Finance

Ashwin Rao, Tikhon Jelvis


1 Order-Book Trading Algorithms
In this chapter, we venture into the world of Algorithmic Trading and specifically, we cover
a couple of problems involving a trading Order Book that can be cast as Markov Decision
Processes, and hence tackled with Dynamic Programming or Reinforcement Learning. We
start the chapter by covering the basics of how trade orders are submitted and executed on
an Order Book, a structure that allows for efficient transactions between buyers and sellers
of a financial asset. Without loss of generality, we refer to the financial asset being traded
on the Order Book as a “stock” and the number of units of the asset as “shares.” Next we
will explain how a large trade can significantly shift the Order Book, a phenomenon known
as Price Impact. Finally, we will cover the two algorithmic trading problems that can be cast
as MDPs. The first problem is Optimal Execution of the sale of a large number of shares
of a stock so as to yield the maximum utility of sales proceeds over a finite horizon. This
involves breaking up the sale of the shares into appropriate pieces and selling those pieces
at the right times so as to achieve the goal of maximizing the utility of sales proceeds.
Hence, it is an MDP Control problem where the actions are the number of shares sold
at each time step. The second problem is Optimal Market-Making, i.e., the optimal bids
(willingness to buy a certain number of shares at a certain price) and asks (willingness to
sell a certain number of shares at a certain price) to be submitted on the Order Book. Again,
by optimal, we mean maximization of the utility of revenues generated by the market-
maker over a finite-horizon (market-makers generate revenue through the spread, i.e. the
gap between the bid and ask prices they offer). This is also an MDP Control problem
where the actions are the bid and ask prices along with the bid and ask shares at each
time step.
For a deeper study on the topics of Order Book, Price Impact, Order Execution, Market-
Making (and related topics), we refer you to the comprehensive treatment in Olivier Gueant’s
book (Gueant 2016).

1.1 Basics of Order Book and Price Impact


Some of the financial literature refers to the Order Book as Limit Order Book (abbreviated
as LOB) but we will stick with the lighter language - Order Book, abbreviated as OB. The
Order Book is essentially a data structure that facilitates matching stock buyers with stock
sellers (i.e., an electronic marketplace). Figure 1.1 depicts a simplified view of an order
book. In this order book market, buyers and sellers express their intent to trade by submit-
ting bids (intent to buy) and asks (intent to sell). These expressions of intent to buy or sell
are known as Limit Orders (abbreviated as LO). The word “limit” in Limit Order refers to
the fact that one is interested in buying only below a certain price level (and likewise, one
is interested in selling only above a certain price level). Each LO is comprised of a price P
and number of shares N . A bid, i.e., Buy LO (P, N ) states willingness to buy N shares at
a price less than or equal to P . Likewise, an ask, i.e., a Sell LO (P, N ) states willingness to
sell N shares at a price greater than or equal to P .
Note that multiple traders might submit LOs with the same price. The order book ag-
gregates the number of shares at each unique price, and the OB data structure is typically

3
Figure 1.1: Trading Order Book
(Image Credit: https://nms.kcl.ac.uk/rll/
enrique-miranda/index.html)

presented for trading in the form of this aggregated view. Thus, the OB data structure can
be represented as two sorted lists of (Price, Size) pairs:

(b) (b) (b) (b)


Buy LOs (Bids): [(Pi , Ni ) | 0 ≤ i < m], Pi > Pj for i < j
(a) (a) (a) (a)
Sell LOs (Asks): [(Pi , Ni ) | 0 ≤ i < n], Pi < Pj for i < j
Note that the Buy LOs are arranged in descending order and the Sell LOs are arranged
in ascending order to signify the fact that the beginning of each list consists of the most
important (best-price) LOs.
Now let’s learn about some of the standard terminology:

(b)
• We refer to P0 as The Best Bid Price (lightened to Best Bid ) to signify that it is the
highest offer to buy and hence, the best price for a seller to transact with.
(a)
• Likewise, we refer to P0 as The Ask Price (lightened to Best Ask) to signify that it is
the lowest offer to sell and hence, the best price for a buyer to transact with.
(a) (b)
P0 +P0
• 2 is refered to as the The Mid Price (lightened to Mid).
(a) (b)
• P0 − P0 is refered to as The Best Bid-Ask Spread (lightened to Spread).
(a) (b)
• Pn−1 − Pm−1 is refered to as The Market Depth (lightened to Depth).

Although an actual real-world trading order book has many other details, we believe this
simplified coverage is adequate for the purposes of core understanding of order book trad-
ing and to navigate the problems of optimal order execution and optimal market-making.
Apart from Limit Orders, traders can express their interest to buy/sell with another type
of order - a Market Order (abbreviated as MO). A Market Order (MO) states one’s intent
to buy/sell N shares at the best possible price(s) available on the OB at the time of MO sub-
mission. So, an LO is keen on price and not so keen on time (willing to wait to get the price
one wants) while an MO is keen on time (desire to trade right away) and not so keen on

4
price (will take whatever the best LO price is on the OB). So now let us understand the
actual transactions that occur between LOs and MOs (buy and sell interactions, and how
the OB changes as a result of these interactions). Firstly, we note that in normal trading
activity, a newly submitted sell LO’s price is typically above the price of the best buy LO
on the OB. But if a new sell LO’s price is less than or equal to the price of the best buy
LO’s price, we say that the market has crossed (to mean that the range of bid prices and the
range of ask prices have intersected), which results in an immediate transaction that eats
into the OB’s Buy LOs.
Precisely, a new Sell LO (P, N ) potentially transacts with (and hence, removes) the best
Buy LOs on the OB.

(b) (b)

i−1
(b) (b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≥ P )] (1.1)
j=0

After this removal, it potentially adds the following LO to the asks side of the OB:
∑ (b)
(P, max(0, N − Ni )) (1.2)
(b)
i:Pi ≥P

Likewise, a new Buy MO (P, N ) potentially transacts with (and hence, removes) the
best Sell LOs on the OB

(a) (a)

i−1
(a) (a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≤ P )] (1.3)
j=0

After this removal, it potentially adds the following to the bids side of the OB:
∑ (a)
(P, max(0, N − Ni )) (1.4)
(a)
i:Pi ≤P

When a Market Order (MO) is submitted, things are simpler. A Sell Market Order of N
shares will remove the best Buy LOs on the OB.

(b) (b)

i−1
(b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < m] (1.5)
j=0

The sales proceeds for this MO is:


m−1
(b) (b)

i−1
(b)
Pi · (min(Ni , max(0, N − Nj ))) (1.6)
i=0 j=0

We note that if N is large, the sales proceeds for this MO can be significantly lower than
(b) (b)
the best possible sales proceeds (= N · P0 ), which happens only if N ≤ N0 . Note also
(b)
that if N is large, the new Best Bid Price (new value of P0 ) can be significantly lower than
the Best Bid Price before the MO was submitted (because the MO “eats into” a significant
volume of Buy LOs on the OB). This “eating into” the Buy LOs on the OB and consequent
lowering of the Best Bid Price (and hence, Mid Price) is known as Price Impact of an MO
(more specifically, as the Temporary Price Impact of an MO). We use the word “temporary”
because subsequent to this “eating into” the Buy LOs of the OB (and consequent, “hole,”
ie., large Bid-Ask Spread), market participants will submit “replenishment LOs” (both

5
Buy LOs and Sell LOs) on the OB. These replenishments LOs would typically mitigate the
Bid-Ask Spread and the eventual settlement of the Best Bid/Best Ask/Mid Prices consti-
tutes what we call Permanent Price Impact - which refers to the changes in OB Best Bid/Best
Ask/Mid prices relative to the corresponding prices before submission of the MO.
Likewise, a Buy Market Order of N shares will remove the best Sell LOs on the OB

(a) (a)

i−1
(a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < n] (1.7)
j=0

The purchase bill for this MO is:


n−1
(a) (a)

i−1
(a)
Pi · (min(Ni , max(0, N − Nj ))) (1.8)
i=0 j=0

If N is large, the purchase bill for this MO can be significantly higher than the best
(a) (a)
possible purchase bill (= N · P0 ), which happens only if N ≤ N0 . All that we wrote
above in terms of Temporary and Permanent Price Impact naturally apply in the opposite
direction for a Buy MO.
We refer to all of the above-described OB movements, including both temporary and
permanent Price Impacts broadly as Order Book Dynamics. There is considerable literature
on modeling Order Book Dynamics and some of these models can get fairly complex in
order to capture various real-world nuances. Much of this literature is beyond the scope
of this book. In this chapter, we will cover a few simple models for how a sell MO will
move the OB’s Best Bid Price (rather than a model for how it will move the entire OB). The
model for how a buy MO will move the OB’s Best Ask Price is naturally identical.
Now let’s write some code that models how LOs and MOs interact with the OB. We
write a class OrderBook that represents the Buy and Sell Limit Orders on the Order Book,
which are each represented as a sorted sequence of the type DollarsAndShares, which is a
dataclass we created to represent any pair of a dollar amount (dollar: float) and num-
ber of shares (shares: int). Sometimes, we use DollarsAndShares to represent an LO (pair
of price and shares) as in the case of the sorted lists of Buy and Sell LOs. At other times, we
use DollarsAndShares to represent the pair of total dollars transacted and total shares trans-
acted when an MO is executed on the OB. The OrderBook maintains a price-descending se-
quence of PriceSizePairs for Buy LOs (descending_bids) and a price-ascending sequence
of PriceSizePairs for Sell LOs (ascending_asks). We write the basic methods to get the
OrderBook’s highest bid price (method bid_price), lowest ask price (method ask_price),
mid price (method mid_price), spread between the highest bid price and lowest ask price
(method bid_ask_spread), and market depth (method market_depth).

@dataclass(frozen=True)
class DollarsAndShares:
dollars: float
shares: int
PriceSizePairs = Sequence[DollarsAndShares]
@dataclass(frozen=True)
class OrderBook:
descending_bids: PriceSizePairs
ascending_asks: PriceSizePairs
def bid_price(self) -> float:
return self.descending_bids[0].dollars

6
def ask_price(self) -> float:
return self.ascending_asks[0].dollars
def mid_price(self) -> float:
return (self.bid_price() + self.ask_price()) / 2
def bid_ask_spread(self) -> float:
return self.ask_price() - self.bid_price()
def market_depth(self) -> float:
return self.ascending_asks[-1].dollars - \
self.descending_bids[-1].dollars

Next we want to write methods for LOs and MOs to interact with the OrderBook. Notice
that each of Equation (1.1) (new Sell LO potentially removing some of the beginning of
the Buy LOs on the OB), Equation (1.3) (new Buy LO potentially removing some of the
beginning of the Sell LOs on the OB), Equation (1.5) (Sell MO removing some of the
beginning of the Buy LOs on the OB) and Equation (1.7) (Buy MO removing some of the
beginning of the Sell LOs on the OB) all perform a common core function - they “eat into”
the most significant LOs (on the opposite side) on the OB. So we first write a @staticmethod
eat_book for this common function.
eat_book takes as input a ps_pairs: PriceSizePairs (representing one side of the OB)
and the number of shares: int to buy/sell. Notice eat_book’s return type: Tuple[DollarsAndShares,
PriceSizePairs]. The returned DollarsAndShares represents the pair of dollars transacted
and the number of shares transacted (with number of shares transacted being less than
or equal to the input shares). The returned PriceSizePairs represents the remainder
of ps_pairs after the transacted number of shares have eaten into the input ps_pairs.
eat_book first deletes (i.e. “eats up”) as much of the beginning of the ps_pairs: PriceSizePairs
data structure as it can (basically matching the input number of shares with an appropri-
ate number of shares at the beginning of the ps_pairs: PriceSizePairs input). Note that
the returned PriceSizePairs is a separate data structure, ensuring the immutability of the
input ps_pairs: PriceSizePairs.
@staticmethod
def eat_book(
ps_pairs: PriceSizePairs,
shares: int
) -> Tuple[DollarsAndShares, PriceSizePairs]:
rem_shares: int = shares
dollars: float = 0.
for i, d_s in enumerate(ps_pairs):
this_price: float = d_s.dollars
this_shares: int = d_s.shares
dollars += this_price * min(rem_shares, this_shares)
if rem_shares < this_shares:
return (
DollarsAndShares(dollars=dollars, shares=shares),
[DollarsAndShares(
dollars=this_price,
shares=this_shares - rem_shares
)] + list(ps_pairs[i+1:])
)
else:
rem_shares -= this_shares
return (
DollarsAndShares(dollars=dollars, shares=shares - rem_shares),
[]
)

Now we are ready to write the method sell_limit_order which takes Sell LO Price and

7
Sell LO shares as input. As you can see in the code below, first it potentially removes
(if it “crosses”) an appropriate number of shares on the Buy LOs side of the OB (using
the @staticmethod eat_book), and then potentially adds an appropriate number of shares
at the Sell LO Price on the Sell LOs side of the OB. sell_limit_order returns a pair of
DollarsAndShares type and OrderBook type. The returned DollarsAndShares represents the
pair of dollars transacted and the number of shares transacted with the Buy LOs side of
the OB (with number of shares transacted being less than or equal to the input shares).
The returned OrderBook represents the new OB after potentially eating into the Buy LOs
side of the OB and then potentially adding some shares at the Sell LO Price on the Sell
LOs side of the OB. Note that the returned OrderBook is a newly-created data structure,
ensuring the immutability of self. We urge you to read the code below carefully as there
are many subtle details that are handled in the code.
from dataclasses import replace
def sell_limit_order(self, price: float, shares: int) -> \
Tuple[DollarsAndShares, OrderBook]:
index: Optional[int] = next((i for i, d_s
in enumerate(self.descending_bids)
if d_s.dollars < price), None)
eligible_bids: PriceSizePairs = self.descending_bids \
if index is None else self.descending_bids[:index]
ineligible_bids: PriceSizePairs = [] if index is None else \
self.descending_bids[index:]
d_s, rem_bids = OrderBook.eat_book(eligible_bids, shares)
new_bids: PriceSizePairs = list(rem_bids) + list(ineligible_bids)
rem_shares: int = shares - d_s.shares
if rem_shares > 0:
new_asks: List[DollarsAndShares] = list(self.ascending_asks)
index1: Optional[int] = next((i for i, d_s
in enumerate(new_asks)
if d_s.dollars >= price), None)
if index1 is None:
new_asks.append(DollarsAndShares(
dollars=price,
shares=rem_shares
))
elif new_asks[index1].dollars != price:
new_asks.insert(index1, DollarsAndShares(
dollars=price,
shares=rem_shares
))
else:
new_asks[index1] = DollarsAndShares(
dollars=price,
shares=new_asks[index1].shares + rem_shares
)
return d_s, OrderBook(
ascending_asks=new_asks,
descending_bids=new_bids
)
else:
return d_s, replace(
self,
descending_bids=new_bids
)

Next, we write the easier method sell_market_order which takes as input the number
of shares to be sold (as a market order). sell_market_order transacts with the appropri-
ate number of shares on the Buy LOs side of the OB (removing those many shares from

8
the Buy LOs side). It returns a pair of DollarsAndShares type and OrderBook type. The
returned DollarsAndShares represents the pair of dollars transacted and the number of
shares transacted (with number of shares transacted being less than or equal to the in-
put shares). The returned OrderBook represents the remainder of the OB after the trans-
acted number of shares have eaten into the Buy LOs side of the OB. Note that the returned
OrderBook is a newly-created data structure, ensuring the immutability of self.

def sell_market_order(
self,
shares: int
) -> Tuple[DollarsAndShares, OrderBook]:
d_s, rem_bids = OrderBook.eat_book(
self.descending_bids,
shares
)
return (d_s, replace(self, descending_bids=rem_bids))

We won’t list the methods buy_limit_order and buy_market_order here as they are com-
pletely analogous (you can find the entire code for OrderBook in the file rl/chapter9/order_book.py).
Now let us test out this code by creating a sample OrderBook and submitting some LOs and
MOs to interact with the OrderBook.

bids: PriceSizePairs = [DollarsAndShares(


dollars=x,
shares=poisson(100. - (100 - x) * 10)
) for x in range(100, 90, -1)]
asks: PriceSizePairs = [DollarsAndShares(
dollars=x,
shares=poisson(100. - (x - 105) * 10)
) for x in range(105, 115, 1)]
ob0: OrderBook = OrderBook(descending_bids=bids, ascending_asks=asks)

The above code creates an OrderBook in the price range [91, 114] with a bid-ask spread
of 5. Figure 1.2 depicts this OrderBook visually.
Let’s submit a Sell LO that says we’d like to sell 40 shares as long as the transacted price
is greater than or equal to 107. Our Sell LO should simply get added to the Sell LOs side
of the OB.

d_s1, ob1 = ob0.sell_limit_order(107, 40)

The new OrderBook ob1 has 40 more shares at the price level of 107, as depicted in Figure
1.3.
Now let’s submit a Sell MO that says we’d like to sell 120 shares at the “best price.” Our
Sell MO should transact with 120 shares at “best prices” of 100 and 99 as well (since the
OB does not have enough Buy LO shares at the price of 100).

d_s2, ob2 = ob1.sell_market_order(120)

The new OrderBook ob2 has 120 less shares on the Buy LOs side of the OB, as depicted
in Figure 1.4.
Now let’s submit a Buy LO that says we’d like to buy 80 shares as long as the transacted
price is less than or equal to 100. Our Buy LO should get added to the Buy LOs side of the
OB.

d_s3, ob3 = ob2.buy_limit_order(100, 80)

9
Figure 1.2: Starting Order Book

Figure 1.3: Order Book after Sell LO

10
Figure 1.4: Order Book after Sell MO

Figure 1.5: Order Book after Buy LO

11
Figure 1.6: Order Book after 2nd Sell LO

The new OrderBook ob3 has re-introduced a Buy LO at the price level of 100 (now with
80 shares), as depicted in Figure 1.5.
Now let’s submit a Sell LO that says we’d like to sell 60 shares as long as the transacted
price is greater than or equal to 104. Our Sell LO should get added to the Sell LOs side of
the OB.
d_s4, ob4 = ob3.sell_limit_order(104, 60)

The new OrderBook ob4 has introduced a Sell LO at a price of 104 with 60 shares, as
depicted in Figure 1.6.
Now let’s submit a Buy MO that says we’d like to buy 150 shares at the “best price.” Our
Buy MO should transact with 150 shares at “best prices” on the Sell LOs side of the OB.
d_s5, ob5 = ob4.buy_market_order(150)

The new OrderBook ob5 has 150 less shares on the Sell LOs side of the OB, wiping out all
the shares at the price level of 104 and almost wiping out all the shares at the price level
of 105, as depicted in Figure 1.7.
This has served as a good test of our code (transactions working as we’d like) and
we encourage you to write more code of this sort to interact with the OrderBook, and
to produce graphs of evolution of the OrderBook as this will help develop stronger intu-
ition and internalize the concepts we’ve learnt above. All of the above code is in the file
rl/chapter9/order_book.py.
Now we are ready to get started with the problem of Optimal Execution of a large-sized
Market Order.

1.2 Optimal Execution of a Market Order


Imagine the following problem: You are a trader in a stock and your boss has instructed
that you exit from trading in this stock because this stock doesn’t meet your company’s

12
Figure 1.7: Order Book after Buy MO

new investment objectives. You have to sell all of the N shares you own in this stock in
the next T hours, but you have been instructed to accomplish the sale by submitting only
Market Orders (not allowed to submit any Limit Orders because of the uncertainty in the
time of execution of the sale with a Limit Order). You can submit sell market orders (of
any size) at the start of each hour - so you have T opportunities to submit market orders
of any size. Your goal is to maximize the Expected Total Utility of sales proceeds for all N
shares over the T hours. Your task is to break up N into T appropriate chunks to maximize
the Expected Total Utility objective. If you attempt to sell the N shares too fast (i.e., too
many in the first few hours), as we’ve learnt above, each (MO) sale will eat a lot into the
Buy LOs on the OB (Temporary Price Impact) which would result in transacting at prices
below the best price (Best Bid Price). Moreover, you risk moving the Best Bid Price on the
OB significantly lower (Permanent Price Impact) that would affect the sales proceeds for
the next few sales you’d make. On the other hand, if you sell the N shares too slow (i.e., too
few in the first few hours), you might transact at good prices but then you risk running
out of time, which means you will have to dump a lot of shares with time running out
which in turn would mean transacting at prices below the best price. Moreover, selling
too slow exposes you to more uncertainty in market price movements over a longer time
period, and more uncertainty in sales proceeds means the Expected Utility objective gets
hurt. Thus, the precise timing and sizes in the breakup of shares is vital. You will need
to have an estimate of the Temporary and Permanent Price Impact of your Market Orders,
which can help you identify the appropriate number of shares to sell at the start of each
hour.
Unsurprisingly, we can model this problem as a Market Decision Process control prob-
lem where the actions at each time step (each hour, in this case) are the number of shares
sold at the time step and the rewards are the Utility of sales proceeds at each time step. To
keep things simple and intuitive, we shall model Price Impact of Market Orders in terms of
their effect on the Best Bid Price (rather than in terms of their effect on the entire OB). In
other words, we won’t be modeling the entire OB Price Dynamics, just the Best Bid Price

13
Dynamics. We shall refer to the OB activity of an MO immediately “eating into the Buy
LOs” (and hence, potentially transacting at prices lower than the best price) as the Tempo-
rary Price Impact. As mentioned earlier, this is followed by subsequent replenishment of
both Buy and Sell LOs on the OB (stabilizing the OB) - we refer to any eventual (end of
the hour) lowering of the Best Bid Price (relative to the Best Bid Price before the MO was
submitted) as the Permanent Price Impact. Modeling the temporary and permanent Price
Impacts separately helps us in deciding on the optimal actions (optimal shares to be sold
at the start of each hour).
Now we develop some formalism to describe this problem precisely. As mentioned
earlier, we make a number of simplifying assumptions in modeling the OB Dynamics for
ease of articulation (without diluting the most important concepts). We index discrete
time by t = 0, 1, . . . , T . We denote Pt as the Best Bid Price on the OB at the start of time
step t (for all t = 0, 1, . . . , T ) and Nt as the number of shares sold at time step t for all
t = 0, 1, . . . , T − 1. We denote the number of shares remaining to be sold at the start of
time step t as Rt for all t = 0, 1, . . . , T . Therefore,

t−1
Rt = N − Ni for all t = 0, 1, . . . , T
i=0

Note that:

R0 = N
Rt+1 = Rt − Nt for all t = 0, 1, . . . , T − 1
Also note that we need to sell everything by time t = T and so:

NT −1 = RT −1 ⇒ RT = 0
The model of Best Bid Price Dynamics from one time step to the next is given by:
Pt+1 = ft (Pt , Nt , ϵt ) for all t = 0, 1, . . . , T − 1
where ft is an arbitrary function incorporating:
• The Permanent Price Impact of selling Nt shares.
• The Price-Impact-independent market-movement of the Best Bid Price from time t
to time t + 1.
• Noise ϵt , a source of randomness in Best Bid Price movements.
The sales proceeds from the sale at time step t, for all t = 0, 1, . . . , T − 1, is defined as:
Nt · Qt = Nt · (Pt − gt (Pt , Nt ))
where gt is a function modeling the Temporary Price Impact (i.e., the Nt MO “eating into”
the Buy LOs on the OB). Qt should be interpreted as the average Buy LO price transacted
against by the Nt MO at time t.
Lastly, we denote the Utility (of Sales Proceeds) function as U (·).
As mentioned previously, solving for the optimal number of shares to be sold at each
time step can be modeled as a discrete-time finite-horizon Markov Decision Process, which
we describe below in terms of the order of MDP activity at each time step t = 0, 1, . . . , T −1
(the MDP horizon is time T meaning all states at time T are terminal states). We follow
the notational style of finite-horizon MDPs that should now be familiar from previous
chapters.
Order of Events at time step t for all t = 0, 1, . . . , T − 1:

14
• Observe State st := (Pt , Rt ) ∈ St
• Perform Action at := Nt ∈ At
• Receive Reward rt+1 := U (Nt · Qt ) = U (Nt · (Pt − gt (Pt , Nt )))
• Experience Price Dynamics Pt+1 = ft (Pt , Nt , ϵt ) and set Rt+1 = Rt − Nt so as to
obtain the next state st+1 = (Pt+1 , Rt+1 ) ∈ St+1 .

Note that we have intentionally not specified if Pt , Rt , Nt are integers or real numbers, or
if constrained to be non-negative etc. Those precise specifications will be customized to the
nuances/constraints of the specific Optimal Order Execution problem we’d be solving. Be
default, we shall assume that Pt ∈ R+ and Nt , Rt ∈ Z≥0 (as these represent realistic trading
situations), although we do consider special cases later in the chapter where Pt , Rt ∈ R
(unconstrained real numbers for analytical tractability).
The goal is to find the Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ) (defined as πt∗ ((Pt , Rt )) =

Nt that maximizes:

T −1
E[ γ t · U (Nt · Qt )]
t=0

where γ is the discount factor to account for the fact that future utility of sales proceeds
can be modeled to be less valuable than today’s.
Now let us write some code to solve this MDP. We write a class OptimalOrderExecution
which models a fairly generic MDP for Optimal Order Execution as described above, and
solves the Control problem with Approximate Value Iteration using the backward induc-
tion algorithm that we implemented in Chapter ??. Let us start by taking a look at the
attributes (inputs) in OptimalOrderExecution:

• shares refers to the total number of shares N to be sold over T time steps.
• time_steps refers to the number of time steps T .
• avg_exec_price_diff refers to the time-sequenced functions gt that return the reduc-
tion in the average price obtained by the Market Order at time t due to eating into the
Buy LOs. gt takes as input the type PriceAndShares that represents a pair of price:
float and shares: int (in this case, the price is Pt and the shares is the MO size Nt
at time t). As explained earlier, the sales proceeds at time t is: Nt · (Pt − gt (Pt , Nt )).
• price_dynamics refers to the time-sequenced functions ft that represent the price dy-
namics: Pt+1 ∼ ft (Pt , Nt ). ft outputs a probability distribution of prices for Pt+1 .
• utility_func refers to the Utility of Sales Proceeds function, incorporating any risk-
aversion.
• discount_factor refers to the discount factor γ.
• func_approx refers to the ValueFunctionApprox type to be used to approximate the
Value Function for each time step (since we are doing backward induction).
• initial_price_distribution refers to the probability distribution of prices P0 at time
0, which is used to generate the samples of states at each of the time steps (needed
in the approximate backward induction algorithm).
from rl.approximate_dynamic_programming import ValueFunctionApprox
@dataclass(frozen=True)
class PriceAndShares:
price: float
shares: int
@dataclass(frozen=True)
class OptimalOrderExecution:
shares: int

15
time_steps: int
avg_exec_price_diff: Sequence[Callable[[PriceAndShares], float]]
price_dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]]
utility_func: Callable[[float], float]
discount_factor: float
func_approx: ValueFunctionApprox[PriceAndShares]
initial_price_distribution: Distribution[float]

The two key things we need to perform the backward induction are:

• A method get_mdp that given a time step t, produces the MarkovDecisionProcess ob-
ject representing the transitions from time t to time t+1. The class OptimalExecutionMDP
within get_mdp implements the abstract methods step and actions of the abstract
class MarkovDecisionProcess. The code should be fairly self-explanatory - just a cou-
ple of things to point out here. Firstly, the input p_r: NonTerminal[PriceAndShares]
to the step method represents the state (Pt , Rt ) at time t, and the variable p_s: PriceAndShares
represents the pair of (Pt , Nt ), which serves as input to avg_exec_price_diff and
price_dynamics (attributes of OptimalOrderExecution). Secondly, note that the actions
method returns an Iterator on a single int at time t = T −1 because of the constraint
NT −1 = RT −1 .
• A method get_states_distribution that returns the probability distribution of states
(Pt , Rt ) at time t (of type SampledDistribution[NonTerminal[PriceAndShares]]). The
code here is similar to the get_states_distribiution method of AssetAllocDiscrete
in Chapter ?? (essentially, walking forward from time 0 to time t by sampling from
the state-transition probability distribution and also sampling from uniform choices
over all actions at each time step).

def get_mdp(self, t: int) -> MarkovDecisionProcess[PriceAndShares, int]:


utility_f: Callable[[float], float] = self.utility_func
price_diff: Sequence[Callable[[PriceAndShares], float]] = \
self.avg_exec_price_diff
dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]] = \
self.price_dynamics
steps: int = self.time_steps
class OptimalExecutionMDP(MarkovDecisionProcess[PriceAndShares, int]):
def step(
self,
p_r: NonTerminal[PriceAndShares],
sell: int
) -> SampledDistribution[Tuple[State[PriceAndShares],
float]]:
def sr_sampler_func(
p_r=p_r,
sell=sell
) -> Tuple[State[PriceAndShares], float]:
p_s: PriceAndShares = PriceAndShares(
price=p_r.state.price,
shares=sell
)
next_price: float = dynamics[t](p_s).sample()
next_rem: int = p_r.state.shares - sell
next_state: PriceAndShares = PriceAndShares(
price=next_price,
shares=next_rem
)
reward: float = utility_f(
sell * (p_r.state.price - price_diff[t](p_s))

16
)
return (NonTerminal(next_state), reward)
return SampledDistribution(
sampler=sr_sampler_func,
expectation_samples=100
)
def actions(self, p_s: NonTerminal[PriceAndShares]) -> \
Iterator[int]:
if t == steps - 1:
return iter([p_s.state.shares])
else:
return iter(range(p_s.state.shares + 1))
return OptimalExecutionMDP()
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[PriceAndShares]]:
def states_sampler_func() -> NonTerminal[PriceAndShares]:
price: float = self.initial_price_distribution.sample()
rem: int = self.shares
for i in range(t):
sell: int = Choose(range(rem + 1)).sample()
price = self.price_dynamics[i](PriceAndShares(
price=price,
shares=rem
)).sample()
rem -= sell
return NonTerminal(PriceAndShares(
price=price,
shares=rem
))
return SampledDistribution(states_sampler_func)

Finally, we produce the Optimal Value Function and Optimal Policy for each time step
with the following method backward_induction_vf_and_pi:
from rl.approximate_dynamic_programming import back_opt_vf_and_policy
def backward_induction_vf_and_pi(
self
) -> Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]]:
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[PriceAndShares, int],
ValueFunctionApprox[PriceAndShares],
SampledDistribution[NonTerminal[PriceAndShares]]
]] = [(
self.get_mdp(i),
self.func_approx,
self.get_states_distribution(i)
) for i in range(self.time_steps)]
num_state_samples: int = 10000
error_tolerance: float = 1e-6
return back_opt_vf_and_policy(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=self.discount_factor,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)

The above code is in the file rl/chapter9/optimal_order_execution.py. We encourage


you to create a few different instances of OptimalOrderExecution by varying it’s inputs (try

17
different temporary and permanent price impact functions, different utility functions, im-
pose a few constraints etc.). Note that the above code has been written with an educa-
tional motivation rather than an efficient-computation motivation, so the convergence of
the backward induction ADP algorithm is going to be slow. How do we know that the
above code is correct? Well, we need to create a simple special case that yields a closed-
form solution that we can compare the Optimal Value Function and Optimal Policy pro-
duced by OptimalOrderExecution against. This will be the subject of the following subsec-
tion.

1.2.1 Simple Linear Price Impact Model with no Risk-Aversion


Now we consider a special case of the above-described MDP - a simple linear Price Impact
model with no risk-aversion. Furthermore, for analytical tractability, we assume N, Nt , Pt
are all unconstrained continuous-valued (i.e., taking values ∈ R).
We assume simple linear price dynamics as follows:

Pt+1 = ft (Pt , Nt , ϵ) = Pt − α · Nt + ϵt
where α ∈ R and ϵt for all t = 0, 1, . . . , T − 1 are independent and identically distributed
(i.i.d.) with E[ϵt |Nt , Pt ] = 0. Therefore, the Permanent Price Impact (as an Expectation)
is α · Nt .
As for the Temporary Price Impact, we know that gt needs to be a non-decreasing func-
tion of Nt . We assume a simple linear form for gt as follows:

gt (Pt , Nt ) = β · Nt for all t = 0, 1, . . . , T − 1


for some constant β ∈ R≥0 . So, Qt = Pt − βNt . As mentioned above, we assume no
risk-aversion, i.e., the Utility function U (·) is assumed to be the identity function. Also, we
assume that the MDP discount factor γ = 1.
Note that all of these assumptions are far too simplistic and hence, an unrealistic model
of the real-world, but starting with this simple model helps build good intuition and en-
ables us to develop more realistic models by incrementally adding complexity/nuances
from this simple base model.
As ever, in order to solve the Control problem, we define the Optimal Value Function and
invoke the Bellman Optimality Equation. We shall use the standard notation for discrete-
time finite-horizon MDPs that we are now very familiar with.
Denote the Value Function for policy π at time t (for all t = 0, 1, . . . T − 1) as:


T −1
Vtπ ((Pt , Rt )) = Eπ [ Ni · (Pi − β · Ni )|(Pt , Rt )]
i=t

Denote the Optimal Value Function at time t (for all t = 0, 1, . . . , T − 1) as:

Vt∗ ((Pt , Rt )) = max Vtπ ((Pt , Rt ))


π

The Optimal Value Function satisfies the finite-horizon Bellman Optimality Equation
for all t = 0, 1, . . . , T − 2, as follows:

Vt∗ ((Pt , Rt )) = max{Nt · (Pt − β · Nt ) + E[Vt+1 ((Pt+1 , Rt+1 ))]}
Nt

and

18
VT∗−1 ((PT −1 , RT −1 )) = NT −1 · (PT −1 − β · NT −1 ) = RT −1 · (PT −1 − β · RT −1 )

From the above, we can infer:

VT∗−2 ((PT −2 , RT −2 )) = max {NT −2 · (PT −2 − β · NT −2 ) + E[RT −1 · (PT −1 − β · RT −1 )]}


NT −2

= max {NT −2 · (PT −2 − β · NT −2 ) + E[(RT −2 − NT −2 )(PT −1 − β · (RT −2 − NT −2 ))]}


NT −2

= max {NT −2 · (PT −2 − β · NT −2 ) + (RT −2 − NT −2 ) · (PT −2 − α · NT −2 − β · (RT −2 − NT −2 ))}


NT −2

This simplifies to:

VT∗−2 ((PT −2 , RT −2 )) = max {RT −2 ·PT −2 −β ·RT2 −2 +(α−2β)(NT2 −2 −NT −2 ·RT −2 )} (1.9)
NT −2

For the case α ≥ 2β, noting that NT −2 ≤ RT −2 , we have the trivial solution:

NT∗ −2 = 0 or NT∗ −2 = RT −2

Substituting either of these two values for NT∗ −2 in the right-hand-side of Equation (1.9)
gives:

VT∗−2 ((PT −2 , RT −2 )) = RT −2 · (PT −2 − β · RT −2 )


Continuing backwards in time in this manner (for the case α ≥ 2β) gives:

Nt∗ = 0 or Nt∗ = Rt for all t = 0, 1, . . . , T − 1


Vt∗ ((Pt , Rt )) = Rt · (Pt − β · Rt ) for all t = 0, 1, . . . , T − 1
So the solution for the case α ≥ 2β is to sell all N shares at any one of the time steps
t = 0, 1, . . . , T − 1 (and none in the other time steps), and the Optimal Expected Total Sale
Proceeds is N · (P0 − β · N )
For the case α < 2β, differentiating the term inside the max in Equation (1.9) with
respect to NT −2 , and setting it to 0 gives:

RT −2
(α − 2β) · (2NT∗ −2 − RT −2 ) = 0 ⇒ NT∗ −2 =
2
Substituting this solution for NT∗ −2 in Equation (1.9) gives:

α + 2β
VT∗−2 ((PT −2 , RT −2 )) = RT −2 · PT −2 − RT2 −2 · ( )
4
Continuing backwards in time in this manner gives:

Rt
Nt∗ = for all t = 0, 1, . . . , T − 1
T −t
Rt2 2β + α · (T − t − 1)
Vt∗ ((Pt , Rt )) = Rt · Pt − ·( ) for all t = 0, 1, . . . , T − 1
2 T −t

19
Rolling forward in time, we see that Nt∗ = N
T , i.e., splitting the N shares uniformly across
the T time steps. Hence, the Optimal Policy is a constant deterministic function (i.e., inde-
pendent of the State). Note that a uniform split makes intuitive sense because Price Impact
and Market Movement are both linear and ∑ additive, and don’t interact.∑This optimization
is essentially equivalent to minimizing Tt=1 Nt2 with the constraint: Tt=1 Nt = N . The
Optimal Expected Total Sales Proceeds is equal to:

N2 2β − α
N · P0 − · (α + )
2 T
Implementation Shortfall is the technical term used to refer to the reduction in Total Sales
Proceeds relative to the maximum possible sales proceeds (= N · P0 ). So, in this simple
2
linear model, the Implementation Shortfall from Price Impact is N2 · (α + 2β−α T ). Note that
the Implementation Shortfall is non-zero even if one had infinite time available (T → ∞)
for the case of α > 0. If Price Impact were purely temporary (α = 0, i.e., Price fully snapped
back), then the Implementation Shortfall is zero if one had infinite time available.
So now let’s customize the class OptimalOrderExecution to this simple linear price im-
pact model, and compare the Optimal Value Function and Optimal Policy produced by
OptimalOrderExecution against the above-derived closed-form solutions. We write code
below to create an instance of OptimalOrderExecution with time steps T = 5, total number
of shares to be sold N = 100, linear temporary price impact with α = 0.03, linear perma-
nent price impact with β = 0.03, utility function as the identity function (no risk-aversion),
and discount factor γ = 1. We set the standard deviation for the price dynamics probabil-
ity distribution to 0 to speed up the calculation. Since we know the closed-form solution
for the Optimal Value Function, we provide some assistance to OptimalOrderExecution by
setting up a linear function approximation with two features: Pt · Rt and Rt2 . The task
of OptimalOrderExecution is to infer the correct coefficients of these features for each time
step. If the coefficients match that of the closed-form solution, it provides a great degree
of confidence that our code is working correctly.

num_shares: int = 100


num_time_steps: int = 5
alpha: float = 0.03
beta: float = 0.05
init_price_mean: float = 100.0
init_price_stdev: float = 10.0

price_diff = [lambda p_s: beta * p_s.shares for _ in range(num_time_steps)]


dynamics = [lambda p_s: Gaussian(
mu=p_s.price - alpha * p_s.shares,
sigma=0.
) for _ in range(num_time_steps)]
ffs = [
lambda p_s: p_s.state.price * p_s.state.shares,
lambda p_s: float(p_s.state.shares * p_s.state.shares)
]
fa: FunctionApprox = LinearFunctionApprox.create(feature_functions=ffs)
init_price_distrib: Gaussian = Gaussian(
mu=init_price_mean,
sigma=init_price_stdev

20
)

ooe: OptimalOrderExecution = OptimalOrderExecution(


shares=num_shares,
time_steps=num_time_steps,
avg_exec_price_diff=price_diff,
price_dynamics=dynamics,
utility_func=lambda x: x,
discount_factor=1,
func_approx=fa,
initial_price_distribution=init_price_distrib
)
it_vf: Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]] = \
ooe.backward_induction_vf_and_pi()

Next we evaluate this Optimal Value Function and Optimal Policy on a particular state
for all time steps, and compare that against the closed-form solution. The state we use for
evaluation is as follows:

state: PriceAndShares = PriceAndShares(


price=init_price_mean,
shares=num_shares
)

The code to evaluate the obtained Optimal Value Function and Optimal Policy on the
above state is as follows:

for t, (vf, pol) in enumerate(it_vf):


print(f”Time {t:d}”)
print()
opt_sale: int = pol.action_for(state)
val: float = vf(NonTerminal(state))
print(f”Optimal Sales = {opt_sale:d}, Opt Val = {val:.3f}”)
print()
print(”Optimal Weights below:”)
print(vf.weights.weights)
print()

With 100,000 state samples for each time step and only 10 state transition samples (since
the standard deviation of ϵ is set to be very small), this prints the following:

Time 0

Optimal Sales = 20, Opt Val = 9779.976

Optimal Weights below:


[ 0.9999948 -0.02199718]

Time 1

21
Optimal Sales = 20, Opt Val = 9762.479

Optimal Weights below:


[ 0.9999935 -0.02374564]

Time 2

Optimal Sales = 20, Opt Val = 9733.324

Optimal Weights below:


[ 0.99999335 -0.02666098]

Time 3

Optimal Sales = 20, Opt Val = 9675.013

Optimal Weights below:


[ 0.99999316 -0.03249182]

Time 4

Optimal Sales = 20, Opt Val = 9500.000

Optimal Weights below:


[ 1. -0.05]

Now let’s compare these results against the closed-form solution.

for t in range(num_time_steps):
print(f”Time {t:d}”)
print()
left: int = num_time_steps - t
opt_sale_anal: float = num_shares / num_time_steps
wt1: float = 1
wt2: float = -(2 * beta + alpha * (left - 1)) / (2 * left)
val_anal: float = wt1 * state.price * state.shares + \
wt2 * state.shares * state.shares

print(f”Optimal Sales = {opt_sale_anal:.3f}, Opt Val = {val_anal:.3f}”)


print(f”Weight1 = {wt1:.3f}”)
print(f”Weight2 = {wt2:.3f}”)
print()

This prints the following:

Time 0

Optimal Sales = 20.000, Opt Val = 9780.000


Weight1 = 1.000

22
Weight2 = -0.022

Time 1

Optimal Sales = 20.000, Opt Val = 9762.500


Weight1 = 1.000
Weight2 = -0.024

Time 2

Optimal Sales = 20.000, Opt Val = 9733.333


Weight1 = 1.000
Weight2 = -0.027

Time 3

Optimal Sales = 20.000, Opt Val = 9675.000


Weight1 = 1.000
Weight2 = -0.033

Time 4

Optimal Sales = 20.000, Opt Val = 9500.000


Weight1 = 1.000
Weight2 = -0.050

We need to point out here that the general case of optimal order execution involving
modeling of the entire Order Book’s dynamics will have to deal with a large state space.
This means the ADP algorithm will suffer from the curse of dimensionality, which means
we will need to employ RL algorithms.

1.2.2 Paper by Bertsimas and Lo on Optimal Order Execution


A paper by Bertsimas and Lo on Optimal Order Execution (Bertsimas and Lo 1998) con-
sidered a special case of the simple Linear Impact model we sketched above. Specifically,
they assumed no risk-aversion (Utility function is identity function) and assumed that the
Permanent Price Impact parameter α is equal to the Temporary Price Impact Parameter β.
In the same paper, Bertsimas and Lo then extended this Linear Impact Model to include
dependence on a serially-correlated variable Xt as follows:

Pt+1 = Pt − (β · Nt + θ · Xt ) + ϵt

Xt+1 = ρ · Xt + ηt

Qt = Pt − (β · Nt + θ · Xt )
where ϵt and ηt are each independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, ϵt and ηt are also independent of each other for all
t = 0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt linearly. Applying
the finite-horizon Bellman Optimality Equation on the Optimal Value Function (and the

23
same backward-recursive approach as before) yields:

Rt
Nt∗ = + h(t, β, θ, ρ) · Xt
T −t
Vt∗ ((Pt , Rt , Xt )) = Rt · Pt − (quadratic in (Rt , Xt ) + constant)
Essentially, the serial-correlation predictability (ρ ̸= 0) alters the uniform-split strategy.
In the same paper, Bertsimas and Lo presented a more realistic model called Linear-
Percentage Temporary (abbreviated as LPT) Price Impact model, whose salient features in-
clude:

• Geometric random walk: consistent with real data, and avoids non-positive prices.
• Fractional Price Impact gt (PPtt,Nt ) doesn’t depend on Pt (this is validated by real data).
• Purely Temporary Price Impact, i.e., the price Pt snaps back after the Temporary
Price Impact (no Permanent effect of Market Orders on future prices).

The specific model is:

Pt+1 = Pt · eZt
Xt+1 = ρ · Xt + ηt
Qt = Pt · (1 − β · Nt − θ · Xt )
where Zt are independent and identically distributed random variables with mean µZ
and variance σZ2 , ηt are independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, Zt and ηt are independent of each other for all t =
0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt multiplicatively. With
the same derivation methodology as before, we get the solution:

Nt∗ = ct + ct Rt + ct Xt
(1) (2) (3)

2
σZ
Vt∗ ((Pt , Rt , Xt )) = eµZ +
(4) (5) (6) (7) (8) (9)
2 · Pt · (ct + ct Rt + ct Xt + ct Rt2 + ct Xt2 + ct Rt Xt )
(k)
where ct , 1 ≤ k ≤ 9, are constants (independent of Pt , Rt , Xt ).
As an exercise, we recommend implementing the above (LPT) model by customizing
OptimalOrderExecution to compare the obtained Optimal Value Function and Optimal Pol-
(k)
icy against the closed-form solution (you can find the exact expressions for the ct coeffi-
cients in the Bertsimas and Lo paper).

1.2.3 Incorporating Risk-Aversion and Real-World Considerations


Bertsimas and Lo ignored risk-aversion for the purpose of analytical tractability. Although
there was value in obtaining closed-form solutions, ignoring risk-aversion makes their
model unrealistic. We have discussed in detail in Chapter ?? about the fact that traders are
wary of the risk of uncertain revenues and would be willing to trade away some expected
revenues for lower variance of revenues. This calls for incorporating risk-aversion in the
maximization objective. Almgren and Chriss wrote an important paper (Almgren and
Chriss 2000) where they work in this Risk-Aversion framework. They consider our simple
linear price impact model and incorporate risk-aversion by maximizing E[Y ] − λ · V ar[Y ]
∑ −1
where Y is the total (uncertain) sales proceeds Tt=0 Nt · Qt and λ controls the degree of

24
risk-aversion. The incorporation of risk-aversion affects the time-trajectory of Nt∗ . Clearly,
if λ = 0, we get the usual uniform-split strategy: Nt∗ = NT . The other extreme assumption
is to minimize V ar[Y ] which yields: N0∗ = N (sell everything immediately because the
only thing we want to avoid is uncertainty of sales proceeds). In their paper, Almgren
and Chriss go on to derive the Efficient Frontier for this problem (analogous to the Effi-
cient Frontier Portfolio Theory we outline in Appendix ??). They also derive solutions for
specific utility functions.
To model a real-world trading situation, the first step is to start with the MDP we de-
scribed earlier with an appropriate model for the price dynamics ft (·) and the temporary
price impact gt (·) (incorporating potential time-heterogeneity, non-linear price dynamics
and non-linear impact). The OptimalOrderExecution class we wrote above allows us to
incorporate all of the above. We can also model various real-world “frictions” such as
discrete prices, discrete number of shares, constraints on prices and number of shares, as
well as trading fees. To make the model truer to reality and more sophisticated, we can
introduce various market factors in the State which would invariably lead to bloating of
the State Space. We would also need to capture Cross-Asset Market Impact. As a further
step, we could represent the entire Order Book (or a compact summary of the size/shape
of the Order book) as part of the state, which leads to further bloating of the state space.
All of this makes ADP infeasible and one would need to employ Reinforcement Learning
algorithms. More importantly, we’d need to write a realistic Order Book Dynamics sim-
ulator capturing all of the above real-world considerations that an RL algorithm would
learn from. There are a lot of practical and technical details involved in writing a real-
world simulator and we won’t be covering those details in this book. It suffices for here
to say that the simulator would essentially be a sampling model that has learnt the Order
Book Dynamics from market data (supervised learning of the Order Book Dynamics). Us-
ing such a simulator and with a deep learning-based function approximation of the Value
Function, we can solve a practical Optimal Order Execution problem with Reinforcement
Learning. We refer you to a couple of papers for further reading on this:

• Paper by Nevmyvaka, Feng, Kearns in 2006 (Nevmyvaka, Feng, and Kearns 2006)
• Paper by Vyetrenko and Xu in 2019 (Vyetrenko and Xu 2019)

Designing real-world simulators for Order Book Dynamics and using Reinforcement
Learning for Optimal Order Execution is an exciting area for future research as well as
engineering design. We hope this section has provided sufficient foundations for you to
dig into this topic further.

1.3 Optimal Market-Making


Now we move on to the second problem of this chapter involving trading on an Order
Book - the problem of Optimal Market-Making. A market-maker is a company/individual
which/who regularly quotes bid and ask prices in a financial asset (which, without loss of
generality, we will refer to as a “stock”). The market-maker typically holds some inventory
in the stock, always looking to buy at one’s quoted bid price and sell at one’s quoted ask
price, thus looking to make money off the spread between one’s quoted ask price and
one’s quoted bid price. The business of a market-maker is similar to that of a car dealer
who maintains an inventory of cars and who will offer purchase and sales prices, looking
to make a profit off the price spread and ensuring that the inventory of cars doesn’t get
too big. In this section, we consider the business of a market-maker who quotes one’s

25
bid prices by submitting Buy LOs on an OB and quotes one’s ask prices by submitting
Sell LOs on the OB. Market-makers are known as liquidity providers in the market because
they make shares of the stock available for trading on the OB (both on the buy side and
sell side). In general, anyone who submits LOs can be thought of as a market liquidity
provider. Likewise, anyone who submits MOs can be thought of as a market liquidity taker
(because an MO takes shares out of the volume that was made available for trading on the
OB).
There is typically a fairly complex interplay between liquidity providers (including market-
makers) and liquidity takers. Modeling OB dynamics is about modeling this complex in-
terplay, predicting arrivals of MOs and LOs, in response to market events and in response
to observed activity on the OB. In this section, we view the OB from the perspective of a
single market-maker who aims to make money with Buy/Sell LOs of appropriate bid-ask
spread and with appropriate volume of shares (specified in their submitted LOs). The
market-maker is likely to be successful if she can do a good job of forecasting OB Dynam-
ics and dynamically adjusting her Buy/Sell LOs on the OB. The goal of the market-maker
is to maximize one’s Utility of Gains at the end of a suitable horizon of time.
The core intuition in the decision of how to set the price and shares in the market-maker’s
Buy and Sell LOs is as follows: If the market-maker’s bid-ask spread is too narrow, they
will have more frequent transactions but smaller gains per transaction (more likelihood of
their LOs being transacted against by an MO or an opposite-side LO). On the other hand,
if the market-maker’s bid-ask spread is too wide, they will have less frequent transactions
but larger gains per transaction (less likelihood of their LOs being transacted against by
an MO or an opposite-side LO). Also of great importance is the fact that a market-maker
needs to carefully manage potentially large inventory buildup (either on the long side
or the short side) so as to avoid scenarios of consequent unfavorable forced liquidation
upon reaching the horizon time. Inventory buildup can occur if the market participants
consistently transact against mostly one side of the market-maker’s submitted LOs. With
this high-level intuition, let us make these concepts of market-making precise. We start
by developing some notation to help articulate the problem of Optimal Market-Making
clearly. We will re-use some of the notation and terminology we had developed for the
problem of Optimal Order Execution. As ever, for ease of exposition, we will simplify the
setting for the Optimal Market-Making problem.
Assume there are a finite number of time steps indexed by t = 0, 1, . . . , T . Assume
the market-maker always shows a bid price and ask price (at each time t) along with the
associated bid shares and ask shares on the OB. Also assume, for ease of exposition, that
the market-maker can add or remove bid/ask shares from the OB costlessly. We use the
following notation:

• Denote Wt ∈ R as the market-maker’s trading account value at time t.


• Denote It ∈ Z as the market-maker’s inventory of shares at time t (assume I0 =
0). Note that the inventory can be positive or negative (negative means the market-
maker is short a certain number of shares).
• Denote St ∈ R+ as the OB Mid Price at time t (assume a stochastic process for St ).
(b)
• Denote Pt ∈ R+ as the market-maker’s Bid Price at time t.
(b)
• Denote Nt ∈ Z+ as the market-maker’s Bid Shares at time t.
(a)
• Denote Pt ∈ R+ as the market-maker’s Ask Price at time t.
(a)
• Denote Nt ∈ Z+ as the market-maker’s Ask Shares at time t.
(b) (b)
• We refer to δt = St − Pt as the market-maker’s Bid Spread (relative to OB Mid).

26
(a) (a)
• We refer to δt = Pt − St as the market-maker’s Ask Spread (relative to OB Mid).
(b) (a) (a) (b)
• We refer to δt + δt = Pt − Pt as the market-maker’s Bid-Ask Spread.
(b)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Bid Shares
(b)
that have been transacted against (by MOs or by Sell LOs) up to time t (Xt is often
refered to as the cumulative “hits” up to time t, as in “the market-maker’s buy offer
has been hit”).
(a)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Ask
(a)
Shares that have been transacted against (by MOs or by Buy LOs) up to time t (Xt
is often refered to as the cumulative “lifts” up to time t, as in “the market-maker’s
sell offer has been lifted”).

With this notation in place, we can write the trading account balance equation for all
t = 0, 1, . . . , T − 1 as follows:
(a) (a) (a) (b) (b) (b)
Wt+1 = Wt + Pt · (Xt+1 − Xt ) − Pt · (Xt+1 − Xt ) (1.10)
Note that since the inventory I0 at time 0 is equal to 0, the inventory It at time t is given
by the equation:
(b) (a)
I t = Xt − Xt
The market-maker’s goal is to maximize (for an appropriately shaped concave utility
function U (·)) the sum of the trading account value at time T and the value of the inventory
of shares held at time T , i.e., we maximize:

E[U (WT + IT · ST )]
As we alluded to earlier, this problem can be cast as a discrete-time finite-horizon Markov
Decision Process (with discount factor γ = 1). Following the usual notation for discrete-
time finite-horizon MDPs, the order of activity for the MDP at each time step t = 0, 1, . . . , T −
1 is as follows:

• Observe State (St , Wt , It ) ∈ St .


(b) (b) (a) (a)
• Perform Action (Pt , Nt , Pt , Nt ) ∈ At .
(b) (b)
• Random number of bid shares hit at time step t (this is equal to Xt+1 − Xt ).
(a) (a)
• Random number of ask shares lifted at time step t (this is equal to Xt+1 − Xt ),
• Update of Wt to Wt+1 .
• Update of It to It+1 .
• Stochastic evolution of St to St+1 .
• Receive Reward Rt+1 , where
{
0 for 1 ≤ t + 1 ≤ T − 1
Rt+1 :=
U (WT + IT · ST ) for t + 1 = T

The goal is to find an Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ), where

πt∗ ((St , Wt , It )) = (Pt , Nt , Pt , Nt )


(b) (b) (a) (a)

that maximizes:

T
E[ Rt ] = E[RT ] = E[U (WT + IT · ST )]
t=1

27
1.3.1 Avellaneda-Stoikov Continuous-Time Formulation
A landmark paper by Avellaneda and Stoikov (Avellaneda and Stoikov 2008) formulated
this optimal market-making problem in it’s continuous-time version. Their formulation
is conducive to analytical tractability and they came up with a simple, clean and intuitive
solution. In this subsection, we go over their formulation and in the next subsection, we
show the derivation of their solution. We adapt our discrete-time notation above to their
continuous-time setting.
(b) (a)
[(Xt |0 ≤ t < T ] and [Xt |0 ≤ t < T ] are assumed to be continuous-time Poisson
(b)
processes with the hit rate per unit of time and the lift rate per unit of time denoted as λt and
(a)
λt respectively. Hence, we can write the following:
(b) (b)
dXt ∼ P oisson(λt · dt)
(a) (a)
dXt ∼ P oisson(λt · dt)
(b) (b)
λt = f (b) (δt )
(a) (a)
λt = f (a) (δt )
for decreasing functions f (b) (·) and f (a) (·).
(a) (a) (b) (b)
dWt = Pt · dXt − Pt · dXt
(b) (a)
I t = Xt − Xt (note: I0 = 0)
(b)
Since infinitesimal Poisson random variables dXt (shares hit in time interval from t
(a)
to t + dt) and dXt (shares lifted in time interval from t to t + dt) are Bernoulli random
(b) (a)
variables (shares hit/lifted within time interval of duration dt will be 0 or 1), Nt and Nt
(number of shares in the submitted LOs for the infinitesimal time interval from t to t + dt)
can be assumed to be 1.
This simplifies the Action at time t to be just the pair:
(b) (a)
(δt , δt )

OB Mid Price Dynamics is assumed to be scaled brownian motion:

dSt = σ · dzt
for some σ ∈ R+ .
The Utility function is assumed to be: U (x) = −e−γx where γ > 0 is the risk-aversion pa-
rameter (this Utility function is essentially the CARA Utility function devoid of associated
constants).

1.3.2 Solving the Avellaneda-Stoikov Formulation


The following solution is as presented in the Avellaneda-Stoikov paper. We can express
the Avellaneda-Stoikov continuous-time formulation as a Hamilton-Jacobi-Bellman (HJB)
formulation (note: for reference, the general HJB formulation is covered in Appendix ??).
We denote the Optimal Value function as V ∗ (t, St , Wt , It ). Note that unlike Section ??
in Chapter ?? where we denoted the Optimal Value Function as a time-indexed sequence

28
Vt∗ (·), here we make t an explicit functional argument of V ∗ and each of St , Wt , It also
as separate functional arguments of V ∗ (instead of the typical approach of making the
state, as a tuple, a single functional argument). This is because in the continuous-time
setting, we are interested in the time-differential of the Optimal Value Function and we
also want to represent the dependency of the Optimal Value Function on each of St , Wt , It
as explicit separate dependencies. Appendix ?? provides the derivation of the general HJB
formulation (Equation (??) in Appendix ??) - this general HJB Equation specializes here
to the following:

max E[dV ∗ (t, St , Wt , It )] = 0 for t < T


(b) (a)
δt ,δt

V ∗ (T, ST , WT , IT ) = −e−γ·(WT +IT ·ST )


An infinitesimal change dV ∗ to V ∗ (t, St , Wt , It ) is comprised of 3 components:

• Due to pure movement in time (dependence of V ∗ on t).


• Due to randomness in OB Mid-Price (dependence of V ∗ on St ).
• Due to randomness in hitting/lifting the market-maker’s Bid/Ask (dependence of
V ∗ on λt and λt ). Note that the probability of being hit in interval from t to t +
(b) (a)
(b) (a)
dt is λt · dt and probability of being lifted in interval from t to t + dt is λt · dt,
upon which the trading account value Wt changes appropriately and the inventory
It increments/decrements by 1.

With this, we can expand dV ∗ (t, St , Wt , It ) and rewrite HJB as:

∂V ∗ ∂V ∗ σ2 ∂ 2V ∗
max { · dt + E[σ · · dzt + · · (dzt )2 ]
(b)
δt ,δt
(a) ∂t ∂St 2 ∂St2

+ λt · dt · V ∗ (t, St , Wt − St + δt , It + 1)
(b) (b)

· dt · V ∗ (t, St , Wt + St + δt , It − 1)
(a) (a)
+ λt
· dt) · V ∗ (t, St , Wt , It )
(b) (a)
+ (1 − λt · dt − λt
− V ∗ (t, St , Wt , It )} = 0

Next, we want to convert the HJB Equation to a Partial Differential Equation (PDE). We
can simplify the above HJB equation with a few observations:

• E[dzt ] = 0.
• E[(dzt )2 ] = dt.
(b) (a)
• Organize the terms involving λt and λt better with some algebra.
• Divide throughout by dt.

∂V ∗ σ 2 ∂ 2 V ∗
max { + ·
(b)
δt ,δt
(a) ∂t 2 ∂St2

+ λt · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))
(b) (b)

· (V ∗ (t, St , Wt + St + δt , It − 1) − V ∗ (t, St , Wt , It ))} = 0


(a) (a)
+ λt

29
(b) (b) (a) (a)
Next, note that λt = f (b) (δt ) and λt = f (a) (δt ), and apply the max only on the
relevant terms:

∂V ∗ σ 2 ∂ 2 V ∗
+ ·
∂t 2 ∂St2
+ max{f (b) (δt ) · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))}
(b) (b)
(b)
δt

+ max{f (a) (δt ) · (V ∗ (t, St , Wt + St + δt , It − 1) − V ∗ (t, St , Wt , It ))} = 0


(a) (a)
(a)
δt

This combines with the boundary condition:

V ∗ (T, ST , WT , IT ) = −e−γ·(WT +IT ·ST )

Next, we make an educated guess for the functional form of V ∗ (t, St , Wt , It ):

V ∗ (t, St , Wt , It ) = −e−γ·(Wt +θ(t,St ,It )) (1.11)


to reduce the problem to a Partial Differential Equation (PDE) in terms of θ(t, St , It ).
Substituting this guessed functional form into the above PDE for V ∗ (t, St , Wt , It ) gives:

∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b)
f (b) (δt ) (b)
+ max{ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It )) )}
(b)
δt γ
(a)
f (a) (δt ) (a)
+ max{ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It )) )} = 0
(a)
δt γ

The boundary condition is:


θ(T, ST , IT ) = IT · ST
It turns out that θ(t, St , It + 1) − θ(t, St , It ) and θ(t, St , It ) − θ(t, St , It − 1) are equal to
financially meaningful quantities known as Indifference Bid and Ask Prices.
Indifference Bid Price Q(b) (t, St , It ) is defined as follows:

V ∗ (t, St , Wt − Q(b) (t, St , It ), It + 1) = V ∗ (t, St , Wt , It ) (1.12)


Q(b) (t, St , It ) is the price to buy a single share with a guarantee of immediate purchase that
results in the Optimum Expected Utility staying unchanged.
Likewise, Indifference Ask Price Q(a) (t, St , It ) is defined as follows:

V ∗ (t, St , Wt + Q(a) (t, St , It ), It − 1) = V ∗ (t, St , Wt , It ) (1.13)


Q(a) (t, St , It ) is the price to sell a single share with a guarantee of immediate sale that results
in the Optimum Expected Utility staying unchanged.
(b) (a)
For convenience, we abbreviate Q(b) (t, St , It ) as Qt and Q(a) (t, St , It ) as Qt . Next, we
express V ∗ (t, St , Wt − Qt , It + 1) = V ∗ (t, St , Wt , It ) in terms of θ:
(b)

(b)
−e−γ·(Wt −Qt +θ(t,St ,It +1))
= −e−γ·(Wt +θ(t,St ,It ))

30
(b)
⇒ Qt = θ(t, St , It + 1) − θ(t, St , It ) (1.14)
(a)
Likewise for Qt , we get:
(a)
Qt = θ(t, St , It ) − θ(t, St , It − 1) (1.15)
(b) (a)
Using Equations (1.14) and (1.15), bring Qt and Qt in the PDE for θ:

∂θ σ 2 ∂ 2 θ ∂θ 2 (b) (b)
+ ·( 2 −γ·( ) ) + max g(δt ) + max h(δt ) = 0
∂t 2 ∂St ∂St (b)
δt
(a)
δt

(b)
f (b) (δt ) (b) (b)
· (1 − e−γ·(δt −St +Qt ) )
(b)
where g(δt ) =
γ
(a)
f (a) (δt ) (a) (a)
· (1 − e−γ·(δt +St −Qt ) )
(a)
and h(δt ) =
γ
(b) (b)
To maximize g(δt ), differentiate g with respect to δt and set to 0:

(b) ∗ (b)
(b) ∗ ∂f (b) (b) ∗ ∂f (b) (b) ∗
e−γ·(δt −St +Qt )
· (γ · f (b) (δt )− (δ
(b) t
)) + (b)
(δt )=0
∂δt ∂δt
(b) ∗
(b) ∗ (b) ∗ (b) 1 f (b) (δ )
⇒ δt = St − Pt = St − Qt + · log (1 − γ · (b) t(b) ∗ ) (1.16)
γ ∂f
(δ ) (b)
∂δt t

(a) (a)
To maximize h(δt ), differentiate h with respect to δt and set to 0:

(a) ∗ (a)
(a) ∗ ∂f (a) (a) ∗ ∂f (a) (a) ∗
e−γ·(δt +St −Qt )
· (γ · f (a) (δt )− (δ
(a) t
)) + (a)
(δt )=0
∂δt ∂δt
(a) ∗
(a) ∗ (a) ∗ (a) 1 f (a) (δ )
⇒ δt = Pt − St = Qt − St + · log (1 − γ · (a) t(a) ∗ ) (1.17)
γ ∂f
(δ ) (a)
∂δt t

(b) ∗ (a) ∗
Equations (1.16) and (1.17) are implicit equations for δt and δt respectively.
Now let us write the PDE in terms of the Optimal Bid and Ask Spreads:

∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b) ∗
f (b) (δt ) (b) ∗
+ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It ))
)
γ (1.18)
(a) ∗
f (a) (δt ) (a) ∗
+ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It ))
)=0
γ
with boundary condition: θ(T, ST , IT ) = IT · ST

How do we go about solving this? Here are the steps:


(b) ∗ (a) ∗
• First we solve PDE (1.18) for θ in terms of δt and δt . In general, this would be a
numerical PDE solution.

31
(b) ∗
• Using Equations (1.14) and (1.15), and using the above-obtained θ in terms of δt
(a) ∗ (b) (a) (b) ∗ (a) ∗
and δt , we get Qt and Qt in terms of δt and δt .
(b) (a) (b) ∗ (a) ∗
• Then we substitute the above-obtained and (in terms of δt and δt ) in
Qt Qt
Equations (1.16) and (1.17).
(b) ∗ (a) ∗
• Finally, we solve the implicit equations for δt and δt (in general, numerically).

This completes the (numerical) solution to the Avellaneda-Stoikov continuous-time for-


mulation for the Optimal Market-Making problem. Having been through all the heavy
equations above, let’s now spend some time on building intuition.
(b) (a)
(m) Q +Q
Define the Indifference Mid Price Qt = t 2 t . To develop intuition for Indifference
Prices, consider a simple case where the market-maker doesn’t supply any bids or asks
after time t. This means the trading account value WT at time T must be the same as the
trading account value at time t and the inventory IT at time T must be the same as the
inventory It at time t. This implies:

V ∗ (t, St , Wt , It ) = E[−e−γ·(Wt +It ·ST ) ]

The process dSt = σ · dzt implies that ST ∼ N (St , σ 2 · (T − t)), and hence:
γ·It2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt , It ) = −e−γ·(Wt +It ·St − 2
)

Hence,
(b) γ·(It +1)2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt − Qt , It + 1) = −e−γ·(Wt −Qt +(It +1)·St −
(b) )
2

But from Equation (1.12), we know that:

V ∗ (t, St , Wt , It ) = V ∗ (t, St , Wt − Qt , It + 1)
(b)

Therefore,
γ·It2 ·σ 2 ·(T −t) (b) γ·(It +1)2 ·σ 2 ·(T −t)
−e−γ·(Wt +It ·St − 2
)
= −e−γ·(Wt −Qt +(It +1)·St − 2
)

This implies:
(b) γ · σ 2 · (T − t)
Qt = St − (2It + 1) ·
2
Likewise, we can derive:

(a) γ · σ 2 · (T − t)
Qt = St − (2It − 1) ·
2
The formulas for the Indifference Mid Price and the Indifference Bid-Ask Price Spread are
as follows:
(m)
Qt = St − It · γ · σ 2 · (T − t)
(a) (b)
Qt − Qt = γ · σ 2 · (T − t)

These results for the simple case of no-market-making-after-time-t serve as approxima-


(m)
tions for our problem of optimal market-making. Think of Qt as a pseudo mid price for

32
the market-maker, an adjustment to the OB mid price St that takes into account the magni-
(m)
tude and sign of It . If the market-maker is long inventory (It > 0), then Qt < St , which
makes intuitive sense since the market-maker is interested in reducing her risk of inven-
tory buildup and so, would be be more inclined to sell than buy, leading her to show bid
and ask prices whose average is lower than the OB mid price St . Likewise, if the market-
(m)
maker is short inventory (It < 0), then Qt > St indicating inclination to buy rather than
sell.
Armed with this intuition, we come back to optimal market-making, observing from
Equations (1.16) and (1.17):
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
(b) ∗ (b) (m) (a) (a) ∗
Visualize this ascending sequence of prices [Pt , Qt , Qt , Qt , Pt ] as jointly sliding
up/down (relative to OB mid price St ) as a function of the inventory It ’s magnitude and
(b) ∗ (a) ∗ (m)
sign, and perceive Pt , Pt in terms of their spreads to the pseudo mid price Qt :
(b) (a) (b) ∗
(b) (m) ∗ Q + Qt 1 f (b) (δ )
Qt − Pt = t + · log (1 − γ · (b) t(b) ∗ )
2 γ ∂f
(δ )
∂δt
(b) t

(b) (a) (a) ∗


(a) ∗ (m) Q + Qt 1 f (a) (δ )
Pt − Qt = t + · log (1 − γ · (a) t(a) ∗ )
2 γ ∂f
(δ )
∂δt
(a) t

1.3.3 Analytical Approximation to the Solution to Avellaneda-Stoikov


Formulation
The PDE (1.18) we derived above for θ and the associated implicit Equations (1.16) and
(b) ∗ (a) ∗
(1.17) for δt , δt are messy. So we make some assumptions, simplify, and derive an-
alytical approximations(as presented in the Avellaneda-Stoikov paper). We start by as-
suming a fairly standard functional form for f (b) and f (a) :

f (b) (δ) = f (a) (δ) = c · e−k·δ


This reduces Equations (1.16) and (1.17) to:

(b) ∗ (b) 1 γ
δt = St − Q t + · log (1 + ) (1.19)
γ k
(a) ∗ (a) 1 γ
δt = Qt − St + · log (1 + ) (1.20)
γ k
(b) ∗ (a) ∗ (m)
which means Pt and Pt are equidistant from Qt . Substituting these simplified
(b) ∗ (a) ∗
δt , δ t in Equation (1.18) reduces the PDE to:

∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (e−k·δt + e−k·δt ) = 0
∂t 2 ∂St ∂St k+γ (1.21)
with boundary condition θ(T, ST , IT ) = IT · ST
(b) ∗ (a) ∗
Note that this PDE (1.21) involves δt and δt . However, Equations (1.19), (1.20),
(b) ∗ (a) ∗
(1.14) and (1.15) enable expressing δt and δt in terms of θ(t, St , It − 1), θ(t, St , It ) and

33
θ(t, St , It + 1). This gives us a PDE just in terms of θ. Solving that PDE for θ would give
(b) ∗ (a) ∗
us not only V ∗ (t, St , Wt , It ) but also δt and δt (using Equations (1.19), (1.20), (1.14)
and (1.15)). To solve the PDE, we need to make a couple of approximations.
(b) ∗ (a) ∗
First we make a linear approximation for e−k·δt and e−k·δt in PDE (1.21) as follows:

∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (1 − k · δt + 1 − k · δt ) = 0 (1.22)
∂t 2 ∂St ∂St k+γ
Combining the Equations (1.19), (1.20), (1.14) and (1.15) gives us:
(b) ∗ (a) ∗ 2 γ
δt + δt = · log (1 + ) + 2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1)
γ k
(b) ∗ (a) ∗
With this expression for δt + δt , PDE (1.22) takes the form:

∂θ σ 2 ∂ 2 θ ∂θ 2 c 2k γ
+ ·( 2 − γ · ( ) )+ · (2 − · log (1 + )
∂t 2 ∂St ∂St k+γ γ k (1.23)
− k · (2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1))) = 0
To solve PDE (1.23), we consider the following asymptotic expansion of θ in It :

∑ In
θ(t, St , It ) = t
· θ(n) (t, St )
n!
n=0

So we need to determine the functions θ(n) (t, St ) for all n = 0, 1, 2, . . .


For tractability, we approximate this expansion to the first 3 terms:
It2 (2)
θ(t, St , It ) ≈ θ(0) (t, St ) + It · θ(1) (t, St ) + · θ (t, St )
2
We note that the Optimal Value Function V ∗ can depend on St only through the current
Value of the Inventory (i.e., through It ·St ), i.e., it cannot depend on St in any other way. This
means V ∗ (t, St , Wt , 0) = −e−γ(Wt +θ (t,St )) is independent of St . This means θ(0) (t, St ) is
(0)

(0) ∂ 2 θ (0)
independent of St . So, we can write it as simply θ(0) (t), meaning ∂θ ∂St and ∂St2 are equal
to 0. Therefore, we can write the approximate expansion for θ(t, St , It ) as:

It2 (2)
· θ (t, St )
θ(t, St , It ) = θ(0) (t) + It · θ(1) (t, St ) + (1.24)
2
Substituting this approximation Equation (1.24) for θ(t, St , It ) in PDE (1.23), we get:

∂θ(0) ∂θ(1) It2 ∂θ(2) σ 2 ∂ 2 θ(1) It2 ∂ 2 θ(2)


+ It · + · + · (It · + · )
∂t ∂t 2 ∂t 2 ∂St2 2 ∂St2
γσ 2 ∂θ(1) It2 ∂θ(2) 2 c 2k γ
− · (It · + · ) + · (2 − · log (1 + ) + k · θ(2) ) = 0
2 ∂St 2 ∂St k+γ γ k (1.25)
with boundary condition:
IT2 (2)
θ(0) (T ) + IT · θ(1) (T, ST ) + · θ (T, ST ) = IT · ST
2
Now we separately collect terms involving specific powers of It , each yielding a separate
PDE:

34
• Terms devoid of It (i.e., It0 )
• Terms involving It (i.e., It1 )
• Terms involving It2

We start by collecting terms involving It :

∂θ(1) σ 2 ∂ 2 θ(1)
+ · = 0 with boundary condition θ(1) (T, ST ) = ST
∂t 2 ∂St2
The solution to this PDE is:

θ(1) (t, St ) = St (1.26)


Next, we collect terms involving It2 :

∂θ(2) σ 2 ∂ 2 θ(2) ∂θ(1) 2


+ · − γ · σ 2
· ( ) = 0 with boundary condition θ(2) (T, ST ) = 0
∂t 2 ∂St2 ∂St

Noting that θ(1) (t, St ) = St , we solve this PDE as:

θ(2) (t, St ) = −γ · σ 2 · (T − t) (1.27)


Finally, we collect the terms devoid of It

∂θ(0) c 2k γ
+ · (2 − · log (1 + ) + k · θ(2) ) = 0 with boundary θ(0) (T ) = 0
∂t k+γ γ k

Noting that θ(2) (t, St ) = −γσ 2 · (T − t), we solve as:

c 2k γ kγσ 2
θ(0) (t) = · ((2 − · log (1 + )) · (T − t) − · (T − t)2 ) (1.28)
k+γ γ k 2
This completes the PDE solution for θ(t, St , It ) and hence, for V ∗ (t, St , Wt , It ). Lastly, we
(b) (a) (m) (b) ∗ (a) ∗
derive formulas for Qt , Qt , Qt , δt , δt .
Using Equations (1.14) and (1.15), we get:

(b) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It + 1) · θ(2) (t, St ) = St − (2It + 1) · (1.29)
2

(a) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It − 1) · θ(2) (t, St ) = St − (2It − 1) · (1.30)
2
Using equations (1.19) and (1.20), we get:

(b) ∗ (2It + 1) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (1.31)
2 γ k

(a) ∗ (1 − 2It ) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (1.32)
2 γ k

35
(b) ∗ (a) ∗ 2 γ
Optimal Bid-Ask Spread δt + δt = γ · σ 2 · (T − t) + · log (1 + ) (1.33)
γ k

(b) (a) (b) ∗ (a) ∗


(m) Q + Qt Pt + Pt
Optimal Pseudo-Mid Qt = t = = St − It · γ · σ 2 · (T − t) (1.34)
2 2
(m)
Now let’s get back to developing intuition. Think of Qt as inventory-risk-adjusted mid-
(m)
price (adjustment to St ). If the market-maker is long inventory (It > 0), Qt < St indi-
(m)
cating inclination to sell rather than buy, and if market-maker is short inventory, Qt > St
(b) ∗ (a) ∗
indicating inclination to buy rather than sell. Think of the interval [Pt , Pt ] as being
(m)
around the pseudo mid-price Qt (rather than thinking of it as being around the OB
(b) ∗ (a) ∗ (m)
mid-price St ). The interval [Pt , Pt ] moves up/down in tandem with Qt moving
up/down (as a function of inventory It ). Note from Equation (1.33) that the Optimal
(a) ∗ (b) ∗
Bid-Ask Spread Pt − Pt is independent of inventory It .
A useful view is:
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
with the spreads as follows:
(a) ∗ (a) (b) (b) ∗ 1 γ
Outer Spreads Pt − Qt = Q t − Pt = · log (1 + )
γ k

(a) (m)γ · σ 2 · (T − t)
(m) (b)
Inner Spreads Qt − Qt = Qt − Qt =
2
This completes the analytical approximation to the solution of the Avellaneda-Stoikov
continuous-time formulation of the Optimal Market-Making problem.

1.3.4 Real-World Market-Making


Note that while the Avellaneda-Stoikov continuous-time formulation and solution is ele-
gant and intuitive, it is far from a real-world model. Real-world OB dynamics are time-
heterogeneous, non-linear and far more complex. Furthermore, there are all kinds of real-
world frictions we need to capture, such as discrete time, discrete prices/number of shares
in a bid/ask submitted by the market-maker, various constraints on prices and number of
shares in the bid/ask, and fees to be paid by the market-maker. Moreover, we need to cap-
ture various market factors in the State and in the OB Dynamics. This invariably leads to
the Curse of Dimensionality and Curse of Modeling. This takes us down the same path that
we’ve now got all too familiar with - Reinforcement Learning algorithms. This means we
need a simulator that captures all of the above factors, features and frictions. Such a simu-
lator is basically a Market-Data-learnt Sampling Model of OB Dynamics. We won’t be cover-
ing the details of how to build such a simulator as that is outside the scope of this book (a
topic under the umbrella of supervised learning of market patterns and behaviors). Using
this simulator and neural-networks-based function approximation of the Value Function
(and/or of the Policy function), we can leverage the power of RL algorithms (to be cov-
ered in the following chapters) to solve the problem of optimal market-making in practice.
There are a number of papers written on how to build practical and useful market simula-
tors and using Reinforcement Learning for Optimal Market-Making. We refer you to two
such papers here:

36
• A paper from University of Liverpool (Spooner et al. 2018)
• A paper from J.P.Morgan Research (Ganesh et al. 2019)

This topic of development of models for OB Dynamics and RL algorithms for practical
market-making is an exciting area for future research as well as engineering design. We
hope this section has provided sufficient foundations for you to dig into this topic further.

1.4 Key Takeaways from this Chapter


• Foundations of Order Book, Limit Orders, Market Orders, Price Impact of large Mar-
ket Orders, and complexity of Order Book Dynamics.
• Casting Order Book trading problems such as Optimal Order Execution and Op-
timal Market-Making as Markov Decision Processes, developing intuition by de-
riving closed-form solutions for highly simplified assumptions (eg: Bertsimas-Lo,
Avellaneda-Stoikov formulations), developing a deeper understanding by imple-
menting a backward-induction ADP algorithm, and then moving on to develop RL
algorithms (and associated market simulator) to solve this problem in a real-world
setting to overcome the Curse of Dimensionality and Curse of Modeling.

37
Bibliography

39
Almgren, Robert, and Neil Chriss. 2000. “Optimal Execution of Portfolio Transactions.”
Journal of Risk, 5–39.
Avellaneda, Marco, and Sasha Stoikov. 2008. “High-Frequency Trading in a Limit Or-
der Book.” Quantitative Finance 8 (3): 217–24. http://www.informaworld.com/10.1080/
14697680701381228.
Bertsimas, Dimitris, and Andrew W. Lo. 1998. “Optimal Control of Execution Costs.”
Journal of Financial Markets 1 (1): 1–50.
Ganesh, Sumitra, Nelson Vadori, Mengda Xu, Hua Zheng, Prashant P. Reddy, and Manuela
Veloso. 2019. “Reinforcement Learning for Market Making in a Multi-Agent Dealer
Market.” CoRR abs/1911.05892. http://dblp.uni-trier.de/db/journals/corr/corr1911.
html#abs-1911-05892.
Gueant, Olivier. 2016. The Financial Mathematics of Market Liquidity: From Optimal Execution
to Market Making. Chapman; Hall/CRC Financial Mathematics Series.
Nevmyvaka, Yuriy, Yi Feng, and Michael J. Kearns. 2006. “Reinforcement Learning for
Optimized Trade Execution.” In ICML, edited by William W. Cohen and Andrew W.
Moore, 148:673–80. ACM International Conference Proceeding Series. ACM. http:
//dblp.uni-trier.de/db/conf/icml/icml2006.html#NevmyvakaFK06.
Spooner, Thomas, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018. “Market
Making via Reinforcement Learning.” CoRR abs/1804.04216. http://dblp.uni-trier.
de/db/journals/corr/corr1804.html#abs-1804-04216.
Vyetrenko, Svitlana, and Shaojie Xu. 2019. “Risk-Sensitive Compact Decision Trees for Au-
tonomous Execution in Presence of Simulated Market Response.” CoRR abs/1906.02312.
http://dblp.uni-trier.de/db/journals/corr/corr1906.html#abs-1906-02312.

41

You might also like