Reinforcement Learning in Finance
Reinforcement Learning in Finance
Applications in Finance
3
Figure 1.1: Trading Order Book
(Image Credit: https://nms.kcl.ac.uk/rll/
enrique-miranda/index.html)
presented for trading in the form of this aggregated view. Thus, the OB data structure can
be represented as two sorted lists of (Price, Size) pairs:
(b)
• We refer to P0 as The Best Bid Price (lightened to Best Bid ) to signify that it is the
highest offer to buy and hence, the best price for a seller to transact with.
(a)
• Likewise, we refer to P0 as The Ask Price (lightened to Best Ask) to signify that it is
the lowest offer to sell and hence, the best price for a buyer to transact with.
(a) (b)
P0 +P0
• 2 is refered to as the The Mid Price (lightened to Mid).
(a) (b)
• P0 − P0 is refered to as The Best Bid-Ask Spread (lightened to Spread).
(a) (b)
• Pn−1 − Pm−1 is refered to as The Market Depth (lightened to Depth).
Although an actual real-world trading order book has many other details, we believe this
simplified coverage is adequate for the purposes of core understanding of order book trad-
ing and to navigate the problems of optimal order execution and optimal market-making.
Apart from Limit Orders, traders can express their interest to buy/sell with another type
of order - a Market Order (abbreviated as MO). A Market Order (MO) states one’s intent
to buy/sell N shares at the best possible price(s) available on the OB at the time of MO sub-
mission. So, an LO is keen on price and not so keen on time (willing to wait to get the price
one wants) while an MO is keen on time (desire to trade right away) and not so keen on
4
price (will take whatever the best LO price is on the OB). So now let us understand the
actual transactions that occur between LOs and MOs (buy and sell interactions, and how
the OB changes as a result of these interactions). Firstly, we note that in normal trading
activity, a newly submitted sell LO’s price is typically above the price of the best buy LO
on the OB. But if a new sell LO’s price is less than or equal to the price of the best buy
LO’s price, we say that the market has crossed (to mean that the range of bid prices and the
range of ask prices have intersected), which results in an immediate transaction that eats
into the OB’s Buy LOs.
Precisely, a new Sell LO (P, N ) potentially transacts with (and hence, removes) the best
Buy LOs on the OB.
(b) (b)
∑
i−1
(b) (b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≥ P )] (1.1)
j=0
After this removal, it potentially adds the following LO to the asks side of the OB:
∑ (b)
(P, max(0, N − Ni )) (1.2)
(b)
i:Pi ≥P
Likewise, a new Buy MO (P, N ) potentially transacts with (and hence, removes) the
best Sell LOs on the OB
(a) (a)
∑
i−1
(a) (a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≤ P )] (1.3)
j=0
After this removal, it potentially adds the following to the bids side of the OB:
∑ (a)
(P, max(0, N − Ni )) (1.4)
(a)
i:Pi ≤P
When a Market Order (MO) is submitted, things are simpler. A Sell Market Order of N
shares will remove the best Buy LOs on the OB.
(b) (b)
∑
i−1
(b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < m] (1.5)
j=0
∑
m−1
(b) (b)
∑
i−1
(b)
Pi · (min(Ni , max(0, N − Nj ))) (1.6)
i=0 j=0
We note that if N is large, the sales proceeds for this MO can be significantly lower than
(b) (b)
the best possible sales proceeds (= N · P0 ), which happens only if N ≤ N0 . Note also
(b)
that if N is large, the new Best Bid Price (new value of P0 ) can be significantly lower than
the Best Bid Price before the MO was submitted (because the MO “eats into” a significant
volume of Buy LOs on the OB). This “eating into” the Buy LOs on the OB and consequent
lowering of the Best Bid Price (and hence, Mid Price) is known as Price Impact of an MO
(more specifically, as the Temporary Price Impact of an MO). We use the word “temporary”
because subsequent to this “eating into” the Buy LOs of the OB (and consequent, “hole,”
ie., large Bid-Ask Spread), market participants will submit “replenishment LOs” (both
5
Buy LOs and Sell LOs) on the OB. These replenishments LOs would typically mitigate the
Bid-Ask Spread and the eventual settlement of the Best Bid/Best Ask/Mid Prices consti-
tutes what we call Permanent Price Impact - which refers to the changes in OB Best Bid/Best
Ask/Mid prices relative to the corresponding prices before submission of the MO.
Likewise, a Buy Market Order of N shares will remove the best Sell LOs on the OB
(a) (a)
∑
i−1
(a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < n] (1.7)
j=0
∑
n−1
(a) (a)
∑
i−1
(a)
Pi · (min(Ni , max(0, N − Nj ))) (1.8)
i=0 j=0
If N is large, the purchase bill for this MO can be significantly higher than the best
(a) (a)
possible purchase bill (= N · P0 ), which happens only if N ≤ N0 . All that we wrote
above in terms of Temporary and Permanent Price Impact naturally apply in the opposite
direction for a Buy MO.
We refer to all of the above-described OB movements, including both temporary and
permanent Price Impacts broadly as Order Book Dynamics. There is considerable literature
on modeling Order Book Dynamics and some of these models can get fairly complex in
order to capture various real-world nuances. Much of this literature is beyond the scope
of this book. In this chapter, we will cover a few simple models for how a sell MO will
move the OB’s Best Bid Price (rather than a model for how it will move the entire OB). The
model for how a buy MO will move the OB’s Best Ask Price is naturally identical.
Now let’s write some code that models how LOs and MOs interact with the OB. We
write a class OrderBook that represents the Buy and Sell Limit Orders on the Order Book,
which are each represented as a sorted sequence of the type DollarsAndShares, which is a
dataclass we created to represent any pair of a dollar amount (dollar: float) and num-
ber of shares (shares: int). Sometimes, we use DollarsAndShares to represent an LO (pair
of price and shares) as in the case of the sorted lists of Buy and Sell LOs. At other times, we
use DollarsAndShares to represent the pair of total dollars transacted and total shares trans-
acted when an MO is executed on the OB. The OrderBook maintains a price-descending se-
quence of PriceSizePairs for Buy LOs (descending_bids) and a price-ascending sequence
of PriceSizePairs for Sell LOs (ascending_asks). We write the basic methods to get the
OrderBook’s highest bid price (method bid_price), lowest ask price (method ask_price),
mid price (method mid_price), spread between the highest bid price and lowest ask price
(method bid_ask_spread), and market depth (method market_depth).
@dataclass(frozen=True)
class DollarsAndShares:
dollars: float
shares: int
PriceSizePairs = Sequence[DollarsAndShares]
@dataclass(frozen=True)
class OrderBook:
descending_bids: PriceSizePairs
ascending_asks: PriceSizePairs
def bid_price(self) -> float:
return self.descending_bids[0].dollars
6
def ask_price(self) -> float:
return self.ascending_asks[0].dollars
def mid_price(self) -> float:
return (self.bid_price() + self.ask_price()) / 2
def bid_ask_spread(self) -> float:
return self.ask_price() - self.bid_price()
def market_depth(self) -> float:
return self.ascending_asks[-1].dollars - \
self.descending_bids[-1].dollars
Next we want to write methods for LOs and MOs to interact with the OrderBook. Notice
that each of Equation (1.1) (new Sell LO potentially removing some of the beginning of
the Buy LOs on the OB), Equation (1.3) (new Buy LO potentially removing some of the
beginning of the Sell LOs on the OB), Equation (1.5) (Sell MO removing some of the
beginning of the Buy LOs on the OB) and Equation (1.7) (Buy MO removing some of the
beginning of the Sell LOs on the OB) all perform a common core function - they “eat into”
the most significant LOs (on the opposite side) on the OB. So we first write a @staticmethod
eat_book for this common function.
eat_book takes as input a ps_pairs: PriceSizePairs (representing one side of the OB)
and the number of shares: int to buy/sell. Notice eat_book’s return type: Tuple[DollarsAndShares,
PriceSizePairs]. The returned DollarsAndShares represents the pair of dollars transacted
and the number of shares transacted (with number of shares transacted being less than
or equal to the input shares). The returned PriceSizePairs represents the remainder
of ps_pairs after the transacted number of shares have eaten into the input ps_pairs.
eat_book first deletes (i.e. “eats up”) as much of the beginning of the ps_pairs: PriceSizePairs
data structure as it can (basically matching the input number of shares with an appropri-
ate number of shares at the beginning of the ps_pairs: PriceSizePairs input). Note that
the returned PriceSizePairs is a separate data structure, ensuring the immutability of the
input ps_pairs: PriceSizePairs.
@staticmethod
def eat_book(
ps_pairs: PriceSizePairs,
shares: int
) -> Tuple[DollarsAndShares, PriceSizePairs]:
rem_shares: int = shares
dollars: float = 0.
for i, d_s in enumerate(ps_pairs):
this_price: float = d_s.dollars
this_shares: int = d_s.shares
dollars += this_price * min(rem_shares, this_shares)
if rem_shares < this_shares:
return (
DollarsAndShares(dollars=dollars, shares=shares),
[DollarsAndShares(
dollars=this_price,
shares=this_shares - rem_shares
)] + list(ps_pairs[i+1:])
)
else:
rem_shares -= this_shares
return (
DollarsAndShares(dollars=dollars, shares=shares - rem_shares),
[]
)
Now we are ready to write the method sell_limit_order which takes Sell LO Price and
7
Sell LO shares as input. As you can see in the code below, first it potentially removes
(if it “crosses”) an appropriate number of shares on the Buy LOs side of the OB (using
the @staticmethod eat_book), and then potentially adds an appropriate number of shares
at the Sell LO Price on the Sell LOs side of the OB. sell_limit_order returns a pair of
DollarsAndShares type and OrderBook type. The returned DollarsAndShares represents the
pair of dollars transacted and the number of shares transacted with the Buy LOs side of
the OB (with number of shares transacted being less than or equal to the input shares).
The returned OrderBook represents the new OB after potentially eating into the Buy LOs
side of the OB and then potentially adding some shares at the Sell LO Price on the Sell
LOs side of the OB. Note that the returned OrderBook is a newly-created data structure,
ensuring the immutability of self. We urge you to read the code below carefully as there
are many subtle details that are handled in the code.
from dataclasses import replace
def sell_limit_order(self, price: float, shares: int) -> \
Tuple[DollarsAndShares, OrderBook]:
index: Optional[int] = next((i for i, d_s
in enumerate(self.descending_bids)
if d_s.dollars < price), None)
eligible_bids: PriceSizePairs = self.descending_bids \
if index is None else self.descending_bids[:index]
ineligible_bids: PriceSizePairs = [] if index is None else \
self.descending_bids[index:]
d_s, rem_bids = OrderBook.eat_book(eligible_bids, shares)
new_bids: PriceSizePairs = list(rem_bids) + list(ineligible_bids)
rem_shares: int = shares - d_s.shares
if rem_shares > 0:
new_asks: List[DollarsAndShares] = list(self.ascending_asks)
index1: Optional[int] = next((i for i, d_s
in enumerate(new_asks)
if d_s.dollars >= price), None)
if index1 is None:
new_asks.append(DollarsAndShares(
dollars=price,
shares=rem_shares
))
elif new_asks[index1].dollars != price:
new_asks.insert(index1, DollarsAndShares(
dollars=price,
shares=rem_shares
))
else:
new_asks[index1] = DollarsAndShares(
dollars=price,
shares=new_asks[index1].shares + rem_shares
)
return d_s, OrderBook(
ascending_asks=new_asks,
descending_bids=new_bids
)
else:
return d_s, replace(
self,
descending_bids=new_bids
)
Next, we write the easier method sell_market_order which takes as input the number
of shares to be sold (as a market order). sell_market_order transacts with the appropri-
ate number of shares on the Buy LOs side of the OB (removing those many shares from
8
the Buy LOs side). It returns a pair of DollarsAndShares type and OrderBook type. The
returned DollarsAndShares represents the pair of dollars transacted and the number of
shares transacted (with number of shares transacted being less than or equal to the in-
put shares). The returned OrderBook represents the remainder of the OB after the trans-
acted number of shares have eaten into the Buy LOs side of the OB. Note that the returned
OrderBook is a newly-created data structure, ensuring the immutability of self.
def sell_market_order(
self,
shares: int
) -> Tuple[DollarsAndShares, OrderBook]:
d_s, rem_bids = OrderBook.eat_book(
self.descending_bids,
shares
)
return (d_s, replace(self, descending_bids=rem_bids))
We won’t list the methods buy_limit_order and buy_market_order here as they are com-
pletely analogous (you can find the entire code for OrderBook in the file rl/chapter9/order_book.py).
Now let us test out this code by creating a sample OrderBook and submitting some LOs and
MOs to interact with the OrderBook.
The above code creates an OrderBook in the price range [91, 114] with a bid-ask spread
of 5. Figure 1.2 depicts this OrderBook visually.
Let’s submit a Sell LO that says we’d like to sell 40 shares as long as the transacted price
is greater than or equal to 107. Our Sell LO should simply get added to the Sell LOs side
of the OB.
The new OrderBook ob1 has 40 more shares at the price level of 107, as depicted in Figure
1.3.
Now let’s submit a Sell MO that says we’d like to sell 120 shares at the “best price.” Our
Sell MO should transact with 120 shares at “best prices” of 100 and 99 as well (since the
OB does not have enough Buy LO shares at the price of 100).
The new OrderBook ob2 has 120 less shares on the Buy LOs side of the OB, as depicted
in Figure 1.4.
Now let’s submit a Buy LO that says we’d like to buy 80 shares as long as the transacted
price is less than or equal to 100. Our Buy LO should get added to the Buy LOs side of the
OB.
9
Figure 1.2: Starting Order Book
10
Figure 1.4: Order Book after Sell MO
11
Figure 1.6: Order Book after 2nd Sell LO
The new OrderBook ob3 has re-introduced a Buy LO at the price level of 100 (now with
80 shares), as depicted in Figure 1.5.
Now let’s submit a Sell LO that says we’d like to sell 60 shares as long as the transacted
price is greater than or equal to 104. Our Sell LO should get added to the Sell LOs side of
the OB.
d_s4, ob4 = ob3.sell_limit_order(104, 60)
The new OrderBook ob4 has introduced a Sell LO at a price of 104 with 60 shares, as
depicted in Figure 1.6.
Now let’s submit a Buy MO that says we’d like to buy 150 shares at the “best price.” Our
Buy MO should transact with 150 shares at “best prices” on the Sell LOs side of the OB.
d_s5, ob5 = ob4.buy_market_order(150)
The new OrderBook ob5 has 150 less shares on the Sell LOs side of the OB, wiping out all
the shares at the price level of 104 and almost wiping out all the shares at the price level
of 105, as depicted in Figure 1.7.
This has served as a good test of our code (transactions working as we’d like) and
we encourage you to write more code of this sort to interact with the OrderBook, and
to produce graphs of evolution of the OrderBook as this will help develop stronger intu-
ition and internalize the concepts we’ve learnt above. All of the above code is in the file
rl/chapter9/order_book.py.
Now we are ready to get started with the problem of Optimal Execution of a large-sized
Market Order.
12
Figure 1.7: Order Book after Buy MO
new investment objectives. You have to sell all of the N shares you own in this stock in
the next T hours, but you have been instructed to accomplish the sale by submitting only
Market Orders (not allowed to submit any Limit Orders because of the uncertainty in the
time of execution of the sale with a Limit Order). You can submit sell market orders (of
any size) at the start of each hour - so you have T opportunities to submit market orders
of any size. Your goal is to maximize the Expected Total Utility of sales proceeds for all N
shares over the T hours. Your task is to break up N into T appropriate chunks to maximize
the Expected Total Utility objective. If you attempt to sell the N shares too fast (i.e., too
many in the first few hours), as we’ve learnt above, each (MO) sale will eat a lot into the
Buy LOs on the OB (Temporary Price Impact) which would result in transacting at prices
below the best price (Best Bid Price). Moreover, you risk moving the Best Bid Price on the
OB significantly lower (Permanent Price Impact) that would affect the sales proceeds for
the next few sales you’d make. On the other hand, if you sell the N shares too slow (i.e., too
few in the first few hours), you might transact at good prices but then you risk running
out of time, which means you will have to dump a lot of shares with time running out
which in turn would mean transacting at prices below the best price. Moreover, selling
too slow exposes you to more uncertainty in market price movements over a longer time
period, and more uncertainty in sales proceeds means the Expected Utility objective gets
hurt. Thus, the precise timing and sizes in the breakup of shares is vital. You will need
to have an estimate of the Temporary and Permanent Price Impact of your Market Orders,
which can help you identify the appropriate number of shares to sell at the start of each
hour.
Unsurprisingly, we can model this problem as a Market Decision Process control prob-
lem where the actions at each time step (each hour, in this case) are the number of shares
sold at the time step and the rewards are the Utility of sales proceeds at each time step. To
keep things simple and intuitive, we shall model Price Impact of Market Orders in terms of
their effect on the Best Bid Price (rather than in terms of their effect on the entire OB). In
other words, we won’t be modeling the entire OB Price Dynamics, just the Best Bid Price
13
Dynamics. We shall refer to the OB activity of an MO immediately “eating into the Buy
LOs” (and hence, potentially transacting at prices lower than the best price) as the Tempo-
rary Price Impact. As mentioned earlier, this is followed by subsequent replenishment of
both Buy and Sell LOs on the OB (stabilizing the OB) - we refer to any eventual (end of
the hour) lowering of the Best Bid Price (relative to the Best Bid Price before the MO was
submitted) as the Permanent Price Impact. Modeling the temporary and permanent Price
Impacts separately helps us in deciding on the optimal actions (optimal shares to be sold
at the start of each hour).
Now we develop some formalism to describe this problem precisely. As mentioned
earlier, we make a number of simplifying assumptions in modeling the OB Dynamics for
ease of articulation (without diluting the most important concepts). We index discrete
time by t = 0, 1, . . . , T . We denote Pt as the Best Bid Price on the OB at the start of time
step t (for all t = 0, 1, . . . , T ) and Nt as the number of shares sold at time step t for all
t = 0, 1, . . . , T − 1. We denote the number of shares remaining to be sold at the start of
time step t as Rt for all t = 0, 1, . . . , T . Therefore,
∑
t−1
Rt = N − Ni for all t = 0, 1, . . . , T
i=0
Note that:
R0 = N
Rt+1 = Rt − Nt for all t = 0, 1, . . . , T − 1
Also note that we need to sell everything by time t = T and so:
NT −1 = RT −1 ⇒ RT = 0
The model of Best Bid Price Dynamics from one time step to the next is given by:
Pt+1 = ft (Pt , Nt , ϵt ) for all t = 0, 1, . . . , T − 1
where ft is an arbitrary function incorporating:
• The Permanent Price Impact of selling Nt shares.
• The Price-Impact-independent market-movement of the Best Bid Price from time t
to time t + 1.
• Noise ϵt , a source of randomness in Best Bid Price movements.
The sales proceeds from the sale at time step t, for all t = 0, 1, . . . , T − 1, is defined as:
Nt · Qt = Nt · (Pt − gt (Pt , Nt ))
where gt is a function modeling the Temporary Price Impact (i.e., the Nt MO “eating into”
the Buy LOs on the OB). Qt should be interpreted as the average Buy LO price transacted
against by the Nt MO at time t.
Lastly, we denote the Utility (of Sales Proceeds) function as U (·).
As mentioned previously, solving for the optimal number of shares to be sold at each
time step can be modeled as a discrete-time finite-horizon Markov Decision Process, which
we describe below in terms of the order of MDP activity at each time step t = 0, 1, . . . , T −1
(the MDP horizon is time T meaning all states at time T are terminal states). We follow
the notational style of finite-horizon MDPs that should now be familiar from previous
chapters.
Order of Events at time step t for all t = 0, 1, . . . , T − 1:
14
• Observe State st := (Pt , Rt ) ∈ St
• Perform Action at := Nt ∈ At
• Receive Reward rt+1 := U (Nt · Qt ) = U (Nt · (Pt − gt (Pt , Nt )))
• Experience Price Dynamics Pt+1 = ft (Pt , Nt , ϵt ) and set Rt+1 = Rt − Nt so as to
obtain the next state st+1 = (Pt+1 , Rt+1 ) ∈ St+1 .
Note that we have intentionally not specified if Pt , Rt , Nt are integers or real numbers, or
if constrained to be non-negative etc. Those precise specifications will be customized to the
nuances/constraints of the specific Optimal Order Execution problem we’d be solving. Be
default, we shall assume that Pt ∈ R+ and Nt , Rt ∈ Z≥0 (as these represent realistic trading
situations), although we do consider special cases later in the chapter where Pt , Rt ∈ R
(unconstrained real numbers for analytical tractability).
The goal is to find the Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ) (defined as πt∗ ((Pt , Rt )) =
∗
Nt that maximizes:
∑
T −1
E[ γ t · U (Nt · Qt )]
t=0
where γ is the discount factor to account for the fact that future utility of sales proceeds
can be modeled to be less valuable than today’s.
Now let us write some code to solve this MDP. We write a class OptimalOrderExecution
which models a fairly generic MDP for Optimal Order Execution as described above, and
solves the Control problem with Approximate Value Iteration using the backward induc-
tion algorithm that we implemented in Chapter ??. Let us start by taking a look at the
attributes (inputs) in OptimalOrderExecution:
• shares refers to the total number of shares N to be sold over T time steps.
• time_steps refers to the number of time steps T .
• avg_exec_price_diff refers to the time-sequenced functions gt that return the reduc-
tion in the average price obtained by the Market Order at time t due to eating into the
Buy LOs. gt takes as input the type PriceAndShares that represents a pair of price:
float and shares: int (in this case, the price is Pt and the shares is the MO size Nt
at time t). As explained earlier, the sales proceeds at time t is: Nt · (Pt − gt (Pt , Nt )).
• price_dynamics refers to the time-sequenced functions ft that represent the price dy-
namics: Pt+1 ∼ ft (Pt , Nt ). ft outputs a probability distribution of prices for Pt+1 .
• utility_func refers to the Utility of Sales Proceeds function, incorporating any risk-
aversion.
• discount_factor refers to the discount factor γ.
• func_approx refers to the ValueFunctionApprox type to be used to approximate the
Value Function for each time step (since we are doing backward induction).
• initial_price_distribution refers to the probability distribution of prices P0 at time
0, which is used to generate the samples of states at each of the time steps (needed
in the approximate backward induction algorithm).
from rl.approximate_dynamic_programming import ValueFunctionApprox
@dataclass(frozen=True)
class PriceAndShares:
price: float
shares: int
@dataclass(frozen=True)
class OptimalOrderExecution:
shares: int
15
time_steps: int
avg_exec_price_diff: Sequence[Callable[[PriceAndShares], float]]
price_dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]]
utility_func: Callable[[float], float]
discount_factor: float
func_approx: ValueFunctionApprox[PriceAndShares]
initial_price_distribution: Distribution[float]
The two key things we need to perform the backward induction are:
• A method get_mdp that given a time step t, produces the MarkovDecisionProcess ob-
ject representing the transitions from time t to time t+1. The class OptimalExecutionMDP
within get_mdp implements the abstract methods step and actions of the abstract
class MarkovDecisionProcess. The code should be fairly self-explanatory - just a cou-
ple of things to point out here. Firstly, the input p_r: NonTerminal[PriceAndShares]
to the step method represents the state (Pt , Rt ) at time t, and the variable p_s: PriceAndShares
represents the pair of (Pt , Nt ), which serves as input to avg_exec_price_diff and
price_dynamics (attributes of OptimalOrderExecution). Secondly, note that the actions
method returns an Iterator on a single int at time t = T −1 because of the constraint
NT −1 = RT −1 .
• A method get_states_distribution that returns the probability distribution of states
(Pt , Rt ) at time t (of type SampledDistribution[NonTerminal[PriceAndShares]]). The
code here is similar to the get_states_distribiution method of AssetAllocDiscrete
in Chapter ?? (essentially, walking forward from time 0 to time t by sampling from
the state-transition probability distribution and also sampling from uniform choices
over all actions at each time step).
16
)
return (NonTerminal(next_state), reward)
return SampledDistribution(
sampler=sr_sampler_func,
expectation_samples=100
)
def actions(self, p_s: NonTerminal[PriceAndShares]) -> \
Iterator[int]:
if t == steps - 1:
return iter([p_s.state.shares])
else:
return iter(range(p_s.state.shares + 1))
return OptimalExecutionMDP()
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[PriceAndShares]]:
def states_sampler_func() -> NonTerminal[PriceAndShares]:
price: float = self.initial_price_distribution.sample()
rem: int = self.shares
for i in range(t):
sell: int = Choose(range(rem + 1)).sample()
price = self.price_dynamics[i](PriceAndShares(
price=price,
shares=rem
)).sample()
rem -= sell
return NonTerminal(PriceAndShares(
price=price,
shares=rem
))
return SampledDistribution(states_sampler_func)
Finally, we produce the Optimal Value Function and Optimal Policy for each time step
with the following method backward_induction_vf_and_pi:
from rl.approximate_dynamic_programming import back_opt_vf_and_policy
def backward_induction_vf_and_pi(
self
) -> Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]]:
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[PriceAndShares, int],
ValueFunctionApprox[PriceAndShares],
SampledDistribution[NonTerminal[PriceAndShares]]
]] = [(
self.get_mdp(i),
self.func_approx,
self.get_states_distribution(i)
) for i in range(self.time_steps)]
num_state_samples: int = 10000
error_tolerance: float = 1e-6
return back_opt_vf_and_policy(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=self.discount_factor,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)
17
different temporary and permanent price impact functions, different utility functions, im-
pose a few constraints etc.). Note that the above code has been written with an educa-
tional motivation rather than an efficient-computation motivation, so the convergence of
the backward induction ADP algorithm is going to be slow. How do we know that the
above code is correct? Well, we need to create a simple special case that yields a closed-
form solution that we can compare the Optimal Value Function and Optimal Policy pro-
duced by OptimalOrderExecution against. This will be the subject of the following subsec-
tion.
Pt+1 = ft (Pt , Nt , ϵ) = Pt − α · Nt + ϵt
where α ∈ R and ϵt for all t = 0, 1, . . . , T − 1 are independent and identically distributed
(i.i.d.) with E[ϵt |Nt , Pt ] = 0. Therefore, the Permanent Price Impact (as an Expectation)
is α · Nt .
As for the Temporary Price Impact, we know that gt needs to be a non-decreasing func-
tion of Nt . We assume a simple linear form for gt as follows:
∑
T −1
Vtπ ((Pt , Rt )) = Eπ [ Ni · (Pi − β · Ni )|(Pt , Rt )]
i=t
The Optimal Value Function satisfies the finite-horizon Bellman Optimality Equation
for all t = 0, 1, . . . , T − 2, as follows:
∗
Vt∗ ((Pt , Rt )) = max{Nt · (Pt − β · Nt ) + E[Vt+1 ((Pt+1 , Rt+1 ))]}
Nt
and
18
VT∗−1 ((PT −1 , RT −1 )) = NT −1 · (PT −1 − β · NT −1 ) = RT −1 · (PT −1 − β · RT −1 )
VT∗−2 ((PT −2 , RT −2 )) = max {RT −2 ·PT −2 −β ·RT2 −2 +(α−2β)(NT2 −2 −NT −2 ·RT −2 )} (1.9)
NT −2
For the case α ≥ 2β, noting that NT −2 ≤ RT −2 , we have the trivial solution:
NT∗ −2 = 0 or NT∗ −2 = RT −2
Substituting either of these two values for NT∗ −2 in the right-hand-side of Equation (1.9)
gives:
RT −2
(α − 2β) · (2NT∗ −2 − RT −2 ) = 0 ⇒ NT∗ −2 =
2
Substituting this solution for NT∗ −2 in Equation (1.9) gives:
α + 2β
VT∗−2 ((PT −2 , RT −2 )) = RT −2 · PT −2 − RT2 −2 · ( )
4
Continuing backwards in time in this manner gives:
Rt
Nt∗ = for all t = 0, 1, . . . , T − 1
T −t
Rt2 2β + α · (T − t − 1)
Vt∗ ((Pt , Rt )) = Rt · Pt − ·( ) for all t = 0, 1, . . . , T − 1
2 T −t
19
Rolling forward in time, we see that Nt∗ = N
T , i.e., splitting the N shares uniformly across
the T time steps. Hence, the Optimal Policy is a constant deterministic function (i.e., inde-
pendent of the State). Note that a uniform split makes intuitive sense because Price Impact
and Market Movement are both linear and ∑ additive, and don’t interact.∑This optimization
is essentially equivalent to minimizing Tt=1 Nt2 with the constraint: Tt=1 Nt = N . The
Optimal Expected Total Sales Proceeds is equal to:
N2 2β − α
N · P0 − · (α + )
2 T
Implementation Shortfall is the technical term used to refer to the reduction in Total Sales
Proceeds relative to the maximum possible sales proceeds (= N · P0 ). So, in this simple
2
linear model, the Implementation Shortfall from Price Impact is N2 · (α + 2β−α T ). Note that
the Implementation Shortfall is non-zero even if one had infinite time available (T → ∞)
for the case of α > 0. If Price Impact were purely temporary (α = 0, i.e., Price fully snapped
back), then the Implementation Shortfall is zero if one had infinite time available.
So now let’s customize the class OptimalOrderExecution to this simple linear price im-
pact model, and compare the Optimal Value Function and Optimal Policy produced by
OptimalOrderExecution against the above-derived closed-form solutions. We write code
below to create an instance of OptimalOrderExecution with time steps T = 5, total number
of shares to be sold N = 100, linear temporary price impact with α = 0.03, linear perma-
nent price impact with β = 0.03, utility function as the identity function (no risk-aversion),
and discount factor γ = 1. We set the standard deviation for the price dynamics probabil-
ity distribution to 0 to speed up the calculation. Since we know the closed-form solution
for the Optimal Value Function, we provide some assistance to OptimalOrderExecution by
setting up a linear function approximation with two features: Pt · Rt and Rt2 . The task
of OptimalOrderExecution is to infer the correct coefficients of these features for each time
step. If the coefficients match that of the closed-form solution, it provides a great degree
of confidence that our code is working correctly.
20
)
Next we evaluate this Optimal Value Function and Optimal Policy on a particular state
for all time steps, and compare that against the closed-form solution. The state we use for
evaluation is as follows:
The code to evaluate the obtained Optimal Value Function and Optimal Policy on the
above state is as follows:
With 100,000 state samples for each time step and only 10 state transition samples (since
the standard deviation of ϵ is set to be very small), this prints the following:
Time 0
Time 1
21
Optimal Sales = 20, Opt Val = 9762.479
Time 2
Time 3
Time 4
for t in range(num_time_steps):
print(f”Time {t:d}”)
print()
left: int = num_time_steps - t
opt_sale_anal: float = num_shares / num_time_steps
wt1: float = 1
wt2: float = -(2 * beta + alpha * (left - 1)) / (2 * left)
val_anal: float = wt1 * state.price * state.shares + \
wt2 * state.shares * state.shares
Time 0
22
Weight2 = -0.022
Time 1
Time 2
Time 3
Time 4
We need to point out here that the general case of optimal order execution involving
modeling of the entire Order Book’s dynamics will have to deal with a large state space.
This means the ADP algorithm will suffer from the curse of dimensionality, which means
we will need to employ RL algorithms.
Pt+1 = Pt − (β · Nt + θ · Xt ) + ϵt
Xt+1 = ρ · Xt + ηt
Qt = Pt − (β · Nt + θ · Xt )
where ϵt and ηt are each independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, ϵt and ηt are also independent of each other for all
t = 0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt linearly. Applying
the finite-horizon Bellman Optimality Equation on the Optimal Value Function (and the
23
same backward-recursive approach as before) yields:
Rt
Nt∗ = + h(t, β, θ, ρ) · Xt
T −t
Vt∗ ((Pt , Rt , Xt )) = Rt · Pt − (quadratic in (Rt , Xt ) + constant)
Essentially, the serial-correlation predictability (ρ ̸= 0) alters the uniform-split strategy.
In the same paper, Bertsimas and Lo presented a more realistic model called Linear-
Percentage Temporary (abbreviated as LPT) Price Impact model, whose salient features in-
clude:
• Geometric random walk: consistent with real data, and avoids non-positive prices.
• Fractional Price Impact gt (PPtt,Nt ) doesn’t depend on Pt (this is validated by real data).
• Purely Temporary Price Impact, i.e., the price Pt snaps back after the Temporary
Price Impact (no Permanent effect of Market Orders on future prices).
Pt+1 = Pt · eZt
Xt+1 = ρ · Xt + ηt
Qt = Pt · (1 − β · Nt − θ · Xt )
where Zt are independent and identically distributed random variables with mean µZ
and variance σZ2 , ηt are independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, Zt and ηt are independent of each other for all t =
0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt multiplicatively. With
the same derivation methodology as before, we get the solution:
Nt∗ = ct + ct Rt + ct Xt
(1) (2) (3)
2
σZ
Vt∗ ((Pt , Rt , Xt )) = eµZ +
(4) (5) (6) (7) (8) (9)
2 · Pt · (ct + ct Rt + ct Xt + ct Rt2 + ct Xt2 + ct Rt Xt )
(k)
where ct , 1 ≤ k ≤ 9, are constants (independent of Pt , Rt , Xt ).
As an exercise, we recommend implementing the above (LPT) model by customizing
OptimalOrderExecution to compare the obtained Optimal Value Function and Optimal Pol-
(k)
icy against the closed-form solution (you can find the exact expressions for the ct coeffi-
cients in the Bertsimas and Lo paper).
24
risk-aversion. The incorporation of risk-aversion affects the time-trajectory of Nt∗ . Clearly,
if λ = 0, we get the usual uniform-split strategy: Nt∗ = NT . The other extreme assumption
is to minimize V ar[Y ] which yields: N0∗ = N (sell everything immediately because the
only thing we want to avoid is uncertainty of sales proceeds). In their paper, Almgren
and Chriss go on to derive the Efficient Frontier for this problem (analogous to the Effi-
cient Frontier Portfolio Theory we outline in Appendix ??). They also derive solutions for
specific utility functions.
To model a real-world trading situation, the first step is to start with the MDP we de-
scribed earlier with an appropriate model for the price dynamics ft (·) and the temporary
price impact gt (·) (incorporating potential time-heterogeneity, non-linear price dynamics
and non-linear impact). The OptimalOrderExecution class we wrote above allows us to
incorporate all of the above. We can also model various real-world “frictions” such as
discrete prices, discrete number of shares, constraints on prices and number of shares, as
well as trading fees. To make the model truer to reality and more sophisticated, we can
introduce various market factors in the State which would invariably lead to bloating of
the State Space. We would also need to capture Cross-Asset Market Impact. As a further
step, we could represent the entire Order Book (or a compact summary of the size/shape
of the Order book) as part of the state, which leads to further bloating of the state space.
All of this makes ADP infeasible and one would need to employ Reinforcement Learning
algorithms. More importantly, we’d need to write a realistic Order Book Dynamics sim-
ulator capturing all of the above real-world considerations that an RL algorithm would
learn from. There are a lot of practical and technical details involved in writing a real-
world simulator and we won’t be covering those details in this book. It suffices for here
to say that the simulator would essentially be a sampling model that has learnt the Order
Book Dynamics from market data (supervised learning of the Order Book Dynamics). Us-
ing such a simulator and with a deep learning-based function approximation of the Value
Function, we can solve a practical Optimal Order Execution problem with Reinforcement
Learning. We refer you to a couple of papers for further reading on this:
• Paper by Nevmyvaka, Feng, Kearns in 2006 (Nevmyvaka, Feng, and Kearns 2006)
• Paper by Vyetrenko and Xu in 2019 (Vyetrenko and Xu 2019)
Designing real-world simulators for Order Book Dynamics and using Reinforcement
Learning for Optimal Order Execution is an exciting area for future research as well as
engineering design. We hope this section has provided sufficient foundations for you to
dig into this topic further.
25
bid prices by submitting Buy LOs on an OB and quotes one’s ask prices by submitting
Sell LOs on the OB. Market-makers are known as liquidity providers in the market because
they make shares of the stock available for trading on the OB (both on the buy side and
sell side). In general, anyone who submits LOs can be thought of as a market liquidity
provider. Likewise, anyone who submits MOs can be thought of as a market liquidity taker
(because an MO takes shares out of the volume that was made available for trading on the
OB).
There is typically a fairly complex interplay between liquidity providers (including market-
makers) and liquidity takers. Modeling OB dynamics is about modeling this complex in-
terplay, predicting arrivals of MOs and LOs, in response to market events and in response
to observed activity on the OB. In this section, we view the OB from the perspective of a
single market-maker who aims to make money with Buy/Sell LOs of appropriate bid-ask
spread and with appropriate volume of shares (specified in their submitted LOs). The
market-maker is likely to be successful if she can do a good job of forecasting OB Dynam-
ics and dynamically adjusting her Buy/Sell LOs on the OB. The goal of the market-maker
is to maximize one’s Utility of Gains at the end of a suitable horizon of time.
The core intuition in the decision of how to set the price and shares in the market-maker’s
Buy and Sell LOs is as follows: If the market-maker’s bid-ask spread is too narrow, they
will have more frequent transactions but smaller gains per transaction (more likelihood of
their LOs being transacted against by an MO or an opposite-side LO). On the other hand,
if the market-maker’s bid-ask spread is too wide, they will have less frequent transactions
but larger gains per transaction (less likelihood of their LOs being transacted against by
an MO or an opposite-side LO). Also of great importance is the fact that a market-maker
needs to carefully manage potentially large inventory buildup (either on the long side
or the short side) so as to avoid scenarios of consequent unfavorable forced liquidation
upon reaching the horizon time. Inventory buildup can occur if the market participants
consistently transact against mostly one side of the market-maker’s submitted LOs. With
this high-level intuition, let us make these concepts of market-making precise. We start
by developing some notation to help articulate the problem of Optimal Market-Making
clearly. We will re-use some of the notation and terminology we had developed for the
problem of Optimal Order Execution. As ever, for ease of exposition, we will simplify the
setting for the Optimal Market-Making problem.
Assume there are a finite number of time steps indexed by t = 0, 1, . . . , T . Assume
the market-maker always shows a bid price and ask price (at each time t) along with the
associated bid shares and ask shares on the OB. Also assume, for ease of exposition, that
the market-maker can add or remove bid/ask shares from the OB costlessly. We use the
following notation:
26
(a) (a)
• We refer to δt = Pt − St as the market-maker’s Ask Spread (relative to OB Mid).
(b) (a) (a) (b)
• We refer to δt + δt = Pt − Pt as the market-maker’s Bid-Ask Spread.
(b)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Bid Shares
(b)
that have been transacted against (by MOs or by Sell LOs) up to time t (Xt is often
refered to as the cumulative “hits” up to time t, as in “the market-maker’s buy offer
has been hit”).
(a)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Ask
(a)
Shares that have been transacted against (by MOs or by Buy LOs) up to time t (Xt
is often refered to as the cumulative “lifts” up to time t, as in “the market-maker’s
sell offer has been lifted”).
With this notation in place, we can write the trading account balance equation for all
t = 0, 1, . . . , T − 1 as follows:
(a) (a) (a) (b) (b) (b)
Wt+1 = Wt + Pt · (Xt+1 − Xt ) − Pt · (Xt+1 − Xt ) (1.10)
Note that since the inventory I0 at time 0 is equal to 0, the inventory It at time t is given
by the equation:
(b) (a)
I t = Xt − Xt
The market-maker’s goal is to maximize (for an appropriately shaped concave utility
function U (·)) the sum of the trading account value at time T and the value of the inventory
of shares held at time T , i.e., we maximize:
E[U (WT + IT · ST )]
As we alluded to earlier, this problem can be cast as a discrete-time finite-horizon Markov
Decision Process (with discount factor γ = 1). Following the usual notation for discrete-
time finite-horizon MDPs, the order of activity for the MDP at each time step t = 0, 1, . . . , T −
1 is as follows:
that maximizes:
∑
T
E[ Rt ] = E[RT ] = E[U (WT + IT · ST )]
t=1
27
1.3.1 Avellaneda-Stoikov Continuous-Time Formulation
A landmark paper by Avellaneda and Stoikov (Avellaneda and Stoikov 2008) formulated
this optimal market-making problem in it’s continuous-time version. Their formulation
is conducive to analytical tractability and they came up with a simple, clean and intuitive
solution. In this subsection, we go over their formulation and in the next subsection, we
show the derivation of their solution. We adapt our discrete-time notation above to their
continuous-time setting.
(b) (a)
[(Xt |0 ≤ t < T ] and [Xt |0 ≤ t < T ] are assumed to be continuous-time Poisson
(b)
processes with the hit rate per unit of time and the lift rate per unit of time denoted as λt and
(a)
λt respectively. Hence, we can write the following:
(b) (b)
dXt ∼ P oisson(λt · dt)
(a) (a)
dXt ∼ P oisson(λt · dt)
(b) (b)
λt = f (b) (δt )
(a) (a)
λt = f (a) (δt )
for decreasing functions f (b) (·) and f (a) (·).
(a) (a) (b) (b)
dWt = Pt · dXt − Pt · dXt
(b) (a)
I t = Xt − Xt (note: I0 = 0)
(b)
Since infinitesimal Poisson random variables dXt (shares hit in time interval from t
(a)
to t + dt) and dXt (shares lifted in time interval from t to t + dt) are Bernoulli random
(b) (a)
variables (shares hit/lifted within time interval of duration dt will be 0 or 1), Nt and Nt
(number of shares in the submitted LOs for the infinitesimal time interval from t to t + dt)
can be assumed to be 1.
This simplifies the Action at time t to be just the pair:
(b) (a)
(δt , δt )
dSt = σ · dzt
for some σ ∈ R+ .
The Utility function is assumed to be: U (x) = −e−γx where γ > 0 is the risk-aversion pa-
rameter (this Utility function is essentially the CARA Utility function devoid of associated
constants).
28
Vt∗ (·), here we make t an explicit functional argument of V ∗ and each of St , Wt , It also
as separate functional arguments of V ∗ (instead of the typical approach of making the
state, as a tuple, a single functional argument). This is because in the continuous-time
setting, we are interested in the time-differential of the Optimal Value Function and we
also want to represent the dependency of the Optimal Value Function on each of St , Wt , It
as explicit separate dependencies. Appendix ?? provides the derivation of the general HJB
formulation (Equation (??) in Appendix ??) - this general HJB Equation specializes here
to the following:
∂V ∗ ∂V ∗ σ2 ∂ 2V ∗
max { · dt + E[σ · · dzt + · · (dzt )2 ]
(b)
δt ,δt
(a) ∂t ∂St 2 ∂St2
+ λt · dt · V ∗ (t, St , Wt − St + δt , It + 1)
(b) (b)
· dt · V ∗ (t, St , Wt + St + δt , It − 1)
(a) (a)
+ λt
· dt) · V ∗ (t, St , Wt , It )
(b) (a)
+ (1 − λt · dt − λt
− V ∗ (t, St , Wt , It )} = 0
Next, we want to convert the HJB Equation to a Partial Differential Equation (PDE). We
can simplify the above HJB equation with a few observations:
• E[dzt ] = 0.
• E[(dzt )2 ] = dt.
(b) (a)
• Organize the terms involving λt and λt better with some algebra.
• Divide throughout by dt.
∂V ∗ σ 2 ∂ 2 V ∗
max { + ·
(b)
δt ,δt
(a) ∂t 2 ∂St2
+ λt · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))
(b) (b)
29
(b) (b) (a) (a)
Next, note that λt = f (b) (δt ) and λt = f (a) (δt ), and apply the max only on the
relevant terms:
∂V ∗ σ 2 ∂ 2 V ∗
+ ·
∂t 2 ∂St2
+ max{f (b) (δt ) · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))}
(b) (b)
(b)
δt
∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b)
f (b) (δt ) (b)
+ max{ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It )) )}
(b)
δt γ
(a)
f (a) (δt ) (a)
+ max{ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It )) )} = 0
(a)
δt γ
(b)
−e−γ·(Wt −Qt +θ(t,St ,It +1))
= −e−γ·(Wt +θ(t,St ,It ))
30
(b)
⇒ Qt = θ(t, St , It + 1) − θ(t, St , It ) (1.14)
(a)
Likewise for Qt , we get:
(a)
Qt = θ(t, St , It ) − θ(t, St , It − 1) (1.15)
(b) (a)
Using Equations (1.14) and (1.15), bring Qt and Qt in the PDE for θ:
∂θ σ 2 ∂ 2 θ ∂θ 2 (b) (b)
+ ·( 2 −γ·( ) ) + max g(δt ) + max h(δt ) = 0
∂t 2 ∂St ∂St (b)
δt
(a)
δt
(b)
f (b) (δt ) (b) (b)
· (1 − e−γ·(δt −St +Qt ) )
(b)
where g(δt ) =
γ
(a)
f (a) (δt ) (a) (a)
· (1 − e−γ·(δt +St −Qt ) )
(a)
and h(δt ) =
γ
(b) (b)
To maximize g(δt ), differentiate g with respect to δt and set to 0:
(b) ∗ (b)
(b) ∗ ∂f (b) (b) ∗ ∂f (b) (b) ∗
e−γ·(δt −St +Qt )
· (γ · f (b) (δt )− (δ
(b) t
)) + (b)
(δt )=0
∂δt ∂δt
(b) ∗
(b) ∗ (b) ∗ (b) 1 f (b) (δ )
⇒ δt = St − Pt = St − Qt + · log (1 − γ · (b) t(b) ∗ ) (1.16)
γ ∂f
(δ ) (b)
∂δt t
(a) (a)
To maximize h(δt ), differentiate h with respect to δt and set to 0:
(a) ∗ (a)
(a) ∗ ∂f (a) (a) ∗ ∂f (a) (a) ∗
e−γ·(δt +St −Qt )
· (γ · f (a) (δt )− (δ
(a) t
)) + (a)
(δt )=0
∂δt ∂δt
(a) ∗
(a) ∗ (a) ∗ (a) 1 f (a) (δ )
⇒ δt = Pt − St = Qt − St + · log (1 − γ · (a) t(a) ∗ ) (1.17)
γ ∂f
(δ ) (a)
∂δt t
(b) ∗ (a) ∗
Equations (1.16) and (1.17) are implicit equations for δt and δt respectively.
Now let us write the PDE in terms of the Optimal Bid and Ask Spreads:
∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b) ∗
f (b) (δt ) (b) ∗
+ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It ))
)
γ (1.18)
(a) ∗
f (a) (δt ) (a) ∗
+ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It ))
)=0
γ
with boundary condition: θ(T, ST , IT ) = IT · ST
31
(b) ∗
• Using Equations (1.14) and (1.15), and using the above-obtained θ in terms of δt
(a) ∗ (b) (a) (b) ∗ (a) ∗
and δt , we get Qt and Qt in terms of δt and δt .
(b) (a) (b) ∗ (a) ∗
• Then we substitute the above-obtained and (in terms of δt and δt ) in
Qt Qt
Equations (1.16) and (1.17).
(b) ∗ (a) ∗
• Finally, we solve the implicit equations for δt and δt (in general, numerically).
The process dSt = σ · dzt implies that ST ∼ N (St , σ 2 · (T − t)), and hence:
γ·It2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt , It ) = −e−γ·(Wt +It ·St − 2
)
Hence,
(b) γ·(It +1)2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt − Qt , It + 1) = −e−γ·(Wt −Qt +(It +1)·St −
(b) )
2
V ∗ (t, St , Wt , It ) = V ∗ (t, St , Wt − Qt , It + 1)
(b)
Therefore,
γ·It2 ·σ 2 ·(T −t) (b) γ·(It +1)2 ·σ 2 ·(T −t)
−e−γ·(Wt +It ·St − 2
)
= −e−γ·(Wt −Qt +(It +1)·St − 2
)
This implies:
(b) γ · σ 2 · (T − t)
Qt = St − (2It + 1) ·
2
Likewise, we can derive:
(a) γ · σ 2 · (T − t)
Qt = St − (2It − 1) ·
2
The formulas for the Indifference Mid Price and the Indifference Bid-Ask Price Spread are
as follows:
(m)
Qt = St − It · γ · σ 2 · (T − t)
(a) (b)
Qt − Qt = γ · σ 2 · (T − t)
32
the market-maker, an adjustment to the OB mid price St that takes into account the magni-
(m)
tude and sign of It . If the market-maker is long inventory (It > 0), then Qt < St , which
makes intuitive sense since the market-maker is interested in reducing her risk of inven-
tory buildup and so, would be be more inclined to sell than buy, leading her to show bid
and ask prices whose average is lower than the OB mid price St . Likewise, if the market-
(m)
maker is short inventory (It < 0), then Qt > St indicating inclination to buy rather than
sell.
Armed with this intuition, we come back to optimal market-making, observing from
Equations (1.16) and (1.17):
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
(b) ∗ (b) (m) (a) (a) ∗
Visualize this ascending sequence of prices [Pt , Qt , Qt , Qt , Pt ] as jointly sliding
up/down (relative to OB mid price St ) as a function of the inventory It ’s magnitude and
(b) ∗ (a) ∗ (m)
sign, and perceive Pt , Pt in terms of their spreads to the pseudo mid price Qt :
(b) (a) (b) ∗
(b) (m) ∗ Q + Qt 1 f (b) (δ )
Qt − Pt = t + · log (1 − γ · (b) t(b) ∗ )
2 γ ∂f
(δ )
∂δt
(b) t
(b) ∗ (b) 1 γ
δt = St − Q t + · log (1 + ) (1.19)
γ k
(a) ∗ (a) 1 γ
δt = Qt − St + · log (1 + ) (1.20)
γ k
(b) ∗ (a) ∗ (m)
which means Pt and Pt are equidistant from Qt . Substituting these simplified
(b) ∗ (a) ∗
δt , δ t in Equation (1.18) reduces the PDE to:
∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (e−k·δt + e−k·δt ) = 0
∂t 2 ∂St ∂St k+γ (1.21)
with boundary condition θ(T, ST , IT ) = IT · ST
(b) ∗ (a) ∗
Note that this PDE (1.21) involves δt and δt . However, Equations (1.19), (1.20),
(b) ∗ (a) ∗
(1.14) and (1.15) enable expressing δt and δt in terms of θ(t, St , It − 1), θ(t, St , It ) and
33
θ(t, St , It + 1). This gives us a PDE just in terms of θ. Solving that PDE for θ would give
(b) ∗ (a) ∗
us not only V ∗ (t, St , Wt , It ) but also δt and δt (using Equations (1.19), (1.20), (1.14)
and (1.15)). To solve the PDE, we need to make a couple of approximations.
(b) ∗ (a) ∗
First we make a linear approximation for e−k·δt and e−k·δt in PDE (1.21) as follows:
∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (1 − k · δt + 1 − k · δt ) = 0 (1.22)
∂t 2 ∂St ∂St k+γ
Combining the Equations (1.19), (1.20), (1.14) and (1.15) gives us:
(b) ∗ (a) ∗ 2 γ
δt + δt = · log (1 + ) + 2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1)
γ k
(b) ∗ (a) ∗
With this expression for δt + δt , PDE (1.22) takes the form:
∂θ σ 2 ∂ 2 θ ∂θ 2 c 2k γ
+ ·( 2 − γ · ( ) )+ · (2 − · log (1 + )
∂t 2 ∂St ∂St k+γ γ k (1.23)
− k · (2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1))) = 0
To solve PDE (1.23), we consider the following asymptotic expansion of θ in It :
∞
∑ In
θ(t, St , It ) = t
· θ(n) (t, St )
n!
n=0
(0) ∂ 2 θ (0)
independent of St . So, we can write it as simply θ(0) (t), meaning ∂θ ∂St and ∂St2 are equal
to 0. Therefore, we can write the approximate expansion for θ(t, St , It ) as:
It2 (2)
· θ (t, St )
θ(t, St , It ) = θ(0) (t) + It · θ(1) (t, St ) + (1.24)
2
Substituting this approximation Equation (1.24) for θ(t, St , It ) in PDE (1.23), we get:
34
• Terms devoid of It (i.e., It0 )
• Terms involving It (i.e., It1 )
• Terms involving It2
∂θ(1) σ 2 ∂ 2 θ(1)
+ · = 0 with boundary condition θ(1) (T, ST ) = ST
∂t 2 ∂St2
The solution to this PDE is:
∂θ(0) c 2k γ
+ · (2 − · log (1 + ) + k · θ(2) ) = 0 with boundary θ(0) (T ) = 0
∂t k+γ γ k
c 2k γ kγσ 2
θ(0) (t) = · ((2 − · log (1 + )) · (T − t) − · (T − t)2 ) (1.28)
k+γ γ k 2
This completes the PDE solution for θ(t, St , It ) and hence, for V ∗ (t, St , Wt , It ). Lastly, we
(b) (a) (m) (b) ∗ (a) ∗
derive formulas for Qt , Qt , Qt , δt , δt .
Using Equations (1.14) and (1.15), we get:
(b) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It + 1) · θ(2) (t, St ) = St − (2It + 1) · (1.29)
2
(a) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It − 1) · θ(2) (t, St ) = St − (2It − 1) · (1.30)
2
Using equations (1.19) and (1.20), we get:
(b) ∗ (2It + 1) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (1.31)
2 γ k
(a) ∗ (1 − 2It ) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (1.32)
2 γ k
35
(b) ∗ (a) ∗ 2 γ
Optimal Bid-Ask Spread δt + δt = γ · σ 2 · (T − t) + · log (1 + ) (1.33)
γ k
(a) (m)γ · σ 2 · (T − t)
(m) (b)
Inner Spreads Qt − Qt = Qt − Qt =
2
This completes the analytical approximation to the solution of the Avellaneda-Stoikov
continuous-time formulation of the Optimal Market-Making problem.
36
• A paper from University of Liverpool (Spooner et al. 2018)
• A paper from J.P.Morgan Research (Ganesh et al. 2019)
This topic of development of models for OB Dynamics and RL algorithms for practical
market-making is an exciting area for future research as well as engineering design. We
hope this section has provided sufficient foundations for you to dig into this topic further.
37
Bibliography
39
Almgren, Robert, and Neil Chriss. 2000. “Optimal Execution of Portfolio Transactions.”
Journal of Risk, 5–39.
Avellaneda, Marco, and Sasha Stoikov. 2008. “High-Frequency Trading in a Limit Or-
der Book.” Quantitative Finance 8 (3): 217–24. http://www.informaworld.com/10.1080/
14697680701381228.
Bertsimas, Dimitris, and Andrew W. Lo. 1998. “Optimal Control of Execution Costs.”
Journal of Financial Markets 1 (1): 1–50.
Ganesh, Sumitra, Nelson Vadori, Mengda Xu, Hua Zheng, Prashant P. Reddy, and Manuela
Veloso. 2019. “Reinforcement Learning for Market Making in a Multi-Agent Dealer
Market.” CoRR abs/1911.05892. http://dblp.uni-trier.de/db/journals/corr/corr1911.
html#abs-1911-05892.
Gueant, Olivier. 2016. The Financial Mathematics of Market Liquidity: From Optimal Execution
to Market Making. Chapman; Hall/CRC Financial Mathematics Series.
Nevmyvaka, Yuriy, Yi Feng, and Michael J. Kearns. 2006. “Reinforcement Learning for
Optimized Trade Execution.” In ICML, edited by William W. Cohen and Andrew W.
Moore, 148:673–80. ACM International Conference Proceeding Series. ACM. http:
//dblp.uni-trier.de/db/conf/icml/icml2006.html#NevmyvakaFK06.
Spooner, Thomas, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018. “Market
Making via Reinforcement Learning.” CoRR abs/1804.04216. http://dblp.uni-trier.
de/db/journals/corr/corr1804.html#abs-1804-04216.
Vyetrenko, Svitlana, and Shaojie Xu. 2019. “Risk-Sensitive Compact Decision Trees for Au-
tonomous Execution in Presence of Simulated Market Response.” CoRR abs/1906.02312.
http://dblp.uni-trier.de/db/journals/corr/corr1906.html#abs-1906-02312.
41