Applications of Markov Decision Process Model and
Applications of Markov Decision Process Model and
Article
Applications of Markov Decision Process Model and Deep
Learning in Quantitative Portfolio Management during the
COVID-19 Pandemic
Han Yue, Jiapeng Liu * and Qin Zhang
College of Economics and Management, China Jiliang University, Hangzhou 310018, China
* Correspondence: jpliu@cjlu.edu.cn
Abstract: Whether for institutional investors or individual investors, there is an urgent need to
explore autonomous models that can adapt to the non-stationary, low-signal-to-noise markets. This
research aims to explore the two unique challenges in quantitative portfolio management: (1) the
difficulty of representation and (2) the complexity of environments. In this research, we suggest a
Markov decision process model-based deep reinforcement learning model including deep learning
methods to perform strategy optimization, called SwanTrader. To achieve better decisions of the
portfolio-management process from two different perspectives, i.e., the temporal patterns analysis
and robustness information capture based on market observations, we suggest an optimal deep
learning network in our model that incorporates a stacked sparse denoising autoencoder (SSDAE)
and a long–short-term-memory-based autoencoder (LSTM-AE). The findings in times of COVID-19
show that the suggested model using two deep learning models gives better results with an alluring
performance profile in comparison with four standard machine learning models and two state-of-the-
art reinforcement learning models in terms of Sharpe ratio, Calmar ratio, and beta and alpha values.
Citation: Yue, H.; Liu, J.; Zhang, Q. Furthermore, we analyzed which deep learning models and reward functions were most effective in
Applications of Markov Decision optimizing the agent’s management decisions. The results of our suggested model for investors can
Process Model and Deep Learning in assist in reducing the risk of investment loss as well as help them to make sound decisions.
Quantitative Portfolio Management
during the COVID-19 Pandemic. Keywords: Markov decision process model; quantitative portfolio management; deep reinforcement
Systems 2022, 10, 146. https:// learning; deep learning; omega ratio
doi.org/10.3390/systems10050146
Academic Editors:
Evangelos Katsamakas and
Oleg Pavlov 1. Introduction
that the influence model improved the ability of RL agents to generate profits. Liu et al. [10]
utilized imitation learning techniques to balance the exploration and exploitation of the
RL agent, and comparison results confirmed its ability to generalize to various financial
markets.
However, the above studies ignore the augmentation of the original input features
because the original price data and technical indicators are not suitable to be directly put
into the state space. First, the high instability, low signal-to-noise ratio, and external shock
characteristics of the real financial market mean that the input of the original financial
features in the state space may bring serious problems to the estimation of the value
function [11]. Second, putting the OHCLV (open, high, close, low, volume) data and a
tremendous number of technical indicators into state space would result in the curse of
dimensionality [12]. Many existing studies have simplified the state space as OHCLV data
and a few technical indicators [13–16]. A better solution is to extract the latent feature
through the feature augmentation model, which can reduce the calculation cost and time
required for training and eliminate redundant information between related features [17].
To the best of our knowledge, there are few studies devoted to augmenting the
quality of features before implementing the trading algorithms. Yashaswi [18] proposed
a latent feature state space (LFSS) module for filtering and feature extraction of financial
data. LFSS includes the use of the Kalman filter and a variety of machine-learning-based
feature-extraction technologies to preprocess the feature space, including ZoomSVD [19],
cRBM [20], and autoencoder (AE) [21]. Among them, the AE has the best effect and is
superior to the preprocessed RL agent and five traditional trading strategies. Soleymani
and Paquet [22] employed a restricted stacked autoencoder module in order to obtain
non-correlated and highly informative features and the results demonstrated the efficiency
of the proposed module. Li [23] proposed a feature preprocessing module consisting
of principal component analysis (PCA) and discrete wavelet transform (DWT), and the
numerical results demonstrate that feature preprocessing is essential for the RL trading
algorithm.
Moreover, the high instability of the financial market will also make it difficult to
define an appropriate reward function. The change rate of portfolio value is commonly
defined as reward function. However, it represents less risk-related information than risk-
adjusted technical indicators such as Sharpe ratio, Sortino ratio, and Calmar ratio. Thus,
many studies [10,24–26] using Sharpe ratio [27] as the reward function and their back-test
performance improved significantly in terms of maximum drawdown (MDD). Wu et al. [28]
introduced the Sortino ratio as the reward, which only factors in the negative deviation of a
trading strategy’s returns from the mean, and Almahdi & Yang [29] introduced the Calmar
ratio based on MDD, all achieving better performance. However, the technical indicators
we mentioned above only use the first two statistical moments of yield distribution, namely
mean and variance, which does not take into account the high apex and thickness tail
property and bias characteristics of the real return series [30]. In addition, MDD measures
the maximum loss over a long period of time, which does not reflect the risk in the short
period [31]. It is particularly vulnerable to extreme events, which can result in significant
losses yet occur infrequently, so they are hardly probable. To address the defects of the above
risk adjusted indicators, we introduced the omega ratio to construct a novel reward function
to better balance the risk and profit. Compared with the three indicators mentioned above,
the omega ratio is considered to be a better performance indicator because it depends on the
distribution of all returns and therefore contains all information about risks and returns [32].
From the perspective of probability and statistics, it is a natural and enlightening in financial
interpretation.
In this paper, to address the aforementioned challenges and issues, we propose a
Markov decision process model-based deep reinforcement learning model for QPM, called
SwanTrader. The proposed model consists of two main components: a multi-level state
space augmentation (MSA) and a RL agent trained by an on-policy actor–critic algorithm.
The MSA comprises a stacked sparse denoising autoencoder (SSDAE) network and a
Systems 2022, 10, 146 3 of 20
2. Related Works
2.1. Reinforcement Learning Algorithms
In the realm of quantitative management, the application of DRL has proliferated
in recent years. There are three forms of algorithm usage: the value-based model, the
policy-based model, and the actor–critic model [37].
The value-based approach, which is the most prevalent way and aids in solving
discrete action space problems with techniques such as deep Q-learning (DQN) and its
enhancements, trains an agent on a single stock or asset [38–40]. Due to the consistency
of price data, it would be impractical to apply this model to the task of investing in a
large quantitative management task. The policy-based approach has been implemented
in [6,41,42]. Policy-based models, as opposed to value-based models, which may be the
optimal policy for some issues, are capable of handling the continuous action space, but it
will result in a high square difference and a sluggish learning speed [42].
The actor–critic model attempts to combine the benefits of the value-based and policy-
based approaches. The objective is to simultaneously update the actor network representing
policy and the critic network representing value function. The critic estimates the value
function, and the actor uses the strategy gradient to update the critic-guided strategy
probability distribution. The actor network learns to take better actions over time, while
the critic network becomes more adept at evaluating these actions. Due to its outstanding
performance, the actor–critic model has garnered considerable attention. Reference [43]
compared the representative algorithms of the three mentioned models: PG, DQN, and
A2C. The results present that the actor–critic model is better than the value-based and
policy-based models, showing more stability and stronger profitability. Reference [28]
showed that the actor–critic model is more stable than the value-based in the ever-evolving
stock market. Reference [44] employed three actor–critic algorithms, including the proximal
policy optimization (PPO), the deep deterministic policy gradient (DDPG), the advantage
actor critic (A2C), and the twin delayed DDPG (TD3), and tested them on the Indian
stock market. On comparing the results, the A2C indicates the best results. Reference [45]
employed DDPG, PPO, and A2C in portfolio-management tasks, and the result indicates
the PPO is slightly higher than the A2C in terms of the Sharpe ratio. Reference [8] conducted
a comparison of PPO, A2C, and DDPG. Experimental results indicate that A2C outperforms
other algorithms. In summary, this paper employed A2C and implemented it using stable-
baselines3 [46].
Systems 2022, 10, 146 4 of 20
T
π ∗ = argmax ∑ E(st ,at )∼ρπ γt r (st , at )
(1)
π t =0
where ρπ indicates the distribution of state-action pairs that RL agent will encounter under
the control of policy π.
For each policy π, one can define its corresponding value function:
The management environment of this paper follows Yang et al. [8]. The state space of
our trading environment consists of four components [bt , ht , pt , Xt ]. Here are the definitions
for each letter:
• bt ∈ R+ :available balance at current time-step t;
• ht ∈ Zn+ : shares owned of each stock at current time-step t;
• pt ∈ Rn : close price of each stock at current time-step t;
Systems 2022, 10, 146 5 of 20
25,000
20,000
15,000
10,000
5,000
Figure 1. Overview
Figure of the
1. Overview data
of the set.set.
data
4.2.Table
Multi-Level Augmented Portfolio-Management Model
1. Summary of financial technical indicators.
Our model consists of two components. In the first section, multi-level augmentation
Type
was performed on raw financial data to be Name fed into the state space, and in the secondNumber
section, theav- Simple Moving Average (SMA), Exponential Moving Average
advantage actor–critic algorithm (A2C) was executed. SwanTrader processes
Moving
each T round in four (EMA), Weighted
steps: (1) inputMoving
of price Averages (WMA), indicators;
data and technical Bollinger Bands 4
(2) utilizing the
erages
MSA to enhance the feature quality and feed (BBANDS)
the augmented features into the state space;
(3) outputting the action,True
Average that is, the volume
Range of buy
(ATR), True and sell
Range of each stock
(TRANGE), UlcerinIndex
the portfolio;
andVolatility
(4) introducing the omega ratio to calculate (UI)reward and update the actor network’s3
trading rules based
Movingon the reward.
Average The structure
Convergence of the SwanTrader
Divergence (MACD), is shown in
Volatility Figure 2.
Ra-
As depicted
Trend in Figure 3, our MSA consists of two steps: extracting
tio (VR), Schaff Trend Cycle (STC), Days Payable Outstanding robustness infor-
6
mation using the SSDAE network and utilizing the LSTM-AE to collect
(DPO), Triple Exponential Average (TRIX), Know Sure Thing (KST) temporal patterns
and historical information based on the previous observations. The SSDAE and LSTM-AE
Relative Strength Index (RSI), Awesome Oscillator (AO), True
networks were trained offline through the training set prior to the trading process, and only
Strength Index (TSI), Average Directional Index (ADX), Aroon Os-
encoding
Momen- layers were used to output informative features in the online trading process.
cillator (AROON), Money Flow Index (MFI), Momentum (MOM), 11
tum
Rate of Change (ROC), Williams %R (WILLR), Stochastic (STOCH),
Elder Ray Index (ERI)
On-Balance Volume (OBV), Force Index (FI), Accumulation/Distri-
Volume bution (AD), Ease of Movement (EM), Chaikin Money Flow (CMF), 7
Volume Price Trend (VPT), Negative Volume Index (NVI)
Total - 31
Multi-level State10,
Systems 2022, Augmentation Module
x FOR PEER REVIEW Augmented + Shares
7 of 20
Features + Balance
+ Close Price
AAPL
Systems 2022,
MSFT10, 146
Policy network 7 of 20
... ... Input Rawthe
and (4) introducing Dataomega ratio to calculate reward and update the actor network’s
AXP
trading rules based on the reward. The structure of the SwanTrader
Value network is shown in Figure 2.
Policy network
MSFT
Input Raw Data reward
…..
... ...
…..
…..
environment
…..
…..
AXP
Rolling data of each assets Value network
Encoding layers of
Encoding layers of LSTM-AE
SSDAE +
t−s+1 t−s+2 t−s+3 t
action
Figure 2. Overview of proposed model.
...
…..
As depicted in Figure 3, our MSA consists of tworeward steps: extracting robustness infor-
…..
…..
…..
environment
…..
…..
As depicted in Figure 3, our MSA consists of two steps: extracting robustness infor-
� �
mation using the SSDAE network and utilizing the LSTM-AE to collect temporal patterns
( ) State Space
Raw data � � andℎ historical
ℎ
information based
Rolling dataon the previous observations. The SSDAE and LSTM-AE
( )
� �
networks
ℎ were trained
( )
ℎ
offline through the training set prior to the trading
( ) process, and
Shares
( ) �−� + �
ℎ t −� + 1
only encoding layers were used....... to outputLSTM
informative features in the online
Latenttrading pro-
…..
…..
( )
� ℎ
�
cess. Encoder Features
…..
( ) �−� + �
ℎ
� � Close Price
AXP, AMGN, ℎ
( )
HON, IBM...... � �
…..
…..
( ) Hidden layer N
ℎ � �
Hidden layer 2
Balance
Hidden layer 1ℎ
( ) State Space
� � � � Rolling data
Raw data ℎ
( )
( ) �−� + �
ℎ
�
Figure Close Price
AXP, AMGN, �
Figure 3.3.Structure
Structureofofℎthe
(the
) MSAnetwork.
MSA network.
HON, IBM......
…..
…..
( ) Hidden layer N
ℎ � �
Hidden layer 2
Balance
�
4.3.
4.3.�Research
ResearchModels
Models Hidden layer 1
Input layer4.3.1.
4.3.1. StackedSparse
Stacked SparseDenoising
Add noise
DenoisingAutoencoder
Autoencoder (SSDAE)
(SSDAE)
We
Weutilized
utilized aa stacked
stacked sparse denoisingautoencoder
autoencoder (SSDAE) network extract
to extract
1. Input Data 2. Robustness Information Extractionsparse denoising 3. Temporal Patterns(SSDAE)
Extraction network4.to la-
Output Features
latent
tent information to address the issues of high volatility, low signal-to-noise ratio, and and
information to address the issues of high volatility, low signal-to-noise ratio, ex-
external
ternal shock
shock
Figure ofoffinancial
3. Structurefinancial
of the data
data
MSA as well
as as
asnetwork.
well thethe dimensional
dimensional disaster
disaster caused
caused by the
by the high-
high-di-
dimension state space, which included OHCLV (open, high,
mension state space, which included OHCLV (open, high, close, low, volume) data and close, low, volume) data a
and a vast
vast4.3. number
Research
number of technical indicators. Our inspiration is derived
Models indicators. Our inspiration is derived from past works: Reference
of technical from past works:
Reference
[25] 4.3.1. [25] employed
employed stacked stacked autoencoders
denoising denoising autoencoders
(SDAEs), (SDAEs), and reference [22]
Stacked Sparse Denoising Autoencoder (SSDAE)and reference [22] introduced a
introduced
constrained a constrained
stacked stacked
autoencoder autoencoder
for dimension for dimension
reduction. reduction.
In contrast, we In contrast,
inserted we
sparse
inserted We utilized
sparsea terms, a stacked
tested sparse
a range denoising
of network autoencoder (SSDAE)
designs,a structure
and evaluated network to extract
a structure for la-
terms,
tenttested range
information to of network
address designs,
the issues and evaluated
of high volatility, low for feature
signal-to-noise augmen-
ratio,
feature
tation augmentation
with superior with superior
performance performance
under our model. under
The our model.
structure of The
the structure
SSDAE ofand
the ex-
autoen-
ternal shock of financial data as well as the dimensional
SSDAE autoencoder network is depicted in Figure 3. xi indicates the input data of the (𝑖)disaster caused by the high-di-
ith
coder network
mension is depicted in Figure 3. 𝑥𝑖 OHCLV
(i ) state space, which included
indicates(open,
the input
high,data of the
close, ithvolume)
low, node, and dataℎ𝑘and a
node, and hktherepresents
represents input datathe input
of the kthdata ofofthe
nodeOur the kth
ith node
hidden the ithThe
oflayer. hidden
arrow layer.
theThe arrow
vast number of technical indicators. inspiration is derived from pastinworks: network
Reference
indiagram
the network
reflects diagram
the weightreflects
of theconnection
the weight ofbetween
the connection
two between two
neighboring layer neighboring
nodes. The a
[25] employed stacked denoising autoencoders (SDAEs), and reference [22] introduced
layer nodes. The autoencoder (AE) [52] is a type of unsupervised machine learning that
constrained stacked autoencoder for dimension reduction. In contrast, we inserted sparse
seeks to duplicate the input information using the hidden layer’s learned representation
terms, tested a range of network designs, and evaluated a structure for feature augmen-
by setting the output values to match the input values. The AE can be viewed as a neural
tation with superior performance under our model. The structure of the SSDAE autoen-
network with three layers. Given the raw training data set D = {( xi }, where i = 1, 2, . . . m, (𝑖)
coder network is depicted in Figure 3. 𝑥 indicates the input data of the ith node, and ℎ
m represents the figure for training samples.𝑖 Following the weighted mapping algorithm, 𝑘
represents the input data of the kth node of the ith hidden layer. The arrow in the network
the feature vector for the hidden layer is h = {h1 , h2 , · · · , hn }.
diagram reflects the weight of the connection between two neighboring layer nodes. The
h = f θ ( x ) = s(Wx + b) (4)
Systems 2022, 10, 146 8 of 20
where θ = {W 0 , b0 , W, b} is the set of the weights matrix and biases vector, and
s(t) = (1 + exp(−t)−1 ) is the sigmoid function. Afterward, the hidden layer is inversely
mapped, and the reconstructed output vector z = {z1 , z2 , · · · , zn } is by means of Equation
(3), and this process is referred to as decoding.
z = gθ ( h ) = s W 0 h + b 0
(5)
The AE depends solely on minimizing reconstruction error for network training, which
may result in the network simply learning a replica of the original input. The denoising
autoencoder (DAE) [53] is based on the AE and employs the noisy input to improve
resilience. With the DAE, the expected potential representation has certain stability and
resistance in the case of input corruption, so that trading tasks can be conducted more
effectively. The initial input x is corrupted into xe via a stochastic mapping, and the
corrupted input xe is then mapped to a hidden representation, and the cost function of DAE
is as follows:
1 m 1
(i ) 2 λ
∑ (i ) 2
JDAE (W, b) = k z W,b ( x
e ) − x k 2 + + (k W k2F + k W 0 k F ) (7)
m i =1 2 2
m s
1 1 2 λ
∑ + β ∑ KL ρ k ρej +
2
JSDAE (W, b) = k z ( xe)(i) − x (i) k2 k W k2F + k W 0 k F ) (9)
m i =1
2 W,b j =1
2
where W
where 𝑊l𝑙 isisthe
theweight
weightofof lthlth
thethe SSDAE
SSDAElayer. Since
layer. the pretrained
Since weights
the pretrained would
weights be used
would be
as regularization in our network, the sparsity term was deleted.
used as regularization in our network, the sparsity term was deleted.
R epeatV ector
Rolling data of each assets
t−s+1 t−s+2 ....... t .......
.......
Time-steps
Decoder Output D ecoder ofLS T M -A E
T im e D istributed Layer
....... t−s+1 t−s+2 ....... t
.......
Time-steps
E ncoder ofLS T M -A E Reconstructed data of each assets
Encoder Output
A LSTM unit is depicted in Figure 6. The LSTM cell structure controls the update and
utilization of historical information mainly through three gates, thereby overcoming the
vanishing gradient problem: input gate 𝑖𝑡 regulates the reading of fresh data, forget gate
Systems 2022, 10, 146 10 of 20
A LSTM unit is depicted in Figure 6. The LSTM cell structure controls the update and
utilization of historical information mainly through three gates, thereby overcoming the
vanishing gradient problem: input gate it regulates the reading of fresh data, forget gate
f t controls the erasure of data, and output gate Ot controls the transmission of data. The
following equations illustrate the LSTM operation:
f t = σ (W f · [ h t − 1 , x t ] + b f ) (11)
Figure6.6.LSTM
Figure LSTMunit
unitstructure.
structure.
4.3.3.
4.3.3.Optimization
OptimizationAlgorithms—Advantage
Algorithms—AdvantageActor–Critic
Actor–Critic(A2C)
(A2C)
In
In this study, we employed one of the actor–critic algorithms, the
this study, we employed one of the actor–critic algorithms, the advantage
advantage actor–
actor–
critic
critic (A2C), whichprior
(A2C), which priorresearch
researchhas
has shown
shown to outperform
to outperform other
other DRLDRL algorithms
algorithms in
in quan-
quantitative management tasks [44,58,59]. We implemented it using stable-baselines3 [46].
titative management tasks [44,58,59]. We implemented it using stable-baselines3 [46]. The
The A2C algorithm [60] presented by OpenAI is a variant of the asynchronous advantage
A2C algorithm [60] presented by OpenAI is a variant of the asynchronous advantage ac-
actor–critic (A3C) algorithm [61]. A2C reduces the variance of the policy gradient by
tor–critic (A3C) algorithm [61]. A2C reduces the variance of the policy gradient by utiliz-
utilizing an advantage function. We updated our policy based on the objective function:
ing an advantage function. We updated our policy based on the objective function:
T 𝑇
∇ J∇𝐽
θ ( θ )(𝜃)
𝜃
== ∑ ∇θ∇log
E[𝔼[∑
𝜃
t | s∣ t𝑠) A
logπ𝜋θ ( a(𝑎 𝜃
(st , a, t𝑎)])]
)𝐴(𝑠
𝑡 𝑡 𝑡 𝑡
(17)
(17)
t =1
𝑡=1
where at 𝑡| ∣st𝑠)𝑡 )denotes
whereπ𝜋θ𝜃((𝑎 denotesthe
thepolicy
policynetwork
networkparameterized
parameterized by and A
𝜃, and
by θ, (st𝑡, a𝑎t𝑡)) indiactes
𝐴(𝑠 indiactes
the
theadvantage
advantagefunction
functiondefined
definedasasfollows:
follows:
A(s𝐴(𝑠 𝑡 ) q=
t , a𝑡t ), 𝑎= π (𝑞s𝜋t ,(𝑠
at𝑡), 𝑎−𝑡 )V−π (𝑉s𝜋t(𝑠
) 𝑡=) r=(s𝑟(𝑠
t , at𝑡,, s𝑎t𝑡+, 1𝑠)𝑡+1
+)γV
+ π𝛾𝑉
(𝜋st(𝑠 )−
+1𝑡+1 )−
Vπ𝑉𝜋(s(𝑠
t )𝑡 ) (18)
(18)
whereqπ𝑞𝜋(s,
where (𝑠,a)𝑎)represents
represents
thethe expected
expected reward
reward at at state
state 𝑠𝑡 when
st when taking
taking action
action 𝑎𝑡 , and
at , and Vπ (s𝑉t𝜋)
is(𝑠the
𝑡 ) isvalue
the value function.
function.
TheOpenAI
The OpenAIgym gym[62]
[62]serves
serves as
as the
the foundation
foundation for
for our
our portfolio
portfolio trading
trading environment.
environment.
A2C is employed by using stable-baselines3 [46]. The size of the input
A2C is employed by using stable-baselines3 [46]. The size of the input sliding windowsliding windowis setis
set to 20, the transaction cost rate λ is set to 0.2%, the time steps for each update are set to
5, the maximum value of gradient clipping is to 0.5, and the learning rate is to 0.0001. To
prevent insufficient learning, such as local minimum and overfitting, gradient clipping is
used, and our A2C actor network employs the Adam optimizer, which has been demon-
Systems 2022, 10, 146 11 of 20
to 20, the transaction cost rate λ is set to 0.2%, the time steps for each update are set to 5, the
maximum value of gradient clipping is to 0.5, and the learning rate is to 0.0001. To prevent
insufficient learning, such as local minimum and overfitting, gradient clipping is used, and
our A2C actor network employs the Adam optimizer, which has been demonstrated in
trials to enhance the training stability and convergence speed of the DRL model [63].
where k t represents the vectors of the stocks’ trading share numbers at each step.
In the real world, we commonly use the change rate of the portfolio value to judge
profit and loss. However, it is challenging to offer consistent commentary on using this
type of reward function. We suggest modifying the reward as follows by using the extra
trade return:
[( Rt − dt ) − ( Rt−s − dt )]
s
Adj_Trt = ln (21)
Rt−s − dt
where Adj_Trts represents the realized logarithmic rate of excess trade return in a period of
time t, the length of period is s, and dt represents the rate of the return of the baseline. The
Dow Jones industrial average (DJIA) index serves as the reference point in this paper.
Moreover, this paper developed a novel reward function based on the omega ratio.
The probability weighted ratio of return to loss under a specific predicted return level is
known as the omega ratio. The omega ratio is seen to be a stronger performance indicator
than the Sharpe ratio, Sortino ratio, and Calmar ratio since it depends on the distribution of
all returns and thus contains all information regarding risks and returns [32]. The formula
is described as follows:
R∞
d (1 − F ( x )) dx ∑ D |k t |
Ort = tR d = R d d =1 (22)
t t
−∞ F ( x ) dx −∞ F ( x ) dx
where F ( x ) indicates the cumulative distribution function of Trt1 in an s-size period (Adj_Trt1
is the daily return).
64-128-64-48-36]. As for the LSTM network parameters, such as the number of hidden
layers and the number of LSTM units for each layer, the network topology is achieved as
[128-32-4-32-128], and 100 LSTM units for each layer are selected as the network parameters
in this paper. The mean square error (MSE) between the true target value and the estimated
target value is employed as the loss function. The ReLU is set as the activation function.
The Adam optimizer is applied to update the weight and bias values and mitigate the
gradient explosion problem. The learning is applied up to 120 epochs for training the
network. For the above two AE networks, to prevent inadequate learning, the gradient
clipping is all implemented through the grad_clip_norm in PyTorch, and we set the value
as 10, and the L2 regularization term of weight (REG_LAMBDA) is set to 0.01.
5.2. Metrics
Six metrics are used in our experiments, which can be divided into three types: (1)
profit metric, including accumulative rate of return (ARR) and alpha; (2) risk metric,
including maximum drawdown (MDD) and beta; and (3) risk-profit metric, including
Sharpe ratio (SR) and Calmar ratio (CMR). The ARR is a common metric used to evaluate
strategic profits, and the greater the cumulative return, the greater the strategy’s profitability.
The SR reflects the additional amount of return an investor receives for each unit of risk. The
MDD is metric to assess the potential loss that seeks the maximum change from the highest
to the lowest. The CMR [64] is used to measure the risk by using the concept of MDD,
and the higher the Calmar ratio, the better it performed on a risk-adjusted basis. Alpha,
commonly regarded as the active return, compares the performance of an investment to a
market index or benchmark that is considered to represent the market’s overall movement.
The calculation process of the alpha value is shown in:
h i
Alpha = R p − R f + β p Rm − R f (23)
where R p is the yield of the model, β p is the beta value of the model, and Rm is the yield
of the benchmark strategy. Beta is widely employed as a risk-reward statistic that enables
investors to assess how much risk they are willing to assume in exchange for a certain
return. The formula is illustrated below:
2
Beta = Cov R p , Rm /σm (24)
2 is the variance of the benchmark strategy.
where Cov is the covariance, and σm
5.3. Baselines
The comparison models described below are trained and traded under the same
trading environment, trading rules, and parameters.
• EW (Equal weight baseline): a simplistic baseline that allocates equal weight to all
portfolio assets;
• Anticor (Anti-Correlation) [65]: a heuristic technique for online portfolio selection that
uses the consistency of positive lagged cross-correlation and negative autocorrelation
to change portfolio weights according to the mean regression principle;
• CRP (Constant rebalanced portfolio) [66]: an investing strategy that maintains the
same wealth distribution among a collection of assets on a daily basis, that is, the
fraction of total wealth represented by a particular asset remains constant at the start
of each day;
• CORN (CORrelation-driven nonparametric learning) [67]: a model for correlation-
driven nonparametric learning that combines correlation sample selection with loga-
rithmic optimal utility function;
• ONS (Online newton step algorithm) [68]: an online portfolio selection model based
on the newton model which requires relatively weak assumptions;
Systems 2022, 10, 146 13 of 20
Accumulativereturn
Figure7.7.Accumulative
Figure returncurves
curvesofofdifferent
differentmodels.
models.
Systems 2022, 10, 146 14 of 20
Figure 7. Accumulative return curves of different models.
Comparisonwith
Figure9.9.Comparison
Figure withdifferent
differentversions
versionsofofnetwork.
network.
Figure10.
Figure Comparisonwith
10.Comparison withdifferent
differentreward
rewardfunction.
function.
7. Conclusions
Table 4. Effectiveness of Omega Ratio.
This paper built a Markov decision process model-based deep reinforcement learning
Model
model ARR (%)
for quantitative MDD
portfolio (%)
management SRin timesCMR Alpha
of COVID-19. In detail,Beta
by using
OHCLV data and financial
ST-Sharpe 80.7 technical indicators1.17
−26.4 for each asset,
1.14 this paper employed 0.84
0.14 a stacked
sparse denoising autoencoder
ST-Sortino 57.8 (SSDAE)
−19.5 model and a long–short-term-memory-based
1.21 1.15 0.12 0.56 au-
toencoder
SwanTrader (LSTM-AE)87.5model to analyze
−15.8 and capture
1.52 the temporal
2.04 patterns
0.19 and robustness
0.65
information based on market observations; A2C was used to optimize and output sequence
7.decisions. Additionally, using two simplified structures of our suggested model and three
Conclusions
types of reward function, namely Sharpe ratio, Sortino ratio, and omega ratio, we explored
This paper built a Markov decision process model-based deep reinforcement learn-
the role of these settings in our suggested model. This paper also compares our model with
ing model for quantitative portfolio management in times of COVID-19. In detail, by using
five standard machine learning models (Anticor, CRP, CORN, ONS) and two state-of-the-art
OHCLV data and financial technical indicators for each asset, this paper employed a
reinforcement learning models (ES [69] and PCA and DWT [23]). According to the back-test
results during the COVID-19 pandemic, we can conclude that:
(1) The DRL-based portfolio-management model outperforms other standard machine
learning-based models in terms of Sharpe ratio, Sortino ratio, and MDD, which means
that Markov decision process model is more suitable than supervised learning by
allowing the tasks of “prediction” and “portfolio construction” to be combined in one
integration step.
(2) By introducing deep learning into the Markov decision process model and adjusting
network structural parameters, the suggested model has a positive effect on balancing
risk and profit. This is the same as the conclusion of Li [23] and Ren [72].
(3) Through the ablation study, it can be seen that SSDAE model has a significant effect
on risk control, especially in the volatility and drawdown of model; the LSTM-AE
model has a significant effect in capturing market trends, but it will also increase
losses while increasing profits. By integrating the two models, we can obtain a better
balance between risk and return.
(4) We also found that the choice of reward function will also affect the risk preference of
the model. By comparing the trading returns, Sharpe ratio, Sortino ratio, and omega
ratio, we found that the more accurate assessment of the value of risk penalty means
that the model has a greater tendency to output prudent action.
To conclude, this paper extends the Markov process model literature by serving
as an attempt toward developing a quantitative portfolio-management model using a
deep-learning-based reinforcement learning method. The result indicates that inputting
augmented state space features improves the performance of the proposed portfolio-
management model. The advantages of the suggested model are its scalability and ap-
plicability. Furthermore, the model’s consistent performance on a long time-span dataset
indicates that it is generalizable.
Systems 2022, 10, 146 17 of 20
Nonetheless, there are certain restrictions in this study. To examine the effects of the
proposed model on portfolio-management performance, features from external financial
environments such as social news and other types of macro data should be exploited, and
its interpretability regarding feature augmentation approach requires more discussion.
Our study indicates the viability of application of the deep learning models in portfolio
management. More deep learning models are also promising for further improving the
strategy performance of quantitative portfolio management.
The results of our suggested model for investors can assist in reducing the risk of
investment loss as well as help them to make sound decisions. In future research, in
consideration of correlations between financial assets, it is possible to extend the proposed
model to exploit cross-asset dependency information and use more comprehensive risk
measurement tools, such as value-at-risk [73] and conditional-value-at-risk [74].
References
1. Wolf, P.; Hubschneider, C.; Weber, M.; Bauer, A.; Härtl, J.; Dürr, F.; Zöllner, J.M. Learning How to Drive in a Real World Simulation
with Deep Q-Networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June
2017; IEEE: Piscataway, NJ, USA, 2017; pp. 244–250.
2. Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q. Mastering Complex Control in Moba Games
with Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12
February 2020; Volume 34, pp. 6672–6679.
3. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al.
Mastering the Game of Go without Human Knowledge. Nature 2017, 550, 354–359. [CrossRef] [PubMed]
4. Yu, Z.; Machado, P.; Zahid, A.; Abdulghani, A.M.; Dashtipour, K.; Heidari, H.; Imran, M.A.; Abbasi, Q.H. Energy and Performance
Trade-off Optimization in Heterogeneous Computing via Reinforcement Learning. Electronics 2020, 9, 1812. [CrossRef]
5. Wang, R.; Wei, H.; An, B.; Feng, Z.; Yao, J. Commission Fee Is Not Enough: A Hierarchical Reinforced Framework for Portfolio
Management. arXiv 2020, arXiv:2012.12620.
6. Jiang, Z.; Liang, J. Cryptocurrency Portfolio Management with Deep Reinforcement Learning. In Proceedings of the 2017
Intelligent Systems Conference (IntelliSys), London, UK, 7–8 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 905–913.
7. Liang, Q.; Zhu, M.; Zheng, X.; Wang, Y. An Adaptive News-Driven Method for CVaR-Sensitive Online Portfolio Selection
in Non-Stationary Financial Markets. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,
International Joint Conferences on Artificial Intelligence Organization, Montreal, QC, Canada, 19–26 August 2021; pp. 2708–2715.
8. Yang, H.; Liu, X.-Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In
Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; pp. 1–8.
9. Chen, Y.-F.; Huang, S.-H. Sentiment-Influenced Trading System Based on Multimodal Deep Reinforcement Learning. Appl. Soft
Comput. 2021, 112, 107788. [CrossRef]
10. Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; Liu, C. Adaptive Quantitative Trading: An Imitative Deep Reinforcement Learning Approach.
AAAI 2020, 34, 2128–2135. [CrossRef]
11. Lu, D.W. Agent Inspired Trading Using Recurrent Reinforcement Learning and LSTM Neural Networks. arXiv 2017,
arXiv:1707.07338.
12. Verleysen, M.; François, D. The Curse of Dimensionality in Data Mining and Time Series Prediction. In Proceedings of the
International Work-Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2005; pp. 758–770.
Systems 2022, 10, 146 18 of 20
13. Betancourt, C.; Chen, W.-H. Deep Reinforcement Learning for Portfolio Management of Markets with a Dynamic Number of
Assets. Expert Syst. Appl. 2021, 164, 114002. [CrossRef]
14. Huang, Z.; Tanaka, F. MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-Based System for Financial
Portfolio Management. PLoS ONE 2022, 17, e0263689. [CrossRef]
15. Park, H.; Sim, M.K.; Choi, D.G. An Intelligent Financial Portfolio Trading Strategy Using Deep Q-Learning. Expert Syst. Appl.
2020, 158, 113573. [CrossRef]
16. Théate, T.; Ernst, D. An Application of Deep Reinforcement Learning to Algorithmic Trading. Expert Syst. Appl. 2021, 173, 114632.
[CrossRef]
17. Meng, Q.; Catchpoole, D.; Skillicorn, D.; Kennedy, P.J. Relational Autoencoder for Feature Extraction. In Proceedings of the 2017
International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 364–371.
18. Yashaswi, K. Deep Reinforcement Learning for Portfolio Optimization Using Latent Feature State Space (LFSS) Module. 2021.
Available online: https://arxiv.org/abs/2102.06233 (accessed on 7 August 2022).
19. Jang, J.-G.; Choi, D.; Jung, J.; Kang, U. Zoom-Svd: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary
Time Range. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino Italy,
22–26 October 2018; pp. 1083–1092.
20. Taylor, G.W.; Hinton, G.E. Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style. In Proceedings of the
26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1025–1032.
21. Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507.
[CrossRef] [PubMed]
22. Soleymani, F.; Paquet, E. Financial Portfolio Optimization with Online Deep Reinforcement Learning and Restricted Stacked
Autoencoder—DeepBreath. Expert Syst. Appl. 2020, 156, 113456. [CrossRef]
23. Li, L. An Automated Portfolio Trading System with Feature Preprocessing and Recurrent Reinforcement Learning. arXiv 2021,
arXiv:2110.05299.
24. Lee, J.; Koh, H.; Choe, H.J. Learning to Trade in Financial Time Series Using High-Frequency through Wavelet Transformation
and Deep Reinforcement Learning. Appl. Intell. 2021, 51, 6202–6223. [CrossRef]
25. Li, Y.; Zheng, W.; Zheng, Z. Deep Robust Reinforcement Learning for Practical Algorithmic Trading. IEEE Access 2019, 7,
108014–108022. [CrossRef]
26. Wu, M.-E.; Syu, J.-H.; Lin, J.C.-W.; Ho, J.-M. Portfolio Management System in Equity Market Neutral Using Reinforcement
Learning. Appl. Intell. 2021, 51, 8119–8131. [CrossRef]
27. Sharpe, W.F. Mutual Fund Performance. J. Bus. 1966, 39, 119–138. [CrossRef]
28. Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; Fujita, H. Adaptive Stock Trading Strategies with Deep Reinforcement Learning
Methods. Inf. Sci. 2020, 538, 142–158. [CrossRef]
29. Almahdi, S.; Yang, S.Y. An Adaptive Portfolio Trading System: A Risk-Return Portfolio Optimization Using Recurrent Reinforce-
ment Learning with Expected Maximum Drawdown. Expert Syst. Appl. 2017, 87, 267–279. [CrossRef]
30. Grinold, R.C.; Kahn, R.N. Active Portfolio Management: Quantitative Theory and Applications; Probus: Chicago, IL, USA, 1995.
31. Magdon-Ismail, M.; Atiya, A.F. Maximum Drawdown. Risk Mag. 2004, 17, 99–102.
32. Benhamou, E.; Guez, B.; Paris, N. Omega and Sharpe Ratio. arXiv 2019, arXiv:1911.10254. [CrossRef]
33. Bin, L. Goods Tariff vs Digital Services Tax: Transatlantic Financial Market Reactions. Econ. Manag. Financ. Mark. 2022, 17, 9–30.
34. Vătămănescu, E.-M.; Bratianu, C.; Dabija, D.-C.; Popa, S. Capitalizing Online Knowledge Networks: From Individual Knowledge
Acquisition towards Organizational Achievements. J. Knowl. Manag. 2022. [CrossRef]
35. Priem, R. An Exploratory Study on the Impact of the COVID-19 Confinement on the Financial Behavior of Individual Investors.
Econ. Manag. Financ. Mark. 2021, 16, 9–40.
36. Barbu, C.M.; Florea, D.L.; Dabija, D.-C.; Barbu, M.C.R. Customer Experience in Fintech. J. Theor. Appl. Electron. Commer. Res. 2021,
16, 1415–1433. [CrossRef]
37. Fischer, T.G. Reinforcement Learning in Financial Markets—A Survey; FAU Discussion Papers in Economics. 2018. Available
online: https://www.econstor.eu/handle/10419/183139 (accessed on 7 August 2022).
38. Chen, L.; Gao, Q. Application of Deep Reinforcement Learning on Automated Stock Trading. In Proceedings of the 2019 IEEE
10th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 18–20 October 2019; pp.
29–33.
39. Dang, Q.-V. Reinforcement Learning in Stock Trading. In Proceedings of the International Conference on Computer Science,
Applied Mathematics and Applications, Hanoi, Vietnam, 19–20 December 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp.
311–322.
40. Jeong, G.; Kim, H.Y. Improving Financial Trading Decisions Using Deep Q-Learning: Predicting the Number of Shares, Action
Strategies, and Transfer Learning. Expert Syst. Appl. 2019, 117, 125–138. [CrossRef]
41. Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.
IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 653–664. [CrossRef]
42. Moody, J.; Saffell, M. Learning to Trade via Direct Reinforcement. IEEE Trans. Neural Netw. 2001, 12, 875–889. [CrossRef]
43. Zhang, Z.; Zohren, S.; Roberts, S. Deep Reinforcement Learning for Trading. arXiv 2019, arXiv:1911.10107. [CrossRef]
Systems 2022, 10, 146 19 of 20
44. Vishal, M.; Satija, Y.; Babu, B.S. Trading Agent for the Indian Stock Market Scenario Using Actor-Critic Based Reinforcement
Learning. In Proceedings of the 2021 IEEE International Conference on Computation System and Information Technology for
Sustainable Solutions (CSITSS), Bangalore, India, 16–18 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. Available
online: https://ieeexplore.ieee.org/abstract/document/9683467 (accessed on 7 August 2022).
45. Pretorius, R.; van Zyl, T. Deep Reinforcement Learning and Convex Mean-Variance Optimisation for Portfolio Management 2022.
Available online: https://arxiv.org/abs/2203.11318 (accessed on 5 August 2022).
46. Raffin, A.; Hill, A.; Ernestus, M.; Gleave, A.; Kanervisto, A.; Dormann, N. Stable Baselines3. 2019. Available online: https:
//www.ai4europe.eu/sites/default/files/2021-06/README_5.pdf (accessed on 7 August 2022).
47. Bakhti, Y.; Fezza, S.A.; Hamidouche, W.; Déforges, O. DDSA: A Defense against Adversarial Attacks Using Deep Denoising
Sparse Autoencoder. IEEE Access 2019, 7, 160397–160407. [CrossRef]
48. Bao, W.; Yue, J.; Rao, Y. A Deep Learning Framework for Financial Time Series Using Stacked Autoencoders and Long-Short Term
Memory. PLoS ONE 2017, 12, e0180944. [CrossRef] [PubMed]
49. Jung, G.; Choi, S.-Y. Forecasting Foreign Exchange Volatility Using Deep Learning Autoencoder-LSTM Techniques. Complexity
2021, 2021, 6647534. [CrossRef]
50. Soleymani, F.; Paquet, E. Deep Graph Convolutional Reinforcement Learning for Financial Portfolio Management–DeepPocket.
Expert Syst. Appl. 2021, 182, 115127. [CrossRef]
51. Qiu, Y.; Liu, R.; Lee, R.S.T. The Design and Implementation of Quantum Finance-Based Hybrid Deep Reinforcement Learning
Portfolio Investment System. J. Phys. Conf. Ser. 2021, 1828, 012011. [CrossRef]
52. Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554.
[CrossRef]
53. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders.
In Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, 5–9 July 2008; pp. 1096–1103.
54. Graves, A. Long Short-Term Memory. Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg,
Germany, 2012; pp. 37–45.
55. Nelson, D.M.; Pereira, A.C.; De Oliveira, R.A. Stock Market’s Price Movement Prediction with LSTM Neural Networks. In
Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017;
IEEE: Piscataway, NJ, USA, 2017; pp. 1419–1426.
56. Yao, S.; Luo, L.; Peng, H. High-Frequency Stock Trend Forecast Using LSTM Model. In Proceedings of the 2018 13th International
Conference on Computer Science & Education (ICCSE), Colombo, Sri Lanka, 8–11 August 2018; IEEE: Piscataway, NJ, USA, 2018;
pp. 1–4.
57. Zhao, Z.; Rao, R.; Tu, S.; Shi, J. Time-Weighted LSTM Model with Redefined Labeling for Stock Trend Prediction. In Proceedings
of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November
2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1210–1217.
58. Liu, X.-Y.; Yang, H.; Gao, J.; Wang, C.D. FinRL: Deep Reinforcement Learning Framework to Automate Trading in Quantitative
Finance. In Proceedings of the Second ACM International Conference on AI in Finance, New York, NY, USA, 3 November 2021;
pp. 1–9.
59. Yang, H.; Liu, X.-Y.; Wu, Q. A Practical Machine Learning Approach for Dynamic Stock Recommendation. In Proceedings of
the 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE
International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), New York, NY, USA, 1–3 August 2018;
IEEE: Piscataway, NJ, USA, 2018; pp. 1693–1697.
60. Zhang, Y.; Clavera, I.; Tsai, B.; Abbeel, P. Asynchronous Methods for Model-Based Reinforcement Learning. arXiv 2019,
arXiv:1910.12453.
61. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep
Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA,
20–22 June 2016; pp. 1928–1937.
62. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai Gym. arXiv 2016,
arXiv:1606.01540.
63. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980.
64. Young, T.W. Calmar Ratio: A Smoother Tool. Futures 1991, 20, 40.
65. Borodin, A.; El-Yaniv, R.; Gogan, V. Can We Learn to Beat the Best Stock. JAIR 2004, 21, 579–594. [CrossRef]
66. Cover, T.M. Universal Portfolios. In The Kelly Capital Growth Investment Criterion; World Scientific Handbook in Financial
Economics Series; World Scientific: Singapore, 2011; Volume 3, pp. 181–209. ISBN 978-981-4293-49-5.
67. Li, B.; Hoi, S.C.H.; Gopalkrishnan, V. CORN: Correlation-Driven Nonparametric Learning Approach for Portfolio Selection. ACM
Trans. Intell. Syst. Technol. 2011, 2, 1–29. [CrossRef]
68. Agarwal, A.; Hazan, E.; Kale, S.; Schapire, R.E. Algorithms for Portfolio Management Based on the Newton Method. In
Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 25–29 June 2006; pp. 9–16.
69. Yang, H.; Liu, X.-Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy.
SSRN J. 2020. [CrossRef]
Systems 2022, 10, 146 20 of 20
70. Yao, W.; Ren, X.; Su, J. An Inception Network with Bottleneck Attention Module for Deep Reinforcement Learning Framework
in Financial Portfolio Management. In Proceedings of the 2022 7th International Conference on Big Data Analytics (ICBDA),
Guangzhou, China, 4 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 310–316.
71. Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; Li, B. Reinforcement-Learning Based Portfolio Management with Augmented
Asset Movement Prediction States. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12
February 2020; Volume 34, pp. 1112–1119.
72. Ren, X.; Jiang, Z.; Su, J. The Use of Features to Enhance the Capability of Deep Reinforcement Learning for Investment Portfolio
Management. In Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), Xiamen, China, 5
March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 44–50.
73. Jorion, P. Value at Risk. 2000. Available online: http://bear.warrington.ufl.edu/aitsahlia/Financial_Risk_Management.pdf
(accessed on 7 August 2022).
74. Rockafellar, R.T.; Uryasev, S. Conditional Value-at-Risk for General Loss Distributions. J. Bank. Financ. 2002, 26, 1443–1471.
[CrossRef]