[go: up one dir, main page]

CN114049223A - A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning - Google Patents

A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning Download PDF

Info

Publication number
CN114049223A
CN114049223A CN202111417906.XA CN202111417906A CN114049223A CN 114049223 A CN114049223 A CN 114049223A CN 202111417906 A CN202111417906 A CN 202111417906A CN 114049223 A CN114049223 A CN 114049223A
Authority
CN
China
Prior art keywords
price
agent
teacher
market
time point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111417906.XA
Other languages
Chinese (zh)
Inventor
潘炎
戴梓煜
印鉴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111417906.XA priority Critical patent/CN114049223A/en
Publication of CN114049223A publication Critical patent/CN114049223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

本发明提供一种基于教师‑学生模型和强化学习的做市方法,该方法能够充分利用市场不完美信息抓住合适的做市时机,并充分考虑头寸累积带来的风险,从而获得价差盈利并促进市场流动性。该方法首先对市场的价量信息进行预处理,再利用完美的市场信息计算相应的技术指标并作为强化学习算法输入训练出一个教师智能体,接着用得到的教师智能体来指导学生智能体的训练过程,利用不完美的市场信息计算出相应的技术指标并作为强化学习算法输入,在教师智能体的指导下训练出一个学生智能体,最终得到的学生智能体可以根据市场的历史信息来做出是否做市的决策。将该智能体进行回测测试,输出结果。

Figure 202111417906

The present invention provides a market making method based on teacher-student model and reinforcement learning, which can make full use of imperfect market information to seize appropriate market making opportunities, and fully consider the risks brought by position accumulation, so as to obtain spread profit and Promote market liquidity. The method first preprocesses the market price and quantity information, and then uses the perfect market information to calculate the corresponding technical indicators and trains a teacher agent as the input of the reinforcement learning algorithm, and then uses the obtained teacher agent to guide the student agent's behavior. During the training process, the corresponding technical indicators are calculated by using imperfect market information and input as the reinforcement learning algorithm, and a student agent is trained under the guidance of the teacher agent, and the final student agent can be obtained according to the historical information of the market. make a market-making decision. Backtest the agent and output the results.

Figure 202111417906

Description

Market making method based on teacher-student model and reinforcement learning
Technical Field
The invention relates to the field of information science, in particular to a marketing method based on a teacher-student model and reinforcement learning.
Background
At present, with economic development, the remaining value of each hand is more and more, and more people enter the financial field to invest in order to increase the value of the wealth and insurance value of the hands; in the financial field, the conventional investment method depends on a financial market research method such as research on the market and the individual subjective experience of investors, is not friendly to investors lacking knowledge in the aspects, and various strategies for quantifying investment just solve the problem, so that the conventional investment method is widely concerned.
The strategy of quantitative investment is an investment transaction performed by means of a mathematical model and computer programming, and aims to obtain stable income; the traditional method usually calculates corresponding effective technical indexes according to the historical information of the current market, so that corresponding trading behaviors are executed according to the indexes, and a stock multi-factor strategy, a futures CTA strategy, a arbitrage strategy and the like are common; with the development of big data and machine learning, by combining the traditional method and the machine learning method, the market rule can be extracted more efficiently, so that more stable investment is realized; neural networks such as RNN, LSTM, etc. are commonly used to make analytical decisions for the market.
With the better effect of reinforcement learning in the fields of games, automation and the like, the reinforcement learning is introduced and used in the field of quantitative transaction, the reinforcement learning learns the transaction environment information in a time series form to obtain an experienced transaction intelligent agent, and the profit and the risk are comprehensively considered, so that the quantitative investment is more effective; training of agents is typically performed using reinforcement learning algorithms of Policy Gradient (PG), DQN (Deep Q-learning Network), ac (actor critical), and variants thereof.
The market making is a common quantitative trading method, namely, a buy order and a sell order are sent out in the market at the same time, if the buy order and the sell order are traded at the same time, the position is not changed, and the price difference of the buy order is earned, so that the market making not only finds a proper time for trading, but also can make both sides trade, thereby leading the market to have better liquidity.
The prior art discloses a patent of an online marketing method based on epsilon-insensitive logarithmic loss, which constructs a plurality of candidate sub-strategies according to futures minute-level OHLC data, and simultaneously proposes that the weighted income of the candidate sub-strategies and the weighted income of a theoretical optimal strategy have a linear relationship; the method takes epsilon insensitive logarithmic loss as a loss function, uses a Follow the regulated Leader online learning algorithm to dynamically update the weight of the candidate sub-strategies, and finally calculates the target bin of the main strategy according to the weight and adjusts the bin of the main strategy to be equal to the target bin. The invention provides that the weighted gain of a theoretical optimal strategy is used as the true value of the linear relation, so that the optimization target of online learning is more definite, and meanwhile, the invention provides that the epsilon insensitive logarithmic loss is used as a loss function, so that the real market condition can be better fitted; after the real data are used for carrying out the back test, the strategy provided by the invention can obtain better income and stability and has strong practicability. However, the patent does not relate to any technology for obtaining price difference profit and promoting market liquidity by using imperfect market information to grasp appropriate marketing opportunities and fully consider risks caused by position accumulation.
Disclosure of Invention
The invention provides a market making method based on a teacher-student model and reinforcement learning, which can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a marketing method based on a teacher-student model and reinforcement learning comprises the following steps:
s1: collecting historical market price data of the target futures variety;
s2: carrying out data cleaning work;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
Further, in step S1, historical market price data of the target futures item at a specific time level is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict
Further, the specific process of step S2 is:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
Further, the specific process of step S3 is:
s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Figure BDA0003375802070000031
Duration of time of w is the indicator
Figure BDA0003375802070000032
w duration absolute price oscillation index
Figure BDA0003375802070000033
Indicator for whether future time duration w' of buying order at time t can meet
Figure BDA0003375802070000034
Index of whether future w' duration of market order at time t can be reached
Figure BDA0003375802070000035
w duration exponential moving average index
Figure BDA0003375802070000036
The calculation formula is as follows:
Figure BDA0003375802070000037
duration of time of w is the indicator
Figure BDA0003375802070000038
The calculation formula is as follows:
Figure BDA0003375802070000039
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Figure BDA00033758020700000310
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure BDA00033758020700000311
s34: calculating the average value of the past n time length prices for each time point t in the data table
Figure BDA00033758020700000312
I.e. a simple moving average;
s35: to be provided with
Figure BDA00033758020700000313
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure BDA00033758020700000314
Figure BDA00033758020700000315
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure BDA0003375802070000041
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Further, the specific process of step S4 is:
s41: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure BDA0003375802070000042
further, the specific process of using the PPO algorithm in step S44 is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environmentoldLet the parameter be phioldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtaining the action a selected by the strategytFrom which an advantage function is calculated
Figure BDA0003375802070000043
The advantage function represents that the difference between the rewarded obtained by the action selected by the intelligent agent strategy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as followsThe following:
Figure BDA0003375802070000051
if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:
L(φ)=Lpolicy(φ)+λLvalue(φ)
Figure BDA0003375802070000052
Lvalue(φ)=Et[||Vφ(st)-Vt||2]
updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of timeoldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectivelyoldAnd φ, thereby introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Figure BDA0003375802070000053
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
Further, the specific process of step S5 is:
s51: defining a review interval size w of each time point t, and representing the length of the historical data considered by the time point t;
s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Figure BDA0003375802070000054
Duration of time of w is the indicator
Figure BDA0003375802070000055
w duration absolute price oscillation index
Figure BDA0003375802070000056
w duration exponential moving average index
Figure BDA0003375802070000057
The calculation formula is as follows:
Figure BDA0003375802070000058
duration of time of w is the indicator
Figure BDA0003375802070000059
The calculation formula is as follows:
Figure BDA00033758020700000510
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Figure BDA00033758020700000511
Subtracting the short term moving average from the long term moving average;
s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure BDA0003375802070000061
s54: calculating the past n for each time point t in the data tableMean value of time length price
Figure BDA0003375802070000062
I.e. a simple moving average;
s55: to be provided with
Figure BDA0003375802070000063
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure BDA0003375802070000064
Figure BDA0003375802070000065
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure BDA0003375802070000066
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Further, the specific process of step S6 is:
s61: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s64: the reinforcement learning algorithm for training the student agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure BDA0003375802070000071
further, the specific process of using the PPO algorithm in step S64 is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environmentoldLet the parameter be thetaoldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtain the policySlightly selected action atFrom which an advantage function is calculated
Figure BDA0003375802070000072
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
Figure BDA0003375802070000073
if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:
L(θ)=Lpolicy(θ)+λLvalue(θ)+μLdiff(θ)
Figure BDA0003375802070000074
Lvalue(θ)=Et[||Vθ(st)-Vt||2]
Figure BDA0003375802070000075
updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of timeoldSince the algorithm has a policy parameter and a ringThe environment is interacted, the strategy parameters of the teacher agent are learned, and the strategy parameters are thetaoldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Figure BDA0003375802070000081
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
Further, the specific process of step S7 is:
s71: loading the strategy of the student agent after the reinforcement learning training;
s72: importing market data for testing, and pushing the market data to a strategy;
s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding transaction instruction according to a strategy to generate a new order, and storing the order for subsequent matching;
s74: and outputting a return test result according to the final transaction result.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the method can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity. The method comprises the steps of preprocessing price information of a market, calculating corresponding technical indexes by using perfect market information and inputting the technical indexes as a reinforcement learning algorithm to train a teacher intelligent agent, guiding a training process of a student intelligent agent by using the obtained teacher intelligent agent, calculating corresponding technical indexes by using imperfect market information and inputting the technical indexes as the reinforcement learning algorithm, training the student intelligent agent under the guidance of the teacher intelligent agent, and finally making a decision whether to make a market according to historical information of the market by using the obtained student intelligent agent. And carrying out a retest test on the intelligent agent and outputting a result.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a marketing method based on a teacher-student model and reinforcement learning includes the following steps:
s1: collecting historical market price data of the target futures, collecting historical market price data of the specific time level of the target futures, wherein the data table specifically comprises the opening price open of each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict
S2: and (3) carrying out data cleaning work:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: removing data rows with missing values and abnormal values;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
Example 2
As shown in fig. 1, a marketing method based on a teacher-student model and reinforcement learning includes the following steps:
s1: collecting historical market price data of the target futures variety;
s2: carrying out data cleaning work;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
In step S1, historical market price data of a specific time level of the target futures item is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict
The specific process of step S2 is:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
The specific process of step S3 is:
s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Figure BDA0003375802070000101
Duration of time of w is the indicator
Figure BDA0003375802070000102
w duration absolute price oscillation index
Figure BDA0003375802070000103
Indicator for whether future time duration w' of buying order at time t can meet
Figure BDA0003375802070000104
Index of whether future w' duration of market order at time t can be reached
Figure BDA0003375802070000105
w duration exponential moving average index
Figure BDA0003375802070000106
The calculation formula is as follows:
Figure BDA0003375802070000107
duration of time of w is the indicator
Figure BDA0003375802070000108
The calculation formula is as follows:
Figure BDA0003375802070000109
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Figure BDA00033758020700001010
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure BDA00033758020700001011
s34: calculating the average value of the past n time length prices for each time point t in the data table
Figure BDA00033758020700001012
I.e. a simple moving average;
s35: to be provided with
Figure BDA00033758020700001013
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure BDA00033758020700001014
Figure BDA00033758020700001015
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure BDA0003375802070000111
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Example 3
As shown in fig. 1, the present invention provides a marketing method based on a teacher-student model (teacher-student model) and reinforcement learning, comprising the following steps:
s1: selecting target futures and collecting corresponding time-level historical market price data, wherein the selected data in the embodiment is the minute-level price data of the futures in 4 months-2021 months in 2010, including the opening price open, of 300 shares in Shanghai province, namely 4 months in 2021tHighest price hightLowest price lowtClosing price closetVolume of finished traffict
S2: carrying out data cleaning work, comprising the following steps:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
S3: calculating various data required for training the teacher agent for the data, comprising the steps of:
s31: defining the review interval of each time point t as 30 minutes, representing the length of the historical data considered by the time point t, defining the future trend display interval of each time point t as 10 minutes, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index
Figure BDA0003375802070000112
Homeopathic index in 30 minutes
Figure BDA0003375802070000113
30 minute absolute price shock index
Figure BDA0003375802070000114
Index of whether business can be achieved in 10 minutes in future when buying a purchase order at time t
Figure BDA0003375802070000115
Index of whether to meet market order at time t in 10 minutes in future
Figure BDA0003375802070000116
30 minute index moving average index
Figure BDA0003375802070000117
The calculation formula is as follows:
Figure BDA0003375802070000121
homeopathic index in 30 minutes
Figure BDA0003375802070000122
The calculation formula is as follows:
Figure BDA0003375802070000123
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in 30 minutes, and MD is the average value of the absolute value difference between the closing price in 30 minutes and the MA value;
30 minute absolute price shock index
Figure BDA0003375802070000124
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure BDA0003375802070000125
s34: calculating the average of the past 30 minute prices for each time point t in the data sheet
Figure BDA0003375802070000126
I.e. a simple moving average;
s35: to be provided with
Figure BDA0003375802070000127
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure BDA0003375802070000128
Figure BDA0003375802070000129
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure BDA0003375802070000131
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
S4: training a teacher agent with a reinforcement learning algorithm, comprising the steps of:
s41: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO (Rapid Policy optimization) algorithm, the training of the agent strategy is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure BDA0003375802070000132
s45: the specific process of using the PPO algorithm is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environmentoldLet the parameter be phioldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtaining the action a selected by the strategytFrom which an advantage function is calculated
Figure BDA0003375802070000133
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
Figure BDA0003375802070000134
if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:
L(φ)=Lpolicy(φ)+λLvalue(φ)
Figure BDA0003375802070000141
Lvalue(φ)=Et[||Vφ(st)-Vt||2]
updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of timeoldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectivelyoldAnd φ, thereby introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Figure BDA0003375802070000142
Policy parameters and teachings for calculating the difference between two distributions, avoiding interactionsThe teacher agent and the intelligent agent have overlarge difference, so that the parameter updating is more smooth.
S5: calculating various data required for training student agents on the data, comprising the following steps:
s51: defining the review interval of each time point t as 30 minutes, and representing the length of the historical data considered by the time point t;
s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index
Figure BDA0003375802070000143
Homeopathic index in 30 minutes
Figure BDA0003375802070000144
30 minute absolute price shock index
Figure BDA0003375802070000145
30 minute index moving average index
Figure BDA0003375802070000146
The calculation formula is as follows:
Figure BDA0003375802070000147
homeopathic index in 30 minutes
Figure BDA0003375802070000148
The calculation formula is as follows:
Figure BDA0003375802070000149
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in 30 minutes, and MD is the average value of the absolute value difference between the closing price in 30 minutes and the MA value;
30 minute absolute price shock index
Figure BDA00033758020700001410
Subtracting the short term moving average from the long term moving average;
s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure BDA0003375802070000151
s54: calculating the average of the past 30 minute prices for each time point t in the data sheet
Figure BDA0003375802070000152
I.e. a simple moving average;
s55: to be provided with
Figure BDA0003375802070000153
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure BDA0003375802070000154
Figure BDA0003375802070000155
where price _ diff is the defined minimum price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure BDA0003375802070000156
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
S6: the method for training the student agent by using the teacher agent to guide the reinforcement learning algorithm comprises the following steps:
s61: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s64: the reinforcement learning algorithm for training the student agent adopts a PPO (Rapid Policy optimization) algorithm, the training of agent strategies is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure BDA0003375802070000161
s65: the specific process of using the PPO algorithm is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environmentoldLet the parameter be thetaoldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtaining the action a selected by the strategytFrom which an advantage function is calculated
Figure BDA0003375802070000162
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
Figure BDA0003375802070000163
if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:
L(θ)=Lpolicy(θ)+λLvalue(θ)+μLdiff(θ)
Figure BDA0003375802070000164
Lvalue(θ)=Et[||Vθ(st)-Vt||2]
Figure BDA0003375802070000171
updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of timeoldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameters of the teacher intelligent agent are learned, and the strategy parameters are respectively thetaoldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Figure BDA0003375802070000172
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
S7: performing a retest test comprising the steps of:
s71: loading the strategy of the student agent after the reinforcement learning training;
s72: importing 300-strand market data of the Shanghai depth for testing, wherein the market data refer to futures in minute level and are used for pushing the market data to a strategy;
s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding trading instruction according to a strategy to generate a new order, storing the order for later matching, and leveling before the trading is finished every day;
s74: and outputting a return test result according to the final transaction result.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A marketing method based on a teacher-student model and reinforcement learning is characterized by comprising the following steps:
s1: collecting historical market price data of the target futures variety;
s2: carrying out data cleaning work;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
2. The teacher-student model and reinforcement learning based marketing method according to claim 1, wherein in step S1, historical market price data of a specific time level of the target futures item is collected, and the data table specifically includes an opening price open for each time point t of the corresponding time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict
3. The teacher-student model and reinforcement learning based marketing method according to claim 2, wherein the specific process of step S2 is:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
4. The teacher-student model and reinforcement learning based marketing method according to claim 3, wherein the specific process of step S3 is:
s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Figure FDA0003375802060000011
Duration of time of w is the indicator
Figure FDA0003375802060000012
w duration absolute price oscillation index
Figure FDA0003375802060000013
Indicator for whether future time duration w' of buying order at time t can meet
Figure FDA0003375802060000014
Index of whether future w' duration of market order at time t can be reached
Figure FDA0003375802060000015
w duration exponential moving average index
Figure FDA0003375802060000016
The calculation formula is as follows:
Figure FDA0003375802060000021
duration of time of w is the indicator
Figure FDA0003375802060000022
The calculation formula is as follows:
Figure FDA0003375802060000023
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Figure FDA0003375802060000024
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure FDA0003375802060000025
s34: calculating the average value of the past n time length prices for each time point t in the data table
Figure FDA0003375802060000026
I.e. a simple moving average;
s35: to be provided with
Figure FDA0003375802060000027
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure FDA0003375802060000028
Figure FDA0003375802060000029
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure FDA00033758020600000210
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
5. The teacher-student model and reinforcement learning based marketing method according to claim 4, wherein the specific process of step S4 is:
s41: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure FDA0003375802060000031
6. the teacher-student model and reinforcement learning based marketing method according to claim 5, wherein the specific process of using PPO algorithm in step S44 is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environmentoldLet the parameter be phioldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtaining the action a selected by the strategytFrom which an advantage function is calculated
Figure FDA0003375802060000032
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is obtained by using a V network and executing the action in the current stateThe obtained reward is estimated, and the formula is as follows:
Figure FDA0003375802060000033
if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:
L(φ)=Lpolicy(φ)+λLvalue(φ)
Figure FDA0003375802060000041
LvaLue(φ)=Et[||Vφ(st)-Vt||2]
updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of timeoldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectivelyoldAnd φ, thus introducing a Kullback-Leibler divergence in the objective function, i.e., KL [ p ] in the formulaφ(·|st),pφold(·|st)]And the method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
7. The method for doing business based on teacher-student model and reinforcement learning according to claim 6, wherein said step S5 is specifically processed by:
s51: defining a review interval size w of each time point t, and representing the length of the historical data considered by the time point t;
s52: calculating the corresponding financial technology of the student part according to the collected dataIndexes used are as follows: w duration exponential moving average index
Figure FDA0003375802060000042
Duration of time of w is the indicator
Figure FDA0003375802060000043
w duration absolute price oscillation index
Figure FDA0003375802060000044
w duration exponential moving average index
Figure FDA0003375802060000045
The calculation formula is as follows:
Figure FDA0003375802060000046
duration of time of w is the indicator
Figure FDA0003375802060000047
The calculation formula is as follows:
Figure FDA0003375802060000048
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Figure FDA00033758020600000411
Subtracting the short term moving average from the long term moving average;
s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
Figure FDA0003375802060000049
s54: calculating the average value of the past n time length prices for each time point t in the data table
Figure FDA00033758020600000410
I.e. a simple moving average;
s55: to be provided with
Figure FDA0003375802060000051
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
Figure FDA0003375802060000052
Figure FDA0003375802060000053
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst
S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
Figure FDA0003375802060000054
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
8. The method for doing business based on teacher-student model and reinforcement learning according to claim 7, wherein said step S6 is specifically processed by:
s61: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s64: the reinforcement learning algorithm for training the student agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
Figure FDA0003375802060000061
9. the teacher-student model and reinforcement learning based marketing method according to claim 8, wherein the specific process of using the PPO algorithm in step S64 is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environmentoldLet the parameter be thetaoldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point ttAnd obtaining the action a selected by the strategytFrom which an advantage function is calculated
Figure FDA0003375802060000062
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
Figure FDA0003375802060000063
if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:
L(θ)=Lpolicy(θ)+λLvalue(θ)+μLdiff(θ)
Figure FDA0003375802060000064
Lvalue(θ)=Et[||Vθ(st)-Vt||2]
Figure FDA0003375802060000065
updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of timeoldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameters of the teacher intelligent agent are learned, and the strategy parameters are respectively thetaoldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Figure FDA0003375802060000071
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
10. The teacher-student model and reinforcement learning based marketing method according to claim 9, wherein the specific process of step S7 is:
s71: loading the strategy of the student agent after the reinforcement learning training;
s72: importing market data for testing, and pushing the market data to a strategy;
s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding transaction instruction according to a strategy to generate a new order, and storing the order for subsequent matching;
s74: and outputting a return test result according to the final transaction result.
CN202111417906.XA 2021-11-25 2021-11-25 A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning Pending CN114049223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111417906.XA CN114049223A (en) 2021-11-25 2021-11-25 A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111417906.XA CN114049223A (en) 2021-11-25 2021-11-25 A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning

Publications (1)

Publication Number Publication Date
CN114049223A true CN114049223A (en) 2022-02-15

Family

ID=80211133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111417906.XA Pending CN114049223A (en) 2021-11-25 2021-11-25 A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning

Country Status (1)

Country Link
CN (1) CN114049223A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119025213A (en) * 2024-10-29 2024-11-26 温州理工学院 International economic and financial analysis teaching demonstration methods, systems and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119025213A (en) * 2024-10-29 2024-11-26 温州理工学院 International economic and financial analysis teaching demonstration methods, systems and equipment

Similar Documents

Publication Publication Date Title
Brabazon et al. An introduction to evolutionary computation in finance
Yu et al. Dynamic stock-decision ensemble strategy based on deep reinforcement learning
le Calvez et al. Deep learning can replicate adaptive traders in a limit-order-book financial market
Jinrong et al. Engineering risk management planning in energy performance contracting in China
Massahi et al. A deep Q-learning based algorithmic trading system for commodity futures markets
Parracho et al. Trading with optimized uptrend and downtrend pattern templates using a genetic algorithm kernel
Gao et al. Deeper hedging: A new agent-based model for effective deep hedging
Bi et al. Application and practice of AI technology in quantitative investment
Wiiava et al. Stock price prediction with golden cross and death cross on technical analysis indicators using long short term memory
Kendall et al. The co-evolution of trading strategies in a multi-agent based simulated stock market through the integration of individual learning and social learning
CN114049223A (en) A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning
AU2020351871A1 (en) Dynamically-generated electronic database for portfolio selection
CN115953220A (en) A systematic trading method based on timing strategy
CN115082239A (en) Deep reinforcement learning stock investment strategy generation method and generation system based on multi-agent cooperation
Nur Comparing the Accuracy of Multiple Discriminant Analyisis, Logistic Regression, and Neural Network to estimate pay and not to pay Dividend
Coşkun The Role of Artificial Intelligence in Investment Decisions and Applications in The Turkish Finance Industry
Vora et al. Stock price analysis and prediction
Padhi et al. Intraday Stock Prices Forecasting Using an Autoregressive Model
Kumar Deep recurrent Q-networks for market making
Wang et al. Deep Reinforcement Learning Based End-to-End Stock Trading Strategy
Karahan et al. Cryptocurrency Trading based on Heuristic Guided Approach with Feature Engineering
Rathnasingha et al. Constructing the Yield Curve for Sri Lankas Government Bond Market
Mishra et al. Predicting Stock Prices with Machine Learning and Virtual Stock Market Game
Ferraz Pricing options using the XGBoost Model
Hajimiri Use of genetic algorithm in algorithmic trading to optimize technical analysis in the international stock market (Forex)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination