CN114049223A

CN114049223A - A Market Making Approach Based on Teacher-Student Model and Reinforcement Learning

Info

Publication number: CN114049223A
Application number: CN202111417906.XA
Authority: CN
Inventors: 潘炎; 戴梓煜; 印鉴
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-02-15

Abstract

The present invention provides a market making method based on teacher-student model and reinforcement learning, which can make full use of imperfect market information to seize appropriate market making opportunities, and fully consider the risks brought by position accumulation, so as to obtain spread profit and Promote market liquidity. The method first preprocesses the market price and quantity information, and then uses the perfect market information to calculate the corresponding technical indicators and trains a teacher agent as the input of the reinforcement learning algorithm, and then uses the obtained teacher agent to guide the student agent's behavior. During the training process, the corresponding technical indicators are calculated by using imperfect market information and input as the reinforcement learning algorithm, and a student agent is trained under the guidance of the teacher agent, and the final student agent can be obtained according to the historical information of the market. make a market-making decision. Backtest the agent and output the results.

Description

Market making method based on teacher-student model and reinforcement learning

Technical Field

The invention relates to the field of information science, in particular to a marketing method based on a teacher-student model and reinforcement learning.

Background

At present, with economic development, the remaining value of each hand is more and more, and more people enter the financial field to invest in order to increase the value of the wealth and insurance value of the hands; in the financial field, the conventional investment method depends on a financial market research method such as research on the market and the individual subjective experience of investors, is not friendly to investors lacking knowledge in the aspects, and various strategies for quantifying investment just solve the problem, so that the conventional investment method is widely concerned.

The strategy of quantitative investment is an investment transaction performed by means of a mathematical model and computer programming, and aims to obtain stable income; the traditional method usually calculates corresponding effective technical indexes according to the historical information of the current market, so that corresponding trading behaviors are executed according to the indexes, and a stock multi-factor strategy, a futures CTA strategy, a arbitrage strategy and the like are common; with the development of big data and machine learning, by combining the traditional method and the machine learning method, the market rule can be extracted more efficiently, so that more stable investment is realized; neural networks such as RNN, LSTM, etc. are commonly used to make analytical decisions for the market.

With the better effect of reinforcement learning in the fields of games, automation and the like, the reinforcement learning is introduced and used in the field of quantitative transaction, the reinforcement learning learns the transaction environment information in a time series form to obtain an experienced transaction intelligent agent, and the profit and the risk are comprehensively considered, so that the quantitative investment is more effective; training of agents is typically performed using reinforcement learning algorithms of Policy Gradient (PG), DQN (Deep Q-learning Network), ac (actor critical), and variants thereof.

The market making is a common quantitative trading method, namely, a buy order and a sell order are sent out in the market at the same time, if the buy order and the sell order are traded at the same time, the position is not changed, and the price difference of the buy order is earned, so that the market making not only finds a proper time for trading, but also can make both sides trade, thereby leading the market to have better liquidity.

The prior art discloses a patent of an online marketing method based on epsilon-insensitive logarithmic loss, which constructs a plurality of candidate sub-strategies according to futures minute-level OHLC data, and simultaneously proposes that the weighted income of the candidate sub-strategies and the weighted income of a theoretical optimal strategy have a linear relationship; the method takes epsilon insensitive logarithmic loss as a loss function, uses a Follow the regulated Leader online learning algorithm to dynamically update the weight of the candidate sub-strategies, and finally calculates the target bin of the main strategy according to the weight and adjusts the bin of the main strategy to be equal to the target bin. The invention provides that the weighted gain of a theoretical optimal strategy is used as the true value of the linear relation, so that the optimization target of online learning is more definite, and meanwhile, the invention provides that the epsilon insensitive logarithmic loss is used as a loss function, so that the real market condition can be better fitted; after the real data are used for carrying out the back test, the strategy provided by the invention can obtain better income and stability and has strong practicability. However, the patent does not relate to any technology for obtaining price difference profit and promoting market liquidity by using imperfect market information to grasp appropriate marketing opportunities and fully consider risks caused by position accumulation.

Disclosure of Invention

The invention provides a market making method based on a teacher-student model and reinforcement learning, which can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a marketing method based on a teacher-student model and reinforcement learning comprises the following steps:

s1: collecting historical market price data of the target futures variety;

s2: carrying out data cleaning work;

s3: calculating various data required by the teacher agent for training the teacher agent according to the data;

s4: training a teacher agent by using a reinforcement learning algorithm;

s5: calculating various data required by a student agent for training the student agent according to the data;

s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;

s7: and (6) performing a return test.

Further, in step S1, historical market price data of the target futures item at a specific time level is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time level_tHighest price high_tLowest price low_tClosing price close_tVolume of finished traffic_t。

Further, the specific process of step S2 is:

s21: removing redundant empty rows in the data table to make the rows continuous;

s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;

s23: data lines where missing values and outliers exist are removed.

Further, the specific process of step S3 is:

s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;

s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index

Duration of time of w is the indicator

w duration absolute price oscillation index

Indicator for whether future time duration w' of buying order at time t can meet

Index of whether future w' duration of market order at time t can be reached

w duration exponential moving average index

The calculation formula is as follows:

duration of time of w is the indicator

The calculation formula is as follows:

wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;

w duration absolute price oscillation index

Subtracting the short term moving average from the long term moving average;

s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:

s34: calculating the average value of the past n time length prices for each time point t in the data table

I.e. a simple moving average;

s35: to be provided with

As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:

where price _ diff is the defined minimum variable price, buy _ price_tFor time t, a purchase order, sell _ price_tMaking market unit price for the time point t;

s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time points_t；

S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succ_tAnd sel _ succ_t1 represents a transaction, 0 represents a non-transaction;

s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point t_tThe formula is as follows:

wherein buy _ succ_tAnd sel _ succ_tRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.

Further, the specific process of step S4 is:

s41: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;

s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environment_tAnd sel _ price_tSending out purchase orders and sales orders;

s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point t_tAs reward;

s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:

further, the specific process of using the PPO algorithm in step S44 is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environment_oldLet the parameter be phi_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtaining the action a selected by the strategy_tFrom which an advantage function is calculated

The advantage function represents that the difference between the rewarded obtained by the action selected by the intelligent agent strategy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as followsThe following:

if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:

L(φ)＝L_policy(φ)+λL_value(φ)

L_value(φ)＝E_t[||V_φ(s_t)-V_t||₂]

updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of time_oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectively_oldAnd φ, thereby introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula

The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.

Further, the specific process of step S5 is:

s51: defining a review interval size w of each time point t, and representing the length of the historical data considered by the time point t;

s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index

Duration of time of w is the indicator

w duration absolute price oscillation index

w duration exponential moving average index

The calculation formula is as follows:

duration of time of w is the indicator

The calculation formula is as follows:

w duration absolute price oscillation index

Subtracting the short term moving average from the long term moving average;

s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:

s54: calculating the past n for each time point t in the data tableMean value of time length price

I.e. a simple moving average;

s55: to be provided with

s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time points_t；

S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succ_tAnd sel _ succ_t1 represents a transaction, 0 represents a non-transaction;

s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point t_tThe formula is as follows:

Further, the specific process of step S6 is:

s61: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;

s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environment_tAnd sel _ price_tSending out purchase orders and sales orders;

s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point t_tAs reward;

s64: the reinforcement learning algorithm for training the student agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:

further, the specific process of using the PPO algorithm in step S64 is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environment_oldLet the parameter be theta_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtain the policySlightly selected action a_tFrom which an advantage function is calculated

The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:

if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:

L(θ)＝L_policy(θ)+λL_value(θ)+μL_diff(θ)

L_value(θ)＝E_t[||V_θ(s_t)-V_t||₂]

updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of time_oldSince the algorithm has a policy parameter and a ringThe environment is interacted, the strategy parameters of the teacher agent are learned, and the strategy parameters are theta_oldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula

Further, the specific process of step S7 is:

s71: loading the strategy of the student agent after the reinforcement learning training;

s72: importing market data for testing, and pushing the market data to a strategy;

s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding transaction instruction according to a strategy to generate a new order, and storing the order for subsequent matching;

s74: and outputting a return test result according to the final transaction result.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the method can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity. The method comprises the steps of preprocessing price information of a market, calculating corresponding technical indexes by using perfect market information and inputting the technical indexes as a reinforcement learning algorithm to train a teacher intelligent agent, guiding a training process of a student intelligent agent by using the obtained teacher intelligent agent, calculating corresponding technical indexes by using imperfect market information and inputting the technical indexes as the reinforcement learning algorithm, training the student intelligent agent under the guidance of the teacher intelligent agent, and finally making a decision whether to make a market according to historical information of the market by using the obtained student intelligent agent. And carrying out a retest test on the intelligent agent and outputting a result.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a marketing method based on a teacher-student model and reinforcement learning includes the following steps:

s1: collecting historical market price data of the target futures, collecting historical market price data of the specific time level of the target futures, wherein the data table specifically comprises the opening price open of each time point t corresponding to the time level_tHighest price high_tLowest price low_tClosing price close_tVolume of finished traffic_t；

S2: and (3) carrying out data cleaning work:

s23: removing data rows with missing values and abnormal values;

s4: training a teacher agent by using a reinforcement learning algorithm;

s7: and (6) performing a return test.

Example 2

s1: collecting historical market price data of the target futures variety;

s2: carrying out data cleaning work;

s4: training a teacher agent by using a reinforcement learning algorithm;

s7: and (6) performing a return test.

In step S1, historical market price data of a specific time level of the target futures item is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time level_tHighest price high_tLowest price low_tClosing price close_tVolume of finished traffic_t。

The specific process of step S2 is:

s23: data lines where missing values and outliers exist are removed.

The specific process of step S3 is:

Duration of time of w is the indicator

w duration absolute price oscillation index

Index of whether future w' duration of market order at time t can be reached

w duration exponential moving average index

The calculation formula is as follows:

duration of time of w is the indicator

The calculation formula is as follows:

w duration absolute price oscillation index

Subtracting the short term moving average from the long term moving average;

I.e. a simple moving average;

s35: to be provided with

Example 3

As shown in fig. 1, the present invention provides a marketing method based on a teacher-student model (teacher-student model) and reinforcement learning, comprising the following steps:

s1: selecting target futures and collecting corresponding time-level historical market price data, wherein the selected data in the embodiment is the minute-level price data of the futures in 4 months-2021 months in 2010, including the opening price open, of 300 shares in Shanghai province, namely 4 months in 2021_tHighest price high_tLowest price low_tClosing price close_tVolume of finished traffic_t；

S2: carrying out data cleaning work, comprising the following steps:

s23: data lines where missing values and outliers exist are removed.

S3: calculating various data required for training the teacher agent for the data, comprising the steps of:

s31: defining the review interval of each time point t as 30 minutes, representing the length of the historical data considered by the time point t, defining the future trend display interval of each time point t as 10 minutes, representing the interval size capable of reflecting the future price trend of the time point t;

s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index

Homeopathic index in 30 minutes

30 minute absolute price shock index

Index of whether business can be achieved in 10 minutes in future when buying a purchase order at time t

Index of whether to meet market order at time t in 10 minutes in future

30 minute index moving average index

The calculation formula is as follows:

homeopathic index in 30 minutes

The calculation formula is as follows:

wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in 30 minutes, and MD is the average value of the absolute value difference between the closing price in 30 minutes and the MA value;

30 minute absolute price shock index

Subtracting the short term moving average from the long term moving average;

s34: calculating the average of the past 30 minute prices for each time point t in the data sheet

I.e. a simple moving average;

s35: to be provided with

S4: training a teacher agent with a reinforcement learning algorithm, comprising the steps of:

s41: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;

s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO (Rapid Policy optimization) algorithm, the training of the agent strategy is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:

s45: the specific process of using the PPO algorithm is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environment_oldLet the parameter be phi_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtaining the action a selected by the strategy_tFrom which an advantage function is calculated

L(φ)＝L_policy(φ)+λL_value(φ)

L_value(φ)＝E_t[||V_φ(s_t)-V_t||₂]

Policy parameters and teachings for calculating the difference between two distributions, avoiding interactionsThe teacher agent and the intelligent agent have overlarge difference, so that the parameter updating is more smooth.

S5: calculating various data required for training student agents on the data, comprising the following steps:

s51: defining the review interval of each time point t as 30 minutes, and representing the length of the historical data considered by the time point t;

s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index

Homeopathic index in 30 minutes

30 minute absolute price shock index

30 minute index moving average index

The calculation formula is as follows:

homeopathic index in 30 minutes

The calculation formula is as follows:

30 minute absolute price shock index

Subtracting the short term moving average from the long term moving average;

s54: calculating the average of the past 30 minute prices for each time point t in the data sheet

I.e. a simple moving average;

s55: to be provided with

where price _ diff is the defined minimum price, buy _ price_tFor time t, a purchase order, sell _ price_tMaking market unit price for the time point t;

S6: the method for training the student agent by using the teacher agent to guide the reinforcement learning algorithm comprises the following steps:

s61: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;

s64: the reinforcement learning algorithm for training the student agent adopts a PPO (Rapid Policy optimization) algorithm, the training of agent strategies is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:

s65: the specific process of using the PPO algorithm is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environment_oldLet the parameter be theta_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtaining the action a selected by the strategy_tFrom which an advantage function is calculated

L(θ)＝L_policy(θ)+λL_value(θ)+μL_diff(θ)

L_value(θ)＝E_t[||V_θ(s_t)-V_t||₂]

updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of time_oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameters of the teacher intelligent agent are learned, and the strategy parameters are respectively theta_oldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula

S7: performing a retest test comprising the steps of:

s72: importing 300-strand market data of the Shanghai depth for testing, wherein the market data refer to futures in minute level and are used for pushing the market data to a strategy;

s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding trading instruction according to a strategy to generate a new order, storing the order for later matching, and leveling before the trading is finished every day;

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A marketing method based on a teacher-student model and reinforcement learning is characterized by comprising the following steps:

s1: collecting historical market price data of the target futures variety;

s2: carrying out data cleaning work;

s4: training a teacher agent by using a reinforcement learning algorithm;

s7: and (6) performing a return test.

2. The teacher-student model and reinforcement learning based marketing method according to claim 1, wherein in step S1, historical market price data of a specific time level of the target futures item is collected, and the data table specifically includes an opening price open for each time point t of the corresponding time level_tHighest price high_tLowest price low_tClosing price close_tVolume of finished traffic_t。

3. The teacher-student model and reinforcement learning based marketing method according to claim 2, wherein the specific process of step S2 is:

s23: data lines where missing values and outliers exist are removed.

4. The teacher-student model and reinforcement learning based marketing method according to claim 3, wherein the specific process of step S3 is:

Duration of time of w is the indicator

w duration absolute price oscillation index

Index of whether future w' duration of market order at time t can be reached

w duration exponential moving average index

The calculation formula is as follows:

duration of time of w is the indicator

The calculation formula is as follows:

w duration absolute price oscillation index

Subtracting the short term moving average from the long term moving average;

I.e. a simple moving average;

s35: to be provided with

5. The teacher-student model and reinforcement learning based marketing method according to claim 4, wherein the specific process of step S4 is:

6. the teacher-student model and reinforcement learning based marketing method according to claim 5, wherein the specific process of using PPO algorithm in step S44 is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environment_oldLet the parameter be phi_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtaining the action a selected by the strategy_tFrom which an advantage function is calculated

The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is obtained by using a V network and executing the action in the current stateThe obtained reward is estimated, and the formula is as follows:

L(φ)＝L_policy(φ)+λL_value(φ)

L_vaLue(φ)＝E_t[||V_φ(s_t)-V_t||₂]

updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of time_oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectively_oldAnd φ, thus introducing a Kullback-Leibler divergence in the objective function, i.e., KL [ p ] in the formula_φ(·|s_t)，p_φold(·|s_t)]And the method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.

7. The method for doing business based on teacher-student model and reinforcement learning according to claim 6, wherein said step S5 is specifically processed by:

s52: calculating the corresponding financial technology of the student part according to the collected dataIndexes used are as follows: w duration exponential moving average index

Duration of time of w is the indicator

w duration absolute price oscillation index

w duration exponential moving average index

The calculation formula is as follows:

duration of time of w is the indicator

The calculation formula is as follows:

w duration absolute price oscillation index

Subtracting the short term moving average from the long term moving average;

s54: calculating the average value of the past n time length prices for each time point t in the data table

I.e. a simple moving average;

s55: to be provided with

8. The method for doing business based on teacher-student model and reinforcement learning according to claim 7, wherein said step S6 is specifically processed by:

9. the teacher-student model and reinforcement learning based marketing method according to claim 8, wherein the specific process of using the PPO algorithm in step S64 is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environment_oldLet the parameter be theta_oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t_tAnd obtaining the action a selected by the strategy_tFrom which an advantage function is calculated

L(θ)＝L_policy(θ)+λL_value(θ)+μL_diff(θ)

L_value(θ)＝E_t[||V_θ(s_t)-V_t||₂]

10. The teacher-student model and reinforcement learning based marketing method according to claim 9, wherein the specific process of step S7 is: