Disclosure of Invention
The invention provides a market making method based on a teacher-student model and reinforcement learning, which can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a marketing method based on a teacher-student model and reinforcement learning comprises the following steps:
s1: collecting historical market price data of the target futures variety;
s2: carrying out data cleaning work;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
Further, in step S1, historical market price data of the target futures item at a specific time level is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict。
Further, the specific process of step S2 is:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
Further, the specific process of step S3 is:
s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Duration of time of w is the indicator
w duration absolute price oscillation index
Indicator for whether future time duration w' of buying order at time t can meet
Index of whether future w' duration of market order at time t can be reached
w duration exponential moving average index
The calculation formula is as follows:
duration of time of w is the indicator
The calculation formula is as follows:
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
s34: calculating the average value of the past n time length prices for each time point t in the data table
I.e. a simple moving average;
s35: to be provided with
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst;
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Further, the specific process of step S4 is:
s41: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
further, the specific process of using the PPO algorithm in step S44 is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environment
oldLet the parameter be phi
oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t
tAnd obtaining the action a selected by the strategy
tFrom which an advantage function is calculated
The advantage function represents that the difference between the rewarded obtained by the action selected by the intelligent agent strategy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as followsThe following:
if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:
L(φ)=Lpolicy(φ)+λLvalue(φ)
Lvalue(φ)=Et[||Vφ(st)-Vt||2]
updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of time
oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectively
oldAnd φ, thereby introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
Further, the specific process of step S5 is:
s51: defining a review interval size w of each time point t, and representing the length of the historical data considered by the time point t;
s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Duration of time of w is the indicator
w duration absolute price oscillation index
w duration exponential moving average index
The calculation formula is as follows:
duration of time of w is the indicator
The calculation formula is as follows:
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Subtracting the short term moving average from the long term moving average;
s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
s54: calculating the past n for each time point t in the data tableMean value of time length price
I.e. a simple moving average;
s55: to be provided with
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst;
S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Further, the specific process of step S6 is:
s61: creating a virtual market environment, loading collected historical market price data of a specific time level of the target futures and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s64: the reinforcement learning algorithm for training the student agent adopts a PPO algorithm, namely a prompt Policy Optimization algorithm, and trains agent strategies according to the state, action and reward, wherein a strategy for representing agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
further, the specific process of using the PPO algorithm in step S64 is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environment
oldLet the parameter be theta
oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t
tAnd obtain the policySlightly selected action a
tFrom which an advantage function is calculated
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:
L(θ)=Lpolicy(θ)+λLvalue(θ)+μLdiff(θ)
Lvalue(θ)=Et[||Vθ(st)-Vt||2]
updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of time
oldSince the algorithm has a policy parameter and a ringThe environment is interacted, the strategy parameters of the teacher agent are learned, and the strategy parameters are theta
oldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
Further, the specific process of step S7 is:
s71: loading the strategy of the student agent after the reinforcement learning training;
s72: importing market data for testing, and pushing the market data to a strategy;
s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding transaction instruction according to a strategy to generate a new order, and storing the order for subsequent matching;
s74: and outputting a return test result according to the final transaction result.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the method can fully utilize imperfect market information to grasp proper market making time and fully consider risks brought by position accumulation, thereby obtaining price difference profit and promoting market liquidity. The method comprises the steps of preprocessing price information of a market, calculating corresponding technical indexes by using perfect market information and inputting the technical indexes as a reinforcement learning algorithm to train a teacher intelligent agent, guiding a training process of a student intelligent agent by using the obtained teacher intelligent agent, calculating corresponding technical indexes by using imperfect market information and inputting the technical indexes as the reinforcement learning algorithm, training the student intelligent agent under the guidance of the teacher intelligent agent, and finally making a decision whether to make a market according to historical information of the market by using the obtained student intelligent agent. And carrying out a retest test on the intelligent agent and outputting a result.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a marketing method based on a teacher-student model and reinforcement learning includes the following steps:
s1: collecting historical market price data of the target futures, collecting historical market price data of the specific time level of the target futures, wherein the data table specifically comprises the opening price open of each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict;
S2: and (3) carrying out data cleaning work:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: removing data rows with missing values and abnormal values;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
Example 2
As shown in fig. 1, a marketing method based on a teacher-student model and reinforcement learning includes the following steps:
s1: collecting historical market price data of the target futures variety;
s2: carrying out data cleaning work;
s3: calculating various data required by the teacher agent for training the teacher agent according to the data;
s4: training a teacher agent by using a reinforcement learning algorithm;
s5: calculating various data required by a student agent for training the student agent according to the data;
s6: training student agents by using a teacher agent to guide a reinforcement learning algorithm;
s7: and (6) performing a return test.
In step S1, historical market price data of a specific time level of the target futures item is collected, and the data table specifically includes opening prices open at each time point t corresponding to the time leveltHighest price hightLowest price lowtClosing price closetVolume of finished traffict。
The specific process of step S2 is:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
The specific process of step S3 is:
s31: defining a review interval w of each time point t, representing the length of historical data considered by the time point t, and defining a future trend display interval w' of each time point t, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: w duration exponential moving average index
Duration of time of w is the indicator
w duration absolute price oscillation index
Indicator for whether future time duration w' of buying order at time t can meet
Index of whether future w' duration of market order at time t can be reached
w duration exponential moving average index
The calculation formula is as follows:
duration of time of w is the indicator
The calculation formula is as follows:
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in the duration w, and MD is the average value of the absolute value difference between the closing price in the duration w and the MA value;
w duration absolute price oscillation index
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
s34: calculating the average value of the past n time length prices for each time point t in the data table
I.e. a simple moving average;
s35: to be provided with
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst;
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
Example 3
As shown in fig. 1, the present invention provides a marketing method based on a teacher-student model (teacher-student model) and reinforcement learning, comprising the following steps:
s1: selecting target futures and collecting corresponding time-level historical market price data, wherein the selected data in the embodiment is the minute-level price data of the futures in 4 months-2021 months in 2010, including the opening price open, of 300 shares in Shanghai province, namely 4 months in 2021tHighest price hightLowest price lowtClosing price closetVolume of finished traffict;
S2: carrying out data cleaning work, comprising the following steps:
s21: removing redundant empty rows in the data table to make the rows continuous;
s22: carrying out deduplication on the specific field to ensure that no repeated data rows exist at the same time point t;
s23: data lines where missing values and outliers exist are removed.
S3: calculating various data required for training the teacher agent for the data, comprising the steps of:
s31: defining the review interval of each time point t as 30 minutes, representing the length of the historical data considered by the time point t, defining the future trend display interval of each time point t as 10 minutes, representing the interval size capable of reflecting the future price trend of the time point t;
s32: calculating corresponding financial technical indexes of a teacher part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index
Homeopathic index in 30 minutes
30 minute absolute price shock index
Index of whether business can be achieved in 10 minutes in future when buying a purchase order at time t
Index of whether to meet market order at time t in 10 minutes in future
30 minute index moving average index
The calculation formula is as follows:
homeopathic index in 30 minutes
The calculation formula is as follows:
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in 30 minutes, and MD is the average value of the absolute value difference between the closing price in 30 minutes and the MA value;
30 minute absolute price shock index
Subtracting the short term moving average from the long term moving average;
s33: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
s34: calculating the average of the past 30 minute prices for each time point t in the data sheet
I.e. a simple moving average;
s35: to be provided with
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
where price _ diff is the defined minimum variable price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s36: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst;
S37: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s38: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
S4: training a teacher agent with a reinforcement learning algorithm, comprising the steps of:
s41: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S3, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s42: setting action space of teacher's agent in reinforcement learning, the action of agent at every time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at time point t, at the same time, the price buy _ price is respectively added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s43: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s44: the reinforcement learning algorithm for training the teacher agent adopts a PPO (Rapid Policy optimization) algorithm, the training of the agent strategy is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
s45: the specific process of using the PPO algorithm is as follows: firstly, the strategy parameter phi of the intelligent agent is initialized, and the parameter is assigned to the strategy parameter phi interacting with the environment
oldLet the parameter be phi
oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t
tAnd obtaining the action a selected by the strategy
tFrom which an advantage function is calculated
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
if the sign of the advantage function is positive, it is stated that the action selected by the agent policy in the current environment state is beneficial to the development of the reward in the direction of maximization, so that the policy parameter phi of the corresponding teacher agent can be adjusted through gradient descent to minimize the objective function, and the formula of the objective function is as follows:
L(φ)=Lpolicy(φ)+λLvalue(φ)
Lvalue(φ)=Et[||Vφ(st)-Vt||2]
updating a policy parameter phi for interaction with the environment with the policy parameter phi for a certain period of time
oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameter of the teacher agent is learned, and the strategy parameters are phi respectively
oldAnd φ, thereby introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
Policy parameters and teachings for calculating the difference between two distributions, avoiding interactionsThe teacher agent and the intelligent agent have overlarge difference, so that the parameter updating is more smooth.
S5: calculating various data required for training student agents on the data, comprising the following steps:
s51: defining the review interval of each time point t as 30 minutes, and representing the length of the historical data considered by the time point t;
s52: calculating the corresponding financial technical indexes of the student part according to the collected data, wherein the used indexes comprise: 30 minute index moving average index
Homeopathic index in 30 minutes
30 minute absolute price shock index
30 minute index moving average index
The calculation formula is as follows:
homeopathic index in 30 minutes
The calculation formula is as follows:
wherein TP is the average value of the highest price, the lowest price and the closing price at the time t, MA is the average value of the closing price in 30 minutes, and MD is the average value of the absolute value difference between the closing price in 30 minutes and the MA value;
30 minute absolute price shock index
Subtracting the short term moving average from the long term moving average;
s53: normalizing the calculated financial technical indexes by using a linear function, wherein the formula is as follows:
s54: calculating the average of the past 30 minute prices for each time point t in the data sheet
I.e. a simple moving average;
s55: to be provided with
As an intermediate price, a buying price and a selling price are calculated for each time point t in the data table, and the formula is as follows:
where price _ diff is the defined minimum price, buy _ pricetFor time t, a purchase order, sell _ pricetMaking market unit price for the time point t;
s56: judging whether each time point t in the data table is a warehouse clearing time point, and recording the earliest warehouse clearing time exit _ time in the future for all non-warehouse clearing time pointst;
S57: for each time point t in the data table, calculate whether the buy order and the sell order can be committed before clearing, and record it as buy _ succtAnd sel _ succt1 represents a transaction, 0 represents a non-transaction;
s58: considering stopping making market trade and balancing before the market is collected on each trading day, calculating the profit brought by the market making action at each time point ttThe formula is as follows:
wherein buy _ succtAnd sel _ succtRespectively, whether the order and the sale order can be traded before the future warehouse cleaning, and transaction _ fe represents the commission charge of the transaction.
S6: the method for training the student agent by using the teacher agent to guide the reinforcement learning algorithm comprises the following steps:
s61: creating a virtual market environment, loading collected historical market price data of 300 futures minute-level of Shanghai depth and the required data calculated in the step S5, taking financial technical indexes as state input states of the reinforcement learning algorithm, and setting corresponding method bodies for initializing, acquiring and pushing the environment state;
s62: setting action space action of student agent in reinforcement learning, wherein the action of the agent at each time point t is divided into 0 and 1, 0 represents that no transaction is made, 1 represents that marketing is made at the time point t, and price buy _ price is added to market environmenttAnd sel _ pricetSending out purchase orders and sales orders;
s63: setting reward and punishment value reward given by environment in reinforcement learning, and using profit obtained by decision of each time point ttAs reward;
s64: the reinforcement learning algorithm for training the student agent adopts a PPO (Rapid Policy optimization) algorithm, the training of agent strategies is carried out according to the state, the action and the reward, the strategy used for representing the agent training in the PPO algorithm is selected as a neural network formed by a multilayer perceptron and an activation function, namely, one multilayer perceptron is connected with a tanh activation function, and the formula of the tanh activation function is as follows:
s65: the specific process of using the PPO algorithm is as follows: firstly, loading the strategy phi of the teacher agent after the reinforcement learning training so as to guide the reinforcement learning process of the student agent in the follow-up process, then initializing the strategy parameter theta of the student agent, and assigning the parameter to the strategy parameter theta interacting with the environment
oldLet the parameter be theta
oldThe strategy of (a) interacts in the virtual market-making environment to obtain the environment state s of the current time point t
tAnd obtaining the action a selected by the strategy
tFrom which an advantage function is calculated
The advantage function represents that the difference between the rewarded obtained by the action selected by the agent policy in the current environment state and the expected rewarded is estimated by using a V network and the rewarded obtained by executing the action in the current state, and the formula is as follows:
if the sign of the advantage function is positive, it is indicated that the action selected by the agent policy in the current environment state is beneficial to the development of reward towards the direction of maximization, so that the policy parameter θ of the corresponding student agent can be adjusted through gradient descent to minimize the objective function, the objective function of the student agent introduces negative log likelihood loss of the decision and the decision of the teacher agent, the larger the difference between the probability distribution of the decision of the student agent and the decision probability distribution of the teacher agent is, the larger the value of the negative log likelihood loss is, so we can also optimize the decision through gradient descent, and in conclusion, the formula of the objective function is as follows:
L(θ)=Lpolicy(θ)+λLvalue(θ)+μLdiff(θ)
Lvalue(θ)=Et[||Vθ(st)-Vt||2]
updating a policy parameter theta for interaction with an environment with the policy parameter theta for a period of time
oldBecause a strategy parameter is interacted with the environment in the algorithm, the strategy parameters of the teacher intelligent agent are learned, and the strategy parameters are respectively theta
oldAnd θ, thus introducing a Kullback-Leibler divergence in the objective function, i.e. in the formula
The method is used for calculating the difference between the two distributions, avoids overlarge difference between interactive strategy parameters and teacher agents, and is beneficial to smoother parameter updating.
S7: performing a retest test comprising the steps of:
s71: loading the strategy of the student agent after the reinforcement learning training;
s72: importing 300-strand market data of the Shanghai depth for testing, wherein the market data refer to futures in minute level and are used for pushing the market data to a strategy;
s73: reading data according to the corresponding time level sequence, matching the current existing order for bargaining, executing a corresponding trading instruction according to a strategy to generate a new order, storing the order for later matching, and leveling before the trading is finished every day;
s74: and outputting a return test result according to the final transaction result.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.