[go: up one dir, main page]

CN113780574A - Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof - Google Patents

Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof Download PDF

Info

Publication number
CN113780574A
CN113780574A CN202110994653.6A CN202110994653A CN113780574A CN 113780574 A CN113780574 A CN 113780574A CN 202110994653 A CN202110994653 A CN 202110994653A CN 113780574 A CN113780574 A CN 113780574A
Authority
CN
China
Prior art keywords
agent
reinforcement learning
action
intelligent agent
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110994653.6A
Other languages
Chinese (zh)
Inventor
梁斌
杨君
冷舒
芦维宁
陈章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110994653.6A priority Critical patent/CN113780574A/en
Publication of CN113780574A publication Critical patent/CN113780574A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请属于智能决策技术领域,具体而言,涉及一种智能体强化学习决策方法、装置、电子设备及其存储介质。本方法首先根据被决策问题的多个约束条件建立约束方程,设计数个可显式求解的简单示例并求解并添加合适的奖励函数,得到一系列稀疏奖励的专家知识数据;将专家知识数据放置于DQN的回放缓冲模块中,获得既有专家知识数据又有环境学习数据的改进的回放缓冲模块Ex‑Replay buffer;智能体和环境交互后,将动作的概率分布输入至动作过滤模块获得合法动作,通过置信度函数确定是否选择过滤后的动作;损失函数中添加自适应项,调整使用动作过滤模块的频率。本方法数据集采集的过程更加高效、便捷,置信度函数可以为智能体选出对应任务的最佳策略。

Figure 202110994653

The present application belongs to the technical field of intelligent decision-making, and in particular, relates to an agent reinforcement learning decision-making method, device, electronic device and storage medium thereof. This method first establishes a constraint equation according to multiple constraints of the decision problem, designs several simple examples that can be solved explicitly, solves and adds an appropriate reward function, and obtains a series of sparse rewarded expert knowledge data; the expert knowledge data is placed in In the playback buffer module of DQN, an improved playback buffer module Ex‑Replay buffer is obtained that includes both expert knowledge data and environmental learning data; after the agent interacts with the environment, the probability distribution of actions is input to the action filtering module to obtain legal actions , determine whether to select the filtered action through the confidence function; add an adaptive item to the loss function to adjust the frequency of using the action filtering module. The data collection process of this method is more efficient and convenient, and the confidence function can select the best strategy for the corresponding task for the agent.

Figure 202110994653

Description

Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof
Technical Field
The application belongs to the technical field of intelligent decision making, and particularly relates to an intelligent agent reinforcement learning decision making method and device, electronic equipment and a storage medium thereof.
Background
With the gradual development of intelligent technologies, the demand for intelligent agent autonomous behavior decision-making is rising continuously, and nowadays many fields start to apply intelligent decision-making technologies based on reinforcement learning, such as robots, automatic driving, intelligent transportation and the like, and the problem of decision-making is getting bigger and more complex. On one hand, the factors lay a solid foundation for the intelligent agent to realize the decision of complex behaviors, but also greatly increase the difficulty of the intelligent agent in learning the correct strategy. At present, decisions made according to reinforcement learning generally need to be capable of flexibly adapting to various working conditions and guaranteeing the success rate of task execution, and once the decisions are wrong, great loss is caused, and even the lives of related personnel are endangered. Therefore, the method has important social significance for ensuring the diversity and robustness of the decision made by the intelligent agent and can bring great economic benefit and social benefit.
In order to guarantee the safety of intelligent decision and promote the development of intelligent technology, more and more intelligent systems utilize reinforcement learning to assist the intelligent agent in learning various strategies, so that the intelligent agent can initially learn in a simple scene under the guidance of reinforcement learning, and a cushion is laid for performing intelligent decision in a more complex scene subsequently. However, reinforcement learning is a method for continuously performing trial and error, and the actual scene has the characteristics of high complexity, incomplete information and the like, so that the intelligent decision completely based on reinforcement learning is greatly challenged when being applied in reality. For the reinforcement learning based on the model, the main problem is from the model itself, and due to the complex characteristic of the intelligent decision scene, it is difficult to establish a complete model and scene isomorphism. For model-free reinforcement learning, when the decision problem is very complex, a large amount of data is needed to ensure that the intelligent agent learns the corresponding strategy. The reason for the large amount of data is mainly two-fold: 1) mass data can be better fitted with a problem solving model; 2) there are many invalid error data in the data, and these invalid data will affect the learning efficiency and result of the agent. However, in an actual intelligent decision task, the time cost and sampling cost for acquiring a large amount of data are high.
Disclosure of Invention
The present application is directed to solving at least the problems of the prior art, and based on the inventors' knowledge and understanding of the following facts and problems, the design of a reward function requires much effort when solving a decision problem with reinforcement learning. Complex decision problems have many constraints, and except that the reward after the task is executed is easy to judge, the reward in the middle process is difficult to design in detail. If the design is not proper, the intelligent agent learns wrong knowledge so that the task is executed unsuccessfully, and irreparable results appear in the serious condition. For such a problem of sparse reward, a large amount of invalid searches are performed in the early learning stage, feedback of a reward function cannot be obtained even if many wrong strategies are tried, and learning efficiency is extremely low. Thus, an action filter established based on expert knowledge effectively filters the generated actions, and converts illegal actions into legal actions, which is an effective means for solving the problem. At present, the problem of reinforcement learning aiming at sparse rewards is still a difficult point.
In view of the above, the present disclosure provides an agent reinforcement learning decision method, an agent reinforcement learning decision device, an electronic device, and a storage medium thereof, so as to implement a reinforcement learning decision method based on confidence action filtering, in which expert knowledge data is used to fill a Replay Buffer in a reinforcement learning algorithm, expert knowledge is used to filter learned actions to obtain an effective action, whether the expert action is adopted is determined according to a confidence degree, and an important technical problem that efficient learning cannot be performed in reinforcement learning based on sparse reward is solved.
According to a first aspect of the present disclosure, an agent reinforcement learning decision method is provided, including:
step 1, constructing an expert knowledge data set for an intelligent agent reinforcement learning decision;
step 2, constructing a playback buffer module of the reinforcement learning network;
step 3, training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and 4, setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
The disclosure is a reinforcement learning decision method under the condition of solving sparse reward. Compared with the existing method, the method has good universality and easy operability for solving the problem of complex scenes. The method mainly comprises two aspects, wherein when the expert training data set is constructed, the scene is simplified into a linear programming or dynamic programming solution, and then a reward function is added to the solution to form the expert training data set. And secondly, when the final action of the intelligent agent is selected, a simple confidence coefficient method is added to ensure that the intelligent agent selects the optimal action which meets the conditions.
Optionally, the constructing an expert knowledge data set for an agent reinforcement learning decision includes:
(1) for a scene in which a target equation and a constraint equation are linearly expressed, acquiring action information and state information of the intelligent agent by adopting a linear programming method, and for a scene needing iterative solution, acquiring the action information and the state information of the intelligent agent by adopting a dynamic programming method;
(2) obtaining the current state information s, the action information a, the state information s' of the agent at the next moment and the mark d for judging whether the task is terminated under N different simple scenes according to a linear programming or dynamic programming method, and forming a knowledge data set by the information
Figure BDA0003233455860000031
Wherein i represents the relevant information of the agent decision in the ith scene;
(3) for the knowledge data set
Figure BDA0003233455860000032
Each item of data sets a reward function ri
Figure BDA0003233455860000033
(4) The reward function r is setiAnd the set
Figure BDA0003233455860000034
Merging and constructing to obtain an expert knowledge data set for the reinforcement learning decision of the intelligent agent
Figure BDA0003233455860000035
Optionally, a playback buffer module of the reinforcement learning network
Figure BDA0003233455860000036
Comprises two parts, the first part is a fixed list
Figure BDA0003233455860000037
The first part is used for storing expert knowledge data, accounts for 30% of the length of the playback buffer module, and is called an expert data playback buffer zone; the second part is a first-in first-out queue
Figure BDA0003233455860000038
The system is used for storing samples collected by the intelligent agent from the environment and is called an environment data playback buffer zone; the queue is empty during initialization, and is updated according to a first-in first-out algorithm after the queue is full.
Optionally, the training of the reinforcement learning network to obtain an agent preliminary strategy and action includes the following steps:
(1) in the complex environment problem, the intelligent agent collects samples from the environment for multiple times, obtains at least one batch of environment data with the quantity of B, and stores the environment data into the first-in first-out queue
Figure BDA0003233455860000039
Performing the following steps;
(2) respectively from the fixed list
Figure BDA00032334558600000310
And a first-in first-out queue
Figure BDA00032334558600000311
The training samples are randomly drawn to form training data with a batch number B, and 30% of the training data is from
Figure BDA00032334558600000312
70% from
Figure BDA00032334558600000313
Inputting training data into reinforcement learning algorithm, outputting intelligent agent preliminary strategy by reinforcement learning algorithm
Figure BDA00032334558600000314
And agent preliminary actions
Figure BDA00032334558600000315
Wherein
Figure BDA00032334558600000316
Is a probability distribution about the agent's preliminary actions,
Figure BDA00032334558600000317
optionally, the setting an action filtering module filters the preliminary policy of the agent, and determines the final action of the agent by using a confidence function, including:
(1) setting a motion filtering module FafThe action filtering module expresses the limiting conditions of the complex environment problem as an inequality equation set and the intelligent agent preliminary strategy
Figure BDA00032334558600000318
The output is used as the input of the action filtering module to obtain the filtering action of the intelligent agent
Figure BDA00032334558600000319
(2) Setting a confidence function fc
Preliminary actions from the agent
Figure BDA00032334558600000320
With filtering actions by agent
Figure BDA00032334558600000321
Selecting a final action a of the agent, and training a subsequent network by using the final action;
(3) setting a loss function L, training the reinforcement learning network by using an Adam algorithm with the aim of minimizing the loss function until the loss function is converged to obtain the trained reinforcement learning network;
wherein beta is a hyperparameter, FafThe motion filtering module, | ·| non-woven phosphor2Is a two-norm, i is the serial number of the training data, B is the B pieces of data of the batch, and L 'is the loss function of the initial reinforcement learning algorithm and is defined as L' ((r + gamma. max (Q (s ', a'))) -Q (s, a))2Wherein s, a, r represent the current state information, action information and reward function of the agent, s ', a' represent the state information and action information of the agent at the next time, γ ═ 0.9, Q (s, a) is a variable related to the state information and action information, and is generated by the neural network, (·)2Is a square operator;
(4) and taking the state information of each step in the task as the input of the trained reinforcement learning network to obtain the final action of the intelligent agent.
According to the second aspect of the present disclosure, an intelligent agent reinforcement learning decision device is further provided, including:
the expert knowledge data set construction module is used for constructing an expert knowledge data set for an intelligent agent reinforcement learning decision;
constructing a playback buffer module for constructing a playback buffer module of the reinforcement learning network,
the intelligent agent preliminary action generation module is used for training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and the intelligent agent final action generating module is used for setting an action filtering module, filtering the intelligent agent preliminary strategy and determining the intelligent agent final action by utilizing the confidence function.
According to a third aspect of the present disclosure, an electronic device is presented, comprising:
a memory for storing processor-executable instructions;
a processor configured to perform any of the agent reinforcement learning decision methods of claims 1-5.
According to a fourth aspect of the present disclosure, a computer-readable storage medium is proposed, having stored thereon a computer program for causing a computer to execute:
constructing an expert knowledge data set for an agent reinforcement learning decision;
constructing a playback buffer module of the reinforcement learning network,
training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
According to a fourth invention of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program for causing a computer to execute:
constructing an expert knowledge data set for an agent reinforcement learning decision;
constructing a playback buffer module of the reinforcement learning network,
training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
A method for constructing a demonstration data set is generally adopted when the problem of sparse rewarding reinforcement learning in a complex scene is solved, and the demonstration data set is also an expert knowledge data set. The main idea of constructing the data set is to collect track information formed by a series of complete state information and action information required by the agent to complete the task. The expert knowledge data set does not need a plurality of pieces of complete track information, and mainly adopts the results of linear programming and dynamic programming as the expert knowledge data after simplifying complex scenes, so that the process of data set acquisition is more efficient and convenient. In addition, the optimality of the planning result has theoretical guarantee, and the design of the reward function becomes very simple. The method also introduces a confidence function, compares the initial action of the intelligent agent generated by the reinforcement learning neural network with the filtering action of the intelligent agent generated by the action filtering module, and determines the final action of the intelligent agent. Other algorithms consider agent filtering actions to be optimal, but reinforcement learning may learn agent preliminary actions that are better than agent filtering actions after multiple explorations. Thus, the confidence function may select the best strategy for the agent for the corresponding task.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be derived from those drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow diagram illustrating a method for intelligent agent reinforcement learning decision-making according to one embodiment of the present disclosure.
Fig. 2 is a schematic diagram illustrating an overall scheme of a reinforcement learning decision method according to an embodiment of the present disclosure.
Fig. 3 is a block diagram illustrating the structure of an agent reinforcement learning decision device according to an embodiment of the present disclosure.
Fig. 4 is a diagram illustrating a defensive scene according to one embodiment of the present disclosure.
Fig. 5 is a top view of the defensive scene of fig. 4.
FIG. 6 is a diagram illustrating training results, according to one embodiment of the present disclosure.
FIG. 7 is a graph illustrating test results according to one embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of an intelligent agent reinforcement learning decision method according to an embodiment of the present disclosure, which may be applied to a user device, such as a mobile phone, a tablet computer, and the like.
As shown in fig. 1 and fig. 2, the intelligent agent reinforcement learning decision method may include the following steps:
in step 1, an expert knowledge data set for an agent reinforcement learning decision is constructed.
In a specific embodiment, the constructing of the expert knowledge data set for the reinforcement learning decision of the agent, the constructing of the expert knowledge data set by using the action information and the state information in the simple scene and the manually set reward function, includes:
(1) for a scene in which a target equation and a constraint equation are linearly expressed, acquiring action information and state information of the intelligent agent by adopting a linear programming method, and for a scene needing iterative solution, acquiring the action information and the state information of the intelligent agent by adopting a dynamic programming method;
(2) obtaining the current state information s, the action information a, the state information s' of the agent at the next moment and the mark d for judging whether the task is terminated under N different simple scenes according to a linear programming or dynamic programming method, and using the above stepsInformation is combined into a knowledge data set
Figure BDA0003233455860000061
Wherein i represents the relevant information of the agent decision in the ith scene;
(3) for the knowledge data set
Figure BDA0003233455860000062
Each item of data sets a reward function ri
Figure BDA0003233455860000063
(4) The reward function r is setiAnd the set
Figure BDA0003233455860000064
Merging and constructing to obtain an expert knowledge data set for the reinforcement learning decision of the intelligent agent
Figure BDA0003233455860000065
In step 2, a playback buffer module of the reinforcement learning network is constructed, wherein one part of the module stores the expert knowledge data set, and the other part stores data of samples collected by the intelligent agent from the environment during network training.
In one embodiment, the playback buffer module of the reinforcement learning network
Figure BDA0003233455860000066
The (replay buffer) includes two parts, the first part being a fixed list
Figure BDA0003233455860000067
The first part is used for storing expert knowledge data, accounts for 30% of the length of the playback buffer module, and is called an expert data playback buffer zone; the second part is a first-in first-out queue
Figure BDA0003233455860000068
The system is used for storing samples collected by the intelligent agent from the environment and is called an environment data playback buffer zone; the queue is empty during initialization, and is updated according to a first-in first-out algorithm after the queue is full.
And step 3, training the reinforcement learning network to obtain the initial strategy and action of the intelligent agent.
In one embodiment, the training of the reinforcement learning network to obtain the initial strategy and action of the agent includes the following steps:
(1) in the complex environment problem, the intelligent agent collects samples from the environment for multiple times, obtains at least one batch of environment data with the quantity of B, and stores the environment data into the first-in first-out queue
Figure BDA0003233455860000071
Performing the following steps;
(2) respectively from the fixed list
Figure BDA0003233455860000072
And a first-in first-out queue
Figure BDA0003233455860000073
The training samples are randomly drawn to form training data with a batch number B, and 30% of the training data is from
Figure BDA0003233455860000074
70% from
Figure BDA0003233455860000075
Inputting training data into reinforcement learning algorithm, outputting intelligent agent preliminary strategy by reinforcement learning algorithm
Figure BDA0003233455860000076
And agent preliminary actions
Figure BDA0003233455860000077
Wherein
Figure BDA0003233455860000078
Is a probability distribution about the agent's preliminary actions,
Figure BDA0003233455860000079
x0argmax (f (x)) is expressed when x is x0The value of the function f (x) is maximized.
And 4, setting an action filtering module, filtering the initial strategy of the intelligent agent according to the constraint in the decision task required by the intelligent agent to obtain the filtering action of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
In one embodiment, the action setting and filtering module filters the preliminary policy of the agent and determines the final action of the agent by using a confidence function, and the action setting and filtering module includes:
(1) setting a motion filtering module FafThe action filtering module expresses the limiting conditions of the complex environment problem as an inequality equation set and the intelligent agent preliminary strategy
Figure BDA00032334558600000710
The output is used as the input of the action filtering module to obtain the filtering action of the intelligent agent
Figure BDA00032334558600000711
(2) Setting a confidence function fc
Figure BDA00032334558600000712
Wherein p (-) represents the probability, α is the artificially set significance level,
Figure BDA00032334558600000713
in order not to violate system constraints
Figure BDA00032334558600000714
Ratio of
Figure BDA00032334558600000715
A better performing set of agent actions;
preliminary actions from the agent
Figure BDA00032334558600000716
With filtering actions by agent
Figure BDA00032334558600000717
Selecting a final action a of the agent, and training a subsequent network by using the final action;
(3) a loss function L is set up which is,
Figure BDA00032334558600000718
training the reinforcement learning network by using an Adam algorithm with a minimum loss function as a target until the loss function is converged to obtain the trained reinforcement learning network;
wherein beta is a hyperparameter which can be adaptively changed in the training process of the learning algorithm, FafThe motion filtering module, | ·| non-woven phosphor2Is a two-norm, i is the serial number of the training data, B is the B pieces of data of the batch, and L 'is the loss function of the initial reinforcement learning algorithm and is defined as L' ((r + gamma. max (Q (s ', a'))) -Q (s, a))2Wherein s, a, r represent the current state information, action information and reward function of the agent, s ', a' represent the state information and action information of the agent at the next time, γ ═ 0.9, Q (s, a) is a variable related to the state information and action information, and is generated by the neural network, (·)2Is a square operator;
(4) and taking the state information of each step in the task as the input of the trained reinforcement learning network to obtain the final action of the intelligent agent.
The disclosure is a reinforcement learning decision method under the condition of solving sparse reward. Compared with the existing method, the method has good universality and easy operability for solving the problem of complex scenes. The method mainly comprises two aspects, wherein when the expert training data set is constructed, the scene is simplified into a linear programming or dynamic programming solution, and then a reward function is added to the solution to form the expert training data set. And secondly, when the final action of the intelligent agent is selected, a simple confidence coefficient method is added to ensure that the intelligent agent selects the optimal action which meets the conditions.
Corresponding to the intelligent agent reinforcement learning decision method, the intelligent agent reinforcement learning decision device is further provided by the disclosure.
Fig. 3 is a schematic diagram of an intelligent agent reinforcement learning decision device according to an embodiment of the present disclosure, and as shown in fig. 3, the intelligent agent reinforcement learning decision device includes:
the expert knowledge data set construction module is used for constructing an expert knowledge data set for an intelligent agent reinforcement learning decision;
constructing a playback buffer module for constructing a playback buffer module of the reinforcement learning network,
the intelligent agent preliminary action generation module is used for training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and the intelligent agent final action generating module is used for setting an action filtering module, filtering the intelligent agent preliminary strategy according to the constraint in the decision task required by the intelligent agent to obtain the intelligent agent filtering action, and determining the intelligent agent final action by utilizing the confidence function.
An embodiment of the present disclosure also provides an electronic device, including:
a memory for storing processor-executable instructions;
a processor configured to perform:
constructing an expert knowledge data set for an agent reinforcement learning decision;
constructing a playback buffer module of the reinforcement learning network,
training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a computer program for causing a computer to execute:
constructing an expert knowledge data set for an agent reinforcement learning decision;
constructing a playback buffer module of the reinforcement learning network,
training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
A method for constructing a demonstration data set is generally adopted when the problem of sparse rewarding reinforcement learning in a complex scene is solved, and the demonstration data set is also an expert knowledge data set. The main idea of constructing the data set is to collect track information formed by a series of complete state information and action information required by the agent to complete the task. The expert knowledge data set does not need a plurality of pieces of complete track information, and mainly adopts the results of linear programming and dynamic programming as the expert knowledge data after simplifying complex scenes, so that the process of data set acquisition is more efficient and convenient. In addition, the optimality of the planning result has theoretical guarantee, and the design of the reward function becomes very simple. The method also introduces a confidence function, compares the initial action of the intelligent agent generated by the reinforcement learning neural network with the filtering action of the intelligent agent generated by the action filtering module, and determines the final action of the intelligent agent. Other algorithms consider agent filtering actions to be optimal, but reinforcement learning may learn agent preliminary actions that are better than agent filtering actions after multiple explorations. Thus, the confidence function may select the best strategy for the agent for the corresponding task.
It should be noted that, in the embodiment of the present disclosure, the Processor may be a Central Processing Unit (CPU), or may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the memory may be used for storing the computer program and/or the module, and the processor may realize various functions of the automobile accessory picture dataset making apparatus by executing or executing the computer program and/or the module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. If the modules/units of the construction device of the wind power system operation stability domain are realized in the form of software functional units and sold or used as independent products, the modules/units can be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be realized by the present disclosure, and the method can also be realized by the relevant hardware instructed by a computer program, which can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the device provided by the present disclosure, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
The present disclosure is described in detail below with reference to the drawings and examples.
The method is used for verifying a decision method based on expert knowledge driving for sparse reward problems in deep reinforcement learning by taking a game for defending aircraft attack as an object.
An expert knowledge-driven decision flow chart for the sparse reward problem in deep reinforcement learning is shown in fig. 1, and mainly includes:
(1) and constructing an expert knowledge data set, and simplifying the intelligent agent defense problem of the complex scene that the airplane can randomly fly into a linear programming problem of the airplane linear flight or a dynamic programming problem of the sectional flight. Knowledge data sets can be determined from the planning algorithm
Figure BDA0003233455860000101
And endowing each piece of data in the knowledge data set with a reward function r according to artificial experienceiThe knowledge data set becomes an expert knowledge data set
Figure BDA0003233455860000102
(2) The method comprises the steps that an improved playback buffer module is obtained, data of all expert knowledge data sets are placed into an expert data playback buffer area, and sample data collected by an agent from the environment are placed into an environment data playback buffer area;
(3) and acquiring the initial action of the intelligent agent, and randomly drawing 128 samples from the playback buffer module to serve as training data, wherein 30% of the training data are expert knowledge data, and 70% of the training data are environmental data. Taking the sample as the input of a DDQN network to obtain an intelligent agent preliminary strategy and intelligent agent preliminary action;
(4) and acquiring the final action of the intelligent agent, and constructing an action filtering module according to the constraint conditions of weapon shot size, weapon shooting range and the like of the defense system. And taking the initial strategy of the intelligent agent as the input of the action filtering module to obtain the filtering action of the intelligent agent. In the learning process, an intelligent agent action set which is better than the intelligent agent filtering action in performance of the intelligent agent preliminary action when the system constraint is not violated is continuously constructed, and the final action of the intelligent agent is determined by utilizing a confidence function.
Acquiring the defense target state comprises the following steps: coordinates of a defended target;
acquiring the state of the defensive weapon comprises the following steps: coordinates of three weapons, the amount of residual charge in each weapon, and the weapon firing range, wherein the relationship between the defensive target and the weapon is shown in fig. 4 and 5.
Acquiring the aircraft state comprises: all the airplanes fly linearly or along a curve fixed in sections according to the current positions and the current headings of all the airplanes.
Calculating weapon launching moments when a plurality of airplanes are in direct flight at different initial positions according to linear programming or dynamic programming, putting the result in an Ex-replay-buffer as expert knowledge data in cooperation with a designed reward function, filtering actions generated by interaction between an agent and the environment through a filtering function generated by an expert system, storing the data in the Ex-replay-buffer after the actions of the weapons interact with the environment, randomly sampling the data in proportion from the Ex-replay-buffer during training to obtain batch training data, and inputting the training data into a reinforcement learning neural network to obtain the initial actions of the weapons.
And constructing an action filtering module according to constraints such as weapon range and the like, inputting the preliminary action into the action filtering module to obtain a filtering action, and determining a final action by using a confidence function.
When there are 12 planes attacking in any way, with 4 rounds per defensive weapon, the training results are shown in fig. 6, and the system converges only over 30 cycles. And the legal action remains 1 at all times. The test data is shown in fig. 7, and the test accuracy is as high as 97.3%.

Claims (8)

1. An agent reinforcement learning decision method, comprising:
constructing an expert knowledge data set for an agent reinforcement learning decision;
constructing a playback buffer module of the reinforcement learning network,
training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and setting an action filtering module, filtering the initial strategy of the intelligent agent, and determining the final action of the intelligent agent by using a confidence function.
2. The agent-based reinforcement learning decision making method according to claim 1, wherein the constructing of the expert knowledge data set for agent-based reinforcement learning decision making comprises:
(1) for a scene in which a target equation and a constraint equation are linearly expressed, acquiring action information and state information of the intelligent agent by adopting a linear programming method, and for a scene needing iterative solution, acquiring the action information and the state information of the intelligent agent by adopting a dynamic programming method;
(2) obtaining the current state information s, the action information a, the state information s' of the agent at the next moment and the mark d for judging whether the task is terminated under N different simple scenes according to a linear programming or dynamic programming method, and forming a knowledge data set by the information
Figure FDA0003233455850000018
Figure FDA0003233455850000011
Wherein i represents the relevant information of the agent decision in the ith scene;
(3) for the knowledge data set
Figure FDA0003233455850000019
Each item of data sets a reward function ri
Figure FDA0003233455850000012
(4) The reward function r is setiAnd the set
Figure FDA00032334558500000110
Merging and constructing to obtain an expert knowledge data set for the reinforcement learning decision of the intelligent agent
Figure FDA00032334558500000111
Figure FDA0003233455850000013
3. The agent reinforcement learning decision-making method of claim 1, wherein the replay buffer module of the reinforcement learning network
Figure FDA0003233455850000014
Comprises two parts, the first part is a fixed list
Figure FDA0003233455850000015
The first part is used for storing expert knowledge data, accounts for 30% of the length of the playback buffer module, and is called an expert data playback buffer zone; the second part is a first-in first-out queue
Figure FDA0003233455850000016
The method is used for storing samples collected by the intelligent agent from the environment and is called an environment data playback buffer zone.
4. The agent reinforcement learning decision-making method according to claim 1, wherein the training of the reinforcement learning network to obtain agent preliminary strategies and actions comprises the following steps:
(1) in the complex environment problem, the intelligent agent collects samples from the environment for multiple times, obtains at least one batch of environment data with the quantity of B, and stores the environment data into the first-in first-out queue
Figure FDA0003233455850000017
Performing the following steps;
(2) respectively from the fixed list
Figure FDA0003233455850000021
And a first-in first-out queue
Figure FDA0003233455850000022
The training samples are randomly drawn to form training data with a batch number B, and 30% of the training data is from
Figure FDA0003233455850000023
70% from
Figure FDA0003233455850000024
Inputting training data into reinforcement learning algorithm, outputting intelligent agent preliminary strategy by reinforcement learning algorithm
Figure FDA0003233455850000025
And agent preliminary actions
Figure FDA0003233455850000026
Wherein
Figure FDA0003233455850000027
Is a probability distribution about the agent's preliminary actions,
Figure FDA0003233455850000028
5. the agent reinforcement learning decision-making method according to claim 1, wherein the action-setting filtering module filters the agent preliminary strategy and determines the agent final action using a confidence function, and comprises:
(1) setting a motion filtering module FafThe action filtering module expresses the limiting conditions of the complex environment problem as an inequality equation set and the intelligent agent preliminary strategy
Figure FDA0003233455850000029
The output is used as the input of the action filtering module to obtain the filtering action of the intelligent agent
Figure FDA00032334558500000210
Figure FDA00032334558500000211
(2) Setting a confidence function fc
Preliminary actions from the agent
Figure FDA00032334558500000212
With filtering actions by agent
Figure FDA00032334558500000213
Selecting a final action a of the agent, and training a subsequent network by using the final action;
(3) setting a loss function L, training the reinforcement learning network by using an Adam algorithm with the aim of minimizing the loss function until the loss function is converged to obtain the trained reinforcement learning network;
(4) and taking the state information of each step in the task as the input of the trained reinforcement learning network to obtain the final action of the intelligent agent.
6. An agent reinforcement learning decision device, comprising:
the expert knowledge data set construction module is used for constructing an expert knowledge data set for an intelligent agent reinforcement learning decision;
constructing a playback buffer module for constructing a playback buffer module of the reinforcement learning network,
the intelligent agent preliminary action generation module is used for training the reinforcement learning network to obtain an intelligent agent preliminary strategy and action;
and the intelligent agent final action generating module is used for setting an action filtering module, filtering the intelligent agent preliminary strategy and determining the intelligent agent final action by using a confidence coefficient function.
7. An electronic device, comprising:
a memory for storing processor-executable instructions;
a processor configured to perform any of the agent reinforcement learning decision methods of claims 1-5.
8. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform any of the agent reinforcement learning decision methods of claims 1-5.
CN202110994653.6A 2021-08-27 2021-08-27 Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof Pending CN113780574A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110994653.6A CN113780574A (en) 2021-08-27 2021-08-27 Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110994653.6A CN113780574A (en) 2021-08-27 2021-08-27 Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof

Publications (1)

Publication Number Publication Date
CN113780574A true CN113780574A (en) 2021-12-10

Family

ID=78839508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110994653.6A Pending CN113780574A (en) 2021-08-27 2021-08-27 Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof

Country Status (1)

Country Link
CN (1) CN113780574A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330782A (en) * 2022-01-07 2022-04-12 中国人民解放军火箭军工程大学 Maintenance task intelligent allocation decision-making method and system based on capability
CN114676847A (en) * 2022-03-17 2022-06-28 网易(杭州)网络有限公司 Agent training method, device, medium and computing device
CN115186828A (en) * 2022-07-15 2022-10-14 清华大学 Behavior decision method, device and equipment for man-machine interaction and storage medium
CN116681142A (en) * 2023-05-16 2023-09-01 清华大学 Agent Reinforcement Learning Method and Device Based on Iterative Policy Constraint

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
US20200334580A1 (en) * 2019-04-17 2020-10-22 International Business Machines Corporation Intelligent decision support system
AU2021100503A4 (en) * 2020-12-04 2021-04-15 East China Jiaotong University Method and system for controlling heavy-haul train based on reinforcement learning
CN113221444A (en) * 2021-04-20 2021-08-06 中国电子科技集团公司第五十二研究所 Behavior simulation training method for air intelligent game

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334580A1 (en) * 2019-04-17 2020-10-22 International Business Machines Corporation Intelligent decision support system
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
AU2021100503A4 (en) * 2020-12-04 2021-04-15 East China Jiaotong University Method and system for controlling heavy-haul train based on reinforcement learning
CN113221444A (en) * 2021-04-20 2021-08-06 中国电子科技集团公司第五十二研究所 Behavior simulation training method for air intelligent game

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张涛;陈章;王学谦;梁斌;: "空间机器人遥操作关键技术综述与展望", 空间控制技术与应用, no. 06, 15 December 2014 (2014-12-15) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330782A (en) * 2022-01-07 2022-04-12 中国人民解放军火箭军工程大学 Maintenance task intelligent allocation decision-making method and system based on capability
CN114676847A (en) * 2022-03-17 2022-06-28 网易(杭州)网络有限公司 Agent training method, device, medium and computing device
CN115186828A (en) * 2022-07-15 2022-10-14 清华大学 Behavior decision method, device and equipment for man-machine interaction and storage medium
CN115186828B (en) * 2022-07-15 2026-01-09 清华大学 Behavioral decision-making methods, devices, equipment, and storage media for human-computer interaction
CN116681142A (en) * 2023-05-16 2023-09-01 清华大学 Agent Reinforcement Learning Method and Device Based on Iterative Policy Constraint

Similar Documents

Publication Publication Date Title
CN113780574A (en) Intelligent agent reinforcement learning decision-making method and device, electronic equipment and storage medium thereof
KR101961421B1 (en) Method, controller, and computer program product for controlling a target system by separately training a first and a second recurrent neural network models, which are initially trained using oparational data of source systems
WO2019018375A1 (en) Neural architecture search for convolutional neural networks
EP3648015B1 (en) A method for training a neural network
CN109583594B (en) Deep learning training method, device, equipment and readable storage medium
CN117940936A (en) Method and apparatus for evaluating robustness against
CN112019297A (en) Fixed-point decoy method and device for unmanned aerial vehicle, electronic equipment and storage medium
CN112084936B (en) Face image preprocessing method, device, equipment and storage medium
CN114881225A (en) Power transmission and transformation inspection model network structure searching method, system and storage medium
CN118428454A (en) A model-independent meta-learning method, device, equipment and storage medium
CN113867934A (en) A UAV-assisted multi-node task offload scheduling method
CN108681708A (en) A kind of vena metacarpea image-recognizing method, device and storage medium based on Inception neural network models
CN116822618A (en) Deep reinforcement learning exploration method and assembly based on dynamic noise network
CN109492804A (en) Dispatching method, device, electronic equipment and the storage medium of photographed scene
CN120437607A (en) Game-playing adversarial agent generation method and system based on large language model
CN118691049A (en) Mission planning method, device, medium and program product for simulation deduction
CN111437605A (en) Method for determining virtual object behaviors and hosting virtual object behaviors
CN117114146B (en) A method, device, medium and equipment for federated learning model poisoning reconstruction
CN116797628B (en) Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device
CN116977661A (en) A data processing method, device, equipment, storage medium and program product
CN118485134A (en) Target searching method and device based on incremental reinforcement learning
CN115268259A (en) PID control loop setting method and device
CN115994588A (en) Federated learning method, device and equipment based on blockchain and contract theory
CN117576499A (en) Building block assembly methods, devices, electronic equipment and storage media
CN117301049A (en) A service robot imitation learning method that uses large-scale language models to evaluate task completion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination