CN112989016A - Method and system for detecting quality of experience of simulated user in dialogue strategy learning - Google Patents
Method and system for detecting quality of experience of simulated user in dialogue strategy learning Download PDFInfo
- Publication number
- CN112989016A CN112989016A CN202110532470.2A CN202110532470A CN112989016A CN 112989016 A CN112989016 A CN 112989016A CN 202110532470 A CN202110532470 A CN 202110532470A CN 112989016 A CN112989016 A CN 112989016A
- Authority
- CN
- China
- Prior art keywords
- experience
- quality
- real
- user
- fact
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Robotics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Human Computer Interaction (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for detecting the experience quality of a simulation user in dialogue strategy learning, wherein the method comprises the following steps: s1, generating simulation experience by a world model; s2, performing quality detection on the simulation experience through a quality detector based on KL divergence; and S3, storing the qualified simulation experience of the quality detection for training the dialogue strategy model. The quality detector based on KL divergence is introduced in the scheme, so that the quality of simulation experience can be evaluated more easily and effectively, the robustness and the effectiveness of a conversation strategy are ensured, the calculation efficiency is greatly improved, and the aim of effectively controlling the quality of the simulation experience is fulfilled.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a method and a system for detecting the experience quality of a simulation user in dialogue strategy learning.
Background
Task completion model dialogue strategy learning aims at building a task-targeted dialogue system that can help users accomplish specific single or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chat robots and personal voice assistants, such as Siri by apple and Cortana by microsoft.
In recent years, reinforcement learning has become a mainstream method of dialogue strategy learning. Based on reinforcement learning, the dialog system can gradually adjust and optimize strategies by natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-machine interaction before the available dialogue strategy is available, which not only increases the training cost, but also deteriorates the user experience in the early training phase.
In order to solve the above problems and accelerate the learning process of the dialogue strategy, researchers have proposed a Deep Dyna-Q (ddq) framework based on the Dyna-Q framework. The DDQ framework introduces a world model that is trained using real user experience to generate simulated user experience, hereinafter referred to as simulation experience, in a dynamic environment in order to make it more similar to real users. During the learning process of the dialogue strategy, the dialogue agent is trained by using real experience collected from actual interaction and simulation experience collected from interaction with the world model. By introducing the world model, only a small amount of real user interaction is needed, so that the learning efficiency of the conversation strategy can be remarkably improved, however, the DDQ also faces some difficulties in further optimizing the conversation strategy learning based on the limited conversation interaction, for example, the simulation experience generated by the world model does not necessarily improve the performance, and the low-quality simulation experience even has a serious negative effect on the performance. Some recent studies, in order to solve this problem, attempt to differentiate low quality experience using a generative countermeasure network (GAN) to control the quality of the simulation experience. However, training GAN has a great instability problem, which may result in non-convergence of the dialogue strategy learning with a high probability, and is highly sensitive to selection of the hyper-parameters, so that the dialogue learning performance is severely restricted. Therefore, how to effectively screen out the low quality experience in the dialogue strategy learning process still remains to be solved and is very important.
Disclosure of Invention
The invention aims to solve the problems and provides a method and a system for detecting the quality of the experience of a simulation user in dialogue strategy learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for detecting the quality of a simulated user experience in dialogue strategy learning, comprising the steps of:
s1, generating simulation experience by a world model;
s2, performing quality detection on the simulation experience through a quality detector based on KL divergence;
and S3, storing the qualified simulation experience of the quality detection for training the dialogue strategy model.
In the above-described method for detecting the quality of the simulated user experience in dialogue strategy learning, the KL divergence-based quality detector performs quality detection of the simulated experience by comparing the simulated experience with the real experience in step S2.
In the above method for detecting the quality of the simulation user experience in the dialogue strategy learning, in step S3, the simulation experience qualified for quality detection is stored to a buffer for the dialogue strategy model training.
In the method for detecting the quality of the simulated user experience in the dialogue strategy learning, in step S2, the word stock world-dit is updated according to the simulated experience generated by the world model, the word stock real-dit is updated according to the real experience generated by the real user, and the similarity between the word stock world-dit and the word stock real-dit is measured through the KL divergence to perform the quality detection of the simulated experience.
In the method for detecting the experience quality of the user in the dialogue strategy learning, a main key of a word library world-fact is a user action generated by a world model, and a corresponding value of the main key is a frequency corresponding to the user action;
the main key of the word stock real-fact is the user action generated by the real user, and the corresponding value of the main key is the frequency corresponding to the user action.
In the above-described method for detecting the quality of experience of a simulated user in dialog strategy learning, in step S2, a predefined variable KL is passedpreKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.
In the method for detecting the user experience quality in the dialogue strategy learning, in step S2, the frequency values of the intersection main key of the lexicon real-fact and the lexicon world-fact in the two lexicons are stored in the lexicon same-fact established in advance, the current KL divergence is calculated based on the lexicon same-fact, and if the current KL divergence is smaller than or equal to KLpreAnd judging that the current experience is qualified.
In the above method for detecting the quality of the simulated user experience in the dialogue strategy learning, in step S2, the current experience is judged to be a qualified experience when the length of the thesaurus same-dit is smaller than the constant C.
A system for detecting the quality of a simulated user experience in conversation strategy learning comprises a quality detector connected with a world model, a real user experience base and a conversation strategy model, wherein the quality detector comprises a KL divergence detector which is used for detecting the quality of the simulated experience generated by the world model according to the real experience generated by a real user.
In the system for detecting the quality of the simulated user experience in the dialogue strategy learning, the quality detector comprises a word stock real-dit for storing the real experience, a word stock world-dit for storing the simulated experience and a word stock same-dit for storing the frequency value of the intersection main key of the word stock real-dit and the word stock world-dit in the two word stocks.
The invention has the advantages that: the KL divergence is introduced to check the distribution of the experience, and no extra work is needed to design and train a complex quality detector, so that the quality of the simulation experience is evaluated more easily, the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured, and the quality of the simulation experience can be effectively controlled.
Drawings
FIG. 1 is an architecture diagram of the dialogue learning method of the present invention;
FIG. 2 is a flow chart of KL divergence calculation in the dialogue learning method of the present invention;
fig. 3 is a graph of learning curves for various classes of agents at different K parameters, wherein,
(a) learning curve graphs of various agents when K = 20;
(b) learning graphs for various classes of agents at K = 30.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
As shown in fig. 1, the present solution proposes a method for detecting the quality of experience of a simulation user in dialogue strategy learning, the basic method of which is consistent with the prior art, such as initializing a dialogue strategy model and a world model using human session data, and starting dialogue strategy learning accordingly. The dialogue strategy learning of the dialogue strategy model mainly comprises two parts of direct reinforcement learning and indirect reinforcement learning (also called planning). And (3) directly strengthening learning, adopting Deep Q-network (DQN) to improve a conversation strategy according to real experience, interacting a conversation strategy model with a User, and selecting an action a to be executed by the conversation strategy model according to an observed conversation state s and a maximized value function Q in each step. The dialogue strategy model then receives a reward r, an action a of the real userr uAnd updates the current state to s', and then applies the true experience (s, a, r, a)r uT) is stored to a real user experience library, t being used to indicate whether the session is terminated.
Maximizing the cost function Q (s, a; theta)Q) Approximated by DNN (deep neural network), by optimizing θQThe updating is continuously iterated to reduce the mean square loss.
During indirect reinforcement learning, the dialogue strategy model improves its dialogue strategy by interacting with the world model to reduce training costs, with the frequency of planning controlled by the parameter K, which means that K steps are planned to be performed in each step of direct reinforcement learning. When the world model can accurately capture the features of the real environment, the value of K tends to be large. At each step of the planning, the world model responds to action a according to the current state sw uGenerating simulation experience (s, a, r, a) in the planning processw u,t’)。
Particularly, according to the scheme, on the basis of the prior art, a quality detector based on KL divergence (KL divergence) is adopted to perform quality detection on simulation experience generated by a world model, and the simulation experience qualified in quality detection is stored in a buffer to be used for training a dialogue strategy model, so that the quality of the simulation experience is ensured, and the influence of low-quality simulation experience on learning performance is avoided.
Specifically, as shown in fig. 2, the KL divergence-based quality detector performs quality detection of the simulation experience by comparing the simulation experience with the real experience, and the specific method is as follows:
updating a word stock world-dit according to simulation experience generated by the world model, updating a word stock real-dit according to real experience generated by a real user, wherein the main keys of the word stock world-dit and the word stock real-dit are the world model and the user action a generated by the real user respectivelyu w、au rAnd the corresponding values of the main keys are the frequencies corresponding to the user actions. Namely, respectively recording the frequency of all behaviors generated by the world model and the real user through the word stock world-fact and real-fact.
The frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks are stored in a word stock same-fact established in advance, and the similarity between the word stock world-fact and the word stock real-fact is measured by KL divergence to carry out quality detection of simulation experience;
the similarity measure is defined by defining a variable KL in advancepreThe variable KLpreIs set to a larger value for tracking the KL divergence between the lexicon real-fact and the lexicon world-fact. Calculating the current KL divergence based on the thesame-fact, if the current KL divergence is less than or equal to KLpreThen it means that the current experience is detected as a qualified experience since the current experience makes the world model more similar to the real user, and the qualified experience is pushed into the buffer MpFor training a dialogue strategy model.
In order to show the effectiveness and superiority of the scheme, the method is compared with other algorithms through an experimental group, and in table 1, D3Q (10) is an intelligent agent based on a GAN quality detector, and DDQ (M, K, N) is an intelligent agent without a quality detector; GPDDQ (M, K, N) is an agent that uses the GP world model and does not use a quality detector; UN-GPDDQ (5000, 20, 4) is an agent that uses the GP world model and does not use a quality detector, while taking into account the uncertainty of the GP model; KL-GPDDQ (M, K, N) is an intelligent body of the KL quality detector using the method on the basis of UN-GPDDQ; where M represents the buffer size, K represents the number of planning steps, N represents the batch size:
table 1: experimental results for different agent training iterations {100,200,300} times with buffer size 5000, K =20, N = 4;
in the table above, Su (Success), Tu (Turns), Re (Reward).
From table 1 it can be found that the DDQ method still performed the worst of all 5. From the operation results of the GPDDQ, UN-GPDDQ and KL-GPDDQ agents, it can be obviously seen that the KL divergence check of the scheme is very helpful for improving the performance, the success rate and the reward are obviously improved, and compared with the DDQ, the method can improve the success rate under the condition of less user interaction.
In addition, as can be seen from fig. 3, the learning speed of the method proposed by the present scheme is much higher than that of DDQ and D3Q. It should be noted that the curve of D3Q is very fluctuant and very unstable, especially when K =30, D3Q cannot converge even to an optimal value, so even if D3Q could cull low quality experience, it is still difficult to implement in reality because GAN is too unstable.
From the experimental group above we can see that this solution has significant advantages over the prior art DDQ framework based methods and also over the GAN quality detectors used in the prior art. According to the scheme, the KL divergence is introduced to check the distribution of experience, and the quality detector does not need to be trained more, so that the quality of simulation experience in reality can be evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.
Example two
The embodiment is similar to the embodiment, except that the embodiment considers that only limited actions (behaviors) exist in the word library world-fact in the initial stage, so that the length of the word library same-fact is small, and in order to preheat the world model, preferably, when the length of the word library same-fact is smaller than the constant C, the simulation experience is regarded as qualified. The constant C is determined by one skilled in the art on a case-by-case basis and is not limited herein.
At this time, only when the length of the thesame same-dit reaches a certain value, namely is larger than or equal to the constant C, the predetermined variable KL is passedpreKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.
EXAMPLE III
The present embodiment provides a system for detecting quality of simulated user experience in conversation strategy learning, which is used for implementing the method in the first embodiment or the second embodiment, and the system comprises a quality detector connected to a world model, a real user experience library and a conversation strategy model, wherein the quality detector comprises a KL divergence detector, and the KL divergence detector is used for detecting quality of simulated experience generated by the world model according to real experience generated by a real user.
Specifically, the quality detector comprises a word stock real-fact for storing real experience, a word stock world-fact for storing simulated experience, and a word stock same-fact for storing frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in two word stocks.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Although the terms simulation experience, real experience, quality detector, human session data, world model, buffer, dialog strategy model, real user experience base, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.
Claims (10)
1. A method for detecting the quality of a simulated user experience in dialogue strategy learning, comprising the steps of:
s1, generating simulation experience by a world model;
s2, performing quality detection on the simulation experience through a quality detector based on KL divergence;
and S3, storing the qualified simulation experience of the quality detection for training the dialogue strategy model.
2. The method for detecting the quality of the simulated user experience in dialog strategy learning according to claim 1, characterized in that in step S2, the KL divergence based quality detector performs the quality detection of the simulated experience by comparing the simulated experience with the real experience.
3. The method for detecting the quality of the simulation user experience in dialogue strategy learning according to claim 2, wherein in step S3, the simulation experience qualified for quality detection is stored to a buffer for dialogue strategy model training.
4. The method for detecting the quality of the simulated user experience in dialog strategy learning according to claim 2, wherein in step S2, the word stock world-cut is updated according to the simulated experience generated by the world model, the word stock real-cut is updated according to the real experience generated by the real user, and the similarity between the word stock world-cut and the word stock real-cut is measured by the KL divergence for the quality detection of the simulated experience.
5. The method for detecting the quality of experience of a simulated user in dialogue strategy learning according to claim 4, wherein the home key of the thesaurus world-dict is a user action generated by a world model, and the corresponding value of the home key is the frequency corresponding to the user action;
the main key of the word stock real-fact is the user action generated by the real user, and the corresponding value of the main key is the frequency corresponding to the user action.
6. Method for detecting the quality of experience of a user in dialog strategy learning according to claim 5, characterised in that in step S2 the quality of experience is simulated by a predefined variable KLpreKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.
7. The method of claim 6, wherein in step S2, the frequency values of the intersection primary key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses are stored in the thesaurus same-fact, and the current KL divergence is calculated based on the thesaurus same-fact, if the current KL divergence is less than or equal to KL divergencepreAnd judging that the current experience is qualified.
8. The method for detecting the quality of experience of a simulated user in dialog strategy learning according to claim 6 or 7, characterized in that in step S2, the current experience is judged to be qualified when the length of the thesaurus same-dit is smaller than the constant C.
9. A system for detecting the quality of a simulated user experience in conversation strategy learning is characterized by comprising a quality detector connected with a world model, a real user experience base and a conversation strategy model, wherein the quality detector comprises a KL divergence detector which is used for detecting the quality of the simulated experience generated by the world model according to the real experience generated by a real user.
10. The system for detecting the quality of experience of a simulated user in conversation strategy learning according to claim 9, wherein the quality detector comprises a thesaurus real-fact for storing the real experience, a thesaurus world-fact for storing the simulated experience, and a thesaurus same-fact for storing the frequency values of the intersection primary key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110532470.2A CN112989016B (en) | 2021-05-17 | 2021-05-17 | Method and system for detecting quality of experience of simulated user in dialogue strategy learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110532470.2A CN112989016B (en) | 2021-05-17 | 2021-05-17 | Method and system for detecting quality of experience of simulated user in dialogue strategy learning |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112989016A true CN112989016A (en) | 2021-06-18 |
| CN112989016B CN112989016B (en) | 2021-08-10 |
Family
ID=76336599
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110532470.2A Active CN112989016B (en) | 2021-05-17 | 2021-05-17 | Method and system for detecting quality of experience of simulated user in dialogue strategy learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112989016B (en) |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120053930A1 (en) * | 2002-12-18 | 2012-03-01 | At&T Intellectual Property Ii, L.P. | System and method of providing a spoken dialog interface to a website |
| CN107342078A (en) * | 2017-06-23 | 2017-11-10 | 上海交通大学 | The cold starting system and method for dialog strategy optimization |
| CN108804611A (en) * | 2018-05-30 | 2018-11-13 | 浙江大学 | A kind of dialogue reply generation method and system based on self comment Sequence Learning |
| US20200034422A1 (en) * | 2016-06-24 | 2020-01-30 | Elemental Cognition Llc | Architecture and processes for computer learning and understanding |
| CN111538668A (en) * | 2020-04-28 | 2020-08-14 | 济南浪潮高新科技投资发展有限公司 | Mobile terminal application testing method, device, equipment and medium based on reinforcement learning |
| CN111801730A (en) * | 2017-12-29 | 2020-10-20 | 得麦股份有限公司 | Systems and methods for artificial intelligence-driven autonomous companions |
| CN112131372A (en) * | 2020-11-25 | 2020-12-25 | 中国科学院自动化研究所 | Knowledge-driven dialogue strategy network optimization method, system and device |
| CN112256856A (en) * | 2020-11-16 | 2021-01-22 | 北京京东尚科信息技术有限公司 | Robot dialogue method, device, electronic device and storage medium |
-
2021
- 2021-05-17 CN CN202110532470.2A patent/CN112989016B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120053930A1 (en) * | 2002-12-18 | 2012-03-01 | At&T Intellectual Property Ii, L.P. | System and method of providing a spoken dialog interface to a website |
| US20200034422A1 (en) * | 2016-06-24 | 2020-01-30 | Elemental Cognition Llc | Architecture and processes for computer learning and understanding |
| CN107342078A (en) * | 2017-06-23 | 2017-11-10 | 上海交通大学 | The cold starting system and method for dialog strategy optimization |
| CN111801730A (en) * | 2017-12-29 | 2020-10-20 | 得麦股份有限公司 | Systems and methods for artificial intelligence-driven autonomous companions |
| CN108804611A (en) * | 2018-05-30 | 2018-11-13 | 浙江大学 | A kind of dialogue reply generation method and system based on self comment Sequence Learning |
| CN111538668A (en) * | 2020-04-28 | 2020-08-14 | 济南浪潮高新科技投资发展有限公司 | Mobile terminal application testing method, device, equipment and medium based on reinforcement learning |
| CN112256856A (en) * | 2020-11-16 | 2021-01-22 | 北京京东尚科信息技术有限公司 | Robot dialogue method, device, electronic device and storage medium |
| CN112131372A (en) * | 2020-11-25 | 2020-12-25 | 中国科学院自动化研究所 | Knowledge-driven dialogue strategy network optimization method, system and device |
Non-Patent Citations (2)
| Title |
|---|
| YUEXIN WU等: "Switch-based Active Deep Dyna-Q:Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning", 《ARXIV》 * |
| 赵崟江等: "改进的DDPG对话策略优化算法", 《计算机工程与设计》 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112989016B (en) | 2021-08-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Weisz et al. | Sample efficient deep reinforcement learning for dialogue systems with large action spaces | |
| US11911702B2 (en) | AI parameter configuration method and apparatus for racing AI model, AI parameter configuration device, and storage medium | |
| US12204854B2 (en) | System for multi-perspective discourse within a dialog | |
| US20190095794A1 (en) | Methods and apparatus for training a neural network | |
| CN112101530A (en) | Neural network training method, device, equipment and storage medium | |
| CN117912459A (en) | Train and/or use an encoder model to determine actions in response to natural language input | |
| CN110546656A (en) | Feedforward generation type neural network | |
| CN117216232B (en) | A large language model hyperparameter optimization method and system | |
| CN112989017B (en) | Method for generating high-quality simulation experience for dialogue strategy learning | |
| CN113419424B (en) | Modeling reinforcement learning robot control method and system for reducing overestimation | |
| US8682677B2 (en) | System and method for automatically generating a dialog manager | |
| WO2021139233A1 (en) | Method and apparatus for generating data extension mixed strategy, and computer device | |
| CN116343759A (en) | Black-box intelligent speech recognition system adversarial sample generation method and related device | |
| Baioletti et al. | Smart multi-objective evolutionary GAN | |
| CN113392956B (en) | GP-based deep Dyna-Q method for dialogue strategy learning | |
| CN119310841A (en) | A triple optimized SAC reinforcement learning method for continuous robot control | |
| CN116788524B (en) | A TD3 soft reinforcement learning spacecraft attitude control method and computer-readable medium | |
| CN112989016B (en) | Method and system for detecting quality of experience of simulated user in dialogue strategy learning | |
| CN118503391B (en) | Dialogue method and system based on neural network with adaptive connection | |
| Chinaei et al. | An inverse reinforcement learning algorithm for partially observable domains with application on healthcare dialogue management | |
| CN111241749B (en) | Permanent magnet synchronous motor chaos prediction method based on reserve pool calculation | |
| Chien et al. | Model-based soft actor-critic | |
| WO2020134011A1 (en) | Method and apparatus for determining display information combination, storage medium, and electronic device | |
| Šćepanović | Testing reward function choice influence on training performance of Double DQN | |
| CN115759226A (en) | Training method, device, equipment and storage medium of visual network model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |
