CN112989016A

CN112989016A - Method and system for detecting quality of experience of simulated user in dialogue strategy learning

Info

Publication number: CN112989016A
Application number: CN202110532470.2A
Authority: CN
Inventors: 曹江; 吴冠霖; 方文其; 平洋; 栾绍童; 闫顼
Original assignee: Nanhu Laboratory
Current assignee: Nanhu Laboratory
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-06-18
Anticipated expiration: 2041-05-17
Also published as: CN112989016B

Abstract

The invention provides a method and a system for detecting the experience quality of a simulation user in dialogue strategy learning, wherein the method comprises the following steps: s1, generating simulation experience by a world model; s2, performing quality detection on the simulation experience through a quality detector based on KL divergence; and S3, storing the qualified simulation experience of the quality detection for training the dialogue strategy model. The quality detector based on KL divergence is introduced in the scheme, so that the quality of simulation experience can be evaluated more easily and effectively, the robustness and the effectiveness of a conversation strategy are ensured, the calculation efficiency is greatly improved, and the aim of effectively controlling the quality of the simulation experience is fulfilled.

Description

Method and system for detecting quality of experience of simulated user in dialogue strategy learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a method and a system for detecting the experience quality of a simulation user in dialogue strategy learning.

Background

Task completion model dialogue strategy learning aims at building a task-targeted dialogue system that can help users accomplish specific single or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chat robots and personal voice assistants, such as Siri by apple and Cortana by microsoft.

In recent years, reinforcement learning has become a mainstream method of dialogue strategy learning. Based on reinforcement learning, the dialog system can gradually adjust and optimize strategies by natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-machine interaction before the available dialogue strategy is available, which not only increases the training cost, but also deteriorates the user experience in the early training phase.

In order to solve the above problems and accelerate the learning process of the dialogue strategy, researchers have proposed a Deep Dyna-Q (ddq) framework based on the Dyna-Q framework. The DDQ framework introduces a world model that is trained using real user experience to generate simulated user experience, hereinafter referred to as simulation experience, in a dynamic environment in order to make it more similar to real users. During the learning process of the dialogue strategy, the dialogue agent is trained by using real experience collected from actual interaction and simulation experience collected from interaction with the world model. By introducing the world model, only a small amount of real user interaction is needed, so that the learning efficiency of the conversation strategy can be remarkably improved, however, the DDQ also faces some difficulties in further optimizing the conversation strategy learning based on the limited conversation interaction, for example, the simulation experience generated by the world model does not necessarily improve the performance, and the low-quality simulation experience even has a serious negative effect on the performance. Some recent studies, in order to solve this problem, attempt to differentiate low quality experience using a generative countermeasure network (GAN) to control the quality of the simulation experience. However, training GAN has a great instability problem, which may result in non-convergence of the dialogue strategy learning with a high probability, and is highly sensitive to selection of the hyper-parameters, so that the dialogue learning performance is severely restricted. Therefore, how to effectively screen out the low quality experience in the dialogue strategy learning process still remains to be solved and is very important.

Disclosure of Invention

The invention aims to solve the problems and provides a method and a system for detecting the quality of the experience of a simulation user in dialogue strategy learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for detecting the quality of a simulated user experience in dialogue strategy learning, comprising the steps of:

s1, generating simulation experience by a world model;

s2, performing quality detection on the simulation experience through a quality detector based on KL divergence;

and S3, storing the qualified simulation experience of the quality detection for training the dialogue strategy model.

In the above-described method for detecting the quality of the simulated user experience in dialogue strategy learning, the KL divergence-based quality detector performs quality detection of the simulated experience by comparing the simulated experience with the real experience in step S2.

In the above method for detecting the quality of the simulation user experience in the dialogue strategy learning, in step S3, the simulation experience qualified for quality detection is stored to a buffer for the dialogue strategy model training.

In the method for detecting the quality of the simulated user experience in the dialogue strategy learning, in step S2, the word stock world-dit is updated according to the simulated experience generated by the world model, the word stock real-dit is updated according to the real experience generated by the real user, and the similarity between the word stock world-dit and the word stock real-dit is measured through the KL divergence to perform the quality detection of the simulated experience.

In the method for detecting the experience quality of the user in the dialogue strategy learning, a main key of a word library world-fact is a user action generated by a world model, and a corresponding value of the main key is a frequency corresponding to the user action;

the main key of the word stock real-fact is the user action generated by the real user, and the corresponding value of the main key is the frequency corresponding to the user action.

In the above-described method for detecting the quality of experience of a simulated user in dialog strategy learning, in step S2, a predefined variable KL is passed_preKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.

In the method for detecting the user experience quality in the dialogue strategy learning, in step S2, the frequency values of the intersection main key of the lexicon real-fact and the lexicon world-fact in the two lexicons are stored in the lexicon same-fact established in advance, the current KL divergence is calculated based on the lexicon same-fact, and if the current KL divergence is smaller than or equal to KL_preAnd judging that the current experience is qualified.

In the above method for detecting the quality of the simulated user experience in the dialogue strategy learning, in step S2, the current experience is judged to be a qualified experience when the length of the thesaurus same-dit is smaller than the constant C.

A system for detecting the quality of a simulated user experience in conversation strategy learning comprises a quality detector connected with a world model, a real user experience base and a conversation strategy model, wherein the quality detector comprises a KL divergence detector which is used for detecting the quality of the simulated experience generated by the world model according to the real experience generated by a real user.

In the system for detecting the quality of the simulated user experience in the dialogue strategy learning, the quality detector comprises a word stock real-dit for storing the real experience, a word stock world-dit for storing the simulated experience and a word stock same-dit for storing the frequency value of the intersection main key of the word stock real-dit and the word stock world-dit in the two word stocks.

The invention has the advantages that: the KL divergence is introduced to check the distribution of the experience, and no extra work is needed to design and train a complex quality detector, so that the quality of the simulation experience is evaluated more easily, the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured, and the quality of the simulation experience can be effectively controlled.

Drawings

FIG. 1 is an architecture diagram of the dialogue learning method of the present invention;

FIG. 2 is a flow chart of KL divergence calculation in the dialogue learning method of the present invention;

fig. 3 is a graph of learning curves for various classes of agents at different K parameters, wherein,

(a) learning curve graphs of various agents when K = 20;

(b) learning graphs for various classes of agents at K = 30.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example one

As shown in fig. 1, the present solution proposes a method for detecting the quality of experience of a simulation user in dialogue strategy learning, the basic method of which is consistent with the prior art, such as initializing a dialogue strategy model and a world model using human session data, and starting dialogue strategy learning accordingly. The dialogue strategy learning of the dialogue strategy model mainly comprises two parts of direct reinforcement learning and indirect reinforcement learning (also called planning). And (3) directly strengthening learning, adopting Deep Q-network (DQN) to improve a conversation strategy according to real experience, interacting a conversation strategy model with a User, and selecting an action a to be executed by the conversation strategy model according to an observed conversation state s and a maximized value function Q in each step. The dialogue strategy model then receives a reward r, an action a of the real user_r ^uAnd updates the current state to s', and then applies the true experience (s, a, r, a)_r ^uT) is stored to a real user experience library, t being used to indicate whether the session is terminated.

Maximizing the cost function Q (s, a; theta)_Q) Approximated by DNN (deep neural network), by optimizing θ_QThe updating is continuously iterated to reduce the mean square loss.

During indirect reinforcement learning, the dialogue strategy model improves its dialogue strategy by interacting with the world model to reduce training costs, with the frequency of planning controlled by the parameter K, which means that K steps are planned to be performed in each step of direct reinforcement learning. When the world model can accurately capture the features of the real environment, the value of K tends to be large. At each step of the planning, the world model responds to action a according to the current state s_w ^uGenerating simulation experience (s, a, r, a) in the planning process_w ^u，t’）。

Particularly, according to the scheme, on the basis of the prior art, a quality detector based on KL divergence (KL divergence) is adopted to perform quality detection on simulation experience generated by a world model, and the simulation experience qualified in quality detection is stored in a buffer to be used for training a dialogue strategy model, so that the quality of the simulation experience is ensured, and the influence of low-quality simulation experience on learning performance is avoided.

Specifically, as shown in fig. 2, the KL divergence-based quality detector performs quality detection of the simulation experience by comparing the simulation experience with the real experience, and the specific method is as follows:

updating a word stock world-dit according to simulation experience generated by the world model, updating a word stock real-dit according to real experience generated by a real user, wherein the main keys of the word stock world-dit and the word stock real-dit are the world model and the user action a generated by the real user respectively^u _w、a^u _rAnd the corresponding values of the main keys are the frequencies corresponding to the user actions. Namely, respectively recording the frequency of all behaviors generated by the world model and the real user through the word stock world-fact and real-fact.

The frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks are stored in a word stock same-fact established in advance, and the similarity between the word stock world-fact and the word stock real-fact is measured by KL divergence to carry out quality detection of simulation experience;

the similarity measure is defined by defining a variable KL in advance_preThe variable KL_preIs set to a larger value for tracking the KL divergence between the lexicon real-fact and the lexicon world-fact. Calculating the current KL divergence based on the thesame-fact, if the current KL divergence is less than or equal to KL_preThen it means that the current experience is detected as a qualified experience since the current experience makes the world model more similar to the real user, and the qualified experience is pushed into the buffer M^pFor training a dialogue strategy model.

In order to show the effectiveness and superiority of the scheme, the method is compared with other algorithms through an experimental group, and in table 1, D3Q (10) is an intelligent agent based on a GAN quality detector, and DDQ (M, K, N) is an intelligent agent without a quality detector; GPDDQ (M, K, N) is an agent that uses the GP world model and does not use a quality detector; UN-GPDDQ (5000, 20, 4) is an agent that uses the GP world model and does not use a quality detector, while taking into account the uncertainty of the GP model; KL-GPDDQ (M, K, N) is an intelligent body of the KL quality detector using the method on the basis of UN-GPDDQ; where M represents the buffer size, K represents the number of planning steps, N represents the batch size:

table 1: experimental results for different agent training iterations {100,200,300} times with buffer size 5000, K =20, N = 4;

in the table above, Su (Success), Tu (Turns), Re (Reward).

From table 1 it can be found that the DDQ method still performed the worst of all 5. From the operation results of the GPDDQ, UN-GPDDQ and KL-GPDDQ agents, it can be obviously seen that the KL divergence check of the scheme is very helpful for improving the performance, the success rate and the reward are obviously improved, and compared with the DDQ, the method can improve the success rate under the condition of less user interaction.

In addition, as can be seen from fig. 3, the learning speed of the method proposed by the present scheme is much higher than that of DDQ and D3Q. It should be noted that the curve of D3Q is very fluctuant and very unstable, especially when K =30, D3Q cannot converge even to an optimal value, so even if D3Q could cull low quality experience, it is still difficult to implement in reality because GAN is too unstable.

From the experimental group above we can see that this solution has significant advantages over the prior art DDQ framework based methods and also over the GAN quality detectors used in the prior art. According to the scheme, the KL divergence is introduced to check the distribution of experience, and the quality detector does not need to be trained more, so that the quality of simulation experience in reality can be evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.

Example two

The embodiment is similar to the embodiment, except that the embodiment considers that only limited actions (behaviors) exist in the word library world-fact in the initial stage, so that the length of the word library same-fact is small, and in order to preheat the world model, preferably, when the length of the word library same-fact is smaller than the constant C, the simulation experience is regarded as qualified. The constant C is determined by one skilled in the art on a case-by-case basis and is not limited herein.

At this time, only when the length of the thesame same-dit reaches a certain value, namely is larger than or equal to the constant C, the predetermined variable KL is passed_preKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.

EXAMPLE III

The present embodiment provides a system for detecting quality of simulated user experience in conversation strategy learning, which is used for implementing the method in the first embodiment or the second embodiment, and the system comprises a quality detector connected to a world model, a real user experience library and a conversation strategy model, wherein the quality detector comprises a KL divergence detector, and the KL divergence detector is used for detecting quality of simulated experience generated by the world model according to real experience generated by a real user.

Specifically, the quality detector comprises a word stock real-fact for storing real experience, a word stock world-fact for storing simulated experience, and a word stock same-fact for storing frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in two word stocks.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although the terms simulation experience, real experience, quality detector, human session data, world model, buffer, dialog strategy model, real user experience base, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A method for detecting the quality of a simulated user experience in dialogue strategy learning, comprising the steps of:

s1, generating simulation experience by a world model;

2. The method for detecting the quality of the simulated user experience in dialog strategy learning according to claim 1, characterized in that in step S2, the KL divergence based quality detector performs the quality detection of the simulated experience by comparing the simulated experience with the real experience.

3. The method for detecting the quality of the simulation user experience in dialogue strategy learning according to claim 2, wherein in step S3, the simulation experience qualified for quality detection is stored to a buffer for dialogue strategy model training.

4. The method for detecting the quality of the simulated user experience in dialog strategy learning according to claim 2, wherein in step S2, the word stock world-cut is updated according to the simulated experience generated by the world model, the word stock real-cut is updated according to the real experience generated by the real user, and the similarity between the word stock world-cut and the word stock real-cut is measured by the KL divergence for the quality detection of the simulated experience.

5. The method for detecting the quality of experience of a simulated user in dialogue strategy learning according to claim 4, wherein the home key of the thesaurus world-dict is a user action generated by a world model, and the corresponding value of the home key is the frequency corresponding to the user action;

6. Method for detecting the quality of experience of a user in dialog strategy learning according to claim 5, characterised in that in step S2 the quality of experience is simulated by a predefined variable KL_preKL divergence between the thesaurus real-fact and the thesaurus world-fact is tracked for similarity measurement.

7. The method of claim 6, wherein in step S2, the frequency values of the intersection primary key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses are stored in the thesaurus same-fact, and the current KL divergence is calculated based on the thesaurus same-fact, if the current KL divergence is less than or equal to KL divergence_preAnd judging that the current experience is qualified.

8. The method for detecting the quality of experience of a simulated user in dialog strategy learning according to claim 6 or 7, characterized in that in step S2, the current experience is judged to be qualified when the length of the thesaurus same-dit is smaller than the constant C.

9. A system for detecting the quality of a simulated user experience in conversation strategy learning is characterized by comprising a quality detector connected with a world model, a real user experience base and a conversation strategy model, wherein the quality detector comprises a KL divergence detector which is used for detecting the quality of the simulated experience generated by the world model according to the real experience generated by a real user.

10. The system for detecting the quality of experience of a simulated user in conversation strategy learning according to claim 9, wherein the quality detector comprises a thesaurus real-fact for storing the real experience, a thesaurus world-fact for storing the simulated experience, and a thesaurus same-fact for storing the frequency values of the intersection primary key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses.