Detailed Description
Illustrative embodiments of the application include, but are not limited to, a voice interaction model evaluation method, an electronic device, and a readable storage medium.
The following describes a specific implementation procedure of the technical solution provided in the embodiment of the present application with reference to the accompanying drawings.
It can be understood that the electronic device in the embodiment of the present application may also be a server or a terminal (terminal). The terminal may be a user terminal, a mobile terminal, a User Equipment (UE), a terminal device, a Mobile Station (MS), a Mobile Terminal (MT), or the like. The terminal device may be a mobile phone, a smart tv, a wearable device, a tablet (Pad), a computer with wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned (SELF DRIVING), a wireless terminal in smart grid (SMART GRID), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (SMART CITY), a wireless terminal in smart home (smart home), etc.
The following is an explanation of terms involved in the present application.
(1) Voice interaction model (SRPAs)
SRPAs is a system that can simulate the voice and behavioral characteristics of a particular character to provide a user with a personalized, natural, and emotional voice interaction experience.
In some embodiments of the application SRPAs may be applied to a question-answer scenario. For example, the SRPAs-based question and answer scenario schematic shown in fig. 1, the user may make a question based on a question and answer box 101 of the application program in the computer 10 using SRPAs, by inputting text in the question and answer box 101 or by inputting voice through a voice control 102. SRPAs can generate a persona reply voice corresponding to the user question data based on the user pre-selected persona.
As mentioned above, most of the data sets of the training SRPAs are related to the characters, lack voice data of voice features, prosody and emotion expressions consistent with the characters, and lack systematic evaluation of SRPAs, so that the SRPAs is difficult to completely achieve the voice interaction effect consistent with the characters in practical application, and the voice interaction experience of the user is affected.
In order to solve the problems, the embodiment of the application provides a voice interaction model evaluation method, which comprises the steps of obtaining reference voice and character description text of a first character, inputting the character description text into a large language model to generate target interaction dialogue data, wherein the target interaction dialogue data comprises at least one question information and character reply information corresponding to each question information, the type of the character reply information is text type, converting the character reply information corresponding to each question information into character reply voice based on the reference voice to obtain a character voice interaction dialogue data set corresponding to the first character, and evaluating the voice interaction model based on the character voice interaction dialogue data set.
Based on the method provided by the application, a large number of character voice interaction dialogue data sets of different characters are generated through the voice characteristics and the target interaction dialogue data corresponding to the reference voice of each character, and the character voice interaction dialogue data set corresponding to each character can comprise a large number of character reply voices corresponding to the character. The SRPAs integrated with the character voice interaction dialogue data set can more accurately simulate the voice characteristics of different characters in practical application, and provides more natural, personalized and emotion-rich voice interaction experience. In this manner, the current SRPAs lack of role-related voice data problem is ameliorated, and the performance of SRPAs and the user's voice interaction experience are improved through systematic evaluation.
In some embodiments of the present application, the language corresponding to the character description text and the text in the target interactive session data may include chinese or english, etc., which is not particularly limited by the present application.
In some embodiments of the present application SRPAs, which includes character voice interaction dialog datasets for different characters, can simulate the voice characteristics, prosody and emotion expressions of the different characters, and thus can be applied to intelligent assistants or chat robots. And can be widely applied to various scenes to provide personalized and emotional voice interaction experience. Illustratively, (1) SRPAs can be applied to a game in-character interactive scene, simulating non-player characters, providing interactive conversations to enhance the immersive feel of the game. (2) SRPAs may be applied to virtual personal assistant scenarios, simulating personal assistants to provide personalized voice interaction services for users, such as calendar management, weather forecast, etc. (3) SRPAs can be applied to an online education and coaching scene, simulate virtual teachers or coaching, and provide education services such as language learning, course coaching and the like. (4) SRPAs can be applied to an automated customer service scenario where a simulated customer care representative provides automated customer consultation and problem solutions in a call center or online customer service. (5) SRPAs can be applied to authoring assist scenes to simulate character interaction conversations in stories generated by authoring partner assist composers and content creators, providing authoring inspiration. (6) SRPAs can be applied to a vocational training simulation scenario where interviewees or specific characters are simulated to help users practice interview skills. (7) SRPAs can be applied to emotion interaction support scenarios, providing emotion support and listening services, and simulating emotion support partners (such as psychological consultants or roles like friends) to provide emotion communication. (8) SRPAs can be applied to intelligent home control scenes, simulate intelligent home assistants and control intelligent equipment (such as equipment light, temperature and the like) in a home. (9) SRPAs can be applied to voice navigation guidance scenes, simulated navigation assistants, and broadcast real-time navigation guidance and traffic information.
It will be appreciated that the above scenarios are merely exemplary, and that SRPAs integrating different character voice interaction dialog data sets may also be applied to other scenarios in other embodiments of the present application, and the present application is not limited in this regard.
FIG. 2 illustrates a flow diagram of a method for speech interaction model evaluation, according to some embodiments of the application. The execution subject of each flow shown in fig. 2 may be an electronic device. The specific flow may include:
s101, acquiring a reference voice of a first character and character description text, wherein the character description text is used for describing the characteristics of the first character.
In some embodiments of the present application, the electronic device may pre-obtain the reference voice and the character description text of the first character. Wherein the first character may be a different character from among movie drama, movie, cartoon, and game.
In some embodiments of the present application, obtaining a reference voice and a character description text of a first character may specifically include obtaining a first multimedia audio, preprocessing the first multimedia audio to obtain at least one initial character voice of the first character, determining an average voice similarity between each initial character voice and other initial character voices based on a voice similarity between each initial character voice, and determining a reference voice based on an average voice similarity corresponding to each initial character voice, wherein the average voice similarity of the reference voice is greater than a voice similarity threshold.
In particular, the electronic device may obtain different multimedia audio such as different movie plays, movies, cartoons, games, etc., which may include audio associated with the first character. First, these multimedia audio may be anonymously processed to remove or replace any identifiable personal information in the audio, such as the name, address, or other sensitive information that may be contained in the sound. The multimedia audio is then sliced into short audio pieces and these audio pieces are converted into a WAV format (lossless audio format) of mono, sample rate 16kH, etc. to reduce the amount of processing data. It can be understood that in the present application, shorter character audio segments may be obtained by sequentially performing sound source separation, speaker separation, and voice activity detection on the multimedia audio, where sound source separation refers to separating different types of audio (such as background sound, ambient noise, character voice, etc.) from the multimedia audio, and the character voice may include a voice corresponding to the first character. The speaker separation refers to separating the voice corresponding to the first character from the voice of the character by using a speaker separation technology. The voice activity detection means that a silent part and a non-voice part in the character voice corresponding to the first character are removed, and the pure character voice of the first character is extracted to obtain at least one initial character voice of the first character. It will be appreciated that the at least one initial character speech of the first character may be an audio clip having a duration of 3s, 5s, 10s, etc., as the application is not particularly limited in this regard.
Further, the electronic device may extract speaker embeddings of each initial character speech of the first character using automatic speech recognition (automatic speech recognition, ASR) techniques, i.e., calculate an average speech similarity for each initial character speech from at least one initial character speech of the first character. For example, assuming that the initial character voices of at least one first character are V1, V2, and V3, the voice similarities S12 between V1 and V2, the voice similarities S13 between V1 and V3, and the voice similarities S23 between V2 and V3 are calculated based on V1, V2, and V3, respectively. Then, based on the voice similarities S12, S13 and S23 between each initial character voice, the average voice similarity of each initial character voice is calculated, that is, the average voice similarity S1 of V1 is calculated based on S12 and S13, the average voice similarity S2 of V2 is calculated based on S12 and S23, and the average voice similarity S3 of V3 is calculated based on S13 and S23. Finally, based on the average voice similarity of each of S1, S2, and S3, one of V1, V2, and V3 is determined as a reference voice, it should be understood that the average voice similarity of the reference voice is greater than a voice similarity threshold, and it should be understood that the average voice similarity of the reference voice may be the highest of the average voice similarities of V1, V2, and V3, and the voice similarity threshold is not particularly limited in the present application.
In some embodiments of the application, the electronic device may extract structured metadata based on different data sources to construct a role profile corresponding to the role, where the role profile may include role portraits (i.e., role profiles) (e.g., characters and preferences, etc.), contextual information (e.g., social identities and relationships, etc.), and role words (e.g., conversations and monologues, etc.). Using a large language model (large language model, LLM), initial interactive dialog data can be generated based on the role profile, the type of the initial interactive dialog data being a text type.
It will be appreciated that the above character profile, i.e., the character description text in the present application. The initial interactive dialogue data can comprise at least one question information and role reply information corresponding to each question information. For each character (e.g., 98, etc.), for example, Q (e.g., 800, etc.) single-round interactive dialogs and Q multi-round interactive dialogs may be generated, resulting in initial interactive dialog data corresponding to each character.
In particular, the character profile is used to describe basic features and core information of the character. The character profile may be a corresponding appearance characteristic of the first character, a spoken Buddhist and emotional response pattern, and so on. The appearance characteristic may be, among other things, the age, sex, height, body type, skin color, hairstyle, etc. of the first character. The spoken Buddhist may be a specific word or phrase, etc., that is frequently used by the first character. The emotional response patterns may be typical emotional responses of the first persona under different circumstances, such as anger, happiness, sadness, surprise, and so on. The background information is used to describe detailed backgrounds of characters and living environments. The context information may be story settings, social identities and relationship networks, etc. corresponding to the first persona. The story settings may be a story background, such as time, place, event, etc., in which the first persona is located. The social identity may be a family member of the first persona, a family relationship, a family economic status, etc., and the relationship network may be a relationship network of the first persona in society, such as friends, colleagues, etc. The character set of speech is used to assist the LLM in understanding the language style and emotion expression characteristics of the first character. The character lines may be at least one of monologues, dialogues, and narrative content corresponding to the first character. The monologue is the content of the first character speaking alone, can reflect the internal ideas and emotions of the first character, the dialogue is the dialogue between the first character and other characters, can reflect the language style of the first character in interaction, and the narrative mode is the language style of the first character when the event or story is recited, and can reflect the expression mode of the character.
Based on the above, the language style, emotion expression characteristics and other multidimensional characteristics of the first character can be determined through the reference voice and character description text of the first character.
S102, inputting a character description text into a large language model to generate target interactive dialogue data, wherein the target interactive dialogue data comprises at least one question information and character reply information corresponding to each question information, and the type of the character reply information is a text type.
In some embodiments of the present application, the LLM may generate initial interactive dialogue data corresponding to each character based on a character description text of each character, where each initial interactive dialogue data includes at least one initial question information and initial character reply information corresponding to each initial question information. It should be appreciated that the type of initial interactive session data may be a text type.
In some embodiments of the present application, after obtaining the reference voice and the character description text of the first character based on S101, the electronic device may input the character description text corresponding to the first character to the LLM, so that the LLM may generate initial interactive session data corresponding to the first character based on the character description text, where the initial interactive session data corresponding to the first character may include at least one of initial question information and initial character reply information.
In some embodiments of the present application, the context relevance of each character may be determined by calculating the semantic similarity between its corresponding initial interaction dialogue data and the corresponding metadata (i.e., character description text).
In some embodiments of the application, character description text is input into a large language model to generate target interactive dialogue data, the method comprises the steps of inputting the character description text into the large language model to generate initial interactive dialogue data, wherein the initial interactive dialogue data comprises M groups of initial interactive dialogues, M is a positive integer, each group of initial interactive dialogues comprises at least one initial question information and character reply information corresponding to each question information, screening M groups of initial interactive dialogues based on semantic similarity between each group of initial interactive dialogues and the character description text to obtain N first interactive dialogues, N is a positive integer, N is smaller than or equal to M, the semantic similarity of the N first interactive dialogues is larger than a first semantic similarity threshold, K second interactive dialogues are obtained based on the semantic similarity between each first interactive dialogues, K is a positive integer, the semantic similarity between K second interactive dialogues is smaller than or equal to N, and the semantic similarity between the K second interactive dialogues is smaller than the second semantic similarity threshold, and the K second interactive dialogues are determined to be the target interactive dialogue data.
In some embodiments of the present application, the first similarity threshold may be 0.45 or the like corresponding to the character description text being chinese, the first similarity threshold may be 0.4 or the like corresponding to the character description text being english.
In some embodiments of the present application, the second similarity threshold may be 0.85 or the like corresponding to the character description text being chinese, the second similarity threshold may be 0.9 or the like corresponding to the character description text being english.
In other embodiments of the present application, the first similarity threshold and the second similarity threshold may also be other values corresponding to the character description text being in languages such as chinese or english, which are not particularly limited by the present application.
Illustratively, first, assume that M sets of initial interactive session data are T1, T2, T3, T4, T5, T6, T7, and T8, i.e., m=8. And respectively calculating semantic similarity between each initial interactive dialogue data and the character description text corresponding to the first character to be S1 (0.8), S2 (0.85), S3 (0.75), S4 (0.7), S5 (0.7), S6 (0.65), S7 (0.75) and S8 (0.85). Then, assuming that the first semantic similarity threshold is 0.7, selecting N pieces of first dialogue data with semantic similarity greater than the first similarity threshold from M groups of initial interaction dialogue data as T1, T2, T3, T4, T5 and T7, namely N=6, respectively calculating that the semantic similarity between the N pieces of first interaction dialogue data is S12(0.9)、S13(0.3)、S14(0.4)、S15(0.7)、S17(0.7)、S23(0.4)、S24(0.8)、S25(0.7)、S27(0.7)、S34(0.9)、S35(0.4)、S37(0.4)、S45(0.7)、S47(0.7)、S57(0.8)., and determining K pieces of second interaction dialogue data based on the N pieces of first interaction dialogue data, wherein the semantic similarity between the K pieces of second interaction dialogue data is smaller than the second semantic similarity threshold by 0.8. For example, the K second interactive session data are T1, T2, T3, T4, T5, respectively. In this way, T1, T2, T3, T4, T5 can be targeted interactive session data.
Based on the above, the interactive dialogue data highly relevant to the character description text is screened out by calculating the semantic similarity between each initial interactive dialogue data and the character description text. The screened target interactive dialogue data is closely related to the characteristics of the first character, so that the consistency of the target interactive dialogue data and the first character is ensured.
In addition, the semantic similarity between the preliminarily screened first interactive dialogue data is calculated, and the second interactive dialogue with lower semantic similarity is screened out. The step can ensure that the finally selected target interactive dialogue data has a certain diversity in terms of semantics, and the interactive dialogue data repetition is avoided. Thereby ensuring semantic diversity of interactive session data.
In some embodiments of the present application, the semantic similarity may be determined based on cosine similarity, euclidean distance, and the like, and the manner of determining the semantic similarity is not particularly limited in the present application.
Taking the first role as an example of "a", the role description text corresponding to the first role and the target interactive dialogue data are exemplified below.
First, the character description text corresponding to the first character "a" may include a character profile, background information, and character line sets. Specifically:
Character introduction-appearance characteristics A is a hero image of a teenager. Oral Buddhist-I's life is not by day. Emotional response mode, frightening, strong sense, and sometimes get into serious trouble due to impulse, good internal heart, and very loyalty to friends and family.
The background information is that story setting is that A is son of B and C, son E of D is accidentally injured to trigger a series of stories, and finally, he is saved by F and the lotus flower avatar is vouchsafe, so that the sense can be continuously protected. Social identity-son of B, young collar. Relationship network-father is A, mother is C, master is F, contradiction with D family, but final resolution.
The character and speech set is unique, i am not going by me, i am running out a road belonging to the user even if the front road is thorn. For white, E, I A will not let go of you. Narrative content i learn to take responsibility and protect more people, although i have made mistakes.
Then, the character text description of the first character "A" is input into the LLM, and the target interactive dialogue data corresponding to the first character "A" generated by the LLM can comprise T1: question information, "A, what is you the most powerful in the field," A reply information, "I'm most powerful in the fields Q1 and Q2 | Q1 can send out powerful energy, so as to protect everything, and Q2 can help I complete many tasks). T2 question information A why you say that I am not going by me, A reply information because I believe that everyone can change fate through his own effort. I do not believe that the fate is purported and i believe that they can create their own future. T3, question information, A, what you like to do, and A replies information, i.e. what I like to do is to help others and protect the safety of people. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I also like to learn new power together with My Master F, ask about information "A, you and F learn, there is something that is particularly interesting to share" A replies about information "I and Master learning is always full of fun, master teaches I a lot of knowledge, let I understand nature".
It will be appreciated that the first character shown above describes text for the character corresponding to A, A, and T1, T2, and T3 included in the target interactive session data are merely exemplary illustrations. In other embodiments of the present application, the character description text corresponding to a may also include more text information, and more target interaction dialogue data may be generated.
And S103, converting the character reply information of each text type into character reply voice based on the reference voice to obtain a character voice interaction dialogue data set corresponding to the first character.
In some embodiments of the present application, after respectively acquiring reference voice and target interactive dialogue data corresponding to a first character based on S101 and S102, the electronic device may input the reference voice into a text-to-speech (TTS) model for feature extraction, to obtain a voice feature corresponding to the first character. And then converting the text type character reply information corresponding to each question information into character reply voice based on the voice characteristics. And finally, generating a role voice interaction dialogue data set corresponding to the first role based on each question information and each role reply voice.
In some embodiments of the present application, the question information may be a question voice or a question text, which is not particularly limited by the present application.
In some embodiments of the present application, the voice interaction dialogue data set of each character corresponds to each character, the voice interaction dialogue data set is based on the question information as question voice, and has a clear question intention, and the content of the character reply voice is mainly stated. Thus, the duration of the character reply voice may generally be greater than the duration of the corresponding question voice. For example, the character reply voice may be 1 to 20 seconds, etc., the question voice may be 3 to 6 seconds, etc., and the present application is not limited thereto.
By way of example, the character voice interaction dialogue dataset corresponding to the first character may include V1 question information: "A, what you are most in the field," A reply voice: "I's most in the field is Q1 and Q2 la. V2 question information, "A why you say me did not go by me", A reply to speech "because I believe that everyone can change fate through his own effort. I do not believe that the fate is purported and i believe that they can create their own future. V3, asking information, namely, what you like doing, A replying to voice, namely, what I like doing is to help others, and the safety of people is protected. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I also like to learn new power with My Master F, ask about information "A, when you and F learn, there is nothing particularly interesting to share", A replies to speech that "I and Master learning is always full of fun". The master teaches me a lot of knowledge to make me understand nature.
It will be appreciated that the role reply voice in the role voice interaction dialog data set corresponding to the first role of the above example is merely an exemplary illustration. In other embodiments of the present application, the character voice interaction dialogue dataset corresponding to the first character may further include character reply voices corresponding to other first characters.
In this way, a high-quality, personalized and emotion-rich role voice interaction dialogue data set can be provided for the voice interaction model (SRPAs), so that the role voice playing quality of SRPAs in practical application is remarkably improved, and a user can obtain more natural, vivid and role voice interaction experience conforming to role setting.
And S104, evaluating the voice interaction model based on the character voice interaction dialogue data set.
In some embodiments of the application, to systematically evaluate the quality of a voice interaction model (SRPAs) to generate character reply voice corresponding to user question data based on a character interaction dialogue dataset, the application provides an evaluation criterion that can be evaluated from three key evaluation dimensions of basic interaction capability, voice quality and expressivity, and role-playing realism.
Table 1 shows a table of evaluation criteria according to some embodiments of the application.
Referring to table 1, basic interactive capabilities may be determined based on instruction compliance (instruction adherence, IA), language fluency (speech fluency, SF), and dialog continuity (conversational coherence, CC). Speech expressivity may be determined based on language naturalness (speech naturalness, SN), prosody consistency (prosodic consistency, PC), and emotion appropriateness (emotion appropriateness, EA). The quality of role playing may be determined based on personality traits (personality consistency, peC) and knowledge traits (knowledge consistency, KC).
TABLE 1
With continued reference to Table 1, instruction compliance (IA) is used to evaluate whether character reply speech faithfully executes character instructions. I.e., evaluate whether the character reply voice is accurately generated based on the user question data and conforms to the character setting.
For example, questioning information "A, what you are most in the field", SRPAs A reply voices generated based on user questioning data "I's most in the field is Q1 and Q2 la ≡Q1 can send out powerful energy, protect everything, Q2 can help I fly, and can also bind bad people, help I complete many tasks). The A reply voice is accurately generated based on the user question data, accords with the role setting of A, and does not jump out of the role to explain or comment. The character setting is that the voice content corresponding to the A reply voice accords with the character and the background, and the character setting is described in view of A. The absence of a jump-out character means that a does not interpret with the identity of the bystander, but directly answers the user's question, maintaining consistency of the character. Thus, it can be determined SRPAs that the instruction compliance of the A-reply voice generated is relatively high.
With continued reference to table 1, language fluency (SF) is used to evaluate whether character reply speech is fluent, without abnormal pauses. Namely, whether the character reply voice sounds smooth or not and the rhythm is proper is evaluated.
For example, questioning information "A, what you are most in the field," SRPAs A reply voices generated based on user questioning data, "I'm most in the field are Q1 and Q2 la. The pronunciation of each word in the A reply voice is clear and accurate, and no pronunciation error exists. The voice speed is moderate, the rhythm is natural, and the situation that the voice speed is too fast or too slow is avoided. There were no noticeable pauses or stuttering. Thus, it can be determined SRPAs that the smoothness of the generated reply voice a is relatively high.
With continued reference to Table 1, a Conversation Continuity (CC) is used to evaluate whether character reply voices are coherent and non-contradictory. I.e., evaluate whether the character reply voice is closely related to the context, logically coherent.
For example, question information: "a, why do you say me happens without me going from day? SRPAs a reply voice generated based on user question data:" because i believe that everyone can change fate with his own effort. I do not believe that the fate is purported and i believe that they can create their own future. The content of the A reply voice is closely related to the content of the user question data, and the logic is coherent and the dialogue is naturally continued. Thus, it can be determined SRPAs that the dialog consistency (CC) of the a-reply voice generated is relatively high.
Based on the instruction compliance (IA) assessment, language fluency (SF) assessment, and session continuity (CC) assessment described above, the basic interactive capabilities of SRPAs can be effectively assessed.
With continued reference to Table 1, language naturalness (SN) is used to evaluate the speech synthesis naturalness of the character reply speech. I.e., evaluate whether the character reply voice is natural, approaching a true human expression.
For example, the question information "A, what you like to do," SRPAs A reply voice generated based on the user question data "I like to do things is help others, and the security of everybody is protected. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I would also like to learn the new power with My Master F. The A replied voice is natural in replying language, is close to the expression of a real human, and has no mechanical or hard feeling. Thus, it can be determined SRPAs that the language naturalness (SN) of the a-reply voice generated is high.
With continued reference to Table 1, prosody Consistency (PC) is used to evaluate whether the character reply speech exhibits the character's due intonation style. I.e., evaluate whether the prosody (pitch, duration, intensity) of the character reply speech is consistent with the reference speech.
For example, question information "A why you say me did not get me from the day," A reply speech SRPAs generated based on user question data "because I believe that everyone can change fate through his own effort. I do not believe that the fate is purported, I believe that self can create his own future, referring to speech as "I am not by the day of my life". The A reply voice is consistent with the reference voice in rhythm, so that the intonation style and emotion expression of the character are maintained. Thus, it can be determined SRPAs that the Prosody Consistency (PC) of the A-reply speech generated is relatively high.
With continued reference to Table 1, emotion suitability (EA) is used to evaluate whether the emotion of a character's reply voice is consistent with the context and character. I.e., evaluate whether the character emotion expression of the character reply voice is consistent with the character setting and the context.
For example, question information: "a, what do you like best? SRPAs a reply voice generated based on user question data:" I prefer to do something to help others, protecting their safety. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I would also like to learn the new power with My Master F. The reply emotion expression of the A reply voice is positive upwards, accords with the role setting and the context situation, and avoids improper emotion. Thus, it can be determined SRPAs that the emotion suitability of the A-reply voice generated is high.
Based on the language naturalness (SN) evaluation, prosody Consistency (PC) evaluation, and emotion suitability (EA) evaluation described above, the language expression of SRPAs can be effectively evaluated.
Personality consistency (personality consistency, peC) is used to evaluate whether character reply speech characters are consistent and vivid with character characters. I.e., evaluate whether character reply speech characters reflect character personality characteristics of the character, such as optimistic, brave and optimistic, etc.
For example, questioning information "A, you are most afraid of what," SRPAs A reply speech generated based on user questioning data "I are most afraid of seeing a friend wounded. I would try best to protect them and keep them away from danger). Thus, it can be determined SRPAs that the generated personality characteristics of the A reply voice reflecting A brave and responsible are consistent with the character setting.
As another example, the question information "A, when you are most happy," SRPAs is a reply voice generated based on the user question data "I are most happy when helping others solve the problem. Seeing that they are happy with my help, i have a special sense of achievement. The A reply voice reflects the personality characteristics of A optimistic and pleasure to the assistant, and accords with the role setting. Therefore, it can be determined SRPAs that the a-reply voice generated has a higher personality consistency match with a.
With continued reference to Table 1, knowledge consistency (knowledge consistency, KC) is used to evaluate whether the character reply voice embodies knowledge of the character's knowledge of experience. That is, the evaluation character reply voice is sufficiently generated based on the background information of the character, and does not make a sense of incongruity with the character setting.
For example, questioning information "A, how you learn to use Q1", SRPAs A reply speech generated based on user questioning data "I learn through My Master F". He teaches me much of his skill to make me more powerful). The A reply voice matches the background information of A (e.g., his master is F) and there is no fact that the kneading does not match the character setting. Thus, it can be determined SRPAs that the knowledge of the A-reply speech generated is highly consistent.
As another example, question information, "A, what happens between you and E," SRPAs A reply speech generated based on user question data, "I and E have overshoot, but he finally understands my standpoint. We have now been undone. The A reply voice is based on its background information (e.g., conflict sum and resolution between A and E) and does not knead the fact that it is inconsistent with the character setting. Thus, it can be determined SRPAs that the knowledge of the A-reply speech generated is highly consistent.
Based on the personality uniformity (PeC) evaluation, prosody uniformity (PC) evaluation, and emotion suitability (EA) evaluation described above, the language expression of SRPAs can be effectively evaluated.
Based on the multiple evaluation dimensions, comprehensive and systematic quality evaluation of the character reply voice generated by the voice interaction model (SRPAs) can be realized, and the performance of SRPAs is quantized, so that targeted optimization can be performed. And further, the character reply voice generated by SRPAs can be ensured to be accurate, smooth and natural, and accords with character setting and background information. The method and the device provide a more real, natural and emotional voice interaction experience for the user.
In some embodiments of the application SRPAs may be in tandem SRPAs or end-to-end SRPAs.
For example, the electronic device may generate a character reply voice corresponding to the user question data based on the concatenation type SRPAs. It can be appreciated that cascading SRPAs is a staged, step-by-step processing speech interaction model that breaks down the process of generating character reply speech into multiple stages, each of which is responsible for processing a particular task, ultimately generating character reply speech.
In particular, the cascade SRPAs may include an automatic speech recognition (automatic speech recognition, ASR) phase, a LLM reasoning phase, and a speech synthesis (TTS) phase. First, in the ASR stage, question information (e.g., question text) entered by a user, etc., is received. Such as the question voice "A, what you prefer to do" is converted to the question text "A, what you prefer to do". Then, in the LLM reasoning stage, the LLM is used for processing the recognized questioning text, and a role reply text is generated. For example, the generated character text reply may be "i prefer to do something that helps others, protecting the security of everybody. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I would also like to learn the new power with My Master F. Finally, in the TTS stage, the generated character reply text is converted into character reply voice.
It will be appreciated that the data set corresponding to concatenation SRPAs is a character-voice interactive dialog data set constructed based on fig. 1 described above, and thus, a character reply voice conforming to the character sound feature can be generated based on concatenation SRPAs.
For another example, the electronic device may generate a character reply voice corresponding to the user question data based on the end-to-end SRPAs. It will be appreciated that end-to-end SRPAs is a model that directly goes from speech input to speech output, and can directly generate character reply speech based on user question information (e.g., question speech) directly. That is, the end-to-end SRPAs can directly generate the character reply voice according to the user question information and the character description text.
Specifically, the end-to-end SRPAs receives the question voice input by the user, extracts the voice feature and generates the corresponding character reply voice. This process does not require conversion of the question speech to question text, but rather directly generates character reply speech. For example, end-to-end SRPAs can generate a reply voice for A directly based on the question voice for "A, you prefer to do what" I prefer to do things to help others, protecting the security of everything. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I would also like to learn the new power with My Master F.
It will be appreciated that the end-to-end SRPAs training based on the character voice interaction dialog data set constructed in fig. 1 above enables the end-to-end SRPAs to generate character reply voices conforming to the character sound characteristics based on the user question data.
In some embodiments of the application, the evaluation of the voice interaction model (SRPAs) is performed based on the character voice interaction dialogue data set, and comprises the steps of determining at least one group of test interaction dialogs in the character voice interaction dialogue data set, wherein each group of test interaction dialogs comprises test question information and test character reply voices corresponding to the test question information, inputting the test question information into the voice interaction model to obtain SRPAs output model reply voices, and evaluating the voice interaction model based on the model reply voices and the test character reply voices corresponding to the test question information.
It will be appreciated that the model reply voice may refer to the character reply voice generated by SRPAs in the embodiment of the present application.
It may be appreciated that the test interactive session may be any set of interactive sessions in the character voice interactive session data set, and each set of interactive sessions may be a single-round character voice interactive session, that is, a question information and a character reply voice corresponding to the question information. Each group of interactive dialogs can also be multi-round character voice interactive dialogs, namely, a plurality of question information and character reply voices corresponding to each question information, and context continuity can be provided between each round of character voice interactive dialogs and the previous round of character voice interactive dialogs and the next round of character voice interactive dialogs.
In some embodiments of the application, the voice interaction model is evaluated based on the model reply voice and the test character reply voice corresponding to the test question information, and the method comprises the steps of inputting the test character reply voice corresponding to the model reply voice and the test question information into an evaluation model for evaluation processing to obtain an evaluation score corresponding to the model reply voice, and evaluating the voice interaction model based on the evaluation score.
It will be appreciated that the evaluation score may range from 1 to 10, and in other embodiments of the present application, the evaluation score may also range from other ranges, without limitation.
Specifically, the evaluation model may evaluate SRPAs based on the multiple evaluation dimensions illustrated in table 1 above, resulting in corresponding evaluation scores. The evaluation model may be a Large Language Model (LLM) or the like, and is not particularly limited herein.
First, the assessment model may determine at least one set of test interaction dialogs from the character voice interaction dialog data set, e.g., a set of test interaction dialogs is test question information of "A, what you prefer to do," A reply voice (i.e., test character reply voice): "I prefer to do things that help others, protecting the security of everything. Each time we see that we are happy with my help, i have a special sense of achievement. Of course, I would also like to learn the new power with My Master F.
Then, the questioning information will be tested: "a, what you like do most? obtaining SRPAs output model reply voice:" I prefer to do something that helps others, protects people, I want to see that people are happy with my help, I have special sense of achievement. I also like to learn new powers with my master F).
Further, the test role reply voice and the model reply voice of the example are input into an evaluation model together for evaluation processing, and evaluation scores corresponding to the model reply voice are obtained. The evaluation model may generate a scoring reason based on the evaluation dimensions exemplified in table 1, i.e., score the test character reply voice and the model reply voice, respectively, based on the evaluation dimensions in table 1. For example, the evaluation score of the model reply voice obtained based on the evaluation model is 8 points or the like, and the evaluation score of the test character reply voice is 9 points or the like.
Finally, the evaluation model may evaluate the voice interaction model based on the evaluation score, i.e., determine that the ratio of the evaluation score of the model reply voice to the evaluation score of the test role reply voice is 8/9≡ 0.8889, where the closer the evaluation score is to 1, the higher the role voice playing quality is represented by SRPAs. Further, the ratio of the calculated evaluation scores may be converted into a score range of 1 to 10 points so as to be consistent with the original scoring criteria. For example, the ratio 0.8889 corresponds to 8 points in the scoring criteria, then the final evaluation score is 8 points. Based on this, it can be determined SRPAs that the role-playing voice is of higher quality.
It will be appreciated that cascading SRPAs is generally superior to end-to-end SRPAs with respect to the evaluation dimensions shown in table 1. For example, cascading SRPAs is generally superior to end-to-end SRPAs in terms of role-playing quality and basic interactive capabilities.
Based on the above, through multiple evaluation dimensions, comprehensive and systematic quality evaluation of the character reply voice generated by the voice interaction model (SRPAs) can be realized, and the interaction performance of SRPAs is quantized, so that targeted optimization can be performed. For example, if SRPAs scores low in emotional expression accuracy, the emotional expression capacity of SRPAs can be increased by adjusting model parameters. And further, the character reply voice generated by SRPAs can be ensured to be accurate, smooth and natural, and accords with character setting and background information. Thereby providing a more realistic, natural and emotional voice interaction experience for the user.
In some embodiments of the application, the evaluation of the voice interaction model is performed based on the character voice interaction dialogue data set, and the method further comprises the steps of dividing the interaction dialogue in the character voice interaction dialogue data set into a training interaction dialogue and a test interaction dialogue according to a preset proportion, wherein the test interaction dialogue comprises question information and test character reply voices corresponding to the question information, performing model training on the voice interaction model based on the training interaction dialogue to obtain a trained voice interaction model, and evaluating the trained voice interaction model based on the test interaction dialogue.
For example, after determining the character voice interaction dialogue data set corresponding to the "a" and the like, the character voice interaction dialogue data set corresponding to each character may be divided into training interaction dialogue voice data and test interaction dialogue voice data according to a preset ratio (e.g., 80% training data, 20% test data, etc.). Based on this, SRPAs can be trained using training interactive dialog pairs. During the training process SRPAs can learn how to generate character reply voices conforming to the character characteristics of a according to the input questioning information or questioning information to obtain SRPAs after training. Finally, the trained SRPAs is evaluated based on the test interaction session. For specific reference, the above-mentioned process of evaluating SRPAs based on the evaluation model is referred to, and will not be described herein.
In some embodiments of the present application, the performance of SRPAs after training with a character voice interaction dialog data set can be effectively improved. For example, for Chinese-form role-reply speech, the post-training role-playing quality SRPAs may be effectively upgraded from 0.5117 to 0.8028, etc., or may be effectively upgraded from 0.5296 to 0.8468, etc., as compared to the pre-training role-playing quality SRPAs. For another example, for english form of character-reply voice, the character-playing quality of SRPAs after training may be effectively improved from 0.4340 to 0.7098, etc., or from 0.4786 to 0.8028, etc., as compared to the character-playing quality of SRPAs before training. And the dimensions of instruction compliance, emotion appropriateness, consistency of human settings and the like of SRPAs after training are effectively enhanced.
FIG. 3 illustrates a process diagram of a method for speech interaction model evaluation, according to some embodiments of the application.
Referring to fig. 3, first, the electronic device may acquire multimedia audio from a tv show, a movie, a game, an animation, etc., and select a plurality of characters (e.g., character a, character B, character C, etc.) and corresponding character audio and scripts (e.g., audio A1, script A2, audio B1, script B2, audio C1, script A2, etc.). Scripts refer to text content extracted from media such as movies, television, animation, or games. Based on the character description information, character description information of different characters can be obtained, wherein the character description information comprises character profile information, character background information, character word sets and the like. Then, using LLM and character description information, such as character profile information, character background information and character word set, wherein the character profile information may include characters such as names and characters of characters and corresponding characteristic values, the character background information may include background information such as environments (such as accident setting and social identities) and relationship networks corresponding to the characters, and the character word set may include monologues, dialogues, narrative modes corresponding to the characters. Target interactive dialog data corresponding to the character can be generated based on the character description text. It should be understood that the target interactive session data may include at least one question information and character reply information corresponding to each question information. Next, a character voice interactive dialogue data set is generated by using a voice synthesis (TTS) model, i.e., text-to-speech, in combination with reference voice embodying character voice features and target interactive dialogue data, and it is understood that the character voice interactive dialogue data set may include at least one question information and character reply voice corresponding to each question information. After the character voice interaction dialogue data set is acquired, the character voice interaction dialogue data set may be input to SRPAs. Based on this, SRPAs can output a corresponding character reply voice based on the question information (e.g., question text or question voice). Further, SRPAs may input the character reply voice to an evaluation model, and the evaluation model may obtain a reference character reply voice selected from the character voice interactive dialogue dataset, and evaluate the character reply voice from eight evaluation dimensions in table 1 to obtain an evaluation score. The base interactive capability, speech expressivity, and role-playing quality of SRPAs can be evaluated based on the evaluation score.
Based on this, a comprehensive, systematic quality assessment of SRPAs and the character-reply voices it generates can be performed to assess the performance of score quantization SRPAs, thereby being specifically optimized. And further, the character reply voice generated by SRPAs can be ensured to be accurate, smooth and natural, and accords with character setting and background information. The method and the device provide a more real, natural and emotional voice interaction experience for the user.
The application provides electronic equipment which comprises one or more processors, wherein one or more memories are used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the electronic equipment is enabled to execute the voice interaction model evaluation method provided by the embodiment of the application.
Fig. 4 shows a schematic structural diagram of an electronic device 500 according to an embodiment of the present application. Referring to fig. 4, the electronic device 500 shown in fig. 4 may be, for example, a server or a terminal, etc., and is not limited herein.
As shown in fig. 4, electronic device 500 may include a processor 510, a memory 520, a communication interface 530, and a bus 540.
Processor 510 may include one or more processing units, such as processor 510 including a central processor, a modem processor, a baseband processor, and the like. In some embodiments, the different processing units may be separate devices or may be integrated in one or more processors.
In some embodiments, processor 510 may be configured to execute one or more programs to implement the voice interaction model evaluation methods provided in embodiments of the present application.
In some embodiments, the processor 110 may also be configured to execute a preset unit/module with a corresponding function to implement the voice interaction model evaluation method provided by the present application.
Memory 520 may include one or more memories for storing data or program codes. For example, in some embodiments, the memory 120 may be used to store reference voices of the character, character description text, target interactive dialog data, and character voice interactive dialog data sets, among others.
In some embodiments, memory 520 may include a hard drive, a solid state drive, a flash memory. In some embodiments, memory 520 may include removable or non-removable or fixed media. In some embodiments, memory 520 may be internal or external to electronic device 500.
A communication interface 530 for enabling communication between the electronic device 500 and other electronic devices. The communication interface 530 may include a wired or wireless communication interface (e.g., wi-Fi air interface, etc.) to facilitate the communication of the electronic device 500 with other electronic devices over a wired or wireless network.
Bus 540 is used to connect processor 510, memory 520, communication interface 530, and other possible modules or circuit structures.
It should be understood that the configuration of the electronic device 500 shown in fig. 4 is merely an example, and in other embodiments, the electronic device 500 may include more or less structures, and is not limited in this regard.
The application provides a chip, which comprises a processor, wherein the processor is coupled with a memory and is used for executing a computer program or instructions stored in the memory, so that the chip realizes the voice interaction model evaluation method provided by the embodiment of the application.
The application provides a readable storage medium, wherein a storage program or instructions are stored on the readable storage medium, and when the storage program or instructions are executed, the electronic equipment is caused to execute the voice interaction model evaluation method provided by the embodiment of the application.
The present application provides a computer program product comprising computer programs/instructions which, when executed on an electronic device, cause the electronic device to implement the speech interaction model assessment method provided in the embodiments of the present application.
Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as program modules or module code executing on a programmable system including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program modules or module code may be applied to the input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this disclosure, a processing system includes any system having a processor such as, for example, a digital signal processor (DIGITAL SIGNAL processor, DSP), microcontroller, application SPECIFIC INTEGRATED Circuit (ASIC), or microprocessor.
The module code may be implemented in a high level modular language or an object oriented programming language for communication with a processing system. The module code may also be implemented in assembly or machine language, if desired. The mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.
In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other readable storage media. Thus, a machine-readable storage medium may include any mechanism for storing or transmitting information in a form readable by a machine, including but not limited to floppy diskettes, optical disks, magneto-optical disks, read Only Memory (ROM), random access memory (random access memory, RAM), erasable programmable read only memory (erasable programmable read only memory, EPROM), electrically erasable programmable read only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), magnetic or optical cards, flash memory, or tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) in an electrical, optical, acoustical or other form of propagated signal using the internet. Thus, a machine-readable storage medium includes any type of machine-readable storage medium suitable for storing or transmitting electronic instructions or information in a machine-readable form.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one example implementation or technique disclosed in accordance with embodiments of the application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
The disclosure of the embodiments of the present application also relates to an operating device for executing the text. The apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose machine selectively activated or reconfigured by a program stored in the machine. Such a program may be stored in a readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application Specific Integrated Circuits (ASICs), or any type of media suitable for storing electronic instructions, and each may be coupled to a system bus. Furthermore, the machines referred to in the specification may comprise a single processor or may be architectures employing multiple processors for increased computing power.
Additionally, the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, the present disclosure of embodiments is intended to be illustrative, but not limiting, of the scope of the concepts discussed herein.