Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.
First, terms related to one or more embodiments of the present specification will be explained.
The multi-mode interaction is that a user can communicate with a digital person in the modes of voice, words, expressions, actions, gestures and the like, and the digital person can reply to the user in the modes of voice, words, expressions, actions, gestures and the like.
Duplex interaction, namely an interaction mode capable of carrying out real-time and two-way communication, wherein a user and a digital person can interrupt or reply to each other at any time.
Non-exclusive dialogue, in which two parties can communicate bidirectionally, and the user and the digital person can interrupt or reply to each other at any time.
VAD (Voice Activity Detection ), also known as voice endpoint detection, voice boundary detection.
TTS (Text To Speech technology) converts Text into sound.
Digital person-refers to a virtual character having a digitized appearance that can be used in virtual reality applications to interact with a real person. In the communication process with digital people, the traditional interaction mode is an exclusive question-answering mode taking voice as a carrier.
Currently, the following problems may occur during the interaction of the avatar with the user:
1) In terms of smoothness of communication, based on an exclusive communication form, a user cannot actively interrupt the conversation of a digital person, and the digital person cannot immediately accept and reply in the process of the conversation with the user, so that the communication between the user and the digital person is not intelligent.
2) In the sense capability diversity, in the communication mode using voice as a carrier, the digital person cannot sense the facial changes of the user, such as the expression, the dialogue state of the user and the like, and cannot sense the limb actions of the user, such as gestures, body gestures and the like. The lack of such information can make the user unable to immediately feed back the state of the user during the communication process with the digital person, resulting in very rigid conversation process.
3) In terms of response time of the dialogue, due to factors such as ASR and time delay of a system, common dialogue time delay is about 1.2-1.5 s, and human-free dialogue time delay is about 600-800 ms, and too long dialogue time delay can cause serious dialogue jamming feeling and poor user experience.
In addition, the existing intelligent dialogue control system for virtual character interaction can only support the duplex capability of voice, lack the understanding capability of video and the duplex state decision capability of vision, cannot sense the multi-mode information such as the expression, the action, the environment and the like of a user, even some dialogue systems only support basic question-answering capability, do not have the duplex capability (active/passive interrupt and acceptance), do not have the understanding capability of video and the duplex state decision capability of vision, and cannot sense the multi-mode information such as the expression, the action, the environment and the like of the user.
Based on this, the multi-mode interaction method provided in the embodiments of the present disclosure is applied to a virtual character interaction control system, and by setting a multi-mode control module, a multi-mode duplex state module, and a basic dialogue module, an interaction process between a virtual character and a real user can be implemented, and on the basis of completing a basic dialogue task, multi-mode data can be further identified, so that the virtual character can actively accept and interrupt a user dialogue, and the interaction time delay of the system is shortened, and meanwhile, multi-mode information such as expression, action, gesture, etc. of the user is perceived, so that the method is applicable to various different application scenarios, such as complex application scenarios such as identity verification, fault assessment, and article verification, etc., and has good application effects.
It should be noted that the multi-mode control module has the function of controlling the input and output of the video stream and the voice stream in the interactive system. At the input end, the module divides and understands the input voice stream and video stream, controls whether the multi-mode duplex system is triggered or not, and quickens the processing efficiency of the system when reducing the cost of system transmission. At the output, it is responsible for rendering the results of the system into a digital person's video stream output. The multi-mode duplex state management module has the functions of managing the state of the current dialogue and deciding the duplex state. The current duplex state includes 1) duplex active/passive disruption, 2) duplex active uptake, 3) invoking a basic dialog system or business logic 4) no feedback. The basic dialogue module has the functions of basic business logic and dialogue question-answering capability.
Furthermore, in the following embodiments, a detailed description will be given of a multi-mode interaction method provided in the embodiments of the present disclosure, and specific processing manners of each module.
Based on this, in the present specification, a multi-modal interaction method is provided, and the present specification relates to a multi-modal interaction apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating a system configuration of a multi-modal interaction method applied to a virtual character interaction control system according to an embodiment of the present disclosure.
Fig. 1 shows a avatar interaction control system 100, and the avatar interaction control system 100 includes a multi-mode control module 102 and a multi-mode duplex state management module 104.
In practical applications, the multimodal control module 102 in the avatar interaction control system 100 may be used as an input of a video stream and a voice stream, or may be used as an output of an avatar interaction video stream, where a portion of the multimodal input includes a video stream input and a voice stream input. Meanwhile, the multi-mode control module 102 performs emotion detection and gesture detection on the video stream, the multi-mode control module 102 performs voice detection on the voice stream, and inputs a detection result of the video stream and/or a detection result of the audio stream into a duplex state decision in the multi-mode duplex state management module 104 to determine an interaction policy of the virtual character, wherein the interaction policy can be mainly divided into action bearing and document+action bearing. Further, the multimodal duplex state management module 104 may further render the avatar according to the determined avatar interaction policy, so as to implement the video stream after the avatar is rendered, and output the video stream through the multimodal control module 102.
According to the multi-modal interaction method provided by the embodiment of the specification, through visual understanding of the user in the video stream, emotion, action and the like of the user are perceived, and the receiving or interrupting mode provided for the virtual character is adopted, so that the interaction process of the virtual character and the user is changed into a non-exclusive conversation, and the virtual character can also provide multi-modal interaction of emotion, action and/or voice.
Referring to fig. 2, fig. 2 shows a flowchart of a multi-mode interaction method according to an embodiment of the present disclosure, which specifically includes the following steps.
It should be noted that, the multi-mode interaction method provided in the embodiments of the present disclosure is applied to a virtual character interaction control system, and through the virtual character interaction control system, a process that the virtual character and the user achieve small interaction delay, smooth communication, and simulate human interaction can be supported.
Step 202, receiving multi-modal data, wherein the multi-modal data comprises voice data and video data.
In practical applications, the avatar interaction control system may receive multimodal data, which is voice data and video data corresponding to a user, where the voice data may be understood as voice data of a user communicating with the avatar. For example, the user can inquire about what the order is applied to when asking for the voice data expressed by the virtual character, and the video data can be understood as the expression, action and mouth shape expressed by the user when expressing the voice data by the user and the virtual character, and the video data of the environment where the user is located. Along the above example, when the user expresses the voice data, the expression which can be displayed in the video data is confusing, the action is the action of spreading hands, and the mouth shape is the mouth shape corresponding to the expression of the voice data.
It should be noted that, in order to realize the interaction between the virtual character and the user, the virtual character needs to make an immediate response to the voice data and the video data of the user, so as to reduce the delay generated in the interaction process. Meanwhile, the functions of interaction, interruption, bearing and the like of the two parties are supported.
And step 204, identifying the multi-mode data to obtain user intention data and/or user gesture data, wherein the user gesture data comprises user emotion data and user action data.
Wherein the user intention data may be understood as the intention of the voice data expressed by the user. For example, as in the previous example, the "please ask can query what to place the order" sentence voice data is intended to ask the avatar if it can help query the user for the previous order.
User gesture data may be understood as gesture data expressed by a user in video data, including user emotion data and user action data. Such as "puzzled" emotion expressed by the user's face, and "spreading" action exhibited by the user's hand, as in the above example.
In practical application, the virtual character interaction control system can identify the multi-mode data, and respectively identify the voice data and the video data in the multi-mode data. Further, user intention data is obtained by recognizing voice data, user pose data is obtained by recognizing video data, and the user pose data may include user emotion data as well as user action data. It should be noted that, in different application scenarios, only the user intention data, or the user gesture data, or both the user intention data and the user gesture data may be identified for the multi-modal data of the user. The expression "and/or" is used in this embodiment, and is not limited in any way.
Further, the virtual character interaction control system can respectively recognize voice data and video data, and further determine information such as intention, emotion, action and gesture of the user. The method comprises the steps of carrying out text conversion on voice data in the multi-modal data, identifying the converted text data, obtaining user intention data, carrying out emotion identification on video data and/or voice data in the multi-modal data, obtaining user emotion data, carrying out gesture identification on the video data in the multi-modal data, obtaining user action data, and determining the user gesture data based on the user emotion data and the user action data.
In practical application, the virtual character interaction control system can perform text conversion on voice data in the multi-mode data, and further, recognize the converted text data to obtain user intention data. The text is converted by specific speech, including but not limited to ASR technology, and the embodiment is not limited to specific conversion. It should be noted that, in order to ensure that the interactive system can perform instant feedback even when the user speaks, the system can segment the voice stream according to the VAD time of 200ms, divide the voice stream into small voice units, and input each voice unit into the ASR module to convert the voice units into text, so as to facilitate the subsequent recognition of user intention data.
Further, after determining the user intention data, the virtual character interaction control system can perform user emotion recognition and gesture recognition according to the video data. It should be noted that, the emotion recognition of the user may be recognized not only based on the video data but also based on the voice data, or based on the video data and the voice data together. For example, emotion recognition is performed based on facial expression changes (eye movements, lip twitches) or head shaking movements of the user in video data. For example, emotion recognition is performed based on the volume level of the voice data and the breath. In addition, the virtual character interaction control system can also recognize actions displayed by the user according to the video data, for example, recognize gestures of the user, and the user can obtain action data of the user when the user puts out the hand spreading gestures. Finally, the virtual character interaction control system can comprehensively determine gesture data of the user according to the emotion data and the action data of the user.
It should be noted that, the virtual character interaction control system can recognize and sense the tiny changes occurring in the voice data and the video data of the user, so as to accurately capture the intention and the dynamics of the user, and facilitate the subsequent decision of the strategy and the mode of the virtual character to realize the multi-modal interaction with the user.
Furthermore, in order to obtain the emotion data of the user as soon as possible, the virtual character interaction control system can adopt a two-stage recognition mode, namely, first performing emotion rough call detection, and then classifying the emotion to obtain the target emotion. Specifically, the step of performing emotion recognition on the video data in the multimodal data to obtain user emotion data includes:
And carrying out emotion detection on the video data in the multi-mode data, and classifying the target emotion in the video data under the condition that the target emotion is detected to be contained in the video data, so as to obtain user emotion data. The target emotion can be understood as a user emotion preset by the system, such as vitality, displeasure, neutrality, happiness, surprise and the like.
In particular, the avatar interaction control system may first perform emotion detection on video data in the multimodal data. When the video stream is detected to contain the target emotion preset by the system, the target emotion can be classified and determined so as to obtain user emotion data. In practical application, in order to ensure that the recognition speed and the recognition accuracy of the system have good effects, the system adopts a two-stage recognition mode, so that the expression of the user in the video stream can be recognized first, but when the target emotion of the user in the video stream is detected, the emotion classification is carried out on the target emotion so as to determine the final emotion data of the user.
It should be noted that the virtual character interaction control system may be configured with a target emotion rough calling module and an emotion classification module. The target emotion coarse-recall module can perform coarse granularity detection on the video stream, and the emotion classification module can perform emotion classification on the video stream to determine whether user emotion data is angry, unpleasant, neutral, happy or surprised. The target emotion rough calling module may employ ResNet model and the emotion classification module may employ a sequential transducer model, but the embodiment is not limited to using these two model types.
When the virtual character interaction control system does not find that the user has the appointed emotion, the video stream is not transmitted backwards, and the recognition efficiency of the system is accelerated while the transmission cost of the system is reduced.
Similarly, when the virtual character interaction control system performs gesture recognition on the video data of the user, a two-stage recognition mode can be adopted. Specifically, the step of performing gesture recognition on the video data in the multimodal data to obtain user action data includes:
and detecting gestures of the video data in the multi-mode data, and classifying the target gestures in the video data under the condition that the target gestures are detected to be included in the video data, so as to obtain user action data.
The target gesture may be understood as a gesture type preset by the system, such as a gesture with a clear meaning (e.g. ok, number or left and right, etc.), an unsafe gesture (e.g. a vertical middle finger, a specific small finger, etc.), and a custom special gesture.
In particular, the avatar interaction control system may perform gesture detection on video data in the multimodal data. When the video stream is detected to contain the target gesture preset by the system, the target gesture can be classified to obtain the user action data. In practical application, the gesture recognition process can also adopt a target gesture rough calling module and a gesture classification module, namely, rough granularity recognition of user gestures in the video stream is realized, and then classification recognition is performed on the target gestures, so that whether the user action data is any one of gestures with clear meaning (such as ok, numbers or left and right sliding), unsafe gestures (such as vertical middle fingers, little finger and the like) and custom special gestures is determined.
According to the multi-mode interaction method provided by the embodiment of the specification, the emotion and the action of the user are identified by adopting a two-stage identification mode, so that the identification process can be completed rapidly, the transmission cost of the system can be reduced, and the identification efficiency of the system can be improved.
After determining the user intention data and/or the user gesture data in the multimodal data, the virtual character interaction control system may first invoke the pre-stored basic dialogue data to support the basic interaction process that can be implemented. Specifically, the step of identifying the multi-modal data to obtain user intention data and/or user gesture data further includes:
Invoking pre-stored basic dialogue data based on the user intention data and/or the user gesture data, wherein the basic dialogue data comprises basic voice data and/or basic action data; and rendering an output video stream of the virtual character based on the basic dialogue data, and driving the virtual character to display the output video stream.
Basic dialogue data may be understood as pre-stored voice and/or action data in the system that may drive the virtual character to implement basic interactions. For example, the dialogue data includes basic communication voice data stored in a database, including but not limited to "you good", "thank you", "what is still a problem", and the like. The action data of the basic communication includes, but is not limited to, a "love motion," a "waving motion," a "nodding" motion, etc.
In practical application, the virtual character interaction control system can also search basic dialogue data which is matched with the user intention data and/or the user gesture data from the basic dialogue data which is stored in the system in advance according to the user intention data and/or the user gesture data, and call the basic dialogue data. Because the basic dialogue data comprises basic voice data and/or basic action data, the virtual character interaction control system can render an output video stream corresponding to the virtual character according to the basic voice data and/or the basic action data so as to drive the virtual character to display the output video stream.
It should be noted that, the basic dialogue data may also include basic service data completed by a virtual character preset by the system, for example, a service providing a base for the user, which is not limited in this embodiment.
In summary, the virtual character interaction control system can recognize the multimodal data according to the multimodal data to determine the intention, the expressed emotion, the action, the gesture and other multimodal data of the user, so that the virtual character can make an interactive expression simulating the human-like according to the emotion data and the gesture data of the user.
In addition, in order that the virtual character can realize the interaction similar to a simulation person with a user and can realize interaction states such as duplex active receiving, duplex active/passive interrupt and the like, the virtual character interaction control system can also provide a multi-mode duplex state decision module in the embodiment of the specification so as to realize the determination of the virtual character interaction strategy and realize the receiving/interrupt of multi-mode duplex.
Based on this, the avatar interaction control system may design three interaction modules, and referring to fig. 3, fig. 3 shows a system architecture diagram of the avatar interaction control system provided in the embodiment of the present disclosure.
Fig. 3 includes three modules, i.e., a multi-mode control module, a multi-mode duplex state management module, and a basic dialogue module, which can also be regarded as subsystems, i.e., a multi-mode control system, a multi-mode duplex state management system, and a basic dialogue system. The multi-mode control system controls the input and output of video stream and voice stream in the interactive system. At the input end, the module segments and understands the input voice stream and video stream, and the core comprises processing functions of the voice stream, the streaming video expression and the streaming video action. At the output, it is responsible for rendering the results of the system into a digital person's video stream output. The multi-mode duplex state management system is responsible for managing the state of the current session and deciding the current duplex policy. Current duplexing strategies include duplexing active/passive break, duplexing active accept, invoking basic dialog systems or business logic and no feedback. The basic dialogue system comprises basic business logic and dialogue question-answering capability, and has basic question-answering interaction capability, namely, the user question is input, the system outputs the answer of the question, and generally three sub-modules are included. 1) NLU (natural language understanding) module, which is used for identifying and understanding text information and converting the text information into a structural semantic representation or an intention label which can be understood by a computer. 2) DM (dialogue management) module, which maintains and updates the current dialogue state and decides the next system action. 3) NLG (natural language generation) module, which converts the state of the system output into understandable natural language text.
The following embodiments may be described in detail with respect to a specific implementation procedure of the multi-mode duplex status management module, so as to determine how to provide the capability of mutual receiving and mutual interrupting for the virtual characters in the virtual character interaction control system.
And 206, determining a virtual character interaction strategy based on the user intention data and/or the user gesture data, wherein the virtual character interaction strategy comprises a text interaction strategy and/or an action interaction strategy.
The virtual character interaction policy may be understood as a text decision, an action decision or a combination of a text decision and an action decision, i.e. a text interaction policy and/or an action interaction policy, which are accepted between the virtual character and the user. The text interaction strategy can be understood as an interaction text corresponding to the voice data of the user by the virtual character, and the interaction text needs to be broken in a sentence in the voice text expressed by the user or carried by the sentence end. The action interaction strategy can be understood as the interaction gesture corresponding to the gesture data of the virtual character for the user, and whether the interaction gesture needs to be interrupted in a sentence in the voice text expressed by the user or carried by the sentence end.
In practical application, the virtual character interaction control system can determine text adapting content of the virtual character according to user intention data, whether the text adapting content is interrupted in a sentence of a user or is adapting in a sentence end of the user, namely a text interaction strategy. The virtual character interaction control system can also determine the gesture adapting content of the virtual character according to the gesture data of the user, whether gesture interruption is carried out in a sentence of the user or gesture adapting is carried out at the end of the sentence of the user, namely, an action interaction strategy. It should be noted that, for certain intention data and/or gesture data of the user, the virtual characters do not necessarily have a text interaction policy and an action interaction policy, i.e. the text interaction policy and the action interaction policy may also be an and/or relationship.
In addition, the virtual character can not only accept or interrupt the interaction of the user, but also support the function of not performing any feedback, namely, when the VAD time of the user does not reach 800ms, the system does not perform any feedback when the user does not need to call a basic dialogue system or business logic to answer.
Specifically, the step of determining the virtual character interaction policy based on the user intention data and/or the user gesture data includes:
Performing fusion processing on video data in the multi-mode data based on the user intention data and/or the user gesture data, and determining target intention text and/or target gesture actions of the user; and determining a virtual character interaction strategy based on the target intention text and/or the target gesture action.
In practical application, after determining the user intention data and/or the user gesture data, the virtual character interaction control system can also perform fusion alignment processing on the text, the video stream and the voice stream, and comprehensively judge the target intention text and/or the target gesture action of the user. Further, a particular virtual character interaction policy may be subsequently determined based on the target intent text and/or the target gestural actions.
Taking emotion recognition as an example, the emotion classification module has recognized the user's expression, such as smile, from the face, but the user may be expressing a anechoic smile. Therefore, in order to solve the problem, the virtual character interaction control system can make multi-modal judgment from the voice of the user and the currently-spoken text, thereby achieving better effect. In specific implementation, the system can adopt a multi-mode classification model to carry out finer emotion judgment, and finally the module can output the current interaction state and can comprise three state slots, namely text, user gesture actions and user emotions, so as to carry out duplex state decision for the multi-mode duplex state management module.
According to the multi-mode interaction method provided by the embodiment of the specification, through further comprehensive judgment on the user intention data and/or the user gesture data, the interaction purpose of the user is precisely defined, invalid communication exhibited by the subsequent virtual character due to the fact that the user interaction purpose is wrong is avoided, and the intelligence of the virtual character is reduced.
After the virtual character interaction control system accurately acquires the target intention text and/or the target gesture action of the user, the text interaction strategy and/or the action interaction strategy of the virtual character can be accurately determined respectively. Specifically, the determining the virtual character interaction strategy based on the target intention text and/or the target gesture action comprises the following steps:
Determining a text interaction policy for the avatar based on the target intent text, and/or
And determining an action interaction strategy of the virtual character based on the target gesture action.
In practical application, the virtual character interaction control system determines a text interaction strategy between the virtual character and the user according to the target intention text. For example, if the target intention text of the user is "inquiry insurance order status", the text interaction policy of the avatar may be accepted from the end of the sentence of the voice text of the user, i.e. the avatar may express "you just, i'm inquire about. If the target intention text of the user is "how slow you are, and how slow you have not yet been queried", the text interaction strategy of the avatar can break the acceptance from the middle of the intention text, namely when the user finishes speaking "how slow you are", the avatar can immediately express "do not need to go to the urgent j aways". Thus, the instant communication between the virtual character and the user can be realized, so that the effect of communication between the virtual character and the user can be achieved.
Further, the virtual character interaction control system can also determine an action interaction strategy of the virtual character and the user according to the target gesture action. For example, if the target gesture of the user is a "ok" gesture, then the action interaction policy of the virtual character may also exhibit the "ok" gesture. If the target gesture of the user is a gesture of "middle finger", the virtual character may not respond to any action, and may only reply to text content, such as "what is not satisfied with" or reply to only one action of "shake cry".
It should be noted that, the text and/or the gesture actions for different targets may be determined according to different text interaction policies and/or action interaction policies. For example, if only the target intention text is available, it may be determined that the avatar should deal with only the text interaction policy, or only the action interaction policy, or a combination of the text interaction policy and the action interaction policy. If only the target gesture acts, the virtual character can be determined to only deal with the text interaction strategy, or only deal with the action interaction strategy, or the combination of the text interaction strategy and the action interaction strategy. If the target intention text and the target gesture action are both available, the virtual character can be determined to only deal with the text interaction strategy, or only deal with the action interaction strategy, or the combination of the text interaction strategy and the action interaction strategy. Furthermore, in the embodiment of the present disclosure, all the cases are not exhaustive, but the avatar interaction control system in the present embodiment may support determining different avatar interaction strategies according to different interaction states.
Step 208, obtaining a three-dimensional rendering model of the virtual character.
In practical application, the virtual character interaction control system can acquire a three-dimensional rendering model of the virtual character, so that the subsequent generation of an interaction video stream of the virtual character according to the three-dimensional rendering model is facilitated, and the multi-modal interaction with a user is completed. It should be noted that the virtual character may be composed of a cartoon or a computer-drawn image, or may be composed of an artificial character, which is not particularly limited in this embodiment.
And 210, generating the avatar of the avatar containing the action interaction strategy by utilizing the three-dimensional rendering model based on the avatar interaction strategy so as to drive the avatar to perform multi-modal interaction.
In practical application, the avatar interaction control system may generate the avatar of the avatar including the action interaction policy of the avatar according to the determined avatar interaction policy and using the three-dimensional rendering model. For example, the head action, the facial expression, the gesture action and the like corresponding to the virtual character, and further, the rendered virtual character image is driven to realize multi-mode interaction with the user.
Further, the virtual character interaction control system can specifically determine the text receiving position and/or the action receiving position corresponding to the virtual character according to the text interaction strategy and the action interaction strategy so as to realize the duplex active receiving process. Specifically, the step of generating, based on the virtual character interaction policy, the avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model to drive the virtual character to perform multi-modal interaction includes:
The method comprises the steps of determining a text accepting position of virtual character text interaction based on a text interaction strategy, determining an action accepting position of virtual character action interaction based on the action interaction strategy, wherein the text accepting position is an accepting position corresponding to voice data, determining the action accepting position of virtual character action interaction based on the action interaction strategy, wherein the action accepting position is an accepting position corresponding to video data, and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the text accepting position and/or the action accepting position so as to drive the virtual character to perform multi-modal interaction.
The text accepting position can be understood as an accepting position corresponding to the interactive text of the virtual character aiming at the voice text expressed by the user, and can be divided into in-sentence accepting and sentence end accepting. The action receiving position can be understood as the interactive action of the virtual character, and the receiving position corresponding to the voice text expressed by the user can be divided into action receiving in a sentence or action receiving at the end of the sentence.
In practical application, the virtual character interaction control system can generate the virtual character image containing the action interaction strategy by utilizing the three-dimensional rendering model according to the text accepting position and/or the action accepting position after determining the text accepting position of the virtual character text interaction and the action accepting position of the virtual character action interaction so as to determine the multi-mode interaction process of the virtual character.
It should be noted that, when the virtual character interaction control system determines that the dialogue or the action of the user needs to be accepted, the current acceptance policy is triggered. The receiving modes include two types, one is only action receiving and the other is action and document receiving. The action receiving only means that the digital person does not make a receiving reply of oral, and only makes an action to respond to the user, for example, the user shakes hands suddenly to call the digital person in the process of conversation, and the virtual person only needs to reply to one action of calling and does not need to influence the current other conversation states. The action and document receiving means that a digital person needs to make action response to a user and also needs to make a verbal receiving reply, and the receiving can have a certain influence on the current conversation process, but can also give an intelligent feeling in experience. If the user is detected to have an uncomfortable emotion in the conversation process, the virtual character needs to interrupt the current conversation state, actively inquire about what is not satisfied by the user, and simultaneously give a placebo action.
Additionally, the avatar interaction control system may also provide a duplex active/passive break process. Specifically, the step of generating, based on the virtual character interaction policy, the avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model to drive the virtual character to perform multi-modal interaction includes:
And determining interrupt accepting interaction data corresponding to the virtual character based on the interrupt intention data, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the interrupt accepting interaction data so as to drive the virtual character to continue multi-modal interaction.
Breaking intent data may be understood as data that the user has an explicit rejection of communication with the avatar. For example, the user makes a "close mouth" gesture, or explicitly states a statement such as "pause communication bar".
The interrupted and accepted interaction data can be understood as accepted text sentences or accepted action data and the like corresponding to the virtual character when the virtual character determines that the user has an interrupted intention.
In practical application, in the virtual character interaction strategy of the virtual character interaction control system, if the user has the interrupt intention according to the user intention data and/or the user gesture data, the current interaction text or interaction action of the virtual character can be paused, and corresponding interrupt accepting interaction data is determined according to the interrupt intention. And generating the virtual character image containing the action interaction strategy by utilizing the three-dimensional rendering model so as to drive the virtual character to continuously complete multi-mode interaction according to the interrupted interaction data. For example, the current conversation may be actively interrupted when the digital person finds that the user has an interrupting intention, which may be an interrupting intention displayed by the user, such as a negative expression or a negative emotion of the user during the digital person speaking. Or may be an implicit interrupt intention of the user, such as the user suddenly disappearing, or not in a state of communication. Under the current strategy, the digital person breaks the current speaking state, waits for the user to speak, or actively inquires the reason of the break of the other party.
Finally, the virtual character interaction control system can also provide an output rendering function, and the audio data stream and the video data stream which determine the interaction of the virtual character are fused and pushed out. Specifically, the step of generating, based on the virtual character interaction policy, the avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model to drive the virtual character to perform multi-modal interaction includes:
The method comprises the steps of determining an audio data stream of text interaction of the virtual character based on the text interaction strategy, determining a video data stream of action interaction of the virtual character based on the action interaction strategy, carrying out fusion processing on the audio data stream and the video data stream, rendering a multi-modal interaction data stream of the virtual character, and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the multi-modal interaction data stream so as to drive the virtual character to carry out multi-modal interaction.
In practical application, the output rendering composite video stream of the virtual character interaction control system is pushed out and contains 3 parts in total. 1) And the streaming TTS part synthesizes the text output of the system into an audio stream. 2) The driving part comprises two sub-modules, a face driving module and an action driving module. The face driving module drives the digital person to output an accurate mouth shape according to the voice stream. The action driving module drives the digital person to output accurate actions according to the action labels output by the system. 3) And the rendering synthesis part is responsible for rendering and synthesizing the output of the driving part, the TTS and other modules into a video stream of the digital person.
In summary, by adding the video stream and the corresponding visual understanding module, the multi-mode interaction method provided by the embodiment of the specification can not only sense facial expressions of the user, but also sense actions of the user. In addition, a new visual processing module can be expanded by a similar method, so that the virtual character perceives more multi-mode information, such as environment information and the like. In the embodiment of the specification, the system can support the real-time perception of five facial expressions of vitality, displeasure, neutrality, happiness and surprise of a user, and can sense three types of actions with definite meaning (such as OK, numbers, left and right sliding and the like), unsafe gestures (such as vertical middle finger, little finger and the like) and custom special actions in real time.
In addition, the method changes the session from an exclusive session form of one-to-one answer to a non-exclusive session form which can be accepted or interrupted at any time by adding the multi-mode control module and the multi-mode duplex state management module. The multi-mode control module divides the dialogue into smaller decision units, and the complete user problem is not used as a trigger condition for the user to reply, so that the dialogue can be accepted at any time and interrupted at any time even in the dialogue process. The voice stream is sliced with a VAD time of 200ms, and the ventilation interval of the human speaking process is about 200ms. The video stream adopts a strategy of detection triggering, and when a specified action, expression or target object is detected, the decision of the duplex state is carried out. 2) The multi-mode duplex state management module is the core for solving the problem, because the multi-mode duplex state management module not only maintains the current duplex dialogue state, but also can decide the current reply strategy, and the duplex strategy comprises 4 states of duplex active bearing, duplex active/passive breaking, calling of a basic dialogue system or business logic and no feedback. By making decisions between the states in these 4, the system can achieve the ability to accept, interrupt and basic questions at any time. 3) The present solution breaks the dialog into smaller units and uses the units as granularity for digital person decisions and replies, leaving the dialog no longer in the form of an exclusive dialog with a question-to-answer. Therefore, even if the user does not express completely, the system processes the input information of the user and calculates the result of the reply, and when the user expresses completely, the system does not need to calculate from the beginning, and directly plays the conversation, thereby greatly shortening the time delay of interaction. In the sense of motion, the dialogue delay of the system can be reduced from 1.5 seconds to about 800 ms.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating a processing procedure of a multi-modal interaction method according to an embodiment of the present disclosure.
The embodiment of fig. 4 can be divided into a multi-modal control system-input, a multi-modal duplex state management system, a basic dialog system, and a multi-modal control system-output, which can be understood as the application of the multi-modal interaction method to the 4 subsystems of the avatar interaction control system.
In practice, the user's video and voice streams enter from the multimodal control system-input. And for the video stream, firstly, a target emotion detection rough calling module and a target gesture detection rough calling module are used for carrying out emotion classification and gesture classification, and finally, an emotion recognition result and a gesture recognition result are input into the multi-modal data & alignment module. And for the voice stream, firstly, cutting, performing text conversion through ASR, and finally, inputting the voice stream into a multi-modal data & alignment module. Further, the multi-mode data & alignment module synthesizes the voice recognition result and the emotion and gesture recognition result in the video, determines the intention and the target action data of the target user, and inputs the intention and the target action data into the multi-mode duplex state decision module in the multi-mode duplex state management system.
Further, the multi-mode duplex state decision system in fig. 4 can make duplex policy decisions to determine two ways of reception, one is action+document reception, and the other is action-only reception. In the process of action and document bearing, the bearing process can be realized by judging whether the bearing is carried out in a sentence or at the end of the sentence, and then the bearing process can be divided into two branches. Specifically, in the end of sentence reception, the reception document decision and the reception action decision are determined according to the intention identification, and in the sentence reception, the reception document decision and the reception action decision can be determined. In addition, in the action receiving, a specific receiving action is determined, and finally, the receiving strategy of the virtual character is input into the multi-mode control system-output to determine the streaming video stream and the streaming audio stream.
It should be noted that the multi-mode duplex state decision system also includes multi-mode interrupt intention judgment, and can combine with the service to realize a specific interrupt receiving function.
Further, the multi-modal control system-output may determine face driving data and motion driving data according to the streaming video stream and the streaming audio stream of the avatar to complete the merging process of rendering and streaming media of the avatar to output the digital human video stream.
In addition, in the multi-mode control system-output, the basic dialogue system can also provide basic dialogue data for interaction of the virtual characters according to the streaming video stream and the streaming audio stream of the virtual characters, and basic business logic and actions are matched to jointly complete generation of the digital human video stream.
In summary, the multi-mode interaction method provided by the embodiment of the specification has the effects of multi-mode sensing, multi-mode duplexing and short interaction time delay. Specifically, for multi-mode sensing, the embodiments of the present disclosure provide a system that can sense voice and video information of a user. Compared with the traditional dialogue system based on voice flow, the scheme not only can process voice information of the user, but also can recognize and detect emotion and action of the user, and the intelligence of digital human perception is greatly improved. Aiming at multi-mode duplex, the embodiment of the specification provides an interactive system which can be instantly received and interrupted at any time. Compared with the traditional single-round dialogue system with one question and one answer, the system can give feedback and replies to the user in real time during the speaking process of the user, such as simple language bearing. In addition, when the user is not in the answering state or has obvious intention of breaking the conversation, the current conversation flow can be broken at any time. The duplex interaction system improves the fluency of interaction, thereby being capable of giving users better interaction experience. The interaction time delay is short, when the user does not express completely, the system processes the input information of the user in a streaming mode and calculates the result of the reply, and when the user expresses completely, the system does not need to calculate from the head, and the user directly plays the call receiving operation, so that the interaction time delay is greatly shortened. In the sense of motion, the dialogue delay of the system can be reduced from 1.5 seconds to about 800 ms.
Corresponding to the above method embodiments, the present disclosure further provides a multi-modal interaction device embodiment, and fig. 5 shows a schematic structural diagram of a multi-modal interaction device according to one embodiment of the present disclosure. As shown in fig. 5, the apparatus is applied to a virtual character interaction control system including:
The data receiving module 502 is configured to receive multimodal data, wherein the multimodal data comprises voice data and video data, the data identifying module 504 is configured to identify the multimodal data, obtain user intention data and/or user gesture data, wherein the user gesture data comprises user emotion data and user action data, the policy determining module 506 is configured to determine a virtual character interaction policy based on the user intention data and/or user gesture data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy, the rendering model obtaining module 508 is configured to obtain a three-dimensional rendering model of the virtual character, and the interaction driving module 510 is configured to generate an image of the virtual character comprising the action interaction policy based on the virtual character interaction policy by using the three-dimensional rendering model to drive the virtual character to perform multimodal interaction.
Optionally, the data recognition module 504 is further configured to perform text conversion on the voice data in the multimodal data, recognize the converted text data to obtain user intention data, and/or perform emotion recognition on the video data and/or the voice data in the multimodal data to obtain user emotion data, perform gesture recognition on the video data in the multimodal data to obtain user action data, and determine user gesture data based on the user emotion data and the user action data.
Optionally, the data recognition module 504 is further configured to perform emotion detection on the video data in the multimodal data, and classify the target emotion in the video data to obtain user emotion data if it is detected that the target emotion is included in the video data.
Optionally, the data recognition module 504 is further configured to perform gesture detection on the video data in the multi-mode data, and classify the target gesture in the video data to obtain user action data when detecting that the target gesture is included in the video data.
Optionally, the policy determination module 506 is further configured to perform fusion processing on video data in the multimodal data based on the user intention data and/or the user gesture data, determine a target intention text and/or a target gesture action of the user, and determine a virtual character interaction policy based on the target intention text and/or the target gesture action.
Optionally, the policy determination module 506 is further configured to determine a text interaction policy for the virtual character based on the target intent text and/or to determine an action interaction policy for the virtual character based on the target gestural action.
Optionally, the interaction driving module 510 is further configured to determine a text accepting position of the virtual character text interaction based on the text interaction policy, where the text accepting position is an accepting position corresponding to the voice data, determine an action accepting position of the virtual character action interaction based on the action interaction policy, where the action accepting position is an accepting position corresponding to the video data, and generate an avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model based on the text accepting position and/or the action accepting position to drive the virtual character to perform multi-modal interaction.
Optionally, the interaction driving module 510 is further configured to pause the current multi-modal interaction of the avatar under the condition that the user has the interrupt intention data in the user intention data and/or the user gesture data of the avatar interaction policy, determine interrupt acceptance interaction data corresponding to the avatar based on the interrupt intention data, and generate the avatar of the avatar containing the action interaction policy by using the three-dimensional rendering model based on the interrupt acceptance interaction data, so as to drive the avatar to continue the multi-modal interaction.
Optionally, the device further comprises a video stream output module, wherein the video stream output module is configured to call pre-stored basic dialogue data based on the user intention data and/or the user gesture data, the basic dialogue data comprises basic voice data and/or basic action data, and render an output video stream of the virtual character based on the basic dialogue data and drive the virtual character to display the output video stream.
Optionally, the interaction driving module 510 is further configured to determine an audio data stream of the virtual character text interaction based on the text interaction policy, determine a video data stream of the action interaction of the virtual character based on the action interaction policy, perform fusion processing on the audio data stream and the video data stream, render a multi-modal interaction data stream of the virtual character, and generate an avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model based on the multi-modal interaction data stream to drive the virtual character to perform multi-modal interaction.
According to the multi-mode interaction device provided by the embodiment of the specification, through receiving the voice data and the audio data of the user and carrying out intention recognition and gesture recognition, the communication intention of the user and/or the gesture corresponding to the user are determined, and then the specific interaction strategy of the virtual character and the user is determined according to the communication intention of the user and/or the gesture corresponding to the user, and then the virtual character is driven to complete the interaction process with the user according to the determined interaction strategy.
The foregoing is a schematic solution of a multi-modal interaction device of this embodiment. It should be noted that, the technical solution of the multi-modal interaction device and the technical solution of the multi-modal interaction method belong to the same concept, and details of the technical solution of the multi-modal interaction device, which are not described in detail, can be referred to the description of the technical solution of the multi-modal interaction method.
Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.
Wherein the processor 620 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the multimodal interaction method described above.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the above-mentioned multi-mode interaction method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned multi-mode interaction method.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the multi-modal interaction method described above.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above-mentioned multi-mode interaction method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the above-mentioned multi-mode interaction method.
An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the multi-modal interaction method described above.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the above-mentioned multi-mode interaction method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned multi-mode interaction method.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.