CN116301388B

CN116301388B - Man-machine interaction scene system for intelligent multi-mode combined application

Info

Publication number: CN116301388B
Application number: CN202310524033.5A
Authority: CN
Inventors: 张卫平; 米小武; 吴茜; 王丹; 张伟
Original assignee: Global Digital Group Co Ltd
Current assignee: Global Numerical Technology Co ltd
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2023-08-01
Anticipated expiration: 2043-05-11
Also published as: CN116301388A

Abstract

The invention provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module; the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction; according to the invention, through collecting and analyzing voice, gesture and eye movement information, more accurate decision and understanding can be made, and the limitation of a single mode is solved.

Description

Man-machine interaction scene system for intelligent multi-mode combined application

Technical Field

The invention relates to the field of multi-mode human-computer interaction, in particular to a human-computer interaction scene system for intelligent multi-mode combined application.

Background

In recent years, along with the rapid development of the fields of computer vision, natural language processing, acoustic signal processing and the like, the multi-mode human-computer interaction technology becomes more and more a research hotspot and an application endpoint; the man-machine interaction system of the multi-mode combined application effectively combines multiple input modes of voice, images and gesture lamps, achieves more flexible, intelligent and natural man-machine interaction, and can greatly improve interaction efficiency.

Consult related published technical schemes, for example, CN114020153a prior art discloses a multi-mode man-machine interaction method and device, the method comprises: acquiring interactive text information from a user; predicting a transitional language according to the interactive text information; acquiring corresponding multi-modal content according to the transition language, taking the multi-modal content as first reply content, and pushing the first reply content to the virtual person client; generating corresponding multi-mode content according to the reply text information of the interactive text information, taking the corresponding multi-mode content as second reply content, and pushing the second reply content to the virtual person client; according to the invention, the transition words are inserted before the formal reply content, the reply text information is processed in a segmentation mode, one-round reply is changed into multi-round reply, the response speed of a virtual person is improved, and smooth human-computer interaction experience is realized; another typical prior art with publication number CN111554279A discloses a multi-modal human-computer interaction system based on Kinect, which comprises the following steps of constructing a data acquisition system capable of receiving multi-modal data acquired by Kinect; training the single-tone element of the acoustic model and the language model to obtain an acoustic recognition module; creating a lip movement dataset for training machine learning using the acquired color map data; training a lip reading identification model by using a lip movement data set by using a model training method of a convolutional neural network based on a residual neural network; the data acquisition system, the voice recognition model and the lip reading recognition model are integrated together to construct a multi-mode human-computer interaction system; the multi-mode man-machine interaction system of the invention enhances the robustness of voice recognition; both modes of the first scheme are text mode contents, and the interactivity with a user is not high enough; in the second scheme, multi-mode identification is completed only through single confidence comparison in a multi-mode decision layer, and adaptability and accuracy are low.

Disclosure of Invention

The invention aims to provide a man-machine interaction scene system for intelligent multi-mode combined application aiming at the defects existing at present.

The invention adopts the following technical scheme:

a man-machine interaction scene system for intelligent multi-mode combined application comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;

the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;

the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;

the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;

the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:

；

wherein,,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->；

Further, the decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained as follows:

computing a decision made by a user on each modalityIs a mean probability of (2):

；

wherein,,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:

；

wherein,,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;

according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:

；

wherein,,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->；

Further, for each weight in the weight matrixThe method comprises the following steps:

；

further, the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction.

The beneficial effects obtained by the invention are as follows:

the invention collects the voice, gesture and eye movement information of the user through the data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.

Drawings

The invention will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a schematic diagram of the overall module of the present invention.

FIG. 2 is a schematic diagram of a decision making and probability obtaining process for each mode according to the present invention.

Fig. 3 is a schematic diagram of the interaction of the present invention in a banking scenario.

The meaning of the reference numerals in the figures: 1-camera, 2-interactive interface, 3-microphone.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples thereof; it should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the invention; other systems, methods, and/or features of the present embodiments will be or become apparent to one with skill in the art upon examination of the following detailed description; it is intended that all such additional systems, methods, features and advantages be included within this description; included within the scope of the invention and protected by the accompanying claims; additional features of the disclosed embodiments are described in, and will be apparent from, the following detailed description.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there is an azimuth or positional relationship indicated by terms such as "upper", "lower", "left", "right", etc., based on the azimuth or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not indicated or implied that the apparatus or component referred to must have a specific azimuth, construction and operation in which the term is described in the drawings is merely illustrative, and it is not to be construed that the term is limited to the patent, and specific meanings of the term may be understood by those skilled in the art according to specific circumstances.

Embodiment one: as shown in fig. 1 and fig. 2, the present embodiment provides a human-computer interaction scene system for intelligent multi-mode combined application, which includes a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;

；

wherein,,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing a user's handDecision making ∈under the potential modality>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->；

；

The embodiment collects voice, gesture and eye movement information of a user through a data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; acquiring a final decision instruction of a user through a decision fusion module; the combined use of a plurality of interaction modes can more conveniently and efficiently complete the interaction task, so that a user does not need to rely on a single interaction mode, a more natural and visual interaction mode is provided for the user, the interaction experience of the user is smoother and more comfortable, the defects of the single interaction mode, such as low accuracy of voice recognition, poor reliability of gesture recognition and the like, are overcome, and the reliability and adaptability of interaction are improved; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.

Embodiment two: this embodiment should be understood to include at least all of the features of any one of the foregoing embodiments, and be further modified based thereon;

the embodiment provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;

the system further comprises man-machine interaction equipment, wherein the man-machine interaction equipment is provided with a microphone and an infrared camera, the voice acquisition module acquires voice information of a user through the microphone, and the gesture acquisition module and the eye movement acquisition module acquire gesture action information and eye movement information of the user through the infrared camera;

the man-machine interaction equipment is also provided with an interaction interface and an interaction audio output device, and the interaction module completes interaction between the man and the machine through the interaction interface and the interaction audio output device;

the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module,

the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;

the voice feature extraction module extracts voice features and obtains a decision of a voice mode and a decision probability of the voice mode as follows:

s101: performing voice preprocessing on the acquired voice information of the user, wherein the voice preprocessing operation comprises noise removal and voice signal enhancement;

s102: extracting voice characteristics from the preprocessed voice information, wherein in the embodiment, the MFCC technology is used for extracting the voice characteristics;

s103: inputting the voice characteristics into a pre-trained voice recognition model, and outputting a voice mode decision and a voice mode decision probability;

for the speech recognition model in step S103, it includes:

input layer: for receiving input speech features;

an intermediate layer: the system comprises a plurality of circulating neural network units, a plurality of control units and a plurality of control units, wherein the circulating neural network units are used for modeling input voice characteristics;

output layer: mapping the output of the middle layer to a tag sequence, wherein the tag sequence is probability distribution of the identification result; the activation function of the output layer is a common classification function, such as a softmax function;

a decoder: the method is used for decoding the tag sequence to obtain a recognition result, and a decoding algorithm adopts a known common decoding algorithm such as a greedy algorithm or a beam search algorithm; thereby obtaining the decision and the probability of the decision of the voice mode;

the gesture feature extraction module extracts gesture features and acquires a gesture mode decision and a gesture mode decision probability mode as follows:

s201: performing gesture preprocessing on the obtained gesture action information of the user, wherein the gesture preprocessing operation comprises denoising, binarizing and graying processing on the gesture action information;

s202: separating the hand parts in the preprocessed gesture motion information by an image segmentation technology;

s203: the extraction and selection of hand features of the hand part are completed through a CNN hand feature extraction model, wherein the CNN hand feature extraction model is a hand feature extraction model established by a technician in advance by using CNN based on experimental data;

s204: inputting the hand characteristics selected in the step S203 into a pre-trained gesture decision tree classifier, and outputting a decision of a gesture mode and a gesture mode decision probability;

the eye movement feature extraction module extracts eye movement features and acquires the decision of an eye movement mode and the decision probability of the eye movement mode as follows:

s301: performing eye movement pretreatment on the obtained eye movement information of the user, wherein the eye movement pretreatment operation comprises noise removal and eye movement error correction;

s302: the extraction and selection of the eye movement characteristics are completed through a CNN eye movement characteristic extraction model, wherein the CNN eye movement characteristic extraction model is an eye movement characteristic extraction model established by a technician in advance by using CNN based on experimental data;

s303: inputting the eye movement characteristics selected in the step S302 into a pre-trained eye movement decision tree classifier, and outputting the decision of the eye movement mode and the decision probability of the eye movement mode;

the decision fusion module obtains a final decision instruction by weighting the decisions and the decision probabilities of the modes obtained in the feature extraction module, and the specific implementation mode is as follows:

is common in the systemStrip decision, set the set of individual decisions as +.>The following probability matrix is generated according to the decisions and the decision probabilities acquired in each mode:

；

wherein,,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Is of (1)Rate of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->；

；

For each weight in the weight matrixThe method comprises the following steps:

;

multiplying the probability matrix by the weight matrix to generate a final decision matrix, and extracting a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction;

the interaction module receives a final decision instruction of the decision fusion module and completes interaction between human and machine according to the final decision instruction;

when the final decision instructions received by the interaction module are more than two and contradict each other, the interaction audio output device in the interaction module interacts with the user to obtain more accurate modal information, such as: when two final decision instructions received by the interaction module are respectively forward and backward, the interaction module inquires the user through the interaction audio output device: "please confirm whether the next instruction is forward or backward", and generate a new final decision instruction to make interaction according to the following user mode information;

the system can be applied to various interaction scenes, such as home, hospital or bank, etc.; the system can be used as a bank intelligent business handling counter, acquire voice information, gesture action information and eye movement information of a user through a camera and a microphone, acquire a decision instruction of the user through a feature extraction module and a decision fusion module, and make corresponding interaction according to the decision instruction of the user through an interaction module as shown in fig. 3; as in fig. 3, the user sends out voice message "i need to transact XXX service", the system obtains the decision instruction of the user, and displays "transact XXX service needs XX/XXX procedure, ask about to transact now? "complete man-machine interaction".

In the embodiment, the decision and decision probability generated by voice information is identified and output through establishing a voice identification model, the decision and decision probability generated by gesture information is output through establishing a CNN hand feature extraction model and a gesture decision tree classifier, and the decision and decision probability generated by eye movement information is output through establishing a CNN eye movement feature extraction model and an eye movement decision tree classifier, so that the basis of the fusion of all modes is obtained; the final decision matrix is obtained by weighting the probability matrix, so that a final decision instruction is obtained, the user interaction information fused by the system is wider, and the decision accuracy of the decision is higher.

The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.

Claims

1. A human-computer interaction scene system for intelligent multi-modal combined application, including a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;

The data acquisition module is used to obtain the user's multimodal data, and the feature extraction module is used to extract the characteristics of each modality according to the multimodal data, and predict the user's expression in each modality according to the characteristics of each modality. Various decisions and decision probabilities, the decision fusion module is used to analyze and calculate the decisions and decision probabilities expressed by the user in each mode to obtain the final decision instruction, and the interaction module is used to complete the human-computer interaction according to the final decision instruction interact;

The data acquisition module includes a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, the voice acquisition module is used to acquire the user's voice information, the gesture acquisition module is used to acquire the user's gesture action information, and the eye movement The obtaining module is used to obtain the user's eye movement information;

The feature extraction module includes a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, the voice feature extraction module is used to extract voice features, and predict the user's decision-making and voice mode in the voice mode according to the voice features Decision-making probability; the gesture feature extraction module is used to extract gesture features, and predicts the user's decision-making and gesture mode decision probability in gesture mode according to gesture features; the eye movement feature extraction module is used to extract eye movement features, and according to Eye movement features predict the user's decision-making in the eye movement mode and the decision probability of the eye movement mode;

The decision fusion module includes all the decisions obtained in each mode, and it is assumed that all decisions share , the set of each decision is /> , the decision fusion module generates the following probability matrix according to the decisions and decision probabilities obtained in each mode:

;

in, Medium /> Make decisions in the voice mode corresponding to the user/> probability, /> Indicates that the user makes a decision in voice mode /> probability, /> Medium /> Corresponding to the user's gesture mode /> probability, /> Denotes user gesture modal decision making /> probability, /> Medium /> Make decisions in the corresponding user eye movement mode /> The probability, Make decisions on behalf of the user in eye movement mode /> The probability that for all /> i in satisfies /> ;

The decision fusion module generates a weight matrix according to the probability matrix, and the acquisition method of the weight matrix is as follows:

Calculate the user's decision on each modality The average probability of:

;

in, Make decisions for the user across modals /> The average probability of , /> for user at /> Make a decision in a modal/> probability, /> represents the voice mode, /> Represents the gesture mode, /> Represents the eye movement modality; computes:

;

in, On behalf of user at /> Make a decision in a modal/> The distance between the probability and the average probability, the larger the distance, the smaller the correlation between the decision made in this mode and the average decision;

according to Generate a weight matrix to assign weights to the decisions in each mode, and the weight matrix is as follows:

;

in, Medium /> Make decisions in the voice mode corresponding to the user/> weight, /> Indicates that the user makes a decision in voice mode /> weight, /> Medium /> Corresponds to the user's gesture mode /> make a decision /> weight, /> Denotes user gesture modal decision making /> weight, /> Medium /> Corresponds to the user's eye movement mode /> make a decision /> the weight of, Make decisions on behalf of the user in eye movement mode /> weights, for all /> i in satisfies /> ;

For each weight in the weight matrix ,satisfy:

;

The decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision decision matrix, and extracts the decision corresponding to the value greater than the threshold in the final decision decision matrix as the final decision instruction;

The interaction module receives the final decision instruction from the decision fusion module, and completes the human-machine interaction according to the final decision instruction;

When the interaction module receives more than two final decision-making instructions and there are conflicts with each other, it interacts with the user through the interactive audio output device in the interaction module to obtain more accurate modal information.