[go: up one dir, main page]

CN116301388B - Man-machine interaction scene system for intelligent multi-mode combined application - Google Patents

Man-machine interaction scene system for intelligent multi-mode combined application Download PDF

Info

Publication number
CN116301388B
CN116301388B CN202310524033.5A CN202310524033A CN116301388B CN 116301388 B CN116301388 B CN 116301388B CN 202310524033 A CN202310524033 A CN 202310524033A CN 116301388 B CN116301388 B CN 116301388B
Authority
CN
China
Prior art keywords
decision
user
mode
probability
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310524033.5A
Other languages
Chinese (zh)
Other versions
CN116301388A (en
Inventor
张卫平
米小武
吴茜
王丹
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Numerical Technology Co ltd
Original Assignee
Global Digital Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Digital Group Co Ltd filed Critical Global Digital Group Co Ltd
Priority to CN202310524033.5A priority Critical patent/CN116301388B/en
Publication of CN116301388A publication Critical patent/CN116301388A/en
Application granted granted Critical
Publication of CN116301388B publication Critical patent/CN116301388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Ophthalmology & Optometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module; the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction; according to the invention, through collecting and analyzing voice, gesture and eye movement information, more accurate decision and understanding can be made, and the limitation of a single mode is solved.

Description

Man-machine interaction scene system for intelligent multi-mode combined application
Technical Field
The invention relates to the field of multi-mode human-computer interaction, in particular to a human-computer interaction scene system for intelligent multi-mode combined application.
Background
In recent years, along with the rapid development of the fields of computer vision, natural language processing, acoustic signal processing and the like, the multi-mode human-computer interaction technology becomes more and more a research hotspot and an application endpoint; the man-machine interaction system of the multi-mode combined application effectively combines multiple input modes of voice, images and gesture lamps, achieves more flexible, intelligent and natural man-machine interaction, and can greatly improve interaction efficiency.
Consult related published technical schemes, for example, CN114020153a prior art discloses a multi-mode man-machine interaction method and device, the method comprises: acquiring interactive text information from a user; predicting a transitional language according to the interactive text information; acquiring corresponding multi-modal content according to the transition language, taking the multi-modal content as first reply content, and pushing the first reply content to the virtual person client; generating corresponding multi-mode content according to the reply text information of the interactive text information, taking the corresponding multi-mode content as second reply content, and pushing the second reply content to the virtual person client; according to the invention, the transition words are inserted before the formal reply content, the reply text information is processed in a segmentation mode, one-round reply is changed into multi-round reply, the response speed of a virtual person is improved, and smooth human-computer interaction experience is realized; another typical prior art with publication number CN111554279A discloses a multi-modal human-computer interaction system based on Kinect, which comprises the following steps of constructing a data acquisition system capable of receiving multi-modal data acquired by Kinect; training the single-tone element of the acoustic model and the language model to obtain an acoustic recognition module; creating a lip movement dataset for training machine learning using the acquired color map data; training a lip reading identification model by using a lip movement data set by using a model training method of a convolutional neural network based on a residual neural network; the data acquisition system, the voice recognition model and the lip reading recognition model are integrated together to construct a multi-mode human-computer interaction system; the multi-mode man-machine interaction system of the invention enhances the robustness of voice recognition; both modes of the first scheme are text mode contents, and the interactivity with a user is not high enough; in the second scheme, multi-mode identification is completed only through single confidence comparison in a multi-mode decision layer, and adaptability and accuracy are low.
Disclosure of Invention
The invention aims to provide a man-machine interaction scene system for intelligent multi-mode combined application aiming at the defects existing at present.
The invention adopts the following technical scheme:
a man-machine interaction scene system for intelligent multi-mode combined application comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:
wherein,,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Further, the decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained as follows:
computing a decision made by a user on each modalityIs a mean probability of (2):
wherein,,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein,,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein,,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
Further, for each weight in the weight matrixThe method comprises the following steps:
further, the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction.
The beneficial effects obtained by the invention are as follows:
the invention collects the voice, gesture and eye movement information of the user through the data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.
Drawings
The invention will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.
FIG. 1 is a schematic diagram of the overall module of the present invention.
FIG. 2 is a schematic diagram of a decision making and probability obtaining process for each mode according to the present invention.
Fig. 3 is a schematic diagram of the interaction of the present invention in a banking scenario.
The meaning of the reference numerals in the figures: 1-camera, 2-interactive interface, 3-microphone.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples thereof; it should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the invention; other systems, methods, and/or features of the present embodiments will be or become apparent to one with skill in the art upon examination of the following detailed description; it is intended that all such additional systems, methods, features and advantages be included within this description; included within the scope of the invention and protected by the accompanying claims; additional features of the disclosed embodiments are described in, and will be apparent from, the following detailed description.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there is an azimuth or positional relationship indicated by terms such as "upper", "lower", "left", "right", etc., based on the azimuth or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not indicated or implied that the apparatus or component referred to must have a specific azimuth, construction and operation in which the term is described in the drawings is merely illustrative, and it is not to be construed that the term is limited to the patent, and specific meanings of the term may be understood by those skilled in the art according to specific circumstances.
Embodiment one: as shown in fig. 1 and fig. 2, the present embodiment provides a human-computer interaction scene system for intelligent multi-mode combined application, which includes a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:
wherein,,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Further, the decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained as follows:
computing a decision made by a user on each modalityIs a mean probability of (2):
wherein,,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein,,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein,,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing a user's handDecision making ∈under the potential modality>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
Further, for each weight in the weight matrixThe method comprises the following steps:
further, the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction.
The embodiment collects voice, gesture and eye movement information of a user through a data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; acquiring a final decision instruction of a user through a decision fusion module; the combined use of a plurality of interaction modes can more conveniently and efficiently complete the interaction task, so that a user does not need to rely on a single interaction mode, a more natural and visual interaction mode is provided for the user, the interaction experience of the user is smoother and more comfortable, the defects of the single interaction mode, such as low accuracy of voice recognition, poor reliability of gesture recognition and the like, are overcome, and the reliability and adaptability of interaction are improved; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.
Embodiment two: this embodiment should be understood to include at least all of the features of any one of the foregoing embodiments, and be further modified based thereon;
the embodiment provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the system further comprises man-machine interaction equipment, wherein the man-machine interaction equipment is provided with a microphone and an infrared camera, the voice acquisition module acquires voice information of a user through the microphone, and the gesture acquisition module and the eye movement acquisition module acquire gesture action information and eye movement information of the user through the infrared camera;
the man-machine interaction equipment is also provided with an interaction interface and an interaction audio output device, and the interaction module completes interaction between the man and the machine through the interaction interface and the interaction audio output device;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module,
the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the voice feature extraction module extracts voice features and obtains a decision of a voice mode and a decision probability of the voice mode as follows:
s101: performing voice preprocessing on the acquired voice information of the user, wherein the voice preprocessing operation comprises noise removal and voice signal enhancement;
s102: extracting voice characteristics from the preprocessed voice information, wherein in the embodiment, the MFCC technology is used for extracting the voice characteristics;
s103: inputting the voice characteristics into a pre-trained voice recognition model, and outputting a voice mode decision and a voice mode decision probability;
for the speech recognition model in step S103, it includes:
input layer: for receiving input speech features;
an intermediate layer: the system comprises a plurality of circulating neural network units, a plurality of control units and a plurality of control units, wherein the circulating neural network units are used for modeling input voice characteristics;
output layer: mapping the output of the middle layer to a tag sequence, wherein the tag sequence is probability distribution of the identification result; the activation function of the output layer is a common classification function, such as a softmax function;
a decoder: the method is used for decoding the tag sequence to obtain a recognition result, and a decoding algorithm adopts a known common decoding algorithm such as a greedy algorithm or a beam search algorithm; thereby obtaining the decision and the probability of the decision of the voice mode;
the gesture feature extraction module extracts gesture features and acquires a gesture mode decision and a gesture mode decision probability mode as follows:
s201: performing gesture preprocessing on the obtained gesture action information of the user, wherein the gesture preprocessing operation comprises denoising, binarizing and graying processing on the gesture action information;
s202: separating the hand parts in the preprocessed gesture motion information by an image segmentation technology;
s203: the extraction and selection of hand features of the hand part are completed through a CNN hand feature extraction model, wherein the CNN hand feature extraction model is a hand feature extraction model established by a technician in advance by using CNN based on experimental data;
s204: inputting the hand characteristics selected in the step S203 into a pre-trained gesture decision tree classifier, and outputting a decision of a gesture mode and a gesture mode decision probability;
the eye movement feature extraction module extracts eye movement features and acquires the decision of an eye movement mode and the decision probability of the eye movement mode as follows:
s301: performing eye movement pretreatment on the obtained eye movement information of the user, wherein the eye movement pretreatment operation comprises noise removal and eye movement error correction;
s302: the extraction and selection of the eye movement characteristics are completed through a CNN eye movement characteristic extraction model, wherein the CNN eye movement characteristic extraction model is an eye movement characteristic extraction model established by a technician in advance by using CNN based on experimental data;
s303: inputting the eye movement characteristics selected in the step S302 into a pre-trained eye movement decision tree classifier, and outputting the decision of the eye movement mode and the decision probability of the eye movement mode;
the decision fusion module obtains a final decision instruction by weighting the decisions and the decision probabilities of the modes obtained in the feature extraction module, and the specific implementation mode is as follows:
is common in the systemStrip decision, set the set of individual decisions as +.>The following probability matrix is generated according to the decisions and the decision probabilities acquired in each mode:
wherein,,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Is of (1)Rate of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Computing a decision made by a user on each modalityIs a mean probability of (2):
wherein,,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein,,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein,,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
For each weight in the weight matrixThe method comprises the following steps:
;
multiplying the probability matrix by the weight matrix to generate a final decision matrix, and extracting a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction;
the interaction module receives a final decision instruction of the decision fusion module and completes interaction between human and machine according to the final decision instruction;
when the final decision instructions received by the interaction module are more than two and contradict each other, the interaction audio output device in the interaction module interacts with the user to obtain more accurate modal information, such as: when two final decision instructions received by the interaction module are respectively forward and backward, the interaction module inquires the user through the interaction audio output device: "please confirm whether the next instruction is forward or backward", and generate a new final decision instruction to make interaction according to the following user mode information;
the system can be applied to various interaction scenes, such as home, hospital or bank, etc.; the system can be used as a bank intelligent business handling counter, acquire voice information, gesture action information and eye movement information of a user through a camera and a microphone, acquire a decision instruction of the user through a feature extraction module and a decision fusion module, and make corresponding interaction according to the decision instruction of the user through an interaction module as shown in fig. 3; as in fig. 3, the user sends out voice message "i need to transact XXX service", the system obtains the decision instruction of the user, and displays "transact XXX service needs XX/XXX procedure, ask about to transact now? "complete man-machine interaction".
In the embodiment, the decision and decision probability generated by voice information is identified and output through establishing a voice identification model, the decision and decision probability generated by gesture information is output through establishing a CNN hand feature extraction model and a gesture decision tree classifier, and the decision and decision probability generated by eye movement information is output through establishing a CNN eye movement feature extraction model and an eye movement decision tree classifier, so that the basis of the fusion of all modes is obtained; the final decision matrix is obtained by weighting the probability matrix, so that a final decision instruction is obtained, the user interaction information fused by the system is wider, and the decision accuracy of the decision is higher.
The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.

Claims (1)

1.一种智能多模态组合应用的人机交互场景系统,包括数据获取模块、特征提取模块、决策融合模块和交互模块;1. A human-computer interaction scene system for intelligent multi-modal combined application, including a data acquisition module, a feature extraction module, a decision fusion module and an interaction module; 所述数据获取模块用于获取用户的多模态数据,所述特征提取模块用于根据多模态数据提取出各模态的特征,并根据各模态的特征预测用户在各模态所表达的各种决策及决策的概率,所述决策融合模块用于将用户在各模态表达的决策及决策概率进行分析计算得出最终决策指令,所述交互模块用于根据最终决策指令完成人机交互;The data acquisition module is used to obtain the user's multimodal data, and the feature extraction module is used to extract the characteristics of each modality according to the multimodal data, and predict the user's expression in each modality according to the characteristics of each modality. Various decisions and decision probabilities, the decision fusion module is used to analyze and calculate the decisions and decision probabilities expressed by the user in each mode to obtain the final decision instruction, and the interaction module is used to complete the human-computer interaction according to the final decision instruction interact; 所述数据获取模块包括语音获取模块、手势获取模块和眼动获取模块,所述语音获取模块用于获取用户的语音信息,所述手势获取模块用于获取用户的手势动作信息,所述眼动获取模块用于获取用户的眼动信息;The data acquisition module includes a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, the voice acquisition module is used to acquire the user's voice information, the gesture acquisition module is used to acquire the user's gesture action information, and the eye movement The obtaining module is used to obtain the user's eye movement information; 所述特征提取模块包括语音特征提取模块、手势特征提取模块和眼动特征提取模块,所述语音特征提取模块用于提取语音特征,并根据语音特征预测用户在语音模态的决策及语音模态决策概率;所述手势特征提取模块用于提取手势特征,并根据手势特征预测用户在手势模态的决策及手势模态决策概率;所述眼动特征提取模块用于提取眼动特征,并根据眼动特征预测用户在眼动模态的决策及眼动模态决策概率;The feature extraction module includes a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, the voice feature extraction module is used to extract voice features, and predict the user's decision-making and voice mode in the voice mode according to the voice features Decision-making probability; the gesture feature extraction module is used to extract gesture features, and predicts the user's decision-making and gesture mode decision probability in gesture mode according to gesture features; the eye movement feature extraction module is used to extract eye movement features, and according to Eye movement features predict the user's decision-making in the eye movement mode and the decision probability of the eye movement mode; 所述决策融合模块中包含各模态下获取的所有决策,设所有决策共有条,各个决策的集合为/>,所述决策融合模块根据各模态下获取的决策和决策概率生成以下概率矩阵:The decision fusion module includes all the decisions obtained in each mode, and it is assumed that all decisions share , the set of each decision is /> , the decision fusion module generates the following probability matrix according to the decisions and decision probabilities obtained in each mode: ; 其中,中/>对应用户的语音模态下做出决策/>的概率,/>表示用户语音模态下作出决策/>的概率,/>中/>对应用户的手势模态下/>的概率,/>表示用户手势模态下作出决策/>的概率,/>中/>对应用户眼动模态下做出决策/>的概率,代表用户眼动模态下作出决策/>的概率,对于所有的/>中的i满足/>in, Medium /> Make decisions in the voice mode corresponding to the user/> probability, /> Indicates that the user makes a decision in voice mode /> probability, /> Medium /> Corresponding to the user's gesture mode /> probability, /> Denotes user gesture modal decision making /> probability, /> Medium /> Make decisions in the corresponding user eye movement mode /> The probability, Make decisions on behalf of the user in eye movement mode /> The probability that for all /> i in satisfies /> ; 所述决策融合模块根据所述概率矩阵生成权重矩阵,所述权重矩阵的获取方式如下:The decision fusion module generates a weight matrix according to the probability matrix, and the acquisition method of the weight matrix is as follows: 计算用户在各模态上做出决策的平均概率:Calculate the user's decision on each modality The average probability of: ; 其中,为用户在各模态上做出决策/>的平均概率,/>为用户在第/>个模态下做出决策/>的概率,/>代表语音模态,/>代表手势模态,/>代表眼动模态;计算:in, Make decisions for the user across modals /> The average probability of , /> for user at /> Make a decision in a modal/> probability, /> represents the voice mode, /> Represents the gesture mode, /> Represents the eye movement modality; computes: ; 其中,代表用户在第/>个模态下做出决策/>的概率与平均概率之间的间距,间距越大代表该模态下做出的决策与平均决策间的关联越小;in, On behalf of user at /> Make a decision in a modal/> The distance between the probability and the average probability, the larger the distance, the smaller the correlation between the decision made in this mode and the average decision; 根据生成一个权重矩阵,为每个模态下的决策赋予权重,所述权重矩阵如下式:according to Generate a weight matrix to assign weights to the decisions in each mode, and the weight matrix is as follows: ; 其中,中/>对应用户的语音模态下做出决策/>的权重,/>表示用户语音模态下作出决策/>的权重,/>中/>对应用户的手势模态/>做出决策/>的权重,/>表示用户手势模态下作出决策/>的权重,/>中/>对应用户的眼动模态/>做出决策/>的权重,代表用户眼动模态下作出决策/>的权重,对于所有的/>中的i满足/>in, Medium /> Make decisions in the voice mode corresponding to the user/> weight, /> Indicates that the user makes a decision in voice mode /> weight, /> Medium /> Corresponds to the user's gesture mode /> make a decision /> weight, /> Denotes user gesture modal decision making /> weight, /> Medium /> Corresponds to the user's eye movement mode /> make a decision /> the weight of, Make decisions on behalf of the user in eye movement mode /> weights, for all /> i in satisfies /> ; 对于权重矩阵中每个权重,满足:For each weight in the weight matrix ,satisfy: ; 所述决策融合模块将概率矩阵与权重矩阵相乘生成一个最终决策判定矩阵,提取出最终决策判定矩阵中大于阈值的数值对应的决策作为最终决策指令;The decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision decision matrix, and extracts the decision corresponding to the value greater than the threshold in the final decision decision matrix as the final decision instruction; 所述交互模块接收决策融合模块的最终决策指令,并根据最终决策指令完成人机间的交互;The interaction module receives the final decision instruction from the decision fusion module, and completes the human-machine interaction according to the final decision instruction; 当交互模块接收到的最终决策指令超过两条且相互存在矛盾时,通过交互模块中的交互音频输出装置与用户交互,以获得更精确的模态信息。When the interaction module receives more than two final decision-making instructions and there are conflicts with each other, it interacts with the user through the interactive audio output device in the interaction module to obtain more accurate modal information.
CN202310524033.5A 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application Active CN116301388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310524033.5A CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310524033.5A CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Publications (2)

Publication Number Publication Date
CN116301388A CN116301388A (en) 2023-06-23
CN116301388B true CN116301388B (en) 2023-08-01

Family

ID=86789013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310524033.5A Active CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Country Status (1)

Country Link
CN (1) CN116301388B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843381A (en) * 2016-03-18 2016-08-10 北京光年无限科技有限公司 Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Privacy protection method and system for multimodal deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018195099A1 (en) * 2017-04-19 2018-10-25 Magic Leap, Inc. Multimodal task execution and text editing for a wearable system
CN108983636B (en) * 2018-06-20 2020-07-17 浙江大学 Human-machine intelligent symbiosis platform system
CN111722713A (en) * 2020-06-12 2020-09-29 天津大学 Multimodal fusion gesture keyboard input method, device, system and storage medium
CN114154549A (en) * 2021-08-30 2022-03-08 华北电力大学 A fault diagnosis method for gas turbine actuator based on multi-feature fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843381A (en) * 2016-03-18 2016-08-10 北京光年无限科技有限公司 Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Privacy protection method and system for multimodal deep learning

Also Published As

Publication number Publication date
CN116301388A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN112889108B (en) Speech classification using audiovisual data
US10621991B2 (en) Joint neural network for speaker recognition
US11138903B2 (en) Method, apparatus, device and system for sign language translation
CN109712108B (en) A visual localization method based on diverse discriminative candidate frame generation network
CN112308006B (en) Method, device, storage medium and electronic device for generating sight area prediction model
CN111274372A (en) Method, electronic device, and computer-readable storage medium for human-computer interaction
CN113591566A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN114495916B (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN114092759A (en) Image recognition model training method, device, electronic device and storage medium
CN117576279B (en) Digital person driving method and system based on multi-mode data
CN115828889A (en) Text analysis method, emotion classification model, device, medium, terminal and product
CN111783557A (en) A wearable blind guidance device and server based on depth vision
CN114360027A (en) A training method, device and electronic device for feature extraction network
CN113780326A (en) Image processing method and device, storage medium and electronic equipment
Wang et al. (2+ 1) D-SLR: an efficient network for video sign language recognition
US12254548B1 (en) Listener animation
CN116611491A (en) Training method and device of target detection model, electronic equipment and storage medium
JP2021081713A (en) Method, device, apparatus, and media for processing voice signal
CN114333772B (en) Speech recognition method, device, equipment, readable storage medium and product
CN115309882A (en) Interactive information generation method, system and storage medium based on multi-modal characteristics
CN116932788A (en) Cover image extraction method, device, equipment and computer storage medium
CN114022958A (en) Sign language recognition method and device, storage medium and electronic device
CN116301388B (en) Man-machine interaction scene system for intelligent multi-mode combined application
Ellappan et al. Multimodal Deep Neural Networks for Robust Sign Language Translation in Real-World Environments
CN109740510B (en) Method and apparatus for outputting information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 518063 No. 01-03, floor 17, block B, building 10, Shenzhen Bay science and technology ecological park, No. 10, Gaoxin South ninth Road, Yuehai street, Nanshan District, Shenzhen, Guangdong

Patentee after: Global Numerical Technology Co.,Ltd.

Country or region after: China

Address before: No. 01-03, 17th Floor, Building B, Shenzhen Bay Science and Technology Ecological Park, No. 10 Gaoxin South 9th Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Global Digital Group Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address