Intelligent assistant interaction method, device and system based on multi-mode data fusion
Technical Field
The invention relates to the technical fields of artificial intelligence, man-machine interaction and Internet of things, in particular to an intelligent assistant interaction method, device and system based on multi-mode data fusion.
Background
Along with the development of artificial intelligence, the prior art can realize the intellectualization through a single mode under a specific scene. For example:
1. The intelligent home control based on the voice recognition technology controls devices such as light, temperature and the like through voice commands, but lacks understanding capability on complex tasks;
2. the security monitoring system based on visual analysis can detect specific scenes or events, but has the defects in multi-mode information fusion and interactivity;
3. current task execution systems typically rely on preset logic, which makes it difficult to dynamically adjust tasks based on real-time circumstances and user intent.
However, these techniques have the following problems in practical applications:
1. The multi-mode fusion capability is lacking, namely, the current system cannot deeply fuse multi-mode information such as voice, images, texts and the like, so that understanding is inaccurate.
2. And the task processing is single, the system lacks task decomposition and cooperation capability, and complex multi-task scenes are difficult to process.
3. The interaction is not natural enough, that is, the user needs to interact with the system in a specific way, and the self-adaptive adjustment can not be performed according to the behavior and the environmental state of the user.
Therefore, the intelligent assistant interaction method, device and system based on multi-mode data fusion are designed, so that the problems are overcome.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an intelligent assistant interaction method, device and system based on multi-mode data fusion, which aim to realize comprehensive processing of natural language, vision, voice and environmental data and are used for task collaboration and man-machine interaction under multiple scenes.
The invention is realized by the following technical scheme that the intelligent assistant interaction method based on multi-mode data fusion comprises the following steps:
1) The user sends out an instruction, and the system collects data through the multi-mode sensing;
2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;
3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;
4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;
5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.
Preferably, the instruction sent by the user in the step 1) comprises a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.
Preferably, the task generation and execution in the step 3) is specifically divided into:
A. the task is simultaneously sent to the execution equipment and the interactive feedback;
B. The execution device analyzes and decomposes the task after receiving the task
C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution
D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.
The specific learning and optimizing method in the step 5) is that the result in the step 4) is saved, if the result is consistent with the preset, the saving is skipped, if the result is inconsistent or the final result is not directly saved without the preset, and the learning is repeatedly operated.
An intelligent assistant interaction device based on multi-modal data fusion, the device comprising:
(1) The multi-mode sensing module comprises a voice sensing unit for collecting user voice input and transcribing the voice input into text, a visual sensing unit for collecting environment images through a camera and detecting user gestures, expressions or objects, a text processing unit for analyzing text instructions input by a user, and an environment sensing unit for collecting environment data.
(2) And the multi-mode fusion module is used for uniformly encoding the voice, visual and text data based on the fusion model of the deep learning and simultaneously generating multi-mode semantic representation by combining the context information.
(3) The task processing and executing module comprises a task analyzing unit which generates a task plan according to semantic representation, a task decomposing unit which then decomposes complex tasks into subtasks, a task scheduling unit which is called after decomposition to dynamically adjust the task execution sequence, and an intelligent device which is finally controlled through an internet of things (IoT) protocol by a device interface unit.
(4) And the interactive feedback module is used for feeding back the task execution state and the task execution result through voice broadcasting, screen display or mobile terminal.
(5) And a learning and optimizing module:
a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;
And the reinforcement learning unit optimizes the task allocation strategy based on feedback.
The assistant interaction system based on multi-mode data fusion is used on intelligent equipment, and the system utilizes the method or the device to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.
Preferably, the intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.
The beneficial effects of the invention are as follows:
Compared with the related technology of the existing intelligent assistant, the intelligent assistant interaction system based on multi-mode data fusion provided by the invention has the beneficial effects that:
1. The system supports multiple input modes of voice, vision and text by combining multiple modes, and user interaction experience is improved.
2. And the resource utilization rate is optimized through dynamic task decomposition and scheduling, and the task execution efficiency is improved.
3. The system can be suitable for various scenes such as intelligent home, office automation, medical assistance and the like.
4. Through reinforcement learning and knowledge base, the system is able to continuously optimize performance based on user behavior and environmental changes.
Drawings
FIG. 1 is a system frame diagram of the present invention.
Detailed Description
The present invention will be further described with reference to the drawings and examples below in order to more clearly understand the objects, technical solutions and advantages of the present invention to those skilled in the art.
In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," "inner," "outer," "transverse," "vertical," and the like indicate or are based on the orientation or positional relationship shown in the drawings, and are merely for convenience in describing the present invention, and do not indicate or imply that the apparatus or element to be referred to must have a specific orientation, and thus should not be construed as limiting the invention.
The invention will be described in detail with reference to the accompanying drawings, as shown in fig. 1, an intelligent assistant interaction method based on multi-mode data fusion, the method comprises the following steps:
1) The user sends out an instruction, and the system collects data through the multi-mode sensing;
2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;
3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;
4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;
5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.
The instructions sent by the user in the step 1) comprise a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.
The task generation and execution in the step 3) is specifically divided into:
A. the task is simultaneously sent to the execution equipment and the interactive feedback;
B. The execution device analyzes and decomposes the task after receiving the task
C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution
D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.
The specific learning and optimizing method in the step 5) is that the result in the step 4) is stored, if the result is consistent with the preset, the storage is skipped, if the result is inconsistent or the final result is not stored directly, the operation and the learning are repeated.
An intelligent assistant interaction device based on multi-modal data fusion, the device comprising:
(1) The multi-mode sensing module comprises a voice sensing unit for collecting user voice input and transcribing the voice input into text, a visual sensing unit for collecting environment images through a camera and detecting user gestures, expressions or objects, a text processing unit for analyzing text instructions input by a user, and an environment sensing unit for collecting environment data.
(2) And the multi-mode fusion module is used for uniformly encoding the voice, visual and text data based on the fusion model of the deep learning and simultaneously generating multi-mode semantic representation by combining the context information.
(3) The task processing and executing module comprises a task analyzing unit which generates a task plan according to semantic representation, a task decomposing unit which then decomposes complex tasks into subtasks, a task scheduling unit which is called after decomposition to dynamically adjust the task execution sequence, and an intelligent device which is finally controlled through an internet of things (IoT) protocol by a device interface unit.
(4) And the interactive feedback module is used for feeding back the task execution state and the task execution result through voice broadcasting, screen display or mobile terminal.
(5) And a learning and optimizing module:
a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;
And the reinforcement learning unit optimizes the task allocation strategy based on feedback.
The assistant interaction system based on multi-mode data fusion is used on intelligent equipment, and the system utilizes the method or the device to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.
The intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.
Example 1
Smart home scene
The user needs that the user sends out a voice command of turning on the light, playing music and lowering the room temperature.
The system operates to sense the user's voice command and detect the ambient illumination in the living room. The task is divided into three subtasks of adjusting light, starting sound to play music and starting air conditioner to cool down. The system controls the lights, stereo, and air conditioning equipment, respectively, via IoT protocols. After the task is completed, the system voice feedback is that the light is turned on, the music is being played, and the air conditioner is adjusted to 22 ℃.
Example 2 Smart office scenario
User demand, the user requests the projection conference schedule through the text instruction and sends notification.
The system operation is that the text processing unit analyzes the user instruction and detects the projector state through the vision module.
The task processing module generates three subtasks, namely, a projector is started, a calendar is displayed, and notification is sent through mail. The system completes schedule projection and mail transmission through the API. The system voice feedback "conference schedule projected, notification sent".
Example 3 medical auxiliary scene
The user demands that medical staff request real-time health data of the patient through gesture instructions.
The system operates with the vision module recognizing the "health data" gesture and querying the patient health database. The system generates a real-time health report based on the query results, presented through a display screen. The voice module feeds back "health data has been updated and presented".
Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.