CN119724179A

CN119724179A - Intelligent assistant interaction method, device and system based on multimodal data fusion

Info

Publication number: CN119724179A
Application number: CN202411900514.2A
Authority: CN
Inventors: 周小文; 李强; 谈海生; 杜皓华
Original assignee: Deqing Alpha Innovation Research Institute
Current assignee: Deqing Alpha Innovation Research Institute
Priority date: 2024-12-23
Filing date: 2024-12-23
Publication date: 2025-03-28

Abstract

The present invention is a method, device and system for intelligent assistant interaction based on multimodal data fusion, and the method includes the following steps: 1) multimodal perception: the user issues an instruction, and the system collects data through multimodal perception; 2) multimodal fusion: the system semantically fuses the collected multimodal data to generate a semantic representation of the task; 3) task generation and execution: the system decomposes subtasks according to the task type and assigns them to related devices or services; 4) interactive feedback: the system feeds back the execution results to the user through voice broadcast or display screen; 5) learning and optimization: the high-quality conclusions after interactive feedback are saved and learned, and the execution of the instructions obtained by learning is more accurate. The present invention combines multimodality to enable the system to support multiple input methods of voice, vision and text, and enhance the user interaction experience.

Description

Intelligent assistant interaction method, device and system based on multi-mode data fusion

Technical Field

The invention relates to the technical fields of artificial intelligence, man-machine interaction and Internet of things, in particular to an intelligent assistant interaction method, device and system based on multi-mode data fusion.

Background

Along with the development of artificial intelligence, the prior art can realize the intellectualization through a single mode under a specific scene. For example:

1. The intelligent home control based on the voice recognition technology controls devices such as light, temperature and the like through voice commands, but lacks understanding capability on complex tasks;

2. the security monitoring system based on visual analysis can detect specific scenes or events, but has the defects in multi-mode information fusion and interactivity;

3. current task execution systems typically rely on preset logic, which makes it difficult to dynamically adjust tasks based on real-time circumstances and user intent.

However, these techniques have the following problems in practical applications:

1. The multi-mode fusion capability is lacking, namely, the current system cannot deeply fuse multi-mode information such as voice, images, texts and the like, so that understanding is inaccurate.

2. And the task processing is single, the system lacks task decomposition and cooperation capability, and complex multi-task scenes are difficult to process.

3. The interaction is not natural enough, that is, the user needs to interact with the system in a specific way, and the self-adaptive adjustment can not be performed according to the behavior and the environmental state of the user.

Therefore, the intelligent assistant interaction method, device and system based on multi-mode data fusion are designed, so that the problems are overcome.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an intelligent assistant interaction method, device and system based on multi-mode data fusion, which aim to realize comprehensive processing of natural language, vision, voice and environmental data and are used for task collaboration and man-machine interaction under multiple scenes.

The invention is realized by the following technical scheme that the intelligent assistant interaction method based on multi-mode data fusion comprises the following steps:

1) The user sends out an instruction, and the system collects data through the multi-mode sensing;

2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;

3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;

4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;

5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.

Preferably, the instruction sent by the user in the step 1) comprises a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.

Preferably, the task generation and execution in the step 3) is specifically divided into:

A. the task is simultaneously sent to the execution equipment and the interactive feedback;

B. The execution device analyzes and decomposes the task after receiving the task

C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution

D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.

The specific learning and optimizing method in the step 5) is that the result in the step 4) is saved, if the result is consistent with the preset, the saving is skipped, if the result is inconsistent or the final result is not directly saved without the preset, and the learning is repeatedly operated.

An intelligent assistant interaction device based on multi-modal data fusion, the device comprising:

(1) The multi-mode sensing module comprises a voice sensing unit for collecting user voice input and transcribing the voice input into text, a visual sensing unit for collecting environment images through a camera and detecting user gestures, expressions or objects, a text processing unit for analyzing text instructions input by a user, and an environment sensing unit for collecting environment data.

(2) And the multi-mode fusion module is used for uniformly encoding the voice, visual and text data based on the fusion model of the deep learning and simultaneously generating multi-mode semantic representation by combining the context information.

(3) The task processing and executing module comprises a task analyzing unit which generates a task plan according to semantic representation, a task decomposing unit which then decomposes complex tasks into subtasks, a task scheduling unit which is called after decomposition to dynamically adjust the task execution sequence, and an intelligent device which is finally controlled through an internet of things (IoT) protocol by a device interface unit.

(4) And the interactive feedback module is used for feeding back the task execution state and the task execution result through voice broadcasting, screen display or mobile terminal.

(5) And a learning and optimizing module:

a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;

And the reinforcement learning unit optimizes the task allocation strategy based on feedback.

The assistant interaction system based on multi-mode data fusion is used on intelligent equipment, and the system utilizes the method or the device to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.

Preferably, the intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.

The beneficial effects of the invention are as follows:

Compared with the related technology of the existing intelligent assistant, the intelligent assistant interaction system based on multi-mode data fusion provided by the invention has the beneficial effects that:

1. The system supports multiple input modes of voice, vision and text by combining multiple modes, and user interaction experience is improved.

2. And the resource utilization rate is optimized through dynamic task decomposition and scheduling, and the task execution efficiency is improved.

3. The system can be suitable for various scenes such as intelligent home, office automation, medical assistance and the like.

4. Through reinforcement learning and knowledge base, the system is able to continuously optimize performance based on user behavior and environmental changes.

Drawings

FIG. 1 is a system frame diagram of the present invention.

Detailed Description

The present invention will be further described with reference to the drawings and examples below in order to more clearly understand the objects, technical solutions and advantages of the present invention to those skilled in the art.

In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," "inner," "outer," "transverse," "vertical," and the like indicate or are based on the orientation or positional relationship shown in the drawings, and are merely for convenience in describing the present invention, and do not indicate or imply that the apparatus or element to be referred to must have a specific orientation, and thus should not be construed as limiting the invention.

The invention will be described in detail with reference to the accompanying drawings, as shown in fig. 1, an intelligent assistant interaction method based on multi-mode data fusion, the method comprises the following steps:

The instructions sent by the user in the step 1) comprise a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.

The task generation and execution in the step 3) is specifically divided into:

The specific learning and optimizing method in the step 5) is that the result in the step 4) is stored, if the result is consistent with the preset, the storage is skipped, if the result is inconsistent or the final result is not stored directly, the operation and the learning are repeated.

(5) And a learning and optimizing module:

The intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.

Example 1

Smart home scene

The user needs that the user sends out a voice command of turning on the light, playing music and lowering the room temperature.

The system operates to sense the user's voice command and detect the ambient illumination in the living room. The task is divided into three subtasks of adjusting light, starting sound to play music and starting air conditioner to cool down. The system controls the lights, stereo, and air conditioning equipment, respectively, via IoT protocols. After the task is completed, the system voice feedback is that the light is turned on, the music is being played, and the air conditioner is adjusted to 22 ℃.

Example 2 Smart office scenario

User demand, the user requests the projection conference schedule through the text instruction and sends notification.

The system operation is that the text processing unit analyzes the user instruction and detects the projector state through the vision module.

The task processing module generates three subtasks, namely, a projector is started, a calendar is displayed, and notification is sent through mail. The system completes schedule projection and mail transmission through the API. The system voice feedback "conference schedule projected, notification sent".

Example 3 medical auxiliary scene

The user demands that medical staff request real-time health data of the patient through gesture instructions.

The system operates with the vision module recognizing the "health data" gesture and querying the patient health database. The system generates a real-time health report based on the query results, presented through a display screen. The voice module feeds back "health data has been updated and presented".

Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims

1. The intelligent assistant interaction method based on multi-mode data fusion is characterized by comprising the following steps of:

2. The intelligent assistant interaction method based on multi-modal data fusion as set forth in claim 1, wherein the instructions issued by the user in step 1) include a voice instruction, a visual instruction, a text instruction, and an environment-aware instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment-aware instruction is fused after being collected by the environment.

3. The intelligent assistant interaction method based on multi-modal data fusion of claim 1, wherein the task generation and execution in step 3) is specifically divided into:

4. The intelligent assistant interaction method based on multi-modal data fusion of claim 1, wherein the specific learning and optimizing method in step 5) is to save the results in step 4), skip saving if the results are consistent with the preset, and repeatedly operate learning if the results are inconsistent or not preset, and directly save the final results.

5. The intelligent assistant interaction device based on multi-mode data fusion is characterized by comprising:

(1) The system comprises a multi-mode sensing module, a visual sensing unit, a text processing unit, a context sensing unit, a visual sensing unit, a user input module and a user input module, wherein the voice sensing unit collects user voice input and transcribes the voice input into text;

(2) The multi-mode fusion module is used for uniformly encoding voice, visual and text data based on a deep learning fusion model, and generating multi-mode semantic representation by combining context information;

(3) The task processing and executing module is used for generating a task plan according to semantic representation by a task analysis unit, decomposing a complex task into subtasks by a task decomposition unit, calling a task scheduling unit to dynamically adjust the task execution sequence after decomposing, and finally controlling the intelligent equipment by an internet of things (IoT) protocol by an equipment interface unit;

(4) The interactive feedback module feeds back the task execution state and the result through voice broadcasting, screen display or a mobile terminal;

(5) And a learning and optimizing module:

6. An assistant interaction system based on multi-mode data fusion, which is used on intelligent equipment, and is characterized in that the system utilizes the method of any one of claims 1-4 or the device of claim 5 to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.

7. The assistant interaction system based on multi-modal data fusion of claim 6, wherein the intelligent device is used in a smart home scenario, a smart office scenario, a medical assistance scenario.