[go: up one dir, main page]

CN119724179A - Intelligent assistant interaction method, device and system based on multimodal data fusion - Google Patents

Intelligent assistant interaction method, device and system based on multimodal data fusion Download PDF

Info

Publication number
CN119724179A
CN119724179A CN202411900514.2A CN202411900514A CN119724179A CN 119724179 A CN119724179 A CN 119724179A CN 202411900514 A CN202411900514 A CN 202411900514A CN 119724179 A CN119724179 A CN 119724179A
Authority
CN
China
Prior art keywords
task
voice
execution
instruction
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411900514.2A
Other languages
Chinese (zh)
Inventor
周小文
李强
谈海生
杜皓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deqing Alpha Innovation Research Institute
Original Assignee
Deqing Alpha Innovation Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deqing Alpha Innovation Research Institute filed Critical Deqing Alpha Innovation Research Institute
Priority to CN202411900514.2A priority Critical patent/CN119724179A/en
Publication of CN119724179A publication Critical patent/CN119724179A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明为一种基于多模态数据融合的智能助手交互方法、装置和系统,所述方法包括如下步骤:1)多模态感知:用户发出指令,系统通过多模态感知采集数据;2)多模态融合:系统将采集到的多模态数据进行语义融合,生成任务语义表示;3)任务生成与执行:系统根据任务类型分解子任务并分配至相关设备或服务;4)交互反馈:系统将执行结果通过语音播报或显示屏反馈给用户;5)学习以及优化:通过交互反馈后的优质结论进行保存和学习,通过学习使其获得的指令执行更加精准。本发明结合多模态使得系统支持语音、视觉和文本的多种输入方式,提升用户交互体验。

The present invention is a method, device and system for intelligent assistant interaction based on multimodal data fusion, and the method includes the following steps: 1) multimodal perception: the user issues an instruction, and the system collects data through multimodal perception; 2) multimodal fusion: the system semantically fuses the collected multimodal data to generate a semantic representation of the task; 3) task generation and execution: the system decomposes subtasks according to the task type and assigns them to related devices or services; 4) interactive feedback: the system feeds back the execution results to the user through voice broadcast or display screen; 5) learning and optimization: the high-quality conclusions after interactive feedback are saved and learned, and the execution of the instructions obtained by learning is more accurate. The present invention combines multimodality to enable the system to support multiple input methods of voice, vision and text, and enhance the user interaction experience.

Description

Intelligent assistant interaction method, device and system based on multi-mode data fusion
Technical Field
The invention relates to the technical fields of artificial intelligence, man-machine interaction and Internet of things, in particular to an intelligent assistant interaction method, device and system based on multi-mode data fusion.
Background
Along with the development of artificial intelligence, the prior art can realize the intellectualization through a single mode under a specific scene. For example:
1. The intelligent home control based on the voice recognition technology controls devices such as light, temperature and the like through voice commands, but lacks understanding capability on complex tasks;
2. the security monitoring system based on visual analysis can detect specific scenes or events, but has the defects in multi-mode information fusion and interactivity;
3. current task execution systems typically rely on preset logic, which makes it difficult to dynamically adjust tasks based on real-time circumstances and user intent.
However, these techniques have the following problems in practical applications:
1. The multi-mode fusion capability is lacking, namely, the current system cannot deeply fuse multi-mode information such as voice, images, texts and the like, so that understanding is inaccurate.
2. And the task processing is single, the system lacks task decomposition and cooperation capability, and complex multi-task scenes are difficult to process.
3. The interaction is not natural enough, that is, the user needs to interact with the system in a specific way, and the self-adaptive adjustment can not be performed according to the behavior and the environmental state of the user.
Therefore, the intelligent assistant interaction method, device and system based on multi-mode data fusion are designed, so that the problems are overcome.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an intelligent assistant interaction method, device and system based on multi-mode data fusion, which aim to realize comprehensive processing of natural language, vision, voice and environmental data and are used for task collaboration and man-machine interaction under multiple scenes.
The invention is realized by the following technical scheme that the intelligent assistant interaction method based on multi-mode data fusion comprises the following steps:
1) The user sends out an instruction, and the system collects data through the multi-mode sensing;
2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;
3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;
4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;
5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.
Preferably, the instruction sent by the user in the step 1) comprises a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.
Preferably, the task generation and execution in the step 3) is specifically divided into:
A. the task is simultaneously sent to the execution equipment and the interactive feedback;
B. The execution device analyzes and decomposes the task after receiving the task
C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution
D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.
The specific learning and optimizing method in the step 5) is that the result in the step 4) is saved, if the result is consistent with the preset, the saving is skipped, if the result is inconsistent or the final result is not directly saved without the preset, and the learning is repeatedly operated.
An intelligent assistant interaction device based on multi-modal data fusion, the device comprising:
(1) The multi-mode sensing module comprises a voice sensing unit for collecting user voice input and transcribing the voice input into text, a visual sensing unit for collecting environment images through a camera and detecting user gestures, expressions or objects, a text processing unit for analyzing text instructions input by a user, and an environment sensing unit for collecting environment data.
(2) And the multi-mode fusion module is used for uniformly encoding the voice, visual and text data based on the fusion model of the deep learning and simultaneously generating multi-mode semantic representation by combining the context information.
(3) The task processing and executing module comprises a task analyzing unit which generates a task plan according to semantic representation, a task decomposing unit which then decomposes complex tasks into subtasks, a task scheduling unit which is called after decomposition to dynamically adjust the task execution sequence, and an intelligent device which is finally controlled through an internet of things (IoT) protocol by a device interface unit.
(4) And the interactive feedback module is used for feeding back the task execution state and the task execution result through voice broadcasting, screen display or mobile terminal.
(5) And a learning and optimizing module:
a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;
And the reinforcement learning unit optimizes the task allocation strategy based on feedback.
The assistant interaction system based on multi-mode data fusion is used on intelligent equipment, and the system utilizes the method or the device to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.
Preferably, the intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.
The beneficial effects of the invention are as follows:
Compared with the related technology of the existing intelligent assistant, the intelligent assistant interaction system based on multi-mode data fusion provided by the invention has the beneficial effects that:
1. The system supports multiple input modes of voice, vision and text by combining multiple modes, and user interaction experience is improved.
2. And the resource utilization rate is optimized through dynamic task decomposition and scheduling, and the task execution efficiency is improved.
3. The system can be suitable for various scenes such as intelligent home, office automation, medical assistance and the like.
4. Through reinforcement learning and knowledge base, the system is able to continuously optimize performance based on user behavior and environmental changes.
Drawings
FIG. 1 is a system frame diagram of the present invention.
Detailed Description
The present invention will be further described with reference to the drawings and examples below in order to more clearly understand the objects, technical solutions and advantages of the present invention to those skilled in the art.
In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," "inner," "outer," "transverse," "vertical," and the like indicate or are based on the orientation or positional relationship shown in the drawings, and are merely for convenience in describing the present invention, and do not indicate or imply that the apparatus or element to be referred to must have a specific orientation, and thus should not be construed as limiting the invention.
The invention will be described in detail with reference to the accompanying drawings, as shown in fig. 1, an intelligent assistant interaction method based on multi-mode data fusion, the method comprises the following steps:
1) The user sends out an instruction, and the system collects data through the multi-mode sensing;
2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;
3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;
4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;
5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.
The instructions sent by the user in the step 1) comprise a voice instruction, a visual instruction, a text instruction and an environment perception instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment perception instruction is fused after being collected through the environment.
The task generation and execution in the step 3) is specifically divided into:
A. the task is simultaneously sent to the execution equipment and the interactive feedback;
B. The execution device analyzes and decomposes the task after receiving the task
C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution
D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.
The specific learning and optimizing method in the step 5) is that the result in the step 4) is stored, if the result is consistent with the preset, the storage is skipped, if the result is inconsistent or the final result is not stored directly, the operation and the learning are repeated.
An intelligent assistant interaction device based on multi-modal data fusion, the device comprising:
(1) The multi-mode sensing module comprises a voice sensing unit for collecting user voice input and transcribing the voice input into text, a visual sensing unit for collecting environment images through a camera and detecting user gestures, expressions or objects, a text processing unit for analyzing text instructions input by a user, and an environment sensing unit for collecting environment data.
(2) And the multi-mode fusion module is used for uniformly encoding the voice, visual and text data based on the fusion model of the deep learning and simultaneously generating multi-mode semantic representation by combining the context information.
(3) The task processing and executing module comprises a task analyzing unit which generates a task plan according to semantic representation, a task decomposing unit which then decomposes complex tasks into subtasks, a task scheduling unit which is called after decomposition to dynamically adjust the task execution sequence, and an intelligent device which is finally controlled through an internet of things (IoT) protocol by a device interface unit.
(4) And the interactive feedback module is used for feeding back the task execution state and the task execution result through voice broadcasting, screen display or mobile terminal.
(5) And a learning and optimizing module:
a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;
And the reinforcement learning unit optimizes the task allocation strategy based on feedback.
The assistant interaction system based on multi-mode data fusion is used on intelligent equipment, and the system utilizes the method or the device to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.
The intelligent device is used for intelligent home scenes, intelligent office scenes and medical auxiliary scenes.
Example 1
Smart home scene
The user needs that the user sends out a voice command of turning on the light, playing music and lowering the room temperature.
The system operates to sense the user's voice command and detect the ambient illumination in the living room. The task is divided into three subtasks of adjusting light, starting sound to play music and starting air conditioner to cool down. The system controls the lights, stereo, and air conditioning equipment, respectively, via IoT protocols. After the task is completed, the system voice feedback is that the light is turned on, the music is being played, and the air conditioner is adjusted to 22 ℃.
Example 2 Smart office scenario
User demand, the user requests the projection conference schedule through the text instruction and sends notification.
The system operation is that the text processing unit analyzes the user instruction and detects the projector state through the vision module.
The task processing module generates three subtasks, namely, a projector is started, a calendar is displayed, and notification is sent through mail. The system completes schedule projection and mail transmission through the API. The system voice feedback "conference schedule projected, notification sent".
Example 3 medical auxiliary scene
The user demands that medical staff request real-time health data of the patient through gesture instructions.
The system operates with the vision module recognizing the "health data" gesture and querying the patient health database. The system generates a real-time health report based on the query results, presented through a display screen. The voice module feeds back "health data has been updated and presented".
Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims (7)

1. The intelligent assistant interaction method based on multi-mode data fusion is characterized by comprising the following steps of:
1) The user sends out an instruction, and the system collects data through the multi-mode sensing;
2) The system carries out semantic fusion on the acquired multi-mode data to generate task semantic representation;
3) The system decomposes subtasks according to the task type and distributes the subtasks to related equipment or service;
4) The interactive feedback is that the system feeds back the execution result to the user through voice broadcasting or a display screen;
5) And the learning and the optimization are that the high-quality conclusion after the interactive feedback is saved and learned, and the instruction obtained by the learning is executed more accurately.
2. The intelligent assistant interaction method based on multi-modal data fusion as set forth in claim 1, wherein the instructions issued by the user in step 1) include a voice instruction, a visual instruction, a text instruction, and an environment-aware instruction, wherein the voice instruction and the visual formulation are identified, the text instruction is analyzed, and the environment-aware instruction is fused after being collected by the environment.
3. The intelligent assistant interaction method based on multi-modal data fusion of claim 1, wherein the task generation and execution in step 3) is specifically divided into:
A. the task is simultaneously sent to the execution equipment and the interactive feedback;
B. The execution device analyzes and decomposes the task after receiving the task
C. Task equipment starts task scheduling and execution after decomposition, and sends a signal to interactive feedback after completion of execution
D. The interactive feedback is checked with preset task information after the task finishing information is received, if the task finishing information is not preset, the latest processing result is directly fed back, if the task finishing information is preset, comparison is started, after comparison, anastomosis is fed back, if the task finishing information is not matched, a final result is manually set, and the final result is input and then is used as the preset.
4. The intelligent assistant interaction method based on multi-modal data fusion of claim 1, wherein the specific learning and optimizing method in step 5) is to save the results in step 4), skip saving if the results are consistent with the preset, and repeatedly operate learning if the results are inconsistent or not preset, and directly save the final results.
5. The intelligent assistant interaction device based on multi-mode data fusion is characterized by comprising:
(1) The system comprises a multi-mode sensing module, a visual sensing unit, a text processing unit, a context sensing unit, a visual sensing unit, a user input module and a user input module, wherein the voice sensing unit collects user voice input and transcribes the voice input into text;
(2) The multi-mode fusion module is used for uniformly encoding voice, visual and text data based on a deep learning fusion model, and generating multi-mode semantic representation by combining context information;
(3) The task processing and executing module is used for generating a task plan according to semantic representation by a task analysis unit, decomposing a complex task into subtasks by a task decomposition unit, calling a task scheduling unit to dynamically adjust the task execution sequence after decomposing, and finally controlling the intelligent equipment by an internet of things (IoT) protocol by an equipment interface unit;
(4) The interactive feedback module feeds back the task execution state and the result through voice broadcasting, screen display or a mobile terminal;
(5) And a learning and optimizing module:
a dynamic knowledge base for storing user preferences, historical interaction records and environmental states;
And the reinforcement learning unit optimizes the task allocation strategy based on feedback.
6. An assistant interaction system based on multi-mode data fusion, which is used on intelligent equipment, and is characterized in that the system utilizes the method of any one of claims 1-4 or the device of claim 5 to realize the opening and closing of the equipment by sensing voice instructions, illumination instructions, text instructions and visual instructions.
7. The assistant interaction system based on multi-modal data fusion of claim 6, wherein the intelligent device is used in a smart home scenario, a smart office scenario, a medical assistance scenario.
CN202411900514.2A 2024-12-23 2024-12-23 Intelligent assistant interaction method, device and system based on multimodal data fusion Pending CN119724179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411900514.2A CN119724179A (en) 2024-12-23 2024-12-23 Intelligent assistant interaction method, device and system based on multimodal data fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411900514.2A CN119724179A (en) 2024-12-23 2024-12-23 Intelligent assistant interaction method, device and system based on multimodal data fusion

Publications (1)

Publication Number Publication Date
CN119724179A true CN119724179A (en) 2025-03-28

Family

ID=95093289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411900514.2A Pending CN119724179A (en) 2024-12-23 2024-12-23 Intelligent assistant interaction method, device and system based on multimodal data fusion

Country Status (1)

Country Link
CN (1) CN119724179A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120832905A (en) * 2025-09-19 2025-10-24 北京灵芒智能技术有限公司 Robot adaptive human-computer interaction and task instruction understanding system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120832905A (en) * 2025-09-19 2025-10-24 北京灵芒智能技术有限公司 Robot adaptive human-computer interaction and task instruction understanding system

Similar Documents

Publication Publication Date Title
CN106297781B (en) Control method and controller
CN107370649B (en) Household appliance control method, system, control terminal and storage medium
JP2023120205A (en) Supplementing voice inputs to automated assistant according to selected suggestions
CN114067798B (en) Server, intelligent equipment and intelligent voice control method
CN109618202B (en) Method for controlling peripheral device, television and readable storage medium
CN114172757A (en) Server, intelligent home system and multi-device voice awakening method
KR102411619B1 (en) Electronic apparatus and the controlling method thereof
WO2016185809A1 (en) Information processing apparatus, information processing method, and program
WO2017141530A1 (en) Information processing device, information processing method and program
CN110491387A (en) Method and system for implementing interactive services based on multiple terminals
CN119724179A (en) Intelligent assistant interaction method, device and system based on multimodal data fusion
WO2018201695A1 (en) Device control method, apparatus, system, and virtual reality device
WO2020224346A1 (en) Control device and operation method therefor, and speech interaction device and operation method therefor
CN107564522A (en) A kind of intelligent control method and device
WO2022268136A1 (en) Terminal device and server for voice control
CN119005994B (en) Virtual assistant interaction method and system for hotel guest rooms
CN118942455A (en) Voice remote control method and remote control system
CN109616111A (en) A scene interaction control method based on speech recognition
CN115834271A (en) Server, smart device and smart device control method
CN108762512A (en) Human-computer interaction device, method and system
US11553019B2 (en) Method, apparatus, electronic device and storage medium for acquiring programs in live streaming room
CN114627864A (en) Display device and voice interaction method
CN115547321A (en) Service processing method and server
WO2020019768A1 (en) Vehicle-mounted holographic display method and system
US11818820B2 (en) Adapting a lighting control interface based on an analysis of conversational input

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination