CN118760359A

CN118760359A - Multi-modal multi-task intelligent interaction method, device, electronic device and storage medium

Info

Publication number: CN118760359A
Application number: CN202411206673.2A
Authority: CN
Inventors: 朱淑媛; 曹珣; 彭旋; 赵慧英; 马宝军; 姜昊
Original assignee: China Unicom Online Information Technology Co Ltd
Current assignee: China Unicom Online Information Technology Co Ltd
Priority date: 2024-08-30
Filing date: 2024-08-30
Publication date: 2024-10-11
Anticipated expiration: 2044-08-30
Also published as: CN118760359B

Abstract

The present invention relates to a multimodal multitask intelligent interaction method, device, electronic device and storage medium. The method comprises: obtaining multimodal data, and preprocessing and feature extracting the multimodal data to construct a multimodal knowledge base. Receive input multimodal information, and identify the current user and usage scenario based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario. Use multiple tasks as input to the Transformer multitask model to form a corresponding multi-head model of the multiple tasks, and fuse the features of different tasks to establish a multimodal understanding framework to output recognition results corresponding to the multiple tasks. Select a corresponding interaction strategy based on the recognition result, generate corresponding interaction content according to the interaction strategy, and visualize the interaction content. The method introduces more interaction methods, is close to the habits and needs of users, and can complete interaction tasks more quickly and accurately.

Description

Multi-modal multi-task intelligent interaction method, device, electronic device and storage medium

技术领域Technical Field

本发明涉及人机交互技术领域，特别是涉及一种多模态多任务智能交互方法、装置、电子设备及存储介质。The present invention relates to the field of human-computer interaction technology, and in particular to a multi-modal multi-task intelligent interaction method, device, electronic device and storage medium.

背景技术Background Art

随着科技的不断发展，人类对智能交互的要求越来越高，传统的单模态智能交互已经难以满足人们的需求，主要有以下问题：With the continuous development of science and technology, people's requirements for intelligent interaction are getting higher and higher. Traditional single-modal intelligent interaction can no longer meet people's needs. There are mainly the following problems:

（1）交互能力有限，存在准确度的问题。单模态智能交互因受到技术限制，可能较难理解用户的意图和上下文，导致其交互内容准确度较低。例如，人机之间的沟通较易受到环境、噪声等因素的影响，难以识别出用户的意图。(1) Limited interaction capabilities and accuracy issues. Due to technical limitations, single-modal intelligent interaction may have difficulty understanding the user's intentions and context, resulting in low accuracy in the interaction content. For example, communication between humans and machines is more easily affected by factors such as the environment and noise, making it difficult to identify the user's intentions.

（2）交互内容多为基础、机械的对话。大多数智能交互系统，往往只能在较为有限的任务领域内进行服务，故只能针对特定的问题和任务进行简单的一问一答式对话。(2) The interactive content is mostly basic and mechanical dialogue. Most intelligent interactive systems can only provide services within a relatively limited task domain, so they can only conduct simple question-and-answer dialogues for specific issues and tasks.

（3）智能化程度较低。例如，大多数智能交互往往只能接受文本和语音的输入，这在一定程度上限制了用户与系统的交互方式，导致交互系统难以学习用户的个性化需求和偏好，进而不能为不同用户提供个性化的服务。(3) Low intelligence. For example, most intelligent interactions can only accept text and voice input, which limits the way users interact with the system to a certain extent, making it difficult for the interactive system to learn users' personalized needs and preferences, and thus unable to provide personalized services for different users.

因此，传统的智能交互方法在交互方式上受限制于文字或者语音与用户进行交互，使得在不同交互场景下人机交互效率较低，系统交互内容易受环境影响，交互准确性较低，较难满足在多模态交互场景下的交互需求。Therefore, traditional intelligent interaction methods are limited to interacting with users through text or voice, which makes the human-computer interaction efficiency low in different interaction scenarios, the system interaction is easily affected by the environment, the interaction accuracy is low, and it is difficult to meet the interaction needs in multimodal interaction scenarios.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够在多种交互场景下提高人机交互效率且交互准确性较高的多模态多任务智能交互方法、装置、电子设备及存储介质。Based on this, it is necessary to provide a multimodal multi-task intelligent interaction method, device, electronic device and storage medium that can improve human-computer interaction efficiency and interaction accuracy in various interaction scenarios in response to the above technical problems.

本发明提供了一种多模态多任务智能交互方法，所述方法包括：The present invention provides a multi-modal multi-task intelligent interaction method, the method comprising:

获取多模态数据，并对所述多模态数据进行预处理和特征提取，以构建多模态知识库；Acquiring multimodal data, and performing preprocessing and feature extraction on the multimodal data to construct a multimodal knowledge base;

接收输入的多模态信息，并基于所述多模态信息对当前用户和使用场景进行识别，以获取所述当前用户和使用场景对应的多个任务；Receiving input multimodal information, and identifying a current user and a usage scenario based on the multimodal information to obtain a plurality of tasks corresponding to the current user and the usage scenario;

将所述多个任务作为Transformer多任务模型的输入，以将所述多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出所述多个任务对应的识别结果；The multiple tasks are used as inputs of the Transformer multi-task model to form the multiple tasks into corresponding multi-head models, and the features of different tasks are information-fused to establish a multimodal understanding framework to output recognition results corresponding to the multiple tasks;

基于所述识别结果选取相应的交互策略，按照所述交互策略生成相应的交互内容并对所述交互内容进行可视化展示；Selecting a corresponding interaction strategy based on the recognition result, generating corresponding interaction content according to the interaction strategy and visually displaying the interaction content;

其中，所述多模态数据包括文本数据、音频数据、视频数据以及图像数据，所述预处理包括数据清洗和聚类算法处理，所述多模态信息包括用户主动输入或被动输入的文本信息、音频信息、视频信息以及图像信息，所述多个任务包括交互意图识别任务、情感识别任务以及实体识别任务，所述交互策略包括任务型交互、知识型交互以及开放领域闲聊交互，所述交互内容包括文字、图片以及语音。Among them, the multimodal data includes text data, audio data, video data and image data, the preprocessing includes data cleaning and clustering algorithm processing, the multimodal information includes text information, audio information, video information and image information actively or passively input by the user, the multiple tasks include interaction intention recognition tasks, emotion recognition tasks and entity recognition tasks, the interaction strategies include task-based interaction, knowledge-based interaction and open domain chat interaction, and the interaction content includes text, pictures and voice.

在其中一个实施例中，所述获取多模态数据，并对所述多模态数据进行预处理和特征提取，以构建多模态知识库，包括：In one embodiment, the acquiring of multimodal data, and preprocessing and feature extraction of the multimodal data to construct a multimodal knowledge base includes:

对所述文本数据进行语料清洗，通过聚类标注删除所述文本数据中的无用数据和重复数据，并去除所述文本数据中的乱码和多余符号，得到预处理后的文本数据；以及Performing corpus cleaning on the text data, deleting useless data and duplicate data in the text data through clustering annotation, and removing garbled characters and redundant symbols in the text data to obtain preprocessed text data; and

通过聚类标注去除所述图像数据中的非法图片，并对所述图像数据进行标签标注，将所述图像数据调整为统一大小和分辨率，得到预处理后的图像数据。Illegal images in the image data are removed through clustering annotation, and the image data are labeled and annotated, and the image data are adjusted to a uniform size and resolution to obtain preprocessed image data.

在其中一个实施例中，所述获取多模态数据，并对所述多模态数据进行预处理和特征提取，以构建多模态知识库，还包括：In one embodiment, the step of acquiring multimodal data, and preprocessing and extracting features from the multimodal data to construct a multimodal knowledge base further includes:

对所述视频数据进行视频分割，将所述视频数据分割为多个视频片段，并从所述视频片段中抽取多个视频帧，以对所述视频帧进行特征提取；以及Performing video segmentation on the video data to segment the video data into a plurality of video segments, and extracting a plurality of video frames from the video segments to perform feature extraction on the video frames; and

对所述音频数据进行去噪处理，并按照设定规则将所述音频数据分割为多段音频，以对所述多段音频进行特征提取。The audio data is denoised, and the audio data is segmented into multiple audio segments according to a set rule, so as to extract features from the multiple audio segments.

通过卷积神经网络对预处理后的所述多模态数据进行特征提取，以提取所述多模态数据中的属性特征和特征向量，所述属性特征和特征向量用于对所述多模态数据进行相似度分析和数据分类；Performing feature extraction on the preprocessed multimodal data by using a convolutional neural network to extract attribute features and feature vectors in the multimodal data, wherein the attribute features and feature vectors are used to perform similarity analysis and data classification on the multimodal data;

将所述多模态数据按照所述属性特征和特征向量的相似度分析结果和数据分类结果存储于包含关系型数据库和图数据库的所述多模态知识库中，并建立不同模态数据之间的索引。The multimodal data is stored in the multimodal knowledge base including a relational database and a graph database according to the similarity analysis results and data classification results of the attribute features and feature vectors, and indexes between different modal data are established.

在其中一个实施例中，所述接收输入的多模态信息，并基于所述多模态信息对当前用户和使用场景进行识别，以获取所述当前用户和使用场景对应的多个任务，包括：In one embodiment, the receiving of input multimodal information and identifying a current user and a usage scenario based on the multimodal information to obtain a plurality of tasks corresponding to the current user and the usage scenario includes:

基于不同用户和不同场景输入的所述多模态信息建立用户画像知识库和场景画像知识库作为候选集，并通过画像匹配单元根据当前用户输入的多模态信息生成相应的输入表征；Based on the multimodal information input by different users and different scenarios, a user portrait knowledge base and a scene portrait knowledge base are established as candidate sets, and a portrait matching unit is used to generate a corresponding input representation according to the multimodal information input by the current user;

通过设定的检索匹配策略计算当前用户输入的多模态信息对应的匹配得分，并反馈所述候选集中得分最高的用户画像匹配向量以及场景画像匹配向量。The matching score corresponding to the multimodal information input by the current user is calculated through the set retrieval matching strategy, and the user portrait matching vector and scene portrait matching vector with the highest score in the candidate set are fed back.

在其中一个实施例中，所述将所述多个任务作为Transformer多任务模型的输入，以将所述多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出所述多个任务对应的识别结果，包括：In one embodiment, the multiple tasks are used as inputs of a Transformer multi-task model to form the multiple tasks into corresponding multi-head models, and information fusion is performed on the features of different tasks to establish a multimodal understanding framework to output recognition results corresponding to the multiple tasks, including:

在所述Transformer多任务模型的解码层引入提示对话框学习，以引导所述Transformer多任务模型生成不同任务对应的标签结果；Introducing prompt dialog learning in the decoding layer of the Transformer multi-task model to guide the Transformer multi-task model to generate label results corresponding to different tasks;

调用所述多模态理解框架对不同任务进行识别处理，以生成所述不同任务分别对应的识别结果，所述识别结果中具有每个任务对应的所述标签结果，所述标签结果用于确定不同任务分别对应的交互策略。The multimodal understanding framework is called to perform recognition processing on different tasks to generate recognition results corresponding to the different tasks, wherein the recognition results have the label results corresponding to each task, and the label results are used to determine the interaction strategies corresponding to the different tasks.

在其中一个实施例中，所述基于所述识别结果选取相应的交互策略，按照所述交互策略生成相应的交互内容并对所述交互内容进行可视化展示，包括：In one embodiment, selecting a corresponding interaction strategy based on the recognition result, generating corresponding interaction content according to the interaction strategy, and visually displaying the interaction content includes:

基于所述识别结果获取当前用户对应的用户意图，并响应于所述用户意图反馈相应的第一交互内容；Acquire a user intention corresponding to the current user based on the recognition result, and feed back corresponding first interaction content in response to the user intention;

通过监测判断所述第一交互内容是否满足当前用户的用户需求，并在所述第一交互内容不满足所述用户需求时，反馈第二交互内容以确定当前用户的用户意图。It is determined by monitoring whether the first interactive content meets the user needs of the current user, and when the first interactive content does not meet the user needs, the second interactive content is fed back to determine the user intention of the current user.

本发明还提供了一种多模态多任务智能交互装置，所述装置包括：The present invention also provides a multi-modal multi-task intelligent interaction device, the device comprising:

知识库构建模块，用于获取多模态数据，并对所述多模态数据进行预处理和特征提取，以构建多模态知识库；A knowledge base construction module, used for acquiring multimodal data, and performing preprocessing and feature extraction on the multimodal data to construct a multimodal knowledge base;

信息识别模块，用于接收输入的多模态信息，并基于所述多模态信息对当前用户和使用场景进行识别，以获取所述当前用户和使用场景对应的多个任务；An information identification module, used to receive input multimodal information, and identify a current user and a usage scenario based on the multimodal information to obtain a plurality of tasks corresponding to the current user and the usage scenario;

任务识别模块，用于将所述多个任务作为Transformer多任务模型的输入，以将所述多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出所述多个任务对应的识别结果；A task identification module is used to use the multiple tasks as inputs of the Transformer multi-task model to form the multiple tasks into corresponding multi-head models, and to fuse information of features of different tasks to establish a multimodal understanding framework to output identification results corresponding to the multiple tasks;

内容交互模块，用于基于所述识别结果选取相应的交互策略，按照所述交互策略生成相应的交互内容并对所述交互内容进行可视化展示；A content interaction module, configured to select a corresponding interaction strategy based on the recognition result, generate corresponding interaction content according to the interaction strategy, and visually display the interaction content;

本发明还提供了一种电子设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现如上述任一种所述的多模态多任务智能交互方法。The present invention also provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the multi-modal multi-task intelligent interaction method as described above is implemented.

本发明还提供了一种计算机存储介质，存储有计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述的多模态多任务智能交互方法。The present invention also provides a computer storage medium storing a computer program, wherein when the computer program is executed by a processor, the multi-modal multi-task intelligent interaction method as described above is implemented.

本发明还提供了一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述的多模态多任务智能交互方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the multi-modal multi-task intelligent interaction method as described above is implemented.

上述多模态多任务智能交互方法、装置、电子设备及存储介质，通过获取多模态数据，并对多模态数据进行预处理和特征提取，构建多模态知识库。随后，接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务。将多个任务作为Transformer多任务模型的输入，将该多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果。最后，基于识别结果选取相应的交互策略，按照该交互策略生成相应的交互内容并对交互内容进行可视化展示。多模态智能交互可以引入更多的交互方式，例如图像、手势、触摸、面部表情、眼神等，更加贴近用户的习惯和需求，同时让用户更快速、更准确地完成交互任务，提升交互效率和用户满意度。同时，多模态智能交互也可以提升交互的可靠性和鲁棒性，在一些嘈杂或者语音输入不方便的环境下，用户可以通过其他方式进行交互，避免了现有语音交互方式受限的不足之处，适应不同的交互场景和环境。The above-mentioned multimodal multi-task intelligent interaction method, device, electronic device and storage medium construct a multimodal knowledge base by acquiring multimodal data, preprocessing and extracting features of the multimodal data. Subsequently, the input multimodal information is received, and the current user and the usage scenario are identified based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario. Multiple tasks are used as the input of the Transformer multi-task model, and the multiple tasks are composed of corresponding multi-head models, and the features of different tasks are information-fused to establish a multimodal understanding framework to output the recognition results corresponding to multiple tasks. Finally, the corresponding interaction strategy is selected based on the recognition result, and the corresponding interaction content is generated according to the interaction strategy and the interaction content is visualized. Multimodal intelligent interaction can introduce more interaction methods, such as images, gestures, touch, facial expressions, eyes, etc., which are closer to the habits and needs of users, while allowing users to complete interaction tasks more quickly and accurately, improving interaction efficiency and user satisfaction. At the same time, multimodal intelligent interaction can also improve the reliability and robustness of interaction. In some noisy environments or inconvenient voice input, users can interact through other methods, avoiding the limitations of existing voice interaction methods and adapting to different interaction scenarios and environments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明提供的多模态多任务智能交互方法流程示意图之一；FIG1 is a schematic diagram of a multi-modal multi-task intelligent interaction method provided by the present invention;

图2为本发明提供的具体实施例中多模态多任务智能交互方法的多模态知识库构建过程示意图；FIG2 is a schematic diagram of a multimodal knowledge base construction process of a multimodal multitask intelligent interaction method in a specific embodiment provided by the present invention;

图3为本发明提供的具体实施例中多模态多任务智能交互方法的多模态信息识别过程示意图；FIG3 is a schematic diagram of a multimodal information recognition process of a multimodal multitask intelligent interaction method in a specific embodiment provided by the present invention;

图4为本发明提供的具体实施例中多模态多任务智能交互方法的多任务学习过程示意图；FIG4 is a schematic diagram of a multi-task learning process of a multi-modal multi-task intelligent interaction method in a specific embodiment of the present invention;

图5为本发明提供的多模态多任务智能交互方法流程示意图之二；FIG5 is a second flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图6为本发明提供的多模态多任务智能交互方法流程示意图之三；FIG6 is a third flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图7为本发明提供的多模态多任务智能交互方法流程示意图之四；FIG7 is a fourth flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图8为本发明提供的多模态多任务智能交互方法流程示意图之五；FIG8 is a fifth flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图9为本发明提供的多模态多任务智能交互方法流程示意图之六；FIG9 is a sixth flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图10为本发明提供的多模态多任务智能交互方法流程示意图之七；FIG10 is a seventh flow chart of the multi-modal multi-task intelligent interaction method provided by the present invention;

图11为本发明提供的多模态多任务智能交互装置结构示意图；FIG11 is a schematic diagram of the structure of a multi-modal multi-task intelligent interaction device provided by the present invention;

图12为本发明提供的电子设备的内部结构图。FIG. 12 is a diagram showing the internal structure of the electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地说明，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

下面结合图1-图12描述本发明的多模态多任务智能交互方法、装置、电子设备及存储介质。The multi-modal multi-task intelligent interaction method, device, electronic device and storage medium of the present invention are described below in conjunction with Figures 1 to 12.

如图1所示，在一个实施例中，一种多模态多任务智能交互方法，包括以下步骤：As shown in FIG1 , in one embodiment, a multi-modal multi-task intelligent interaction method includes the following steps:

步骤S110，获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库。Step S110, acquiring multimodal data, and performing preprocessing and feature extraction on the multimodal data to construct a multimodal knowledge base.

具体的，服务器获取文本数据、音频数据、视频数据以及图像数据等多模态数据，并对该多模态数据进行数据清洗和聚类算法处理的预处理以及特征提取，来完成多模态知识库的构建。Specifically, the server obtains multimodal data such as text data, audio data, video data, and image data, and performs data cleaning and clustering algorithm preprocessing and feature extraction on the multimodal data to complete the construction of the multimodal knowledge base.

步骤S120，接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务。Step S120, receiving input multimodal information, and identifying the current user and the usage scenario based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario.

具体的，服务器接收用户主动输入或被动输入的文本信息、音频信息、视频信息以及图像信息等多模态信息，并基于收到的多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务作为Transformer多任务模型的输入。Specifically, the server receives multimodal information such as text information, audio information, video information, and image information actively or passively input by the user, and identifies the current user and usage scenario based on the received multimodal information to obtain multiple tasks corresponding to the current user and usage scenario as input of the Transformer multi-task model.

步骤S130，将多个任务作为Transformer多任务模型的输入，以将多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果。Step S130, taking multiple tasks as input of the Transformer multi-task model, so as to combine the multiple tasks into a corresponding multi-head model, and fuse the features of different tasks to establish a multimodal understanding framework, so as to output the recognition results corresponding to the multiple tasks.

具体的，服务器将交互意图识别任务、情感识别任务以及实体识别任务等多个任务作为Transformer多任务模型的输入，将该多个任务组成一个多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果。Specifically, the server takes multiple tasks such as interaction intention recognition tasks, emotion recognition tasks, and entity recognition tasks as the input of the Transformer multi-task model, combines the multiple tasks into a multi-head model, and fuses the features of different tasks to establish a multimodal understanding framework to output the recognition results corresponding to multiple tasks.

步骤S140，基于识别结果选取相应的交互策略，按照交互策略生成相应的交互内容并对交互内容进行可视化展示。Step S140, selecting a corresponding interaction strategy based on the recognition result, generating corresponding interaction content according to the interaction strategy and visually displaying the interaction content.

具体的，服务器基于步骤S130中得到的识别结果选取相应的交互策略，按照选取的交互策略生成相应的交互内容并将该交互内容可视化展示给当前用户。Specifically, the server selects a corresponding interaction strategy based on the recognition result obtained in step S130, generates corresponding interaction content according to the selected interaction strategy, and visually displays the interaction content to the current user.

其中，交互策略包括任务型交互、知识型交互以及开放领域闲聊交互，交互内容包括文字、图片以及语音。Among them, the interaction strategies include task-based interaction, knowledge-based interaction and open-domain chat interaction, and the interaction content includes text, pictures and voice.

上述多模态多任务智能交互方法，通过获取多模态数据，并对多模态数据进行预处理和特征提取，构建多模态知识库。随后，接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务。将多个任务作为Transformer多任务模型的输入，将该多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果。最后，基于识别结果选取相应的交互策略，按照该交互策略生成相应的交互内容并对交互内容进行可视化展示。多模态智能交互可以引入更多的交互方式，例如图像、手势、触摸、面部表情、眼神等，更加贴近用户的习惯和需求，同时让用户更快速、更准确地完成交互任务，提升交互效率和用户满意度。同时，多模态智能交互也可以提升交互的可靠性和鲁棒性，在一些嘈杂或者语音输入不方便的环境下，用户可以通过其他方式进行交互，避免了现有语音交互方式受限的不足之处，适应不同的交互场景和环境。The above multimodal multitask intelligent interaction method constructs a multimodal knowledge base by acquiring multimodal data, preprocessing and extracting features of the multimodal data. Subsequently, the input multimodal information is received, and the current user and the usage scenario are identified based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario. Multiple tasks are used as the input of the Transformer multitask model, and the multiple tasks are combined into corresponding multi-head models, and the features of different tasks are information-fused to establish a multimodal understanding framework to output the recognition results corresponding to multiple tasks. Finally, the corresponding interaction strategy is selected based on the recognition results, and the corresponding interaction content is generated according to the interaction strategy and the interaction content is visualized. Multimodal intelligent interaction can introduce more interaction methods, such as images, gestures, touch, facial expressions, eyes, etc., which are closer to the habits and needs of users, while allowing users to complete interaction tasks more quickly and accurately, improving interaction efficiency and user satisfaction. At the same time, multimodal intelligent interaction can also improve the reliability and robustness of interaction. In some noisy environments or inconvenient voice input, users can interact through other methods, avoiding the limitations of existing voice interaction methods and adapting to different interaction scenarios and environments.

参见图2所示，在具体实施例中，本发明提供的多模态多任务智能交互方法，构建丰富异构的知识库，通过多模态学习，从向量空间融合各类模态数据，实现跨模态信息融合分析和关联检索，实现知识的自动归类，并通过数据关联关系主动发现新知识。在知识库构建过程中，首先需要收集数据，多模态知识库需要包含图像、视频、文本等多种类型的数据，因此需要从各种来源收集数据，例如网络图片、视频、文本资料等。随后，对收集的多模态数据进行预处理，构建多模态知识库需要对不同类型的数据进行不同的预处理，通过聚类算法自动聚类相关物料，进行自动维护，以确保数据的准确性、一致性和可比性，从而为后续的分析和应用提供更好的基础。具体操作为对收集到的不同模态的数据分别进行清洗和去重标注工作，确保数据的准确性和可靠性。As shown in FIG2 , in a specific embodiment, the multimodal multitask intelligent interaction method provided by the present invention constructs a rich and heterogeneous knowledge base, and through multimodal learning, various types of modal data are fused from the vector space to realize cross-modal information fusion analysis and associated retrieval, realize automatic classification of knowledge, and actively discover new knowledge through data association relationships. In the process of knowledge base construction, data needs to be collected first. The multimodal knowledge base needs to contain multiple types of data such as images, videos, and texts. Therefore, data needs to be collected from various sources, such as network pictures, videos, and text materials. Subsequently, the collected multimodal data is preprocessed. To construct a multimodal knowledge base, different types of data need to be preprocessed differently. Relevant materials are automatically clustered through a clustering algorithm, and automatic maintenance is performed to ensure the accuracy, consistency, and comparability of the data, thereby providing a better basis for subsequent analysis and application. The specific operation is to clean and de-duplicate the collected data of different modalities to ensure the accuracy and reliability of the data.

在本实施例中，对文本数据清洗包括语料清洗（聚类标注删除无关数据重复数据，去除乱码多余符号数据）、分词（jieba（是一个开源的中文分词工具）、LTP（中文工具集）等）以及词性标注等过程。图片数据清洗主要需要做图片检查，聚类标注去除非法图片（水印、噪声、模糊等问题），为图片标注文本标签、映射图片文字表达、将图片调整为统一大小和分辨率，以便进行更好的比较和分析。视频数据清洗主要需要进行视频分割，即对视频进行分割，将其切分为一系列短片段，用于后续的处理和分析，还需要进行帧提取，即从视频中提取出一定数量的帧，以便进行特征提取和分析。音频数据主要需要去噪处理，即对音频数据进行去噪处理，以减少噪声对后续处理的干扰，还需要进行声音分割，即将音频数据按照一定规则进行分割，以便进行进一步的处理和分析。In this embodiment, text data cleaning includes corpus cleaning (clustering and annotation to delete irrelevant data and duplicate data, and remove garbled and redundant symbol data), word segmentation (jieba (an open source Chinese word segmentation tool), LTP (Chinese tool set), etc.), and part-of-speech tagging. Image data cleaning mainly requires image inspection, clustering and annotation to remove illegal images (watermarks, noise, blur, etc.), annotating text labels for images, mapping image text expressions, and adjusting images to uniform size and resolution for better comparison and analysis. Video data cleaning mainly requires video segmentation, that is, segmenting the video and cutting it into a series of short clips for subsequent processing and analysis. It is also necessary to perform frame extraction, that is, extract a certain number of frames from the video for feature extraction and analysis. Audio data mainly requires denoising, that is, denoising the audio data to reduce the interference of noise on subsequent processing. It is also necessary to perform sound segmentation, that is, segment the audio data according to certain rules for further processing and analysis.

数据预处理后，对于不同类型的数据，需要提取不同的特征，例如图像可以通过卷积神经网络等算法提取颜色、纹理、形状等属性特征以及提取图片的特征向量，以便进行相似度分析和分类。文本可以提取词汇、语法等属性特征以及提取文本的语义特征向量，以便进行相似度分析和分类。视频数据可通过基于深度学习的算法，提取视频帧的特征向量，以便进行相似度分析和分类。音频数据可通过基于深度学习的算法，提取音频的特征向量，以便进行相似度分析和分类。之后，将处理好的数据存储到知识库中，采用关系型数据库或者图数据库两种方式存储丰富的知识库体系。其中关系数据库包括FAQ问答库、表格问答库等，图数据库则是构建的业务领域多模态知识图谱。After data preprocessing, different features need to be extracted for different types of data. For example, images can be extracted through algorithms such as convolutional neural networks to extract attribute features such as color, texture, and shape, as well as feature vectors of images, for similarity analysis and classification. Text can be extracted through attribute features such as vocabulary and grammar, as well as semantic feature vectors of text, for similarity analysis and classification. Video data can be extracted through algorithms based on deep learning to extract feature vectors of video frames for similarity analysis and classification. Audio data can be extracted through algorithms based on deep learning to extract feature vectors of audio for similarity analysis and classification. After that, the processed data is stored in the knowledge base, and a rich knowledge base system is stored in two ways: relational database or graph database. Relational databases include FAQ question and answer bases, table question and answer bases, etc., and graph databases are multimodal knowledge graphs constructed in business fields.

在本实施例中，对于知识库中的数据建立索引，其中针对不同模态知识信息，建立多模态关系关联，可基于CLIP模型（多模态大模型）进行跨模态关联训练，从而将不同类型的数据进行连接和关联，建立多模态关系，利用CLIP实现跨模态检索，可以方便地进行查询操作。在后续数据运营过程中，设计用户接口，提供便捷的工具，通过用户交互记录、人工服务记录等对现有知识库进行优化，支持用户实时增、删、改、查，具有便捷的知识录入、编辑、审核功能，支持文本、图片、超链接、视频、表情、交互式页面、接口调用等富媒体方式，知识点可按业务进行分组管理，不断更新知识库中的数据，有效保证了知识库的实用性和可靠性。In this embodiment, an index is established for the data in the knowledge base, wherein a multimodal relationship association is established for different modal knowledge information, and cross-modal association training can be performed based on the CLIP model (multimodal large model), so that different types of data are connected and associated, and multimodal relationships are established. Cross-modal retrieval is implemented using CLIP, and query operations can be conveniently performed. In the subsequent data operation process, a user interface is designed, and convenient tools are provided. The existing knowledge base is optimized through user interaction records, manual service records, etc., and users are supported to add, delete, modify, and query in real time. It has convenient knowledge entry, editing, and review functions, and supports rich media methods such as text, pictures, hyperlinks, videos, emoticons, interactive pages, and interface calls. Knowledge points can be grouped and managed according to business, and the data in the knowledge base is continuously updated, effectively ensuring the practicality and reliability of the knowledge base.

结合图3所示，在输入多模态信息的过程中，输入包括被动输入和主动输入两部分，获取被动输入数据包括用户输入的文本、语音、图片和视频等多种模态的数据，主动输入包括用户人脸、场景识别后的用户画像和场景画像信息。针对不同人群需求和不同场景，建立行业用户画像知识库和场景画像知识库用作候选集，然后通过画像匹配模块用当前输入的用户的性别、年龄、人脸等用户属性信息和场景的环境属性信息作为检索（query），生成个性化的输入表征，从而通过检索匹配策略，计算匹配得分，返回最高得分的与该用户最匹配的隐式个性化输入表示，即用户画像匹配向量（u）以及场景匹配向量（g），最后输入进入后续交互策略识别模块，来捕获个性化交互方式。As shown in FIG3 , in the process of inputting multimodal information, the input includes two parts: passive input and active input. Passive input data includes data of multiple modes such as text, voice, picture and video input by the user, and active input includes user face, user portrait and scene portrait information after scene recognition. According to the needs of different groups and different scenes, an industry user portrait knowledge base and a scene portrait knowledge base are established as candidate sets. Then, the user attribute information such as the gender, age, face and the environmental attribute information of the scene are used as the query through the portrait matching module to generate personalized input representations, so as to calculate the matching score through the retrieval matching strategy, and return the highest score that best matches the user, that is, the user portrait matching vector (u) and the scene matching vector (g). Finally, the input is entered into the subsequent interaction strategy identification module to capture the personalized interaction mode.

参见图4所示，图中entity表示实体，intent表示意图，sentiment表示情感库，Decoder表示解码器，Encoder表示编码器，concatenate为深度学习中的拼接函数，在多任务学习过程中，具体包括以下步骤：As shown in Figure 4, entity represents entity, intent represents intent, sentiment represents sentiment library, Decoder represents decoder, Encoder represents encoder, and concatenate is a concatenation function in deep learning. In the multi-task learning process, the following steps are specifically included:

1）收集多模态数据：通过摄像头、麦克风、传感器等设备，收集用户的语音、视频、手势等多模态数据。该种方式收集的多模态数据主要为被动输入数据，此外，通过匹配用户画像和场景画像数据库，可以输出用户画像数据和场景画像数据作为主动识别输入数据。1) Collect multimodal data: Collect multimodal data such as user voice, video, gestures, etc. through cameras, microphones, sensors and other devices. The multimodal data collected in this way is mainly passive input data. In addition, by matching user portraits and scene portrait databases, user portrait data and scene portrait data can be output as active recognition input data.

2）对1）中收集的多模态输入数据进行预处理，如音频数据的降噪、语音转换成文本、文本数据的分词、视频抽帧等。2) Preprocess the multimodal input data collected in 1), such as denoising audio data, converting speech into text, segmenting text data, extracting frames from videos, etc.

3）多模态特征提取：从预处理后的多模态数据中提取特征，以便后续的建模使用。如对于语音数据，提取MFCC声学特征（频谱、梅尔倒谱系数），对于文本数据可以用BERT（自编码语言模型）提取语义等特征，对于图像数据可以用MAE（平均绝对误差）提取卷积神经网络等特征。3) Multimodal feature extraction: Extract features from preprocessed multimodal data for subsequent modeling. For example, for speech data, extract MFCC acoustic features (spectrum, Mel cepstral coefficients), for text data, use BERT (autoencoder language model) to extract semantic features, and for image data, use MAE (mean absolute error) to extract convolutional neural network features.

4）多任务学习：将多个任务（如交互意图识别、情感判断、实体识别等）作为一个整体进行学习，以便能够同时训练多个任务并共享特征。基于 Transformer多任务模型，将多个任务组成一个多头模型，设计Attention Mask（注意力掩码）共享一部分参数，实现归纳迁移，以提高模型的泛化性能。多任务学习在解码层引入Prompt Learning（提示对话框学习），引导模型生成对应任务标签结果。多任务学习过程可以将不同任务的特征进行信息融合，使得表达相似语义的跨模态数据映射到同一空间中，从而得到更全面的用户输入信息特征表示，提高模型识别准确率，实现兼具表征与生成能力的多模态理解框架。4) Multi-task learning: Learn multiple tasks (such as interaction intent recognition, sentiment judgment, entity recognition, etc.) as a whole so that multiple tasks can be trained simultaneously and features can be shared. Based on the Transformer multi-task model, multiple tasks are combined into a multi-head model, and the Attention Mask is designed to share some parameters to achieve inductive transfer and improve the generalization performance of the model. Multi-task learning introduces Prompt Learning in the decoding layer to guide the model to generate corresponding task label results. The multi-task learning process can fuse information from features of different tasks, so that cross-modal data expressing similar semantics can be mapped into the same space, thereby obtaining a more comprehensive representation of user input information features, improving the recognition accuracy of the model, and realizing a multi-modal understanding framework with both representation and generation capabilities.

5）输出识别结果：输出用户意图识别、情感识别、实体识别对应的标签结果，该标签结果用于后续交互行为决策判断。5) Output recognition results: Output the label results corresponding to user intention recognition, emotion recognition, and entity recognition. The label results are used for subsequent interactive behavior decision-making.

之后，根据输出的用户意图识别、情感识别、实体识别对应的标签结果进行交互决策的综合判断，具体步骤如下：Afterwards, a comprehensive judgment of the interactive decision is made based on the label results corresponding to the output of user intention recognition, emotion recognition, and entity recognition. The specific steps are as follows:

1）响应用户意图：根据用户意图，提供相应的回答、建议或动作。1) Respond to user intent: Provide corresponding answers, suggestions or actions based on user intent.

2）检测反馈及是否满足用户需求，满足触发满足用户退出条件，否则进入3）。2) Check the feedback and whether it meets the user's needs. If it does, trigger the user's exit condition. Otherwise, enter 3).

3）询问进一步信息：如果未能完全理解用户意图，需要主动询问用户进一步的信息来确定意图。3) Ask for further information: If you do not fully understand the user’s intent, you need to proactively ask the user for further information to confirm the intent.

4）可选：提供具有相关性的信息反馈，基于用户意图，主动提供有用的信息，例如相关资料、链接或服务等。4) Optional: Provide relevant information feedback. Based on user intent, proactively provide useful information, such as related materials, links, or services.

5）结束退出条件判断。系统根据用户反馈，确定是否满足使用者的需求或触发退出条件，如果不满足，重新进行1）~ 4）的流程，直到结束退出为止。5) Exit condition judgment. Based on user feedback, the system determines whether the user's needs are met or the exit condition is triggered. If not, the process from 1) to 4) is repeated until the exit is completed.

在本实施例中，通过个性化的反馈输出，可以根据用户的使用场景和情绪提供不同形式的交互模态，如在办公室中，选择更适合的文本、卡片进行交互，在休闲环境中，推荐感知程度高的语音、视频多模态结合的交互。在交互内容上，根据交互当前的上下文、用户画像、情绪生成个性化的表达，如用户表现低落、气愤情绪，调整表达语速、内容的风格以更好地满足用户需求。识别器根据使用场景、时间等发送提示信息或者推荐信息，并智能决策触发时机。如结合天气，在询问出行方式时做调整，并给出提醒安全，引导用户进行特定操作，例如提示用户输入指定信息、提醒用户点击特定按钮或链接等。另外，还可以提供较高的互动性和反馈机制，例如激励和反馈，以增强用户体验和参与度。在交互过程中，不断记录使用者的意图和操作偏好，以便未来更准确的识别使用者的意图和需求，提供更加智能的服务。In this embodiment, through personalized feedback output, different forms of interaction modes can be provided according to the user's usage scenario and emotions. For example, in the office, more suitable texts and cards are selected for interaction, and in the leisure environment, highly perceived voice and video multimodal interactions are recommended. In terms of interactive content, personalized expressions are generated according to the current context, user portrait, and emotions of the interaction. For example, if the user is depressed or angry, the expression speed and content style are adjusted to better meet the needs of the user. The recognizer sends prompt information or recommended information according to the usage scenario, time, etc., and intelligently decides the triggering time. For example, combined with the weather, adjustments are made when asking about travel methods, and reminders are given for safety, guiding users to perform specific operations, such as prompting users to enter specified information, reminding users to click specific buttons or links, etc. In addition, higher interactivity and feedback mechanisms, such as incentives and feedback, can also be provided to enhance user experience and participation. During the interaction process, the user's intentions and operation preferences are continuously recorded so that the user's intentions and needs can be more accurately identified in the future, and more intelligent services can be provided.

如图5所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库，具体包括以下步骤：As shown in FIG5 , in one embodiment, the multimodal multitask intelligent interaction method provided by the present invention obtains multimodal data, and preprocesses and extracts features of the multimodal data to construct a multimodal knowledge base, specifically comprising the following steps:

步骤S111，对文本数据进行语料清洗，通过聚类标注删除文本数据中的无用数据和重复数据，并去除文本数据中的乱码和多余符号，得到预处理后的文本数据。Step S111 , cleaning the text data, deleting useless data and duplicate data in the text data through clustering annotation, and removing garbled characters and redundant symbols in the text data to obtain preprocessed text data.

步骤S112，通过聚类标注去除图像数据中的非法图片，并对图像数据进行标签标注，将图像数据调整为统一大小和分辨率，得到预处理后的图像数据。Step S112, removing illegal images from the image data through clustering annotation, labeling the image data, adjusting the image data to a uniform size and resolution, and obtaining pre-processed image data.

如图6所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库，具体还包括以下步骤：As shown in FIG6 , in one embodiment, the multimodal multitask intelligent interaction method provided by the present invention acquires multimodal data, and preprocesses and extracts features of the multimodal data to construct a multimodal knowledge base, and specifically includes the following steps:

步骤S113，对视频数据进行视频分割，将视频数据分割为多个视频片段，并从视频片段中抽取多个视频帧，以对视频帧进行特征提取。Step S113, performing video segmentation on the video data, dividing the video data into a plurality of video segments, and extracting a plurality of video frames from the video segments to perform feature extraction on the video frames.

步骤S114，对音频数据进行去噪处理，并按照设定规则将音频数据分割为多段音频，以对多段音频进行特征提取。Step S114, denoising the audio data, and dividing the audio data into multiple audio segments according to a set rule, so as to extract features from the multiple audio segments.

如图7所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库，具体还包括以下步骤：As shown in FIG. 7 , in one embodiment, the multimodal multitask intelligent interaction method provided by the present invention acquires multimodal data, and preprocesses and extracts features of the multimodal data to construct a multimodal knowledge base, and specifically includes the following steps:

步骤S115，通过卷积神经网络对预处理后的多模态数据进行特征提取，以提取多模态数据中的属性特征和特征向量，属性特征和特征向量用于对多模态数据进行相似度分析和数据分类。Step S115, performing feature extraction on the preprocessed multimodal data through a convolutional neural network to extract attribute features and feature vectors in the multimodal data, and the attribute features and feature vectors are used to perform similarity analysis and data classification on the multimodal data.

步骤S116，将多模态数据按照属性特征和特征向量的相似度分析结果和数据分类结果存储于包含关系型数据库和图数据库的多模态知识库中，并建立不同模态数据之间的索引。Step S116, storing the multimodal data in a multimodal knowledge base including a relational database and a graph database according to the similarity analysis results of the attribute features and feature vectors and the data classification results, and establishing an index between different modal data.

如图8所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务，具体包括以下步骤：As shown in FIG8 , in one embodiment, the multimodal multitasking intelligent interaction method provided by the present invention receives input multimodal information, and identifies the current user and the usage scenario based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario, specifically including the following steps:

步骤S122，基于不同用户和不同场景输入的多模态信息建立用户画像知识库和场景画像知识库作为候选集，并通过画像匹配单元根据当前用户输入的多模态信息生成相应的输入表征。Step S122, based on the multimodal information input by different users and different scenarios, a user portrait knowledge base and a scene portrait knowledge base are established as candidate sets, and a portrait matching unit is used to generate a corresponding input representation according to the multimodal information input by the current user.

步骤S124，通过设定的检索匹配策略计算当前用户输入的多模态信息对应的匹配得分，并反馈候选集中得分最高的用户画像匹配向量以及场景画像匹配向量。Step S124, calculate the matching score corresponding to the multimodal information input by the current user through the set retrieval matching strategy, and feed back the user portrait matching vector and scene portrait matching vector with the highest score in the candidate set.

如图9所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，将多个任务作为Transformer多任务模型的输入，以将多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果，具体包括以下步骤：As shown in FIG9 , in one embodiment, the multimodal multitask intelligent interaction method provided by the present invention takes multiple tasks as input of the Transformer multitask model to form a corresponding multi-head model with multiple tasks, fuses information of features of different tasks, establishes a multimodal understanding framework, and outputs recognition results corresponding to multiple tasks, specifically including the following steps:

步骤S132，在Transformer多任务模型的解码层引入提示对话框学习，以引导Transformer多任务模型生成不同任务对应的标签结果。Step S132, introducing prompt dialog learning in the decoding layer of the Transformer multi-task model to guide the Transformer multi-task model to generate label results corresponding to different tasks.

步骤S134，调用多模态理解框架对不同任务进行识别处理，以生成不同任务分别对应的识别结果，识别结果中具有每个任务对应的标签结果，标签结果用于确定不同任务分别对应的交互策略。Step S134, calling the multimodal understanding framework to perform recognition processing on different tasks to generate recognition results corresponding to different tasks, wherein the recognition results have label results corresponding to each task, and the label results are used to determine the interaction strategies corresponding to different tasks.

如图10所示，在一个实施例中，本发明提供的多模态多任务智能交互方法，基于识别结果选取相应的交互策略，按照交互策略生成相应的交互内容并对交互内容进行可视化展示，具体包括以下步骤：As shown in FIG. 10 , in one embodiment, the multi-modal multi-task intelligent interaction method provided by the present invention selects a corresponding interaction strategy based on the recognition result, generates corresponding interaction content according to the interaction strategy, and visualizes the interaction content, specifically including the following steps:

步骤S142，基于识别结果获取当前用户对应的用户意图，并响应于用户意图反馈相应的第一交互内容。Step S142: acquiring the user intention corresponding to the current user based on the recognition result, and feeding back the corresponding first interaction content in response to the user intention.

步骤S144，通过监测判断第一交互内容是否满足当前用户的用户需求，并在第一交互内容不满足用户需求时，反馈第二交互内容以确定当前用户的用户意图。Step S144, determining whether the first interactive content meets the user needs of the current user by monitoring, and when the first interactive content does not meet the user needs, feeding back the second interactive content to determine the user intention of the current user.

下面对本发明提供的多模态多任务智能交互装置进行描述，下文描述的多模态多任务智能交互装置与上文描述的多模态多任务智能交互方法可相互对应参照。The multimodal multitask intelligent interaction device provided by the present invention is described below. The multimodal multitask intelligent interaction device described below and the multimodal multitask intelligent interaction method described above can be referred to each other.

如图11所示，在一个实施例中，一种多模态多任务智能交互装置，包括知识库构建模块1110、信息识别模块1120、任务识别模块1130以及内容交互模块1140。As shown in FIG. 11 , in one embodiment, a multi-modal multi-task intelligent interaction device includes a knowledge base construction module 1110 , an information identification module 1120 , a task identification module 1130 , and a content interaction module 1140 .

知识库构建模块1110用于获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库。The knowledge base construction module 1110 is used to obtain multimodal data, and perform preprocessing and feature extraction on the multimodal data to construct a multimodal knowledge base.

信息识别模块1120用于接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务。The information identification module 1120 is used to receive input multimodal information and identify the current user and the usage scenario based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario.

任务识别模块1130用于将多个任务作为Transformer多任务模型的输入，以将多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果。The task identification module 1130 is used to take multiple tasks as the input of the Transformer multi-task model to form a corresponding multi-head model for the multiple tasks, and to fuse the features of different tasks to establish a multimodal understanding framework to output the recognition results corresponding to the multiple tasks.

内容交互模块1140用于基于识别结果选取相应的交互策略，按照交互策略生成相应的交互内容并对交互内容进行可视化展示。The content interaction module 1140 is used to select a corresponding interaction strategy based on the recognition result, generate corresponding interaction content according to the interaction strategy, and visualize the interaction content.

其中，多模态数据包括文本数据、音频数据、视频数据以及图像数据，预处理包括数据清洗和聚类算法处理，多模态信息包括用户主动输入或被动输入的文本信息、音频信息、视频信息以及图像信息，多个任务包括交互意图识别任务、情感识别任务以及实体识别任务，交互策略包括任务型交互、知识型交互以及开放领域闲聊交互，交互内容包括文字、图片以及语音。Among them, multimodal data includes text data, audio data, video data and image data, preprocessing includes data cleaning and clustering algorithm processing, multimodal information includes text information, audio information, video information and image information actively or passively input by users, multiple tasks include interaction intention recognition tasks, emotion recognition tasks and entity recognition tasks, interaction strategies include task-based interaction, knowledge-based interaction and open domain chat interaction, and the interaction content includes text, pictures and voice.

在本实施例中，本发明提供的多模态多任务智能交互装置，知识库构建模块1110具体用于：In this embodiment, in the multi-modal multi-task intelligent interaction device provided by the present invention, the knowledge base construction module 1110 is specifically used for:

对文本数据进行语料清洗，通过聚类标注删除文本数据中的无用数据和重复数据，并去除文本数据中的乱码和多余符号，得到预处理后的文本数据。The text data is cleaned, useless and duplicate data in the text data is deleted through clustering annotation, and garbled characters and redundant symbols in the text data are removed to obtain the preprocessed text data.

通过聚类标注去除图像数据中的非法图片，并对图像数据进行标签标注，将图像数据调整为统一大小和分辨率，得到预处理后的图像数据。Illegal images in the image data are removed through clustering annotation, and the image data is labeled and annotated, and the image data is adjusted to a uniform size and resolution to obtain preprocessed image data.

在本实施例中，本发明提供的多模态多任务智能交互装置，知识库构建模块1110具体还用于：In this embodiment, in the multi-modal multi-task intelligent interaction device provided by the present invention, the knowledge base construction module 1110 is further used for:

对视频数据进行视频分割，将视频数据分割为多个视频片段，并从视频片段中抽取多个视频帧，以对视频帧进行特征提取。The video data is segmented into multiple video segments, and multiple video frames are extracted from the video segments to extract features of the video frames.

对音频数据进行去噪处理，并按照设定规则将音频数据分割为多段音频，以对多段音频进行特征提取。The audio data is denoised and segmented into multiple audio segments according to set rules to extract features from the multiple audio segments.

通过卷积神经网络对预处理后的多模态数据进行特征提取，以提取多模态数据中的属性特征和特征向量，属性特征和特征向量用于对多模态数据进行相似度分析和数据分类。The preprocessed multimodal data is subjected to feature extraction through a convolutional neural network to extract attribute features and feature vectors in the multimodal data. The attribute features and feature vectors are used to perform similarity analysis and data classification on the multimodal data.

将多模态数据按照属性特征和特征向量的相似度分析结果和数据分类结果存储于包含关系型数据库和图数据库的多模态知识库中，并建立不同模态数据之间的索引。The multimodal data are stored in a multimodal knowledge base including a relational database and a graph database according to the similarity analysis results of attribute features and feature vectors and the data classification results, and indexes between different modal data are established.

在本实施例中，本发明提供的多模态多任务智能交互装置，信息识别模块1120具体用于：In this embodiment, in the multi-modal multi-task intelligent interaction device provided by the present invention, the information identification module 1120 is specifically used for:

基于不同用户和不同场景输入的多模态信息建立用户画像知识库和场景画像知识库作为候选集，并通过画像匹配单元根据当前用户输入的多模态信息生成相应的输入表征。Based on the multimodal information input from different users and different scenarios, a user portrait knowledge base and a scene portrait knowledge base are established as candidate sets, and a portrait matching unit is used to generate corresponding input representations according to the multimodal information input by the current user.

通过设定的检索匹配策略计算当前用户输入的多模态信息对应的匹配得分，并反馈候选集中得分最高的用户画像匹配向量以及场景画像匹配向量。The matching score corresponding to the multimodal information input by the current user is calculated through the set retrieval matching strategy, and the user portrait matching vector and scene portrait matching vector with the highest score in the candidate set are fed back.

在本实施例中，本发明提供的多模态多任务智能交互装置，任务识别模块1130具体用于：In this embodiment, in the multi-modal multi-task intelligent interaction device provided by the present invention, the task identification module 1130 is specifically used for:

在Transformer多任务模型的解码层引入提示对话框学习，以引导Transformer多任务模型生成不同任务对应的标签结果。Prompt dialog learning is introduced in the decoding layer of the Transformer multi-task model to guide the Transformer multi-task model to generate label results corresponding to different tasks.

调用多模态理解框架对不同任务进行识别处理，以生成不同任务分别对应的识别结果，识别结果中具有每个任务对应的标签结果，标签结果用于确定不同任务分别对应的交互策略。The multimodal understanding framework is called to perform recognition processing on different tasks to generate recognition results corresponding to different tasks. The recognition results contain label results corresponding to each task, and the label results are used to determine the interaction strategies corresponding to different tasks.

在本实施例中，本发明提供的多模态多任务智能交互装置，内容交互模块1140具体用于：In this embodiment, in the multi-modal multi-task intelligent interaction device provided by the present invention, the content interaction module 1140 is specifically used for:

基于识别结果获取当前用户对应的用户意图，并响应于用户意图反馈相应的第一交互内容。The user intention corresponding to the current user is obtained based on the recognition result, and the corresponding first interaction content is fed back in response to the user intention.

通过监测判断第一交互内容是否满足当前用户的用户需求，并在第一交互内容不满足用户需求时，反馈第二交互内容以确定当前用户的用户意图。It is determined by monitoring whether the first interactive content meets the user needs of the current user, and when the first interactive content does not meet the user needs, the second interactive content is fed back to determine the user intention of the current user.

图12示例了一种电子设备的实体结构示意图，该电子设备可以是智能终端，其内部结构图可以如图12所示。该电子设备包括通过系统总线连接的处理器、存储器和网络接口。其中，该电子设备的处理器用于提供计算和控制能力。该电子设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该电子设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现多模态多任务智能交互方法，该方法包括：FIG12 illustrates a schematic diagram of the physical structure of an electronic device, which may be a smart terminal, and its internal structure diagram may be shown in FIG12. The electronic device includes a processor, a memory, and a network interface connected via a system bus. The processor of the electronic device is used to provide computing and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the electronic device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, a multi-modal multi-task intelligent interaction method is implemented, and the method includes:

获取多模态数据，并对多模态数据进行预处理和特征提取，以构建多模态知识库；Acquire multimodal data, preprocess and extract features from the multimodal data to build a multimodal knowledge base;

接收输入的多模态信息，并基于多模态信息对当前用户和使用场景进行识别，以获取当前用户和使用场景对应的多个任务；Receive input multimodal information, and identify the current user and the usage scenario based on the multimodal information to obtain multiple tasks corresponding to the current user and the usage scenario;

将多个任务作为Transformer多任务模型的输入，以将多个任务组成相应的多头模型，并将不同任务的特征进行信息融合，建立多模态理解框架，以输出多个任务对应的识别结果；Take multiple tasks as input to the Transformer multi-task model to combine multiple tasks into corresponding multi-head models, fuse the features of different tasks, and establish a multimodal understanding framework to output recognition results corresponding to multiple tasks;

基于识别结果选取相应的交互策略，按照交互策略生成相应的交互内容并对交互内容进行可视化展示；Select the corresponding interaction strategy based on the recognition result, generate the corresponding interaction content according to the interaction strategy and visualize the interaction content;

本领域技术人员可以理解，图12中示出的结构，仅仅是与本发明方案相关的部分结构的框图，并不构成对本发明方案所应用于其上的电子设备的限定，具体的电子设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 12 is merely a block diagram of a partial structure related to the solution of the present invention, and does not constitute a limitation on the electronic device to which the solution of the present invention is applied. The specific electronic device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

另一方面，本发明还提供了一种计算机存储介质，存储有计算机程序，计算机程序被处理器执行时实现多模态多任务智能交互方法，该方法包括：On the other hand, the present invention further provides a computer storage medium storing a computer program, which implements a multi-modal multi-task intelligent interaction method when executed by a processor, the method comprising:

又一方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令时实现多模态多任务智能交互方法，该方法包括：In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer instructions, the computer instructions being stored in a computer-readable storage medium. A processor of an electronic device reads the computer instructions from the computer-readable storage medium, and when the processor executes the computer instructions, a multi-modal multi-task intelligent interaction method is implemented, the method comprising:

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，该计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to memory, storage, database or other media used in the embodiments provided by the present invention may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM) or external cache memory.

作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双倍数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present invention, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the patent of the present invention. It should be pointed out that, for ordinary technicians in this field, several variations and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention shall be subject to the attached claims.

Claims

1. A multi-modal multitasking intelligent interaction method, the method comprising:

Acquiring multi-modal data, preprocessing the multi-modal data and extracting features to construct a multi-modal knowledge base;

receiving input multi-mode information, and identifying a current user and a use scene based on the multi-mode information so as to acquire a plurality of tasks corresponding to the current user and the use scene;

The tasks are used as input of a transform multitask model, so that the tasks are formed into a corresponding multi-head model, information fusion is carried out on the characteristics of different tasks, a multi-mode understanding framework is built, and recognition results corresponding to the tasks are output;

Selecting a corresponding interaction strategy based on the identification result, generating corresponding interaction content according to the interaction strategy, and visually displaying the interaction content;

The multi-modal data comprises text data, audio data, video data and image data, the preprocessing comprises data cleaning and clustering algorithm processing, the multi-modal information comprises text information, audio information, video information and image information which are actively or passively input by a user, the tasks comprise an interaction intention recognition task, an emotion recognition task and an entity recognition task, the interaction strategy comprises task type interaction, knowledge type interaction and open field chat interaction, and the interaction content comprises words, pictures and voices.

2. The method for intelligent interaction of multiple modes and multiple tasks according to claim 1, wherein the steps of obtaining multiple modes data, preprocessing the multiple modes data, extracting features of the multiple modes data, and constructing a multiple modes knowledge base include:

Carrying out corpus cleaning on the text data, deleting useless data and repeated data in the text data through clustering labeling, and removing messy codes and redundant symbols in the text data to obtain preprocessed text data; and

And removing illegal pictures in the image data through clustering labeling, labeling the image data, and adjusting the image data to be uniform in size and resolution to obtain preprocessed image data.

3. The method for intelligent interaction of multiple modes and multiple tasks according to claim 2, wherein the steps of obtaining multiple modes data, preprocessing the multiple modes data and extracting features to construct a multiple modes knowledge base further comprise:

video segmentation is carried out on the video data, the video data is segmented into a plurality of video segments, and a plurality of video frames are extracted from the video segments so as to carry out feature extraction on the video frames; and

And denoising the audio data, and dividing the audio data into a plurality of sections of audio according to a set rule so as to extract the characteristics of the plurality of sections of audio.

4. The method for intelligent interaction of multiple modes and multiple tasks according to claim 3, wherein said obtaining multiple modes data, preprocessing said multiple modes data and extracting features to build a multiple modes knowledge base, further comprises:

Extracting the characteristics of the preprocessed multi-modal data through a convolutional neural network to extract attribute characteristics and characteristic vectors in the multi-modal data, wherein the attribute characteristics and the characteristic vectors are used for carrying out similarity analysis and data classification on the multi-modal data;

and storing the multi-modal data in the multi-modal knowledge base comprising a relational database and a graph database according to the similarity analysis result and the data classification result of the attribute features and the feature vectors, and establishing indexes among different modal data.

5. The method for intelligent interaction of multiple modes and multiple tasks according to claim 1, wherein the steps of receiving the input multiple modes information, identifying the current user and the usage scenario based on the multiple modes information, and obtaining the multiple tasks corresponding to the current user and the usage scenario include:

establishing a user portrait knowledge base and a scene portrait knowledge base as candidate sets based on the multi-mode information input by different users and different scenes, and generating corresponding input characterization according to the multi-mode information input by the current user through a portrait matching unit;

And calculating a matching score corresponding to the multi-mode information input by the current user through the set search matching strategy, and feeding back a user portrait matching vector and a scene portrait matching vector with the highest scores in the candidate set.

6. The method of claim 5, wherein the inputting the plurality of tasks as a transform multitask model to form the plurality of tasks into a corresponding multi-headed model, and fusing information about features of different tasks, and establishing a multi-modal understanding framework to output recognition results corresponding to the plurality of tasks, includes:

Introducing prompt dialog box learning at a decoding layer of the converter multi-task model to guide the converter multi-task model to generate label results corresponding to different tasks;

And invoking the multi-mode understanding framework to perform recognition processing on different tasks so as to generate recognition results respectively corresponding to the different tasks, wherein the recognition results are provided with the label results corresponding to each task, and the label results are used for determining interaction strategies respectively corresponding to the different tasks.

7. The multi-modal multi-task intelligent interaction method according to any one of claims 1 to 6, wherein selecting a corresponding interaction policy based on the recognition result, generating corresponding interaction content according to the interaction policy, and visually displaying the interaction content, includes:

Acquiring user intention corresponding to a current user based on the identification result, and feeding back corresponding first interactive content in response to the user intention;

and judging whether the first interactive content meets the user requirement of the current user or not through monitoring, and feeding back the second interactive content to determine the user intention of the current user when the first interactive content does not meet the user requirement.

8. A multi-modal multi-tasking intelligent interaction device, the device comprising:

The knowledge base construction module is used for acquiring multi-modal data, preprocessing the multi-modal data and extracting features to construct a multi-modal knowledge base;

the information identification module is used for receiving input multi-mode information and identifying a current user and a use scene based on the multi-mode information so as to acquire a plurality of tasks corresponding to the current user and the use scene;

The task identification module is used for taking the tasks as input of a transform multitask model, forming the tasks into a corresponding multi-head model, fusing the characteristics of different tasks, establishing a multi-mode understanding frame, and outputting identification results corresponding to the tasks;

The content interaction module is used for selecting a corresponding interaction strategy based on the identification result, generating corresponding interaction content according to the interaction strategy and carrying out visual display on the interaction content;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 7.