CN114757209B

CN114757209B - Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Info

Publication number: CN114757209B
Application number: CN202210659318.5A
Authority: CN
Inventors: 张梅山; 卢攀忠; 林智超; 孙越恒
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-11-11
Anticipated expiration: 2042-06-13
Also published as: CN114757209A

Abstract

The invention provides a human-computer interaction instruction parsing method and device based on multimodal semantic role recognition, and relates to the technical field of semantic analysis in natural language processing. Including: constructing a complete set of instruction semantic role annotation paradigm according to the characteristics of human-computer interaction instructions; according to the instruction semantic role annotation paradigm, combined with image acquisition, the single-modal form of the semantic role annotation model is extended to visual text multimodality The multi-modal form of visual text of the semantic role annotation model is trained and learned, and the multi-modal semantic role recognition is completed to perform semantic analysis of human-computer interaction instructions. The invention innovatively attempts to use the paradigm of multi-modal semantic role labeling to perform semantic analysis on human-computer interaction instructions, so as to convert the original machine-incomprehensible instructions into machine-understandable semantic structured output, which is more convenient and safe. , Quickly execute the user's intent.

Description

Human-computer interaction instruction analysis method and device based on multi-modal semantic role recognition

技术领域technical field

本发明涉及自然语言处理中的语义分析技术领域，特别是指一种基于多模态语义角色识别的人机交互指令解析方法及装置。The invention relates to the technical field of semantic analysis in natural language processing, in particular to a method and device for man-machine interaction instruction analysis based on multi-modal semantic role recognition.

背景技术Background technique

语义角色标注是一种浅层语义分析技术，用于抽取出句子中所蕴含的谓词-论元结构。其中，谓词是一条语句中能够引发一个语义事件的核心词，论元则是参与该语义事件的角色，包括施事者、受事者等。总的来说，语义角色标注技术的核心在于能够让机器理解一句话中“谁都谁做了什么，在什么时间和地点”。目前，已经有着许多应用尝试将语义角色标注作为其技术链路中的关键一环，如知识问答、对话机器人、机器翻译等。Semantic role labeling is a shallow semantic analysis technique used to extract the predicate-argument structure implied in a sentence. Among them, the predicate is the core word in a sentence that can trigger a semantic event, and the argument is the role that participates in the semantic event, including the agent and the recipient. In general, the core of semantic role labeling technology is to enable machines to understand "who did what, when and where" in a sentence. At present, there are already many applications that try to use semantic role labeling as a key link in their technical links, such as knowledge question answering, dialogue robots, machine translation, etc.

随着技术的发展，人机交互技术已经逐渐成为用户控制无人设备（如机器人、无人机）的一种重要方式。通过语音下达指令，使得无人设备理解操控者的意图，从而执行相应的命令，可以解放操控者的双手，更加方便、安全、快捷地操控无人设备。然而，现有的指令解析技术发展有限，不能做到有针对性的从指令中解析出机器可理解的语义结构。本发明计划利用语义角色标注技术本身的优势，做到对控制指令的意图语义高精度的解析，使得无人设备更好的服务于用户，执行更高抽象难度的作业。With the development of technology, human-computer interaction technology has gradually become an important way for users to control unmanned devices (such as robots and drones). Commands are issued through voice, so that the unmanned device understands the operator's intentions and executes the corresponding command, which can free the hands of the operator and control the unmanned device more conveniently, safely and quickly. However, the development of existing instruction parsing technologies is limited, and it is impossible to parse out machine-understandable semantic structures from instructions in a targeted manner. The present invention plans to use the advantages of the semantic role labeling technology itself to achieve high-precision analysis of the intent and semantics of control instructions, so that unmanned devices can better serve users and perform more abstract and difficult operations.

目前语义角色标注整体流程主要分为两种，一种是基于流水线的方式，利用序列标注方法识别出句子中的谓词，接着再识别出句子中的语义角色（论元），这会导致错误传播的问题严重。另一种则是构建语义图的方法来同时抽取出谓词和其对应的语义角色，首先通过枚举句子所有可能的谓词和论元候选片段作为图中节点，接着利用谓词片段和语义角色片段之间的语义角色关系作为图中的边，最终通过精确的解码构成的语义图得到结构化输出。目前的无人设备大都具有视觉和语言两种感知，然而现有的语义角色标注方法大都面向单一文本设定下，忽略了图像信息和文本信息之间重要的互补关系。At present, the overall process of semantic role labeling is mainly divided into two types. One is based on the pipeline method, using the sequence labeling method to identify the predicates in the sentence, and then identify the semantic role (argument) in the sentence, which will lead to error propagation. The problem is serious. The other is to construct a semantic graph to simultaneously extract predicates and their corresponding semantic roles. First, enumerate all possible predicate and argument candidate fragments of a sentence as nodes in the graph, and then use the relationship between predicate fragments and semantic role fragments to The semantic role relationship among them is used as the edges in the graph, and finally the semantic graph formed by precise decoding can get a structured output. Most of the current unmanned devices have both vision and language perception. However, most of the existing semantic role labeling methods are oriented to a single text setting, ignoring the important complementary relationship between image information and text information.

目前语义角色标注数据集的标注范式大都面向通用领域，在特殊领域如无人设备指令操控指令下仍有较大的空白。At present, the annotation paradigms of semantic role annotation datasets are mostly oriented to the general field, and there are still large gaps in special fields such as unmanned equipment command and control instructions.

发明内容Contents of the invention

针对现有技术中无人设备指令操控指令下仍有较大的空白的问题，本发明提出了一种基于多模态语义角色识别的人机交互指令解析方法及装置。Aiming at the problem that there is still a large gap under the command and control command of unmanned equipment in the prior art, the present invention proposes a method and device for man-machine interaction command parsing based on multi-modal semantic role recognition.

为解决上述技术问题，本发明提供如下技术方案：In order to solve the above technical problems, the present invention provides the following technical solutions:

一方面，提供了一种基于多模态语义角色识别的人机交互指令解析方法，该方法应用于电子设备，包括以下步骤：On the one hand, a method for parsing human-computer interaction instructions based on multimodal semantic role recognition is provided, and the method is applied to electronic devices, including the following steps:

S1：根据人机交互指令的特性，构建指令语义角色标注范式；S1: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role annotation paradigm;

S2：根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本多模态形式；S2: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S3：对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析。S3: Train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，步骤S1中，根据人机交互指令的特性，构建指令语义角色标注范式，包括：Optionally, in step S1, according to the characteristics of the human-computer interaction instruction, construct an instruction semantic role labeling paradigm, including:

S11：采用VerbAtlas语义角色标注数据的标注方式作为标注基准；S11: Use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark;

S12：对预存的中文语义角色标注范式扩展和修改，使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析，获得指令语义角色标注范式。S12: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

可选地，步骤S2中，根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式，包括：Optionally, in step S2, according to the instruction semantic role labeling paradigm, combined with image collection, the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form, including:

S21：根据所述指令语义角色标注范式，通过无人系统采集图像，采用Faster-RCNN获得序列目标区域，将所述序列目标区域组成图像区域序列，对所述图像序列特征进行提取；S21: According to the instruction semantic role annotation paradigm, collect images through an unmanned system, use Faster-RCNN to obtain a sequence target area, form the sequence target area into an image area sequence, and extract features of the image sequence;

S22：通过提取的图像序列特征，对语义文本端的语义角色进行辅助识别，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S22: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

可选地，步骤S3中，对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析，包括：Optionally, in step S3, train and learn the visual text multimodal form of the semantic role labeling model, complete multimodal semantic role recognition and perform semantic analysis on human-computer interaction instructions, including:

S31：根据语义角色标注模型的视觉文本多模态形式构建预训练模型；S31: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

S32：所述预训练模型的输入的指令

；利用BERT预训练模型对所述指令I进行编码，获得指令I中每个词对应的词向量序列

； S32: Instructions for inputting the pre-trained model

; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I

;

S33：枚举出指令I中所有的跨度

，其中

，获得每个跨度的特征向量；其中，所述跨度的大小均为预设值； S33: enumerate all spans in instruction I

,in

, to obtain the feature vector of each span; wherein, the size of the span is a preset value;

S34：根据所述每个跨度的特征向量，生成语义图中谓词节点和语义角色节点对应的候选向量；S34: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph;

S35：引入损失函数对模型的训练损失进行完善，完成多模态语义角色识别对人机交互指令进行语义解析。S35: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，S34中，采用两个不同的MLP层分别得到谓词候选向量

以及语义角色候选向量

，其中：

；

。 Optionally, in S34, two different MLP layers are used to obtain predicate candidate vectors respectively

and semantic role candidate vectors

,in:

;

.

可选地，S35中，引入损失函数对模型的训练损失进行完善，包括：Optionally, in S35, a loss function is introduced to improve the training loss of the model, including:

构建语义角色标注损失函数，判断模型预测的谓词及论元结构的完整性；Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

其中，包括一个MLP 层得分层以及一个Biaffine得分层；所述MLP 层得分层用于判断当前谓词节点的语义框架，所述Biaffine得分层用于对句子中每个谓词

、语义角色

以及两者关系

的三元组

进行打分；交叉熵来计算每个三元组的损失，所述语义角色标注损失函数如下述公式（1）所示： Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence

, semantic role

and the relationship between the two

triplet of

Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):

构建模态匹配函数，用于图像和文本跨模态特征对的模态匹配，该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体，则输出标签为1，否则标签为0；通过多任务学习的范式，定义如下述公式（2）的模态匹配函数的损失函数：Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise The label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):

一方面，提供了一种基于多模态语义角色识别的人机交互指令解析装置，该装置应用于电子设备，该装置包括：On the one hand, a multi-modal semantic role recognition-based human-computer interaction instruction parsing device is provided, the device is applied to electronic equipment, and the device includes:

指令语义角色标注范式构建模块，用于根据人机交互指令的特性，构建指令语义角色标注范式；Instruction Semantic Role Labeling Paradigm Building Module, which is used to construct instruction semantic role labeling paradigm according to the characteristics of human-computer interaction instructions;

多模态构建模块，用于根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本多模态形式；A multimodal building block, used to expand the single-modal form of the semantic role labeling model to a multimodal form of visual text according to the instruction semantic role labeling paradigm and in combination with image acquisition;

模型训练模块，用于对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析。The model training module is used to train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，9、根据权利要求8所述的装置，其特征在于，所述指令语义角色标注范式构建模块，用于根采用VerbAtlas语义角色标注数据的标注方式作为标注基准；Optionally, 9. The device according to claim 8, wherein the instruction semantic role labeling paradigm building module is used to use the labeling method of VerbAtlas semantic role labeling data as the labeling reference;

对预存的中文语义角色标注范式扩展和修改，使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析，获得指令语义角色标注范式。The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

可选地，多模态构建模块，用于根据所述指令语义角色标注范式，通过无人系统采集图像，采用Faster-RCNN获得序列目标区域，将所述序列目标区域组成图像区域序列，对所述图像序列特征进行提取；Optionally, the multi-modal construction module is used to collect images through an unmanned system according to the instruction semantic role labeling paradigm, use Faster-RCNN to obtain sequence target areas, and form the sequence target areas into an image area sequence, for all Extract the features of the above image sequence;

通过提取的图像序列特征，对语义文本端的语义角色进行辅助识别，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。Through the extracted image sequence features, the semantic role on the semantic text side is assisted to identify, and the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form.

一方面，提供了一种电子设备，所述电子设备包括处理器和存储器，所述存储器中存储有至少一条指令，所述至少一条指令由所述处理器加载并执行以实现上述一种基于多模态语义角色识别的人机交互指令解析方法。In one aspect, an electronic device is provided, the electronic device includes a processor and a memory, the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above multi-based Human-computer interaction instruction parsing method for modal semantic role recognition.

一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令，所述至少一条指令由处理器加载并执行以实现上述一种基于多模态语义角色识别的人机交互指令解析方法。On the one hand, a computer-readable storage medium is provided, and at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to realize the above-mentioned human-computer interaction based on multimodal semantic role recognition. Interactive command parsing method.

本发明实施例的上述技术方案至少具有如下有益效果：The above-mentioned technical solutions of the embodiments of the present invention have at least the following beneficial effects:

上述方案中，本发明创新性的提出一种集成话语篇章的语义依存图表示方案，将句子语义依存图扩展到整个篇章，充分考虑了对话场景下话语语义信息不完整的特性。本发明首次针对对话文本提出融合话语内部和话语之间的一体化语义依存图联合分析模型，采用端到端的建模方式将句子语义和篇章语义连接在一起。另外本发明所采用的基于知识蒸馏的教师-学生网络也能够满足对话系统实际应用中对于效率和延迟的高要求。Among the above solutions, the present invention innovatively proposes a semantic dependency graph representation scheme of integrated discourse chapters, which extends the sentence semantic dependency graph to the entire discourse, fully considering the incompleteness of discourse semantic information in dialogue scenarios. For the first time, the present invention proposes an integrated semantic dependency graph joint analysis model that integrates within discourses and between discourses, and uses an end-to-end modeling method to connect sentence semantics and discourse semantics together. In addition, the teacher-student network based on knowledge distillation used in the present invention can also meet the high requirements for efficiency and delay in the practical application of dialogue systems.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的流程图；Fig. 1 is a flowchart of a human-computer interaction instruction parsing method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图2是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的流程图；Fig. 2 is a flow chart of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图3是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色标注模型图；Fig. 3 is a multimodal semantic role labeling model diagram of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图4是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色结构化输出图；Fig. 4 is a multimodal semantic role structured output diagram of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图5是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色标注实现人机交互示例图；Fig. 5 is an example diagram of human-computer interaction realized by multi-modal semantic role labeling in a human-computer interaction instruction analysis method based on multi-modal semantic role recognition provided by an embodiment of the present invention;

图6是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析装置框图；Fig. 6 is a block diagram of a human-computer interaction instruction parsing device based on multimodal semantic role recognition provided by an embodiment of the present invention;

图7是本发明实施例提供的一种电子设备的结构示意图。Fig. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明要解决的技术问题、技术方案和优点更加清楚，下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following will describe in detail with reference to the drawings and specific embodiments.

本发明实施例提供了一种基于多模态语义角色识别的人机交互指令解析方法，该方法可以由电子设备实现，该电子设备可以是终端或服务器。如图1所示的基于多模态语义角色识别的人机交互指令解析方法流程图，该方法的处理流程可以包括如下的步骤：An embodiment of the present invention provides a method for parsing human-computer interaction instructions based on multimodal semantic role recognition. The method can be implemented by an electronic device, and the electronic device can be a terminal or a server. As shown in Figure 1, the flow chart of the human-computer interaction instruction parsing method based on multi-modal semantic role recognition, the processing flow of the method may include the following steps:

S101：根据人机交互指令的特性，构建指令语义角色标注范式；S101: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role labeling paradigm;

S102：根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本多模态形式；S102: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S103：对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析。S103: Carry out training and learning on the visual text multimodal form of the semantic role labeling model, and complete multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，步骤S101中，根据人机交互指令的特性，构建指令语义角色标注范式，包括：Optionally, in step S101, according to the characteristics of the human-computer interaction instruction, construct an instruction semantic role annotation paradigm, including:

S111：采用VerbAtlas语义角色标注数据的标注方式作为标注基准；S111: Use the labeling method of the VerbAtlas semantic role labeling data as the labeling benchmark;

S112：对预存的中文语义角色标注范式扩展和修改，使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析，获得指令语义角色标注范式。S112: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

可选地，步骤S102中，根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式，包括：Optionally, in step S102, according to the instruction semantic role labeling paradigm, combined with image collection, the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form, including:

S121：根据所述指令语义角色标注范式，通过无人系统采集图像，采用Faster-RCNN获得序列目标区域，将所述序列目标区域组成图像区域序列，对所述图像序列特征进行提取；S121: According to the instruction semantic role labeling paradigm, collect images through an unmanned system, use Faster-RCNN to obtain sequence target areas, form the sequence target areas into an image area sequence, and extract features of the image sequence;

S122：通过提取的图像序列特征，对语义文本端的语义角色进行辅助识别，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S122: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

可选地，步骤S103中，对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析，包括：Optionally, in step S103, training and learning the visual text multimodal form of the semantic role labeling model, completing multimodal semantic role recognition and semantic analysis of human-computer interaction instructions, including:

S131：根据语义角色标注模型的视觉文本多模态形式构建预训练模型；S131: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

S132：所述预训练模型的输入的指令

； S132: Instructions for inputting the pre-trained model

;

S133：枚举出指令I中所有的跨度

，其中

，获得每个跨度的特征向量；其中，所述跨度的大小均为预设值； S133: enumerate all spans in instruction I

,in

S134：根据所述每个跨度的特征向量，生成语义图中谓词节点和语义角色节点对应的候选向量；S134: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph;

S135：引入损失函数对模型的训练损失进行完善，完成多模态语义角色识别对人机交互指令进行语义解析。S135: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，S134中，采用两个不同的MLP层分别得到谓词候选向量

以及语义角色候选向量

，其中：

；

。 Optionally, in S134, two different MLP layers are used to obtain predicate candidate vectors respectively

and semantic role candidate vectors

,in:

;

.

可选地，S135中，引入损失函数对模型的训练损失进行完善，包括：Optionally, in S135, a loss function is introduced to improve the training loss of the model, including:

、语义角色

以及两者关系

的三元组

, semantic role

and the relationship between the two

triplet of

本发明实施例中，创新性的尝试引入图像信息到现有的单模态语义角色标注模型中，从而利用图像信息辅助语义角色标注模型对输入语句进行语义分析。尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析，从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出，做到更加方便、安全、快捷的执行用户的意图。In the embodiment of the present invention, an innovative attempt is made to introduce image information into the existing single-modal semantic role labeling model, so that image information is used to assist the semantic role labeling model to perform semantic analysis on input sentences. Try to use the paradigm of multi-modal semantic role labeling to semantically analyze human-computer interaction instructions, so as to convert instructions that cannot be understood by the machine into semantically structured output that the machine can understand, so as to achieve more convenient, safe and fast execution of the user intention of.

本发明实施例提供了一种基于多模态语义角色识别的人机交互指令解析方法，该方法可以由电子设备实现，该电子设备可以是终端或服务器。如图2所示的基于多模态语义角色识别的人机交互指令解析方法流程图，该方法的处理流程可以包括如下的步骤：An embodiment of the present invention provides a method for parsing human-computer interaction instructions based on multimodal semantic role recognition. The method can be implemented by an electronic device, and the electronic device can be a terminal or a server. As shown in Figure 2, the flow chart of the human-computer interaction instruction analysis method based on multi-modal semantic role recognition, the processing flow of the method may include the following steps:

S201：采用VerbAtlas语义角色标注数据的标注方式作为标注基准。S201: Using the labeling method of the VerbAtlas semantic role labeling data as a labeling benchmark.

一种可行的实施方式中，本发明首先针对人机交互指令，基于其本身的特点构建一套完善的指令语义角色标注范式。以往的语义角色标注范式大都面向通用领域（如新闻）等，其语义角色的设定在于更好的通用性。但是在人机交互领域，每种类型的指令其语义角色都有着其特殊性，这是通用领域的语义角色不能涵盖的。In a feasible implementation manner, the present invention first aims at human-computer interaction instructions, and builds a complete set of instruction semantic role labeling paradigms based on its own characteristics. Most of the previous semantic role annotation paradigms are oriented to general fields (such as news), etc., and the setting of their semantic roles lies in better generality. But in the field of human-computer interaction, each type of instruction has its own particularity in its semantic role, which cannot be covered by the semantic role in the general field.

S202：对预存的中文语义角色标注范式扩展和修改，使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析，获得指令语义角色标注范式。S202: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

一种可行的实施方式中，本发明拟扩展和修改现有的中文语义角色标注范式，使其适用于人机交互指令的语义解析。In a feasible implementation, the present invention intends to expand and modify the existing Chinese semantic role labeling paradigm, making it suitable for semantic analysis of human-computer interaction instructions.

初步计划采用VerbAtlas语义角色标注数据标注方式作为本发明的标注基准，其主要基于以下两种考虑：（1）该标注基准在谓词识别上加入语义框架的概念，使得每个谓词的具体语义更加精确，从而缓解了谓词由于语境不同导致的歧义问题。（2）该标注基准面向多语言场景设计，可以方便本发明设计面向中文指令的标注范式。表1展示了本发明初步设定的语义框架即语义角色。涵盖了简单的如前进、移动等的位移指令，以及高难度的如拿取、打开等操作指令；语义角色则包含了参与该语义事件的操控设备、操控手段以及指令执行的时间、地点等。The preliminary plan is to use the VerbAtlas semantic role labeling data labeling method as the labeling benchmark of the present invention, which is mainly based on the following two considerations: (1) The labeling benchmark adds the concept of semantic framework to the predicate recognition, making the specific semantics of each predicate more accurate , thereby alleviating the ambiguity problem caused by different contexts of predicates. (2) The labeling benchmark is designed for multilingual scenes, which can facilitate the design of the Chinese instruction-oriented labeling paradigm in the present invention. Table 1 shows the semantic framework initially set by the present invention, that is, semantic roles. It covers simple displacement instructions such as moving forward and moving, as well as difficult operation instructions such as taking and opening; the semantic role includes the control equipment involved in the semantic event, the control means, and the time and place when the command is executed.

S203：根据所述指令语义角色标注范式，通过无人系统采集图像，采用Faster-RCNN获得序列目标区域，将所述序列目标区域组成图像区域序列，对所述图像序列特征进行提取；S203: According to the instruction semantic role labeling paradigm, collect images through an unmanned system, use Faster-RCNN to obtain sequence target areas, form the sequence target areas into an image area sequence, and extract features of the image sequence;

S204：通过提取的图像序列特征，对语义文本端的语义角色进行辅助识别，将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S204: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

一种可行的实施方式中，在模型架构上，本发明采用如图3所示的双塔模型来解决多模态语义角色任务之间图文特征的融合。整体架构主要分为三个部分，图像端的图像序列特征提取，语言端的语义图特征提取，以及最后用于特征融合的训练函数。In a feasible implementation, in terms of model architecture, the present invention adopts the twin-tower model as shown in FIG. 3 to solve the fusion of graphic and text features between multi-modal semantic role tasks. The overall architecture is mainly divided into three parts, the image sequence feature extraction on the image side, the semantic map feature extraction on the language side, and finally the training function for feature fusion.

一种可行的实施方式中，图像序列特征：对于无人系统观察到的图像

，本发明采用现有的Faster-RCNN获得一序列目标区域，将其组成图像区域序列

，并获得区域序列对应的特征序列

。对于特征序列中的区域特征

，本发明利用一层MLP层，对做进一步的特征抽象，得到最终的图像特征

: In a feasible implementation, image sequence features: for images observed by unmanned systems

, the present invention uses the existing Faster-RCNN to obtain a sequence of target areas, which are composed of image area sequences

, and obtain the feature sequence corresponding to the region sequence

. For the region features in the feature sequence

, the present invention uses a layer of MLP layer to further abstract the features to obtain the final image features

:

S205：根据语义角色标注模型的视觉文本多模态形式构建预训练模型；S205: Construct a pre-training model according to the visual-text multimodal form of the semantic role labeling model;

S206：所述预训练模型的输入的指令

； S206: Instructions for inputting the pre-trained model

;

S207：枚举出指令I中所有的跨度

，其中

，获得每个跨度的特征向量；其中，所述跨度的大小均为预设值； S207: enumerate all spans in instruction I

,in

S208：根据所述每个跨度的特征向量，生成语义图中谓词节点和语义角色节点对应的候选向量。S208: Generate candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vectors of each span.

一种可行的实施方式中，文本序列特征：本发明采用目前端到端语义角色标注经典的语义图神经网络构建思路来获得句子中隐含的谓词和其对应的论元。对于输入的指令

，利用BERT预训练模型对其进行编码，获得指令中每个词对应的词向量序列

。接着枚举出指令中所有的跨度

，其中

，由句子中多个词组成的。每个跨度的最大长度和最小长度都是预先设定好的。对于每个跨度

，将其特征向量表示为： In a feasible implementation, text sequence features: the present invention adopts the current classic semantic graph neural network construction idea of end-to-end semantic role labeling to obtain hidden predicates and their corresponding arguments in sentences. For the input command

, use the BERT pre-training model to encode it, and obtain the word vector sequence corresponding to each word in the instruction

. Then enumerate all spans in the instruction

,in

, consisting of multiple words in the sentence. The maximum and minimum lengths of each span are preset. for each span

, expressing its eigenvectors as:

其中

，

表示每个跨度起始单词和结尾单词所对应的隐藏层表示，

表示每个跨度对应的长度特征，

则是利用Self-Attention机制，计算对于跨度内每个词的注意力，并根据注意力加权平均得到的向量。 in

,

denote the hidden layer representations corresponding to each span start word and end word,

Represents the length feature corresponding to each span,

It uses the Self-Attention mechanism to calculate the attention for each word in the span, and calculates the vector obtained by weighting the attention.

对于每个跨度对应的表示

，需要生成语义图中谓词节点和语义角色节点对应的候选向量，因此本发明采用两个不同的MLP层分别得到谓词候选向量以及语义角色候选向量

以及

：For each span the corresponding representation

, it is necessary to generate candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph, so the present invention uses two different MLP layers to obtain predicate candidate vectors and semantic role candidate vectors respectively

as well as

:

S209：引入损失函数对模型的训练损失进行完善，完成多模态语义角色识别对人机交互指令进行语义解析。S209: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

一种可行的实施方式中，采用两个不同的MLP层分别得到谓词候选向量

以及语义角色候选向量

，其中：

；

。 In a feasible implementation, two different MLP layers are used to obtain predicate candidate vectors respectively

and semantic role candidate vectors

,in:

;

.

其中，MLP^P 是用于获取谓词表示的多层前馈神经网络，MLP^R是用于获取语义角色表示的多层前馈神经网络。Among them, MLP ^P is a multi-layer feed-forward neural network for obtaining predicate representations, and MLP ^R is a multi-layer feed-forward neural network for obtaining semantic role representations.

一种可行的实施方式中，引入损失函数对模型的训练损失进行完善，包括：In a feasible implementation, a loss function is introduced to improve the training loss of the model, including:

、语义角色

以及两者关系

的三元组

, semantic role

and the relationship between the two

triplet of

一种可行的实施方式中，在训练损失上，本发明定义了两种损失函数用于训练模型。第一种判断模型预测的谓词、论元结构的完整性的语义角色标注损失函数，其包括一个 MLP层得分层来判断当前谓词节点的语义框架，以及一个Biaffine得分层来对句子中每个谓词、语义角色以及两者关系的三元组

进行打分，具体定义如下： In a feasible implementation manner, in terms of training loss, the present invention defines two loss functions for training the model. The first is the semantic role labeling loss function that judges the integrity of the predicate and argument structure predicted by the model, which includes an MLP layer score layer to judge the semantic framework of the current predicate node, and a Biaffine score layer to evaluate each A triplet of predicates, semantic roles, and their relations

To score, the specific definition is as follows:

其中，

表示用于获取语义框架类别得分的多层前馈神经网络；

是 Biaffine权重矩阵，

是线性权重矩阵，

是偏置项。获得每个关系对应的评分后，本发明采用交叉熵来计算每个三元组的损失：in,

represents a multi-layer feed-forward neural network for obtaining semantic frame category scores;

is the Biaffine weight matrix,

is the linear weight matrix,

is a bias item. After obtaining the score corresponding to each relationship, the present invention uses cross-entropy to calculate the loss of each triplet:

其中

和

表示对应的的语义框架以及语义角色集合。 in

and

Represents the corresponding semantic framework and semantic role set.

一种可行的实施方式中，第二种则是用于图像和文本跨模态特征对的模态匹配函数，该函数的标签本发明定义为如果该语义角色对应的片段中包含该目标区域对应的物体，则输出标签为1，否则标签为0。本发明同样利用一个Biaffine层计算对该图像区域特征、语义角色以及两者关系的三元组

进行打分， In a feasible implementation, the second type is a modal matching function for image and text cross-modal feature pairs, and the label of this function is defined in the present invention as if the segment corresponding to the semantic role contains the corresponding object, the output label is 1, otherwise the label is 0. The present invention also utilizes a Biaffine layer to calculate triplets of the image region features, semantic roles and the relationship between the two

to score,

同理，其对应的损失函数为：Similarly, the corresponding loss function is:

最终的损失函数，本发明采用多任务学习的范式进行定义：The final loss function is defined in the present invention using the paradigm of multi-task learning:

其中

用于调节两种损失函数在模型训练中所发挥的权重。 in

It is used to adjust the weight of the two loss functions in model training.

本发明实施例中，多模态语义角色标注的目标是给定一条输入指令，得出该指令的语义结构化输出，使得机器能够理解并执行。多模态语义角色识别的结构化输出结果如图4所示。In the embodiment of the present invention, the goal of multi-modal semantic role labeling is to give an input instruction and obtain a semantically structured output of the instruction, so that the machine can understand and execute it. The structured output results of multimodal semantic role recognition are shown in Figure 4.

本发明实施例中，图5展示了本发明的多模态语义角色标注模型在人机交互指令上的解析实例。对于用户下达的指令，本发明的多模态语义角色解析系统识别出其中的谓词，对应的语义框架，以及属于该语义框架的语义角色，将其组织为机器可识别的结构化输出。In the embodiment of the present invention, FIG. 5 shows an example of parsing human-computer interaction instructions by the multimodal semantic role labeling model of the present invention. For the instruction issued by the user, the multi-modal semantic role analysis system of the present invention identifies the predicates, the corresponding semantic framework, and the semantic roles belonging to the semantic framework, and organizes them into a machine-recognizable structured output.

本发明实施例中，针对于目前现有的语义角色标注模型大都基于单模态设定，创新性的尝试引入图像信息到现有的单模态语义角色标注模型中，从而利用图像信息辅助语义角色标注模型对输入语句进行语义分析。尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析，从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出，做到更加方便、安全、快捷的执行用户的意图。In the embodiment of the present invention, since most existing semantic role labeling models are based on single-modal settings, an innovative attempt is made to introduce image information into the existing single-modal semantic role labeling models, thereby using image information to assist semantic The role labeling model performs semantic analysis on the input sentence. Try to use the paradigm of multi-modal semantic role labeling to semantically analyze human-computer interaction instructions, so as to convert instructions that cannot be understood by the machine into semantically structured output that the machine can understand, so as to achieve more convenient, safe and fast execution of the user intention of.

图6是根据一示例性实施例示出的一种基于多模态语义角色识别的人机交互指令解析装置框图。参照图6，该装置300包括：Fig. 6 is a block diagram of a human-computer interaction instruction parsing device based on multimodal semantic role recognition according to an exemplary embodiment. Referring to Figure 6, the device 300 includes:

范式构建模块310，用于根据人机交互指令的特性，构建一套完善的指令语义角色标注范式；Paradigm construction module 310, configured to construct a complete set of instruction semantic role labeling paradigms according to the characteristics of human-computer interaction instructions;

多模态构建模块320，用于根据所述指令语义角色标注范式，结合图像采集，将语义角色标注模型的单模态形式扩展为视觉文本多模态形式；The multimodal construction module 320 is used to expand the single-modal form of the semantic role labeling model into a multimodal form of visual text according to the semantic role labeling paradigm of the instruction and in combination with image collection;

模型训练模块330，用于对语义角色标注模型的视觉文本多模态形式进行训练学习，完成多模态语义角色识别对人机交互指令进行语义解析。The model training module 330 is used to train and learn the visual text multi-modal form of the semantic role labeling model, complete multi-modal semantic role recognition and perform semantic analysis on human-computer interaction instructions.

可选地，范式构建模块310，用于根采用VerbAtlas语义角色标注数据的标注方式作为标注基准；Optionally, the paradigm building module 310 is used to use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark;

对预存的中文语义角色标注范式扩展和修改，使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析，获得一套完善的指令语义角色标注范式。The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and a complete set of instruction semantic role labeling paradigms is obtained.

可选地，多模态构建模块320，用于根据所述指令语义角色标注范式，通过无人系统采集图像，采用Faster-RCNN获得序列目标区域，将所述序列目标区域组成图像区域序列，对所述图像序列特征进行提取；Optionally, the multimodal construction module 320 is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, use Faster-RCNN to obtain sequence target areas, and form the sequence target areas into an image area sequence, for The image sequence features are extracted;

可选地，模型训练模块330，用于根据语义角色标注模型的视觉文本多模态形式构建预训练模型；Optionally, the model training module 330 is configured to construct a pre-training model according to the visual-text multimodal form of the semantic role labeling model;

所述预训练模型的输入的指令

； Instructions for the input of the pre-trained model

;

枚举出指令I中所有的跨度

，其中

，获得每个跨度的特征向量；其中，所述跨度的大小均为预设值； Enumerate all spans in instruction I

,in

根据所述每个跨度的特征向量，生成语义图中谓词节点和语义角色节点对应的候选向量；According to the feature vector of each span, generate the candidate vector corresponding to the predicate node and the semantic role node in the semantic graph;

引入损失函数对模型的训练损失进行完善，完成多模态语义角色识别对人机交互指令进行语义解析。Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地，模型训练模块330，用于采用两个不同的MLP层分别得到谓词候选向量

以及语义角色候选向量

，其中：

；

。 Optionally, the model training module 330 is used to adopt two different MLP layers to obtain predicate candidate vectors respectively

and semantic role candidate vectors

,in:

;

.

可选地，模型训练模块330，用于构建语义角色标注损失函数，判断模型预测的谓词及论元结构的完整性；Optionally, the model training module 330 is used to construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

、语义角色

以及两者关系

的三元组

, semantic role

and the relationship between the two

triplet of

可选地，模型训练模块330，用于构建模态匹配函数，用于图像和文本跨模态特征对的模态匹配，该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体，则输出标签为1，否则标签为0；通过多任务学习的范式，定义如下述公式（2）的模态匹配函数的损失函数：Optionally, the model training module 330 is used to construct a modality matching function for modality matching of image and text cross-modal feature pairs, and the label of the function is defined as if the segment corresponding to the semantic role contains the target region For the corresponding object, the output label is 1, otherwise the label is 0; through the paradigm of multi-task learning, the loss function of the modal matching function defined in the following formula (2):

图7是本发明实施例提供的一种电子设备400的结构示意图，该电子设备400可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上处理器（centralprocessing units，CPU）401和一个或一个以上的存储器402，其中，所述存储器402中存储有至少一条指令，所述至少一条指令由所述处理器401加载并执行以实现下述基于多模态语义角色识别的人机交互指令解析方法的步骤：FIG. 7 is a schematic structural diagram of an electronic device 400 provided by an embodiment of the present invention. The electronic device 400 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 401 and one or more memory 402, wherein at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to realize the following human-machine based on multimodal semantic role recognition The steps of the interactive instruction parsing method:

S1：根据人机交互指令的特性，构建一套完善的指令语义角色标注范式；S1: According to the characteristics of human-computer interaction instructions, construct a complete set of instruction semantic role labeling paradigms;

在示例性实施例中，还提供了一种计算机可读存储介质，例如包括指令的存储器，上述指令可由终端中的处理器执行以完成上述基于多模态语义角色识别的人机交互指令解析方法。例如，所述计算机可读存储介质可以是ROM、随机存取存储器（RAM）、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory including instructions, the above-mentioned instructions can be executed by a processor in the terminal to complete the above-mentioned human-computer interaction instruction parsing method based on multi-modal semantic role recognition . For example, the computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. A human-computer interaction instruction analysis method based on multimodal semantic role recognition, characterized in that, comprising the following steps:

S1: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role annotation paradigm;

S2: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S3: Train and learn the visual text multimodal form of the semantic role labeling model, and complete the semantic analysis of human-computer interaction instructions for multimodal semantic role recognition;

In the step S3, the multimodal form of visual text of the semantic role labeling model is trained and learned, and the multimodal semantic role recognition is completed to perform semantic analysis on human-computer interaction instructions, including:

S31: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

S32: Instructions for inputting the pre-trained model

;

S33: enumerate all spans in instruction I

,in

S34: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph;

S35: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

2. The method according to claim 1, characterized in that, in the step S1, according to the characteristics of the human-computer interaction instruction, constructing an instruction semantic role annotation paradigm, including:

S11: Use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark;

S12: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

3. The method according to claim 2, characterized in that, in the step S2, according to the instruction semantic role labeling paradigm, combined with image acquisition, the single-modal form of the semantic role labeling model is extended to a visual-text dual-mode state forms, including:

S21: According to the instruction semantic role annotation paradigm, collect images through an unmanned system, use Faster-RCNN to obtain a sequence target area, form the sequence target area into an image area sequence, and extract features of the image sequence;

S22: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

4. The method according to claim 1, characterized in that, in said S34, two different layer perceptron MLP layers are used to obtain the predicate candidate vector respectively

and semantic role candidate vectors

,in:

.

5. The method according to claim 4, characterized in that, in said S35, introducing a loss function to improve the training loss of the model, including:

Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence

semantic role

and the relationship between the two

triplet of

.

6. The method according to claim 4, characterized in that, in said S35, introducing a loss function to improve the training loss of the model, including:

Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise the label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):

.

7. A human-computer interaction instruction parsing device based on multimodal semantic role recognition, characterized in that the device is suitable for any one of the above-mentioned methods in claims 1-6, and the device includes:

Instruction Semantic Role Labeling Paradigm Building Module, which is used to construct instruction semantic role labeling paradigm according to the characteristics of human-computer interaction instructions;

A multimodal building block, used to expand the single-modal form of the semantic role labeling model to a multimodal form of visual text according to the instruction semantic role labeling paradigm and in combination with image acquisition;

The model training module is used to train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

8. The device according to claim 7, characterized in that, the instruction semantic role labeling paradigm building block is used to root the labeling method using VerbAtlas semantic role labeling data as the labeling reference;

The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

9. The device according to claim 7, wherein the multimodal construction module is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, and use Faster-RCNN to obtain the sequence target area, Composing the sequence target area into a sequence of image regions, and extracting the features of the sequence of images;

Through the extracted image sequence features, the semantic role on the semantic text side is assisted to identify, and the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form.