[go: up one dir, main page]

CN114757209B - Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition - Google Patents

Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition Download PDF

Info

Publication number
CN114757209B
CN114757209B CN202210659318.5A CN202210659318A CN114757209B CN 114757209 B CN114757209 B CN 114757209B CN 202210659318 A CN202210659318 A CN 202210659318A CN 114757209 B CN114757209 B CN 114757209B
Authority
CN
China
Prior art keywords
semantic role
semantic
instruction
labeling
paradigm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210659318.5A
Other languages
Chinese (zh)
Other versions
CN114757209A (en
Inventor
张梅山
卢攀忠
林智超
孙越恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210659318.5A priority Critical patent/CN114757209B/en
Publication of CN114757209A publication Critical patent/CN114757209A/en
Application granted granted Critical
Publication of CN114757209B publication Critical patent/CN114757209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本发明提供了一种基于多模态语义角色识别的人机交互指令解析方法及装置,涉及自然语言处理中的语义分析技术领域。包括:根据人机交互指令的特性,构建一套完善的指令语义角色标注范式;根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。本发明创新性的尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析,从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出,做到更加方便、安全、快捷的执行用户的意图。

Figure 202210659318

The invention provides a human-computer interaction instruction parsing method and device based on multimodal semantic role recognition, and relates to the technical field of semantic analysis in natural language processing. Including: constructing a complete set of instruction semantic role annotation paradigm according to the characteristics of human-computer interaction instructions; according to the instruction semantic role annotation paradigm, combined with image acquisition, the single-modal form of the semantic role annotation model is extended to visual text multimodality The multi-modal form of visual text of the semantic role annotation model is trained and learned, and the multi-modal semantic role recognition is completed to perform semantic analysis of human-computer interaction instructions. The invention innovatively attempts to use the paradigm of multi-modal semantic role labeling to perform semantic analysis on human-computer interaction instructions, so as to convert the original machine-incomprehensible instructions into machine-understandable semantic structured output, which is more convenient and safe. , Quickly execute the user's intent.

Figure 202210659318

Description

基于多模态语义角色识别的人机交互指令解析方法及装置Human-computer interaction instruction analysis method and device based on multi-modal semantic role recognition

技术领域technical field

本发明涉及自然语言处理中的语义分析技术领域,特别是指一种基于多模态语义角色识别的人机交互指令解析方法及装置。The invention relates to the technical field of semantic analysis in natural language processing, in particular to a method and device for man-machine interaction instruction analysis based on multi-modal semantic role recognition.

背景技术Background technique

语义角色标注是一种浅层语义分析技术,用于抽取出句子中所蕴含的谓词-论元结构。其中,谓词是一条语句中能够引发一个语义事件的核心词,论元则是参与该语义事件的角色,包括施事者、受事者等。总的来说,语义角色标注技术的核心在于能够让机器理解一句话中“谁都谁做了什么,在什么时间和地点”。目前,已经有着许多应用尝试将语义角色标注作为其技术链路中的关键一环,如知识问答、对话机器人、机器翻译等。Semantic role labeling is a shallow semantic analysis technique used to extract the predicate-argument structure implied in a sentence. Among them, the predicate is the core word in a sentence that can trigger a semantic event, and the argument is the role that participates in the semantic event, including the agent and the recipient. In general, the core of semantic role labeling technology is to enable machines to understand "who did what, when and where" in a sentence. At present, there are already many applications that try to use semantic role labeling as a key link in their technical links, such as knowledge question answering, dialogue robots, machine translation, etc.

随着技术的发展,人机交互技术已经逐渐成为用户控制无人设备(如机器人、无人机)的一种重要方式。通过语音下达指令,使得无人设备理解操控者的意图,从而执行相应的命令,可以解放操控者的双手,更加方便、安全、快捷地操控无人设备。然而,现有的指令解析技术发展有限,不能做到有针对性的从指令中解析出机器可理解的语义结构。本发明计划利用语义角色标注技术本身的优势,做到对控制指令的意图语义高精度的解析,使得无人设备更好的服务于用户,执行更高抽象难度的作业。With the development of technology, human-computer interaction technology has gradually become an important way for users to control unmanned devices (such as robots and drones). Commands are issued through voice, so that the unmanned device understands the operator's intentions and executes the corresponding command, which can free the hands of the operator and control the unmanned device more conveniently, safely and quickly. However, the development of existing instruction parsing technologies is limited, and it is impossible to parse out machine-understandable semantic structures from instructions in a targeted manner. The present invention plans to use the advantages of the semantic role labeling technology itself to achieve high-precision analysis of the intent and semantics of control instructions, so that unmanned devices can better serve users and perform more abstract and difficult operations.

目前语义角色标注整体流程主要分为两种,一种是基于流水线的方式,利用序列标注方法识别出句子中的谓词,接着再识别出句子中的语义角色(论元),这会导致错误传播的问题严重。另一种则是构建语义图的方法来同时抽取出谓词和其对应的语义角色,首先通过枚举句子所有可能的谓词和论元候选片段作为图中节点,接着利用谓词片段和语义角色片段之间的语义角色关系作为图中的边,最终通过精确的解码构成的语义图得到结构化输出。目前的无人设备大都具有视觉和语言两种感知,然而现有的语义角色标注方法大都面向单一文本设定下,忽略了图像信息和文本信息之间重要的互补关系。At present, the overall process of semantic role labeling is mainly divided into two types. One is based on the pipeline method, using the sequence labeling method to identify the predicates in the sentence, and then identify the semantic role (argument) in the sentence, which will lead to error propagation. The problem is serious. The other is to construct a semantic graph to simultaneously extract predicates and their corresponding semantic roles. First, enumerate all possible predicate and argument candidate fragments of a sentence as nodes in the graph, and then use the relationship between predicate fragments and semantic role fragments to The semantic role relationship among them is used as the edges in the graph, and finally the semantic graph formed by precise decoding can get a structured output. Most of the current unmanned devices have both vision and language perception. However, most of the existing semantic role labeling methods are oriented to a single text setting, ignoring the important complementary relationship between image information and text information.

目前语义角色标注数据集的标注范式大都面向通用领域,在特殊领域如无人设备指令操控指令下仍有较大的空白。At present, the annotation paradigms of semantic role annotation datasets are mostly oriented to the general field, and there are still large gaps in special fields such as unmanned equipment command and control instructions.

发明内容Contents of the invention

针对现有技术中无人设备指令操控指令下仍有较大的空白的问题,本发明提出了一种基于多模态语义角色识别的人机交互指令解析方法及装置。Aiming at the problem that there is still a large gap under the command and control command of unmanned equipment in the prior art, the present invention proposes a method and device for man-machine interaction command parsing based on multi-modal semantic role recognition.

为解决上述技术问题,本发明提供如下技术方案:In order to solve the above technical problems, the present invention provides the following technical solutions:

一方面,提供了一种基于多模态语义角色识别的人机交互指令解析方法,该方法应用于电子设备,包括以下步骤:On the one hand, a method for parsing human-computer interaction instructions based on multimodal semantic role recognition is provided, and the method is applied to electronic devices, including the following steps:

S1:根据人机交互指令的特性,构建指令语义角色标注范式;S1: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role annotation paradigm;

S2:根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;S2: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S3:对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。S3: Train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,步骤S1中,根据人机交互指令的特性,构建指令语义角色标注范式,包括:Optionally, in step S1, according to the characteristics of the human-computer interaction instruction, construct an instruction semantic role labeling paradigm, including:

S11:采用VerbAtlas语义角色标注数据的标注方式作为标注基准;S11: Use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark;

S12:对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。S12: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

可选地,步骤S2中,根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式,包括:Optionally, in step S2, according to the instruction semantic role labeling paradigm, combined with image collection, the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form, including:

S21:根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;S21: According to the instruction semantic role annotation paradigm, collect images through an unmanned system, use Faster-RCNN to obtain a sequence target area, form the sequence target area into an image area sequence, and extract features of the image sequence;

S22:通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S22: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

可选地,步骤S3中,对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析,包括:Optionally, in step S3, train and learn the visual text multimodal form of the semantic role labeling model, complete multimodal semantic role recognition and perform semantic analysis on human-computer interaction instructions, including:

S31:根据语义角色标注模型的视觉文本多模态形式构建预训练模型;S31: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

S32:所述预训练模型的输入的指令

Figure 227321DEST_PATH_IMAGE001
;利用BERT预训练模型 对所述指令I进行编码,获得指令I中每个词对应的词向量序列
Figure 743753DEST_PATH_IMAGE002
Figure 460036DEST_PATH_IMAGE003
; S32: Instructions for inputting the pre-trained model
Figure 227321DEST_PATH_IMAGE001
; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I
Figure 743753DEST_PATH_IMAGE002
Figure 460036DEST_PATH_IMAGE003
;

S33:枚举出指令I中所有的跨度

Figure 327498DEST_PATH_IMAGE004
,其中
Figure 787430DEST_PATH_IMAGE005
, 获得每个跨度的特征向量;其中,所述跨度的大小均为预设值; S33: enumerate all spans in instruction I
Figure 327498DEST_PATH_IMAGE004
,in
Figure 787430DEST_PATH_IMAGE005
, to obtain the feature vector of each span; wherein, the size of the span is a preset value;

S34:根据所述每个跨度的特征向量,生成语义图中谓词节点和语义角色节点对应的候选向量;S34: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph;

S35:引入损失函数对模型的训练损失进行完善,完成多模态语义角色识别对人机交互指令进行语义解析。S35: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,S34中,采用两个不同的MLP层分别得到谓词候选向量

Figure 892789DEST_PATH_IMAGE006
以及语义角色候 选向量
Figure 45553DEST_PATH_IMAGE007
,其中:
Figure 603573DEST_PATH_IMAGE008
Figure 867195DEST_PATH_IMAGE009
。 Optionally, in S34, two different MLP layers are used to obtain predicate candidate vectors respectively
Figure 892789DEST_PATH_IMAGE006
and semantic role candidate vectors
Figure 45553DEST_PATH_IMAGE007
,in:
Figure 603573DEST_PATH_IMAGE008
;
Figure 867195DEST_PATH_IMAGE009
.

可选地,S35中,引入损失函数对模型的训练损失进行完善,包括:Optionally, in S35, a loss function is introduced to improve the training loss of the model, including:

构建语义角色标注损失函数,判断模型预测的谓词及论元结构的完整性;Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

其中,包括一个MLP 层得分层以及一个Biaffine得分层;所述MLP 层得分层用于 判断当前谓词节点的语义框架,所述Biaffine得分层用于对句子中每个谓词

Figure 92640DEST_PATH_IMAGE010
、语义角色
Figure 165374DEST_PATH_IMAGE011
以及两者关系
Figure 741849DEST_PATH_IMAGE012
的三元组
Figure 543583DEST_PATH_IMAGE013
进行打分;交叉熵来计算每个三元组的损失,所述语 义角色标注损失函数如下述公式(1)所示: Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence
Figure 92640DEST_PATH_IMAGE010
, semantic role
Figure 165374DEST_PATH_IMAGE011
and the relationship between the two
Figure 741849DEST_PATH_IMAGE012
triplet of
Figure 543583DEST_PATH_IMAGE013
Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):

Figure 623535DEST_PATH_IMAGE014
Figure 623535DEST_PATH_IMAGE014

可选地,S35中,引入损失函数对模型的训练损失进行完善,包括:Optionally, in S35, a loss function is introduced to improve the training loss of the model, including:

构建模态匹配函数,用于图像和文本跨模态特征对的模态匹配,该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体,则输出标签为1,否则标签为0;通过多任务学习的范式,定义如下述公式(2)的模态匹配函数的损失函数:Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise The label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):

Figure 118101DEST_PATH_IMAGE015
Figure 118101DEST_PATH_IMAGE015

一方面,提供了一种基于多模态语义角色识别的人机交互指令解析装置,该装置应用于电子设备,该装置包括:On the one hand, a multi-modal semantic role recognition-based human-computer interaction instruction parsing device is provided, the device is applied to electronic equipment, and the device includes:

指令语义角色标注范式构建模块,用于根据人机交互指令的特性,构建指令语义角色标注范式;Instruction Semantic Role Labeling Paradigm Building Module, which is used to construct instruction semantic role labeling paradigm according to the characteristics of human-computer interaction instructions;

多模态构建模块,用于根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;A multimodal building block, used to expand the single-modal form of the semantic role labeling model to a multimodal form of visual text according to the instruction semantic role labeling paradigm and in combination with image acquisition;

模型训练模块,用于对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。The model training module is used to train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,9、根据权利要求8所述的装置,其特征在于,所述指令语义角色标注范式构建模块,用于根采用VerbAtlas语义角色标注数据的标注方式作为标注基准;Optionally, 9. The device according to claim 8, wherein the instruction semantic role labeling paradigm building module is used to use the labeling method of VerbAtlas semantic role labeling data as the labeling reference;

对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

可选地,多模态构建模块,用于根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;Optionally, the multi-modal construction module is used to collect images through an unmanned system according to the instruction semantic role labeling paradigm, use Faster-RCNN to obtain sequence target areas, and form the sequence target areas into an image area sequence, for all Extract the features of the above image sequence;

通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。Through the extracted image sequence features, the semantic role on the semantic text side is assisted to identify, and the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form.

一方面,提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现上述一种基于多模态语义角色识别的人机交互指令解析方法。In one aspect, an electronic device is provided, the electronic device includes a processor and a memory, the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above multi-based Human-computer interaction instruction parsing method for modal semantic role recognition.

一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现上述一种基于多模态语义角色识别的人机交互指令解析方法。On the one hand, a computer-readable storage medium is provided, and at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to realize the above-mentioned human-computer interaction based on multimodal semantic role recognition. Interactive command parsing method.

本发明实施例的上述技术方案至少具有如下有益效果:The above-mentioned technical solutions of the embodiments of the present invention have at least the following beneficial effects:

上述方案中,本发明创新性的提出一种集成话语篇章的语义依存图表示方案,将句子语义依存图扩展到整个篇章,充分考虑了对话场景下话语语义信息不完整的特性。本发明首次针对对话文本提出融合话语内部和话语之间的一体化语义依存图联合分析模型,采用端到端的建模方式将句子语义和篇章语义连接在一起。另外本发明所采用的基于知识蒸馏的教师-学生网络也能够满足对话系统实际应用中对于效率和延迟的高要求。Among the above solutions, the present invention innovatively proposes a semantic dependency graph representation scheme of integrated discourse chapters, which extends the sentence semantic dependency graph to the entire discourse, fully considering the incompleteness of discourse semantic information in dialogue scenarios. For the first time, the present invention proposes an integrated semantic dependency graph joint analysis model that integrates within discourses and between discourses, and uses an end-to-end modeling method to connect sentence semantics and discourse semantics together. In addition, the teacher-student network based on knowledge distillation used in the present invention can also meet the high requirements for efficiency and delay in the practical application of dialogue systems.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的流程图;Fig. 1 is a flowchart of a human-computer interaction instruction parsing method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图2是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的流程图;Fig. 2 is a flow chart of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图3是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色标注模型图;Fig. 3 is a multimodal semantic role labeling model diagram of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图4是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色结构化输出图;Fig. 4 is a multimodal semantic role structured output diagram of a human-computer interaction instruction analysis method based on multimodal semantic role recognition provided by an embodiment of the present invention;

图5是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析方法的多模态语义角色标注实现人机交互示例图;Fig. 5 is an example diagram of human-computer interaction realized by multi-modal semantic role labeling in a human-computer interaction instruction analysis method based on multi-modal semantic role recognition provided by an embodiment of the present invention;

图6是本发明实施例提供的一种基于多模态语义角色识别的人机交互指令解析装置框图;Fig. 6 is a block diagram of a human-computer interaction instruction parsing device based on multimodal semantic role recognition provided by an embodiment of the present invention;

图7是本发明实施例提供的一种电子设备的结构示意图。Fig. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following will describe in detail with reference to the drawings and specific embodiments.

本发明实施例提供了一种基于多模态语义角色识别的人机交互指令解析方法,该方法可以由电子设备实现,该电子设备可以是终端或服务器。如图1所示的基于多模态语义角色识别的人机交互指令解析方法流程图,该方法的处理流程可以包括如下的步骤:An embodiment of the present invention provides a method for parsing human-computer interaction instructions based on multimodal semantic role recognition. The method can be implemented by an electronic device, and the electronic device can be a terminal or a server. As shown in Figure 1, the flow chart of the human-computer interaction instruction parsing method based on multi-modal semantic role recognition, the processing flow of the method may include the following steps:

S101:根据人机交互指令的特性,构建指令语义角色标注范式;S101: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role labeling paradigm;

S102:根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;S102: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S103:对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。S103: Carry out training and learning on the visual text multimodal form of the semantic role labeling model, and complete multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,步骤S101中,根据人机交互指令的特性,构建指令语义角色标注范式,包括:Optionally, in step S101, according to the characteristics of the human-computer interaction instruction, construct an instruction semantic role annotation paradigm, including:

S111:采用VerbAtlas语义角色标注数据的标注方式作为标注基准;S111: Use the labeling method of the VerbAtlas semantic role labeling data as the labeling benchmark;

S112:对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。S112: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

可选地,步骤S102中,根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式,包括:Optionally, in step S102, according to the instruction semantic role labeling paradigm, combined with image collection, the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form, including:

S121:根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;S121: According to the instruction semantic role labeling paradigm, collect images through an unmanned system, use Faster-RCNN to obtain sequence target areas, form the sequence target areas into an image area sequence, and extract features of the image sequence;

S122:通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S122: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

可选地,步骤S103中,对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析,包括:Optionally, in step S103, training and learning the visual text multimodal form of the semantic role labeling model, completing multimodal semantic role recognition and semantic analysis of human-computer interaction instructions, including:

S131:根据语义角色标注模型的视觉文本多模态形式构建预训练模型;S131: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

S132:所述预训练模型的输入的指令

Figure 916293DEST_PATH_IMAGE001
;利用BERT预训练模 型对所述指令I进行编码,获得指令I中每个词对应的词向量序列
Figure 521718DEST_PATH_IMAGE002
Figure 721755DEST_PATH_IMAGE003
; S132: Instructions for inputting the pre-trained model
Figure 916293DEST_PATH_IMAGE001
; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I
Figure 521718DEST_PATH_IMAGE002
Figure 721755DEST_PATH_IMAGE003
;

S133:枚举出指令I中所有的跨度

Figure 387222DEST_PATH_IMAGE004
,其中
Figure 610393DEST_PATH_IMAGE005
, 获得每个跨度的特征向量;其中,所述跨度的大小均为预设值; S133: enumerate all spans in instruction I
Figure 387222DEST_PATH_IMAGE004
,in
Figure 610393DEST_PATH_IMAGE005
, to obtain the feature vector of each span; wherein, the size of the span is a preset value;

S134:根据所述每个跨度的特征向量,生成语义图中谓词节点和语义角色节点对应的候选向量;S134: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph;

S135:引入损失函数对模型的训练损失进行完善,完成多模态语义角色识别对人机交互指令进行语义解析。S135: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,S134中,采用两个不同的MLP层分别得到谓词候选向量

Figure 144143DEST_PATH_IMAGE006
以及语义角色 候选向量
Figure 74053DEST_PATH_IMAGE007
,其中:
Figure 769476DEST_PATH_IMAGE008
Figure 683206DEST_PATH_IMAGE009
。 Optionally, in S134, two different MLP layers are used to obtain predicate candidate vectors respectively
Figure 144143DEST_PATH_IMAGE006
and semantic role candidate vectors
Figure 74053DEST_PATH_IMAGE007
,in:
Figure 769476DEST_PATH_IMAGE008
;
Figure 683206DEST_PATH_IMAGE009
.

可选地,S135中,引入损失函数对模型的训练损失进行完善,包括:Optionally, in S135, a loss function is introduced to improve the training loss of the model, including:

构建语义角色标注损失函数,判断模型预测的谓词及论元结构的完整性;Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

其中,包括一个MLP 层得分层以及一个Biaffine得分层;所述MLP 层得分层用于 判断当前谓词节点的语义框架,所述Biaffine得分层用于对句子中每个谓词

Figure 755067DEST_PATH_IMAGE010
、语义角色
Figure 539483DEST_PATH_IMAGE011
以及两者关系
Figure 405808DEST_PATH_IMAGE012
的三元组
Figure 272745DEST_PATH_IMAGE013
进行打分;交叉熵来计算每个三元组的损失,所述语 义角色标注损失函数如下述公式(1)所示: Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence
Figure 755067DEST_PATH_IMAGE010
, semantic role
Figure 539483DEST_PATH_IMAGE011
and the relationship between the two
Figure 405808DEST_PATH_IMAGE012
triplet of
Figure 272745DEST_PATH_IMAGE013
Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):

Figure 148298DEST_PATH_IMAGE014
Figure 148298DEST_PATH_IMAGE014

可选地,S135中,引入损失函数对模型的训练损失进行完善,包括:Optionally, in S135, a loss function is introduced to improve the training loss of the model, including:

构建模态匹配函数,用于图像和文本跨模态特征对的模态匹配,该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体,则输出标签为1,否则标签为0;通过多任务学习的范式,定义如下述公式(2)的模态匹配函数的损失函数:Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise The label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):

Figure 52800DEST_PATH_IMAGE015
Figure 52800DEST_PATH_IMAGE015

本发明实施例中,创新性的尝试引入图像信息到现有的单模态语义角色标注模型中,从而利用图像信息辅助语义角色标注模型对输入语句进行语义分析。尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析,从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出,做到更加方便、安全、快捷的执行用户的意图。In the embodiment of the present invention, an innovative attempt is made to introduce image information into the existing single-modal semantic role labeling model, so that image information is used to assist the semantic role labeling model to perform semantic analysis on input sentences. Try to use the paradigm of multi-modal semantic role labeling to semantically analyze human-computer interaction instructions, so as to convert instructions that cannot be understood by the machine into semantically structured output that the machine can understand, so as to achieve more convenient, safe and fast execution of the user intention of.

本发明实施例提供了一种基于多模态语义角色识别的人机交互指令解析方法,该方法可以由电子设备实现,该电子设备可以是终端或服务器。如图2所示的基于多模态语义角色识别的人机交互指令解析方法流程图,该方法的处理流程可以包括如下的步骤:An embodiment of the present invention provides a method for parsing human-computer interaction instructions based on multimodal semantic role recognition. The method can be implemented by an electronic device, and the electronic device can be a terminal or a server. As shown in Figure 2, the flow chart of the human-computer interaction instruction analysis method based on multi-modal semantic role recognition, the processing flow of the method may include the following steps:

S201:采用VerbAtlas语义角色标注数据的标注方式作为标注基准。S201: Using the labeling method of the VerbAtlas semantic role labeling data as a labeling benchmark.

一种可行的实施方式中,本发明首先针对人机交互指令,基于其本身的特点构建一套完善的指令语义角色标注范式。以往的语义角色标注范式大都面向通用领域(如新闻)等,其语义角色的设定在于更好的通用性。但是在人机交互领域,每种类型的指令其语义角色都有着其特殊性,这是通用领域的语义角色不能涵盖的。In a feasible implementation manner, the present invention first aims at human-computer interaction instructions, and builds a complete set of instruction semantic role labeling paradigms based on its own characteristics. Most of the previous semantic role annotation paradigms are oriented to general fields (such as news), etc., and the setting of their semantic roles lies in better generality. But in the field of human-computer interaction, each type of instruction has its own particularity in its semantic role, which cannot be covered by the semantic role in the general field.

S202:对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。S202: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm.

一种可行的实施方式中,本发明拟扩展和修改现有的中文语义角色标注范式,使其适用于人机交互指令的语义解析。In a feasible implementation, the present invention intends to expand and modify the existing Chinese semantic role labeling paradigm, making it suitable for semantic analysis of human-computer interaction instructions.

初步计划采用VerbAtlas语义角色标注数据标注方式作为本发明的标注基准,其主要基于以下两种考虑:(1)该标注基准在谓词识别上加入语义框架的概念,使得每个谓词的具体语义更加精确,从而缓解了谓词由于语境不同导致的歧义问题。(2)该标注基准面向多语言场景设计,可以方便本发明设计面向中文指令的标注范式。表1展示了本发明初步设定的语义框架即语义角色。涵盖了简单的如前进、移动等的位移指令,以及高难度的如拿取、打开等操作指令;语义角色则包含了参与该语义事件的操控设备、操控手段以及指令执行的时间、地点等。The preliminary plan is to use the VerbAtlas semantic role labeling data labeling method as the labeling benchmark of the present invention, which is mainly based on the following two considerations: (1) The labeling benchmark adds the concept of semantic framework to the predicate recognition, making the specific semantics of each predicate more accurate , thereby alleviating the ambiguity problem caused by different contexts of predicates. (2) The labeling benchmark is designed for multilingual scenes, which can facilitate the design of the Chinese instruction-oriented labeling paradigm in the present invention. Table 1 shows the semantic framework initially set by the present invention, that is, semantic roles. It covers simple displacement instructions such as moving forward and moving, as well as difficult operation instructions such as taking and opening; the semantic role includes the control equipment involved in the semantic event, the control means, and the time and place when the command is executed.

Figure 90026DEST_PATH_IMAGE016
Figure 90026DEST_PATH_IMAGE016

Figure 978347DEST_PATH_IMAGE017
Figure 978347DEST_PATH_IMAGE017

S203:根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;S203: According to the instruction semantic role labeling paradigm, collect images through an unmanned system, use Faster-RCNN to obtain sequence target areas, form the sequence target areas into an image area sequence, and extract features of the image sequence;

S204:通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S204: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text.

一种可行的实施方式中,在模型架构上,本发明采用如图3所示的双塔模型来解决多模态语义角色任务之间图文特征的融合。整体架构主要分为三个部分,图像端的图像序列特征提取,语言端的语义图特征提取,以及最后用于特征融合的训练函数。In a feasible implementation, in terms of model architecture, the present invention adopts the twin-tower model as shown in FIG. 3 to solve the fusion of graphic and text features between multi-modal semantic role tasks. The overall architecture is mainly divided into three parts, the image sequence feature extraction on the image side, the semantic map feature extraction on the language side, and finally the training function for feature fusion.

一种可行的实施方式中,图像序列特征:对于无人系统观察到的图像

Figure 392011DEST_PATH_IMAGE019
,本发明 采用现有的Faster-RCNN获得一序列目标区域,将其组成图像区域序列
Figure 151020DEST_PATH_IMAGE020
,并 获得区域序列对应的特征序列
Figure 93568DEST_PATH_IMAGE021
。对于特征序列中的区域特征
Figure 938027DEST_PATH_IMAGE022
,本发明利用 一层MLP层,对做进一步的特征抽象,得到最终的图像特征
Figure 155382DEST_PATH_IMAGE023
: In a feasible implementation, image sequence features: for images observed by unmanned systems
Figure 392011DEST_PATH_IMAGE019
, the present invention uses the existing Faster-RCNN to obtain a sequence of target areas, which are composed of image area sequences
Figure 151020DEST_PATH_IMAGE020
, and obtain the feature sequence corresponding to the region sequence
Figure 93568DEST_PATH_IMAGE021
. For the region features in the feature sequence
Figure 938027DEST_PATH_IMAGE022
, the present invention uses a layer of MLP layer to further abstract the features to obtain the final image features
Figure 155382DEST_PATH_IMAGE023
:

Figure 768897DEST_PATH_IMAGE024
Figure 768897DEST_PATH_IMAGE024

S205:根据语义角色标注模型的视觉文本多模态形式构建预训练模型;S205: Construct a pre-training model according to the visual-text multimodal form of the semantic role labeling model;

S206:所述预训练模型的输入的指令

Figure 413505DEST_PATH_IMAGE001
;利用BERT预训练模 型对所述指令I进行编码,获得指令I中每个词对应的词向量序列
Figure 10840DEST_PATH_IMAGE002
Figure 766306DEST_PATH_IMAGE003
; S206: Instructions for inputting the pre-trained model
Figure 413505DEST_PATH_IMAGE001
; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I
Figure 10840DEST_PATH_IMAGE002
Figure 766306DEST_PATH_IMAGE003
;

S207:枚举出指令I中所有的跨度

Figure 499907DEST_PATH_IMAGE004
,其中
Figure 518678DEST_PATH_IMAGE005
, 获得每个跨度的特征向量;其中,所述跨度的大小均为预设值; S207: enumerate all spans in instruction I
Figure 499907DEST_PATH_IMAGE004
,in
Figure 518678DEST_PATH_IMAGE005
, to obtain the feature vector of each span; wherein, the size of the span is a preset value;

S208:根据所述每个跨度的特征向量,生成语义图中谓词节点和语义角色节点对应的候选向量。S208: Generate candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vectors of each span.

一种可行的实施方式中,文本序列特征:本发明采用目前端到端语义角色标注经 典的语义图神经网络构建思路来获得句子中隐含的谓词和其对应的论元。对于输入的指令

Figure 196784DEST_PATH_IMAGE025
,利用BERT预训练模型对其进行编码,获得指令中每个词对应的 词向量序列
Figure 634238DEST_PATH_IMAGE002
Figure 612558DEST_PATH_IMAGE003
。接着枚举出指令中所有的跨度
Figure 474335DEST_PATH_IMAGE004
,其中
Figure 905316DEST_PATH_IMAGE005
,由句子中多个词组成的。每个跨度的最大长度和最小长度都是 预先设定好的。对于每个跨度
Figure 205847DEST_PATH_IMAGE026
,将其特征向量表示为: In a feasible implementation, text sequence features: the present invention adopts the current classic semantic graph neural network construction idea of end-to-end semantic role labeling to obtain hidden predicates and their corresponding arguments in sentences. For the input command
Figure 196784DEST_PATH_IMAGE025
, use the BERT pre-training model to encode it, and obtain the word vector sequence corresponding to each word in the instruction
Figure 634238DEST_PATH_IMAGE002
Figure 612558DEST_PATH_IMAGE003
. Then enumerate all spans in the instruction
Figure 474335DEST_PATH_IMAGE004
,in
Figure 905316DEST_PATH_IMAGE005
, consisting of multiple words in the sentence. The maximum and minimum lengths of each span are preset. for each span
Figure 205847DEST_PATH_IMAGE026
, expressing its eigenvectors as:

Figure 648461DEST_PATH_IMAGE027
Figure 648461DEST_PATH_IMAGE027

其中

Figure 805773DEST_PATH_IMAGE028
Figure 599417DEST_PATH_IMAGE029
表示每个跨度起始单词和结尾单词所对应的隐藏层表示,
Figure 500376DEST_PATH_IMAGE030
表示每个跨度对应的长度特征,
Figure 63076DEST_PATH_IMAGE031
则是利用Self-Attention机制,计算对于跨度 内每个词的注意力,并根据注意力加权平均得到的向量。 in
Figure 805773DEST_PATH_IMAGE028
,
Figure 599417DEST_PATH_IMAGE029
denote the hidden layer representations corresponding to each span start word and end word,
Figure 500376DEST_PATH_IMAGE030
Represents the length feature corresponding to each span,
Figure 63076DEST_PATH_IMAGE031
It uses the Self-Attention mechanism to calculate the attention for each word in the span, and calculates the vector obtained by weighting the attention.

对于每个跨度对应的表示

Figure 594551DEST_PATH_IMAGE032
,需要生成语义图中谓词节点和语义角色节点对应 的候选向量,因此本发明采用两个不同的MLP层分别得到谓词候选向量以及语义角色候选 向量
Figure 734546DEST_PATH_IMAGE033
以及
Figure 48984DEST_PATH_IMAGE034
:For each span the corresponding representation
Figure 594551DEST_PATH_IMAGE032
, it is necessary to generate candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph, so the present invention uses two different MLP layers to obtain predicate candidate vectors and semantic role candidate vectors respectively
Figure 734546DEST_PATH_IMAGE033
as well as
Figure 48984DEST_PATH_IMAGE034
:

Figure 590823DEST_PATH_IMAGE035
Figure 590823DEST_PATH_IMAGE035

S209:引入损失函数对模型的训练损失进行完善,完成多模态语义角色识别对人机交互指令进行语义解析。S209: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

一种可行的实施方式中,采用两个不同的MLP层分别得到谓词候选向量

Figure 965304DEST_PATH_IMAGE006
以及语 义角色候选向量
Figure 592594DEST_PATH_IMAGE007
,其中:
Figure 710723DEST_PATH_IMAGE008
Figure 638228DEST_PATH_IMAGE009
。 In a feasible implementation, two different MLP layers are used to obtain predicate candidate vectors respectively
Figure 965304DEST_PATH_IMAGE006
and semantic role candidate vectors
Figure 592594DEST_PATH_IMAGE007
,in:
Figure 710723DEST_PATH_IMAGE008
;
Figure 638228DEST_PATH_IMAGE009
.

其中,MLPP 是用于获取谓词表示的多层前馈神经网络,MLPR是用于获取语义角色表示的多层前馈神经网络。Among them, MLP P is a multi-layer feed-forward neural network for obtaining predicate representations, and MLP R is a multi-layer feed-forward neural network for obtaining semantic role representations.

一种可行的实施方式中,引入损失函数对模型的训练损失进行完善,包括:In a feasible implementation, a loss function is introduced to improve the training loss of the model, including:

构建语义角色标注损失函数,判断模型预测的谓词及论元结构的完整性;Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

其中,包括一个MLP 层得分层以及一个Biaffine得分层;所述MLP 层得分层用于 判断当前谓词节点的语义框架,所述Biaffine得分层用于对句子中每个谓词

Figure 180680DEST_PATH_IMAGE010
、语义角色
Figure 29687DEST_PATH_IMAGE011
以及两者关系
Figure 951507DEST_PATH_IMAGE012
的三元组
Figure 467939DEST_PATH_IMAGE013
进行打分;交叉熵来计算每个三元组的损失,所述语 义角色标注损失函数如下述公式(1)所示: Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence
Figure 180680DEST_PATH_IMAGE010
, semantic role
Figure 29687DEST_PATH_IMAGE011
and the relationship between the two
Figure 951507DEST_PATH_IMAGE012
triplet of
Figure 467939DEST_PATH_IMAGE013
Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):

Figure 184222DEST_PATH_IMAGE014
Figure 184222DEST_PATH_IMAGE014

一种可行的实施方式中,在训练损失上,本发明定义了两种损失函数用于训练模 型。第一种判断模型预测的谓词、论元结构的完整性的语义角色标注损失函数,其包括一个 MLP层得分层来判断当前谓词节点的语义框架,以及一个Biaffine得分层来对句子中每个 谓词、语义角色以及两者关系的三元组

Figure 520526DEST_PATH_IMAGE013
进行打分,具体定义如下: In a feasible implementation manner, in terms of training loss, the present invention defines two loss functions for training the model. The first is the semantic role labeling loss function that judges the integrity of the predicate and argument structure predicted by the model, which includes an MLP layer score layer to judge the semantic framework of the current predicate node, and a Biaffine score layer to evaluate each A triplet of predicates, semantic roles, and their relations
Figure 520526DEST_PATH_IMAGE013
To score, the specific definition is as follows:

Figure 980457DEST_PATH_IMAGE036
Figure 980457DEST_PATH_IMAGE036

Figure 351396DEST_PATH_IMAGE037
Figure 351396DEST_PATH_IMAGE037

其中,

Figure 238580DEST_PATH_IMAGE038
表示用于获取语义框架类别得分的多层前馈神经网络;
Figure 327759DEST_PATH_IMAGE039
是 Biaffine权重矩阵,
Figure 325802DEST_PATH_IMAGE040
是线性权重矩阵,
Figure 816826DEST_PATH_IMAGE041
是偏置项。获得每个关系对应的评分后,本发 明采用交叉熵来计算每个三元组的损失:in,
Figure 238580DEST_PATH_IMAGE038
represents a multi-layer feed-forward neural network for obtaining semantic frame category scores;
Figure 327759DEST_PATH_IMAGE039
is the Biaffine weight matrix,
Figure 325802DEST_PATH_IMAGE040
is the linear weight matrix,
Figure 816826DEST_PATH_IMAGE041
is a bias item. After obtaining the score corresponding to each relationship, the present invention uses cross-entropy to calculate the loss of each triplet:

Figure 874912DEST_PATH_IMAGE042
Figure 874912DEST_PATH_IMAGE042

其中

Figure 185808DEST_PATH_IMAGE043
Figure 315438DEST_PATH_IMAGE044
表示对应的的语义框架以及语义角色集合。 in
Figure 185808DEST_PATH_IMAGE043
and
Figure 315438DEST_PATH_IMAGE044
Represents the corresponding semantic framework and semantic role set.

一种可行的实施方式中,引入损失函数对模型的训练损失进行完善,包括:In a feasible implementation, a loss function is introduced to improve the training loss of the model, including:

构建模态匹配函数,用于图像和文本跨模态特征对的模态匹配,该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体,则输出标签为1,否则标签为0;通过多任务学习的范式,定义如下述公式(2)的模态匹配函数的损失函数:Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise The label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):

Figure 536334DEST_PATH_IMAGE045
Figure 536334DEST_PATH_IMAGE045

一种可行的实施方式中,第二种则是用于图像和文本跨模态特征对的模态匹配函 数,该函数的标签本发明定义为如果该语义角色对应的片段中包含该目标区域对应的物 体,则输出标签为1,否则标签为0。本发明同样利用一个Biaffine层计算对该图像区域特 征、语义角色以及两者关系的三元组

Figure 889955DEST_PATH_IMAGE046
进行打分, In a feasible implementation, the second type is a modal matching function for image and text cross-modal feature pairs, and the label of this function is defined in the present invention as if the segment corresponding to the semantic role contains the corresponding object, the output label is 1, otherwise the label is 0. The present invention also utilizes a Biaffine layer to calculate triplets of the image region features, semantic roles and the relationship between the two
Figure 889955DEST_PATH_IMAGE046
to score,

Figure 554724DEST_PATH_IMAGE047
Figure 554724DEST_PATH_IMAGE047

同理,其对应的损失函数为:Similarly, the corresponding loss function is:

Figure 550362DEST_PATH_IMAGE048
Figure 550362DEST_PATH_IMAGE048

最终的损失函数,本发明采用多任务学习的范式进行定义:The final loss function is defined in the present invention using the paradigm of multi-task learning:

Figure 625766DEST_PATH_IMAGE049
Figure 625766DEST_PATH_IMAGE049

其中

Figure 150288DEST_PATH_IMAGE050
用于调节两种损失函数在模型训练中所发挥的权重。 in
Figure 150288DEST_PATH_IMAGE050
It is used to adjust the weight of the two loss functions in model training.

本发明实施例中,多模态语义角色标注的目标是给定一条输入指令,得出该指令的语义结构化输出,使得机器能够理解并执行。多模态语义角色识别的结构化输出结果如图4所示。In the embodiment of the present invention, the goal of multi-modal semantic role labeling is to give an input instruction and obtain a semantically structured output of the instruction, so that the machine can understand and execute it. The structured output results of multimodal semantic role recognition are shown in Figure 4.

本发明实施例中,图5展示了本发明的多模态语义角色标注模型在人机交互指令上的解析实例。对于用户下达的指令,本发明的多模态语义角色解析系统识别出其中的谓词,对应的语义框架,以及属于该语义框架的语义角色,将其组织为机器可识别的结构化输出。In the embodiment of the present invention, FIG. 5 shows an example of parsing human-computer interaction instructions by the multimodal semantic role labeling model of the present invention. For the instruction issued by the user, the multi-modal semantic role analysis system of the present invention identifies the predicates, the corresponding semantic framework, and the semantic roles belonging to the semantic framework, and organizes them into a machine-recognizable structured output.

本发明实施例中,针对于目前现有的语义角色标注模型大都基于单模态设定,创新性的尝试引入图像信息到现有的单模态语义角色标注模型中,从而利用图像信息辅助语义角色标注模型对输入语句进行语义分析。尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析,从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出,做到更加方便、安全、快捷的执行用户的意图。In the embodiment of the present invention, since most existing semantic role labeling models are based on single-modal settings, an innovative attempt is made to introduce image information into the existing single-modal semantic role labeling models, thereby using image information to assist semantic The role labeling model performs semantic analysis on the input sentence. Try to use the paradigm of multi-modal semantic role labeling to semantically analyze human-computer interaction instructions, so as to convert instructions that cannot be understood by the machine into semantically structured output that the machine can understand, so as to achieve more convenient, safe and fast execution of the user intention of.

图6是根据一示例性实施例示出的一种基于多模态语义角色识别的人机交互指令解析装置框图。参照图6,该装置300包括:Fig. 6 is a block diagram of a human-computer interaction instruction parsing device based on multimodal semantic role recognition according to an exemplary embodiment. Referring to Figure 6, the device 300 includes:

范式构建模块310,用于根据人机交互指令的特性,构建一套完善的指令语义角色标注范式;Paradigm construction module 310, configured to construct a complete set of instruction semantic role labeling paradigms according to the characteristics of human-computer interaction instructions;

多模态构建模块320,用于根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;The multimodal construction module 320 is used to expand the single-modal form of the semantic role labeling model into a multimodal form of visual text according to the semantic role labeling paradigm of the instruction and in combination with image collection;

模型训练模块330,用于对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。The model training module 330 is used to train and learn the visual text multi-modal form of the semantic role labeling model, complete multi-modal semantic role recognition and perform semantic analysis on human-computer interaction instructions.

可选地,范式构建模块310,用于根采用VerbAtlas语义角色标注数据的标注方式作为标注基准;Optionally, the paradigm building module 310 is used to use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark;

对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得一套完善的指令语义角色标注范式。The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and a complete set of instruction semantic role labeling paradigms is obtained.

可选地,多模态构建模块320,用于根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;Optionally, the multimodal construction module 320 is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, use Faster-RCNN to obtain sequence target areas, and form the sequence target areas into an image area sequence, for The image sequence features are extracted;

通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。Through the extracted image sequence features, the semantic role on the semantic text side is assisted to identify, and the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form.

可选地,模型训练模块330,用于根据语义角色标注模型的视觉文本多模态形式构建预训练模型;Optionally, the model training module 330 is configured to construct a pre-training model according to the visual-text multimodal form of the semantic role labeling model;

所述预训练模型的输入的指令

Figure 576721DEST_PATH_IMAGE001
;利用BERT预训练模型对所 述指令I进行编码,获得指令I中每个词对应的词向量序列
Figure 985837DEST_PATH_IMAGE002
Figure 40380DEST_PATH_IMAGE003
; Instructions for the input of the pre-trained model
Figure 576721DEST_PATH_IMAGE001
; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I
Figure 985837DEST_PATH_IMAGE002
Figure 40380DEST_PATH_IMAGE003
;

枚举出指令I中所有的跨度

Figure 611170DEST_PATH_IMAGE004
,其中
Figure 383954DEST_PATH_IMAGE005
, 获得每个跨度的特征向量;其中,所述跨度的大小均为预设值; Enumerate all spans in instruction I
Figure 611170DEST_PATH_IMAGE004
,in
Figure 383954DEST_PATH_IMAGE005
, to obtain the feature vector of each span; wherein, the size of the span is a preset value;

根据所述每个跨度的特征向量,生成语义图中谓词节点和语义角色节点对应的候选向量;According to the feature vector of each span, generate the candidate vector corresponding to the predicate node and the semantic role node in the semantic graph;

引入损失函数对模型的训练损失进行完善,完成多模态语义角色识别对人机交互指令进行语义解析。Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.

可选地,模型训练模块330,用于采用两个不同的MLP层分别得到谓词候选向量

Figure 127919DEST_PATH_IMAGE006
以及语义角色候选向量
Figure 912336DEST_PATH_IMAGE007
,其中:
Figure 44240DEST_PATH_IMAGE008
Figure 648527DEST_PATH_IMAGE009
。 Optionally, the model training module 330 is used to adopt two different MLP layers to obtain predicate candidate vectors respectively
Figure 127919DEST_PATH_IMAGE006
and semantic role candidate vectors
Figure 912336DEST_PATH_IMAGE007
,in:
Figure 44240DEST_PATH_IMAGE008
;
Figure 648527DEST_PATH_IMAGE009
.

可选地,模型训练模块330,用于构建语义角色标注损失函数,判断模型预测的谓词及论元结构的完整性;Optionally, the model training module 330 is used to construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model;

其中,包括一个MLP 层得分层以及一个Biaffine得分层;所述MLP 层得分层用于 判断当前谓词节点的语义框架,所述Biaffine得分层用于对句子中每个谓词

Figure 524080DEST_PATH_IMAGE010
、语义角色
Figure 160073DEST_PATH_IMAGE011
以及两者关系
Figure 462878DEST_PATH_IMAGE012
的三元组
Figure 820041DEST_PATH_IMAGE013
进行打分;交叉熵来计算每个三元组的损失,所述语 义角色标注损失函数如下述公式(1)所示: Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence
Figure 524080DEST_PATH_IMAGE010
, semantic role
Figure 160073DEST_PATH_IMAGE011
and the relationship between the two
Figure 462878DEST_PATH_IMAGE012
triplet of
Figure 820041DEST_PATH_IMAGE013
Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):

Figure 233705DEST_PATH_IMAGE014
Figure 233705DEST_PATH_IMAGE014

可选地,模型训练模块330,用于构建模态匹配函数,用于图像和文本跨模态特征对的模态匹配,该函数的标签定义为如果该语义角色对应的片段中包含该目标区域对应的物体,则输出标签为1,否则标签为0;通过多任务学习的范式,定义如下述公式(2)的模态匹配函数的损失函数:Optionally, the model training module 330 is used to construct a modality matching function for modality matching of image and text cross-modal feature pairs, and the label of the function is defined as if the segment corresponding to the semantic role contains the target region For the corresponding object, the output label is 1, otherwise the label is 0; through the paradigm of multi-task learning, the loss function of the modal matching function defined in the following formula (2):

Figure 789451DEST_PATH_IMAGE015
Figure 789451DEST_PATH_IMAGE015

本发明实施例中,针对于目前现有的语义角色标注模型大都基于单模态设定,创新性的尝试引入图像信息到现有的单模态语义角色标注模型中,从而利用图像信息辅助语义角色标注模型对输入语句进行语义分析。尝试用多模态语义角色标注的范式来对人机交互指令进行语义解析,从而将原本机器无法理解的指令转换成机器可理解的语义结构化输出,做到更加方便、安全、快捷的执行用户的意图。In the embodiment of the present invention, since most existing semantic role labeling models are based on single-modal settings, an innovative attempt is made to introduce image information into the existing single-modal semantic role labeling models, thereby using image information to assist semantic The role labeling model performs semantic analysis on the input sentence. Try to use the paradigm of multi-modal semantic role labeling to semantically analyze human-computer interaction instructions, so as to convert instructions that cannot be understood by the machine into semantically structured output that the machine can understand, so as to achieve more convenient, safe and fast execution of the user intention of.

图7是本发明实施例提供的一种电子设备400的结构示意图,该电子设备400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(centralprocessing units,CPU)401和一个或一个以上的存储器402,其中,所述存储器402中存储有至少一条指令,所述至少一条指令由所述处理器401加载并执行以实现下述基于多模态语义角色识别的人机交互指令解析方法的步骤:FIG. 7 is a schematic structural diagram of an electronic device 400 provided by an embodiment of the present invention. The electronic device 400 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 401 and one or more memory 402, wherein at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to realize the following human-machine based on multimodal semantic role recognition The steps of the interactive instruction parsing method:

S1:根据人机交互指令的特性,构建一套完善的指令语义角色标注范式;S1: According to the characteristics of human-computer interaction instructions, construct a complete set of instruction semantic role labeling paradigms;

S2:根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;S2: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text;

S3:对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。S3: Train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions.

在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由终端中的处理器执行以完成上述基于多模态语义角色识别的人机交互指令解析方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory including instructions, the above-mentioned instructions can be executed by a processor in the terminal to complete the above-mentioned human-computer interaction instruction parsing method based on multi-modal semantic role recognition . For example, the computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims (9)

1.一种基于多模态语义角色识别的人机交互指令解析方法,其特征在于,包括以下步骤:1. A human-computer interaction instruction analysis method based on multimodal semantic role recognition, characterized in that, comprising the following steps: S1:根据人机交互指令的特性,构建指令语义角色标注范式;S1: According to the characteristics of human-computer interaction instructions, construct an instruction semantic role annotation paradigm; S2:根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;S2: According to the semantic role labeling paradigm of the instruction, combined with image collection, the single-modal form of the semantic role labeling model is extended to a multi-modal form of visual text; S3:对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令的语义解析;S3: Train and learn the visual text multimodal form of the semantic role labeling model, and complete the semantic analysis of human-computer interaction instructions for multimodal semantic role recognition; 所述步骤S3中,对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析,包括:In the step S3, the multimodal form of visual text of the semantic role labeling model is trained and learned, and the multimodal semantic role recognition is completed to perform semantic analysis on human-computer interaction instructions, including: S31:根据语义角色标注模型的视觉文本多模态形式构建预训练模型;S31: Construct a pre-training model according to the visual text multimodal form of the semantic role labeling model; S32:所述预训练模型的输入的指令
Figure 202160DEST_PATH_IMAGE001
;利用BERT预训练模型对所述指令I进行编码,获得指令I中每个词对应的词向量序列
Figure 181617DEST_PATH_IMAGE002
S32: Instructions for inputting the pre-trained model
Figure 202160DEST_PATH_IMAGE001
; Utilize the BERT pre-training model to encode the instruction I, and obtain the word vector sequence corresponding to each word in the instruction I
Figure 181617DEST_PATH_IMAGE002
;
S33:枚举出指令I中所有的跨度
Figure 243245DEST_PATH_IMAGE003
,其中
Figure 435192DEST_PATH_IMAGE004
,获得每个跨度的特征向量;其中,所述跨度的大小均为预设值;
S33: enumerate all spans in instruction I
Figure 243245DEST_PATH_IMAGE003
,in
Figure 435192DEST_PATH_IMAGE004
, to obtain the feature vector of each span; wherein, the size of the span is a preset value;
S34:根据所述每个跨度的特征向量,生成语义图中谓词节点和语义角色节点对应的候选向量;S34: According to the feature vector of each span, generate candidate vectors corresponding to the predicate node and the semantic role node in the semantic graph; S35:引入损失函数对模型的训练损失进行完善,完成多模态语义角色识别对人机交互指令进行语义解析。S35: Introduce a loss function to improve the training loss of the model, and complete multi-modal semantic role recognition and semantic analysis of human-computer interaction instructions.
2.根据权利要求1所述的方法,其特征在于,所述步骤S1中,根据人机交互指令的特性,构建指令语义角色标注范式,包括:2. The method according to claim 1, characterized in that, in the step S1, according to the characteristics of the human-computer interaction instruction, constructing an instruction semantic role annotation paradigm, including: S11:采用VerbAtlas语义角色标注数据的标注方式作为标注基准;S11: Use the labeling method of VerbAtlas semantic role labeling data as the labeling benchmark; S12:对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。S12: Extend and modify the pre-stored Chinese semantic role labeling paradigm, so that the extended and modified Chinese semantic role labeling paradigm is applicable to the semantic analysis of human-computer interaction instructions, and obtain the instruction semantic role labeling paradigm. 3.根据权利要求2所述的方法,其特征在于,所述步骤S2中,根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式,包括:3. The method according to claim 2, characterized in that, in the step S2, according to the instruction semantic role labeling paradigm, combined with image acquisition, the single-modal form of the semantic role labeling model is extended to a visual-text dual-mode state forms, including: S21:根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;S21: According to the instruction semantic role annotation paradigm, collect images through an unmanned system, use Faster-RCNN to obtain a sequence target area, form the sequence target area into an image area sequence, and extract features of the image sequence; S22:通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。S22: Use the extracted image sequence features to assist in identifying the semantic role on the semantic text side, and extend the single-modal form of the semantic role labeling model to a dual-modal form of visual text. 4.根据权利要求1所述的方法,其特征在于,所述S34中,采用两个不同的层感知机MLP层分别得到谓词候选向量
Figure 983985DEST_PATH_IMAGE005
以及语义角色候选向量
Figure 603186DEST_PATH_IMAGE006
,其中:
Figure 604640DEST_PATH_IMAGE007
4. The method according to claim 1, characterized in that, in said S34, two different layer perceptron MLP layers are used to obtain the predicate candidate vector respectively
Figure 983985DEST_PATH_IMAGE005
and semantic role candidate vectors
Figure 603186DEST_PATH_IMAGE006
,in:
Figure 604640DEST_PATH_IMAGE007
.
5.根据权利要求4所述的方法,其特征在于,所述S35中,引入损失函数对模型的训练损失进行完善,包括:5. The method according to claim 4, characterized in that, in said S35, introducing a loss function to improve the training loss of the model, including: 构建语义角色标注损失函数,判断模型预测的谓词及论元结构的完整性;Construct a semantic role labeling loss function to judge the integrity of the predicate and argument structure predicted by the model; 其中,包括一个MLP 层得分层以及一个Biaffine得分层;所述MLP 层得分层用于判断当前谓词节点的语义框架,所述Biaffine得分层用于对句子中每个谓词
Figure 600277DEST_PATH_IMAGE008
语义角色
Figure 548117DEST_PATH_IMAGE009
以及两者关系
Figure 275902DEST_PATH_IMAGE010
的三元组
Figure 826969DEST_PATH_IMAGE011
进行打分;交叉熵来计算每个三元组的损失,所述语义角色标注损失函数如下述公式(1)所示:
Wherein, comprise a MLP layer score layer and a Biaffine score layer; The MLP layer score layer is used to judge the semantic framework of the current predicate node, and the Biaffine score layer is used to each predicate in the sentence
Figure 600277DEST_PATH_IMAGE008
semantic role
Figure 548117DEST_PATH_IMAGE009
and the relationship between the two
Figure 275902DEST_PATH_IMAGE010
triplet of
Figure 826969DEST_PATH_IMAGE011
Scoring; cross-entropy to calculate the loss of each triplet, the semantic role labeling loss function is shown in the following formula (1):
Figure 298402DEST_PATH_IMAGE012
Figure 298402DEST_PATH_IMAGE012
.
6.根据权利要求4所述的方法,其特征在于,所述S35中,引入损失函数对模型的训练损失进行完善,包括:6. The method according to claim 4, characterized in that, in said S35, introducing a loss function to improve the training loss of the model, including: 构建模态匹配函数,用于图像和文本跨模态特征对的模态匹配,该函数的标签定义为如果该语义角色对应的片段中包含目标区域对应的物体,则输出标签为1,否则标签为0;通过多任务学习的范式,定义如下述公式(2)的模态匹配函数的损失函数:Construct a modal matching function for modal matching of image and text cross-modal feature pairs. The label of this function is defined as if the segment corresponding to the semantic role contains the object corresponding to the target area, then the output label is 1, otherwise the label is 0; through the paradigm of multi-task learning, define the loss function of the mode matching function as shown in the following formula (2):
Figure 352945DEST_PATH_IMAGE013
Figure 352945DEST_PATH_IMAGE013
.
7.一种基于多模态语义角色识别的人机交互指令解析装置,其特征在于,所述装置适用于上述权利要求1-6 中任意一项的方法,装置包括:7. A human-computer interaction instruction parsing device based on multimodal semantic role recognition, characterized in that the device is suitable for any one of the above-mentioned methods in claims 1-6, and the device includes: 指令语义角色标注范式构建模块,用于根据人机交互指令的特性,构建指令语义角色标注范式;Instruction Semantic Role Labeling Paradigm Building Module, which is used to construct instruction semantic role labeling paradigm according to the characteristics of human-computer interaction instructions; 多模态构建模块,用于根据所述指令语义角色标注范式,结合图像采集,将语义角色标注模型的单模态形式扩展为视觉文本多模态形式;A multimodal building block, used to expand the single-modal form of the semantic role labeling model to a multimodal form of visual text according to the instruction semantic role labeling paradigm and in combination with image acquisition; 模型训练模块,用于对语义角色标注模型的视觉文本多模态形式进行训练学习,完成多模态语义角色识别对人机交互指令进行语义解析。The model training module is used to train and learn the visual text multimodal form of the semantic role labeling model, and complete the multimodal semantic role recognition and semantic analysis of human-computer interaction instructions. 8.根据权利要求7所述的装置,其特征在于,所述指令语义角色标注范式构建模块,用于根采用VerbAtlas语义角色标注数据的标注方式作为标注基准;8. The device according to claim 7, characterized in that, the instruction semantic role labeling paradigm building block is used to root the labeling method using VerbAtlas semantic role labeling data as the labeling reference; 对预存的中文语义角色标注范式扩展和修改,使扩展和修改后的中文语义角色标注范式适用于人机交互指令的语义解析,获得指令语义角色标注范式。The pre-stored Chinese semantic role labeling paradigm is extended and modified, so that the extended and modified Chinese semantic role labeling paradigm is suitable for the semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained. 9.根据权利要求7所述的装置,其特征在于,所述多模态构建模块,用于根据所述指令语义角色标注范式,通过无人系统采集图像,采用Faster-RCNN获得序列目标区域,将所述序列目标区域组成图像区域序列,对所述图像序列特征进行提取;9. The device according to claim 7, wherein the multimodal construction module is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, and use Faster-RCNN to obtain the sequence target area, Composing the sequence target area into a sequence of image regions, and extracting the features of the sequence of images; 通过提取的图像序列特征,对语义文本端的语义角色进行辅助识别,将语义角色标注模型的单模态形式扩展为视觉文本双模态形式。Through the extracted image sequence features, the semantic role on the semantic text side is assisted to identify, and the single-modal form of the semantic role labeling model is extended to a visual-text dual-modal form.
CN202210659318.5A 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition Active CN114757209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210659318.5A CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210659318.5A CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Publications (2)

Publication Number Publication Date
CN114757209A CN114757209A (en) 2022-07-15
CN114757209B true CN114757209B (en) 2022-11-11

Family

ID=82336249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210659318.5A Active CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Country Status (1)

Country Link
CN (1) CN114757209B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571046A (en) * 2021-06-28 2021-10-29 深圳瑞鑫泰通信有限公司 Artificial intelligent speech recognition analysis method, system, device and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9189742B2 (en) * 2013-11-20 2015-11-17 Justin London Adaptive virtual intelligent agent
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN111191620B (en) * 2020-01-03 2022-03-22 西安电子科技大学 A Construction Method of Human-Object Interaction Detection Dataset
CN111274372A (en) * 2020-01-15 2020-06-12 上海浦东发展银行股份有限公司 Method, electronic device, and computer-readable storage medium for human-computer interaction
CN112201228A (en) * 2020-09-28 2021-01-08 苏州贝果智能科技有限公司 Multimode semantic recognition service access method based on artificial intelligence
CN113590776B (en) * 2021-06-23 2023-12-12 北京百度网讯科技有限公司 Knowledge graph-based text processing method and device, electronic equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571046A (en) * 2021-06-28 2021-10-29 深圳瑞鑫泰通信有限公司 Artificial intelligent speech recognition analysis method, system, device and storage medium

Also Published As

Publication number Publication date
CN114757209A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
TWI588816B (en) A language interaction method
WO2021139266A1 (en) Fine-tuning method and apparatus for external knowledge-fusing bert model, and computer device
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
KR101627428B1 (en) Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
CN111563149A (en) Entity linking method for Chinese knowledge map question-answering system
CN115048936B (en) Aspect-level emotion triplet extraction method integrating part-of-speech information
CN117093729A (en) Retrieval method, system and retrieval terminal based on medical scientific research information
CN113392182A (en) Knowledge matching method, device, equipment and medium fusing context semantic constraints
CN111475650A (en) Russian semantic role labeling method, system, device and storage medium
CN117034961A (en) BERT-based medium-method inter-translation quality assessment method
Zhang et al. Hierarchical scene parsing by weakly supervised learning with image descriptions
Wang et al. A entity relation extraction model with enhanced position attention in food domain
CN116186216A (en) Question Generation Method and System Based on Knowledge Enhancement and Dual-Graph Interaction
Nair et al. Knowledge graph based question answering system for remote school education
CN119127038B (en) Text generation method, electronic device, storage medium and computer program product
CN114266258B (en) Semantic relation extraction method and device, electronic equipment and storage medium
CN114757209B (en) Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition
JP2019200756A (en) Artificial intelligence programming server and program for the same
CN118246456A (en) Multimodal sentiment analysis method and system based on consistency perception
CN116776287A (en) Multi-modal sentiment analysis method and system integrating multi-granularity visual and text features
CN114579605B (en) Form question and answer data processing method, electronic device and computer storage medium
CN116414988A (en) Graph Convolutional Aspect-Level Sentiment Classification Method and System Based on Dependency Enhancement
CN114091464B (en) High-universality many-to-many relation triple extraction method fusing five-dimensional features
Zhu et al. GL-NER: Generation-Aware Large Language Models for Few-Shot Named Entity Recognition
Singh BERT Algorithm used in Google Search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant