CN116541507A

CN116541507A - A visual question answering method and system based on dynamic semantic graph neural network

Info

Publication number: CN116541507A
Application number: CN202310820674.5A
Authority: CN
Inventors: 吴锦梦; 管晓阳; 洪汉玉; 时愈; 马雷; 张耀宗
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-08-04

Abstract

The invention discloses a visual question answering method and system based on a dynamic semantic graph neural network. The method includes: constructing a Glove word embedding model, a Bi-GRU model, a bilinear attention model, a graph attention network model, and a multi-layer A dynamic semantic graph neural network model of a perceptron model; based on a plurality of training samples, the dynamic semantic graph neural network model used for visual question answering prediction is trained to obtain a target visual question answering model; wherein each training sample includes: training The image and the training question text corresponding to the training image; the target image and the test question text corresponding to the target image are input into the target visual question answering model to obtain the target visual question answering result. The present invention can effectively improve the performance of the visual question answering model, and can improve the accuracy of the visual question answering.

Description

A visual question answering method and system based on dynamic semantic graph neural network

技术领域technical field

本发明涉及计算机视觉和自然语言处理技术领域，尤其涉及一种基于动态语义图神经网络的视觉问答方法及系统。The invention relates to the technical fields of computer vision and natural language processing, in particular to a visual question answering method and system based on a dynamic semantic graph neural network.

背景技术Background technique

视觉问答(Visual Question Answering，VQA)是一种涉及计算机视觉和自然语言处理的综合性学习任务，即给定一幅图像和一个有关图像场景的问题，模型会预测正确的答案。视觉问答系统首先需要分析问题文本的语义信息，然后与从图像中提取的视觉信息进行多模态融合，最后经过分类器得到正确答案。由于现有的方法在面对复杂问题通常简单地使用注意力机制进行跨模态的融合，这些方法旨在改善跨模态表示，没有去关注如何构建图像目标间的交互关系，没有进行空间推理。Visual Question Answering (VQA) is a comprehensive learning task involving computer vision and natural language processing, that is, given an image and a question about the image scene, the model predicts the correct answer. The visual question answering system first needs to analyze the semantic information of the question text, then perform multimodal fusion with the visual information extracted from the image, and finally get the correct answer through a classifier. Since existing methods usually simply use the attention mechanism for cross-modal fusion in the face of complex problems, these methods aim to improve cross-modal representations, without paying attention to how to construct the interactive relationship between image targets, and without spatial reasoning .

传统的视觉问答方法通常使用深度卷积神经网络和循环神经网络分别计算图像和问题表示，然后在视觉特征和语言特征之间使用注意力进行跨模态融合来预测最终答案。虽然这些方法可以解决视觉问答任务，但它们缺乏对明确检测到的对象以及它们之间的交互式语义和空间关系进行推理。因此，有论文提出将图像生成一个概率场景图，并对其执行一系列推理操作，生成的场景图包含预测对象、属性和关系的概率分布。然而在构建场景图时，缺少探究细粒度图像区域的局部语义特征，提取全局图像特征会生成高层次的图像特征，而忽略了图像中具有判别性的局部区域，无法捕捉细粒度的区分区域。同时传统单一的固态场景图能够提供来自图像中的全局语义信息，然而场景图的语义表达无法针对不同问题中的关键信息进行自适应地变化，特别是，当场景图实例过多，每个实例间推理关系对不同的问题往往不能提供有效信息，也会对后期的问题引导推理提出更严格的要求。Traditional visual question answering methods usually use deep convolutional neural networks and recurrent neural networks to compute image and question representations respectively, and then use attention to perform cross-modal fusion between visual and linguistic features to predict the final answer. While these approaches can solve visual question answering tasks, they lack reasoning about explicitly detected objects and the interactive semantic and spatial relationships between them. Therefore, some papers propose to generate a probabilistic scene graph from the image and perform a series of inference operations on it. The generated scene graph contains the probability distribution of predicted objects, attributes and relationships. However, when constructing a scene graph, there is a lack of exploring local semantic features of fine-grained image regions. Extracting global image features will generate high-level image features, while ignoring discriminative local regions in the image, and failing to capture fine-grained distinguishing regions. At the same time, the traditional single solid-state scene graph can provide global semantic information from the image, but the semantic expression of the scene graph cannot adaptively change the key information in different problems, especially when there are too many instances of the scene graph, each instance Inter-reasoning relationships often cannot provide effective information for different questions, and will also put forward stricter requirements for later question-guided reasoning.

因此，亟需提供一种技术方案解决上述技术问题。Therefore, it is urgent to provide a technical solution to solve the above technical problems.

发明内容Contents of the invention

为解决上述技术问题，本发明提供了一种基于动态语义图神经网络的视觉问答方法及系统。In order to solve the above technical problems, the present invention provides a visual question answering method and system based on a dynamic semantic graph neural network.

本发明的一种基于动态语义图神经网络的视觉问答方法的技术方案如下：A technical scheme of a visual question answering method based on a dynamic semantic graph neural network of the present invention is as follows:

构建包含Glove词嵌入模型、Bi-GRU模型、双线性注意力模型、图注意网络模型和多层感知机模型的动态语义图神经网络模型；Construct a dynamic semantic graph neural network model including Glove word embedding model, Bi-GRU model, bilinear attention model, graph attention network model and multi-layer perceptron model;

基于多个训练样本，对用于视觉问答预测的所述动态语义图神经网络模型进行训练，得到目标视觉问答模型；其中，每个训练样本包括：训练图像和该训练图像对应的训练问句文本；Based on a plurality of training samples, the dynamic semantic graph neural network model used for visual question answering prediction is trained to obtain a target visual question answering model; wherein each training sample includes: a training image and a training question text corresponding to the training image ;

将目标图像和所述目标图像对应的待测问句文本输入至所述目标视觉问答模型中，得到目标视觉问答结果。Inputting the target image and the text of the question sentence to be tested corresponding to the target image into the target visual question answering model to obtain a target visual question answering result.

本发明的一种基于动态语义图神经网络的视觉问答方法的有益效果如下：The beneficial effect of a kind of visual question answering method based on dynamic semantic graph neural network of the present invention is as follows:

本发明的方法能够有效提升视觉问答模型的性能，并能够提升视觉问答的准确率。The method of the present invention can effectively improve the performance of the visual question answering model, and can improve the accuracy of the visual question answering.

在上述方案的基础上，本发明的一种基于动态语义图神经网络的视觉问答方法还可以做如下改进。On the basis of the above solution, a visual question answering method based on a dynamic semantic graph neural network of the present invention can also be improved as follows.

进一步，还包括：Further, it also includes:

获取所述多个训练样本，并对每个训练样本对应的视觉问答结果的真实值进行标注，得到每个训练样本的真实视觉问答结果。The plurality of training samples are obtained, and the real value of the visual question answering result corresponding to each training sample is marked to obtain the real visual question answering result of each training sample.

进一步，基于多个训练样本，对用于视觉问答预测的所述动态语义图神经网络模型进行训练，得到目标视觉问答模型的步骤，包括：Further, based on a plurality of training samples, the dynamic semantic graph neural network model used for visual question answering prediction is trained to obtain the step of the target visual question answering model, including:

将任一训练样本的训练问句文本输入至所述Glove词嵌入模型，得到该训练样本的多个词嵌入向量并输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本的多个文本词特征向量和句子级特征向量，并生成问题图；Input the training question text of any training sample into the Glove word embedding model, obtain a plurality of word embedding vectors of the training sample and input it into the Bi-GRU model to capture structural information, and obtain multiple word embedding vectors of the training sample Text word feature vectors and sentence-level feature vectors, and generate question graphs;

将所述任一训练样本的训练图像对应的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量的初始场景图；Input the nodes and edges in the real scene graph corresponding to the training image of any training sample to the Glove word embedding model to carry out word embedding processing, and obtain the initial node feature vector and the initial edge feature vector of the training sample scene graph;

基于双线性注意力模型，对所述任一训练样本的每个文本词特征向量与初始场景图进行处理，得到该训练样本的第二场景图；Based on the bilinear attention model, process each text word feature vector and initial scene graph of any training sample to obtain the second scene graph of the training sample;

将所述任一训练样本的第二场景图、多个文本词特征向量和句子级特征向量依次输入所述图注意网络模型和所述多层感知机模型，得到该训练样本的训练视觉问答结果；The second scene graph, a plurality of text word feature vectors and sentence-level feature vectors of any training sample are input into the graph attention network model and the multi-layer perceptron model in sequence to obtain the training visual question answering result of the training sample ;

基于所述任一训练样本的训练视觉问答结果和真实视觉问答结果，得到该训练样本的损失值，直至得到每个训练样本的损失值；Based on the training visual question answering result and the real visual question answering result of any training sample, the loss value of the training sample is obtained until the loss value of each training sample is obtained;

基于所有的损失值，对所述动态语义图神经网络模型进行优化，得到优化后的动态语义图神经网络模型，并将所述优化后的动态语义图神经网络模型作为所述动态语义图神经网络模型并返回执行所述将任一训练样本的训练问句文本输入至所述Glove词嵌入模型的步骤，直至满足预设迭代训练条件时，将所述优化后的动态语义图神经网络模型确定为所述目标视觉问答模型。Based on all the loss values, the dynamic semantic graph neural network model is optimized to obtain the optimized dynamic semantic graph neural network model, and the optimized dynamic semantic graph neural network model is used as the dynamic semantic graph neural network Model and return to execute the step of inputting the training question text of any training sample to the Glove word embedding model, until the preset iterative training condition is met, the optimized dynamic semantic graph neural network model is determined as The target visual question answering model.

采用上述进一步技术方案的有益效果是：进一步解决图像的场景图与该图像相关的不同问题自适应的问题，充分利用问题变化形成动态场景图，增强了问题中的单词级的语义信息与图像上的局部感兴趣区域的深层次交互，以此提升视觉问答的准确率。The beneficial effect of adopting the above-mentioned further technical solution is: to further solve the problem of adapting the scene graph of the image to different problems related to the image, make full use of the problem change to form a dynamic scene graph, and enhance the word-level semantic information in the problem and the image. The deep interaction of local regions of interest can improve the accuracy of visual question answering.

进一步，将任一训练样本的训练问句文本输入至所述Glove词嵌入模型，得到该训练样本的多个词嵌入向量并输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本的多个文本词特征向量和句子级特征向量，并生成问题图的步骤，包括：Further, input the training question text of any training sample into the Glove word embedding model, obtain a plurality of word embedding vectors of the training sample and input them into the Bi-GRU model to capture structural information, and obtain the training sample's Multiple text word feature vectors and sentence-level feature vectors, and the steps of generating a question graph include:

对所述任一训练样本的训练问句文本进行分隔词处理，得到该训练样本的多个单词数据；其中，/>表示该训练样本的单词数据的个数，/>表示该训练样本中第/>个训练问句文本，/>表示该训练样本中的第/>个训练问句文本中的第/>个单词数据；Separating word processing is performed on the training question text of any of the training samples to obtain a plurality of word data of the training sample ; where /> Indicates the number of word data of the training sample, /> Indicates that the first /> in the training sample training question text, /> Indicates the first /> in the training sample The /> in the training question text word data;

将所述任一训练样本的每个单词数据分别输入至所述Glove词嵌入模型，得到每个单词数据的词嵌入向量/>；其中，/>表示该训练样本中的第/>个训练问句文本中的第/>个单词数据对应的词嵌入向量，/>，/>为词嵌入向量的维度；Each word data of any of the training samples respectively input to the Glove word embedding model to obtain the word embedding vector of each word data /> ; where /> Indicates the first /> in the training sample The /> in the training question text The word embedding vector corresponding to word data, /> , /> is the dimension of the word embedding vector;

将所述任一训练样本的多个词嵌入向量输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本对应的句子级特征向量/>和词特征向量/>，并生成所述问题图；其中，/> ，/>表示该训练样本中的第/>个训练问句文本中的第/>个文本词特征向量，/>，/>为文本词特征向量的维度，/>表示单词数量。Embedding vectors of multiple words of any of the training samples Input to the Bi-GRU model to capture structural information, and obtain the sentence-level feature vector corresponding to the training sample /> and word feature vectors /> , and generate the problem graph ; where /> , /> Indicates the first /> in the training sample The /> in the training question text text word feature vector, /> , /> is the dimension of the text word feature vector, /> Indicates the number of words.

进一步，将所述任一训练样本的训练图像对应的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量的初始场景图的步骤包括：Further, the nodes and edges in the real scene graph corresponding to the training image of any training sample are input to the Glove word embedding model for word embedding processing, and the training sample contains initial node feature vectors and initial edge feature vectors The initial scene graph steps include:

根据所述任一训练样本的训练图像对应的包含标注信息的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量/>的初始场景图/>；其中，/>，，/>，/>表示候选框的数量，/>为真实场景图中标签词嵌入向量的维度，/>，/>。According to the training image of any training sample, the nodes and edges in the real scene graph corresponding to the label information are input to the Glove word embedding model for word embedding processing, and the training sample contains the initial node feature vector. and initial edge eigenvectors /> The initial scene graph for /> ; where /> , , /> , /> Indicates the number of candidate boxes, /> is the dimension of the label word embedding vector in the real scene graph, /> , /> .

进一步，基于双线性注意力模型，对所述任一训练样本的每个文本词特征向量与初始场景图进行处理，得到该训练样本的第二场景图的步骤，包括：Further, based on the bilinear attention model, each text word feature vector and the initial scene graph of any training sample are processed to obtain the second scene graph of the training sample, including:

将所述任一训练样本的初始场景图分别与该训练样本的每个词特征向量输入所述双线性注意力模型进行处理，得到该训练样本的目标节点特征向量/>；所述双线性注意力模型为：/>；The initial scene graph of any training sample respectively with each word feature vector of the training sample Input the bilinear attention model for processing to obtain the target node feature vector of the training sample /> ; The bilinear attention model is: /> ;

其中，为可学习的权重矩阵，/>代表第/>个训练问句文本中的第个单词数据对于/>产生的注意力分数，/>，/>表示第/>个第二场景图中第/>个目标对象的目标节点特征向量；in, is a learnable weight matrix, /> On behalf of No. /> The first training question text in word data for /> The resulting attention score, /> , /> Indicates the first /> No. /> in the second scene diagram The target node feature vector of a target object;

基于第一预设公式，获取所述任一训练样本的初始场景图的初始边特征向量的每个初始关系特征向量/>的边注意力权值/>，并利用第二预设公式和该训练样本的边注意力权值/>，对相应的初始关系特征向量/>分别进行转化，得到该训练样本的目标关系特征向量/>；Obtaining an initial scene graph of any training sample based on a first preset formula The initial edge eigenvector of Each initial relation eigenvector of /> The edge attention weights /> , and use the second preset formula and the side attention weights of the training sample/> , for the corresponding initial relation eigenvectors /> Transform respectively to obtain the target relationship feature vector of the training sample /> ;

其中，，/>，/>；所述第一预设公式为：/>，为可学习的权重矩阵，/>代表第/>个问题中的第/>个单词与初始关系特征向量/>产生的注意力分数，对/>进行归一化得到；所述第二预设公式为：/>；其中，/>表示第/>个第二场景图中目标/>到目标/>的关系特征向量；in, , /> , /> ; The first preset formula is: /> , is a learnable weight matrix, /> On behalf of No. /> No. /> of the question word and initial relation feature vector/> The resulting attention score, pair /> normalized to get ; The second preset formula is: /> ; where /> Indicates the first /> object in the second scene graph /> to target /> The relationship feature vector of ;

根据所述任一训练样本的目标节点特征向量和目标关系特征向量/>，得到该训练样本的多个第二场景图/>；其中，/>。According to the target node feature vector of any training sample and target relation eigenvector/> , to obtain multiple second scene graphs of the training sample /> ; where /> .

进一步，将所述任一训练样本的第二场景图、多个文本词特征向量和句子级特征向量依次输入所述图注意网络模型和所述多层感知机模型，得到该训练样本对应的训练视觉问答结果的步骤，包括：Further, the second scene graph, multiple text word feature vectors and sentence-level feature vectors of any training sample are sequentially input into the graph attention network model and the multi-layer perceptron model to obtain the training data corresponding to the training sample. Steps for visual question answering results, including:

将所述任一训练样本的多个文本词特征向量和第二场景图输入所述图注意网络模型进行迭代处理，得到多个目标词特征向量，并通过所述多层感知机模型计算每个目标词特征向量与邻接节点的权重分数，并基于通道注意力机制对所有的邻接节点进行过滤，得到并将多个节点特征向量进行聚合，得到视觉问答目标特征/>，并将该训练样本对应的句子级特征向量/>和视觉问答目标特征/>输入至softmax进行处理，得到该训练样本对应的训练视觉问答结果/>；其中，/>，。A plurality of text word feature vectors and the second scene graph of any training sample are input into the graph attention network model for iterative processing to obtain a plurality of target word feature vectors, and each of them is calculated by the multi-layer perceptron model The target word feature vector and the weight score of adjacent nodes, and filter all adjacent nodes based on the channel attention mechanism, and obtain and combine multiple node feature vectors Aggregate to get visual Q&A target features/> , and the sentence-level feature vector corresponding to the training sample /> and visual question answering target features/> Input to softmax for processing, and obtain the training visual question answering result corresponding to the training sample /> ; where /> , .

本发明的一种基于动态语义图神经网络的视觉问答系统的技术方案如下：A technical scheme of a visual question answering system based on a dynamic semantic graph neural network of the present invention is as follows:

包括：构建模块、训练模块和运行模块；Including: construction module, training module and operation module;

所述构建模块用于：构建包含Glove词嵌入模型、Bi-GRU模型、双线性注意力模型、图注意网络模型和多层感知机模型的动态语义图神经网络模型；The building block is used for: constructing the dynamic semantic graph neural network model comprising Glove word embedding model, Bi-GRU model, bilinear attention model, graph attention network model and multi-layer perceptron model;

所述训练模块用于：基于多个训练样本，对用于视觉问答预测的所述动态语义图神经网络模型进行训练，得到目标视觉问答模型；其中，每个训练样本包括：训练图像和该训练图像对应的训练问句文本；The training module is used to: based on a plurality of training samples, train the dynamic semantic graph neural network model used for visual question answering prediction to obtain a target visual question answering model; wherein each training sample includes: training images and the training The training question text corresponding to the image;

所述运行模块用于：将目标图像和所述目标图像对应的待测问句文本输入至所述目标视觉问答模型中，得到目标视觉问答结果。The running module is configured to: input the target image and the text of the question sentence corresponding to the target image into the target visual question answering model to obtain the target visual question answering result.

本发明的一种基于动态语义图神经网络的视觉问答系统的有益效果如下：The beneficial effect of a kind of visual question answering system based on dynamic semantic graph neural network of the present invention is as follows:

本发明的系统能够有效提升视觉问答模型的性能，并能够提升视觉问答的准确率。The system of the present invention can effectively improve the performance of the visual question answering model, and can improve the accuracy of the visual question answering.

在上述方案的基础上，本发明的一种基于动态语义图神经网络的视觉问答系统还可以做如下改进。On the basis of the above solution, a visual question answering system based on a dynamic semantic graph neural network of the present invention can also be improved as follows.

进一步，还包括：处理模块；Further, it also includes: a processing module;

所述处理模块用于：获取所述多个训练样本，并对每个训练样本对应的视觉问答结果的真实值进行标注，得到每个训练样本的真实视觉问答结果。The processing module is configured to: obtain the plurality of training samples, and mark the real value of the visual question answering result corresponding to each training sample, so as to obtain the real visual question answering result of each training sample.

进一步，所述训练模块包括：第一训练模块、第二训练模块、第三训练模块、第四训练模块、损失计算模块和迭代训练模块；Further, the training module includes: a first training module, a second training module, a third training module, a fourth training module, a loss calculation module and an iterative training module;

所述第一训练模块用于：将任一训练样本的训练问句文本输入至所述Glove词嵌入模型，得到该训练样本的多个词嵌入向量并输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本的多个文本词特征向量和句子级特征向量，并生成问题图；The first training module is used to: input the training question text of any training sample into the Glove word embedding model, obtain a plurality of word embedding vectors of the training sample and input them into the Bi-GRU model to capture the structure Information, get multiple text word feature vectors and sentence-level feature vectors of the training sample, and generate a question graph;

所述第二训练模块用于：将所述任一训练样本的训练图像对应的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量的初始场景图；The second training module is used to: input the nodes and edges in the real scene diagram corresponding to the training image of any training sample to the Glove word embedding model to carry out word embedding processing, and obtain the initial node of the training sample Initial scene graph of eigenvectors and initial edge eigenvectors;

所述第三训练模块用于：基于双线性注意力模型，对所述任一训练样本的每个文本词特征向量与初始场景图进行处理，得到该训练样本的第二场景图；The third training module is used to: process each text word feature vector and initial scene graph of any training sample based on a bilinear attention model to obtain a second scene graph of the training sample;

所述第四训练模块用于：将所述任一训练样本的第二场景图、多个文本词特征向量和句子级特征向量依次输入所述图注意网络模型和所述多层感知机模型，得到该训练样本的训练视觉问答结果；The fourth training module is used to: input the second scene graph, a plurality of text word feature vectors and sentence-level feature vectors of any training sample into the graph attention network model and the multi-layer perceptron model in sequence, Obtain the training visual question answering result of the training sample;

所述损失计算模块用于：基于所述任一训练样本的训练视觉问答结果和真实视觉问答结果，得到该训练样本的损失值，直至得到每个训练样本的损失值；The loss calculation module is used to: obtain the loss value of the training sample based on the training visual question answering result and the real visual question answering result of any training sample, until the loss value of each training sample is obtained;

所述迭代训练模块用于：基于所有的损失值，对所述动态语义图神经网络模型进行优化，得到优化后的动态语义图神经网络模型，并将所述优化后的动态语义图神经网络模型作为所述动态语义图神经网络模型并返回执行所述将任一训练样本的训练问句文本输入至所述Glove词嵌入模型的步骤，直至满足预设迭代训练条件时，将所述优化后的动态语义图神经网络模型确定为所述目标视觉问答模型。The iterative training module is used to: optimize the dynamic semantic graph neural network model based on all loss values, obtain the optimized dynamic semantic graph neural network model, and transfer the optimized dynamic semantic graph neural network model to As the dynamic semantic graph neural network model and return to the step of inputting the training question text of any training sample to the Glove word embedding model, until the preset iterative training condition is met, the optimized The dynamic semantic graph neural network model is determined as the target visual question answering model.

附图说明Description of drawings

图1示出了本发明提供的一种基于动态语义图神经网络的视觉问答方法的实施例的流程示意图；Fig. 1 shows a schematic flow chart of an embodiment of a visual question answering method based on a dynamic semantic graph neural network provided by the present invention;

图2示出了本发明提供的一种基于动态语义图神经网络的视觉问答方法的实施例中步骤120的流程示意图；Fig. 2 shows a schematic flow chart of step 120 in an embodiment of a visual question answering method based on a dynamic semantic graph neural network provided by the present invention;

图3示出了本发明提供的一种基于动态语义图神经网络的视觉问答系统的实施例的结构示意图。FIG. 3 shows a schematic structural diagram of an embodiment of a visual question answering system based on a dynamic semantic graph neural network provided by the present invention.

具体实施方式Detailed ways

图1示出了本发明提供的一种基于动态语义图神经网络的视觉问答方法的实施例的流程示意图。如图1所示，该方法包括如下步骤：Fig. 1 shows a schematic flowchart of an embodiment of a visual question answering method based on a dynamic semantic graph neural network provided by the present invention. As shown in Figure 1, the method includes the following steps:

步骤110：构建包含Glove词嵌入模型、Bi-GRU模型、双线性注意力模型、图注意网络模型和多层感知机模型的动态语义图神经网络模型。Step 110: Construct a dynamic semantic graph neural network model including Glove word embedding model, Bi-GRU model, bilinear attention model, graph attention network model and multi-layer perceptron model.

其中，本实施例中的动态语义图神经网络模型由多个模型构成，包括但不限于：Glove词嵌入模型、Bi-GRU（双向动态门控循环单元）模型、双线性注意力模型、图注意网络模型和多层感知机模型。每个模型的功能作用在后文中详细描述，在此不过多赘述。Among them, the dynamic semantic graph neural network model in this embodiment is composed of multiple models, including but not limited to: Glove word embedding model, Bi-GRU (bidirectional dynamic gated recurrent unit) model, bilinear attention model, graph Pay attention to network models and multi-layer perceptron models. The function of each model is described in detail later, so I won't go into details here.

步骤120：基于多个训练样本，对用于视觉问答预测的所述动态语义图神经网络模型进行训练，得到目标视觉问答模型。Step 120: Based on a plurality of training samples, train the dynamic semantic graph neural network model used for visual question answering prediction to obtain a target visual question answering model.

其中，①每个训练样本包括：训练图像对应的初始场景图和该训练图像对应的训练问句文本。②初始场景图：用于视觉问答训练的图像场景图，用图结构的方式表示训练图像中的不同目标之间的关系。③训练问句文本为：与训练图像相关联的问句文本，任一训练图像对应的训练问句文本的数量为至少一个。③目标视觉问答模型为：经过训练样本训练后所得到的视觉问答模型。Among them, ① each training sample includes: the initial scene graph corresponding to the training image and the training question text corresponding to the training image. ②Initial scene graph: The image scene graph used for visual question answering training, which uses a graph structure to represent the relationship between different targets in the training image. ③ The training question text is: the question text associated with the training image, and the number of the training question text corresponding to any training image is at least one. ③ The target visual question answering model is: the visual question answering model obtained after training with training samples.

步骤130：将目标图像和所述目标图像对应的待测问句文本输入至所述目标视觉问答模型中，得到目标视觉问答结果。Step 130: Input the target image and the text of the question to be tested corresponding to the target image into the target visual question answering model to obtain the target visual question answering result.

其中，①目标图像对应一个初始场景图，该初始场景图为：用于视觉问答测试的图像场景图。②待测问句文本为：与目标图像对应的问句文本，其数量为至少一个。③目标视觉问答结果为：待测问句文本所对应的最终答案，该结果为预测值。Among them, ① The target image corresponds to an initial scene graph, and the initial scene graph is: an image scene graph for visual question answering test. ② The question text to be tested is: at least one question text corresponding to the target image. ③ The target visual question answering result is: the final answer corresponding to the text of the question sentence to be tested, which is the predicted value.

例如，当某个目标图像包含的内容所涉及的待测问句文本为“Where is thegreen tent in the picture?”时，目标视觉问答结果为：“bottom”（对应的真实值为：bottom）；当该目标图像包含的内容所涉及的待测问句文本为“Is the person in thepicture wearing a hat?”时，目标视觉问答结果为：“yes”（对应的真实值为：yes）。又如，当某个目标图像包含的内容所涉及的待测问句文本为“What device does the personhold?”时，目标视觉问答结果为：“camera”（对应的真实值为：camera）；当该目标图像包含的内容所涉及的待测问句文本为“What is the color of the car?”时，目标视觉问答结果为：“white”（对应的真实值为：white）。For example, when the content of a target image contains the text of the question sentence to be tested as "Where is the green tent in the picture?", the target visual question answer result is: "bottom" (the corresponding real value is: bottom); When the content of the target image contains the text of the question sentence to be tested is "Is the person in the picture wearing a hat?", the target visual question answer result is: "yes" (the corresponding true value is: yes). For another example, when the content of a target image contains the text of the question sentence to be tested as "What device does the personhold?", the result of the target visual question answer is: "camera" (the corresponding real value is: camera); when When the content of the target image contains the text of the question to be tested as "What is the color of the car?", the target visual question answer result is: "white" (the corresponding real value is: white).

较优地，还包括：Preferably, it also includes:

其中，真实视觉问答结果为：用户标注的每个训练样本的视觉问答结果的真实值。Among them, the real visual question answering result is: the real value of the visual question answering result of each training sample marked by the user.

较优地，如图2所示，步骤120包括：Preferably, as shown in Figure 2, step 120 includes:

步骤121：将任一训练样本的训练问句文本输入至所述Glove词嵌入模型，得到该训练样本的多个词嵌入向量并输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本的多个文本词特征向量和句子级特征向量，并生成问题图。Step 121: Input the training question text of any training sample into the Glove word embedding model, obtain multiple word embedding vectors of the training sample and input them into the Bi-GRU model to capture structural information, and obtain the training sample Multiple text word feature vectors and sentence-level feature vectors of , and generate a question graph.

其中，①Glove词嵌入模型用于将问句文本中的每个单词生成对应的词嵌入向量。②Bi-GRU模型用于对词嵌入向量进行结构信息捕捉，从而获得问句文本对应的文本词特征向量和句子级特征向量生成问题图。Among them, ①Glove word embedding model is used to generate corresponding word embedding vectors for each word in the question text. ②The Bi-GRU model is used to capture the structural information of the word embedding vector, so as to obtain the text word feature vector and sentence-level feature vector corresponding to the question text to generate a question graph.

具体地，步骤121包括：Specifically, step 121 includes:

步骤1211：对所述任一训练样本的训练问句文本进行分隔词处理，得到该训练样本的多个单词数据。Step 1211: Perform word separation processing on the training question text of any training sample to obtain multiple word data of the training sample .

其中，表示该训练样本的单词数据的个数，/>表示该训练样本中第/>个训练问句文本，/>表示该训练样本中的第/>个训练问句文本中的第/>个单词数据。in, Indicates the number of word data of the training sample, /> Indicates that the first /> in the training sample training question text, /> Indicates the first /> in the training sample The /> in the training question text word data.

具体地，对任一训练样本所对应的每个训练问句文本进行分隔词处理操作，将每个训练问句文本中的所有单词转换为小写形式，得到该训练样本对应的多个单词数据，/>表示该训练样本中的第/>个训练问句文本中的单词数据的总数量。Specifically, perform word separation processing on each training question text corresponding to any training sample, convert all words in each training question text into lowercase, and obtain multiple word data corresponding to the training sample , /> Indicates the first /> in the training sample The total number of word data in the training question text.

需要说明的是，在本实施例中，训练问句文本的长度最大为14（即为14），也可根据需求对/>进行设定，在此不设限制。It should be noted that, in this embodiment, the maximum length of the training question text is 14 (ie is 14), and can also be used for /> Make settings, there are no restrictions here.

步骤1212：将所述任一训练样本的每个单词数据分别输入至所述Glove词嵌入模型，得到每个单词数据的词嵌入向量/>。Step 1212: each word data of any training sample respectively input to the Glove word embedding model to obtain the word embedding vector of each word data /> .

其中，表示该训练样本中的第/>个训练问句文本中的第/>个单词数据对应的词嵌入向量，/>，/>为词嵌入向量的维度。in, Indicates the first /> in the training sample The /> in the training question text The word embedding vector corresponding to word data, /> , /> is the dimension of the word embedding vector.

步骤1213：将所述任一训练样本的多个词嵌入向量输入至所述Bi-GRU模型进行捕捉结构信息，得到该训练样本对应的句子级特征向量/>和词特征向量/>，并生成所述问题图/>。Step 1213: embedding a plurality of words of any training sample into vectors Input to the Bi-GRU model to capture structural information, and obtain the sentence-level feature vector corresponding to the training sample /> and word feature vectors /> , and generate the problem graph /> .

其中，，表示该训练样本中的第/>个训练问句文本中的第/>个文本词特征向量，/>，为文本词特征向量的维度，/>表示单词数量。in, , Indicates the first /> in the training sample The /> in the training question text text word feature vector, /> , is the dimension of the text word feature vector, /> Indicates the number of words.

步骤122：将所述任一训练样本的训练图像对应的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量的初始场景图。Step 122: Input the nodes and edges in the real scene graph corresponding to the training image of any of the training samples to the Glove word embedding model for word embedding processing, and obtain the initial node feature vector and initial edge features of the training sample Vector initial scene graph.

具体地，根据任一训练样本的训练图像对应的包含标注信息的真实场景图中的节点和边输入至所述Glove词嵌入模型进行词嵌入处理，得到该训练样本的包含初始节点特征向量和初始边特征向量/>的初始场景图/>。Specifically, the nodes and edges in the real scene graph containing label information corresponding to the training image of any training sample are input to the Glove word embedding model for word embedding processing, and the initial node feature vector containing the training sample is obtained. and initial edge eigenvectors /> The initial scene graph for /> .

其中，，/>，/>表示候选框的数量，/>为真实场景图中标签词嵌入向量的维度，/>，/>。in, , /> , /> Indicates the number of candidate boxes, /> is the dimension of the label word embedding vector in the real scene graph, /> , /> .

步骤123：基于双线性注意力模型，对所述任一训练样本的每个文本词特征向量与初始场景图进行处理，得到该训练样本的第二场景图。Step 123: Based on the bilinear attention model, process each text word feature vector and the initial scene graph of any training sample to obtain a second scene graph of the training sample.

具体地，将所述任一训练样本的初始场景图分别与该训练样本的每个词特征向量/>输入所述双线性注意力模型进行处理，得到该训练样本的目标节点特征向量；Specifically, the initial scene graph of any training sample Respectively with each word feature vector of the training sample /> Input the bilinear attention model for processing to obtain the target node feature vector of the training sample ;

所述双线性注意力模型为：；The bilinear attention model is: ;

其中，，/>，/>；所述第一预设公式为：/>，为可学习的权重矩阵，/>代表第/>个问题中的第/>个单词与初始关系特征向量/>产生的注意力分数，对/>进行归一化得到；in, , /> , /> ; The first preset formula is: /> , is a learnable weight matrix, /> On behalf of No. /> No. /> of the question word and initial relation feature vector/> The resulting attention score, pair /> normalized to get ;

所述第二预设公式为：；其中，/>表示第/>个第二场景图中目标/>到目标/>的关系特征向量；The second preset formula is: ; where /> Indicates the first /> object in the second scene graph /> to target /> The relationship feature vector of ;

需要说明的是：①引入双线性注意力模型，将对于同一张图像的第个问题/>中的词向量/>与初始场景图输入双线性注意力模型，融合问题中关键词后，得到更新后的节点特征/>与边特征向量/>，突出场景图的关键节点的位置，从而根据问题产生对应的第二场景图/>。It should be noted that: ①Introducing a bilinear attention model, the first image of the same image will be question/> Word vectors in /> Enter the bilinear attention model with the initial scene graph, and after fusing the keywords in the question, get the updated node features/> with edge eigenvectors /> , to highlight the position of the key nodes of the scene graph, so as to generate the corresponding second scene graph according to the problem /> .

首先计算问题与节点的注意力权重，会产生基于问题的不同图像区域的注意力分数，通过注意力分数可以仅关注与给定不同问题最相关的部分场景实例：；其中，计算相关性得分/>后，对它们进行归一化，得到注意权值/>，适应问题的节点特征/>可由输入特征/>和注意力权值的加权平均值计算得到：Firstly, the attention weight of the question and the node is calculated, which will generate the attention score of different image regions based on the question. Through the attention score, we can only focus on some scene instances most relevant to the given different questions: ; where the relevance score is computed /> After that, normalize them to get attention weights /> , adapting the node characteristics of the problem /> can be entered by the feature /> and attention weights The weighted average of is calculated as:

； ;

②除了需要对节点求注意力权重之外，同样要对关系边施加注意力机制，因为关系对于问题的解答一样重要。为了捕捉问题与场景图的深层交互关系，找到与问题相关的关系边，使用如下公式计算出边注意力权值/>：②In addition to the need to seek attention weights for nodes, attention mechanisms must also be applied to relationship edges, because relationships are equally important for answering questions. In order to capture the deep interactive relationship between the problem and the scene graph, find and problem For related relationship edges, use the following formula to calculate edge attention weights /> :

； ;

其中，为可学习的权重矩阵。/>代表第/>个问题中的第/>个单词与初始边特征向量/>产生的注意力分数，之后将/>进行归一化，得到注意权值/>，初始场景图边特征向量/>在融合第/>个问题中的单词/>转化为/>：in, is a learnable weight matrix. /> On behalf of No. /> No. /> of the question words and initial edge feature vectors/> The resulting attention score will then be /> Normalize to get attention weights /> , initial scene graph edge feature vector /> In the fusion section /> word in question /> converts to /> :

③由以上方法可以得到根据第个问题建立的第二场景图为，/>，/>。第二场景图的加入了与不同问题句子级的交互，使原本复杂的场景图变得更加稀疏，更加关注问题的内容，可以更有效地预测答案。③ From the above method can be obtained according to the The second scene graph established by this problem is , /> , /> . The second scene graph adds sentence-level interaction with different questions, making the original complex scene graph more sparse, paying more attention to the content of the question, and predicting the answer more effectively.

步骤124：将所述任一训练样本的第二场景图、多个文本词特征向量和句子级特征向量依次输入所述图注意网络模型和所述多层感知机模型，得到该训练样本的训练视觉问答结果。Step 124: Input the second scene graph, multiple text word feature vectors, and sentence-level feature vectors of any training sample into the graph attention network model and the multi-layer perceptron model in sequence to obtain the training data of the training sample. Visual question answering results.

具体地，将所述任一训练样本的多个文本词特征向量和第二场景图输入所述图注意网络模型进行迭代处理，得到多个目标词特征向量，并通过所述多层感知机模型计算每个目标词特征向量与邻接节点的权重分数，并基于通道注意力机制对所有的邻接节点进行过滤，得到并将多个节点特征向量进行聚合，得到视觉问答目标特征/>，并将该训练样本对应的句子级特征向量/>和视觉问答目标特征/>输入至softmax进行处理，得到该训练样本对应的训练视觉问答结果/>。Specifically, a plurality of text word feature vectors and a second scene graph of any training sample are input into the graph attention network model for iterative processing to obtain a plurality of target word feature vectors, and pass the multi-layer perceptron model Calculate the weight score of each target word feature vector and adjacent nodes, and filter all adjacent nodes based on the channel attention mechanism, and obtain multiple node feature vectors Aggregate to get visual Q&A target features/> , and the sentence-level feature vector corresponding to the training sample /> and visual question answering target features/> Input to softmax for processing, and obtain the training visual question answering result corresponding to the training sample /> .

需要说明的是：It should be noted:

①采用图注意力网络（GAT）。与图注意力网络不同的是，本实施例中的场景图节点的更新是基于由问句文本的单词数据生成的词特征向量/>进行迭代操作，指令在节点更新过程中起到两个作用，首先在生成注意力分布的时候引入了指令向量，控制相邻节点有多少信息可以流入。其次在节点更新时，利用通道注意力对不重要的信息进行了过滤。① Adopt graph attention network (GAT). Unlike the graph attention network, the update of the scene graph nodes in this embodiment is based on the word data of the question text Generated word feature vectors /> For iterative operation, instructions play two roles in the node update process. First, instruction vectors are introduced when generating attention distribution to control how much information can flow into adjacent nodes. Second, channel attention is used to filter unimportant information when nodes are updated.

采用在图注意网络模型逐次融合个指令向量进行节点的更新。在第/>次迭代过程中，我们将第/>个问题的第/>个指令向量/>连接到第/>次迭代输出的每个节点及边缘特征/>。Sequential Fusion Using In-Graph Attention Network Model Instruction vectors are used to update the nodes. at /> During the iteration, we will pass the /> No. /> of the question instruction vectors /> Connect to page /> Each node output by iteration and edge features/> .

其中，，/>分别表示融合了第/>个问题的第/>个指令向量/>的节点特征和边缘特征，并作为第/>次迭代过程的输入。/>表示向量的拼接。in, , /> Respectively represent the fusion of the first /> No. /> of the question instruction vectors /> node features and edge features, and as the first /> input to the iterative process. /> Represents a concatenation of vectors.

②计算节点与相邻接点的权重分数，即节点的特征对节点/>的重要性，我们首先对/>个节点中的其中两个节点/>进行融合，本文中融合方式采用了多层感知机（MLP）模型。使用softmax对所有邻居节点/>进行归一化，注意力函数表达式为：②Calculate the weight scores of nodes and adjacent nodes, that is, nodes feature pair node /> The importance of two of nodes /> For fusion, the fusion method in this paper adopts the multi-layer perceptron (MLP) model. Use softmax for all neighbor nodes /> For normalization, the attention function expression is:

；其中，/>表示在第/>次迭代后节点/>的特征对节点/>注意力分数，/>表示节点/>的所有邻居节点的集合。 ; where /> Indicated at the /> node after iterations /> feature pair node /> attention score, /> represents a node /> The set of all neighbor nodes of .

③在根据邻居节点更新中心节点之前，我们引入了通道注意力机制对邻居节点在通道上进行过滤。图注意力网络（GAT）计算邻居节点变换后特征的加权平均值作为节点的新向量表示：/>；其中，/>为可训练的权重矩阵，/>为非线性激活函数。在经过/>次迭代神经网络之后，模型的最终可以得到所有节点的最终状态/>。③ Before updating the central node according to the neighbor nodes, we introduce the channel attention mechanism to filter the neighbor nodes on the channel. The graph attention network (GAT) calculates the weighted average of the transformed features of neighbor nodes as a node A new vector representation of : /> ; where /> is a trainable weight matrix, /> is a non-linear activation function. passing by /> After the iterative neural network, the model can finally get the final state of all nodes /> .

④得到消息传递迭代后，首先将消息传递后的向量所有图节点的最终状态聚合为最终状态/>，然后用问题向量预测答案/>：④ get message delivery After iteration, first pass the message vector to the final state of all graph nodes Aggregate to final state /> , and then use the question vector to predict the answer /> :

其中，是预测的答案概率，模型最终从中选出概率最大的标签，作为最终预测的答案（训练视觉问答结果）。in, is the predicted answer probability, and the model finally selects the label with the highest probability as the final predicted answer (training visual question answering result).

步骤125：基于所述任一训练样本的训练视觉问答结果和真实视觉问答结果，得到该训练样本的损失值，直至得到每个训练样本的损失值。Step 125: Obtain the loss value of the training sample based on the training visual question answering result and the real visual question answering result of any training sample, until the loss value of each training sample is obtained.

其中，①训练视觉问答结果为：训练样本对应的视觉问答的预测值。②真实视觉问答结果为：训练样本对应的视觉问答的真实值，即训练样本的标注信息。Among them, ① the result of training visual question answering is: the predicted value of the visual question answering corresponding to the training sample. ②The result of the real visual question answering is: the real value of the visual question answering corresponding to the training sample, that is, the annotation information of the training sample.

具体地，基于预设损失函数、任一训练样本的训练视觉问答结果和真实视觉问答结果，得到该训练样本的损失值，重复上述过程，直至得到每个训练样本的损失值。Specifically, based on the preset loss function, the training visual question answering result and the real visual question answering result of any training sample, the loss value of the training sample is obtained, and the above process is repeated until the loss value of each training sample is obtained.

步骤126：基于所有的损失值，对所述动态语义图神经网络模型进行优化，得到优化后的动态语义图神经网络模型，并将所述优化后的动态语义图神经网络模型作为所述动态语义图神经网络模型并返回执行步骤121，直至满足预设迭代训练条件时，将所述优化后的动态语义图神经网络模型确定为所述目标视觉问答模型。Step 126: Based on all loss values, optimize the dynamic semantic graph neural network model to obtain an optimized dynamic semantic graph neural network model, and use the optimized dynamic semantic graph neural network model as the dynamic semantic graph neural network model The graph neural network model returns to step 121 until the preset iterative training condition is met, and the optimized dynamic semantic graph neural network model is determined as the target visual question answering model.

具体地，基于所有的损失值，对动态语义图神经网络模型进行优化，得到优化后的动态语义图神经网络模型，判断优化后的动态语义图神经网络模型是否满足预设迭代训练条件，若是，则将优化后的动态语义图神经网络模型确定为目标视觉问答模型；若否，则将优化后的动态语义图神经网络模型作为动态语义图神经网络模型并返回执行步骤121，直至满足预设迭代训练条件时，将优化后的动态语义图神经网络模型确定为目标视觉问答模型。Specifically, based on all loss values, the dynamic semantic graph neural network model is optimized to obtain the optimized dynamic semantic graph neural network model, and it is judged whether the optimized dynamic semantic graph neural network model meets the preset iterative training conditions, and if so, Then the optimized dynamic semantic graph neural network model is determined as the target visual question answering model; if not, the optimized dynamic semantic graph neural network model is used as the dynamic semantic graph neural network model and returns to step 121 until the preset iteration is satisfied When training conditions, the optimized dynamic semantic graph neural network model is determined as the target visual question answering model.

需要说明的是，本实施例是在公开的GQA数据集上测试提出的算法。GQA数据集中的图像来自Visual Genome数据集，从Visual Genome中提供的场景图注释开始，应用了一个规范化过程。它包含113K张图片，大约1.2万个问题，分为大约80%、10%、10%用于训练、验证和测试。总体词汇量由3097个单词组成，包括1702个对象类、310个关系和610个对象属性。GQA的训练集和验证集中每幅图像配有场景图的信息标注，描述了场景里的目标对象的标签和属性，还有目标实体间成对的关系信息。It should be noted that this embodiment tests the proposed algorithm on the public GQA dataset. The images in the GQA dataset come from the Visual Genome dataset, and a normalization process is applied starting from the scene graph annotations provided in Visual Genome. It contains 113K images, about 12K questions, divided into about 80%, 10%, 10% for training, validation and testing. The overall vocabulary consists of 3097 words, including 1702 object classes, 310 relations and 610 object attributes. Each image in the training set and verification set of GQA is equipped with information annotations of the scene graph, which describe the labels and attributes of the target objects in the scene, as well as the paired relationship information between target entities.

本实施例的方法与以下主流的基于GQA数据集的视觉问答算法模型相比，包括Human、LCGN和GraphVQA。The method of this embodiment is compared with the following mainstream visual question answering algorithm models based on GQA datasets, including Human, LCGN and GraphVQA.

如下表1所示，与其他最先进的方法相比，本实施例的方法显示了更好的性能。以下结果表明，本实施例的方法通过问题引导的动态场景图可以自适应建模视觉对象间的语义依赖关系，较好的理解与问题相关的场景，有效提升视觉问答的准确率。As shown in Table 1 below, the method of this example shows better performance compared to other state-of-the-art methods. The following results show that the method of this embodiment can adaptively model the semantic dependencies between visual objects through the question-guided dynamic scene graph, better understand the scene related to the question, and effectively improve the accuracy of visual question answering.

表1：Table 1:

本实施例的技术方案通过协同注意力机制使得图像的场景图自适应跟该图像相关的不同的问题，充分利用问题变化形成动态场景图，增强了问题中单词级的语义信息与图像上的局部感兴趣区域的深层次交互，能够有效提升视觉问答模型的性能，并提升视觉问答的准确率。The technical solution of this embodiment enables the scene graph of the image to adapt to different problems related to the image through the collaborative attention mechanism, and makes full use of the problem change to form a dynamic scene graph, which enhances the word-level semantic information in the problem and the local image on the image. The deep interaction of the region of interest can effectively improve the performance of the visual question answering model and improve the accuracy of the visual question answering.

图3示出了本发明提供的一种基于动态语义图神经网络的视觉问答系统的实施例的结构示意图。如图3所示，该系统200包括：构建模块210、训练模块220和运行模块230。FIG. 3 shows a schematic structural diagram of an embodiment of a visual question answering system based on a dynamic semantic graph neural network provided by the present invention. As shown in FIG. 3 , the system 200 includes: a construction module 210 , a training module 220 and an operation module 230 .

所述构建模块210用于：构建包含Glove词嵌入模型、Bi-GRU模型、双线性注意力模型、图注意网络模型和多层感知机模型的动态语义图神经网络模型；The building block 210 is used for: constructing the dynamic semantic graph neural network model comprising Glove word embedding model, Bi-GRU model, bilinear attention model, graph attention network model and multi-layer perceptron model;

所述训练模块220用于：基于多个训练样本，对用于视觉问答预测的所述动态语义图神经网络模型进行训练，得到目标视觉问答模型；其中，每个训练样本包括：训练图像和该训练图像对应的训练问句文本；The training module 220 is used for: based on a plurality of training samples, the dynamic semantic graph neural network model used for visual question answering prediction is trained to obtain a target visual question answering model; wherein, each training sample includes: a training image and the The training question text corresponding to the training image;

所述运行模块230用于：将目标图像和所述目标图像对应的待测问句文本输入至所述目标视觉问答模型中，得到目标视觉问答结果。The running module 230 is configured to: input the target image and the text of the question sentence corresponding to the target image into the target visual question answering model to obtain the target visual question answering result.

较优地，还包括：处理模块；Preferably, it also includes: a processing module;

较优地，所述训练模块220包括：第一训练模块、第二训练模块、第三训练模块、第四训练模块、损失计算模块和迭代训练模块；Preferably, the training module 220 includes: a first training module, a second training module, a third training module, a fourth training module, a loss calculation module and an iterative training module;

所述第一训练模块用于：将任一训练样本的训练问句文本输入至所述Glove词嵌入模型，得到该训练样本的多个词嵌入向量并输入至所述Bi-GRU模型进行结构信息捕捉，得到该训练样本的多个文本词特征向量和句子级特征向量，并生成问题图；The first training module is used to: input the training question text of any training sample into the Glove word embedding model, obtain a plurality of word embedding vectors of the training sample and input them into the Bi-GRU model for structural information Capture, obtain multiple text word feature vectors and sentence-level feature vectors of the training sample, and generate a question graph;

上述关于本实施例的一种基于动态语义图神经网络的视觉问答系统200中的各参数和各个模块实现相应功能的步骤，可参考上文中关于一种基于动态语义图神经网络的视觉问答方法的实施例中的各参数和步骤，在此不做赘述。For the above-mentioned steps of realizing the corresponding functions of each parameter and each module in a visual question answering system 200 based on a dynamic semantic graph neural network in this embodiment, you can refer to the above about a visual question answering method based on a dynamic semantic graph neural network The parameters and steps in the embodiments will not be repeated here.

本发明实施例提供的一种存储介质，包括：存储介质中存储有指令，当计算机读取所述指令时，使所述计算机执行如一种基于动态语义图神经网络的视觉问答方法的步骤，具体可参考上文中一种基于动态语义图神经网络的视觉问答方法的实施例中的各参数和步骤，在此不做赘述。A storage medium provided by an embodiment of the present invention includes: an instruction is stored in the storage medium, and when the computer reads the instruction, the computer is made to perform steps such as a visual question answering method based on a dynamic semantic graph neural network, specifically Reference can be made to the parameters and steps in the above embodiment of a visual question answering method based on a dynamic semantic graph neural network, and details are not repeated here.

计算机存储介质例如：优盘、移动硬盘等。Computer storage media such as: USB flash drive, mobile hard disk, etc.

所属技术领域的技术人员知道，本发明可以实现为方法、系统和存储介质。Those skilled in the art know that the present invention can be implemented as a method, system and storage medium.

因此，本发明可以具体实现为以下形式，即：可以是完全的硬件、也可以是完全的软件（包括固件、驻留软件、微代码等），还可以是硬件和软件结合的形式，本文一般称为“电路”、“模块”或“系统”。此外，在一些实施例中，本发明还可以实现为在一个或多个计算机可读介质中的计算机程序产品的形式，该计算机可读介质中包含计算机可读的程序代码。可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)，只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Therefore, the present invention can be embodied in the following forms, namely: it can be complete hardware, it can also be complete software (including firmware, resident software, microcode, etc.), and it can also be a combination of hardware and software. Called a "circuit", "module" or "system". Furthermore, in some embodiments, the present invention can also be implemented in the form of a computer program product embodied in one or more computer-readable media having computer-readable program code embodied therein. Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more leads, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. A visual question-answering method based on a dynamic semantic graph neural network is characterized by comprising the following steps:

constructing a dynamic semantic graph neural network model comprising a Glove word embedding model, a Bi-GRU model, a bilinear attention model, a graph meaning network model and a multi-layer perceptron model;

training the dynamic semantic graph neural network model for visual question-answering prediction based on a plurality of training samples to obtain a target visual question-answering model; wherein each training sample comprises: training images and training question texts corresponding to the training images;

and inputting the target image and the text of the question to be tested corresponding to the target image into the target visual question-answering model to obtain a target visual question-answering result.

2. The visual question-answering method based on dynamic semantic graph neural network according to claim 1, further comprising:

and acquiring the plurality of training samples, and marking the true value of the visual question-answer result corresponding to each training sample to obtain the true visual question-answer result of each training sample.

3. The visual question-answering method based on a dynamic semantic graph neural network according to claim 2, wherein the step of training the dynamic semantic graph neural network model for visual question-answering prediction based on a plurality of training samples to obtain a target visual question-answering model comprises:

Inputting training question text of any training sample into the Glove word embedding model to obtain a plurality of word embedding vectors of the training sample, inputting the word embedding vectors into the Bi-GRU model to capture structural information, obtaining a plurality of text word feature vectors and sentence-level feature vectors of the training sample, and generating a question graph;

inputting nodes and edges in the real scene graph corresponding to the training image of any training sample into the Glove word embedding model to perform word embedding processing to obtain an initial scene graph of the training sample, wherein the initial scene graph comprises initial node feature vectors and initial edge feature vectors;

processing each text word feature vector of any training sample and the initial scene graph based on the bilinear attention model to obtain a second scene graph of the training sample;

sequentially inputting a second scene graph, a plurality of text word feature vectors and sentence level feature vectors of any training sample into the graph annotation network model and the multi-layer perceptron model to obtain a training visual question-answering result of the training sample;

obtaining a loss value of the training sample based on the training visual question-answering result and the real visual question-answering result of any training sample until the loss value of each training sample is obtained;

And optimizing the dynamic semantic graph neural network model based on all the loss values to obtain an optimized dynamic semantic graph neural network model, taking the optimized dynamic semantic graph neural network model as the dynamic semantic graph neural network model, and returning to execute the step of inputting training question text of any training sample into the Glove word embedding model until a preset iteration training condition is met, and determining the optimized dynamic semantic graph neural network model as the target vision question-answering model.

4. The visual question-answering method based on dynamic semantic graph neural network according to claim 3, wherein the step of inputting training question text of any training sample to the Glove word embedding model to obtain a plurality of word embedding vectors of the training sample and inputting the word embedding vectors to the Bi-GRU model to capture structural information to obtain a plurality of text word feature vectors and sentence-level feature vectors of the training sample, and generating a question graph comprises the steps of:

performing separator word processing on the training question text of any training sample to obtain a plurality of word data of the training sampleThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Number of word data representing the training sample, +. >Indicating +.>Training question text->Representing the +.>The +.f. in the text of the training question>Individual word data;

each word data of any training sample is processedRespectively inputting the words into the Glove word embedding model to obtain each wordWord embedding vector of data->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing the +.>The first training question textWord embedding vector corresponding to the individual word data, +.>，/>Embedding a dimension of a vector for a word;

embedding multiple words of the training samples into vectorsInputting the sentence-level feature vector into the Bi-GRU model to capture structural information, and obtaining the sentence-level feature vector corresponding to the training sample>Sum word feature vector->And generates the problem graphThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)> ，/>Representing the +.>The +.f. in the text of the training question>Individual text word feature vectors,/>，/>For the dimension of the text word feature vector, +.>Representing the number of words.

5. The visual question-answering method based on a dynamic semantic graph neural network according to claim 4, wherein the step of inputting nodes and edges in the real scene graph corresponding to the training image of any training sample to the Glove word embedding model to perform word embedding processing to obtain an initial scene graph of the training sample, wherein the initial scene graph comprises initial node feature vectors and initial edge feature vectors comprises:

Inputting nodes and edges in the real scene graph corresponding to the training image of any training sample and containing the labeling information into the Glove word embedding model to perform word embedding processing to obtain an initial node feature vector containing the training sampleAnd an initial edge feature vector +.>Is->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>，，/>，/>Representing the number of candidate boxes, +.>Embedding vector dimension for tag words in real scene graph, < ->，/>。

6. The visual question-answering method based on a dynamic semantic graph neural network according to claim 5, wherein the step of processing each text word feature vector of any training sample with an initial scene graph based on a bilinear attention model to obtain a second scene graph of the training sample comprises:

an initial scene graph of any training sampleEach word feature vector associated with the training sample>Inputting the bilinear attention model for processing to obtain the target node characteristic vector of the training sample>The method comprises the steps of carrying out a first treatment on the surface of the The bilinear attention model is: />；

Wherein,,is a weight matrix which can be learned, +.>Represents->The +.f. in the text of the training question>Individual word data +.>The resulting attention score,/- >，/>Indicate->Second scene graph +.>Target node feature vectors of the individual target objects;

based on a first preset formula, acquiring an initial scene graph of any training sampleIs>Is +.>Is +.>And side note force weight of the training sample using a second predetermined formula ++>For the corresponding initial relation feature vector +.>Respectively converting to obtain target relation feature vector +.>；

Wherein,,，/>，/>the method comprises the steps of carrying out a first treatment on the surface of the The first preset formula is:

，is a weight matrix which can be learned, +.>Represents->First of all of the questions>Individual words and initial relation feature vector +.>The resulting attention score, for->Normalizing to obtainThe method comprises the steps of carrying out a first treatment on the surface of the The second preset formula is: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Indicate->Target +.>To the target->Is a relation feature vector of (1);

target node feature vector according to any training sampleAnd target relation feature vector->Obtaining a plurality of second scene graphs of the training sample +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>。

7. The visual question-answering method based on the dynamic semantic graph neural network according to claim 6, wherein the step of sequentially inputting the second scene graph, the plurality of text word feature vectors and the sentence-level feature vectors of any training sample into the graph-meaning network model and the multi-layer perceptron model to obtain the training visual question-answering result corresponding to the training sample comprises the following steps:

Inputting the text word feature vectors and the second scene graph of any training sample into the graph annotation network model for iterative processing to obtain a plurality of target word feature vectors, calculating the weight scores of each target word feature vector and adjacent nodes through the multi-layer perceptron model, filtering all adjacent nodes based on a channel attention mechanism to obtain a plurality of node feature vectorsPolymerizing to obtain visual question-answering target feature +.>And the sentence-level feature vector corresponding to the training sample is +.>And visual question-answering target feature->Inputting into softmax for processing to obtain training visual question-answer result corresponding to the training sample>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>，。

8. A visual question-answering system based on a dynamic semantic graph neural network, comprising: the system comprises a construction module, a training module and an operation module;

the construction module is used for: constructing a dynamic semantic graph neural network model comprising a Glove word embedding model, a Bi-GRU model, a bilinear attention model, a graph meaning network model and a multi-layer perceptron model;

the training module is used for: training the dynamic semantic graph neural network model for visual question-answering prediction based on a plurality of training samples to obtain a target visual question-answering model; wherein each training sample comprises: training images and training question texts corresponding to the training images;

The operation module is used for: and inputting the target image and the text of the question to be tested corresponding to the target image into the target visual question-answering model to obtain a target visual question-answering result.

9. The visual question-answering system based on dynamic semantic graph neural network according to claim 8, further comprising: a processing module;

the processing module is used for: and acquiring the plurality of training samples, and marking the true value of the visual question-answer result corresponding to each training sample to obtain the true visual question-answer result of each training sample.

10. The visual question-answering system based on dynamic semantic graph neural network according to claim 9, wherein the training module comprises: the system comprises a first training module, a second training module, a third training module, a fourth training module, a loss calculation module and an iteration training module;

the first training module is used for: inputting training question text of any training sample into the Glove word embedding model to obtain a plurality of word embedding vectors of the training sample, inputting the word embedding vectors into the Bi-GRU model to capture structural information, obtaining a plurality of text word feature vectors and sentence-level feature vectors of the training sample, and generating a question graph;

The second training module is used for: inputting nodes and edges in the real scene graph corresponding to the training image of any training sample into the Glove word embedding model to perform word embedding processing to obtain an initial scene graph of the training sample, wherein the initial scene graph comprises initial node feature vectors and initial edge feature vectors;

the third training module is used for: processing each text word feature vector of any training sample and the initial scene graph based on the bilinear attention model to obtain a second scene graph of the training sample;

the fourth training module is used for: sequentially inputting a second scene graph, a plurality of text word feature vectors and sentence level feature vectors of any training sample into the graph annotation network model and the multi-layer perceptron model to obtain a training visual question-answering result of the training sample;

the loss calculation module is used for: obtaining a loss value of the training sample based on the training visual question-answering result and the real visual question-answering result of any training sample until the loss value of each training sample is obtained;

the iterative training module is used for: and optimizing the dynamic semantic graph neural network model based on all the loss values to obtain an optimized dynamic semantic graph neural network model, taking the optimized dynamic semantic graph neural network model as the dynamic semantic graph neural network model, and returning to execute the step of inputting training question text of any training sample into the Glove word embedding model until a preset iteration training condition is met, and determining the optimized dynamic semantic graph neural network model as the target vision question-answering model.