CN113792207A

CN113792207A - Cross-modal retrieval method based on multi-level feature representation alignment

Info

Publication number: CN113792207A
Application number: CN202111149240.4A
Authority: CN
Inventors: 张卫锋; 周俊峰; 王小江
Original assignee: Jiaxing University
Current assignee: Jiaxing University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-14
Anticipated expiration: 2041-09-29
Also published as: CN113792207B

Abstract

The invention discloses a cross-modal retrieval method based on multi-level feature representation alignment, and relates to the technical field of cross-modal retrieval. The present invention calculates the global similarity, local similarity and relationship similarity between two different modal data of image and text respectively in the stage of cross-modal fine-grained and accurate alignment, and fuses to obtain the comprehensive similarity of image-text, in the In the neural network training phase, the corresponding loss function is designed, the cross-modal structure constraint information is mined, the parameter learning of the retrieval model is constrained and supervised from multiple perspectives, and finally the retrieval results of the test query samples are obtained according to the comprehensive similarity of the image-text, so as to pass Introducing the fine-grained correlation between two different modal data, image and text, effectively improves the accuracy of cross-modal retrieval, and has broad market demand and application prospects in image retrieval, pattern recognition and other fields.

Description

A cross-modal retrieval method based on alignment of multi-level feature representations

技术领域technical field

本发明涉及跨模态检索技术领域，特别涉及一种基于多层次特征表示对齐的跨模态检索方法。The invention relates to the technical field of cross-modality retrieval, in particular to a cross-modality retrieval method based on alignment of multi-level feature representations.

背景技术Background technique

随着移动互联网、社交网络等新一代互联网技术的快速发展，文本、图像、视频等多模态数据呈现爆炸式增长。跨模态检索技术旨在通过挖掘和利用不同模态数据之间的关联信息，实现不同模态数据之间的跨越检索，其核心是实现跨模态数据之间的相似度度量。近年来，跨模态检索技术已成为国内外研究热点，受到学术界和工业界的广泛关注，是跨模态智能的重要研究领域之一，也是信息检索领域未来发展的重要方向。With the rapid development of new-generation Internet technologies such as mobile Internet and social networks, multi-modal data such as text, images, and videos have exploded. The cross-modal retrieval technology aims to realize the cross-retrieval between different modal data by mining and utilizing the correlation information between the different modal data, and its core is to realize the similarity measurement between the cross-modal data. In recent years, cross-modal retrieval technology has become a research hotspot at home and abroad, and has received extensive attention from academia and industry. It is one of the important research fields of cross-modal intelligence and an important direction for the future development of information retrieval.

跨模态检索同时涉及多种模态的数据，这些数据之间存在“异构鸿沟”，即它们在高层语义上相互关联，但在底层特征上呈现异构性，因此需要检索算法能够深入挖掘不同模态数据之间的关联信息，实现一种模态数据到另一种模态数据的对齐。Cross-modal retrieval involves data from multiple modalities at the same time. There is a "heterogeneous gap" between these data, that is, they are related to each other in high-level semantics, but they are heterogeneous in underlying features, so the retrieval algorithm needs to be able to dig deeper. The association information between different modal data, to realize the alignment of one modal data to another.

目前，子空间学习方法是跨模态检索的主流方法，该类方法又可细分为基于传统统计相关性分析的检索模型和基于深度学习的检索模型。其中，基于传统统计相关性分析的跨模态检索方法通过线性映射矩阵将不同模态数据映射到子空间，最大化不同模态数据之间的相关性。基于深度学习的跨模态检索方法利用深度神经网络的特征抽取能力抽取不同模态数据的有效表示，同时利用神经网络的复杂非线性映射能力挖掘跨模态数据之间复杂关联特性。At present, the subspace learning method is the mainstream method of cross-modal retrieval, which can be subdivided into retrieval models based on traditional statistical correlation analysis and retrieval models based on deep learning. Among them, the cross-modal retrieval method based on traditional statistical correlation analysis maps different modal data to subspaces through a linear mapping matrix to maximize the correlation between different modal data. The deep learning-based cross-modal retrieval method uses the feature extraction ability of deep neural network to extract effective representations of different modal data, and uses the complex nonlinear mapping ability of neural network to mine complex correlation characteristics between cross-modal data.

在实现本发明的过程中，申请人发现现有技术存在以下技术问题：In the process of realizing the present invention, the applicant found that the prior art has the following technical problems:

现有技术提供的跨模态检索方法注重图像和文本的全局特征和局部特征的表示学习、关联分析和对齐，但缺乏视觉目标之间关系的推理和关系信息的对齐，且无法全面有效利用训练数据蕴含的结构约束信息监督模型进行训练，导致跨模态检索方法对图像和文本的跨模态检索精确度较低。The cross-modal retrieval methods provided by the prior art focus on the representation learning, association analysis and alignment of global and local features of images and texts, but lack reasoning about the relationship between visual objects and alignment of relationship information, and cannot fully and effectively utilize training. The structural constraint information contained in the data supervises the training of the model, resulting in low cross-modal retrieval accuracy of images and texts for cross-modal retrieval methods.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术存在的上述问题，本发明提供了一种基于多层次特征表示对齐的跨模态检索方法，通过跨模态多层次表示关联，准确衡量图像和文本之间的相似度，有效提供检索准确率，从而解决现有跨模态检索方法表示不够精细、跨模态关联不够充分的技术问题，同时，利用跨模态结构约束信息监督检索模型的训练。本发明的技术方案如下：In order to solve the above problems existing in the prior art, the present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. Provide retrieval accuracy to solve the technical problems of insufficient representation of existing cross-modal retrieval methods and insufficient cross-modal correlation, and at the same time, use cross-modal structural constraint information to supervise the training of retrieval models. The technical scheme of the present invention is as follows:

根据本发明实施例的一个方面，提供一种基于多层次特征表示对齐的跨模态检索方法，其特征在于，所述方法包括：According to an aspect of the embodiments of the present invention, a cross-modal retrieval method based on multi-level feature representation alignment is provided, wherein the method includes:

获取训练数据集，对于所述训练数据集中的每组数据对，所述数据对包括图像数据、文本数据，以及所述图像数据与所述文本数据共同对应的语义标签；Obtaining a training data set, for each group of data pairs in the training data set, the data pairs include image data, text data, and a semantic label corresponding to the image data and the text data in common;

对于所述训练数据集中的每组数据对，分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征，以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征；For each group of data pairs in the training data set, extract the image global features, image local features and image relationship features corresponding to the image data in the data pair, and text global features, textual features corresponding to the text data in the data pair, respectively. Local features and text relationship features;

对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对，根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度；For a target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, the image local feature and text local feature corresponding to the target data pair The image-text comprehensive similarity corresponding to the target data pair is obtained by calculating the corresponding image relationship feature and the text relationship feature of the feature, the target data pair;

基于各组目标数据对对应的图像-文本综合相似度，设计模态间结构约束损失函数和模态内结构约束损失函数，并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对模型进行训练。Based on the image-text comprehensive similarity corresponding to each set of target data pairs, an inter-modal structural constraint loss function and an intra-modal structural constraint loss function are designed, and the inter-modal structural constraint loss function and the intra-modal structural constraint loss function are used. The constrained loss function trains the model.

在一个优选的实施例中，所述对于所述训练数据集中的每组数据对，分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征，以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征的步骤，包括：In a preferred embodiment, for each group of data pairs in the training data set, image global features, image local features and image relationship features corresponding to the image data in the data pairs are extracted respectively, and the data pairs The steps of text global features, text local features and text relationship features corresponding to Chinese text data include:

对于所述训练数据集中的每组数据对，采用卷积神经网络CNN提取所述数据对所对应图像数据的图像全局特征

，然后采用视觉目标检测器检测所述图像数据包括的视觉目标并提取每个视觉目标的图像局部特征

，其中，M为所述图像数据包括的视觉目标数量，

为视觉目标

的特征向量，再通过图像视觉关系编码网络提取各个视觉目标之间的图像关系特征

，其中，

为视觉目标

和视觉目标

之间的图像关系特征； For each group of data pairs in the training data set, the convolutional neural network CNN is used to extract the image global features of the image data corresponding to the data pairs

, and then use a visual target detector to detect the visual targets included in the image data and extract the image local features of each visual target

, where M is the number of visual objects included in the image data,

for visual target

feature vector, and then extract the image relationship features between each visual object through the image visual relationship coding network

,in,

for visual target

and visual target

The image relationship features between;

对于所述训练数据集中的每组数据对，采用词嵌入模型将所述数据对所对应文本数据中的每个词转换为词向量

，其中，N为所述文本数据包括的词数量，然后将各个词向量依次输入至递归神经网络，获得所述文本数据对应的文本全局特征

，再将各个词向量输入至前馈神经网络获得各个词对应的文本局部特征

，同时将各个词向量输入至文本关系编码网络提取各个词之间的文本关系特征

，其中，

为词

和词

之间的文本关系特征。 For each set of data pairs in the training data set, a word embedding model is used to convert each word in the text data corresponding to the data pair into a word vector

, where N is the number of words included in the text data, and then input each word vector into the recurrent neural network in turn to obtain the text global features corresponding to the text data

, and then input each word vector into the feedforward neural network to obtain the local text features corresponding to each word

, and input each word vector into the text relation coding network to extract the text relation features between each word

,in,

for word

and word

textual relationship between.

在一个优选的实施例中，所述对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对，根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度的步骤，包括：In a preferred embodiment, for the target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, the target The step of calculating the image-text comprehensive similarity corresponding to the target data pair by calculating the corresponding image local feature and text local feature, the target data pair corresponding image relationship feature and text relationship feature, including:

对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对，基于所述目标数据对中图像数据对应的图像全局特征

和文本数据对应的文本全局特征

的余弦距离，计算得到所述目标数据对对应的图像-文本全局相似度

；其中，图像-文本全局相似度

的计算公式如公式（1）： For a target data pair composed of any image data and any text data in the training data set, based on the image global feature corresponding to the image data in the target data pair

Text global features corresponding to text data

The cosine distance is calculated to obtain the image-text global similarity corresponding to the target data pair

; where, image-text global similarity

The calculation formula is as formula (1):

) 公式（1）

) Formula 1)

采用文本引导注意力机制计算所述目标数据对中图像数据包括的每个视觉目标的权重，将各个视觉目标的图像局部特征

进行对应权重加权后，经前馈神经网络映射获得新的图像局部表示

，然后采用视觉引导注意力机制计算所述目标数据对中文本数据包括的每个词的权重，将各个词的文本局部特征

进行对应权重加权后，经前馈神经网络映射得到新的文本局部表示

，根据各个图像局部表示

和各个文本局部表示

计算所有视觉目标和词的余弦相似度，并以其均值计算得到所述目标数据对对应的图像-文本局部相似度

；其中，图像-文本局部相似度

的计算公式如公式（2），M为视觉目标数量，N为词数量： The text-guided attention mechanism is used to calculate the weight of each visual target included in the image data in the target data pair, and the image local features of each visual target are calculated.

After the corresponding weights are weighted, a new local representation of the image is obtained through the feedforward neural network mapping

, and then the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the text local features of each word are calculated.

After the corresponding weights are weighted, the new local representation of the text is obtained through the feedforward neural network mapping.

, according to the local representation of each image

and individual text local representations

Calculate the cosine similarity of all visual targets and words, and calculate the corresponding image-text local similarity of the target data pair with its mean

; where, the image-text local similarity

The calculation formula is as formula (2), M is the number of visual objects, and N is the number of words:

公式（2）

Formula (2)

根据所述目标数据对中各个图像关系特征和各个文本关系特征的余弦相似度均值，计算得到所述目标数据对对应的图像-文本关系相似度

；其中，图像-文本关系相似度

的计算公式如公式（3），P表示图像数据和文本数据的关系个数： According to the average cosine similarity of each image relationship feature and each text relationship feature in the target data pair, the image-text relationship similarity corresponding to the target data pair is calculated and obtained

; where, the image-text relationship similarity

The calculation formula of , such as formula (3), P represents the number of relationships between image data and text data:

公式（3）

Formula (3)

根据所述目标数据对对应的图像-文本全局相似度

、图像-文本局部相似度

、图像-文本关系相似度

计算得到所述目标数据对对应的图像-文本综合相似度

；其中，图像-文本综合相似度

的计算公式如公式（4）： According to the target data pair corresponding image-text global similarity

, image-text local similarity

, image-text relationship similarity

Calculate the image-text comprehensive similarity corresponding to the target data pair

; Among them, the image-text comprehensive similarity

The calculation formula is as formula (4):

公式（4）。

Formula (4).

在一个优选的实施例中，所述模态间结构约束损失函数的计算公式如公式（5），其中，B为样本数，

为模型超参数，

为匹配的目标数据对，

和

为非匹配的目标数据对： In a preferred embodiment, the calculation formula of the structural constraint loss function between modes is as formula (5), where B is the number of samples,

are model hyperparameters,

is the matching target data pair,

and

For non-matching target data pairs:

公式（5）

Formula (5)

所述模态内结构约束损失函数的计算公式如公式（6），其中，

为图像三元组，相比于

，

与

具有更多共同语义标签，

为文本三元组，相比于

，

与

具有更多共同语义标签：The calculation formula of the structural constraint loss function within the mode is as formula (6), wherein,

is an image triple, compared to

,

and

have more common semantic labels,

is a text triple, compared to

,

and

Has more common semantic labels:

公式（6）。

Formula (6).

在一个优选的实施例中，所述采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练的步骤，包括：In a preferred embodiment, the step of using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function to train the neural network model includes:

从所述训练数据集中随机采样获得匹配的目标数据对、非匹配的目标数据对、图像三元组、文本三元组，分别根据所述模态间结构约束损失函数计算模态间结构约束损失函数值，根据所述模态内结构约束损失函数计算模态内结构约束损失函数值，并按公式（7）进行融合，利用反向传播算法优化网络参数：The matching target data pairs, non-matching target data pairs, image triples, and text triples are randomly sampled from the training data set, and the inter-modal structural constraint loss is calculated according to the inter-modal structural constraint loss function respectively. function value, according to the structural constraint loss function in the mode, calculate the loss function value of the structure constraint in the mode, and fuse it according to formula (7), and use the back propagation algorithm to optimize the network parameters:

公式（7）

Formula (7)

其中

是超参数。 in

are hyperparameters.

在一个优选的实施例中，所述通过图像视觉关系编码网络提取各个视觉目标之间的图像关系特征

的步骤，包括： In a preferred embodiment, the image relationship feature between each visual object is extracted through an image visual relationship coding network

steps, including:

经图像视觉目标检测器获得图像中视觉目标

和视觉目标

的特征

，

，以及两个目标联合区域的特征

，采用公式（8）对上述各个特征进行融合，计算得到各个关系特征： The visual target in the image is obtained by the image visual target detector

and visual target

Characteristics

,

, and the features of the two target joint regions

, using formula (8) to fuse the above features, and calculate each relationship feature:

公式（8）

Formula (8)

其中[]为向量拼接操作，

为神经元激活函数，

为模型参数。 Where [] is the vector concatenation operation,

is the neuron activation function,

are model parameters.

在一个优选的实施例中，所述将各个词向量输入至文本关系编码网络提取各个词之间的文本关系特征

的步骤，包括： In a preferred embodiment, the inputting each word vector into a text relational coding network extracts textual relational features between each word

steps, including:

在文本关系编码网络中，采用公式（9）计算词

和词

之间的文本关系特征

： In the text relational coding network, formula (9) is used to calculate the word

and word

textual relational features between

:

公式（9）

Formula (9)

其中，

表示神经元激活函数，

为模型参数。 in,

represents the neuron activation function,

are model parameters.

在一个优选的实施例中，所述采用文本引导注意力机制计算所述目标数据对中图像数据包括的每个视觉目标的权重，将各个视觉目标的图像局部特征

的步骤，包括： In a preferred embodiment, the weight of each visual target included in the image data in the target data pair is calculated by using a text-guided attention mechanism, and the image local features of each visual target are combined

steps, including:

采用文本引导注意力机制，通过公式（10）计算图像中每个视觉目标的权重：Using the text-guided attention mechanism, the weight of each visual object in the image is calculated by formula (10):

公式（10）

Formula (10)

其中，

、

为模型参数； in,

,

are model parameters;

通过公式（11）对每个视觉目标进行加权，并经过前馈神经网络映射获得新的图像局部表示

： Each visual object is weighted by formula (11), and a new image local representation is obtained through feedforward neural network mapping

:

公式（11）

Formula (11)

其中，

为模型参数。 in,

are model parameters.

在一个优选的实施例中，所述采用视觉引导注意力机制计算所述目标数据对中文本数据包括的每个词的权重，将各个词的文本局部特征

的步骤，包括： In a preferred embodiment, the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the local text features of each word

steps, including:

采用视觉引导注意力机制，通过公式（12）计算文本中每个词的权重：Using a vision-guided attention mechanism, the weight of each word in the text is calculated by formula (12):

公式（12）

Formula (12)

其中，

、

为模型参数； in,

,

are model parameters;

通过公式（13）对各个词的文本局部特征

进行对应权重加权，并经过前馈神经网络映射获得新的文本局部表示

： By formula (13), the text local features of each word are

The corresponding weights are weighted, and a new local representation of the text is obtained through feedforward neural network mapping

:

公式（13）

Formula (13)

其中，

为模型参数。 in,

are model parameters.

在一个优选的实施例中，所述训练数据集通过Wikipedia、MS COCO、Pascal Voc获取。In a preferred embodiment, the training data set is obtained through Wikipedia, MS COCO, and Pascal Voc.

与现有技术相比，本发明提供的一种基于多层次特征表示对齐的跨模态检索方法具有以下优点：Compared with the prior art, the cross-modal retrieval method based on the alignment of multi-level feature representations provided by the present invention has the following advantages:

本发明提供的一种基于多层次特征表示对齐的跨模态检索方法，通过在跨模态细粒度精确对齐阶段，分别计算图像和文本两种不同模态数据之间的全局相似度、局部相似度和关系相似度，并融合得到图像-文本综合相似度，在网络训练阶段，设计相应损失函数，挖掘跨模态结构约束信息，从多个角度约束和监督检索模型的参数学习，最后根据图像-文本综合相似度获取测试查询样例的检索结果，从而通过引入图像和文本两种不同模态数据之间的细粒度关联关系，有效提高跨模态检索的准确率，在图文检索、模式识别等领域具有广泛的市场需求和应用前景。The present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. In the cross-modal fine-grained and precise alignment stage, the global similarity and local similarity between two different modal data of image and text are calculated respectively. In the network training stage, the corresponding loss function is designed, the cross-modal structural constraint information is mined, the parameter learning of the retrieval model is constrained and supervised from multiple perspectives, and finally, according to the image - Text comprehensive similarity to obtain the retrieval results of test query samples, so as to effectively improve the accuracy of cross-modal retrieval by introducing fine-grained correlation between image and text two different modal data, in image and text retrieval, mode Identification and other fields have a wide range of market demand and application prospects.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并于说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1是本发明一个实施例提供的一种实施环境的示意图。FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention.

图2是根据一示例性实施例示出的一种基于多层次特征表示对齐的跨模态检索方法的方法流程图。Fig. 2 is a method flowchart of a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment.

图3是本发明实施例示出的一种模态间结构约束损失示意图。FIG. 3 is a schematic diagram of a structural constraint loss between modes according to an embodiment of the present invention.

图4是本发明实施例示出的一种模态内结构约束损失示意图。FIG. 4 is a schematic diagram of an intra-modal structural constraint loss according to an embodiment of the present invention.

图5是本发明实施例进行文本检索图像的一种结果示意图。FIG. 5 is a schematic diagram of a result of text retrieval of images according to an embodiment of the present invention.

图6是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。Fig. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations, according to an exemplary embodiment.

图7是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。Fig. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，以下结合具体实施例（但不限于所举实施例)与附图详细描述本发明，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to specific embodiments (but not limited to the illustrated embodiments) and the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例可以适用于多种场景，其涉及的实施环境可以包括单个服务器的输入输出场景，或者终端与服务器的互动场景。当实施环境为单个服务器的输入输出场景时，图像数据和文本数据的获取和存储主体均为服务器；当实施环境为终端与服务器的互动场景，此时，实施例所涉及的实施环境的示意图可以如图1所示。在图1所示实施环境的示意图中，该实施环境包括终端101和服务器102。The embodiments of the present invention may be applicable to various scenarios, and the implementation environment involved may include an input and output scenario of a single server, or an interaction scenario between a terminal and a server. When the implementation environment is an input and output scene of a single server, the acquisition and storage of image data and text data are both servers; when the implementation environment is an interaction scene between a terminal and a server, at this time, the schematic diagram of the implementation environment involved in the embodiment can be As shown in Figure 1. In the schematic diagram of the implementation environment shown in FIG. 1 , the implementation environment includes a terminal 101 and a server 102 .

终端101是运行有至少一个客户端的电子设备，客户端是某个应用程序的客户端，又称APP（Application，应用程序）。终端101可以是智能手机、平板电脑等。The terminal 101 is an electronic device running at least one client, and the client is a client of an application program, also known as APP (Application, application program). The terminal 101 may be a smartphone, a tablet computer, or the like.

终端101和服务器102之间通过无线或有线网络连接。终端101用于向服务器102发送数据，或，终端用于接收服务器102发送的数据。在一种可能的实施方式中，终端101可以向服务器102发送图像数据或文本数据中的至少一种。The terminal 101 and the server 102 are connected through a wireless or wired network. The terminal 101 is used for sending data to the server 102 , or the terminal is used for receiving data sent by the server 102 . In a possible implementation, the terminal 101 may send at least one of image data or text data to the server 102 .

服务器102用于接收终端101发送的数据，或，服务器102用于向终端101发送数据。其中，服务器102可以对终端101发送的数据进行分析处理，从而从数据库中匹配出相似度最高的图像数据和文本数据并发送至终端101。The server 102 is configured to receive data sent by the terminal 101 , or the server 102 is configured to send data to the terminal 101 . The server 102 can analyze and process the data sent by the terminal 101 , so as to match the image data and text data with the highest similarity from the database and send them to the terminal 101 .

图2是根据一示例性实施例示出的一种基于多层次特征表示对齐的跨模态检索方法的方法流程图，如图2所示，该一种基于多层次特征表示对齐的跨模态检索方法，其特征在于，所述方法包括：Fig. 2 is a method flowchart of a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment. As shown in Fig. 2, the cross-modal retrieval method based on alignment of multi-level feature representations A method, characterized in that the method comprises:

步骤100：获取训练数据集，对于所述训练数据集中的每组数据对，所述数据对包括图像数据、文本数据，以及所述图像数据与所述文本数据共同对应的语义标签。Step 100: Acquire a training data set. For each data pair in the training data set, the data pair includes image data, text data, and a semantic label corresponding to the image data and the text data.

需要说明的是，文本数据可以为任意语种对应的文本内容，比如英文、中文、日文、德文等；图像数据可以为任意色彩对应的图像内容，比如彩色图像、灰度图像等。It should be noted that the text data can be text content corresponding to any language, such as English, Chinese, Japanese, German, etc.; the image data can be image content corresponding to any color, such as color images, grayscale images, etc.

步骤200：对于所述训练数据集中的每组数据对，分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征，以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征。Step 200: For each group of data pairs in the training data set, extract the image global feature, image local feature and image relationship feature corresponding to the image data in the data pair, and the text global feature corresponding to the text data in the data pair. features, text local features, and text relation features.

在一个优选的实施例中，步骤200具体包括：In a preferred embodiment, step 200 specifically includes:

步骤210：对于所述训练数据集中的每组数据对，采用卷积神经网络CNN提取所述数据对所对应图像数据的图像全局特征

，其中，M为所述图像数据包括的视觉目标数量，

为视觉目标

，其中，

为视觉目标

和视觉目标

之间的图像关系特征。 Step 210: For each group of data pairs in the training data set, use a convolutional neural network CNN to extract the global image features of the image data corresponding to the data pairs

, where M is the number of visual objects included in the image data,

for visual target

,in,

for visual target

and visual target

The image relationship features between them.

步骤220：对于所述训练数据集中的每组数据对，采用词嵌入模型将所述数据对所对应文本数据中的每个词转换为词向量

，其中，

为词

和词

之间的文本关系特征。 Step 220: For each group of data pairs in the training data set, use a word embedding model to convert each word in the text data corresponding to the data pair into a word vector

,in,

for word

and word

textual relationship between.

通过上述步骤200的实施，可实现跨模态多层次精细化表示。Through the implementation of the above-mentioned step 200, a cross-modal multi-level refined representation can be realized.

步骤300：对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对，根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度。Step 300: For the target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, and the image local feature corresponding to the target data pair. The image-text comprehensive similarity corresponding to the target data pair is obtained by calculating with the local text feature, the image relationship feature and the text relationship feature corresponding to the target data pair.

在一个优选的实施例中，步骤300具体包括：In a preferred embodiment, step 300 specifically includes:

步骤310：对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对，基于所述目标数据对中图像数据对应的图像全局特征

和文本数据对应的文本全局特征

。 Step 310: For a target data pair composed of any image data and any text data in the training data set, based on the image global feature corresponding to the image data in the target data pair

Text global features corresponding to text data

.

其中，图像-文本全局相似度

的计算公式如公式（1）： Among them, the image-text global similarity

The calculation formula is as formula (1):

) 公式（1）

) Formula 1)

步骤320：采用文本引导注意力机制计算所述目标数据对中图像数据包括的每个视觉目标的权重，将各个视觉目标的图像局部特征

，根据各个图像局部表示

和各个文本局部表示

。 Step 320: Calculate the weight of each visual target included in the image data in the target data pair by using the text-guided attention mechanism, and combine the image local features of each visual target.

, according to the local representation of each image

and individual text local representations

.

其中，图像-文本局部相似度

的计算公式如公式（2），M为视觉目标数量，N为词数量： Among them, the image-text local similarity

The calculation formula of is as formula (2), M is the number of visual objects, and N is the number of words:

公式（2）

Formula (2)

步骤330：根据所述目标数据对中各个图像关系特征和各个文本关系特征的余弦相似度均值，计算得到所述目标数据对对应的图像-文本关系相似度

。其中，图像-文本关系相似度

的计算公式如公式（3），P表示图像数据和文本数据的关系个数： Step 330: According to the average cosine similarity of each image relationship feature and each text relationship feature in the target data pair, calculate the image-text relationship similarity corresponding to the target data pair.

. Among them, the image-text relationship similarity

公式（3）

Formula (3)

步骤340：根据所述目标数据对对应的图像-文本全局相似度

、图像-文本局部相似度

、图像-文本关系相似度计算得到所述目标数据对对应的图像-文本综合相似度

。 Step 340: According to the target data, the corresponding image-text global similarity

, image-text local similarity

, the image-text relationship similarity is calculated to obtain the corresponding image-text comprehensive similarity of the target data pair

.

其中，图像-文本综合相似度

的计算公式如公式（4）： Among them, the image-text comprehensive similarity

The calculation formula is as formula (4):

公式（4）

Formula (4)

通过上述步骤300的实施，可实现跨模态细粒度精确对齐。Through the implementation of the above step 300, fine-grained and precise alignment across modalities can be achieved.

步骤400：基于各组目标数据对对应的图像-文本综合相似度，设计模态间结构约束损失函数和模态内结构约束损失函数，并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练。Step 400: Design the inter-modal structural constraint loss function and the intra-modal structural constraint loss function based on the image-text comprehensive similarity corresponding to each set of target data pairs, and use the inter-modal structural constraint loss function and the modality constraint loss function. The neural network model is trained using the in-state structure-constrained loss function.

为模型超参数，

为匹配的目标数据对，

和

are model hyperparameters,

is the matching target data pair,

and

For non-matching target data pairs:

公式（5）

Formula (5)

为图像三元组，相比于

，

与

具有更多共同语义标签，

为文本三元组，相比于

，

与

is an image triple, compared to

,

and

have more common semantic labels,

is a text triple, compared to

,

and

Has more common semantic labels:

公式（6）

Formula (6)

其中，图3是本发明实施例示出的一种模态间结构约束损失示意图。3 is a schematic diagram of a structural constraint loss between modes according to an embodiment of the present invention.

公式（7）

Formula (7)

其中

是超参数。 in

are hyperparameters.

其中，图4是本发明实施例示出的一种模态内结构约束损失示意图。Among them, FIG. 4 is a schematic diagram of an intra-modal structural constraint loss shown in an embodiment of the present invention.

通过上述步骤400的实施，可实现利用跨模态结构约束信息监督检索模型的训练，从而使网络训练朝着拉升匹配的目标数据对之间相似度，降低非匹配的目标数据对之间相似度的方向进行，同时使训练后的网络能够学习到更具判别力的图像和文本表示。Through the implementation of the above step 400, it is possible to use the cross-modal structural constraint information to supervise the training of the retrieval model, so that the network training can increase the similarity between matched target data pairs and reduce the similarity between non-matched target data pairs. degree direction, while enabling the trained network to learn more discriminative image and text representations.

steps, including:

经图像视觉目标检测器获得图像中视觉目标

和视觉目标

的特征

，

，以及两个目标联合区域的特征

and visual target

Characteristics

,

, and the features of the two target joint regions

公式（8）

Formula (8)

其中[]为向量拼接操作，

为神经元激活函数，

为模型参数。 Where [] is the vector concatenation operation,

is the neuron activation function,

are model parameters.

steps, including:

在文本关系编码网络中，采用公式（9）计算词

和词

之间的文本关系特征

and word

textual relational features between

:

公式（9）

Formula (9)

其中，

表示神经元激活函数，

为模型参数。 in,

represents the neuron activation function,

are model parameters.

steps, including:

公式（10）

Formula (10)

其中，

、

为模型参数； in,

,

are model parameters;

:

公式（11）

Formula (11)

其中，

为模型参数。 in,

are model parameters.

steps, including:

公式（12）

Formula (12)

其中，

、

为模型参数； in,

,

are model parameters;

通过公式（13）对各个词的文本局部特征

： By formula (13), the text local features of each word are

:

公式（13）

Formula (13)

其中，

为模型参数。 in,

are model parameters.

需要说明的是，当采用上述步骤100-400实现神经网络模型的训练后，不同模态的数据经过神经网络模型计算就能准确输出二者之间的相似度。使用测试数据集中的任意一种模态类型作为查询模态，以另一种模态类型作为目标模态，将查询模态的每个数据作为查询样例，检索目标模态中的数据，按照公式（4）所示图像-文本综合相似度计算公式，计算查询样例和查询目标的相似性。在一种可能的实施方式中，神经网络模型可以将相似性最高的目标模态数据作为匹配数据进行输出，或，神经网络模型将各个神经网络模型相似性按照从大到小排序，得到预设数量的目标模态数据的相关结果列表，从而实现不同模态数据间的跨模态检索作业。It should be noted that after the above steps 100-400 are used to implement the training of the neural network model, the similarity between the data of different modes can be accurately outputted through the calculation of the neural network model. Use any modality type in the test data set as the query modality, use another modality type as the target modality, take each data of the query modality as a query sample, retrieve the data in the target modality, and follow The image-text comprehensive similarity calculation formula shown in formula (4) calculates the similarity between the query sample and the query target. In a possible implementation, the neural network model may output the target modal data with the highest similarity as matching data, or the neural network model may sort the similarities of each neural network model in descending order to obtain a preset A list of relevant results for a large number of target modal data, so as to realize cross-modal retrieval operations between different modal data.

本实施例采用了MS COCO跨模态数据集进行实验，该数据集由文献（T. Lin, etal. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.）首次提出，已成为跨模态检索领域最常用的实验数据集之一。该数据集中的每张图片均带有5个文本标注，其中82783张图片及其文本标注作为训练样本集，在剩余样本中随机挑选5000张图片及其文本标注作为测试样本集。为了更好地说明本发明实施例提供的基于多层次特征表示对齐的跨模态检索方法的有益效果，将本发明提供的基于多层次特征表示对齐的跨模态检索方法与以下3种现有跨模态检索方法进行实验测试比对：This example uses the MS COCO cross-modal dataset for experiments, which was first proposed by the literature (T. Lin, etal. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.), and has become a One of the most commonly used experimental datasets in the field of cross-modal retrieval. Each image in this dataset has 5 text annotations, of which 82,783 images and their text annotations are used as the training sample set, and 5,000 images and their text annotations are randomly selected from the remaining samples as the test sample set. In order to better illustrate the beneficial effects of the cross-modal retrieval method based on alignment of multi-level feature representations provided by the embodiments of the present invention, the cross-modal retrieval method based on alignment of multi-level feature representations provided by the present invention is compared with the following three existing Experimental test comparison of cross-modal retrieval methods:

现有方法一：文献（I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun,Order-embeddings ofimages and language, ICLR, 2016.）中记载的Order-embedding方法。Existing method 1: The Order-embedding method described in the literature (I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, Order-embeddings of images and language, ICLR, 2016.).

现有方法二：文献（F. Faghri, D. Fleet, R. Kiros, and S. Fidler, VSE++:Improved visualsemantic embeddings with hard negatives, BMVC, 2018.）中记载的VSE++方法。Existing method 2: The VSE++ method described in the literature (F. Faghri, D. Fleet, R. Kiros, and S. Fidler, VSE++: Improved visualsemantic embeddings with hard negatives, BMVC, 2018.).

现有方法三：文献（J. Yu, W. Zhang, Y. Lu, Z. Qin, et al. Reasoning onthe relation: Enhancing visual representation for visual question answeringand cross-modal retrieval, IEEE Transactions on Multimedia, 22(12):3196-3209,2020.）中记载的c-VRANet方法。Existing method three: literature (J. Yu, W. Zhang, Y. Lu, Z. Qin, et al. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval, IEEE Transactions on Multimedia, 22(12) ): 3196-3209, 2020.) The c-VRANet method described in.

实验采用跨模态检索领域常用的R@n指标来评测跨模态检索的准确率，该指标表示检索返回的n个样例中正确样例的百分比，该指标越高表示检索的结果越好，本实验中n分别取1，5，10。In the experiment, the R@n index commonly used in the field of cross-modal retrieval is used to evaluate the accuracy of cross-modal retrieval. This index represents the percentage of correct samples among the n samples returned by the retrieval. , in this experiment, n is taken as 1, 5, and 10, respectively.

表一Table I

通过表一示出的数据可知，与现有跨模态检索方法相比，本发明提供的一种基于多层次特征表示对齐的跨模态检索方法在图像数据检索文本数据，以及文本数据检索图像数据两大任务上的检索准确率均有明显提升，从而充分证明了本发明提出的图像文本全局-局部-关系多层次特征表示精细化对齐的有效性。为了便于理解，还示出采用本发明实施例进行文本检索图像的结果示意图，如图5所示，其中，第一列为检索用文本，第二列为数据集给定的匹配图像，第三列到第七列为相似度前五的对应检索结果。It can be seen from the data shown in Table 1 that, compared with the existing cross-modal retrieval method, the cross-modal retrieval method based on the alignment of multi-level feature representation provided by the present invention retrieves text data from image data, and retrieves image data from text data. The retrieval accuracy rates of the two major tasks of the data are significantly improved, which fully proves the effectiveness of the refined alignment of the global-local-relational multi-level feature representation of the image text proposed by the present invention. For ease of understanding, a schematic diagram of the results of text retrieval images using an embodiment of the present invention is also shown, as shown in FIG. 5 , in which the first column is the text for retrieval, the second column is the matching image given by the dataset, and the third column is the text for retrieval. Columns to the seventh column are the corresponding search results of the top five similarities.

下面的实验结果表明，与现有方法相比，本发明基于多层次特征表示对齐的跨模态检索方法，可以取得更高的检索准确率。The following experimental results show that, compared with the existing methods, the cross-modal retrieval method based on the alignment of multi-level feature representations of the present invention can achieve higher retrieval accuracy.

综上所述，本发明提供的一种基于多层次特征表示对齐的跨模态检索方法，通过在跨模态细粒度精确对齐阶段，分别计算图像和文本两种不同模态数据之间的全局相似度、局部相似度和关系相似度，并融合得到图像-文本综合相似度，在网络训练阶段，设计相应损失函数，挖掘跨模态结构约束信息，从多个角度约束和监督检索模型的参数学习，最后根据图像-文本综合相似度获取测试查询样例的检索结果，从而通过引入图像和文本两种不同模态数据之间的细粒度关联关系，有效提高跨模态检索的准确率，在图文检索、模式识别等领域具有广泛的市场需求和应用前景。To sum up, the present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. In the cross-modal fine-grained and precise alignment stage, the global image and text data between two different modal data are calculated respectively. Similarity, local similarity and relational similarity are combined to obtain the comprehensive image-text similarity. In the network training stage, the corresponding loss function is designed to mine the cross-modal structural constraint information, and the parameters of the retrieval model are constrained and supervised from multiple perspectives. Finally, the retrieval results of the test query samples are obtained according to the comprehensive image-text similarity, so as to effectively improve the accuracy of cross-modal retrieval by introducing the fine-grained correlation between the two different modal data of image and text. Graphic retrieval, pattern recognition and other fields have a wide range of market demand and application prospects.

图6是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。例如，装置600可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等。Fig. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations, according to an exemplary embodiment. For example, apparatus 600 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

参照图6，装置600可以包括以下一个或多个组件：处理组件602，存储器604，电源组件606，多媒体组件608，音频组件610，输入/输出（I/ O）的接口612，传感器组件614，以及通信组件616。6, the apparatus 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and communication component 616 .

处理组件602通常控制装置600的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件602可以包括一个或多个处理器620来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件602可以包括一个或多个模块，便于处理组件602和其他组件之间的交互。例如，处理组件602可以包括多媒体模块，以方便多媒体组件608和处理组件602之间的交互。The processing component 602 generally controls the overall operation of the device 600, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 602 may include one or more modules that facilitate interaction between processing component 602 and other components. For example, processing component 602 may include a multimedia module to facilitate interaction between multimedia component 608 and processing component 602.

存储器604被配置为存储各种类型的数据以支持在装置600的操作。这些数据的示例包括用于在装置600上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器（SRAM），电可擦除可编程只读存储器（EEPROM），可擦除可编程只读存储器（EPROM），可编程只读存储器（PROM），只读存储器（ROM），磁存储器，快闪存储器，磁盘或光盘。Memory 604 is configured to store various types of data to support operations at device 600 . Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and the like. Memory 604 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

电源组件606为装置600的各种组件提供电力。电源组件606可以包括电源管理系统，一个或多个电源，及其他与为装置600生成、管理和分配电力相关联的组件。Power supply assembly 606 provides power to the various components of device 600 . Power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 600 .

多媒体组件608包括在装置600和目标用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器（LCD）和触摸面板（TP）。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自目标用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件608包括一个前置摄像头和/或后置摄像头。当装置600处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。Multimedia component 608 includes screens that provide an output interface between device 600 and the intended user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a target user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. A touch sensor can sense not only the boundaries of a touch or swipe action, but also the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 608 includes a front-facing camera and/or a rear-facing camera. When the apparatus 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件610被配置为输出和/或输入音频信号。例如，音频组件610包括一个麦克风（MIC），当装置600处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中，音频组件610还包括一个扬声器，用于输出音频信号。Audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a microphone (MIC) that is configured to receive external audio signals when device 600 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 604 or transmitted via communication component 616 . In some embodiments, audio component 610 also includes a speaker for outputting audio signals.

I/ O接口612为处理组件602和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。I/O interface 612 provides an interface between processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, and the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

传感器组件614包括一个或多个传感器，用于为装置600提供各个方面的状态评估。例如，传感器组件614可以检测到装置600的打开/关闭状态，组件的相对定位，例如组件为装置600的显示器和小键盘，传感器组件614还可以检测装置600或装置600一个组件的位置改变，目标用户与装置600接触的存在或不存在，装置600方位或加速/减速和装置600的温度变化。传感器组件614可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件614还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。Sensor assembly 614 includes one or more sensors for providing status assessment of various aspects of device 600 . For example, the sensor assembly 614 can detect the open/closed state of the device 600, the relative positioning of the components, such as the display and keypad of the device 600, the sensor assembly 614 can also detect the position change of the device 600 or a component of the device 600, the target The presence or absence of user contact with the device 600 , the orientation or acceleration/deceleration of the device 600 and the temperature change of the device 600 . Sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件616被配置为便于装置600和其他设备之间有线或无线方式的通信。装置600可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，通信组件616还包括近场通信（NFC）模块，以促进短程通信。例如，在NFC模块可基于射频识别（RFID）技术，红外数据协会（IrDA）技术，超宽带（UWB）技术，蓝牙（BT）技术和其他技术来实现。Communication component 616 is configured to facilitate wired or wireless communication between apparatus 600 and other devices. Device 600 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 616 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置600可以被一个或多个应用专用集成电路（ASIC）、数字信号处理器（DSP）、数字信号处理设备（DSPD）、可编程逻辑器件（PLD）、现场可编程门阵列（FPGA）、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, apparatus 600 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the above method.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器604，上述指令可由装置600的处理器620执行以完成上述方法。例如，非临时性计算机可读存储介质可以是ROM、随机存取存储器（RAM）、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory 604 including instructions, executable by the processor 620 of the apparatus 600 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

一种非临时性计算机可读存储介质，当存储介质中的指令由装置600的处理器执行时，使得装置600能够执行一种基于多层次特征表示对齐的跨模态检索方法，该方法包括：A non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the apparatus 600, the apparatus 600 can execute a cross-modal retrieval method based on alignment of multi-level feature representations, the method comprising:

基于各组目标数据对对应的图像-文本综合相似度，设计模态间结构约束损失函数和模态内结构约束损失函数，并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练。Based on the image-text comprehensive similarity corresponding to each set of target data pairs, an inter-modal structural constraint loss function and an intra-modal structural constraint loss function are designed, and the inter-modal structural constraint loss function and the intra-modal structural constraint loss function are used. The constrained loss function trains the neural network model.

图7是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。例如，装置700可以被提供为一服务器。参照图7，装置700包括处理组件722，其进一步包括一个或多个处理器，以及由存储器732所代表的存储器资源，用于存储可由处理部件722执行的指令，例如应用程序。存储器732中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件722被配置为执行指令，以执行上述启动页面生成方法。Fig. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment. For example, the apparatus 700 may be provided as a server. 7, apparatus 700 includes a processing component 722, which further includes one or more processors, and a memory resource, represented by memory 732, for storing instructions executable by processing component 722, such as application programs. An application program stored in memory 732 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 722 is configured to execute instructions to perform the above-described startup page generation method.

装置700还可以包括一个电源组件726被配置为执行装置700的电源管理，一个有线或无线网络接口750被配置为将装置700连接到网络，和一个输入输出（I/O）接口758。装置700可以操作基于存储在存储器732的操作系统，例如Windows ServerTM，Mac OS XTM，UnixTM, LinuxTM，FreeBSDTM或类似。Device 700 may also include a power supply assembly 726 configured to perform power management of device 700 , a wired or wireless network interface 750 configured to connect device 700 to a network, and an input output (I/O) interface 758 . Device 700 may operate based on an operating system stored in memory 732, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

虽然，前文已经用一般性说明、具体实施方式及试验，对本发明做了详尽的描述，但在本发明基础上，可以对之进行修改或改进，这对本领域技术人员而言是显而易见的。因此，在不偏离本发明精神的基础上所做的这些修改或改进，均属于本发明要求保护的范围。Although the present invention has been described in detail above with general description, specific embodiments and tests, it is obvious to those skilled in the art that modifications or improvements can be made on the basis of the present invention. Therefore, these modifications or improvements made without departing from the spirit of the present invention fall within the scope of the claimed protection of the present invention.

本领域技术人员在考虑说明书及实践这里的发明的后，将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。应当理解的是，本发明并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention herein. The present invention is intended to cover any variations, uses or adaptations of the present invention which follow the general principles of the present invention and include common knowledge or conventional techniques in the technical field not disclosed by the present invention . It should be understood that the present invention is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from its scope.

Claims

1. A cross-modal retrieval method based on multi-level feature representation alignment is characterized by comprising the following steps:

acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;

for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;

for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;

designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.

2. The method according to claim 1, wherein the step of extracting, for each group of data pairs in the training data set, an image global feature, an image local feature and an image relation feature corresponding to image data in the data pair, and a text global feature, a text local feature and a text relation feature corresponding to text data in the data pair, respectively, comprises:

for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)

Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target

WhereinMfor the number of visual objects comprised by the image data,

as a visual target

Extracting image relation characteristics among all visual targets through an image visual relation coding network

Wherein

as a visual target

And a visual target

Image relationship features between;

for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model

WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained

Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word

Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words

Wherein

is a word

Hehe word

A textual relationship feature between.

3. The method according to claim 2, wherein the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:

for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair

Global features of text corresponding to text data

The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data

(ii) a Wherein image-text global similarity

Is as in formula (1):

) Formula (1)

Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target

After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping

Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculated

After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping

From the respective image partial representation

Calculating cosine similarity of all visual targets and words according to local representation of each text, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity

(ii) a Wherein the image-text local similarity

The formula (2) is shown in the specification, wherein M is the number of visual objects, and N is the number of words:

formula (2)

Calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair

(ii) a Wherein, the similarity of image-text relationship

Is as in formula (3),Pnumber of relationships representing image data and text data:

formula (3)

According to the global similarity of the target data to the corresponding image-text

Image-text local similarity

Similarity of image-text relationship

Is calculated toTo the corresponding image-text comprehensive similarity of the target data pair

(ii) a Wherein, the image-text comprehensive similarity

The calculation formula of (2) is as formula (4):

equation (4).

4. The method according to claim 3, wherein the inter-modal structural constraint loss function is calculated as in equation (5),Bis the number of samples to be tested,

in order to be a hyper-parameter of the model,

for the pair of target data that is matched,

and

for non-matching target data pairs:

formula (5)

The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,

for image triplets, in contrast to

，

And

there are more common semantic labels that are present,

for text triplets, in contrast to

，

And

with more common semantic labels:

equation (6).

5. The method of claim 4, wherein the step of training a neural network model using the inter-modal and intra-modal structural constraint loss functions comprises:

randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:

formula (7)

Wherein

Is a hyper-parameter.

6. The method according to claim 2, wherein the extracting of the image relation features between the visual targets through the image visual relation coding network

The method comprises the following steps:

obtaining visual objects in an image via an image visual object detector

And a visual target

Is characterized by

，

And the characteristics of the two target union regions

And fusing the characteristics by adopting a formula (8), and calculating to obtain the relationship characteristics:

formula (8)

Wherein]In order to perform the vector splicing operation,

in order to function the activation of the neuron,

are model parameters.

7. The method of claim 2, wherein inputting the word vectors into the text-relation coding network extracts text-relation features between words

The method comprises the following steps:

in a text-relational coding network, words are calculated using equation (9)

Hehe word

Characteristic of textual relationship between

：

Formula (9)

Wherein,

representing the neuron activation function as a model parameter.

8. The method of claim 3, wherein the step of removing the substrate comprises removing the substrate from the substrateCalculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and calculating the image local characteristics of each visual target

After weighting the corresponding weight, obtaining a new image local representation through feedforward neural network mapping, comprising the following steps:

using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):

formula (10)

Wherein,

、

is a model parameter;

each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping

：

Formula (11)

Wherein,

are model parameters.

9. The method of claim 3, wherein said directing attention visually is performedThe force mechanism calculates the weight of each word included in the text data in the target data pair and the local text characteristics of each word