CN111639240B

CN111639240B - Cross-modal Hash retrieval method and system based on attention awareness mechanism

Info

Publication number: CN111639240B
Application number: CN202010408302.8A
Authority: CN
Inventors: 罗昕; 姚洪磊; 许信顺
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2021-04-09
Anticipated expiration: 2040-05-14
Also published as: CN111639240A

Abstract

The invention discloses a cross-modal hash retrieval method and system based on an attention perception mechanism. Modal features; input the cross-modal features of the cross-modal data pair into the hash learning model, and optimize the hash learning model with the goal of minimizing the loss function according to the output cross-modal hash code; The hash code of the data to be tested obtained by the hash learning model, among the hash codes of the modal data that is different from that of the data to be tested, screen the modal data that meets the retrieval requirements. The attention mechanism is applied to the cross-modal hash retrieval task, and a new attention method based on the attention-aware mechanism is proposed, which can suppress the noise and redundancy in the original data while enhancing the key attention areas and improve the hash rate. code generation quality.

Description

A cross-modal hash retrieval method and system based on attention-aware mechanism

技术领域technical field

本发明涉及跨模态哈希检索技术领域，特别是涉及一种基于注意力感知机制的跨模态哈希检索方法及系统。The invention relates to the technical field of cross-modal hash retrieval, in particular to a cross-modal hash retrieval method and system based on an attention-aware mechanism.

背景技术Background technique

本部分的陈述仅仅是提供了与本发明相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.

随着网络多媒体数据的爆发式增长，需要采用现有图像检索与其相关的文本或视频，或者基于文本检索图像或视频，即使用一种模态的数据检索另一种模态的相似样本，与此同时数据的高效存储和快速查询也成为一个难题，基于此，近年来有学者提出使用哈希学习的方式来解决这一难题，因为哈希学习方法能够将原始的高维样本数据使用简单紧凑的二进制哈希码来表示，由此可以极大的压缩数据规模，便于数据保存和互相检索。With the explosive growth of network multimedia data, it is necessary to use existing images to retrieve related texts or videos, or to retrieve images or videos based on texts, that is, to use data from one modality to retrieve similar samples from another modality. At the same time, the efficient storage and fast query of data has also become a difficult problem. Based on this, in recent years, some scholars have proposed the use of hash learning to solve this problem, because the hash learning method can use the original high-dimensional sample data in a simple and compact manner. It can be represented by the binary hash code, which can greatly compress the data size, which is convenient for data storage and mutual retrieval.

跨模态检索技术旨在根据已有的数据检索与之相匹配的不同模态的数据，如通过文本信息在数据库中查找符合文字描述的图片集。现有技术按照是否结合深度学习技术可以分为深度模型和非深度模型，传统的深度跨模态哈希检索模型通常分为三个步骤，首先使用深度网络提取不同模态的特征，然后根据提取到的特征使用全连接网络在交叉熵损失以及样本相似度矩阵的监督下学习哈希函数，最后通过哈希函数将样本转成哈希码保存在数据库中。Cross-modal retrieval technology aims at retrieving data of different modalities that match the existing data, such as searching for a set of pictures that match the text description in a database through text information. Existing technologies can be divided into deep models and non-deep models according to whether they are combined with deep learning technology. The traditional deep cross-modal hash retrieval model is usually divided into three steps. First, the deep network is used to extract the features of different modalities. The obtained features use a fully connected network to learn a hash function under the supervision of cross entropy loss and sample similarity matrix, and finally convert the sample into a hash code and save it in the database through the hash function.

现阶段已有很多跨模态哈希检索方法被提出，但是，发明人发现现有技术至少存在以下问题：对于检索任务来说，真实数据往往会存在一些噪声和冗余，而在特征提取时，需要提取最有用的视觉信息，而忽略背景信息，因为背景信息会对检索造成干扰；但是在实际数据中，有价值类别的信息仅覆盖一小部分，大多数区域为背景，而当前的大多数跨模态检索方法会忽略这一问题，直接从原始数据中学习特征，因此可能会被无效或冗余信息所误导，以致生成低质量的哈希码；此外，很多效果较好的深度跨模态哈希检索模型为了提升检索效果，往往会引入参数量较多效果更好的网络模型，如GAN(生成对抗网络)等，但是会大幅度增加训练和检索时间。At this stage, many cross-modal hash retrieval methods have been proposed. However, the inventor found that the prior art has at least the following problems: For retrieval tasks, real data often have some noise and redundancy, and when feature extraction , it is necessary to extract the most useful visual information, while ignoring the background information, because the background information will interfere with the retrieval; but in the actual data, the information of the valuable category only covers a small part, most of the areas are background, and the current large Most cross-modal retrieval methods ignore this problem and learn features directly from raw data, so they may be misled by invalid or redundant information, resulting in low-quality hash codes; In order to improve the retrieval effect of the modal hash retrieval model, a network model with more parameters and better effect is often introduced, such as GAN (generative adversarial network), etc., but it will greatly increase the training and retrieval time.

发明内容SUMMARY OF THE INVENTION

为了解决上述问题，本发明提出了一种基于注意力感知机制的跨模态哈希检索方法及系统，将注意力机制应用于跨模态哈希检索任务中，提出注意力感知机制的新型注意力方法，包含多种模态数据的跨模态数据集实现同时进行特征学习和哈希编码学习，最后将经注意力加权的特征表示反馈到哈希学习模型中用以指导哈希码的生成，实现对原始数据中的噪声和冗余进行抑制处理同时对重点关注区域进行增强，提高哈希码的生成质量。In order to solve the above problems, the present invention proposes a cross-modal hash retrieval method and system based on an attention-aware mechanism, applies the attention mechanism to the cross-modal hash retrieval task, and proposes a new type of attention-aware mechanism. Force method, the cross-modal data set containing multiple modal data realizes feature learning and hash coding learning at the same time, and finally the attention-weighted feature representation is fed back to the hash learning model to guide the generation of hash codes , to suppress the noise and redundancy in the original data and enhance the key areas of interest, so as to improve the generation quality of the hash code.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

第一方面，本发明提供一种基于注意力感知机制的跨模态哈希检索方法，包括：In a first aspect, the present invention provides a cross-modal hash retrieval method based on an attention-aware mechanism, including:

对跨模态数据集中的训练集进行特征提取和注意力特征提取，得到经注意力特征加权的跨模态特征；Perform feature extraction and attention feature extraction on the training set in the cross-modal dataset to obtain cross-modal features weighted by attention features;

将训练集中跨模态数据对的跨模态特征输入至哈希学习模型中，根据输出的跨模态哈希码以最小化损失函数为目标优化哈希学习模型；Input the cross-modal features of the cross-modal data pairs in the training set into the hash learning model, and optimize the hash learning model with the goal of minimizing the loss function according to the output cross-modal hash code;

根据由优化后的哈希学习模型得到的待测数据的哈希码，在跨模态数据集中与待测数据模态不同的模态数据的哈希码中，筛选满足检索要求的模态数据。According to the hash codes of the data to be tested obtained from the optimized hash learning model, the modal data that meet the retrieval requirements are screened from the hash codes of the modal data in the cross-modal data set that are different from the modalities of the data to be tested. .

第二方面，本发明提供一种基于注意力感知机制的跨模态哈希检索系统，包括：In a second aspect, the present invention provides a cross-modal hash retrieval system based on an attention-aware mechanism, including:

特征提取模块，用于对跨模态数据集中的训练集进行特征提取和注意力特征提取，得到经注意力特征加权的跨模态特征；The feature extraction module is used to perform feature extraction and attention feature extraction on the training set in the cross-modal data set, and obtain the cross-modal feature weighted by the attention feature;

哈希学习模块，用于将训练集中跨模态数据对的跨模态特征输入至哈希学习模型中，根据输出的跨模态哈希码以最小化损失函数为目标优化哈希学习模型；The hash learning module is used to input the cross-modal features of the cross-modal data pairs in the training set into the hash learning model, and optimize the hash learning model with the goal of minimizing the loss function according to the output cross-modal hash code;

检索模块，用于根据由优化后的哈希学习模型得到的待测数据的哈希码，在跨模态数据集中与待测数据模态不同的模态数据的哈希码中，筛选满足检索要求的模态数据。The retrieval module is used to filter the hash codes of the modal data that are different from the modalities of the data to be tested in the cross-modal data set according to the hash codes of the data to be tested obtained by the optimized hash learning model. Modal data requested.

第三方面，本发明提供一种电子设备，包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成第一方面所述的方法。In a third aspect, the present invention provides an electronic device, comprising a memory, a processor, and computer instructions stored in the memory and executed on the processor, and when the computer instructions are executed by the processor, the method described in the first aspect is completed .

第四方面，本发明提供一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成第一方面所述的方法。In a fourth aspect, the present invention provides a computer-readable storage medium for storing computer instructions, and when the computer instructions are executed by a processor, the method described in the first aspect is completed.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

本发明中跨模态数据集中包含多种模态数据，且多种模态数据实现同时进行特征学习和哈希编码学习，提高哈希码生成的效率。In the present invention, the cross-modal data set includes multiple modal data, and the multiple modal data realizes feature learning and hash coding learning at the same time, thereby improving the efficiency of hash code generation.

本发明提出注意力感知机制的新型注意力方法，将注意力机制应用于跨模态哈希检索任务中，两个不同的模态加权，不仅可以突出跨模态数据的关键部分，如图片中物体存在的区域或文本输入中的某个单词，同时还可以抑制冗余或无效部分对检索效果的影响，如图片背景或文本干扰词等，有效地提高哈希码生成的质量，并且能适用于各种多模态数据场景下的跨模态检索任务The present invention proposes a new attention method of the attention perception mechanism, and applies the attention mechanism to the cross-modal hash retrieval task. Two different modal weights can not only highlight the key parts of the cross-modal data, such as in the picture The area where the object exists or a word in the text input can also suppress the influence of redundant or invalid parts on the retrieval effect, such as picture background or text noise words, etc., which can effectively improve the quality of hash code generation, and can be applied Cross-modal retrieval tasks in various multimodal data scenarios

附图说明Description of drawings

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.

图1(a)-(b)为图片模态数据；Figure 1(a)-(b) is the picture modal data;

图1(c)为公共数据集MIRFlicker-25K中文本标注词频排名前10位的单词；Figure 1(c) shows the top 10 words in the public dataset MIRFlicker-25K in terms of text annotation frequency;

图1(d)为图1(a)的文本标注数据；Fig. 1(d) is the text annotation data of Fig. 1(a);

图2为本发明实施例1提供的基于注意力感知机制的跨模态哈希检索方法流程图；2 is a flowchart of a cross-modal hash retrieval method based on an attention-aware mechanism provided in Embodiment 1 of the present invention;

图3为本发明实施例1提供的图像注意力特征提取流程图；3 is a flowchart of image attention feature extraction provided by Embodiment 1 of the present invention;

图4为本发明实施例1提供的文本注意力特征提取流程图；4 is a flowchart of text attention feature extraction provided by Embodiment 1 of the present invention;

图5为本发明实施例1提供的基于注意力感知机制的跨模态哈希检索系统结构图。FIG. 5 is a structural diagram of a cross-modal hash retrieval system based on an attention-aware mechanism according to Embodiment 1 of the present invention.

具体实施方式：Detailed ways:

下面结合附图与实施例对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本发明的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that the terms "including" and "having" and any conjugations thereof are intended to cover the non-exclusive A process, method, system, product or device comprising, for example, a series of steps or units is not necessarily limited to those steps or units expressly listed, but may include those steps or units not expressly listed or for such processes, methods, Other steps or units inherent to the product or equipment.

在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

实施例1Example 1

目前已有多种跨模态哈希检索方法被提出，但是由于真实数据存在噪声和冗余，而目前的检索方法直接从原始数据中学习特征，会被无效或冗余信息所误导，以致生成低质量的哈希码。以图片和文本两个模态为例，如图1(a)-1(b)所示，对于图1(a)的图片，需要突出蜜蜂和花朵所在区域而忽略后面的背景部分，因为会对检索造成干扰；同样地，对于图1(b)的图片，标签即监督信息为“动物”、“花朵”和“植物生命”，最有用的视觉信息可能是在花朵上悬停的蝴蝶。但是，这些有价值类别的信息仅覆盖整个图像的一小部分，而该图像中的大多数区域是背景；At present, a variety of cross-modal hash retrieval methods have been proposed, but due to the noise and redundancy in real data, the current retrieval methods directly learn features from the original data, which will be misled by invalid or redundant information, resulting in the generation of Low quality hash codes. Taking the two modes of picture and text as an example, as shown in Figure 1(a)-1(b), for the picture in Figure 1(a), it is necessary to highlight the area where bees and flowers are located and ignore the background part behind, because it will interferes with retrieval; similarly, for the picture in Figure 1(b), where the labels, i.e., the supervision information, are “animal”, “flower”, and “plant life”, the most useful visual information may be the butterfly hovering over the flower. However, the information of these valuable categories covers only a small part of the whole image, and most of the areas in this image are the background;

如图1(c)所示包含了公共数据集MIRFlicker-25K中，文本标注词频排名前10位的单词，可以看到有一半的单词：“explore”，“canon”，“bw”，“nikon”和“2007”都是与图像内容没有直接关系的无效单词；图1(d)则是对图1(a)的文本标注，只有单词“bees”是与检索任务相关的。As shown in Figure 1(c), which contains the top 10 words in the public dataset MIRFlicker-25K, the word frequency of text annotations, you can see that there are half of the words: "explore", "canon", "bw", "nikon" ” and “2007” are both invalid words that are not directly related to the image content; Figure 1(d) is the text annotation of Figure 1(a), and only the word “bees” is relevant to the retrieval task.

由此可以看出，如果不对原始数据中噪声和冗余进行抑制处理，容易导致生成低质量的哈希码，影响检索结果。It can be seen from this that if the noise and redundancy in the original data are not suppressed, it is easy to generate low-quality hash codes and affect the retrieval results.

Attention机制近年来在计算机视觉领域得到广泛应用，例如对自然语言处理、物体检测、图像识别以及语音识别等方面，但在跨模态检索方向却鲜有人使用。传统的Attention机制用在图像识别上，能够自发寻找图片中需要重点关注的部分，即通过学习生成一个和图片表示(这个图片表示可以是原始图片，也可以是feature map等)大小相同的Mask；对于关注区域，Mask对应位置具有较高的激活值。根据作用区域，Attention模型通常可以分为空间注意力模型和通道注意力模型；空间注意力模型针对feature map中的不同位置生成对应的注意力值，还原到原始图片中就是图片中不同的位置对于任务具有不同程度的影响；通道注意力机制针对feature map中的不同channel生成对应的注意力值，更具有抽象性。The Attention mechanism has been widely used in the field of computer vision in recent years, such as natural language processing, object detection, image recognition, and speech recognition, but it is rarely used in cross-modal retrieval. The traditional Attention mechanism is used in image recognition, which can spontaneously find the part of the picture that needs to be focused on, that is, by learning to generate a Mask with the same size as the picture representation (this picture representation can be the original picture, or a feature map, etc.); For the region of interest, the corresponding position of the Mask has a higher activation value. According to the area of action, the Attention model can usually be divided into a spatial attention model and a channel attention model; the spatial attention model generates corresponding attention values for different positions in the feature map, and when restored to the original picture, the different positions in the picture are for Tasks have different degrees of influence; the channel attention mechanism generates corresponding attention values for different channels in the feature map, which is more abstract.

本实施例融合了空间注意力机制，将注意力机制应用于跨模态哈希检索任务中，在传统的注意力机制的基础上提出一种新的注意力方法，称为注意力感知机制，用于两个不同的模态加权；This embodiment integrates the spatial attention mechanism, applies the attention mechanism to the cross-modal hash retrieval task, and proposes a new attention method based on the traditional attention mechanism, which is called the attention-aware mechanism. for two different modal weights;

即本实施例中的基于注意力感知机制的跨模态哈希检索方法，对原始数据中的噪声和冗余进行抑制处理同时对重点关注区域进行增强，进而提取注意力矩阵，对于生成哈希码的质量有较好的提升效果，可用于各种多模态数据场景下的跨模态信息检索，如图2所示，具体包括以下步骤：That is, the cross-modal hash retrieval method based on the attention perception mechanism in this embodiment suppresses the noise and redundancy in the original data while enhancing the key attention area, and then extracts the attention matrix. The quality of the code has a good improvement effect, and can be used for cross-modal information retrieval in various multi-modal data scenarios, as shown in Figure 2, which includes the following steps:

S1：对跨模态数据集中的训练集进行特征提取和注意力特征提取，得到经注意力特征加权的跨模态特征；S1: Perform feature extraction and attention feature extraction on the training set in the cross-modal dataset to obtain cross-modal features weighted by attention features;

S2：将训练集中跨模态数据对的跨模态特征输入至哈希学习模型中，根据输出的跨模态哈希码以最小化损失函数为目标优化哈希学习模型；S2: Input the cross-modal features of the cross-modal data pairs in the training set into the hash learning model, and optimize the hash learning model with the goal of minimizing the loss function according to the output cross-modal hash code;

S3：根据由优化后的哈希学习模型得到的待测数据的哈希码，在跨模态数据集中与待测数据模态不同的模态数据的哈希码中，筛选满足检索要求的模态数据。S3: According to the hash code of the data to be tested obtained by the optimized hash learning model, among the hash codes of the modal data in the cross-modal data set with different modalities from the data to be tested, filter the models that meet the retrieval requirements. state data.

所述步骤S1中，跨模态数据集包括多种模态数据，在本实施例中，以图像模态数据和文本模态数据为例，可以理解的，该模态类型可以扩展其他模态，如视频、语音等。In the step S1, the cross-modal data set includes multiple modal data. In this embodiment, taking image modal data and text modal data as examples, it can be understood that this modal type can extend other modalities. , such as video, voice, etc.

将跨模态数据集划分为训练集和测试集，对训练集中的图像和文本的跨模态数据对采用两个并列的卷积神经网络同时进行特征提取和注意力特征提取；具体为：获取初始注意力矩阵，以最小化损失函数对卷积神经网络进行训练，输出改进后的注意力矩阵；将注意力矩阵与卷积神经网络输出的特征矩阵进行点乘操作，得到经注意力特征加权的跨模态特征。Divide the cross-modal data set into a training set and a test set, and use two parallel convolutional neural networks to simultaneously perform feature extraction and attention feature extraction on the cross-modal data pairs of images and texts in the training set; specifically: obtaining The initial attention matrix is used to train the convolutional neural network by minimizing the loss function, and the improved attention matrix is output; the attention matrix and the feature matrix output by the convolutional neural network are dot-multiplied to obtain the weighted attention feature. cross-modal features.

其中，对训练集中的图像进行图像特征提取和图像注意力特征提取，具体包括：Among them, image feature extraction and image attention feature extraction are performed on the images in the training set, including:

S1-1：图像特征提取过程采用卷积神经网络CNN_F作为基础的网络结构，在第五个卷积层Conv5输出图像特征矩阵；S1-1: The image feature extraction process uses the convolutional neural network CNN_F as the basic network structure, and outputs the image feature matrix in the fifth convolutional layer Conv5;

S1-2：图像注意力特征提取过程包括：(1)在第五个卷积层和全连接层之间引入一个attention层，改进了残差网络Resnet-50，如图3所示，采用新的卷积层Conv6和最大池化层Max pooling替换全连接层，引入Conv6层的目的是确保最终注意力图的大小与图像特征提取过程中Conv5层输出的图像特征矩阵大小一致；使用改进的Resnet-50网络提取初始注意力矩阵O，并使用交叉熵函数作为损失函数来对该网络进行预训练。S1-2: The image attention feature extraction process includes: (1) An attention layer is introduced between the fifth convolutional layer and the fully connected layer, and the residual network Resnet-50 is improved, as shown in Figure 3, using a new The convolutional layer Conv6 and the maximum pooling layer Max pooling replace the fully connected layer. The purpose of introducing the Conv6 layer is to ensure that the size of the final attention map is consistent with the image feature matrix output by the Conv5 layer during the image feature extraction process; using the improved Resnet- 50 The network extracts the initial attention matrix O and uses the cross-entropy function as the loss function to pre-train this network.

(2)对初始注意力矩阵进一步改进：(2) Further improve the initial attention matrix:

O′_ir＝sigmoid(max_k(O_ijk))，O′ _ir = sigmoid(max _k (O _ijk )),

其中，O′_ir是图片I_i的第r个区域对应的注意力权重，O_ijk是预训练网络输出O中同样位置第k个类别(共有Nc个类别)的数值。Among them, _O'ir is the attention weight corresponding to the rth region of the image I _i , and _Oijk is the value of the kth category (a total of Nc categories) in the same position in the output O of the pre-training network.

其中，

是最终获得的注意力矩阵，μ_i是可计算的阈值，具体计算方式如下：in,

is the final attention matrix obtained, and μ _i is the computable threshold, the specific calculation method is as follows:

将该图片不同区域的注意力值进行升序排序，并假设一张图片中大约有p％(0<p<100)的区域属于冗余区，同时剩下的部分(约占1-p％)是重点区域；那么μ_i的值设为O′_i排序后的第

个激活值，其中Nr＝n×n表示区域数量。Sort the attention values of different areas of the image in ascending order, and assume that about p% (0<p<100) of the areas in an image belong to redundant areas, while the rest (about 1-p%) is the key area; then the value of μ _i is set to the _th

activation values, where Nr=n×n represents the number of regions.

(3)将

在通道层面上进行延展，得到新的权重矩阵

然后和Conv5层输出的图像特征矩阵做点乘操作，得到经图像注意力特征加权的图像特征。(3) will

Extend at the channel level to get a new weight matrix

Then, do a dot product with the image feature matrix output by the Conv5 layer to obtain image features weighted by image attention features.

对训练集中的图像进行文本特征提取和文本注意力特征提取，具体包括：Perform text feature extraction and text attention feature extraction on the images in the training set, including:

S1-3：文本特征提取过程中采用两个全连接层获取文本特征；S1-3: In the process of text feature extraction, two fully connected layers are used to obtain text features;

S1-4：文本注意力特征提取过程包括：(1)在第一全连接层Fc1前引入attention层，采用不含隐藏层的神经网络，即一个两层的非线性分类网络，得到输入文本表示的每一个标注和其对应分类之间的映射关系W，如图4所示，并使用W作为初始注意力矩阵，使用最小平方误差损失指导该分类网络的训练。S1-4: The text attention feature extraction process includes: (1) Introducing an attention layer before the first fully connected layer Fc1, using a neural network without hidden layers, that is, a two-layer nonlinear classification network, to obtain the input text representation The mapping relationship W between each annotation and its corresponding classification is shown in Figure 4, and using W as the initial attention matrix, the least square error loss is used to guide the training of the classification network.

使用SoftMax函数标准化W_ij，并假设文本y_i对于不同类别的贡献度服从分布F_i(·)，Use the SoftMax function to standardize Wi _ij , and assume that the contribution of text _yi to different categories obeys the distribution F _i ( ),

F_i(l_j)＝W′_ij，F _i (l _j )=W′ _ij ,

其中，l_j是第j个样本对应的标签信息，Among them, l _j is the label information corresponding to the jth sample,

求解每个标注对应的信息熵：Solve the information entropy corresponding to each label:

W″_i＝-E_i，W″ _i =-E _i ,

求解最终的注意力矩阵

Solve the final attention matrix

其中，v是可计算的阈值，具体计算方式为：Among them, v is a computable threshold, and the specific calculation method is:

将注意力矩阵W″_i进行升序排列，把v设置为第

个位置对应的值，其中Nt表示文本标注集合中不同标签的数量。Arrange the attention matrix W″ _i in ascending order, and set v as the first

The value corresponding to each position, where Nt represents the number of distinct labels in the text annotation set.

(3)将原始文本特征与文本注意图

相乘得到经文本注意力特征加权的文本特征；其中，原始文本特征使用BoW表示，也可以是其他形式如Word2Vec。(3) Combine the original text features with the text attention map

Multiply to obtain text features weighted by text attention features; wherein, the original text features are represented by BoW, and can also be in other forms such as Word2Vec.

所述步骤S2中，将图像特征和文本特征输入至哈希学习网络模型中，采用sign函数得到二值化哈希码，以最小化损失函数为目标构建全局目标函数：In the step S2, the image features and text features are input into the hash learning network model, the sign function is used to obtain a binary hash code, and the global objective function is constructed with the goal of minimizing the loss function:

其中，n为样本集中样本数量，B^x是图片模态对应的二值哈希码，B^y是文本模态对应的二值哈希码，设置B＝B^x＝B^y＝sign(γ(F+G))，W_x、W_y是图片模态数据和文本模态数据对应的初始注意力矩阵，F_*＝f^x(x_i,θ^x),θ^x是图像网络参数，F是图像网络的输出；G_*＝f^y(y_i,θ^y),θ^y是文本网络参数，G是文本网络的输出；令

γ和η均为超参数；相似度矩阵S为：对于两个不同的样本i，j，若两个样本标签至少有一个类都存在，那么将S_ij设置为1，否则置为0。Among them, n is the number of samples in the sample set, B ^x is the binary hash code corresponding to the picture mode, By is the binary hash code corresponding to the text mode, set B ⁼ B ^x ⁼ By = sign(γ( F+G)), W _x , W _y are the initial attention matrices corresponding to the image modal data and the text modal data, F _* = f ^x (x _i , θ ^x ), θ ^x is the image network parameter, F is The output of the image network; G _* = f ^y (y _i , θ ^y ), θ ^y is the text network parameter, G is the output of the text network; let

Both γ and η are hyperparameters; the similarity matrix S is: for two different samples i, j, if at least one class of the two sample labels exists, then S _ij is set to 1, otherwise it is set to 0.

在本实施例中，全局目标函数第一项为负对数似然损失函数，第二项为量化损失函数，由于样本之间的相似性关系是通过标签信息L得到的，因此为了更加充分的利用样本监督信息，本实施例提出第三项损失，即语义保持损失函数。In this embodiment, the first term of the global objective function is the negative log-likelihood loss function, and the second term is the quantization loss function. Using the sample supervision information, this embodiment proposes a third loss, that is, a semantic preservation loss function.

所述步骤S2中，以最小化损失函数为目标优化哈希学习模型，需要优化的变量分别为B,F,G,W_x,W_y，本实施例采用迭代优化的方式最小化损失函数，即每次只优化一个变量，其他变量保持不变。具体的优化策略如下：In the step S2, the hash learning model is optimized with the goal of minimizing the loss function, and the variables to be optimized are B, F, G, W _x , and W _y respectively. In this embodiment, an iterative optimization method is used to minimize the loss function, That is, only one variable is optimized at a time, and the other variables remain unchanged. The specific optimization strategy is as follows:

S2-1：固定变量B,G,W_x,W_y，更新变量F：S2-1: Fix variables B, G, W _x , W _y , update variable F:

对于样本点x_i，使用随机梯度下降法优化F_*，即：For sample points x _i , use stochastic gradient descent to optimize F _* , ie:

采用链式法则计算

即

经反向传播更新图像网络的参数θ^x。Calculated using the chain rule

which is

The parameters θ ^x of the image network are updated via backpropagation.

S2-2：固定变量B,F,G,W_y，更新变量W_x：S2-2: Fix variables B, F, G, W _y , update variable W _x :

使用随机梯度下降法更新该变量，

Update this variable using stochastic gradient descent,

S2-3：固定变量B,F,W_x,W_y，更新变量G：S2-3: Fix variables B, F, W _x , W _y , update variable G:

和更新变量F的过程类似，对于样本点y_j，首先计算变量G的梯度，即：Similar to the process of updating the variable F, for the sample point y _j , the gradient of the variable G is first calculated, namely:

使用链式法则计算

并更新参数θ^y。Calculate using the chain rule

and update the parameter θ ^y .

S2-4：固定变量B,F,G,W_x，更新变量W_y，即：S2-4: Fix variables B, F, G, W _x , update variable W _y , namely:

S2-5：固定变量F,G,W_x,W_y，更新变量B，即：S2-5: Fix variables F, G, W _x , W _y , update variable B, namely:

其中，V＝γ(F+G)。where V=γ(F+G).

所述步骤S3中，对哈希学习模型完成优化后，根据优化后的哈希学习模型，对跨模态数据集中所有样本计算得到对应的哈希码；In the step S3, after the optimization of the hash learning model is completed, according to the optimized hash learning model, the corresponding hash codes are calculated for all samples in the cross-modal data set;

在进行检索任务时，将得到数据输入至模型中得到对应的哈希码，在跨模态数据集中与待测数据模态不同的模态数据的哈希码中，检索汉明距离最近的N个哈希码，筛选出满足该检索要求的跨模态数据。When the retrieval task is performed, the obtained data is input into the model to obtain the corresponding hash code, and in the hash code of the modal data in the cross-modal data set with different modalities from the data to be tested, the N nearest Hamming distance is retrieved. A hash code is used to filter out cross-modal data that meets the retrieval requirements.

实施例2Example 2

如图5所示，本实施例提供一种基于注意力感知机制的跨模态哈希检索系统，包括：As shown in FIG. 5 , this embodiment provides a cross-modal hash retrieval system based on an attention-aware mechanism, including:

此处需要说明的是，上述模块对应于实施例1中的步骤S1至S3，上述模块与对应的步骤所实现的示例和应用场景相同，但不限于上述实施例1所公开的内容。需要说明的是，上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted here that the above modules correspond to steps S1 to S3 in Embodiment 1, and the examples and application scenarios implemented by the above modules and corresponding steps are the same, but are not limited to the content disclosed in Embodiment 1 above. It should be noted that the above modules may be executed in a computer system such as a set of computer-executable instructions as part of the system.

在本实施例中，特征提取模块中接收图片和文本，图像数据和文本数据同时进行特征学习和哈希编码学习，在图像特征提取网络中包括图像注意力特征提取模块，在文本特征提取网络中包括文本注意力特征提取模块，最后将经过注意力加权的特征输入到哈希学习模块中用以指导哈希码的生成，提高哈希码生成的质量，适用于各种多模态数据场景下的跨模态检索任务。In this embodiment, the feature extraction module receives pictures and texts, and the image data and text data perform feature learning and hash coding learning at the same time. The image feature extraction network includes an image attention feature extraction module, and the text feature extraction network includes an image attention feature extraction module. Including the text attention feature extraction module, and finally input the attention-weighted features into the hash learning module to guide the generation of hash codes and improve the quality of hash code generation, which is suitable for various multi-modal data scenarios cross-modal retrieval task.

在更多实施例中，还提供：In further embodiments, there is also provided:

一种电子设备，包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成实施例1中所述的方法。为了简洁，在此不再赘述。An electronic device includes a memory, a processor, and computer instructions stored on the memory and executed on the processor, and when the computer instructions are executed by the processor, the method described in Embodiment 1 is completed. For brevity, details are not repeated here.

应理解，本实施例中，处理器可以是中央处理单元CPU，处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC，现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general-purpose processors, digital signal processors DSP, application-specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

存储器可以包括只读存储器和随机存取存储器，并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如，存储器还可以存储设备类型的信息。The memory may include read-only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成实施例1中所述的方法。A computer-readable storage medium for storing computer instructions, when the computer instructions are executed by a processor, the method described in Embodiment 1 is completed.

实施例1中的方法可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。为避免重复，这里不再详细描述。The method in Embodiment 1 may be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, detailed description is omitted here.

本领域普通技术人员可以意识到，结合本实施例描述的各示例的单元即算法步骤，能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the unit, that is, the algorithm step of each example described in conjunction with this embodiment, can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

上述虽然结合附图对本发明的具体实施方式进行了描述，但并非对本发明保护范围的限制，所属领域技术人员应该明白，在本发明的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, they do not limit the scope of protection of the present invention. Those skilled in the art should understand that on the basis of the technical solutions of the present invention, those skilled in the art do not need to pay creative efforts. Various modifications or deformations that can be made are still within the protection scope of the present invention.

Claims

1. a cross-modal hash retrieval method based on attention perception mechanism, is characterized in that, comprises:

Perform feature extraction and attention feature extraction on the training set in the cross-modal dataset to obtain cross-modal features weighted by attention features;

The cross-modal features of the cross-modal data pairs in the training set are input into the hash learning model, and the hash learning model is optimized according to the output cross-modal hash code with the goal of minimizing the loss function; according to the output cross-modal hash code The Greek code builds a global objective function with the goal of minimizing the loss function;

The global objective function is:

Among them, n is the number of samples in the sample set, B ^x and By are the hash codes corresponding to the x-modal data and the ^y -modal data in the cross-modal data pair, and θ ^x and θ ^y are the x-modal data and the y-modal data. The data corresponds to the network parameters of the network, W _x and W _y are the initial attention matrices corresponding to the x-modal data and the y-modal data, S _ij is the similarity matrix, γ and η are hyperparameters; F, G are the x-modal The modal data and the y modal data correspond to the output of the network, and L is the label information;

According to the hash codes of the data to be tested obtained from the optimized hash learning model, the modal data that meet the retrieval requirements are screened from the hash codes of the modal data in the cross-modal data set that are different from the modalities of the data to be tested. .

2. A cross-modal hash retrieval method based on an attention-aware mechanism according to claim 1, wherein the cross-modal data set comprises multiple modal data, and the training set comprises a plurality of A cross-modal data pair that uses two parallel convolutional neural networks to perform feature extraction and attention feature extraction at the same time.

3. A cross-modal hash retrieval method based on an attention-aware mechanism as claimed in claim 1, wherein the attention feature extraction comprises:

Obtain the initial attention feature matrix, train the convolutional neural network to minimize the loss function, and output the improved attention feature matrix;

The attention feature matrix is dot-multiplied with the feature matrix output by the convolutional neural network, and the cross-modal feature weighted by the attention feature is obtained.

4. A cross-modal hash retrieval method based on an attention-aware mechanism as claimed in claim 1, wherein the global objective function comprises a negative log-likelihood loss function, a quantization loss function and a semantic preservation loss function.

5. A cross-modal hash retrieval method based on an attention-aware mechanism as claimed in claim 1, wherein an iterative optimization method is adopted to optimize the hash learning model, and the optimized variables include cross-modal data pairs. The hash code of , the output of the corresponding network for the cross-modal data pair, and the initial attention matrix.

6. A cross-modal hash retrieval method based on an attention-aware mechanism as claimed in claim 1, characterized in that, in the cross-modal data set, the hash codes of the modal data modalities different from those of the data to be tested are , compare the Hamming distance between the above hash code and the hash code of the data to be tested, retrieve the N hash codes with the nearest Hamming distance, and filter out the cross-modal data that meets the retrieval requirements.

7. A cross-modal hash retrieval system based on an attention-aware mechanism, comprising:

The feature extraction module is used to perform feature extraction and attention feature extraction on the training set in the cross-modal data set, and obtain the cross-modal feature weighted by the attention feature;

The hash learning module is used to input the cross-modal features of the cross-modal data pairs in the training set into the hash learning model, and optimize the hash learning model with the goal of minimizing the loss function according to the output cross-modal hash code; Construct a global objective function with the goal of minimizing the loss function according to the output cross-modal hash code;

The global objective function is:

Among them, n is the number of samples in the sample set, B ^x and By are the hash codes corresponding to the x-modal data and the ^y -modal data in the cross-modal data pair, and θ ^x and θ ^y are the x-modal data and the y-modal data. The data corresponds to the network parameters of the network, W _x and W _y are the initial attention matrices corresponding to the x-modal data and the y-modal data, S _ij is the similarity matrix, γ and η are hyperparameters; F, G are the x-modal The modal data and y modal data correspond to the output of the network, and L is the label information;

The retrieval module is used to filter the hash codes of the modal data that are different from the modalities of the data to be tested in the cross-modal data set according to the hash codes of the data to be tested obtained by the optimized hash learning model. Modal data requested.

8. An electronic device, characterized in that, comprising a memory and a processor and a computer instruction stored on the memory and running on the processor, when the computer instruction is run by the processor, completes any one of claims 1-6 the method described.

9 . A computer-readable storage medium, characterized in that it is used for storing computer instructions, and when the computer instructions are executed by a processor, the method according to any one of claims 1-6 is completed. 10 .