CN111914110A

CN111914110A - Example retrieval method based on deep activation salient region

Info

Publication number: CN111914110A
Application number: CN202010745156.8A
Authority: CN
Inventors: 赵万磊; 肖惠楚; 王菡子
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-10

Abstract

An instance retrieval method based on depth-activated salient regions, involving visual instance retrieval. 1) Model design: The model includes a forward propagation module, a pattern localization module and a feature extraction module; 2) For a given image database, each image in the database is used as the input of the model, and the output instance localization results and corresponding instances are extracted. 3) For each query picture, use it as the input of the forward propagation part of the deep pattern mining module, extract regional instance features, and use the instance features with all instance-level features extracted by the model on the database picture. The features are compared for similarity. The area with the highest similarity in each database image is the result of instance retrieval on the image, and the similarity corresponding to this area is the similarity of the image. All images in the database are based on similarity from high to low. Arrange in order to obtain the results of instance retrieval of the entire database. It can be applied to intelligent retrieval and video editing of video media.

Description

A Instance Retrieval Method Based on Deeply Activated Salient Regions

技术领域technical field

本发明涉及视觉实例检索，尤其是涉及可应用到互联网公司，视频媒体的智能化检索、视频编辑等领域的一种基于深度激活显著区域的实例检索方法。The invention relates to visual instance retrieval, in particular to an instance retrieval method based on depth activation salient regions, which can be applied to Internet companies, intelligent retrieval of video media, video editing and other fields.

背景技术Background technique

视觉实例检索(以下简称“实例检索”)通常被看作是图像检索的一个子任务。被检索的目标为一个视觉实例，通常通过框标注的形式在查询图片中给出。例如，检索一个特定的人或者一辆特定的车。实例检索需要在图像数据库中找出包含该查询实例的图片，并且给出图片中实例的位置。实例检索可以广泛应用到智能化商品检索、视频编辑等领域中，通过检索和定位特定实例来支持与该实例相关的任务，是当前图像数据处理领域通用且基本的技术。Visual instance retrieval (hereinafter referred to as "instance retrieval") is generally regarded as a subtask of image retrieval. The retrieved target is a visual instance, usually given in the query image in the form of box annotations. For example, to retrieve a specific person or a specific car. Instance retrieval requires finding the image containing the query instance in the image database and giving the location of the instance in the image. Instance retrieval can be widely used in intelligent commodity retrieval, video editing and other fields. It supports tasks related to a specific instance by retrieving and locating it. It is a common and basic technology in the field of current image data processing.

目前，业界的主流技术都无法在进行实例检索的同时对实例进行定位，它们继承于传统图像检索技术，即对图像提取一个全局特征从而进行检索。一部分方法在全局图像的基础上直接进行全局特征的提取；还有一部分方法首先在图像上提取大量局部特征，再对局部特征进行聚合产生全局特征，来降低后续检索时直接比对局部特征所带来的大量计算消耗。采用全局特征进行检索时，整张图片里多个实例的特征混合到一起，导致无法对单一实例进行定位。同时由于混入了其他实例或者背景的特征，全局特征对于逐个实例的区分力和表达力严重下降。业界中为数不多的技术，采用了基于深度学习的目标检测或实例分割的框架，来解决视觉实例定位的问题。这些技术都是采用带有监督的训练方式来训练目标检测或者实例分割框架，在通过检测或分割取得物体定位后，在定位区域里取得实例级特征来进行检索。实例级的检索使得对检索实例的定位成为可能，但有监督的训练又不可避免地让对物体的定位只对训练数据集中的类别敏感，而对训练类别中没有的实例物体则无法进行定位和检索。At present, the mainstream technologies in the industry cannot locate the instance while retrieving the instance. They inherit from the traditional image retrieval technology, that is, extract a global feature from the image for retrieval. Some methods directly extract global features on the basis of the global image; some methods first extract a large number of local features from the image, and then aggregate the local features to generate global features to reduce the direct comparison of local features in subsequent retrieval. a lot of computational cost to come. When global features are used for retrieval, the features of multiple instances in the entire image are mixed together, which makes it impossible to locate a single instance. At the same time, due to the mixing of features from other instances or backgrounds, the discriminative and expressive power of global features for each instance is seriously reduced. There are few technologies in the industry that use the framework of target detection or instance segmentation based on deep learning to solve the problem of visual instance localization. These techniques use supervised training methods to train target detection or instance segmentation frameworks. After obtaining object localization through detection or segmentation, instance-level features are obtained in the localization area for retrieval. Instance-level retrieval makes it possible to locate the retrieved instances, but supervised training inevitably makes the localization of objects only sensitive to the categories in the training dataset, and cannot locate and locate the instance objects that are not in the training category. retrieve.

在实际应用场景中，查询的实例类别不能事先假定；另外，收集所有可能出现的视觉实例类别来用以训练是不现实的。因此，如何在仅拥有几千个物体类别的数据集合上训练，而可以找到训练类里没有出现的类别(未知类)实例，是本发明所关注解决的关键技术难题。In practical application scenarios, the instance classes of the query cannot be assumed in advance; in addition, it is unrealistic to collect all possible visual instance classes for training. Therefore, how to train on a data set with only a few thousand object categories and find instances of categories (unknown categories) that do not appear in the training category is a key technical problem to be solved by the present invention.

目前业界还没有既能进行实例定位又能对实例类别鲁棒的实例检索方法。At present, there is no instance retrieval method in the industry that can both perform instance localization and be robust to instance categories.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对使用深度激活的显著区域来解决未知类别实例的定位问题，提升实例检索任务中对于实例物体种类的鲁棒性，提供可应用到智能化的商品检索、视频编辑等领域的一种基于深度激活显著区域的实例检索方法。The purpose of the present invention is to solve the problem of localization of unknown category instances by using deeply activated salient regions, to improve the robustness of instance object types in instance retrieval tasks, and to provide solutions that can be applied to intelligent commodity retrieval, video editing and other fields. An instance retrieval method based on deeply activated salient regions.

本发明包括以下步骤：The present invention includes the following steps:

1)模型设计：所述模型包括三部分，前向传播模块对深度模式进行挖掘，为模型的第一部分；模式定位模块主要负责激活显著区域并进行区域形状估计来定位实例，为模型的第二部分；特征提取模块通过区域池化提取实例级的特征，为模型的第三部分。将图片作为输入，输出图片中检测到的实例的定位信息及对应的特征，参与后续的检索部分；1) Model design: The model consists of three parts. The forward propagation module mines the depth pattern, which is the first part of the model; the pattern localization module is mainly responsible for activating salient regions and performing region shape estimation to locate instances, which is the second part of the model. Part; the feature extraction module extracts instance-level features through region pooling, which is the third part of the model. Take the picture as input, output the location information and corresponding features of the instance detected in the picture, and participate in the subsequent retrieval part;

2)数据预处理：对于给定的图像数据库，数据库中的每张图片都作为模型的输入，提取输出的实例定位结果和对应的实例级特征，保存以备用；2) Data preprocessing: For a given image database, each image in the database is used as the input of the model, and the output instance location results and corresponding instance-level features are extracted and saved for future use;

3)实例检索：对于每个查询图片，将其作为深度模式挖掘模块的前向传播部分的输入，用给定的查询实例的区域进行区域性实例特征的提取，然后将该实例特征与模型在数据库图片上提取出的所有实例级特征进行相似度对比，每张数据库图片中相似度最高的区域即为该图上实例检索的结果，该区域对应的相似度即为这张图片的相似度，数据库所有图片按照相似度从高到低依次排列，得到整个数据库的实例检索的结果。3) Instance retrieval: For each query image, take it as the input of the forward propagation part of the deep pattern mining module, use the given query instance region to extract regional instance features, and then combine the instance features with the model in the model. All instance-level features extracted from the database image are compared for similarity. The area with the highest similarity in each database image is the result of instance retrieval on the image, and the similarity corresponding to this area is the similarity of this image. All images in the database are arranged in descending order of similarity, and the result of instance retrieval of the entire database is obtained.

在步骤1)中，所述模型设计采用深度学习技术中流行的残差网络(ResNet-50)，该网络全连接层之前的全卷积结构作为卷积神经网络骨架，不进行额外训练，直接使用预训练的残差网络权重；模型设计的具体步骤包括：In step 1), the model design adopts the popular residual network (ResNet-50) in deep learning technology, and the fully convolutional structure before the fully connected layer of the network is used as the skeleton of the convolutional neural network, without additional training, directly Use pretrained residual network weights; specific steps in model design include:

首先，网络接受一幅输入图像，通过上述网络结构，得到最后一个卷积层的输出张量，记为X∈R^W*H*C，其中W和H分别为输出张量的宽和高，C为输出张量的通道数；通过在通道维度上对求平均得到平均响应图

来挖掘输入图片上激活的位置。平均响应图

的值反应了输入图片各处对深度激活模式的响应得分；

上的峰点处即为响应最高处；这些局部极值点所对映的输入图像的区域是最有可能存在实例的区域；First, the network accepts an input image, and through the above network structure, the output tensor of the last convolutional layer is obtained, denoted as X∈R ^W*H*C , where W and H are the width and height of the output tensor, respectively, C is the number of channels of the output tensor; the average response map is obtained by averaging over the channel dimension

to mine the activation locations on the input image. Average Response Plot

The value of reflects the response score of the input image to the deep activation pattern;

The peak point on the upper part is the highest response point; the region of the input image corresponding to these local extreme points is the region where the instance is most likely to exist;

在模式定位模块中，在每一个峰点处提取出上对应的C维向量并对其进行反向传播来激活显著区域，得到在输入图像上与该峰点对映的区域，然后对该区域进行形状估计完成定位。此处反向传播可以近似表达为一个逐层求解条件概率的过程。下面给出对卷积结构中某一个卷积层进行反向传播的具体操作，其他层的传播依该方式进行。设有卷积层的输入张量、输出张量、卷积核，以及反向传播到该层时已知的对上的结果P(B)，其反向传播的过程表示为：In the pattern localization module, the corresponding C-dimensional vector is extracted at each peak point and backpropagated to activate the salient region, and the region corresponding to the peak point on the input image is obtained, and then the region is Perform shape estimation to complete positioning. Here, backpropagation can be approximately expressed as a process of solving conditional probability layer by layer. The specific operation of backpropagating a certain convolutional layer in the convolutional structure is given below, and the propagation of other layers is carried out in this way. With the input tensor, output tensor, convolution kernel of the convolutional layer, and the known result P(B) when backpropagating to the layer, the process of backpropagation is expressed as:

其中传递概率为：where the transit probability is:

其中，为正则化项，使对A上任意一点的传递概率总和为1。F_x-i,y-j为卷积核上与A_x,y及B_i,j相关的值。对于平均响应图

上的某一个点，经迭代上述过程传播到输入层时得到的激活概率图记为。反映了输入图像上与该峰点有关的深度激活显著区域。将激活概率图的值正则化至[0,1]，对值大于0.1的所有像素点r(x,y)，通过计算图像二阶矩进行区域的形状估计。图像二阶矩的计算如下：Among them, is the regularization term, so that the sum of the transit probabilities to any point on A is 1. F _xi,yj is the value related to A _x,y and B _i,j on the convolution kernel. For the average response graph

At a certain point on , the activation probability map obtained when iteratively propagates the above process to the input layer is denoted as . It reflects the depth activation salient region related to this peak on the input image. The value of the activation probability map is normalized to [0, 1], and the shape of the region is estimated by calculating the second-order moment of the image for all pixel points r(x, y) whose value is greater than 0.1. The second moment of the image is calculated as follows:

通过上述二阶矩计算得到深度激活显著区域的近似椭圆的参数，椭圆的外接矩形即为最终定位结果。The parameters of the approximate ellipse of the deep activation salient region are obtained through the above second-order moment calculation, and the circumscribing rectangle of the ellipse is the final positioning result.

上述定位过程同样被应用到了平均响应图

上除峰点外的其他位置。在所有对深度激活模式响应大于

的均值的位置上，通过反向传播得到的定位结果和每个位置所对应的深度激活模式响应得分一起，进行一个非极大值抑制过程来筛选得到最终的定位结果。在每个定位结果区域上提取特征，即可得到实例级特征用以进行后续检索。The above positioning process is also applied to the average response map

other than the peak point. active mode responses at all depths greater than

At the position of the mean value of , the positioning result obtained by backpropagation and the response score of the deep activation mode corresponding to each position are combined with a non-maximum suppression process to filter to obtain the final positioning result. By extracting features on each location result area, instance-level features can be obtained for subsequent retrieval.

与现有技术相比，本发明具有以下突出的优点：Compared with the prior art, the present invention has the following outstanding advantages:

本发明提出了一种简单有效的实例检索模型，在无需训练过程的情况下，利用深度激活显著区域来提高对实例类别的鲁棒性，解决以往的方法在兼顾实例定位和实例类别鲁棒性上存在的问题。可应用到互联网公司，视频媒体的智能化检索、视频编辑等领域，是一种基于深度激活显著区域的实例检索方法。在本发明提出的模型中，一方面，特征提取部分采用实例定位和提取实例级特征的方式，避免了全局特征会混合噪音信息导致特征区分力不够，以及无法对单一实例进行定位的问题，提高了检索的实用性；另一方面，定位过程采用深度激活显著区域的方式，避开了业界方法中受到的训练过程和训练类别的约束，采用深度激活模式的响应来定位可能的前景物体，从而囊括了对未知类别实例的定位，提高实例检索中对于检索实例类别的鲁棒性的同时，避免了耗时的训练过程，提升了模型的轻便性和可移植性。The present invention proposes a simple and effective instance retrieval model, which utilizes deep activation of salient regions to improve the robustness to instance categories without the need for a training process, and solves the problem that the previous methods have both instance localization and instance category robustness. problems on the . It can be applied to Internet companies, intelligent retrieval of video media, video editing and other fields. It is an instance retrieval method based on deep activation of salient regions. In the model proposed by the present invention, on the one hand, the feature extraction part adopts the method of instance localization and instance-level feature extraction, which avoids the problem that global features will mix noise information, resulting in insufficient feature discrimination and inability to locate a single instance. On the other hand, the localization process adopts the method of deep activation of salient regions, avoiding the constraints of the training process and training categories in the industry method, and uses the response of the deep activation mode to locate possible foreground objects, thereby It includes the localization of instances of unknown categories, improves the robustness of instance retrieval to retrieved instance categories, avoids the time-consuming training process, and improves the portability and portability of the model.

附图说明Description of drawings

图1为本发明实施例的模型结构示意图。FIG. 1 is a schematic diagram of a model structure according to an embodiment of the present invention.

具体实施方式Detailed ways

以下实施例将结合附图对本发明作进一步的说明。The following embodiments will further illustrate the present invention in conjunction with the accompanying drawings.

本发明实施例包括以下步骤：The embodiment of the present invention includes the following steps:

1)模型设计：本发明的模型如图1。在图1中包含三个主要的模块，前向传播模块1、模式定位模块2和特征提取模块3，依次形成序贯结构。前向传播模块对深度模式进行挖掘，为模型的第一部分；模式定位模块主要负责激活显著区域并进行区域形状估计来定位实例，为模型的第二部分；特征提取模块通过区域池化提取实例级的特征，为模型的第三部分。模型将数据库图片作为输入，得到所有实例定位及对应的实例级特征作为数据库特征。之后对查询实例提取区域性特征，在数据库特征中进行查询，检索相似的实例。1) Model design: The model of the present invention is shown in Figure 1. In Figure 1, there are three main modules, the forward propagation module 1, the pattern localization module 2 and the feature extraction module 3, which form a sequential structure in turn. The forward propagation module mines deep patterns, which is the first part of the model; the pattern localization module is mainly responsible for activating salient regions and performing region shape estimation to locate instances, which is the second part of the model; the feature extraction module extracts instance-level through region pooling features for the third part of the model. The model takes database images as input, and obtains all instance locations and corresponding instance-level features as database features. After that, regional features are extracted from the query instances, and the database features are queried to retrieve similar instances.

具体地，采用深度学习技术中流行的残差网络(ResNet-50)。该网络全连接层之前的全卷积结构作为本发明的卷积神经网络骨架。不进行额外训练，直接使用预训练的残差网络权重。Specifically, the popular residual network (ResNet-50) in deep learning technology is adopted. The fully convolutional structure before the fully connected layer of the network serves as the skeleton of the convolutional neural network of the present invention. Without additional training, the pre-trained residual network weights are directly used.

首先，网络接受一幅输入图像，通过上述网络结构，得到最后一个卷积层的输出张量，记为X∈R^W*H*C，其中W和H分别为输出张量的宽和高，C为输出张量的通道数。通过在通道维度上对求平均得到平均响应图

来挖掘输入图片上激活的位置。该图的值反应了输入图片各处对深度激活模式的响应得分。该图的峰点处即为响应最高处。这些局部极值点所对映的输入图像的区域是最有可能存在实例的区域。First, the network accepts an input image, and through the above network structure, the output tensor of the last convolutional layer is obtained, denoted as X∈R ^W*H*C , where W and H are the width and height of the output tensor, respectively, C is the number of channels of the output tensor. Average response plot by averaging over the channel dimension

to mine the activation locations on the input image. The values in this map reflect the response scores to deep activation patterns across the input image. The peak of the graph is the highest response. The regions of the input image to which these local extreme points correspond are the regions where instances are most likely to exist.

在模式定位模块中，在每一个峰点处提取出上对应的C维向量并对其进行反向传播来激活显著区域，得到在输入图像上与该峰点对映的区域，然后对该区域进行形状估计完成定位。此处反向传播可以近似表达为一个逐层求解条件概率的过程。In the pattern localization module, the corresponding C-dimensional vector is extracted at each peak point and backpropagated to activate the salient region, and the region corresponding to the peak point on the input image is obtained, and then the region is Perform shape estimation to complete positioning. Here, backpropagation can be approximately expressed as a process of solving conditional probability layer by layer.

下面给出对卷积结构中某一个卷积层进行反向传播的具体操作，其他层的传播依该方式进行。设有卷积层的输入张量、输出张量、卷积核，以及反向传播到该层时已知的对上的结果P(B)，其反向传播的过程可以表示为：The specific operation of backpropagating a certain convolutional layer in the convolutional structure is given below, and the propagation of other layers is carried out in this way. With the input tensor, output tensor, convolution kernel of the convolutional layer, and the known result P(B) when backpropagating to the layer, the process of backpropagation can be expressed as:

其中传递概率为：where the transit probability is:

在本发明中，上述定位过程同样被应用到平均响应图

上除峰点外的其他位置。在所有对深度激活模式响应大于

的均值的位置上，通过反向传播得到的定位结果和每个位置所对应的深度激活模式响应得分一起，进行一个非极大值抑制过程来筛选得到最终的定位结果。在每个定位结果区域上提取特征，即可得到实例级特征用以进行后续检索。In the present invention, the above positioning process is also applied to the average response map

other than the peak point. active mode responses at all depths greater than

2)数据预处理：对于给定的图像数据库，数据库中的每张图片都作为本发明模型的输入，提取输出的实例定位结果和对应的实例级特征，保存以备用。2) Data preprocessing: For a given image database, each image in the database is used as the input of the model of the present invention, and the output instance location results and corresponding instance-level features are extracted and saved for future use.

3)实例检索：对于每个查询图片，将其作为前向传播模块的输入，用给定的查询实例的区域进行区域性实例特征的提取。然后将该实例特征与本发明模型在数据库图片上提取出的所有实例级特征进行相似度对比，每张数据库图片中对应的相似度最高的区域即为该图上实例检索的结果，该区域对应的相似度即为该图的相似度。数据库所有图片按照相似度从高到低依次排列，得到整个数据库上的实例检索的结果。3) Instance retrieval: For each query image, it is used as the input of the forward propagation module, and the region of the given query instance is used to extract regional instance features. Then compare the similarity between the instance feature and all instance-level features extracted from the database image by the model of the present invention. The area with the highest similarity in each database image is the result of instance retrieval on the image. This area corresponds to The similarity is the similarity of the graph. All images in the database are arranged in descending order of similarity to obtain the results of instance retrieval on the entire database.

在本发明提出的模型中，一方面，特征提取部分采用实例定位和提取实例级特征的方式，避免了全局特征会混合噪音信息导致特征区分力不够，以及无法对单一实例进行定位的问题，提高了检索的实用性；另一方面，定位过程采用深度激活显著区域的方式，避开了业界方法中受到的训练过程和训练类别的约束，采用深度激活模式的响应来定位可能的前景物体，从而囊括了对未知类别实例的定位，提高实例检索中对于检索实例类别的鲁棒性的同时，避免了耗时的训练过程，提升了模型的轻便性和可移植性。In the model proposed by the present invention, on the one hand, the feature extraction part adopts the method of instance localization and instance-level feature extraction, which avoids the problem that global features will mix noise information, resulting in insufficient feature discrimination and inability to locate a single instance. On the other hand, the localization process adopts the method of deep activation of salient regions, avoiding the constraints of the training process and training categories in the industry method, and uses the response of the deep activation mode to locate possible foreground objects, thereby It includes the localization of instances of unknown categories, improves the robustness of instance retrieval to retrieved instance categories, avoids the time-consuming training process, and improves the portability and portability of the model.

本发明提出一种简单有效的实例检索方法，在无需训练过程的情况下，利用深度激活显著区域来提高对实例类别的鲁棒性，解决以往的方法在兼顾实例定位和实例类别鲁棒性上存在的问题。The present invention proposes a simple and effective instance retrieval method, which utilizes deep activation of salient regions to improve the robustness to instance categories without the need for a training process, and solves the problem that the previous method takes both instance localization and instance category robustness into consideration. existing problems.

在实例检索中，本发明技术与现有技术R-MAC、CroW、CAM、BLCF、BLCF-SalGAN、RegionalAttention、DeepVision、FCIS+XD和PCL*+SPN在Instance-335和INSTRE数据集上的检索评价指标mAP的比较对比如表1所示。In instance retrieval, the retrieval evaluation of the technology of the present invention and the prior art R-MAC, CroW, CAM, BLCF, BLCF-SalGAN, RegionalAttention, DeepVision, FCIS+XD and PCL*+SPN on Instance-335 and INSTRE datasets The comparison and comparison of the indicator mAP are shown in Table 1.

表1Table 1

在所有的对比方法中，只有FCIS+XD和PCL*+SPN方法能对所有检索到的实例进行定位，且它们均采用有监督的训练方式。在Instance-335数据集上，评价指标采用FCIS+XD的设置，包含了前10、前20、前50、前100和所有结果的检索结果评价对比。由表1可以看出，本发明的方法在两个数据集下最终的mAP均高于所有可以定位的方法，且在两个数据集之间都表现出稳定的性能。INSTRE数据集中包含许多训练类别里没有的类别实例，本发明的方法在该数据集上对比其他能定位的方法所表现出的稳定性证明了对实例类别的鲁棒性。虽然BLCF-SalGAN在INSTRE数据集上性能更好，但其需要额外标注生成的显著图信息以及其不能定位实例的特性导致其在真实场景中并不实用。因此，本发明的模型获得了比所有能定位的方法更好的结果，并保证了实际场景的实用性。Among all the contrasting methods, only the FCIS+XD and PCL*+SPN methods can localize all the retrieved instances, and they all use supervised training. On the Instance-335 dataset, the evaluation index adopts the setting of FCIS+XD, including the top 10, top 20, top 50, top 100 and all the results of the retrieval results evaluation comparison. It can be seen from Table 1 that the final mAP of the method of the present invention is higher than that of all localizable methods under the two datasets, and shows stable performance between the two datasets. The INSTRE data set contains many class instances that are not in the training class, and the stability of the method of the present invention compared with other localization methods on this data set proves the robustness to instance classes. Although BLCF-SalGAN performs better on the INSTRE dataset, it requires additional annotation of the generated saliency map information and its inability to locate instances makes it impractical in real scenes. Therefore, the model of the present invention obtains better results than all localization methods and guarantees the practicality of the actual scene.

R-MAC对应的方法为GiorgosTolias等人提出的方法(Tolias G,Sicre R,JégouH.Particular object retrieval with integral max-pooling of CNN activations[J].arXivpreprint arXiv:1511.05879,2015.)；CroW对应的方法为YannisKalantidis等人提出的方法(Kalantidis Y,Mellina C,Osindero S.Cross-dimensional weightingfor aggregated deep convolutional features[C]//European conference oncomputer vision.Springer,Cham,2016:685-701.)；CAM对应的方法为Albert Jimenez等人提出的方法(Jimenez A,Alvarez J M,Giro-i-Nieto X.Class-weightedconvolutional features for visual instance search[J].arXiv preprint arXiv:1707.02581,2017.)；BLCF及BLCF-SalGAN对应的方法为Eva Mohedano等人提出的词袋模型编码卷积特征方法以及用显著性图加权的词袋模型编码卷积特征的方法(Mohedano E,McGuinness K,Giró-i-Nieto X,et al.Saliency weighted convolutional featuresfor instance search[C]//2018international conference on content-basedmultimedia indexing(CBMI).IEEE,2018:1-6.)；RegionalAttention对应的方法为Jaeyoon Kim等人提出的方法(Kim J,Yoon S E.Regional Attention Based DeepFeature for Image Retrieval[C]//BMVC.2018:209.)；DeepVision对应的方法为AmaiaSalvador等人提出的方法(Salvador A,Giró-i-Nieto X,Marqués F,et al.Faster r-cnnfeatures for instance search[C]//Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops.2016:9-16.)；FCIS+XD对应的方法为Zhan Yu等人提出的采用实例分割框架提取实例级特征用于实例检索的方法(Zhan Y,Zhao W L.Instance Search via Instance Level Segmentation and FeatureRepresentation[J].arXiv preprint arXiv:1806.03576,2018.)；PCL*+SPN对应的方法为Lin Jie等人提出的采用弱监督目标检测框架提取特征用于实例检索的方法(Lin J,ZhanY,Zhao W L.Instance search based on weakly supervised feature learning[J].Neurocomputing,2019.)。The method corresponding to R-MAC is the method proposed by GiorgosTolias et al. (Tolias G, Sicre R, JégouH.Particular object retrieval with integral max-pooling of CNN activations[J].arXivpreprint arXiv:1511.05879,2015.); the method corresponding to CroW For the method proposed by Yannis Kalantidis et al. (Kalantidis Y, Mellina C, Osindero S. Cross-dimensional weighting for aggregated deep convolutional features[C]//European conference on computer vision. Springer, Cham, 2016:685-701.); corresponding to CAM The method is proposed by Albert Jimenez et al. (Jimenez A, Alvarez J M, Giro-i-Nieto X. Class-weighted convolutional features for visual instance search [J]. arXiv preprint arXiv:1707.02581, 2017.); BLCF and BLCF-SalGAN The corresponding methods are the bag-of-words model encoding convolutional feature method proposed by Eva Mohedano et al. .Saliency weighted convolutional features for instance search[C]//2018international conference on content-basedmultimedia indexing(CBMI).IEEE, 2018:1-6.); The corresponding method of Regional Attention is the method proposed by Jaeyoon Kim et al. (Kim J, Yoon S E.Regional Attention Based DeepFeature for Image Retrieval[C]//BMVC.2018:209.); the method corresponding to DeepVision is the method proposed by AmaiaSalvador et al. (Salva dor A, Giró-i-Nieto X, Marqués F, et al. Faster r-cnnfeatures for instance search[C]//Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops.2016:9-16.); FCIS+ The method corresponding to XD is the method proposed by Zhan Yu et al. to extract instance-level features for instance retrieval using an instance segmentation framework (Zhan Y, Zhao W L. Instance Search via Instance Level Segmentation and FeatureRepresentation [J].arXiv preprint arXiv:1806.03576 , 2018.); the method corresponding to PCL*+SPN is the method proposed by Lin Jie et al. to extract features using a weakly supervised target detection framework for instance retrieval (Lin J, ZhanY, Zhao W L.Instance search based on weakly supervised feature learning[J].Neurocomputing, 2019.).

本发明公开一种基于深度激活图的视觉实例检索方法。本技术将输入图片经过预训练网络计算，获得深度激活图。该图的局部峰值可以对映到预训练数据中的已知类别或者预训练数据之外的未知类别。对这些峰值点进行近似于条件概率的逐层反传，直到输入图像层，即可以获得这些峰值点在输入图像上所对映的潜在视觉物体区域，从而实现了对视觉实例的定位。根据定位，本发明技术进一步在该区域提取特征，用于视觉实例检索中的特征表示。本发明不同已有的技术在于，本技术只依赖于一个预训练的深度卷积神经网络；另外，它可以有效检测到图像中的已知视觉物体类别和未知视觉物体类别，从而对图像中所有潜在物体生成实例级别的特征表示用于检索。The invention discloses a visual instance retrieval method based on depth activation map. In this technology, the input image is calculated by the pre-training network to obtain the deep activation map. Local peaks in the graph can map to known classes in the pre-training data or unknown classes outside the pre-training data. These peak points are back-propagated layer by layer, which approximates the conditional probability, until the input image layer, and the potential visual object regions corresponding to these peak points on the input image can be obtained, thereby realizing the localization of visual instances. Based on localization, the present technology further extracts features in the region for feature representation in visual instance retrieval. The present invention is different from the existing technologies in that the present technology only relies on a pre-trained deep convolutional neural network; in addition, it can effectively detect the known visual object category and the unknown visual object category in the image, so that all the visual object categories in the image can be effectively detected. Latent objects generate instance-level feature representations for retrieval.

Claims

1. an instance retrieval method based on depth activation significant region, is characterized in that comprising the following steps:

1) Model design: The model consists of three parts. The forward propagation module mines the depth pattern, which is the first part of the model; the pattern localization module is mainly responsible for activating salient regions and performing region shape estimation to locate instances, which is the second part of the model. Part; the feature extraction module extracts instance-level features through regional pooling, which is the third part of the model; takes the picture as input, outputs the location information and corresponding features of the instance detected in the picture, and participates in the subsequent retrieval part;

2) Data preprocessing: For a given image database, each image in the database is used as the input of the model, and the output instance location results and corresponding instance-level features are extracted and saved for future use;

3) Instance retrieval: For each query image, take it as the input of the forward propagation part of the deep pattern mining module, use the given query instance region to extract regional instance features, and then combine the instance features with the model in the model. All instance-level features extracted from the database image are compared for similarity. The area with the highest similarity in each database image is the result of instance retrieval on the image, and the similarity corresponding to this area is the similarity of this image. All images in the database are arranged in descending order of similarity, and the result of instance retrieval of the entire database is obtained.

2. a kind of instance retrieval method based on deep activation significant area as claimed in claim 1 is characterized in that in step 1), described model design adopts the popular residual network (ResNet-50) in deep learning technology, this The fully convolutional structure before the fully connected layer of the network is used as the skeleton of the convolutional neural network, and the pre-trained residual network weights are directly used without additional training; the specific steps of model design include:

First, the network accepts an input image, and through the above network structure, the output tensor of the last convolutional layer is obtained, denoted as X∈R ^W*H*C , where W and H are the width and height of the output tensor, respectively, C is the number of channels of the output tensor; the average response map is obtained by averaging over the channel dimension

to mine the locations of activations on the input image; the average response map

In the pattern localization module, the corresponding C-dimensional vector is extracted at each peak point and backpropagated to activate the salient region, and the region corresponding to the peak point on the input image is obtained, and then the region is Perform shape estimation to complete positioning; here, backpropagation can be approximately expressed as a process of solving conditional probability layer by layer; the specific operation of backpropagating a convolutional layer in the convolutional structure is given below, and the propagation of other layers depends on This method is carried out; there are input tensors, output tensors, convolution kernels of the convolutional layer, and the known result P(B) when backpropagating to this layer, and the process of backpropagation is expressed as :

where the transit probability is:

Among them, is the regularization term, so that the sum of the transfer probabilities to any point on A is 1; F _xi,yj is the value related to A _x,y and B _i,j on the convolution kernel; For the average response graph

At a certain point on , the activation probability map obtained when it propagates to the input layer through the iterative process above is recorded as; it reflects the deep activation significant area related to the peak point on the input image; the value of the activation probability map is normalized to [0 ,1], for all pixel points r(x,y) whose value is greater than 0.1, the shape of the region is estimated by calculating the second-order moment of the image; the calculation of the second-order moment of the image is as follows:

The parameters of the approximate ellipse of the deep activation salient region are obtained through the above second-order moment calculation, and the circumscribed rectangle of the ellipse is the final positioning result;

The above positioning process is also applied to the average response map

on all positions except the peak point; at all responses to deep activation modes greater than

At the position of the mean value of , the positioning results obtained through backpropagation and the corresponding deep activation mode response scores of each position are combined with a non-maximum suppression process to filter to obtain the final positioning results; in each positioning result area By extracting features from above, instance-level features can be obtained for subsequent retrieval.