CN106909924A

CN106909924A - A kind of remote sensing image method for quickly retrieving based on depth conspicuousness

Info

Publication number: CN106909924A
Application number: CN201710087670.5A
Authority: CN
Inventors: 张菁; 梁西; 陈璐; 卓力; 耿文浩; 李嘉锋
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-02-18
Filing date: 2017-02-18
Publication date: 2017-06-30
Anticipated expiration: 2037-02-18
Also published as: CN106909924B

Abstract

A kind of remote sensing image method for quickly retrieving based on depth conspicuousness belongs to computer vision field, and in particular to the technology such as deep learning, conspicuousness target detection and image retrieval.The present invention, using depth learning technology, have studied a kind of method for quickly retrieving of remote sensing image with remote sensing image as research object.Multitask conspicuousness target detection model is built using full convolutional neural networks first, the model carries out conspicuousness Detection task and semantic segmentation task simultaneously, in the depth significant characteristics of network pre-training process learning remote sensing image.Then depth network structure is improved, Hash layer trim network is added, study obtains the binary system Hash codes of remote sensing image.Finally comprehensive utilization significant characteristics and Hash codes carry out similarity measurement.The present invention is for realizing that remote sensing image is accurate, efficient retrieval is practical and with significant application value.

Description

A Fast Retrieval Method of Remote Sensing Image Based on Depth Saliency

技术领域technical field

本发明以遥感影像为研究对象，利用人工智能领域的最新研究成果——深度学习技术，研究了一种遥感影像的快速检索方法。首先采用全卷积神经网络构建多任务显著性目标检测模型，计算遥感影像的深度显著性特征；然后改进深度网络结构，加入哈希层学习得到二进制哈希码；最后综合利用显著性特征和哈希码实现遥感影像准确、快速检索。本发明属于计算机视觉领域，具体涉及深度学习、显著性目标检测和图像检索等技术。The present invention takes remote sensing images as the research object, and uses the latest research results in the field of artificial intelligence - deep learning technology to study a fast retrieval method for remote sensing images. Firstly, the fully convolutional neural network is used to build a multi-task salient target detection model, and the depth salient features of remote sensing images are calculated; then the deep network structure is improved, and the hash layer is added to learn the binary hash code; finally, the salient features and the hash are comprehensively used Xima realizes accurate and fast retrieval of remote sensing images. The invention belongs to the field of computer vision, and specifically relates to technologies such as deep learning, salient target detection and image retrieval.

背景技术Background technique

遥感影像数据作为地理信息系统(Geographic Information System，GIS)、全球定位系统(Global Positioning System，GPS)、遥感测绘技术(remote sensing system，RS)三大空间信息技术中的基础数据，广泛应用于环境监测、资源调查、土地利用、城市规划、自然灾害分析和军事等各个领域。近年来，随着高分辨率遥感卫星、成像雷达以及无人机驾驶飞机(Unmanned Aerial Vehicle)技术的发展，遥感影像数据进一步呈现海量、复杂和高分辨率的特点，实现遥感影像高效、准确检索对于促进遥感影像信息的准确提取和数据共享具有重要的研究意义和应用价值。Remote sensing image data, as the basic data in the three major spatial information technologies of geographic information system (Geographic Information System, GIS), global positioning system (Global Positioning System, GPS) and remote sensing mapping technology (remote sensing system, RS), is widely used in the environment Monitoring, resource surveys, land use, urban planning, natural disaster analysis, and military. In recent years, with the development of high-resolution remote sensing satellites, imaging radars, and unmanned aerial vehicle (Unmanned Aerial Vehicle) technologies, remote sensing image data has further presented the characteristics of massive, complex, and high-resolution images, enabling efficient and accurate retrieval of remote sensing images. It has important research significance and application value for promoting the accurate extraction of remote sensing image information and data sharing.

图像检索技术由早期的基于文本的图像检索(Text-Based Image Retrieval，TBIR)逐渐发展为通过提取图像特征实现基于内容的图像检索(Content-Based ImageRetrieval，CBIR)。基于显著性目标的图像检索方法，能够快速地从复杂场景中选择少数几个显著的区域进行优先处理，从而有效降低数据处理复杂度，提高检索效率。相比普通图像检索，遥感影像包含的信息复杂多变，目标小且与背景区分不明显，如果仍采用传统的显著性检测方法将难以实现对遥感影像显著性特征的准确描述与分析。近年来，随着人工智能领域的最新研究成果——深度学习技术的提出，例如：以全卷积神经网络(FullyConvolutional Neural Network，FCNN)为代表的深度神经网络，凭借其独特的类似于人眼局部感受的卷积核以及类似于生物神经的层次级联结构，在图像深度显著性特征学习方面表现出优良的鲁棒性。其权值共享的特性也使得网络参数大大减少，同时降低了对训练数据过拟合的风险，比其他种类的深度网络更易于训练，可以提高显著性特征的表征准确度。Image retrieval technology has gradually developed from the early text-based image retrieval (Text-Based Image Retrieval, TBIR) to content-based image retrieval (Content-Based Image Retrieval, CBIR) by extracting image features. The image retrieval method based on salient objects can quickly select a few salient regions from complex scenes for priority processing, thereby effectively reducing the complexity of data processing and improving retrieval efficiency. Compared with ordinary image retrieval, the information contained in remote sensing images is complex and changeable, and the target is small and not clearly distinguished from the background. If traditional saliency detection methods are still used, it will be difficult to accurately describe and analyze the salient features of remote sensing images. In recent years, with the latest research results in the field of artificial intelligence - the introduction of deep learning technology, such as: the deep neural network represented by the fully convolutional neural network (Fully Convolutional Neural Network, FCNN), with its unique similar to the human eye The convolution kernel of local perception and the hierarchical cascade structure similar to biological nerves show excellent robustness in the learning of image depth saliency features. Its weight sharing feature also greatly reduces network parameters, and at the same time reduces the risk of overfitting to training data. It is easier to train than other types of deep networks, and can improve the representation accuracy of salient features.

考虑到遥感影像数量日益增加，图像语义描述能力有限等问题，本发明以公开的大规模航拍图像数据集(AID)、武汉大学遥感影像数据集(WHU-RS)及谷歌地球遥感影像为数据来源，提出一种基于深度显著性的遥感影像快速检索方法。首先，构建基于全卷积神经网络(Fully Convolutional Neural Network，FCNN)的多任务显著性目标检测模型，在预训练数据集上学习遥感影像不同层次的语义信息作为深度显著性特征并转换为一维列向量。进一步微调神经网络模型，引入哈希层并增加训练样本，将该模型学习到的遥感影像高维显著性特征以二进制哈希码(Binary Hash Codes)的形式映射到低维空间，分别存储显著性特征向量和哈希码构建特征数据库。通过训练好的模型提取待查询的遥感图像显著性特征向量和哈希码，对比特征数据库，计算哈希码汉明距离(Hamming Distance)和显著性特征向量欧氏距离(Euclidean Distance)度量相似度，实现遥感影像快速检索。Considering the increasing number of remote sensing images and the limited image semantic description ability, the present invention uses the public large-scale aerial image dataset (AID), Wuhan University remote sensing image dataset (WHU-RS) and Google Earth remote sensing images as data sources , propose a fast remote sensing image retrieval method based on depth saliency. First, build a multi-task salient object detection model based on a Fully Convolutional Neural Network (FCNN), and learn semantic information at different levels of remote sensing images on the pre-trained dataset as deep salient features and convert them into one-dimensional Column vector. Further fine-tune the neural network model, introduce the hash layer and increase the training samples, map the high-dimensional salient features of remote sensing images learned by the model to the low-dimensional space in the form of binary hash codes, and store the salient features separately. Feature vectors and hash codes build a feature database. Extract the salient feature vector and hash code of the remote sensing image to be queried through the trained model, compare the feature database, calculate the hash code Hamming distance (Hamming Distance) and the salient feature vector Euclidean distance (Euclidean Distance) to measure the similarity , to achieve fast retrieval of remote sensing images.

发明内容Contents of the invention

本发明与已有的遥感影像检索方法不同，利用深度学习技术，提出一种基于深度显著性的遥感影像快速检索方法。首先，采用全卷积神经网络(FCNN)构建多任务深度显著性目标检测模型，将普通卷积神经网络(CNN)图像级别的分类进一步延伸到像素级别的分类。在大规模航拍图像数据集(AID)上预训练网络，显著性检测任务和语义分割任务共享卷积层，综合学习遥感影像的三层语义信息，有效去除特征冗余，准确提取深度显著性特征。其次，在该模型中加入哈希层，扩充武汉大学遥感影像数据集(WHU-RS)微调神经网络，利用深度神经网络通过随机梯度下降算法(Stochastic Gradient Descent，SGD)实现增量学习的优势，逐点学习二进制哈希码，实现高维显著性特征降维，既可节省存储空间又可提升检索效率。同时，相比传统需要成对输入训练样本的哈希方法，本发明所采用的方法在大规模数据集上更易扩展。将神经网络预训练和微调过程学习的显著性特征转化为一维列向量，和二进制哈希码一同构建特征数据库。最后，在图像检索阶段采用由粗到细的检索策略，综合利用二进制哈希码和显著性特征度量汉明距离和欧式距离，实现遥感影像快速、准确检索。本方法主要过程如附图1所示，可分为以下三个步骤：基于深度显著性的目标检测模型构建、神经网络预训练并加入哈希层微调和多层次深度检索。Different from the existing remote sensing image retrieval methods, the present invention uses deep learning technology to propose a remote sensing image rapid retrieval method based on depth saliency. First, a fully convolutional neural network (FCNN) is used to build a multi-task deep salient object detection model, which further extends the image-level classification of ordinary convolutional neural networks (CNN) to pixel-level classification. Pre-training the network on a large-scale aerial image dataset (AID), the saliency detection task and the semantic segmentation task share the convolutional layer, comprehensively learn the three-layer semantic information of remote sensing images, effectively remove feature redundancy, and accurately extract deep salient features . Secondly, adding a hash layer to the model, expanding the Wuhan University Remote Sensing Image Dataset (WHU-RS) to fine-tune the neural network, and using the deep neural network to realize the advantages of incremental learning through the stochastic gradient descent algorithm (Stochastic Gradient Descent, SGD), Learn binary hash code point by point to achieve dimensionality reduction of high-dimensional salient features, which can save storage space and improve retrieval efficiency. At the same time, compared with the traditional hash method that needs to input training samples in pairs, the method adopted by the present invention is easier to expand on large-scale data sets. The salient features learned by the neural network pre-training and fine-tuning process are converted into one-dimensional column vectors, and the feature database is constructed together with the binary hash code. Finally, in the image retrieval stage, a coarse-to-fine retrieval strategy is adopted, and binary hash codes and salient features are used to measure Hamming distance and Euclidean distance to achieve fast and accurate retrieval of remote sensing images. The main process of this method is shown in Figure 1, which can be divided into the following three steps: object detection model construction based on depth saliency, neural network pre-training and adding hash layer fine-tuning and multi-level deep retrieval.

(1)基于深度显著性的目标检测模型构建(1) Construction of target detection model based on depth saliency

为了有效提取图像的显著区，本发明将构建一种基于全卷积神经网络的多任务显著性目标检测模型。该模型同时进行两个任务：显著性检测和语义分割。显著性检测用于对遥感影像的深度特征学习，计算深度显著性，语义分割用于提取图像内部对象语义信息，消除显著图背景混淆，补充显著性目标缺失部分。In order to effectively extract the salient area of the image, the present invention will construct a multi-task salient target detection model based on a fully convolutional neural network. The model performs two tasks simultaneously: saliency detection and semantic segmentation. Saliency detection is used for deep feature learning of remote sensing images, and depth saliency is calculated. Semantic segmentation is used to extract semantic information of objects inside images, eliminate background confusion of salient images, and supplement missing parts of salient objects.

(2)神经网络预训练并加入哈希层微调(2) Neural network pre-training and adding hash layer fine-tuning

本发明选取大规模航拍图像数据集(AID)作为标准数据集预训练网络。为了使显著性目标检测模型学习的显著性特征对中国遥感影像的检索有更好的鲁棒性，在武汉大学遥感影像数据集(WHU-RS)的基础上，在谷歌地球上下载了6050幅不同光照、拍摄角度、分辨率及尺寸的中国遥感影像，将WHU-RS数据集扩充至7000幅图像用于微调神经网络。The present invention selects a large-scale aerial image data set (AID) as a standard data set pre-training network. In order to make the salient features learned by the salient object detection model more robust to the retrieval of Chinese remote sensing images, 6050 images were downloaded from Google Earth on the basis of the Wuhan University Remote Sensing Image Dataset (WHU-RS). The WHU-RS data set was expanded to 7000 images for fine-tuning the neural network with different illumination, shooting angles, resolutions and sizes of Chinese remote sensing images.

(3)多层次深度检索(3) Multi-level deep search

本发明提出了一种由粗糙到精细的检索方案。粗糙检索利用哈希层学习的二进制哈希码，通过汉明距离度量相似性。精细检索将第13、15层卷积层生成的二维遥感影像特征图映射为一维列向量，作为显著性特征向量，通过欧氏距离度量相似性。使用基于排名的评价标准，统计检索结果的查准率(Precision)。The invention proposes a retrieval scheme from rough to fine. Rough retrieval utilizes the binary hash code learned by the hash layer, and measures the similarity by Hamming distance. Fine retrieval maps the two-dimensional remote sensing image feature maps generated by the 13th and 15th convolutional layers into a one-dimensional column vector, which is used as a salient feature vector, and the similarity is measured by Euclidean distance. Use ranking-based evaluation criteria to count the precision of the retrieval results (Precision).

1.一种基于深度显著性的遥感影像快速检索方法，其特征在于包括以下步骤：1. A remote sensing image fast retrieval method based on depth saliency, characterized in that it comprises the following steps:

步骤1：基于深度显著性的目标检测模型构建Step 1: Construction of object detection model based on deep saliency

输入一幅RGB图像，经过15个卷积层进行一系列卷积操作，然后进行显著性检测任务和超像素目标语义分割任务共享卷积层；前13个卷积层经过卷积神经网络VGGNet初化，卷积核大小为3×3，每个卷积层后采用修正线性单元ReLU作为激活函数；第2、4、5、13卷积层后进行最大值池化操作；第14、15卷积层的卷积核大小分别为7×7和1×1，第14、15卷积层后连接Dropout层；Input an RGB image, perform a series of convolution operations through 15 convolutional layers, and then perform the saliency detection task and the superpixel target semantic segmentation task to share the convolutional layer; the first 13 convolutional layers pass through the initial convolutional neural network VGGNet The size of the convolution kernel is 3×3, and the modified linear unit ReLU is used as the activation function after each convolution layer; the maximum pooling operation is performed after the 2nd, 4th, 5th, and 13th convolutional layers; volumes 14 and 15 The convolution kernel sizes of the product layers are 7×7 and 1×1 respectively, and the Dropout layer is connected after the 14th and 15th convolutional layers;

通过上采样构建反卷积层，通过双线性插值初始化其参数，在训练学习上采样函数中迭代更新；在显著性目标检测任务中通过sigmoid阈值函数将输出图像标准化至[0,1]，学习显著性特征；在语义分割任务中用反卷积层对最后一个卷积层的特征图进行上采样,并且将上采样的结果进行剪裁，使输出图像与输入图像大小相同；Construct the deconvolution layer by upsampling, initialize its parameters by bilinear interpolation, and update iteratively in the training and learning upsampling function; in the salient target detection task, the output image is normalized to [0,1] through the sigmoid threshold function, Learn salient features; in the semantic segmentation task, use the deconvolution layer to upsample the feature map of the last convolutional layer, and clip the upsampling result so that the output image is the same size as the input image;

步骤2：神经网络预训练并加入哈希层微调Step 2: Neural network pre-training and adding hash layer fine-tuning

步骤2.1：多任务显著性目标检测模型预训练Step 2.1: Multi-task salient object detection model pre-training

FCNN预训练通过显著性检测任务和分割任务一同展开；χ表示N₁幅宽高分别为W和Q的训练图像的集合，Xi为其中第i幅图像，Y_ijk表示第i幅宽高分别为j和k的图像相应的像素级真实分割图，其中i＝1…N₁，j＝1…W，k＝1…Q；Z表示N₂幅训练图像的集合，Z_n为其中第n幅图像，n＝1…N₂，它有相应的存在显著性目标的真实二值图像M_n；θ_s为共享卷积层参数，θ_h为分割任务参数，θ_f为显著性任务参数；公式(1)、公式(2)分别为分割任务的交叉熵代价函数J₁(χ；θ_s,θ_h)和显著性检测任务的平方欧式距离代价函数J₂(Z；θ_s,θ_f)，FCNN通过最小化两个代价函数进行训练：The FCNN pre-training is carried out through the saliency detection task and the segmentation task together; χ represents the set of N ₁ training images whose width and height are W and Q respectively, Xi is the i-th image, and Y _ijk represents the i-th width and height respectively The pixel-level real segmentation map corresponding to the images of j and k, where i=1...N ₁ , j=1...W, k=1...Q; Z represents the set of N ₂ training images, and Z _n is the nth one Image, n=1...N ₂ , which has a corresponding real binary image M _n with salient objects; θ _s is the parameter of the shared convolution layer, θ _h is the parameter of the segmentation task, θ _f is the parameter of the saliency task; the formula (1) and formula (2) are respectively the cross-entropy cost function J ₁ (χ; θ _s , θ _h ) of the segmentation task and the square Euclidean distance cost function J ₂ (Z; θ _s , θ _f ) of the saliency detection task , the FCNN is trained by minimizing two cost functions:

公式(1)中，是指示函数，h_cjk是第c类置信分割图的元素(j,k)，c＝1…C，h(Xi；θ_s,θ_h)是语义分割函数，共返回C个目标类的置信分割图，C为预训练数据集包含的图像类别公式(2)中，f(Z_n；θ_s,θ_f)是显著图输出函数，F表示F-范数运算；In formula (1), is the indicator function, h _cjk is the element (j,k) of the c-th confidence segmentation map, c=1...C, h(Xi; θ _s , θ _h ) is the semantic segmentation function, and returns the confidence of C target classes Segmentation map, C is the image category contained in the pre-training data set In the formula (2), f(Z _n ; θ _s , θ _f ) is the salient map output function, and F represents the F-norm operation;

接下来，用随机梯度下降SGD方法，在对所有训练样本进行正则化的基础上，最小化上述代价函数；由于用于预训练的数据集没有同时具有分割和显著性标注，因此分割任务和显著性检测任务交替进行；训练过程需要将所有原始图像大小归一化；学习速率为0.001±0.01；动量参数通常为[0.9,1.0]，权值衰减因子通常为0.0005±0.0002，；随机梯度下降学习进程共进行80000次以上迭代；详细的预训练过程如下：Next, use the stochastic gradient descent SGD method to minimize the above cost function on the basis of regularizing all training samples; since the data set used for pre-training does not have both segmentation and saliency labels, the segmentation task and saliency The task of sex detection is alternately performed; the training process needs to normalize the size of all original images; the learning rate is 0.001±0.01; the momentum parameter is usually [0.9,1.0], and the weight decay factor is usually 0.0005±0.0002; stochastic gradient descent learning The process performs more than 80,000 iterations in total; the detailed pre-training process is as follows:

1)共享全卷积参数基于VGGNet初始化；1) Shared full convolution parameters Initialization based on VGGNet;

2)通过正态分布随机初始化分割任务参数和显著性任务参数 2) Randomly initialize the segmentation task parameters through normal distribution and saliency task parameters

3)根据和利用SGD训练分割网络，更新这两个参数为和 3) According to with Using SGD to train the segmentation network, update these two parameters as with

4)根据和利用SGD训练显著性网络，更新相关参数为和 4) According to with Use SGD to train the saliency network, and update the relevant parameters as with

5)根据和利用SGD训练分割网络，获得和 5) According to with Using SGD to train the segmentation network, obtain with

6)根据和利用SGD训练显著性网络，更新相关参数为和 6) According to with Use SGD to train the saliency network, and update the relevant parameters as with

7)重复上述3-6步三次以获得预训练最终参数θ_s，θ_h，θ_f；7) Repeat the above steps 3-6 three times to obtain the pre-training final parameters θ _s , θ _h , θ _f ;

步骤2.2：加入哈希层，针对目标域微调网络Step 2.2: Add a hash layer and fine-tune the network for the target domain

在预训练好的网络倒数第二层和最终的任务层中间，插入一个包含s个神经元的全连接层，即哈希层H，将高维特征映射到低维空间，生成二进制哈希码进行存储；哈希层H权重采用随机投影构造哈希值初始化，神经元激活函数采用sigmoid函数使输出值在0到1之间，神经元个数为目标二进制码的码长；Between the penultimate layer of the pre-trained network and the final task layer, insert a fully connected layer containing s neurons, that is, the hash layer H, which maps high-dimensional features to low-dimensional space and generates binary hash codes Store; the H weight of the hash layer is initialized with the hash value constructed by random projection, the neuron activation function uses the sigmoid function to make the output value between 0 and 1, and the number of neurons is the code length of the target binary code;

微调过程通过反向传播算法调节网络权重；网络微调为调节第十个卷积层之后的网络权重；用于微调网络的数据集数据量大小与预训练网络的数据集相比会减少10％-50％，相比预训练的网络参数，微调过程网络参数迭代次数和学习速率降低1％-10％，动量参数和权值衰减因子保持不变；The fine-tuning process adjusts the network weight through the back propagation algorithm; the network fine-tuning is to adjust the network weight after the tenth convolutional layer; the data size of the data set used for fine-tuning the network will be reduced by 10% compared with the data set of the pre-trained network- 50%, compared with the pre-trained network parameters, the number of iterations and learning rate of the network parameters in the fine-tuning process is reduced by 1%-10%, and the momentum parameters and weight decay factors remain unchanged;

详细的微调过程如下：The detailed fine-tuning process is as follows:

1)共享全卷积参数分割任务参数和显著性任务参数通过预训练过程得到；1) Shared full convolution parameters Split Task Parameters and saliency task parameters Obtained through the pre-training process;

2)根据和利用SGD训练分割网络，更新这两个参数为和 2) According to with Using SGD to train the segmentation network, update these two parameters as with

3)根据和利用SGD训练显著性网络，更新相关参数为和 3) According to with Use SGD to train the saliency network, and update the relevant parameters as with

4)根据和利用SGD训练分割网络，获得和 4) According to with Using SGD to train the segmentation network, obtain with

5)根据和利用SGD训练显著性网络，更新相关参数为和 5) According to with Use SGD to train the saliency network, and update the relevant parameters as with

6)重复上述3-6步三次以获得最终参数θ_s，θ_h，θ_f；6) Repeat the above steps 3-6 three times to obtain the final parameters θ _s , θ _h , θ _f ;

步骤3：多层次深度检索Step 3: Multi-Level Depth Retrieval

步骤3.1：粗糙检索Step 3.1: Coarse retrieval

步骤3.1.1：生成二进制哈希码Step 3.1.1: Generate Binary Hash Code

将一幅待查询图像I_q输入到经过微调的神经网络，提取哈希层的输出作为图像签名，用Out(H)表示；二进制码根据阈值二值化激活值得到；对每一个二进制位r＝1…s，根据公式(3)输出二进制码：Input an image I _q to be queried into the fine-tuned neural network, extract the output of the hash layer as the image signature, denoted by Out(H); the binary code is obtained by binarizing the activation value according to the threshold; for each binary bit r =1...s, output binary code according to formula (3):

其中，s是哈希层神经元个数，初始值设置范围为[40,100]；Γ＝{I₁,I₂,…,I_n}表示包含n幅图像的用于检索的数据集；相应的每幅图像的二进制码表示为Γ_H＝{H₁,H₂,…,H_n}，其中i＝1…n，H_i∈{0,1}^s表示s个神经元生成的s位二进制码值分别为0或1；Among them, s is the number of neurons in the hash layer, and the initial value setting range is [40,100]; Γ={I ₁ ,I ₂ ,…,I _n } means the data set for retrieval containing n images; the corresponding The binary code of each image is expressed as Γ _H ={H ₁ ,H ₂ ,…,H _n }, where i=1…n, H _i ∈{0,1} ^s represents the s-bit binary code generated by s neurons The code value is 0 or 1 respectively;

步骤3.1.2：汉明距离度量相似性Step 3.1.2: Hamming distance measures similarity

两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数；对于一幅待查询图像I_q和它的二进制码H_q，如果H_q和H_i∈Γ_H之间的汉明距离小于设定的阈值，则定义一个包含m幅候选图片(candidates)的候选池P＝{I_c1,I_c2,…,I_cm}，汉明距离小于5认为两幅图像是相似的；The Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings; for a query image I _q and its binary code H _q , if H _q and H _i ∈ Γ If the Hamming distance between _H is less than the set threshold, then define a candidate pool P={I _c1 , I _c2 ,...,I _cm } containing m candidate pictures (candidates), and if the Hamming distance is less than 5, two pictures are considered the images are similar;

步骤3.2：精细检索Step 3.2: Refined Search

步骤3.2.1：显著性特征提取Step 3.2.1: Salient Feature Extraction

将待查询图像I_q通过神经网络第13、15层卷积层生成的二维遥感影像特征图分别映射为一维向量进行存储；在后续检索过程中分别对比采用不同特征向量的检索结果决定最终选用哪一层卷积生成的特征图提取遥感影像显著性特征；The two-dimensional remote sensing image feature maps generated by the 13th and 15th convolutional layers of the neural network for the image I _q to be queried are respectively mapped to one-dimensional vectors for storage; in the subsequent retrieval process, the retrieval results using different feature vectors are compared to determine the final Select the feature map generated by which layer of convolution to extract the salient features of remote sensing images;

步骤3.2.2：欧式距离度量相似性Step 3.2.2: Euclidean distance measure similarity

对于一幅查询图像I_q和一个候选池P，使用提取的显著性特征向量从候选池P中挑选出排名前k幅图像；V_q和分别表示查询图像q和I_ci的特征向量；定义I_q和候选池P中第i幅图像相应特征向量之间的欧式距离s_i作为它们之间的相似性等级，如公式(4)所示；For a query image I _q and a candidate pool P, use the extracted salient feature vectors to select the top k images from the candidate pool P; V _q and denote the feature vectors of the query image q and I _ci respectively; define the Euclidean distance s _i between I _q and the corresponding feature vector of the i-th image in the candidate pool P as the similarity level between them, as shown in formula (4) ;

欧式距离越小，两幅图像间的相似性越大；每幅候选图I_ci根据和查询图像的相似度升序排序，排名前k的图像则为检索结果；The smaller the Euclidean distance, the greater the similarity between the two images; each candidate image I _ci is sorted in ascending order according to the similarity with the query image, and the top k images are the retrieval results;

步骤3.3：检索结果评价Step 3.3: Evaluation of search results

使用基于排名的评价标准对检索结果进行评价；对于一幅查询图像q和得到的排名前k幅检索结果图像，查准率Precision根据以下公式计算：Use ranking-based evaluation criteria to evaluate the retrieval results; for a query image q and the obtained top k retrieval result images, the precision rate Precision is calculated according to the following formula:

其中，Precision@k表示设定阈值k，在检索到第k个正确结果为止，从第一个正确结果到第k个正确结果的平均正确率；Rel(i)表示查询图像q和排名第i幅图像的相关性，Rel(i)∈{0,1}，1代表查询图像q和排名第i幅图像具有相同分类，即二者相关，0则不相关。Among them, Precision@k means to set the threshold k, and until the kth correct result is retrieved, the average accuracy rate from the first correct result to the kth correct result; Rel(i) means the query image q and the ranking i Rel(i)∈{0,1}, 1 means that the query image q and the i-th image have the same classification, that is, they are related, and 0 means they are not related.

本发明与现有技术相比，具有以下明显的优势和有益效果：Compared with the prior art, the present invention has the following obvious advantages and beneficial effects:

首先，相比传统人工提取遥感影像特征的方法，本发明利用全卷积神经网络构建深度显著性目标检测模型，选择国内外遥感影像数据库训练网络，综合分析图像的三层语义信息，自动学习遥感影像显著性特征。同时，创新性地语义分割加入全卷积神经网络对遥感影像深度显著性的学习，有效完善学习到的显著性特征。实验证实，采用该模型在场景较为复杂的多目标检测数据集上，如微软COCO数据集等均可提取到边缘较清晰的显著性目标。深层神经网络的学习能力可进一步迁移至对遥感影像的显著性特征学习。其次，本发明在全卷积神经网络架构中引入哈希层，在学习遥感影像深度显著性特征的同时生成二进制哈希码，既可节省存储空间，又可提高后续检索效率。最后，在进行图像检索时采用由粗到细的检索策略，综合利用二进制哈希码和显著性特征进行相似性度量。实验证实，在AlexNet神经网络中加入哈希层，并采用由粗到细的多层次检索策略，在250万张不同类别的普通图像检索中，统计返回排名前K幅相似图像的准确率，即topK查准率，当K取1000时，topK查准率平均可达88％，检索时间约为1s。因此，将该方法迁移至遥感影像的检索，对于实现遥感影像准确、高效检索切实可行并具有重要应用价值。First of all, compared with the traditional method of manually extracting remote sensing image features, the present invention uses a fully convolutional neural network to build a deep saliency target detection model, selects domestic and foreign remote sensing image databases for training networks, comprehensively analyzes the three-layer semantic information of images, and automatically learns remote sensing Distinctive features of the image. At the same time, the innovative semantic segmentation is added to the fully convolutional neural network to learn the depth saliency of remote sensing images, effectively improving the learned saliency features. Experiments have confirmed that this model can extract salient objects with clearer edges on multi-object detection data sets with more complex scenes, such as the Microsoft COCO data set. The learning ability of deep neural network can be further transferred to the salient feature learning of remote sensing images. Secondly, the present invention introduces a hash layer into the fully convolutional neural network architecture to generate binary hash codes while learning the depth salient features of remote sensing images, which can save storage space and improve subsequent retrieval efficiency. Finally, a coarse-to-fine retrieval strategy is adopted in image retrieval, and binary hash codes and salient features are used to measure the similarity. Experiments have confirmed that adding a hash layer to the AlexNet neural network, and adopting a multi-level retrieval strategy from coarse to fine, in the retrieval of 2.5 million ordinary images of different categories, the accuracy rate of the top K similar images returned by statistics, namely The topK precision rate, when K is 1000, the topK precision rate can reach 88% on average, and the retrieval time is about 1s. Therefore, transferring this method to the retrieval of remote sensing images is feasible and has important application value for realizing accurate and efficient retrieval of remote sensing images.

附图说明Description of drawings

图1基于深度显著性的遥感影像快速检索方法流程图；Fig. 1 Flowchart of fast remote sensing image retrieval method based on depth saliency;

图2基于深度显著性的目标检测模型架构图；Figure 2 Architecture diagram of target detection model based on depth saliency;

图3加入哈希层的神经网络架构图；Fig. 3 adds the neural network architecture diagram of the hash layer;

图4多层次检索过程图。Figure 4 is a diagram of the multi-level retrieval process.

具体实施方式detailed description

根据上述描述，以下是一个具体的实施流程，但本专利所保护的范围并不限于该实施流程。According to the above description, the following is a specific implementation process, but the protection scope of this patent is not limited to this implementation process.

显著性区域，主观理解为人眼视觉集中注意的区域，与人眼视觉系统(HumanVisual System，HVS)紧密相关，客观而言则是针对图像的某种特征，存在一个该特征最明显的子区。因此，显著性检测问题的关键在于特征学习和提取。鉴于深度学习在这一方面具有的强大功能，本发明将全卷积神经网络用于显著性检测问题，提出了基于全卷积神经网络的多任务显著性目标检测模型。该模型同时进行两个任务：显著性检测任务和语义分割任务。显著性检测任务用于对遥感影像的深度特征学习，计算深度显著性，语义分割任务用于提取图像内部对象语义信息，消除显著图背景混淆，补充显著性目标缺失部分。The salient area is subjectively understood as the area where the human eye focuses attention, and is closely related to the Human Visual System (HVS). Objectively speaking, it refers to a certain feature of the image, and there is a sub-area with the most obvious feature. Therefore, the crux of the saliency detection problem lies in feature learning and extraction. In view of the powerful function of deep learning in this aspect, the present invention uses the fully convolutional neural network for the salient detection problem, and proposes a multi-task salient target detection model based on the fully convolutional neural network. The model performs two tasks simultaneously: a saliency detection task and a semantic segmentation task. The saliency detection task is used to learn the depth features of remote sensing images and calculate the depth saliency. The semantic segmentation task is used to extract the semantic information of the internal objects in the image, eliminate the background confusion of the saliency map, and supplement the missing parts of the saliency target.

本发明提出的全卷积神经网络架构基于主流的开源深度学习框架Caffe实现，具体模型结构见附图2。输入一幅RGB图像，经过15个卷积层(Conv)进行一系列卷积操作，显著性检测任务和超像素目标语义分割任务共享卷积层。前13个卷积层经过卷积神经网络VGGNet初化，卷积核大小为3×3，每个卷积层后采用修正线性单元(Rectified LinearUnit，ReLU)作为激活函数，从而加快收敛速度。第2、4、5、13卷积层后进行最大值池化(MaxPooling)操作，降低特征维度，减少计算量的同时保证特征的不变性。第14、15卷积层的卷积核大小分别为7×7和1×1，每层卷积后连接Dropout层以解决复杂神经网络结构潜在的过拟合现象，即模型过度学习训练数据中的噪声和细节而导致在实际测试中错误率较高、泛化能力较差的问题。通过上采样构建反卷积层，通过双线性插值初始化其参数，在训练学习上采样函数中迭代更新。在显著性目标检测任务中通过sigmoid阈值函数将输出图像标准化至[0,1]，学习显著性特征。在语义分割任务中用反卷积层对最后一个卷积层的特征图进行上采样,并且将上采样的结果进行剪裁(Crop)，使输出图像与输入图像大小相同，从而对每个像素都产生了一个预测，同时保留了原始输入图像中的空间信息。The fully convolutional neural network architecture proposed by the present invention is implemented based on the mainstream open source deep learning framework Caffe, and the specific model structure is shown in Figure 2. Input an RGB image, and perform a series of convolution operations through 15 convolutional layers (Conv). The saliency detection task and the superpixel target semantic segmentation task share the convolutional layer. The first 13 convolutional layers are initialized by the convolutional neural network VGGNet, and the convolution kernel size is 3×3. After each convolutional layer, the Rectified Linear Unit (ReLU) is used as the activation function to speed up the convergence speed. After the 2nd, 4th, 5th, and 13th convolutional layers, the maximum pooling (MaxPooling) operation is performed to reduce the feature dimension and reduce the amount of calculation while ensuring the invariance of the features. The convolution kernel sizes of the 14th and 15th convolutional layers are 7×7 and 1×1 respectively, and each convolutional layer is connected to the Dropout layer to solve the potential over-fitting phenomenon of the complex neural network structure, that is, the model over-learning training data The noise and details of the model lead to high error rate and poor generalization ability in the actual test. The deconvolution layer is constructed by upsampling, its parameters are initialized by bilinear interpolation, and iteratively updated in the training and learning upsampling function. In the salient target detection task, the output image is normalized to [0,1] through the sigmoid threshold function to learn salient features. In the semantic segmentation task, the deconvolution layer is used to upsample the feature map of the last convolutional layer, and the upsampling result is cropped (Crop), so that the output image is the same size as the input image, so that each pixel is A prediction is produced while preserving the spatial information in the original input image.

本发明使用公开的大规模航拍图像数据集(AID)用于神经网络的预训练，旨在更好地学习遥感影像不同级别的语义特征。引入哈希层，利用扩充的武汉大学遥感影像数据集(WHU-RS)进一步微调网络，不但可以将神经网络学习的高维特征映射到低维，缩短检索时间，还能使神经网络学习到的特征更具鲁棒性。The present invention uses the publicly available large-scale aerial image data set (AID) for pre-training of neural networks, aiming at better learning semantic features of different levels of remote sensing images. Introducing the hash layer and using the expanded Wuhan University Remote Sensing Image Dataset (WHU-RS) to further fine-tune the network can not only map the high-dimensional features learned by the neural network to low-dimensional, shorten the retrieval time, but also make the neural network learned features are more robust.

步骤2.1.1：构建预训练数据集Step 2.1.1: Build a pre-training dataset

预训练阶段选择公开的大规模航拍图像数据集(AID)作为标准数据集用于预训练。AID包含30个类别，10000幅航拍图像，所有图像均选自谷歌地球，经专业的遥感技术领域人员标注。每个分类的图像都取自不同国家、地区，在不同时间通过不同拍摄遥感探测仪拍摄，图像尺寸为600×600像素，分辨率为0.5m/像素到8m/像素不等。相比其他数据集，该数据集类内差距较小，类间差距较大，是目前航拍图像数据集中规模最大的数据集。In the pre-training stage, the public large-scale aerial image dataset (AID) is selected as the standard data set for pre-training. AID contains 30 categories and 10,000 aerial images, all of which are selected from Google Earth and annotated by professional remote sensing technical personnel. The images of each category are taken from different countries and regions, and taken by different remote sensing detectors at different times. The image size is 600×600 pixels, and the resolution ranges from 0.5m/pixel to 8m/pixel. Compared with other datasets, the intra-class gap of this dataset is small, and the inter-class gap is relatively large. It is currently the largest dataset in aerial image datasets.

步骤2.1.2：显著性目标检测模型预训练Step 2.1.2: Salient object detection model pre-training

FCNN预训练通过显著性检测任务和分割任务一同展开。χ表示N₁幅宽高分别为W和Q的训练图像的集合，Xi为其中第i幅图像，Y_ijk表示第i幅宽高分别为j和k的图像相应的像素级真实分割图，其中i＝1…N₁，j＝1…W，k＝1…Q。Z表示N₂幅训练图像的集合，Z_n为其中第n幅图像，n＝1…N₂，它有相应的存在显著性目标的真实二值图像M_n。θ_s为共享卷积层参数，θ_h为分割任务参数，θ_f为显著性任务参数。公式(1)、公式(2)分别为分割任务的交叉熵代价函数J₁(χ；θ_s,θ_h)和显著性检测任务的平方欧式距离代价函数J₂(Z；θ_s,θ_f)，FCNN通过最小化两个代价函数进行训练：FCNN pre-training is carried out through the saliency detection task and the segmentation task together. χ represents _a collection of N1 training images whose width and height are respectively W and Q, Xi is the i-th image, and Y _ijk represents the pixel-level real segmentation map corresponding to the i-th image whose width and height are j and k respectively, where i=1...N ₁ , j=1...W, k=1...Q. Z represents a set of N ₂ training images, and Z _n is the nth image, n=1...N ₂ , which has a corresponding real binary image M _n with salient objects. θ _s is the parameter of the shared convolutional layer, θ _h is the parameter of the segmentation task, and θ _f is the parameter of the saliency task. Formula (1) and formula (2) are respectively the cross-entropy cost function J ₁ (χ; θ _s , θ _h ) of the segmentation task and the square Euclidean distance cost function J ₂ (Z; θ _s , θ _f ), the FCNN is trained by minimizing two cost functions:

公式(1)中，是指示函数，h_cjk是第c类置信分割图的元素(j,k)，c＝1…C，h(Xi；θ_s,θ_h)是语义分割函数，共返回C个目标类的置信分割图，C为预训练数据集包含的图像类别，本发明中C取30；公式(2)中，f(Z_n；θ_s,θ_f)是显著图输出函数，F表示F-范数运算。In formula (1), is the indicator function, h _cjk is the element (j,k) of the c-th confidence segmentation map, c=1...C, h(Xi; θ _s , θ _h ) is the semantic segmentation function, and returns the confidence of C target classes Segmentation map, C is the image category contained in the pre-training data set, and C is 30 in the present invention; in formula (2), f(Z _n ; θ _s , θ _f ) is a saliency map output function, and F represents the F-norm operation.

接下来，用随机梯度下降(SGD)方法，在对所有训练样本进行正则化的基础上，最小化上述代价函数。由于用于预训练的数据集没有同时具有分割和显著性标注，因此分割任务和显著性检测任务交替进行。由于训练过程需要将所有原始图像大小归一化，因此本发明将原始图像重置大小为500×500像素用于预训练。学习速率是SGD学习方法的必要参数，决定了权值更新的速度，设置得太大会导致代价函数振荡，结果越过最优值，太小会使收敛速度过慢，一般倾向于选取较小的学习速率，如0.001±0.01以保持系统稳定。动量参数和权值衰减因子可提高训练自适应性，动量参数通常为[0.9,1.0]，权值衰减因子通常为0.0005±0.0002。通过实验观察，本发明将学习速率设为10^-10，动量参数设为0.99，权值衰减因子取Caffe框架默认值0.0005。随机梯度下降(SGD)学习进程通过NVIDIA GTX 1080GPU设备加速，共进行80000次迭代。详细的预训练过程如下：Next, the above cost function is minimized using stochastic gradient descent (SGD) with regularization over all training samples. Since the dataset used for pre-training does not have both segmentation and saliency annotations, the segmentation task and saliency detection task are alternated. Since the training process needs to normalize all original image sizes, the present invention resets the original image size to 500×500 pixels for pre-training. The learning rate is a necessary parameter of the SGD learning method, which determines the speed of weight update. If it is set too large, the cost function will oscillate, and the result will exceed the optimal value. If it is too small, the convergence speed will be too slow. Generally, a smaller learning rate is preferred. rate, such as 0.001±0.01 to keep the system stable. The momentum parameter and the weight decay factor can improve the training adaptability, the momentum parameter is usually [0.9,1.0], and the weight decay factor is usually 0.0005±0.0002. Through experimental observation, the present invention sets the learning rate to 10 ^-10 , the momentum parameter to 0.99, and the weight decay factor to the default value of 0.0005 in the Caffe framework. The stochastic gradient descent (SGD) learning process was accelerated by an NVIDIA GTX 1080 GPU device for a total of 80,000 iterations. The detailed pre-training process is as follows:

7)重复上述3-6步三次以获得预训练最终参数θ_s，θ_h，θ_f。7) Repeat the above steps 3-6 three times to obtain the final pre-training parameters θ _s , θ _h , θ _f .

步骤2.2.1：构建用于微调网络的中国遥感影像数据集Step 2.2.1: Construct a Chinese remote sensing imagery dataset for fine-tuning the network

选用扩充的武汉大学遥感影像数据集(WHU-RS)用于神经网络微调。原始WHU-RS数据集包含19个场景分类，共950幅分辨率不等的遥感图像，图像尺寸为600×600像素，所有图像均取自谷歌地球。结合中国的地形地貌，在原始数据集的基础上重构并且扩展至7000幅遥感影像作为样本库，每个类别包含超过200幅图像。新增样本图像的光照、拍摄角度、分辨率及尺寸均不同，利于神经网络学习更具鲁棒性的显著性特征。The expanded Wuhan University Remote Sensing Image Dataset (WHU-RS) was selected for neural network fine-tuning. The original WHU-RS dataset contains 19 scene classifications, a total of 950 remote sensing images with different resolutions, and the image size is 600×600 pixels. All images are taken from Google Earth. Combined with China's topography and geomorphology, the original data set is reconstructed and expanded to 7000 remote sensing images as a sample library, each category contains more than 200 images. The illumination, shooting angle, resolution, and size of the newly added sample images are all different, which is conducive to the learning of more robust salient features by the neural network.

步骤2.2.2：加入哈希层微调网络Step 2.2.2: Join the hash layer to fine-tune the network

深度神经网络生成的特征向量维度较高，在大规模的图像检索中非常耗时。由于具有相似的图像二进制哈希码相似，因此，本发明在预训练好的网络倒数第二层和最终的任务层中间，插入一个包含s个神经元的全连接层，即哈希层H，将高维特征映射到低维空间，生成二进制哈希码进行存储，网络结构见附图3。哈希层H权重采用随机投影构造哈希值初始化，神经元激活函数采用sigmoid函数使输出值在0到1之间，根据经验设定阈值为0.5，神经元个数为目标二进制码的码长。哈希层不但提供了前一层的特征抽象，也是连接中级和高级图像语义特征的桥梁。The feature vectors generated by deep neural networks have high dimensionality, which is very time-consuming in large-scale image retrieval. Because the binary hash codes of similar images are similar, the present invention inserts a fully connected layer containing s neurons between the penultimate layer of the pre-trained network and the final task layer, that is, the hash layer H, Map high-dimensional features to low-dimensional space, and generate binary hash codes for storage. The network structure is shown in Figure 3. The H weight of the hash layer is initialized with the hash value constructed by random projection, the neuron activation function uses the sigmoid function to make the output value between 0 and 1, the threshold is set to 0.5 according to experience, and the number of neurons is the code length of the target binary code . The hash layer not only provides the feature abstraction of the previous layer, but also serves as a bridge connecting intermediate and high-level image semantic features.

微调过程通过反向传播(Back Propagation)算法调节网络权重。网络微调可针对整个网络或部分网络进行。由于低层网络结构学习到的特征更为一般化，并且为了避免发生过拟合，本发明利用扩充的WHU-RS数据集，重点调节高层网络，即第十个卷积层之后的网络权重。通常，用于微调网络的数据集数据量大小与预训练数据集相比会减少10％-50％，本发明中，微调网络数据集包含7000幅图像，明显小于预训练时包含10000幅图像的数据集，相比预训练的网络参数，微调过程网络参数要适当减小，迭代次数和学习速率可降低1％-10％。本发明中，微调过程将迭代次数减少至8000次，学习速率降低1％，为10^-12，动量参数和权值衰减因子保持不变，即分别设为0.99和0.0005。The fine-tuning process adjusts the network weights through the Back Propagation algorithm. Network fine-tuning can be performed for the entire network or a portion of the network. Since the features learned by the low-level network structure are more general, and in order to avoid overfitting, the present invention uses the expanded WHU-RS data set to focus on adjusting the high-level network, that is, the network weight after the tenth convolutional layer. Usually, the data size of the data set used for fine-tuning the network will be reduced by 10%-50% compared with the pre-training data set. In the present invention, the fine-tuning network data set contains 7000 images, which is significantly smaller than the pre-training data set containing 10000 images. For the data set, compared with the pre-trained network parameters, the network parameters in the fine-tuning process should be appropriately reduced, and the number of iterations and learning rate can be reduced by 1%-10%. In the present invention, the number of iterations is reduced to 8000 during the fine-tuning process, the learning rate is reduced by 1% to 10 ^-12 , and the momentum parameter and weight decay factor remain unchanged, ie respectively set to 0.99 and 0.0005.

详细的微调过程如下：The detailed fine-tuning process is as follows:

6)重复上述3-6步三次以获得最终参数θ_s，θ_h，θ_f。6) Repeat steps 3-6 above three times to obtain the final parameters θ _s , θ _h , θ _f .

步骤3：多层次深度检索Step 3: Multi-Level Depth Retrieval

深度卷积神经网络的浅层部分学习底层视觉特征，而深层部分可捕捉图像语义信息。因此，本发明采用由粗到细的检索策略实现快速、准确的图像检索。特征提取及检索过程见附图4。The shallow part of the deep convolutional neural network learns the underlying visual features, while the deep part captures image semantic information. Therefore, the present invention adopts a coarse-to-fine retrieval strategy to realize fast and accurate image retrieval. The process of feature extraction and retrieval is shown in Figure 4.

步骤3.1：粗糙检索Step 3.1: Coarse retrieval

首先检索一系列有相似高级语义特征的候选区，即在哈希层拥有相似的二进制激活值，然后根据相似性度量进一步生成相似图像排名。First, a series of proposals with similar high-level semantic features are retrieved, i.e., have similar binary activation values at the hashing layer, and then a similar image ranking is further generated according to the similarity measure.

步骤3.1.1：生成二进制哈希码Step 3.1.1: Generate Binary Hash Code

将一幅待查询图像I_q输入到经过微调的神经网络，提取哈希层的输出作为图像签名，用Out(H)表示。二进制码根据阈值二值化激活值得到。对每一个二进制位r＝1…s，根据公式(3)输出二进制码：Input an image I _q to be queried into the fine-tuned neural network, and extract the output of the hash layer as the image signature, denoted by Out(H). The binary code is obtained by binarizing the activation value according to the threshold. For each binary bit r=1...s, output binary code according to formula (3):

其中，s是哈希层神经元个数，个数过多会出现过拟合，建议初始值设置范围为[40,100]，具体数值根据实训练数据进行调整，本发明中s设为48。Γ＝{I₁,I₂,…,I_n}表示包含n幅图像的用于检索的数据集。相应的每幅图像的二进制码表示为Γ_H＝{H₁,H₂,…,H_n}，其中i＝1…n，H_i∈{0,1}^s表示s个神经元生成的s位二进制码值分别为0或1。Among them, s is the number of neurons in the hash layer. If the number is too large, overfitting will occur. The recommended initial value setting range is [40,100]. The specific value is adjusted according to the actual training data. In the present invention, s is set to 48. Γ={I ₁ , I ₂ ,...,I _n } represents a data set for retrieval including n images. The corresponding binary code of each image is expressed as Γ _H ={H ₁ ,H ₂ ,…,H _n }, where i=1…n, H _i ∈ {0,1} ^s represents s generated by s neurons Bit binary code value is 0 or 1 respectively.

两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。对于一幅待查询图像I_q和它的二进制码H_q，如果H_q和H_i∈Γ_H之间的汉明距离小于设定的阈值，则定义一个包含m幅候选图片(candidates)的候选池P＝{I_c1,I_c2,…,I_cm}，一般情况下，汉明距离小于5就可以认为两幅图像是相似的。The Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings. For an image to be queried I _q and its binary code H _q , if the Hamming distance between H _q and H _i ∈ Γ _H is smaller than the set threshold, define a candidate that contains m candidate pictures (candidates) Pool P={I _c1 , I _c2 ,...,I _cm }, in general, two images can be considered similar if the Hamming distance is less than 5.

步骤3.2：精细检索Step 3.2: Refined Search

步骤3.2.1：显著性特征提取Step 3.2.1: Salient Feature Extraction

由于深度卷积网络不同卷积层学习不同图像不同级别的语义特征，其中，中高层卷积层学习到的特征更适用与图像检索任务。因此，将待查询图像I_q通过神经网络第13、15层卷积层生成的二维遥感影像特征图分别映射为一维向量进行存储。在后续检索过程中分别对比采用不同特征向量的检索结果决定最终选用哪一层卷积生成的特征图提取遥感影像显著性特征。Since different convolutional layers of deep convolutional networks learn different levels of semantic features of different images, among them, the features learned by middle and high-level convolutional layers are more suitable for image retrieval tasks. Therefore, the two-dimensional remote sensing image feature maps generated by the image I _q to be queried through the 13th and 15th convolutional layers of the neural network are respectively mapped into one-dimensional vectors for storage. In the subsequent retrieval process, the retrieval results using different feature vectors were compared to determine which layer of convolution generated feature maps was finally selected to extract the salient features of remote sensing images.

对于一幅查询图像I_q和一个候选池P，使用提取的显著性特征向量从候选池P中挑选出排名前k幅图像。V_q和分别表示查询图像q和I_ci的特征向量。定义I_q和候选池P中第i幅图像相应特征向量之间的欧式距离s_i作为它们之间的相似性等级，如公式(4)所示。For a query image I _q and a candidate pool P, use the extracted salient feature vectors to pick out the top k images from the candidate pool P. V _q and denote the feature vectors of the query image q and I _ci , respectively. Define the Euclidean distance s _i between I _q and the corresponding feature vector of the i-th image in the candidate pool P as the similarity level between them, as shown in formula (4).

欧式距离越小，两幅图像间的相似性越大。每幅候选图I_ci根据和查询图像的相似度升序排序，排名前k的图像则为检索结果。The smaller the Euclidean distance, the greater the similarity between two images. Each candidate image I _ci is sorted in ascending order according to the similarity with the query image, and the top k images are the retrieval results.

步骤3.3：检索结果评价Step 3.3: Evaluation of search results

本发明使用基于排名的评价标准对检索结果进行评价。对于一幅查询图像q和得到的排名前k幅检索结果图像，查准率(Precision)根据以下公式计算：The present invention uses ranking-based evaluation criteria to evaluate retrieval results. For a query image q and the obtained top k retrieval result images, the precision rate (Precision) is calculated according to the following formula:

其中，Precision@k表示根据实际需求设定阈值k，在检索到第k个正确结果为止，从第一个正确结果到第k个正确结果的平均正确率；Rel(i)表示查询图像q和排名第i幅图像的相关性，Rel(i)∈{0,1}，1代表查询图像q和排名第i幅图像具有相同分类，即二者相关，0则不相关。Among them, Precision@k indicates that the threshold k is set according to actual needs, and the average accuracy rate from the first correct result to the kth correct result is retrieved until the kth correct result is retrieved; Rel(i) indicates the query image q and The correlation of the i-th image, Rel(i)∈{0,1}, 1 means that the query image q and the i-th image have the same classification, that is, they are related, and 0 means they are not related.

Claims

1. A remote sensing image fast retrieval method based on depth saliency, characterized in that it comprises the following steps:

Step 1: Construction of object detection model based on deep saliency

Input an RGB image, perform a series of convolution operations through 15 convolutional layers, and then perform the saliency detection task and the superpixel target semantic segmentation task to share the convolutional layer; the first 13 convolutional layers pass through the initial convolutional neural network VGGNet The size of the convolution kernel is 3×3, and the modified linear unit ReLU is used as the activation function after each convolution layer; the maximum pooling operation is performed after the 2nd, 4th, 5th, and 13th convolutional layers; volumes 14 and 15 The convolution kernel sizes of the product layers are 7×7 and 1×1 respectively, and the Dropout layer is connected after the 14th and 15th convolutional layers;

Construct the deconvolution layer by upsampling, initialize its parameters by bilinear interpolation, and update iteratively in the training and learning upsampling function; in the salient target detection task, the output image is normalized to [0,1] through the sigmoid threshold function, Learn salient features; in the semantic segmentation task, use the deconvolution layer to upsample the feature map of the last convolutional layer, and clip the upsampling result so that the output image is the same size as the input image;

Step 2: Neural network pre-training and adding hash layer fine-tuning

Step 2.1: Multi-task salient object detection model pre-training

The FCNN pre-training is carried out through the saliency detection task and the segmentation task together; χ represents the set of N ₁ training images whose width and height are W and Q respectively, Xi is the i-th image, and Y _ijk represents the i-th width and height respectively The pixel-level real segmentation map corresponding to the images of j and k, where i=1...N ₁ , j=1...W, k=1...Q; Z represents the set of N ₂ training images, and Z _n is the nth one Image, n=1...N ₂ , it has a corresponding real binary image M _n with salient objects; θ _s is the shared convolutional layer parameter, θ _h is the segmentation task parameter, θ _f is the saliency task parameter; the formula (1) and formula (2) respectively represent the cross-entropy cost function J ₁ (χ; θ _s , θ _h ) of the segmentation task and the square Euclidean distance cost function J ₂ (Z; θ _s , θ _f ) of the saliency detection task , the FCNN is trained by minimizing two cost functions:

{J J}_{22} ((Z Z;; {θ θ}_{s the s},, {θ θ}_{f f})) = = \frac{11}{{N N}_{22}} {Σ Σ}_{n no = = 11}^{{N N}_{22}} | | | | {M m}_{n no} - - f f (({Z Z}_{n no};; {θ θ}_{s the s},, {θ θ}_{f f})) | | {| |}_{F f}^{22} - - - - - - ((22))

In formula (1), is the indicator function, h _cjk is the element (j,k) of the c-th confidence segmentation map, c=1...C, h(Xi; θ _s , θ _h ) is the semantic segmentation function, and returns the confidence of C target classes Segmentation map, C is the image category contained in the pre-training data set In the formula (2), f(Z _n ; θ _s , θ _f ) is the salient map output function, and F represents the F-norm operation;

Next, use the stochastic gradient descent SGD method to minimize the above cost function on the basis of regularizing all training samples; since the data set used for pre-training does not have both segmentation and saliency labels, the segmentation task and saliency The task of sex detection is alternately performed; the training process needs to normalize the size of all original images; the learning rate is 0.001±0.01; the momentum parameter is usually [0.9,1.0], and the weight decay factor is usually 0.0005±0.0002; stochastic gradient descent learning The process performs more than 80,000 iterations in total; the detailed pre-training process is as follows:

1) Shared full convolution parameters Initialization based on VGGNet;

2) Randomly initialize the segmentation task parameters through normal distribution and saliency task parameters

3) According to with Using SGD to train the segmentation network, update these two parameters as with

4) According to with Use SGD to train the saliency network, and update the relevant parameters as with

5) According to with Using SGD to train the segmentation network, obtain with

6) According to with Use SGD to train the saliency network, and update the relevant parameters as with

7) Repeat the above steps 3-6 three times to obtain the pre-training final parameters θ _s , θ _h , θ _f ;

Step 2.2: Add a hash layer and fine-tune the network for the target domain

Between the penultimate layer of the pre-trained network and the final task layer, insert a fully connected layer containing s neurons, that is, the hash layer H, which maps high-dimensional features to low-dimensional space and generates binary hash codes Store; the H weight of the hash layer is initialized with the hash value constructed by random projection, the neuron activation function uses the sigmoid function to make the output value between 0 and 1, and the number of neurons is the code length of the target binary code;

The fine-tuning process adjusts the network weight through the back propagation algorithm; the network fine-tuning is to adjust the network weight after the tenth convolutional layer; the data size of the data set used for fine-tuning the network will be reduced by 10% compared with the data set of the pre-trained network- 50%, compared with the pre-trained network parameters, the number of iterations and learning rate of the network parameters in the fine-tuning process is reduced by 1%-10%, and the momentum parameters and weight decay factors remain unchanged;

The detailed fine-tuning process is as follows:

1) Shared full convolution parameters Split Task Parameters and saliency task parameters Obtained through the pre-training process;

2) According to with Using SGD to train the segmentation network, update these two parameters as with

3) According to with Use SGD to train the saliency network, and update the relevant parameters as with

4) According to with Using SGD to train the segmentation network, obtain with

5) According to with Use SGD to train the saliency network, and update the relevant parameters as with

6) Repeat the above steps 3-6 three times to obtain the final parameters θ _s , θ _h , θ _f ;

Step 3: Multi-Level Depth Retrieval

Step 3.1: Coarse retrieval

Step 3.1.1: Generate Binary Hash Code

Input an image I _q to be queried into the fine-tuned neural network, extract the output of the hash layer as the image signature, denoted by Out(H); the binary code is obtained by binarizing the activation value according to the threshold; for each binary bit r =1...s, output binary code according to formula (3):

{H h}^{r r} = = \{\begin{matrix} 11 & {Out out}^{r r} ((H h)) &GreaterEqual; &Greater Equal; 0.5 0.5 \\ 00 & o o t t h h e e r r w w i i s the s e e \end{matrix} - - - - - - ((33))

Among them, s is the number of neurons in the hash layer, and the initial value setting range is [40,100]; Γ={I ₁ ,I ₂ ,…,I _n } means the data set for retrieval containing n images; the corresponding The binary code of each image is expressed as Γ _H ={H ₁ ,H ₂ ,…,H _n }, where i=1…n, H _i ∈{0,1} ^s represents the s-bit binary code generated by s neurons The code value is 0 or 1 respectively;

Step 3.1.2: Hamming distance measures similarity

The Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings; for a query image I _q and its binary code H _q , if H _q and H _i ∈ Γ If the Hamming distance between _H is less than the set threshold, then define a candidate pool P={I _c1 , I _c2 ,...,I _cm } containing m candidate pictures (candidates), and if the Hamming distance is less than 5, two pictures are considered the images are similar;

Step 3.2: Refined Search

Step 3.2.1: Salient Feature Extraction

The two-dimensional remote sensing image feature maps generated by the 13th and 15th convolutional layers of the neural network for the image I _q to be queried are respectively mapped to one-dimensional vectors for storage; in the subsequent retrieval process, the retrieval results using different feature vectors are compared to determine the final Select the feature map generated by which layer of convolution to extract the salient features of remote sensing images;

Step 3.2.2: Euclidean distance measure similarity

For a query image I _q and a candidate pool P, use the extracted salient feature vectors to select the top k images from the candidate pool P; V _q and denote the feature vectors of the query image q and I _ci respectively; define the Euclidean distance s _i between I _q and the corresponding feature vector of the i-th image in the candidate pool P as the similarity level between them, as shown in formula (4) ;

{s the s}_{i i} = = | | | | {V V}_{q q} - - {V V}_{i i}^{P P} | | | | - - - - - - ((44))

The smaller the Euclidean distance, the greater the similarity between the two images; each candidate image I _ci is sorted in ascending order according to the similarity with the query image, and the top k images are the retrieval results;

Step 3.3: Evaluation of search results

Use ranking-based evaluation criteria to evaluate the retrieval results; for a query image q and the obtained top k retrieval result images, the precision rate Precision is calculated according to the following formula:

Pr PR e e c c i i s the s i i o o n no @ @ k k = = \frac{{Σ Σ}_{i i = = 11}^{k k} Re Re l l ((i i))}{k k} - - - - - - ((55))

Among them, Precision@k means to set the threshold k, and until the kth correct result is retrieved, the average accuracy rate from the first correct result to the kth correct result; Rel(i) means the query image q and the ranking i Rel(i)∈{0,1}, 1 means that the query image q and the i-th image have the same classification, that is, they are related, and 0 means they are not related.