CN118628997A

CN118628997A - A cross-modal place recognition method based on efficient self-attention

Info

Publication number: CN118628997A
Application number: CN202410817613.8A
Authority: CN
Inventors: 窦立云; 王进; 梁瑞; 芦欣
Original assignee: NANTONG INSTITUTE OF TECHNOLOGY
Current assignee: NANTONG INSTITUTE OF TECHNOLOGY
Priority date: 2024-06-24
Filing date: 2024-06-24
Publication date: 2024-09-10

Abstract

The invention discloses a cross-modal place recognition method based on efficient self-attention, comprising: inputting an RGB image and an infrared image into a dual-stream network; performing a quantization operation on the input image, and extracting image salient features by a color contrast method; inputting the image into a ResNet50 network, performing high-dimensional feature mapping on the features output by the third stage of the ResNet50 network by a high-dimensional feature mapping module, and outputting final overall features; performing image block processing on the features output by the third stage of the ResNet50 network, and processing local shared features by an efficient self-attention module EMSA, and outputting local features; fusing image salient features and local features to obtain final local features; performing overall and local feature collaborative constraints on the final overall features and the final local features; if a specified number of training rounds is reached, then terminating the training to obtain a cross-modal place recognition model based on an efficient self-attention mechanism; otherwise, continuing to complete the training; and improving the accuracy and efficiency of the entire network place recognition.

Description

A cross-modal place recognition method based on efficient self-attention

技术领域Technical Field

本发明涉及计算机视觉技术领域，尤其涉及一种基于高效自注意力的跨模态地点识别方法。The present invention relates to the field of computer vision technology, and in particular to a cross-modal place recognition method based on efficient self-attention.

背景技术Background Art

视觉地点识别(Visual Place Recognition,VPR)是一个高度先进的技术，主要用于辅助机器人或视觉导航系统准确判定其是否处于先前访问过的位置；VPR通过摄像机捕获的图像进行智能识别，从而实现精准的地点定位；在视觉同步定位与地图构建(SLAM)系统中，VPR扮演着关键角色，它不仅可以用于重新定位，还能有效地进行地图复用和回环修正。众多技术的创新和发展在视觉地点识别(VPR)领域取得了显著的进步。然而，VPR任务仍面临着一系列挑战，尤其是图像间类内差异的显著性，这主要是由外观和视角的变化引起的。虽然许多研究提出了解决这些挑战的方法，但大多数研究并未充分考虑光照变化的影响。Visual Place Recognition (VPR) is a highly advanced technology that is mainly used to assist robots or visual navigation systems to accurately determine whether they are in a previously visited location. VPR uses images captured by cameras for intelligent recognition to achieve accurate location positioning. In visual simultaneous localization and mapping (SLAM) systems, VPR plays a key role, not only for relocalization, but also for effective map reuse and loop correction. Numerous technological innovations and developments have made significant progress in the field of visual place recognition (VPR). However, the VPR task still faces a series of challenges, especially the significance of intra-class differences between images, which is mainly caused by changes in appearance and perspective. Although many studies have proposed solutions to these challenges, most studies have not fully considered the impact of illumination changes.

在面对光照条件变化的挑战时，机器人通过采用一种灵活的硬件方案来适应不同的环境；在高照度环境下，机器人利用RGB摄像头进行图像采集；而在低照度条件下，则转而使用红外摄像头。可见光与红外跨模态视觉地点识别(VI-VPR)的主要目标是从不同模态的摄像头捕获地点的图像，并据此确定其位置。这种方法的独特之处在于，它的查询集和图库集包含来自不同模态的图像，从而有效克服了由光照条件引起的限制。然而，与传统的单模态地点位置识别相比，VI-VPR面临着特有的挑战，尤其是在处理RGB图像与红外图像之间的跨模态差异时。红外光图像与可见光图像之间存在显著的差异，这导致现有的地点识别方法在处理多模态识别问题时效果欠佳；同时还存在特征匹配速度较慢的主要缺点，这限制了整体性能的提升。When faced with the challenge of changing lighting conditions, the robot adopts a flexible hardware solution to adapt to different environments; in high-light environments, the robot uses RGB cameras for image acquisition; in low-light conditions, it switches to infrared cameras. The main goal of cross-modal visual place recognition with visible and infrared light (VI-VPR) is to capture images of places from cameras of different modalities and determine their locations based on them. The uniqueness of this approach is that its query set and gallery set contain images from different modalities, which effectively overcomes the limitations caused by lighting conditions. However, compared with traditional single-modal place location recognition, VI-VPR faces unique challenges, especially when dealing with cross-modal differences between RGB images and infrared images. There are significant differences between infrared and visible light images, which makes existing place recognition methods less effective when dealing with multimodal recognition problems; there is also a major disadvantage of slow feature matching, which limits the overall performance improvement.

发明内容Summary of the invention

发明目的：为了克服现有技术中存在的不足，本发明提供一种基于高效自注意力的跨模态地点识别方法，提出了一种高效的自注意力方法，并将像素颜色对比度作为一个关键因素融入网络中，在不损失精度的同时，提升地点识别任务中特征匹配速度。Purpose of the invention: In order to overcome the shortcomings of the prior art, the present invention provides a cross-modal place recognition method based on efficient self-attention, proposes an efficient self-attention method, and integrates pixel color contrast as a key factor into the network, thereby improving the feature matching speed in the place recognition task without losing accuracy.

技术方案：为实现上述目的，一种基于高效自注意力的跨模态地点识别方法，包括以下步骤：Technical solution: To achieve the above purpose, a cross-modal place recognition method based on efficient self-attention includes the following steps:

步骤一、将RGB图像和红外图像输入到双流网络；Step 1: Input the RGB image and infrared image into the two-stream network;

步骤二、对输入的图像进行量化操作，计算RGB图像和红外图像的颜色距离以及每种模态量化前图像与量化后图像的颜色距离，并采用颜色对比度方法提取图像显著性特征；Step 2: quantize the input image, calculate the color distance between the RGB image and the infrared image, and the color distance between the image before and after quantization of each modality, and use the color contrast method to extract the image's salient features;

步骤三、将图像输入到ResNet50网络，从每种模态的图像中提取独有的特征；Step 3: Input the image into the ResNet50 network to extract unique features from the image of each modality;

步骤四、ResNet50网络第三阶段输出的特征经过高维度特征映射模块进行高维度特征映射，输出最终整体特征；Step 4: The features output by the third stage of the ResNet50 network are mapped to high-dimensional features by the high-dimensional feature mapping module, and the final overall features are output;

步骤五、对ResNet50网络第三阶段输出的特征进行图像块处理得到局部共享特征，并通过高效自注意力模块EMSA处理局部共享特征，输出局部特征；Step 5: Perform image block processing on the features output by the third stage of the ResNet50 network to obtain local shared features, and process the local shared features through the efficient self-attention module EMSA to output local features;

步骤六、将步骤二获得的图像显著性特征和步骤五获得的局部特征进行融合得到最终局部特征；Step 6: Fusing the image salient features obtained in step 2 and the local features obtained in step 5 to obtain the final local features;

步骤七、将最终整体特征与最终局部特征进行整体特征和局部特征协同约束；Step 7: Coordinate constraints between the final global features and the final local features;

步骤八、若是达到指定的训练轮数，则结束训练得到基于高效自注意力机制的跨模态地点识别模型；否则返回步骤一继续完成训练。Step 8: If the specified number of training rounds is reached, the training is terminated to obtain a cross-modal place recognition model based on an efficient self-attention mechanism; otherwise, return to step 1 to continue training.

进一步的，设RGB地点图像共有n张，红外地点图像共有m张，RGB模态的样本表示为红外模态样本表示为其中，表示第i幅RGB地点图像，表示第j幅红外地点图像，和分别表示和对应的地标；Furthermore, assuming that there are n RGB location images and m infrared location images, the samples of RGB mode are represented as The infrared modal sample is represented as in, represents the i-th RGB location image, represents the jth infrared location image, and Respectively and corresponding landmarks;

所述步骤二中，将图像定义为：In the step 2, the image is defined as:

I(x,y)＝{I_L(x,y),I_a(x,y),I_b(x,y)}I(x,y)＝{I _L (x,y),I _a (x,y),I _b (x,y)}

归一化后0≤I_L(x,y),I_a(x,y),I_b(x,y)≤1；After normalization, _0≤IL (x,y), _Ia (x,y), _Ib (x,y)≤1;

对输入的RGB图像和红外图像进行量化操作，得到量化后的两幅彩色图像f_i和g_i，通过颜色距离d来评估量化后两幅彩色图像f_i和g_i之间的差异，颜色距离d表示为：The input RGB image and infrared image are quantized to obtain two quantized color images _fi and _gi . The difference between the two quantized color images _fi and _gi is evaluated by the color distance d. The color distance d is expressed as:

式中，L代表图片的亮度，a代表绿色到红色的分量，b代表从蓝色到黄色的分量。进一步的，设颜色量化后的调色板颜色数量为K，在颜色空间中寻找K个颜色：In the formula, L represents the brightness of the image, a represents the component from green to red, and b represents the component from blue to yellow. Further, let the number of colors in the color palette after color quantization be K, and find K colors in the color space:

(C_1L,C_1a,C_1b),(C_2L,C_2a,C_2b),…,(C_KL,C_Ka,C_Kb)(C _1L ,C _1a ,C _1b ),(C _2L ,C _2a ,C _2b ),…,(C _KL ,C _Ka ,C _Kb )

式中，0≤C_KL,C_Ka,C_Kb≤1；Where, 0≤C _KL ,C _Ka ,C _Kb ≤1;

计算量化前RGB图像和量化后的图像g_i的颜色距离，以及计算量化前红外图像和量化后的图像f_i的颜色距离；以彩色图像为例，设量化前的图像和量化后的图像g_i的颜色距离d满足公式:Calculate the color distance between the RGB image before quantization and the image after quantization g _i , and calculate the color distance between the infrared image before quantization and the image after quantization _fi ; Taking the color image as an example, assume that the image before quantization The color distance d of the quantized image _gi satisfies the formula:

式中，1≤x≤W,1≤y≤H,1≤j≤K，W，H为图像的宽高；量化后使得最小；量化后的结果使用颜色对比度的方法处理，得到各个像素颜色在图像中的显著性特征。In the formula, 1≤x≤W, 1≤y≤H, 1≤j≤K, W, H are the width and height of the image; after quantization, The quantified result is processed using the color contrast method to obtain the significant features of each pixel color in the image.

进一步的，选择使用Lab颜色空间来计算颜色对比度，通过对比一个特定像素的颜色与图像中其他像素的颜色，量化该像素的显著度；得到各个像素颜色在图像中的显著性值，设图像中某元素为则显著性值：Furthermore, we choose to use the Lab color space to calculate the color contrast. By comparing the color of a specific pixel with the colors of other pixels in the image, we can quantify the saliency of the pixel. We get the saliency value of each pixel color in the image. Let an element in the image be Then the significance value is:

由于图像经过颜色量化处理，每个像素的颜色都通过图像的颜色直方图特征来确定，从中获取相应的量化颜色值；提出优化方法仅需计算与每一个像素对应的颜色相关的显著性值，得到优化后显著性值：Since the image has been color quantized, the color of each pixel is determined by the color histogram features of the image, from which the corresponding quantized color value is obtained; the proposed optimization method only needs to calculate the saliency value associated with the color corresponding to each pixel to obtain the optimized saliency value:

式中，c_l为像素M量化后的颜色值，f_j为颜色c_l在图像中出现的频率，α_j为颜色c_l的定量值，K为量化后的颜色数量；Where c _l is the quantized color value of pixel M, f _j is the frequency of color c _l in the image, α _j is the quantitative value of color c _l , and K is the number of quantized colors;

依据图像中每个像素以及每个像素对应的颜色相关的显著性值，提取图像中的图像显著性特征。The image salient features in the image are extracted based on the saliency value associated with each pixel in the image and the color corresponding to each pixel.

进一步的，所述步骤五中，在ResNet50网络中的第三阶段输出的特征为和以特征块的形式提取局部共享特征和局部共享特征和经过高效注意力模块EMSA得到局部特征 Furthermore, in step 5, the features output in the third stage of the ResNet50 network are and Extract local shared features in the form of feature blocks and Local shared features and The local features are obtained through the efficient attention module EMSA

设F为其中一个整体特征候选描述符：Let F be one of the overall feature candidate descriptors:

式中，C，H，W分别代表通道、高和宽三个维度，从F中提取一组步幅为S_p的大小为d_x×d_y的补丁级特征{P_i，x_i,y_i}，补丁级特征总数为：Where C, H, and W represent the three dimensions of channel, height, and width, respectively. A set of patch-level features {P _i , x _i , y _i } with a stride of _Sp and a size of d _x ×d _y is extracted from F. The total number of patch-level features is:

式中，P_i表示补丁级特征集合，(x_i,y_i)表示补丁级特征的中心在特征映射上的坐标；Where _Pi represents the patch-level feature set, ( _xi , _yi ) represents the coordinates of the center of the patch-level feature on the feature map;

高效注意力模块EMSA将P_i进行卷积操作得到Q，将P_i记作x，二维输入标记被重塑成沿空间维度的三维标记并输入到深度卷积操作，减小高度H和宽度W维度；s为缩减因子，s根据特征图大小或阶段数来自适应设置，核大小、步幅和填充分别为s+1、s和s/2；经过空间缩减的新标记映射被重塑为二维标记，得其中n'＝h/s×w/s；被送入两组投影操作以获取键K和值V。The efficient attention module EMSA performs a convolution operation on _Pi to obtain Q, and _Pi is recorded as x. The two-dimensional input is marked Reshaped into a three-dimensional marker along the spatial dimension And input to the depth convolution operation to reduce the height H and width W dimensions; s is the reduction factor, s is adaptively set according to the feature map size or the number of stages, the kernel size, stride and padding are s+1, s and s/2 respectively; the new label mapping after spatial reduction is reshaped into a two-dimensional mark, Where n' = h/s × w/s; is fed into two sets of projection operations to obtain the key K and value V.

进一步的，采用EMSA方程来计算在查询Q、键K和值V上的注意力函数；EMSA方程如下：Furthermore, the EMSA equation is used to calculate the attention function on the query Q, key K and value V; the EMSA equation is as follows:

式中，Conv(·)代表一个标准的1×1卷积操作，作用是模仿不同注意力头部之间的交互作用；Where Conv(·) represents a standard 1×1 convolution operation, which is used to simulate the interaction between different attention heads;

高效注意力模块EMSA的设置，使得每个注意力头的函数能够同时依赖于所有键Key和查询Query；在应用Softmax函数后的点积矩阵中引入了实例归一化InstanceNormalization，标记为IN(·)；将每个头的输出值被连接起来并线性投影，以形成最终的输出。The setting of the efficient attention module EMSA enables the function of each attention head to depend on all keys and queries at the same time; instance normalization is introduced in the dot product matrix after applying the Softmax function, marked as IN(·); the output values of each head are connected and linearly projected to form the final output.

进一步的，所述EMSA方程计算成本的值S_t是：Furthermore, the value of the EMSA equation calculation cost _St is:

进一步的，所述步骤六中，将步骤二中提取的显著性图和步骤五中得到的局部特征进行特征融合，获得融合颜色特征的图像显著性图M：Furthermore, in step 6, the saliency map extracted in step 2 is And the local features obtained in step 5 Perform feature fusion to obtain the image saliency map M of the fused color features:

上述公式在于衡量某一像素I_k的两个显著值S(I_k)和m(I_k)之间的差异。The above formula is to measure the difference between two saliency values S(I _k ) and m(I _k ) of a certain pixel I _k .

有益效果：本发明的一种基于高效自注意力的跨模态地点识别方法，将像素颜色对比度作为一个关键因素融入到网络中，在不损失精度的同时，成功地解决了地点识别任务中特征匹配速度缓慢的问题；提出基于高效自注意力模块的跨模态模型EMSA，通过这种方法的应用，能够更快、更准确地完成复杂环境下的地点识别任务，显著推动了视觉地点识别技术的进步；并选择了图像失真度更低的K-Means聚类量化方法；提高了整个网络地点识别的准确性和效率。Beneficial effects: The present invention adopts a cross-modal place recognition method based on efficient self-attention, which integrates pixel color contrast into the network as a key factor, and successfully solves the problem of slow feature matching speed in place recognition tasks without losing accuracy; proposes a cross-modal model EMSA based on an efficient self-attention module. Through the application of this method, place recognition tasks in complex environments can be completed faster and more accurately, which significantly promotes the progress of visual place recognition technology; and selects a K-Means clustering quantization method with lower image distortion; improves the accuracy and efficiency of the entire network place recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图1为基于高效自注意力机制的跨模态地点识别模型训练流程图；Figure 1 is a flowchart of the cross-modal place recognition model training based on the efficient self-attention mechanism;

附图2为高效注意力EMSA模型示意图；Figure 2 is a schematic diagram of the high-efficiency attention EMSA model;

附图3为RGB图像显著性图S(I)提取过程示意图；Figure 3 is a schematic diagram of the extraction process of the RGB image saliency map S(I);

附图4为RGB图像显著图M(I)提取过程示意图。FIG4 is a schematic diagram of the extraction process of the RGB image saliency map M(I).

具体实施方式DETAILED DESCRIPTION

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

如附图1所示，一种基于高效自注意力的跨模态地点识别方法，包括以下步骤：As shown in FIG1 , a cross-modal place recognition method based on efficient self-attention includes the following steps:

步骤三、将图像输入到ResNet50网络，从每种模态的图像中精确提取独有的特征；Step 3: Input the image into the ResNet50 network to accurately extract unique features from the image of each modality;

步骤八、若是达到指定的训练轮数，则结束训练得到基于高效自注意力机制的跨模态地点识别模型，依据得到的模型以实现地点的准确识别；否则返回步骤一继续完成训练。Step 8: If the specified number of training rounds is reached, the training is terminated to obtain a cross-modal place recognition model based on an efficient self-attention mechanism, and the location is accurately recognized based on the obtained model; otherwise, return to step 1 to continue training.

所述步骤六和步骤七之间还包括一个步骤，这一步骤为通过高维度模态内特征聚合模块将ResNet50网络第四阶段的特征和最终整体特征进行融合，对整体特征进行约束。There is also a step between step six and step seven, which is to fuse the features of the fourth stage of the ResNet50 network and the final overall features through a high-dimensional intra-modal feature aggregation module to constrain the overall features.

如附图3所示，设RGB地点图像共有n张，红外地点图像共有m张，RGB模态的样本表示为红外模态样本表示为其中，表示第i幅RGB地点图像，表示第j幅红外地点图像，和分别表示和对应的地标；RGB图像由红、绿、蓝三种基色构成，并含有三个通道，红外图像则以单通道的形式存在；由于红外地点图像的通道数为1，因此在数据预处理阶段，以填充相同值的策略，将其扩充成三通道，从而与RGB地点图像保持通道数的一致，方便后续模型训练；在步骤一中，输入一个RGB模态样本和一个红外模态样本 As shown in Figure 3, suppose there are n RGB location images and m infrared location images, and the samples of RGB mode are represented as The infrared modal sample is represented as in, represents the i-th RGB location image, represents the jth infrared location image, and Respectively and The corresponding landmark; RGB images are composed of three primary colors: red, green, and blue, and contain three channels, while infrared images exist in the form of a single channel; Since the number of channels of infrared location images is 1, in the data preprocessing stage, the strategy of filling the same value is used to expand it into three channels, so as to keep the number of channels consistent with the RGB location image, which is convenient for subsequent model training; In step 1, input an RGB modal sample and an infrared modality sample

所述步骤二中，考虑在Lab颜色空间下的图像颜色特征的提取，将图像定义为：In the step 2, the extraction of image color features in the Lab color space is considered, and the image is defined as:

归一化后0≤I_L(x,y),I_a(x,y),I_b(x,y)≤1；将RGB图像定义为由图像的亮度L，绿色到红色的分量a，蓝色带黄色的分量b组成；同样将红外图像定义为由图像的亮度L，绿色到红色的分量a，蓝色带黄色的分量b组成。After normalization, _0≤IL (x,y), _Ia (x,y), _Ib (x,y)≤1; the RGB image is defined as consisting of the image brightness L, the green to red component a, and the blue with yellow component b; similarly, the infrared image is defined as consisting of the image brightness L, the green to red component a, and the blue with yellow component b.

如附图3所示，对输入的RGB图像和红外图像进行量化操作，对输入图像实施的量化操作采用了K-Means方法，得到量化后的两幅彩色图像f_i和g_i，f_i原本是红外图像，但经过图像预处理已转换为三通道格式，从而也被视为彩色图像；K-Means是一种广受欢迎的颜色量化技术，通过将颜色空间分割成多个簇来实现，每个簇代表一种特定的颜色；使用K-Means方法进行颜色量化处理结果与原始图像在视觉上几乎没有明显差异，并且不会产生色彩失真的问题。通过颜色距离d来评估量化后两幅彩色图像f_i和g_i之间的差异，颜色距离d表示为：As shown in Figure 3, the input RGB image and infrared image are quantized. The quantization operation implemented on the input image uses the K-Means method to obtain two quantized color images _fi and g _i . _fi was originally an infrared image, but it has been converted into a three-channel format after image preprocessing, so it is also regarded as a color image. K-Means is a popular color quantization technology that is implemented by dividing the color space into multiple clusters, each cluster representing a specific color. The color quantization result using the K-Means method has almost no obvious visual difference from the original image, and there is no color distortion problem. The difference between the two quantized color images _fi and g _i is evaluated by the color distance d, and the color distance d is expressed as:

式中，L代表图片的亮度，a代表绿色到红色的分量，b代表从蓝色到黄色的分量。Where L represents the brightness of the image, a represents the component from green to red, and b represents the component from blue to yellow.

设颜色量化后的调色板颜色数量为K，在颜色空间中寻找K个颜色，并且记录每一个颜色的亮度L，绿色到红色的分量a，蓝色带黄色的分量b：Assume that the number of colors in the color palette after color quantization is K, find K colors in the color space, and record the brightness L of each color, the component a from green to red, and the component b from blue to yellow:

(C_1L,C_1a,C_1b),(C_2L,C_2a,C_2b),...,(C_KL,C_Ka,C_Kb)(C _1L ,C _1a ,C _1b ),(C _2L ,C _2a ,C _2b ),...,(C _KL ,C _Ka ,C _Kb )

式中，0≤C_KL,C_Ka,C_Kb≤1；Where, 0≤C _KL ,C _Ka ,C _Kb ≤1;

如附图3所示，计算量化前RGB图像和量化后的图像g_i的颜色距离，以及计算量化前红外图像和量化后的图像f_i的颜色距离；以彩色图像为例，设量化前的图像和量化后的图像g_i的颜色距离d满足公式:As shown in FIG3 , the color distance between the RGB image before quantization and the image after quantization g _i is calculated, and the color distance between the infrared image before quantization and the image after quantization _fi is calculated; taking the color image as an example, assuming that the image before quantization The color distance d of the quantized image _gi satisfies the formula:

如附图3所示，选择使用Lab颜色空间来计算颜色对比度，通过对比一个特定像素的颜色与图像中其他像素的颜色，量化该像素的显著度；Lab颜色空间特别适用于此类计算，因为它能够更准确地描绘颜色亮度和色彩差异；Lab颜色空间方法不仅提供了关于颜色对比度的量化分析，还能有效地揭示图像中的显著特征，为图像处理和分析提供重要信息；进而得到各个像素颜色在图像中的显著性值，设图像中某元素为则显著性值：As shown in Figure 3, the Lab color space is selected to calculate the color contrast. By comparing the color of a specific pixel with the colors of other pixels in the image, the prominence of the pixel is quantified. The Lab color space is particularly suitable for such calculations because it can more accurately depict color brightness and color differences. The Lab color space method not only provides quantitative analysis of color contrast, but also effectively reveals significant features in the image, providing important information for image processing and analysis. Then, the prominence value of each pixel color in the image is obtained. Suppose an element in the image is Then the significance value is:

在处理一幅图像时，若对每个像素单独计算其显著性值，会显著降低算法的执行效率。考虑到由于图像经过颜色量化处理，每个像素的颜色都通过图像的颜色直方图特征来确定，从中获取相应的量化颜色值；提出优化方法对上述处理流程进行了优化，仅需计算与每一个像素对应的颜色相关的显著性值，从而大幅减少重复计算，得到优化后显著性值：When processing an image, if the saliency value of each pixel is calculated separately, the execution efficiency of the algorithm will be significantly reduced. Considering that the image has been color quantized, the color of each pixel is determined by the color histogram features of the image, and the corresponding quantized color value is obtained from it; the optimization method is proposed to optimize the above processing flow, and only the saliency value related to the color corresponding to each pixel needs to be calculated, thereby greatly reducing repeated calculations and obtaining the optimized saliency value:

依据图像中每个像素以及每个像素对应的颜色相关的显著性值，提取图像中的图像显著性特征；以RGB图像为例，基于颜色对比度的显著性图S(I)提取过程如附图3所示。According to the saliency value related to each pixel in the image and the color corresponding to each pixel, the image saliency features in the image are extracted; taking the RGB image as an example, the process of extracting the saliency map S(I) based on color contrast is shown in Figure 3.

如附图2所示，所述步骤五中，在ResNet50网络中的第三阶段输出的特征为和以特征块的形式提取局部共享特征和局部共享特征和经过高效注意力模块EMSA得到局部特征 As shown in FIG. 2 , in step 5, the features output in the third stage of the ResNet50 network are and Extract local shared features in the form of feature blocks and Local shared features and The local features are obtained through the efficient attention module EMSA

式中，C，H，W分别代表通道、高和宽三个维度，从F中提取一组步幅为S_p的大小为d_x×d_y的补丁级特征{P_i,x_i,y_i}，补丁级特征总数为：Where C, H, and W represent the three dimensions of channel, height, and width, respectively. A set of patch-level features {P _i , _xi , _yi } with a size of d _x ×d _y and a stride of _Sp is extracted from F. The total number of patch-level features is:

如附图2所示，初始高效注意力模块EMSA操作与多头自注意力MSA一样经过卷积操作得到Q，即高效注意力模块EMSA将P_i进行卷积操作得到Q；为了压缩内存，将P_i记作x，二维输入标记被重塑成沿空间维度的三维标记并输入到深度卷积操作，以减小高度H和宽度W维度；s为缩减因子，s根据特征图大小或阶段数来自适应设置，核大小、步幅和填充分别为s+1、s和s/2；经过空间缩减的新标记映射被重塑为二维标记，得其中n'＝h/s×w/s；然后被送入两组投影操作以获取键K和值V。As shown in Figure 2, the initial efficient attention module EMSA operation is the same as the multi-head self-attention MSA, and the convolution operation is performed to obtain Q. That is, the efficient attention module EMSA performs a convolution operation on _Pi to obtain Q. In order to compress the memory, _Pi is recorded as x, and the two-dimensional input mark Reshaped into a three-dimensional marker along the spatial dimension And input to the depth convolution operation to reduce the height H and width W dimensions; s is the reduction factor, s is adaptively set according to the feature map size or the number of stages, the kernel size, stride and padding are s+1, s and s/2 respectively; the new label mapping after spatial reduction is reshaped into a two-dimensional mark, where n' = h/s x w/s; then is fed into two sets of projection operations to obtain the key K and value V.

采用EMSA方程来计算在查询Q、键K和值V上的注意力函数；EMSA方程如下：The EMSA equation is used to calculate the attention function on the query Q, key K and value V; the EMSA equation is as follows:

式中，Conv(·)代表一个标准的1×1卷积操作，主要作用是模仿不同注意力头部之间的交互作用；Where Conv(·) represents a standard 1×1 convolution operation, which mainly simulates the interaction between different attention heads;

上述高效注意力模块EMSA的设置，这种设计使得每个注意力头的函数能够同时依赖于所有键Key和查询Query；这种方法可能会在一定程度上削弱多头自注意力MSA机制同时处理来自不同位置的多样化表示子集的能力，为了弥补这一不足，并恢复头部间的多样性；因此，在应用Softmax函数后的点积矩阵中引入了实例归一化InstanceNormalization，标记为IN(·)，这不仅提高了模型对不同特征子集的敏感度，还增强了整体的注意力分布，有效地平衡了注意力机制的集中与分散；最后将每个头的输出值被连接起来并线性投影，以形成最终的输出。The setting of the above efficient attention module EMSA is designed so that the function of each attention head can depend on all keys and queries at the same time. This method may weaken the ability of the multi-head self-attention MSA mechanism to simultaneously process diverse representation subsets from different positions to some extent. In order to make up for this deficiency and restore the diversity among heads, instance normalization is introduced in the dot product matrix after applying the Softmax function, marked as IN(·), which not only improves the sensitivity of the model to different feature subsets, but also enhances the overall attention distribution, effectively balancing the concentration and dispersion of the attention mechanism. Finally, the output values of each head are connected and linearly projected to form the final output.

所述EMSA方程计算成本的值S_t是：The EMSA equation calculates the cost value of _St is:

假设s＞1，EMSA的计算成本远低于原始MSA，特别是在较低阶段，s通常更高。Assuming s > 1, the computational cost of EMSA is much lower than that of the original MSA, especially at lower stages where s is usually higher.

如附图4所示，所述步骤六中，将步骤二中提取的显著性图和步骤五中得到的局部特征进行特征融合，获得融合颜色特征的图像显著性图M；计算图像像素显著值S(I_k)和m(I_k)之间的残差值，残差值用于评估两个显著性图之间的一致性；通过这种方法，能够获得一个融合了颜色特征的更加准确的图像显著性图M，某一像素I_k的两个图像显著值S(I_k)和m(I_k)融合得到的图像显著性图如下式所示：As shown in FIG. 4 , in step 6, the saliency map extracted in step 2 is And the local features obtained in step 5 Perform feature fusion to obtain an image saliency map M of fused color features; calculate the residual value between the image pixel saliency value S(I _k ) and m(I _k ), and the residual value is used to evaluate the consistency between the two saliency maps; through this method, a more accurate image saliency map M fused with color features can be obtained. The image saliency map obtained by fusion of the two image saliency values S(I _k ) and m(I _k ) of a certain pixel I _k is shown in the following formula:

上述公式在于衡量某一像素I_k的两个显著值S(I_k)和m(I_k)之间的差异；当某一像素的这两个显著值相近时，表明无论是从基于深度学习的高层特征，还是从基于颜色信息的图像底层特征的角度来看，某一像素都显示出显著性；经过融合处理的显著值会接近于1，反映出高度的显著性；特征融合图如附图4所示。The above formula is to measure the difference between the two salient values S(I _k ) and m(I _k ) of a certain pixel I _k ; when the two salient values of a certain pixel are similar, it indicates that the certain pixel shows significance from the perspective of both high-level features based on deep learning and low-level features of the image based on color information; the salient value after fusion processing will be close to 1, reflecting a high degree of significance; the feature fusion diagram is shown in Figure 4.

基于高效自注意力机制的跨模态地点识别模型的测试流程：The testing process of the cross-modal place recognition model based on the efficient self-attention mechanism:

S1、输入查询数据集和图库数据集，进入S2；S1, input query dataset and gallery dataset, and enter S2;

S2：利用训练过后得到的模型，对S1输入的查询数据集和图库数据集的所有地点图像进行整体特征提取和局部特征提取，进入S3；S2: Using the trained model, perform overall feature extraction and local feature extraction on all location images of the query dataset and gallery dataset input in S1, and enter S3;

S3：将查询数据集整体特征和图库数据集整体特征进行相似度匹配，进入S4；S3: perform similarity matching between the overall features of the query dataset and the overall features of the gallery dataset, and proceed to S4;

S4：通过整体特征的匹配得到的候选排名，再通过局部特征进行重排序，进入S5；S4: The candidate ranking is obtained by matching the overall features, and then re-ranked by local features, and then enters S5;

S5：根据相似度的高低，得出查询数据集内的每张地点图像与图库数据集的匹配结果，进入S6；S5: According to the similarity, the matching result of each location image in the query data set and the library data set is obtained, and then S6 is entered;

S6：计算模型的大小以及特征匹配的推理时间，并结束。S6: Calculate the model size and the inference time of feature matching, and end.

测试流程中的S1步骤内的查询数据集表示需要查询的地点图像，而图库数据集表示待查询集匹配的地点图像的集合；在测试流程的S5步骤中，查询数据集中的每张图像都会与图库集中的若干图像进行匹配。本发明使用召回率Recall@N曲线来对地点识别算法的性能指标进行评价，并使用M和ms为单位对本发明所提出的EMSA的模型大小和推理时间进行评价。The query data set in step S1 of the test flow represents the location images to be queried, and the gallery data set represents the set of location images to be matched by the query set; in step S5 of the test flow, each image in the query data set will be matched with several images in the gallery set. The present invention uses the Recall@N curve to evaluate the performance index of the location recognition algorithm, and uses M and ms as units to evaluate the model size and inference time of the EMSA proposed by the present invention.

准确率通用的评价标准有Top-1，Top-5，Top-10等；本发明在广泛应用于跨模态地点识别任务的公开数据集KAIST上进行了模型的训练和评估；与现有的SCHAL-Net技术相比，在红外图像查询RGB图像阈值为10米的模式下，本发明的准确率在Top-10方面提高了1.8％；而在RGB图像查询红外图像阈值为10米的模式下，本发明的准确率在Top-1和Top-10方面分别提高了1.5％和1.7％。在模型大小上本发明所提出的模型比SCHAL-Net技术减少了1M，推理时间提高了7ms；整体相较于现有的SCHAL-Net技术准确率更高，并在保证准确率的同时提高地点识别的识别效率。The common evaluation criteria for accuracy are Top-1, Top-5, Top-10, etc.; the present invention trains and evaluates the model on the public dataset KAIST, which is widely used in cross-modal place recognition tasks; compared with the existing SCHAL-Net technology, in the mode where the infrared image queries the RGB image with a threshold of 10 meters, the accuracy of the present invention is improved by 1.8% in Top-10; and in the mode where the RGB image queries the infrared image with a threshold of 10 meters, the accuracy of the present invention is improved by 1.5% and 1.7% in Top-1 and Top-10 respectively. In terms of model size, the model proposed by the present invention is 1M smaller than the SCHAL-Net technology, and the inference time is improved by 7ms; overall, the accuracy is higher than the existing SCHAL-Net technology, and the recognition efficiency of place recognition is improved while ensuring the accuracy.

实施例Example

为评估所提出网络模型的性能，本发明在广泛应用于跨模态地点识别任务的知名公开数据集KAIST上进行了模型的训练和评估；本发明的实验评估了RGB图像检索红外图像、红外图像检索RGB图像这两种检索模式；为了公平起见，选择了各种基线方法进行比较，以验证本发明EMSA模型的有效性。本发明提出的跨模态地点识别算法与其他几种算法的对比分析，包括SCHAL-Net算法、DOLG、Patch-NetVLAD、MixVPR等在KAIST公开数据集上的性能表现果；为了全面评估这些算法，本发明采用了Recall@N作为评价指标，其中N的取值分别为1、5和10；如下表1所示：To evaluate the performance of the proposed network model, the present invention trained and evaluated the model on the well-known public dataset KAIST, which is widely used in cross-modal place recognition tasks; the present invention's experiments evaluated the two retrieval modes of RGB image retrieval of infrared images and infrared image retrieval of RGB images; for the sake of fairness, various baseline methods were selected for comparison to verify the effectiveness of the EMSA model of the present invention. The cross-modal place recognition algorithm proposed in the present invention is compared with several other algorithms, including the performance results of SCHAL-Net algorithm, DOLG, Patch-NetVLAD, MixVPR, etc. on the KAIST public dataset; in order to comprehensively evaluate these algorithms, the present invention uses Recall@N as the evaluation index, where the values of N are 1, 5 and 10 respectively; as shown in Table 1 below:

表1本发明在阈值为10米在KAIST数据集中与现有算法对比实验结果Table 1 Experimental results of the present invention compared with the existing algorithms in the KAIST dataset when the threshold is 10 meters

通过对比实验，本发明提出的地点识别算法在KAIST数据集上表现尤为突出，无论是在RGB图像查询红外图像，还是红外图像查询RGB图像的场景中，都显著优于原有算法模型；具体而言，在RGB图像查询红外图像的情况下，可以观察到Top-5召回率的最大提升，达到了1.9％；而在红外图像查询RGB图像的场景中，Top-1召回率的提升最为显著，达到了1.5％；充分证明了本发明改进后算法模型的有效性和优越性。Through comparative experiments, the place recognition algorithm proposed in the present invention performs particularly well on the KAIST dataset, and is significantly better than the original algorithm model in the scenarios of RGB images querying infrared images or infrared images querying RGB images. Specifically, in the case of RGB images querying infrared images, the maximum improvement in Top-5 recall rate can be observed, reaching 1.9%; and in the scenario of infrared images querying RGB images, the improvement in Top-1 recall rate is the most significant, reaching 1.5%. This fully proves the effectiveness and superiority of the improved algorithm model of the present invention.

本发明提出的算法进行了细致的对比分析，特别关注运行效率和模型大小，并与其他现有方法进行了比较；为了确保比较的公平性，所有实验均在相同的硬件配置下进行；如下表2所示The algorithm proposed in this paper is subjected to a detailed comparative analysis, with particular attention paid to the running efficiency and model size, and compared with other existing methods; to ensure the fairness of the comparison, all experiments are carried out under the same hardware configuration; as shown in Table 2 below

表2本发明与不同算法运行效率与模型大小对比Table 2 Comparison of operating efficiency and model size between the present invention and different algorithms

根据表3所示数据，可以明显看出本发明提出的算法在模型大小和推理时间上相比于原有模型均有显著的改进；这一进步主要得益于提出的注意力算法所带来的优化效果。相较于其他当前流行的地点识别算法，本发明改进后算法模型在模型体积和推理速度方面也表现出明显的优越性。具体而言，与DOLG相比EMSA模型在内存上减少了约2M，在推理时间上节省了约5ms。当EMSA算法与在识别效果上表现最佳的MixVPR算法进行比较时，EMSA模型大小约为其一半，而推理时间却快了16毫秒；结果不仅突显了本发明所提出的算法的高效性，更是凸显了其在节约资源和提高运行速度方面的显著优势，这对于实际应用场景中的算法部署具有重要的意义。总体而言，本发明所提出的算法不仅在性能上表现出色，同时在运行效率和资源消耗方面也展现了优异的平衡。According to the data shown in Table 3, it can be clearly seen that the algorithm proposed in the present invention has significant improvements in model size and reasoning time compared with the original model; this progress is mainly due to the optimization effect brought by the proposed attention algorithm. Compared with other currently popular place recognition algorithms, the improved algorithm model of the present invention also shows obvious superiority in model volume and reasoning speed. Specifically, compared with DOLG, the EMSA model reduces about 2M in memory and saves about 5ms in reasoning time. When the EMSA algorithm is compared with the MixVPR algorithm, which performs best in recognition effect, the EMSA model size is about half of that, while the reasoning time is 16 milliseconds faster; the results not only highlight the efficiency of the algorithm proposed in the present invention, but also highlight its significant advantages in saving resources and improving running speed, which is of great significance for the deployment of algorithms in practical application scenarios. In general, the algorithm proposed in the present invention not only performs well in performance, but also shows an excellent balance in terms of operating efficiency and resource consumption.

基于本发明提出一种实施场景，城市安全监控系统通过地点识别技术自动分析街道上的活动，以便及时响应潜在的安全威胁；包括：Based on the present invention, an implementation scenario is proposed, in which a city security monitoring system automatically analyzes activities on the street through location recognition technology so as to respond to potential security threats in a timely manner; including:

在关键区域安装高分辨率监控彩色和红外摄像头，确保能够清晰捕捉到该区域内的活动；配置摄像头与中央监控系统的连接，以便视频数据能够被实时传输和分析。在监控系统中集成地点识别技术；可能包括配置图像识别软件，用于分析摄像头捕获的图像；系统需要先进行“学习”或“训练”，以识别特定场景中的常规活动和潜在的异常行为。系统对实时视频进行分析，识别和跟踪在监控区域内移动的个体；使用地点识别技术分析个体的行为模式，比如行走路线、停留时间和与其他个体的互动。然后，系统根据预设的安全参数来识别任何异常或可疑行为，如未授权的入侵、异常聚集或潜在的犯罪活动；系统可以通过学习历史数据来提高对异常行为的识别准确性。最后，一旦识别出异常行为，系统将自动向安全人员发送警报，附上相关的视频片段和具体位置信息。Install high-resolution surveillance color and infrared cameras in key areas to ensure that activities in the area can be clearly captured; configure the connection between the cameras and the central monitoring system so that video data can be transmitted and analyzed in real time. Integrate location recognition technology into the monitoring system; may include configuring image recognition software to analyze images captured by the camera; the system needs to be "learned" or "trained" first to recognize routine activities and potential abnormal behaviors in specific scenes. The system analyzes real-time video to identify and track individuals moving in the monitoring area; uses location recognition technology to analyze individual behavior patterns, such as walking routes, dwell time, and interactions with other individuals. The system then identifies any abnormal or suspicious behavior, such as unauthorized intrusions, abnormal gatherings, or potential criminal activities, based on preset security parameters; the system can improve the accuracy of identifying abnormal behavior by learning from historical data. Finally, once abnormal behavior is identified, the system will automatically send an alert to security personnel, with relevant video clips and specific location information.

上述仅为本发明的较佳的优选实施例的描述，本技术领域的普通技术人员根据上述揭示内容，在不脱离上述基本原理内容的情况下做出若干修改和优化，这些改进和优化应当视为本发明的所理解的保护范围。The above is only a description of the preferred embodiments of the present invention. Ordinary technicians in this technical field can make several modifications and optimizations based on the above disclosure without departing from the above basic principles. These improvements and optimizations should be regarded as the understood protection scope of the present invention.

Claims

1.A cross-modal location identification method based on high-efficiency self-attention is characterized in that: the method comprises the following steps:

Step one, inputting an RGB image and an infrared image into a double-flow network;

Performing quantization operation on an input image, calculating color distances of an RGB image and an infrared image and color distances of an image before quantization and an image after quantization of each mode, and extracting image significance characteristics by adopting a color contrast method;

Inputting the images into ResNet networks, and extracting unique features from the images of each mode;

Fourthly, performing high-dimensional feature mapping on the features output by the ResNet network in the third stage through a high-dimensional feature mapping module, and outputting final overall features;

fifthly, performing image block processing on the features output by the ResNet network in the third stage to obtain local sharing features, and processing the local sharing features through an efficient self-attention module EMSA to output the local features;

step six, fusing the image salient features obtained in the step two with the local features obtained in the step five to obtain final local features;

step seven, carrying out overall feature and local feature cooperative constraint on the final overall feature and the final local feature;

step eight, if the number of training rounds reaches the designated number, finishing training to obtain a cross-modal place recognition model based on the high-efficiency self-attention mechanism; otherwise, returning to the first step to finish training continuously.

2. The method for identifying the cross-modal location based on the high-efficiency self-attention as claimed in claim 1, wherein the method comprises the following steps: let RGB site images total n, infrared site images total m, RGB mode sample expressed asThe infrared mode sample is expressed asWherein, Represents the ith RGB place image,Represents the j-th infrared location image,AndRespectively representAndA corresponding landmark;

in the second step, the image is defined as:

I(x,y)＝{I_L(x,y),I_a(x,y),I_b(x,y)}

after normalization, I _L(x,y),I_a(x,y),I_b (x, y) is more than or equal to 0 and less than or equal to 1;

The input RGB image and the infrared image are quantized to obtain quantized two color images f _i and g _i, and the difference between the quantized two color images f _i and g _i is evaluated through a color distance d, wherein the color distance d is expressed as follows:

where L represents the brightness of the picture, a represents the green to red component, and b represents the blue to yellow component.

3. A method of cross-modal location identification based on efficient self-attention as recited in claim 2 wherein: let the number of palette colors after color quantization be K, find K colors in color space:

(C_1L,C_1a,C_1b),(C_2L,C_2a,C_2b),...,(C_KL,C_Ka,C_Kb)

wherein, C _KL,C_Ka,C_Kb is more than or equal to 0 and less than or equal to 1;

Calculating the color distance of the pre-quantization RGB image and the quantized image g _i, and calculating the color distance of the pre-quantization infrared image and the quantized image f _i; taking a color image as an example, an image before quantization is set And the color distance d of the quantized image g _i satisfies the formula:

Wherein x is not less than 1 and not more than W, y is not less than 1 and not more than H, j is not less than 1 and not more than K, W and H are the width and the height of the image; after quantization, make Minimum; the quantized result is processed by using a color contrast method to obtain the significance characteristics of each pixel color in the image.

4. A method of cross-modal location identification based on efficient self-attention as recited in claim 3 wherein: selecting Lab color space to calculate color contrast, and quantifying the saliency of a specific pixel by comparing the color of the pixel with the colors of other pixels in the image; obtaining the significance value of each pixel color in the image, and setting a certain element in the image asSignificance value:

because the image is subjected to color quantization processing, the color of each pixel is determined through the color histogram feature of the image, and a corresponding quantized color value is obtained; the optimization method is provided, wherein the saliency value related to the color corresponding to each pixel is only calculated, and the saliency value after optimization is obtained:

Wherein c _l is the quantized color value of pixel M, f _j is the frequency of occurrence of color c _l in the image, α _j is the quantized value of color c _l, and K is the quantized color number;

And extracting the image saliency characteristics in the image according to each pixel in the image and the corresponding color-related saliency value of each pixel.

5. The method for identifying the cross-modal location based on the high-efficiency self-attention as claimed in claim 1, wherein the method comprises the following steps: in the fifth step, the third phase output in ResNet network is characterized byAndExtracting local shared features in the form of feature blocksAndLocal sharing featuresAndLocal feature acquisition via high-efficiency attention module EMSA

Let F be one of the global feature candidate descriptors:

Wherein, C, H, W represent three dimensions of channel, height and width respectively, a group of patch level features { P _i,x_i,y_i } with steps S _p and d _x×d_y are extracted from F, and the total number of patch level features is:

Where P _i represents the patch level feature set, (x _i,y_i) represents the coordinates of the center of the patch level feature on the feature map;

The high-efficiency attention module EMSA carries out convolution operation on P _i to obtain Q, and marks P _i as x and two-dimensional input marks Three-dimensional markers reshaped into a dimension along spaceThe data are input into a depth convolution operation, so that the height H dimension and the width W dimension are reduced; s is a reduction factor, s is adaptively set according to the size or the stage number of the feature map, and the kernel size, the stride and the filling are s+1, s and s/2 respectively; new marker mapping with spatial reductionRemodelling into two-dimensional marks to obtainWherein n' =h/s×w/s; Is fed into two sets of projection operations to obtain the key K and the value V.

6. The method for identifying the cross-modal location based on the high-efficiency self-attention as recited in claim 5, wherein: employing the EMSA equation to calculate the attention function on query Q, key K, and value V; the EMSA equation is as follows:

where Conv (-) represents a standard 1X 1 convolution operation that mimics the interaction between different attention headers;

The efficient attention module EMSA is arranged, so that the function of each attention head can depend on all Key keys and Query at the same time; an example normalization Instance Normalization is introduced into the dot product matrix after the application of the Softmax function, and is marked as IN (; the output values of each head are connected and linearly projected to form the final output.

7. The method for identifying the cross-modal location based on the high-efficiency self-attention as recited in claim 6, wherein: the value S _t of the EMSA equation computation cost is:

8. The method for identifying the cross-modal location based on the high-efficiency self-attention as claimed in claim 1, wherein the method comprises the following steps: in the sixth step, the saliency map extracted in the second step is displayed And the local features obtained in step fiveFeature fusion is carried out, and an image saliency map M with fused color features is obtained:

The above formula is to measure the difference between two significant values S (I _k) and m (I _k) for a certain pixel I _k.