CN116469020A

CN116469020A - Unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance

Info

Publication number: CN116469020A
Application number: CN202310402925.8A
Authority: CN
Inventors: 李红光; 孟令捷; 杨丽春
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-21

Abstract

The invention discloses a UAV image target detection method based on multi-scale and Gaussian Wasserstein distance, which relates to the technical field of aerial image processing, combining low-level and high-level feature fusion and scale insensitivity measurement ideas, including the following steps: S1: Establishing a UAV image target data set, and preprocessing the image data; S2: Slicing the input image, and then splicing the sliced results; S3: Fusing the receptive field of multi-scale pooling information-rich features; S4: Introducing NWD measurement based on Gaussian Wasserstein distance; S5: For UAV images containing small targets in the test set, the trained improved feature extraction network is used for target prediction. The present invention adopts the above method to improve the detection accuracy of small targets, improve the depth detection algorithm for conventional scale targets, realize effective detection for targets with limited pixels, and have high accuracy and recall rate.

Description

A UAV image target based on multi-scale and Gaussian Wasserstein distance Detection method

技术领域technical field

本发明涉及航空图像处理技术领域，具体涉及一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法。The invention relates to the technical field of aerial image processing, in particular to a method for detecting an image target of an unmanned aerial vehicle based on multi-scale and Gaussian Wasserstein distance.

背景技术Background technique

无人机图像有限像素目标是指在无人机图像中所占像素很少的目标。在远距离成像条件下，尤其是中高空无人机远距离斜视对地观测时，地面目标在图像中所占像素数较少。利用计算机对无人机图像数据进行有效地分析和处理，识别出不同类别的目标并标注其所在的位置是计算机视觉任务中的基础问题之一，被广泛应用于军事、农林业、海事、防灾救灾、城市规划等各个领域，这也对无人机图像的目标检测任务提出了更高的要求。The finite pixel target in the UAV image refers to the target that occupies few pixels in the UAV image. Under long-distance imaging conditions, especially when medium and high-altitude UAVs observe the earth with a long-distance squint, the number of pixels occupied by ground targets in the image is relatively small. Using computers to effectively analyze and process UAV image data, identifying different types of targets and marking their locations is one of the basic problems in computer vision tasks. It is widely used in various fields such as military, agriculture, forestry, maritime affairs, disaster prevention and relief, and urban planning. This also puts forward higher requirements for target detection tasks in UAV images.

在复杂背景中检测小目标是图像分析处理领域的一个重要研究方向，与自然场景下的图像相比，由于成像距离较远，无人机图像具有背景复杂度高、目标尺寸小、特征弱等特点，且由于成像环境较为复杂，例如天气、平台速度、高度及稳定性变化性大，导致无人机图像具有分辨率低、色彩饱和度低、环境噪声失真等问题，从而加大了目标检测的难度。Detecting small targets in complex backgrounds is an important research direction in the field of image analysis and processing. Compared with images in natural scenes, due to the longer imaging distance, UAV images have the characteristics of high background complexity, small target size, and weak features. Due to the complex imaging environment, such as weather, platform speed, height, and stability, the UAV images have problems such as low resolution, low color saturation, and environmental noise distortion, which increases the difficulty of target detection.

现有的目标检测算法分为基于传统图像处理和基于深度学习的算法两大类。基于传统图像处理的目标检测法大多应用于红外弱小目标检测领域，通过引入视觉注意力机制，利用目标与背景、噪声之间的差异，选择性地发现感兴趣的目标区域，但手工设计特征具有代表性不足的缺点，易受复杂背景的干扰，不能直接应用于无人机图像目标检测任务中。基于深度神经网络的目标检测算法在在常规数据集上表现优异，但对小目标检测精度较低，这是因为卷积神经网络一般通过堆叠的卷积和池化层构成，随着网络层次加深，特征维度逐渐降低，待测目标的信息量进一步减少，难以检测。Existing target detection algorithms are divided into two categories based on traditional image processing and algorithms based on deep learning. Target detection methods based on traditional image processing are mostly used in the field of infrared weak and small target detection. By introducing a visual attention mechanism, the difference between the target, the background and the noise is used to selectively find the target area of interest. However, the hand-designed features have the disadvantage of insufficient representation and are easily interfered by complex backgrounds, so they cannot be directly applied to the task of UAV image target detection. The target detection algorithm based on the deep neural network performs well on conventional data sets, but the detection accuracy of small targets is low. This is because the convolutional neural network is generally composed of stacked convolution and pooling layers. As the network level deepens, the feature dimension gradually decreases, and the amount of information of the target to be tested is further reduced, making it difficult to detect.

因此，有必要提供一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，来解决上述问题。Therefore, it is necessary to provide a UAV image target detection method based on multi-scale and Gaussian Wasserstein distance to solve the above problems.

发明内容Contents of the invention

本发明的目的是提供一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，提高了小目标检测精度，改进了针对常规尺度目标的深度检测算法，对有限像素的目标实现了有效检测，具有较高的准确性和召回率。The purpose of the present invention is to provide a method for detecting objects in UAV images based on multi-scale and Gaussian Wasserstein distance, which improves the detection accuracy of small objects, improves the depth detection algorithm for conventional-scale objects, and realizes effective detection for objects with limited pixels, with high accuracy and recall rate.

为实现上述目的，本发明提供了一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，包括以下步骤：In order to achieve the above object, the present invention provides a method for detecting objects in UAV images based on multi-scale and Gauss Wasserstein distance, comprising the following steps:

S1：建立无人机图像目标数据集，并对图像数据进行预处理；S1: Establish a UAV image target data set, and preprocess the image data;

S2：对输入图像进行切片操作，再对切片结果进行拼接；S2: Slicing the input image, and then splicing the sliced results;

S3：融合多尺度池化信息丰富特征的感受野；S3: Integrating the receptive field of multi-scale pooling information-rich features;

S4：引入基于高斯Wasserstein距离的NWD度量；S4: Introducing the NWD metric based on the Gaussian Wasserstein distance;

S5：对于测试集中含小目标的无人机图像，利用训练好的改进的特征提取网络进行目标预测。S5: For the UAV images containing small targets in the test set, use the trained and improved feature extraction network for target prediction.

优选的，在步骤S1中，将原始图像重叠切割为800×800像素的统一尺寸，根据目标在图像中出现的频率及尺寸确定目标，并根据图像中目标所占比例选择图像，取其中X个类别的样本作为训练集，其余类别样本作为测试集。Preferably, in step S1, the original image is overlapped and cut into a uniform size of 800×800 pixels, the target is determined according to the frequency and size of the target in the image, and the image is selected according to the proportion of the target in the image, and samples of X categories are taken as a training set, and samples of other categories are used as a test set.

优选的，在步骤S2中，切片操作是设置Focus结构，进行降采样，将高分辨率图像拆分成多个低分辨率图像，保留小目标的特征信息。Preferably, in step S2, the slicing operation is to set the Focus structure, perform down-sampling, split the high-resolution image into multiple low-resolution images, and retain the feature information of the small target.

优选的，在步骤S3中，在主干网络最后的卷积层前引入SPP模块，将不同尺度的特征信息进行融合。Preferably, in step S3, an SPP module is introduced before the last convolutional layer of the backbone network to fuse feature information of different scales.

优选的，在步骤S4中，NWD度量设计的过程为，将边界框建模为二维高斯分布，对于水平边界框，它的内接椭圆方程表示为：Preferably, in step S4, the process of NWD metric design is to model the bounding box as a two-dimensional Gaussian distribution, and for the horizontal bounding box, its inscribed ellipse equation is expressed as:

其中(μ_x,μ_y)为椭圆中心坐标，σ_x和σ_y分别表示沿x、y轴的半轴长度，μ_x＝c_x，μ_y＝c_y，σ_x＝w/2，σ_y＝h/2。Where (μ _x , μ _y ) is the coordinates of the center of the ellipse, σ _x and σ _y represent the semi-axis lengths along the x and y axes respectively, μ _x =c _x , μ _y =cy , _{σ x} ₌ w/2, σ _y =h/2.

优选的，在步骤S4中，二维高斯分布的概率密度函数表示为：Preferably, in step S4, the probability density function of the two-dimensional Gaussian distribution is expressed as:

其中x、μ和Σ分别表示高斯分布的坐标、均值向量和协方差矩阵。where x, μ, and Σ denote the coordinates, mean vector, and covariance matrix of the Gaussian distribution, respectively.

优选的，在步骤S4中，(x-μ)^TΣ^-1(x-μ)＝1时，水平边界框R＝(c_x,c_y,w,h)建模为二维高斯分布N(μ,∑)，其中：Preferably, in step S4, when (x-μ) ^T Σ ^-1 (x-μ)=1, the horizontal bounding box R=(c _x , _cy ,w,h) is modeled as a two-dimensional Gaussian distribution N(μ,Σ), where:

将两个边界框之间的相似度转换成两个高斯分布之间的距离，对于两个二维高斯分布μ₁＝N(m₁,Σ₁)和μ₂＝N(m₂,∑₂)，μ₁和μ₂之间的二阶Wasserstein距离简化表示为：Convert the similarity between two bounding boxes into the distance between two Gaussian distributions, for two two-dimensional Gaussian distributions μ ₁ = N(m ₁ ,Σ ₁ ) and μ ₂ =N(m ₂ ,Σ ₂ ), the second-order Wasserstein distance between μ ₁ and μ ₂ is simplified as:

其中||·||_F表示Frobenius范数；Where ||·|| _F represents the Frobenius norm;

对于从边界框A＝(cx_a,cy_a,w_a,h_a)和B＝(cx_b,cy_b,w_b,h_b)建模的高斯分布N_a和N_b，进一步简化为：For Gaussian distributions N _a and N _b modeled from bounding boxes A = (cx _a , cy _a , w _a , h _a ) and B = (cx _b , cy _b , w _b , h _b ), this further simplifies to:

使用它的指数形式归一化作为两个边界框相似度的度量：Normalize using its exponential form as a measure of similarity between two bounding boxes:

其中C为数据集中目标的平均绝对大小，IoU曲线随着目标尺寸降低，位置偏移导致的指标下降幅度增大。Among them, C is the average absolute size of the target in the data set, and the IoU curve decreases as the target size decreases, and the index decline caused by position offset increases.

优选的，在步骤S4中，损失函数由目标置信度损失、分类损失及边界框回归损失加权组成，目标置信度损失和分类损失采用二值交叉熵，边界框回归损失表示为预测边界框和真实目标边界框的CIoU损失和NWD损失的归一化加权和，损失函数表示为：Preferably, in step S4, the loss function is composed of target confidence loss, classification loss and bounding box regression loss weighting, the target confidence loss and classification loss adopt binary cross entropy, the bounding box regression loss is expressed as the normalized weighted sum of the CIoU loss and NWD loss of the predicted bounding box and the real target bounding box, and the loss function is expressed as:

Loss＝λ₁L_cls+λ₂L_obj+λ₃[αL_CIoU+(1-α)L_NWD]Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ [αL _CIoU +(1-α)L _NWD ]

L_NWD＝1-NWD(N_p,N_g)L _NWD ＝1-NWD(N _p , N _g )

其中NWD(N_p,N_g)表示预测框和真实框之间的指数归一化Wasserstein距离。where NWD(N _p , N _g ) represents the exponentially normalized Wasserstein distance between the predicted box and the ground truth box.

优选的，在步骤S5中，采用AP50、AP75及mAP作为模型的评价指标评估算法性能，测试改进的特征提取网络在测试数据集上的效果，分析引入NWD度量对模型性能的影响。Preferably, in step S5, AP50, AP75 and mAP are used as evaluation indicators of the model to evaluate the performance of the algorithm, to test the effect of the improved feature extraction network on the test data set, and to analyze the impact of introducing the NWD metric on the model performance.

因此，本发明采用上述一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，具备以下有益效果；Therefore, the present invention adopts the above-mentioned multi-scale and Gaussian Wasserstein distance-based UAV image target detection method, which has the following beneficial effects;

(1)本发明通过多尺度特征提取模块在Neck网络中采用双向特征金字塔网络(BiFPN)将低层特征与高层特征进行双向融合，丰富有限像素目标特征信息的表达。(1) The present invention uses a bidirectional feature pyramid network (BiFPN) in the Neck network through a multi-scale feature extraction module to bidirectionally fuse low-level features with high-level features to enrich the expression of limited pixel target feature information.

(2)本发明通过融合多帧图像的时空信息，提高了检测的召回率。(2) The present invention improves the recall rate of detection by fusing spatiotemporal information of multiple frames of images.

(3)本发明通过提取并结合多种图像视觉特征，使检测结果具有可靠性。(3) The present invention makes the detection result reliable by extracting and combining various image visual features.

(4)本发明在非最大值抑制阶段和边界框回归损失中采用具有尺度不敏感性的归一化高斯Wasserstein距离度量以评估预测框和真实框的相似度，从而提高小目标检测精度。(4) The present invention uses a scale-insensitive normalized Gaussian Wasserstein distance metric in the non-maximum suppression stage and bounding box regression loss to evaluate the similarity between the predicted box and the real box, thereby improving the detection accuracy of small objects.

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

附图说明Description of drawings

图1是本发明一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法的流程图；Fig. 1 is a kind of flow chart of the UAV image target detection method based on multi-scale and Gaussian Wasserstein distance of the present invention;

图2是本发明采用的Focus结构示意图；Fig. 2 is the Focus structure schematic diagram that the present invention adopts;

图3是本发明采用的SPP模块结构图；Fig. 3 is the SPP module structural diagram that the present invention adopts;

图4是本发明采用的基于高斯Wasserstein距离的NWD度量指标下的位置偏移曲线示意图；Fig. 4 is the schematic diagram of the position offset curve under the NWD metric index based on the Gaussian Wasserstein distance adopted by the present invention;

具体实施方式Detailed ways

以下通过附图和实施例对本发明的技术方案作进一步说明。The technical solutions of the present invention will be further described below through the accompanying drawings and embodiments.

除非另外定义，本发明使用的技术术语或者科学术语应当为本发明所属领域内具有一般技能的人士所理解的通常意义。本发明中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。Unless otherwise defined, the technical terms or scientific terms used in the present invention shall have the usual meanings understood by those skilled in the art to which the present invention belongs. "First", "second" and similar words used in the present invention do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

本发明采用上述一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，结合低层与高层特征融合和尺度不敏感性度量思路，包括以下步骤：S1：建立无人机图像目标数据集，并对图像数据进行预处理；S2：对输入图像进行切片操作，再对切片结果进行拼接；S3：融合多尺度池化信息丰富特征的感受野；S4：引入基于高斯Wasserstein距离的NWD度量；S5：对于测试集中含小目标的无人机图像，利用训练好的改进的特征提取网络进行目标预测。The present invention adopts the above-mentioned UAV image target detection method based on multi-scale and Gaussian Wasserstein distance, combines low-level and high-level feature fusion and scale insensitivity measurement ideas, and includes the following steps: S1: Establish a UAV image target data set, and preprocess the image data; S2: Slicing the input image, and then splicing the sliced results; S3: Fuse the receptive field of multi-scale pooling information-rich features; S4: Introduce NWD measurement based on Gaussian Wasserstein distance; UAV imagery, using a trained improved feature extraction network for object prediction.

在步骤S1中，将原始图像重叠切割为800×800像素的统一尺寸，根据目标在图像中出现的频率及尺寸确定目标，并根据图像中目标所占比例选择图像，取其中X个类别的样本作为训练集，其余类别样本作为测试集。In step S1, the original image is overlapped and cut into a uniform size of 800×800 pixels, the target is determined according to the frequency and size of the target in the image, and the image is selected according to the proportion of the target in the image, and the samples of X categories are taken as the training set, and the samples of the other categories are used as the test set.

从DIOR、DOTA、xView、VisDrone等多个大规模公开航空遥感图像数据集中收集而来的AI-TOD数据集为基础，组成无人机图像有限像素小目标数据集，根据目标在图像中出现的频率及尺寸确定目标类别主要为飞机、船舶、汽车、人等。Based on AI-TOD data sets collected from multiple large-scale public aerial remote sensing image data sets such as DIOR, DOTA, xView, and VisDrone, a small target data set with limited pixels in UAV images is composed. According to the frequency and size of the target in the image, the target categories are mainly aircraft, ships, cars, and people.

将原始图像重叠切割为800×800像素的统一尺寸，根据图像中目标所占比例选择不包含尺寸大于64像素的图像，数据集共包含28036幅图像和700621个目标实例，目标平均尺寸为12.8像素，方差为5.9像素，目标尺寸远小于其他遥感数据集，取其中X个类别的样本作为训练集，其余类别样本作为测试集。The original image is overlapped and cut into a uniform size of 800×800 pixels. According to the proportion of the target in the image, images with a size larger than 64 pixels are selected. The dataset contains 28,036 images and 700,621 target instances. The average size of the target is 12.8 pixels, and the variance is 5.9 pixels.

在步骤S2中，切片操作是设置Focus结构，进行降采样，将高分辨率图像拆分成多个低分辨率图像，保留小目标的特征信息。Focus是一种特殊的降采样方法，具体处理操作如图2所示，间隔一个像素的距离进行取值并将其合并成一个低分辨率图像，对应通道数变为原来的4倍，通过将高分辨率图像拆分成多个低分辨率图像，从而使宽和高的信息统一集中到通道维度上，降低了计算量的同时避免了下采样带来的信息损失，更多地保留了小目标的特征信息，并且有利于提高网络训练和推理速度。In step S2, the slicing operation is to set the Focus structure, perform down-sampling, split the high-resolution image into multiple low-resolution images, and retain the feature information of the small target. Focus is a special downsampling method. The specific processing operation is shown in Figure 2. Values are taken at intervals of one pixel and merged into a low-resolution image. The corresponding number of channels becomes 4 times that of the original. By splitting the high-resolution image into multiple low-resolution images, the width and height information is unified on the channel dimension, which reduces the amount of calculation and avoids the information loss caused by downsampling. It retains more feature information of small objects and is conducive to improving network training and inference speed.

在步骤S3中，在主干网络最后的卷积层前引入SPP模块，将不同尺度的特征信息进行融合。SSP模块结构如图3所示，输入特征首先经过1×1卷积层，分别经过5×5、7×7和13×13三个不同尺度的最大池化窗口，将三个尺度的池化特征与输入特征进行连接，再经过一个1×1卷积层，最终得到固定大小的特征向量。SPP层通过融合多尺度池化信息丰富特征的感受野，增强特征图的特征表达能力。In step S3, an SPP module is introduced before the last convolutional layer of the backbone network to fuse feature information of different scales. The structure of the SSP module is shown in Figure 3. The input features first pass through a 1×1 convolutional layer, and then pass through the maximum pooling windows of three different scales of 5×5, 7×7, and 13×13 respectively. The pooled features of the three scales are connected to the input features, and then pass through a 1×1 convolutional layer to finally obtain a feature vector of a fixed size. The SPP layer enhances the feature expression ability of the feature map by fusing the receptive field of multi-scale pooling information to enrich the features.

在步骤S4中，IoU表示预测框和真实框之间的重叠程度，被广泛应用于基于锚点框的目标检测框架中，例如非最大值抑制(Non-MaximumSuppression,NMS)阶段采用IoU指标过滤掉重叠率较高的预测框，通常在损失函数中用基于IoU的指标代替L2损失作为边界框回归损失，但基于IoU的评估指标对小目标位置偏移十分敏感，微小的位置偏差会导致IoU迅速下降，从而影响基于锚点框的检测器的性能。采用归一化高斯Wasserstein距离计算两个边界框之间的相似性。NWD度量设计的过程为，将边界框建模为二维高斯分布，对于水平边界框，它的内接椭圆方程表示为：In step S4, IoU represents the degree of overlap between the predicted frame and the real frame, which is widely used in the target detection framework based on the anchor point frame. For example, the non-maximum suppression (Non-Maximum Suppression, NMS) stage uses the IoU index to filter out the prediction frame with a high overlap rate. Usually, the IoU-based index is used in the loss function instead of the L2 loss as the bounding box regression loss. However, the IoU-based evaluation index is very sensitive to small target position deviations. Box detector performance. The similarity between two bounding boxes is calculated using the normalized Gaussian Wasserstein distance. The process of NWD metric design is to model the bounding box as a two-dimensional Gaussian distribution. For a horizontal bounding box, its inscribed ellipse equation is expressed as:

在步骤S4中，二维高斯分布的概率密度函数表示为：In step S4, the probability density function of the two-dimensional Gaussian distribution is expressed as:

在步骤S4中，(x-μ)^TΣ^-1(x-μ)＝1时，水平边界框R＝(c_x,c_y,w,h)可以建模为二维高斯分布N(μ,∑)，其中：In step S4, when (x-μ) ^T Σ ^-1 (x-μ)=1, the horizontal bounding box R=(c _x , _cy ,w,h) can be modeled as a two-dimensional Gaussian distribution N(μ,Σ), where:

将两个边界框之间的相似度转换成两个高斯分布之间的距离，对于两个二维高斯分布μ₁＝N(m₁,∑₁)和μ₂＝N(m₂,∑₂)，μ₁和μ₂之间的二阶Wasserstein距离简化表示为：Convert the similarity between two bounding boxes into the distance between two Gaussian distributions. For two two-dimensional Gaussian distributions μ ₁ = N(m ₁ ,∑ ₁ ) and μ ₂ =N(m ₂ ,∑ ₂ ), the second-order Wasserstein distance between μ ₁ and μ ₂ is simplified as:

其中||·||_F表示Frobenius范数。where ||·|| _F represents the Frobenius norm.

其中C为数据集中目标的平均绝对大小，IoU曲线随着目标尺寸降低，位置偏移导致的指标下降幅度增大。如图4所示，NWD对应的四条曲线完全一致，对边界框尺度变化不敏感；NWD曲线更平滑，对偏移的敏感性较低，且当边界框A包含边界框B或两个边界框没有交集时，NWD指标依然可以反映两个边界框的相似度，具有更强的鲁棒性。Among them, C is the average absolute size of the target in the data set, and the IoU curve decreases as the target size decreases, and the index decline caused by position offset increases. As shown in Figure 4, the four curves corresponding to NWD are exactly the same and are not sensitive to the scale change of the bounding box; the NWD curve is smoother and less sensitive to offset, and when the bounding box A contains the bounding box B or the two bounding boxes do not intersect, the NWD index can still reflect the similarity of the two bounding boxes and has stronger robustness.

在步骤S4中，损失函数由目标置信度损失、分类损失及边界框回归损失加权组成，目标置信度损失和分类损失采用二值交叉熵，边界框回归损失表示为预测边界框和真实目标边界框的CIoU损失和NWD损失的归一化加权和。损失函数表示为：In step S4, the loss function is composed of target confidence loss, classification loss and bounding box regression loss weighted. The target confidence loss and classification loss adopt binary cross entropy, and the bounding box regression loss is expressed as the normalized weighted sum of the CIoU loss and NWD loss of the predicted bounding box and the real target bounding box. The loss function is expressed as:

L_NWD＝1-NWD(N_p,N_g)L _NWD ＝1-NWD(N _p , N _g )

在步骤S5中，采用AP50、AP75及mAP作为模型的评价指标评估算法性能，测试改进的特征提取网络在测试数据集上的效果，分析引入NWD度量对模型性能的影响。In step S5, AP50, AP75 and mAP are used as the evaluation indicators of the model to evaluate the performance of the algorithm, test the effect of the improved feature extraction network on the test data set, and analyze the impact of introducing NWD metrics on model performance.

采用AP50、AP75及mAP作为模型的评价指标来评估算法性能，各类别下的平均精度(Average Precision，AP)为P-R曲线下的面积，mAP则表示各类别平均精度(mAP)的平均值；在COCO数据集中mAP表示在IoU阈值为0.5到0.95间以0.05为间隔计算十个mAP并取平均得到的指标，AP50和mAP75分别表示以0.5和0.75作为IoU阈值计算得到的各类别平均精度。AP50, AP75 and mAP are used as the evaluation indicators of the model to evaluate the performance of the algorithm. The average precision (AP) under each category is the area under the P-R curve, and mAP represents the average value of the average precision (mAP) of each category; mAP in the COCO dataset means that ten mAPs are calculated and averaged at intervals of 0.05 between the IoU thresholds of 0.5 and 0.95, and AP50 and mAP75 represent 0.5 and 0.7, respectively. 5 is the average precision of each category calculated as the IoU threshold.

在深度学习框架PyTorch上进行算法实现，硬件配置为CPU：Intel Xeon 24核，1.9GHz，64GB RAM；GPU：GeForce RTX 3080Ti。采用YOLOv5官方预训练模型进行参数初始化，并在遥感图像目标检测数据集上进行微调。初始学习率设置为0.01，训练前采用Warmup热身策略，再通过余弦退火算法动态衰减学习率。每个模型训练1000个周期，为防止过拟合，当验证集上的指标在100个周期内不再提高，则提前结束。训练时batch_size设置为128，测试时batch_size设置为1。The algorithm is implemented on the deep learning framework PyTorch, and the hardware configuration is CPU: Intel Xeon 24 cores, 1.9GHz, 64GB RAM; GPU: GeForce RTX 3080Ti. The YOLOv5 official pre-training model is used for parameter initialization, and fine-tuning is performed on the remote sensing image target detection dataset. The initial learning rate is set to 0.01, and the Warmup strategy is used before training, and then the learning rate is dynamically attenuated through the cosine annealing algorithm. Each model is trained for 1000 cycles. In order to prevent overfitting, when the indicators on the verification set do not improve within 100 cycles, it will end early. The batch_size is set to 128 during training, and the batch_size is set to 1 during testing.

采用多尺度训练的方式，并根据数据集标注的真实边界框标签采用K-means算法自动聚类生成新的最佳锚点框尺寸以适应不同数据集中不同尺度的目标。Multi-scale training is adopted, and the K-means algorithm is used to automatically cluster to generate new optimal anchor box sizes according to the real bounding box labels marked in the dataset to adapt to targets of different scales in different datasets.

测试改进的特征提取网络在测试数据集上的效果，采用BiFPN结构对多层特征图进行双向连接，并根据特征重要程度动态加权融合，从而提升网络特征表达能力，并为目标新增了一个检测头部。实验结果如表1所示，在AI-TOD数据集上mAP值提升0.3％，APm提升2.2％，AP75提升1.0％。Test the effect of the improved feature extraction network on the test data set, use the BiFPN structure to connect the multi-layer feature maps in two directions, and dynamically weight and fuse them according to the importance of the features, thereby improving the network feature expression ability, and adding a new detection head for the target. The experimental results are shown in Table 1. On the AI-TOD dataset, the mAP value increased by 0.3%, APm increased by 2.2%, and AP75 increased by 1.0%.

表1改进网络结构性能对比Table 1 Performance comparison of improved network structure

分析引入NWD度量对模型性能的影响，在NMS阶段采用对目标尺度不敏感的NWD度量代替IoU，可以有效避免预测框与最高得分预测框的IoU小于阈值导致的冗余检测框增多，造成假阳性率过大；对于边界框回归损失函数而言，引入NWD损失有助于缓解CIoU损失对小目标位置偏差的敏感性，使网络可以针对小目标进行更好的学习和优化，实验结果如表2所示。Analyze the impact of introducing the NWD metric on the performance of the model. In the NMS stage, the NWD metric that is not sensitive to the target scale is used instead of IoU, which can effectively avoid the increase of redundant detection frames caused by the IoU of the prediction frame and the highest score prediction frame being less than the threshold, resulting in an excessive false positive rate. For the bounding box regression loss function, the introduction of NWD loss can help alleviate the sensitivity of CIoU loss to small target position deviations, so that the network can better learn and optimize for small targets. The experimental results are shown in Table 2.

表2引入NWD度量对检测性能的影响Table 2 Influence of introducing NWD metric on detection performance

在NMS阶段引入NWD度量，mAP为16.2％，相比于YOLOv5采用的IoU度量mAP值提升1.2％。NWD损失以归一化加权的方式对CIoU损失进行补充，当NWD损失权重设置为0.35时相较于单独的CIoU损失mAP提升1.6％。实验结果表明在NMS阶段和边界框回归损失中引入NWD度量对于小目标检测性能提升均具有一定提升。The NWD metric is introduced in the NMS stage, and the mAP is 16.2%, which is 1.2% higher than the IoU metric mAP value adopted by YOLOv5. The NWD loss complements the CIoU loss in a normalized weighted manner. When the NWD loss weight is set to 0.35, the mAP is increased by 1.6% compared with the CIoU loss alone. The experimental results show that the introduction of NWD metrics in the NMS stage and the bounding box regression loss can improve the performance of small target detection to a certain extent.

在AI-TOD数据集上与其他经典和先进的目标检测方法进行性能对比，采用官方提供的COCOAPI-AITOD接口以保证模型性能对比的客观性和可信性，对比结果如表3所示。The performance of the AI-TOD dataset is compared with other classic and advanced target detection methods, and the official COCOAPI-AITOD interface is used to ensure the objectivity and credibility of the model performance comparison. The comparison results are shown in Table 3.

表3(1)AI-TOD数据集上不同算法性能对比Table 3(1) Performance comparison of different algorithms on the AI-TOD dataset

表3(2)AI-TOD数据集上不同算法性能对比Table 3(2) Performance comparison of different algorithms on the AI-TOD dataset

本方法多类别平均精度mAP值达到17.8％，AP₅₀和AP₇₅分别为41.4％和12.4％。与基线方法YOLOv5相比，mAP提升了3.0％，AP₅₀提升了4.6％，AP₇₅提升3.3％，且三个指标均高于经典的基于锚点框和非锚点框的目标检测算法；与Faster-RCNN、CascadeR-CNN等经典的多阶段目标检测方法下相比，单阶段方法，例如YOLOv3、SSD及RetinaNet的mAP值较低，对小目标的检测性能较差；无锚检测器CenterNet避免了离散尺寸的锚点框对多尺度目标鲁棒性较差的问题，基于多中心点的无锚检测器通过多中心点和偏置目标设计进一步提升检测性能极小目标检测，AP_vt指标最高，达到6.1％，本文提出的方法兼顾了整个数据集的所有尺度目标实例，整体上具有突出的性能优势。相比于先进的DetectoRS算法AP_t指标提升4.9％，AP_vt提升3.4％，对极小目标的检测上具有的显著的性能提高。对比实验的结果表明了本文所提方法在遥感图像小目标检测任务上相比于当前部分方法性能更优，从而证明了该方法的有效性。The multi-category average precision mAP value of this method reaches 17.8%, and the AP ₅₀ and AP ₇₅ are 41.4% and 12.4%, respectively.与基线方法YOLOv5相比，mAP提升了3.0％，AP ₅₀提升了4.6％，AP ₇₅提升3.3％，且三个指标均高于经典的基于锚点框和非锚点框的目标检测算法；与Faster-RCNN、CascadeR-CNN等经典的多阶段目标检测方法下相比，单阶段方法，例如YOLOv3、SSD及RetinaNet的mAP值较低，对小目标的检测性能较差；无锚检测器CenterNet避免了离散尺寸的锚点框对多尺度目标鲁棒性较差的问题，基于多中心点的无锚检测器通过多中心点和偏置目标设计进一步提升检测性能极小目标检测，AP _vt指标最高，达到6.1％，本文提出的方法兼顾了整个数据集的所有尺度目标实例，整体上具有突出的性能优势。 Compared with the advanced DetectoRS algorithm, the AP _t index is increased by 4.9%, and the AP _vt is increased by 3.4%, which has a significant performance improvement in the detection of extremely small targets. The results of comparative experiments show that the method proposed in this paper has better performance than some current methods in the small target detection task of remote sensing images, thus proving the effectiveness of the method.

因此，本发明采用上述一种基于多尺度和高斯Wasserstein距离的无人机图像目标检测方法，将低层与高层特征融合和尺度不敏感性度量思路相结合，以提高无人机图像中有限像素小目标检测的准确率。Therefore, the present invention adopts the above-mentioned UAV image target detection method based on multi-scale and Gaussian Wasserstein distance, and combines low-level and high-level feature fusion and scale insensitivity measurement ideas to improve the accuracy of limited-pixel small target detection in UAV images.

最后应说明的是：以上实施例仅用以说明本发明的技术方案而非对其进行限制，尽管参照较佳实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对本发明的技术方案进行修改或者等同替换，而这些修改或者等同替换亦不能使修改后的技术方案脱离本发明技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention rather than limit it. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: it can still modify or equivalently replace the technical solution of the present invention, and these modifications or equivalent replacement cannot make the modified technical solution deviate from the spirit and scope of the technical solution of the present invention.

Claims

1. a kind of unmanned aerial vehicle image object detection method based on multiscale and Gaussian Wasserstein distance, it is characterized in that: comprise the following steps:

S1: Establish a UAV image target data set, and preprocess the image data;

S2: Slicing the input image, and then splicing the sliced results;

S3: Integrating the receptive field of multi-scale pooling information-rich features;

S4: Introducing the NWD metric based on the Gaussian Wasserstein distance;

S5: For the UAV images containing small targets in the test set, use the trained and improved feature extraction network for target prediction.

2. according to a kind of unmanned aerial vehicle image object detection method based on multi-scale and Gaussian Wasserstein distance described in claim 1, it is characterized in that: in step S1, original image is overlapped and cut into the uniform size of 800 * 800 pixels, determines the target according to the frequency and the size that target occurs in the image, and selects image according to the proportion of target in the image, gets wherein the sample of X category as training set, all the other category samples as test set.

3. according to a kind of unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance described in claim 1, it is characterized in that: in step S2, slice operation is to set Focus structure, carry out down-sampling, high-resolution image is split into a plurality of low-resolution images, retains the feature information of small target.

4. according to a kind of unmanned aerial vehicle image object detection method based on multi-scale and Gaussian Wasserstein distance described in claim 1, it is characterized in that: in step S3, before the last convolutional layer of backbone network, introduce SPP module, the feature information of different scales is fused.

5. according to a kind of UAV image target detection method based on multi-scale and Gaussian Wasserstein distance described in claim 1, it is characterized in that: in step S4, the process of NWD measurement design is:

The bounding box is modeled as a two-dimensional Gaussian distribution, and for a horizontal bounding box, its inscribed ellipse equation is expressed as:

Where (μ _x , μ _y ) is the coordinates of the center of the ellipse, σ _x and σ _y represent the semi-axis lengths along the x and y axes respectively, μ _x =c _x , μ _y =cy , _{σ x} ₌ w/2, σ _y =h/2.

6. according to a kind of UAV image target detection method based on multi-scale and Gaussian Wasserstein distance described in claim 5, it is characterized in that: in step S4, the probability density function of two-dimensional Gaussian distribution is expressed as:

where x, μ, and Σ denote the coordinates, mean vector, and covariance matrix of the Gaussian distribution, respectively.

7. according to a kind of UAV image object detection method based on multi-scale and Gaussian Wasserstein distance described in claim 6, it is characterized in that: in step S4, (x-μ) ^T Σ ^-1 (x-μ)=1, horizontal bounding box R=(c _x , _cy , w, h) is modeled as two-dimensional Gaussian distribution N (μ, Σ), wherein:

Convert the similarity between two bounding boxes into the distance between two Gaussian distributions, for two two-dimensional Gaussian distributions μ ₁ = N(m ₁ ,Σ ₁ ) and μ ₂ =N(m ₂ ,Σ ₂ ), the second-order Wasserstein distance between μ ₁ and μ ₂ is simplified as:

Where ||·|| _F represents the Frobenius norm;

For Gaussian distributions N _a and N _b modeled from bounding boxes A = (cx _a , cy _a , w _a , h _a ) and B = (cx _b , cy _b , w _b , h _b ), this further simplifies to:

Normalize using its exponential form as a measure of similarity between two bounding boxes:

where C is the average absolute size of objects in the dataset.

8. according to a kind of UAV image target detection method based on multi-scale and Gaussian Wasserstein distance described in claim 7, it is characterized in that: in step S4, loss function is made up of target confidence degree loss, classification loss and bounding box regression loss weighting, target confidence degree loss and classification loss adopt binary cross entropy, bounding box regression loss is expressed as the normalized weighted sum of CIoU loss and NWD loss of predicted bounding box and real target bounding box, and loss function is expressed as:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ [αL _CIoU +(1-α)L _NWD ]

L _NWD ＝1-NWD(N _p , N _g )

where NWD(N _p , N _g ) represents the exponentially normalized Wasserstein distance between the predicted box and the ground truth box.

9. according to a kind of unmanned aerial vehicle image object detection method based on multiscale and Gaussian Wasserstein distance described in claim 1, it is characterized in that: in step S5, adopt AP50, AP75 and mAP as the evaluation index evaluation algorithm performance of model, the effect of the feature extraction network of test improvement on test data set, analysis introduces the impact of NWD measurement on model performance.