CN116402859A

CN116402859A - A Moving Target Detection Method Based on Aerial Image Sequence

Info

Publication number: CN116402859A
Application number: CN202310424573.6A
Authority: CN
Inventors: 刘晶红; 王波; 朱圣杰; 王宣; 徐芳
Original assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Current assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-07

Abstract

The present invention relates to the technical field of image processing, and specifically provides a moving target detection method based on an aerial image sequence, comprising the following steps: S1: inputting an aerial remote sensing image sequence; S2: preparing an image data set to be trained, and first cutting the image Obtain the image data set to be trained, and then cluster the training image data set into 9 fixed anchor frames according to the marked detection frame of the moving target in the image data set to be trained, and substitute them into the algorithm and train to obtain the model parameters; S3: the kth Input the frame image into the target detection network, input the k-th frame and the k-nth frame image into the motion detection network, fuse the feature maps obtained in the target detection network and the motion detection network to obtain the final network output . The invention inputs two adjacent frames of images into the network to directly obtain the information of the moving target, saves the background compensation step, reduces the detection process, obtains higher detection accuracy of the moving target, and realizes real-time detection.

Description

A Moving Target Detection Method Based on Aerial Image Sequence

技术领域technical field

本发明涉及图像处理技术领域，尤其涉及一种基于航空图像序列的运动目标检测方法。The invention relates to the technical field of image processing, in particular to a moving target detection method based on aerial image sequences.

背景技术Background technique

航空遥感图像序列一般从无人机或气球等动基座平台获取，运动平台飞行高度通常在数百米到数十公里之间。运动目标检测即从序列图像中检测出变化区域，并将运动目标从背景图像中提取出来，提取到的运动目标信息可以为后续的目标识别、跟踪，行为分析等任务提供可参考的区域。在遥感图像序列中准确、快速地找到运动目标所在的位置，无论是在民用还是军事领域中都具有十分重要的意义；在民用领域中，可用于智能控制、人机交互、视觉导航等；在军事领域中，可用于目标监视、远程预警、精确制导等。目前对航空遥感序列图像运动目标的识别常通过人工判读的方式进行，导致数据的利用率低且情报的时效性差，同时人工判读方式易受工作人员身体状况、精神和主观意识的影响。Aerial remote sensing image sequences are generally obtained from moving base platforms such as drones or balloons, and the flying height of the moving platform is usually between hundreds of meters and tens of kilometers. Moving target detection is to detect the changing area from the sequence image, and extract the moving target from the background image. The extracted moving target information can provide a reference area for subsequent target recognition, tracking, behavior analysis and other tasks. Accurately and quickly finding the position of the moving target in the remote sensing image sequence is of great significance in both civilian and military fields; in the civilian field, it can be used for intelligent control, human-computer interaction, visual navigation, etc.; In the military field, it can be used for target surveillance, long-range early warning, precision guidance, etc. At present, the identification of moving targets in aerial remote sensing sequence images is often carried out by manual interpretation, resulting in low data utilization and poor intelligence timeliness. At the same time, manual interpretation is easily affected by the physical condition, mentality and subjective consciousness of the staff.

为了避免人工方式对运动目标检测结果的影响、减少对人力资本的消耗，高效准确的航空遥感图像的自动运动目标检测技术显得尤为重要。与固定基座下的运动目标检测任务相比，航空图像序列的运动目标检测主要包括如下难点：In order to avoid the influence of manual methods on the moving target detection results and reduce the consumption of human capital, efficient and accurate automatic moving target detection technology for aerial remote sensing images is particularly important. Compared with the task of moving object detection under a fixed base, the detection of moving objects in aerial image sequences mainly includes the following difficulties:

1、运动成像：无人机移动监控、机载成像探测等都是动平台成像，动平台成像会使背景在空间上和时间上不再保持不变，运动的背景会产生虚假运动，合理的区分背景运动和目标运动，是动平台成像下运动目标检测的需要解决的关键问题；1. Motion imaging: UAV mobile monitoring, airborne imaging detection, etc. are all moving platform imaging. Moving platform imaging will make the background no longer remain unchanged in space and time, and the moving background will produce false motion. Reasonable Distinguishing between background motion and target motion is the key problem to be solved in moving target detection under moving platform imaging;

2、背景复杂：航空遥感图像中存在大量与运动目标拥有相同或相近特征的物体，且航拍图像的拍摄易受到云雾等天气的影响，需要考虑复杂气象条件对航拍图像的影响；2. Complex background: There are a large number of objects with the same or similar characteristics as moving targets in aerial remote sensing images, and the shooting of aerial images is easily affected by weather such as clouds and fog. It is necessary to consider the influence of complex meteorological conditions on aerial images;

3、动态背景：除了所需关注的主要运动目标外，还存在大量自然界中存在的干扰的运动背景，这些典型的动态背景包括被风吹动的树木、有波浪的水面、移动的白云等，动态背景需要根据应用需求进行排除；3. Dynamic background: In addition to the main moving objects that need attention, there are also a large number of disturbing moving backgrounds that exist in nature. These typical dynamic backgrounds include trees blown by the wind, water with waves, and moving white clouds. Dynamic background needs to be excluded according to application requirements;

4、阴影运动：阴影通常会被当作运动目标而被常规的运动目标检测技术检测出来；4. Shadow movement: Shadows are usually detected as moving objects by conventional moving object detection techniques;

5、实时检测：运动目标检测多数情况都需要及时处理图像数据，越是鲁棒、检测效果越好的方法，通常都难以做到实时处理；一方面需要寻求完善的检测技术，能够适应复杂条件下的运动目标检测，另一方面需要就算法本身以及硬件实现进行改进和提速。5. Real-time detection: In most cases of moving target detection, image data needs to be processed in time. The more robust and the better the detection effect, it is usually difficult to achieve real-time processing; on the one hand, it is necessary to seek perfect detection technology that can adapt to complex conditions On the other hand, it is necessary to improve and speed up the algorithm itself and hardware implementation.

同时，现有的针对航空图像序列运动目标检测的传统方法大都采用将任务分为两步进行，首先将运动平台产生的背景运动补偿为静态背景或者获得可以更新的背景图像，然后利用静态背景下的运动目标检测技术对其进行运动目标检测。这样的检测方法严重依赖于运动模型的真实逼近和模型参数的鲁棒求解，因此在航空遥感图像成像平台运动的条件下较难实现同一像素在时域上的准确建模，且补偿过程中耗时巨大，检测结果需要后处理步骤，不能达到实时性要求。At the same time, most of the existing traditional methods for detecting moving objects in aerial image sequences divide the task into two steps. First, the background motion generated by the moving platform is compensated for a static background or an updateable background image is obtained, and then the static background is used to The advanced moving object detection technology performs moving object detection on it. Such a detection method relies heavily on the real approximation of the motion model and the robust solution of the model parameters, so it is difficult to achieve accurate modeling of the same pixel in the time domain under the condition of the motion of the aerial remote sensing image imaging platform, and the compensation process is time-consuming. The time is huge, and the detection results need post-processing steps, which cannot meet the real-time requirements.

现有的基于深度学习方法检测运动目标，通常直接采用通用的目标检测器，难以达到较高的检测精度和较快的检测效率，无法很好的在实际工程项目中应用。目前所提出的网络结构都是两阶段的，将运动目标检测任务分为目标检测与运动检测任务，且分开进行检测，通常利用额外计算的两帧做差图像或光流图像输入网络当作其运动检测分支，然后将目标检测网络的结果与运动检测结果融合，尚未涉及端到端的网络提出。Existing methods based on deep learning to detect moving targets usually directly use general-purpose target detectors, which are difficult to achieve high detection accuracy and fast detection efficiency, and cannot be well applied in actual engineering projects. The network structure proposed so far is two-stage. The moving object detection task is divided into object detection and motion detection tasks, and the detection is performed separately. Usually, two extra frames of calculation are used to make difference images or optical flow images input into the network as its The motion detection branch, which then fuses the results of the object detection network with the motion detection results, has not yet been proposed involving end-to-end networks.

综上所述，如何设计一种基于深度学习的运动目标检测模型，可将多帧的运动特征与基于单帧图像的目标特征在网络的特征层面融合，实现端到端直接输出运动目标信息，可提高检测效率、减少检测流程、提高检测可靠性的运动目标检测方法，是当下亟需解决的问题。To sum up, how to design a moving target detection model based on deep learning, which can integrate multi-frame motion features and target features based on single-frame images at the feature level of the network, and realize end-to-end direct output of moving target information. A moving target detection method that can improve detection efficiency, reduce the detection process, and improve detection reliability is an urgent problem to be solved at present.

发明内容Contents of the invention

本发明为解决上述问题，提供了一种基于航空图像序列的运动目标检测方法，将提取的基于多帧的运动特征与基于单帧图像的目标特征在网络的特征层面融合，最后端到端直接输出运动目标信息，无需额外平台运动先验知识进行运动补偿，也无需后处理步骤，可提高检测效率和检测可靠性。In order to solve the above problems, the present invention provides a moving target detection method based on aerial image sequences, which fuses the extracted motion features based on multi-frames and target features based on single-frame images at the feature level of the network, and finally end-to-end direct The output of moving target information does not require additional prior knowledge of platform motion for motion compensation, nor does it require post-processing steps, which can improve detection efficiency and detection reliability.

为达到上述目的，本发明提出如下技术方案：一种基于航空图像序列的运动目标检测方法，包括如下步骤：In order to achieve the above object, the present invention proposes the following technical solutions: a method for detecting moving objects based on aerial image sequences, comprising the steps of:

S1：输入航空遥感图像序列；S1: input aerial remote sensing image sequence;

S2：准备待训练影像数据集；S2: Prepare the image data set to be trained;

S3：将第k帧图像输入至目标检测网络中，将第k帧和第k-n帧图像输入至运动检测网络中，将目标检测网络中和运动检测网络中得到的特征图进行信息融合并得到最终的网络输出。S3: Input the k-th frame image into the target detection network, input the k-th frame and the k-nth frame image into the motion detection network, and fuse the feature maps obtained in the target detection network and the motion detection network to obtain the final network output.

优选的，步骤S3包括如下子步骤：Preferably, step S3 includes the following sub-steps:

S31：对输入的第k帧、第k-n帧图像进行运动信息提取，将两帧图像输入至基于二维卷积的运动特征增强模块MFEM中，得到加强的运动特征图f_64*3*152*152；S31: Extract motion information from the input kth and knth frame images, input the two frames of images into the motion feature enhancement module MFEM based on two-dimensional convolution, and obtain an enhanced motion feature map f _{64*3*152* 152} ;

S32：将得到的加强运动特征图输入至改进的三维卷积模块MIE-Net中，提取出不同卷积深度的包含运动信息的特征图S_76*76，S_38*38，S_19*19；S32: Input the obtained enhanced motion feature map into the improved three-dimensional convolution module MIE-Net, and extract feature maps S _76*76 , S _38*38 , S _19*19 containing motion information with different convolution depths;

S33：将S32中得到的S_19*19特征输入至non-local模块进一步整合运动信息，得到整合后的S’_19*19；S33: input the S _19*19 feature obtained in S32 to the non-local module to further integrate the motion information, and obtain the integrated S'_19*19;

S34：对第k帧图像进行目标信息特征提取，得到不同卷积深度的特征图F_76*76，F_38*38，F_19*19；S34: Perform target information feature extraction on the image of the kth frame, and obtain feature maps F _76*76 , F _38*38 , F _19*19 of different convolution depths;

S35：将S32中得到S_76*76和S_38*38，S33中得到的S’_19*19，S34中得到的F_76*76、F_38*38和F_19*19的特征图按对应尺寸沿通道维度进行拼接；S35: The feature maps of S _76*76 and S _38*38 obtained in S32, S' _19*19 obtained in S33, and F _76*76 , F _38*38 and F _19*19 obtained in S34 according to the corresponding size Stitching along the channel dimension;

S36：将S35中得到的三个特征图输入两个卷积模块解码，将尺寸规范化为YOLO格式，得到网络的最终输出。S36: Input the three feature maps obtained in S35 into two convolution modules for decoding, normalize the size into YOLO format, and obtain the final output of the network.

优选的，S31中的MFEM模块中将两帧图像的特征做差，并利用差值作为三维卷积运动信息提取模块输入的第三维度。Preferably, the MFEM module in S31 performs a difference between the features of the two frames of images, and uses the difference as the third dimension input to the three-dimensional convolutional motion information extraction module.

优选的，S32中改进的三维卷积模块MIE-Net具体为：将一个卷积核为(3*3*3)的三维卷积分解为一个卷积核为(3*1*1)和(1*1*1)的两个一维卷积和一个卷积核为(1*2*2)的二维卷积。Preferably, the improved three-dimensional convolution module MIE-Net in S32 is specifically: decomposing a three-dimensional convolution with a convolution kernel of (3*3*3) into a convolution kernel of (3*1*1) and ( Two one-dimensional convolutions of 1*1*1) and a two-dimensional convolution with a convolution kernel of (1*2*2).

优选的，S31中当航空影像数据集的数量较少时，若基于单帧的目标检测分支的权重较大，添加训练影像数据集中随机的十分之一的运动目标正样本同时作为第k-n帧输入；即当输入的第k帧和第k-n帧图像为同一帧图像时，将第k帧和第k-n帧图像作为负样本使运动信息提取分支获得更大的权重。Preferably, in S31, when the number of aerial image data sets is small, if the weight of the target detection branch based on a single frame is large, add a random one-tenth of the moving target positive samples in the training image data set as the k-nth frame at the same time Input; that is, when the input kth frame and k-nth frame image are the same frame image, the kth frame and k-nth frame image are used as negative samples to make the motion information extraction branch obtain greater weight.

优选的，当航空数据集图像少时，引入仿射变换算法：Preferably, when there are few images in the aerial dataset, an affine transformation algorithm is introduced:

其中，(x，y)为原始像素点坐标，(u，v)为仿射变换之后的坐标。Among them, (x, y) is the coordinate of the original pixel point, and (u, v) is the coordinate after affine transformation.

优选的，步骤S3中当相邻两帧之间运动目标的运动距离大时，将第k-n帧图像中的n设为n＜5；当相邻两帧之间运动目标的运动距离小时，将第k-n帧图像中的n设为n≥5。Preferably, in step S3, when the moving distance of the moving object between two adjacent frames is large, set n in the image of the k-nth frame to n<5; when the moving distance of the moving object between adjacent two frames is small, set n in the image of the k-nth frame is set as n≥5.

优选的，S2中待训练影像数据集的获取方法如下：Preferably, the acquisition method of the image data set to be trained in S2 is as follows:

S21：对图像进行裁切得到待训练影像数据集；S21: Cutting the image to obtain an image data set to be trained;

S22：根据待训练影像数据集中运动目标已标注的检测框，将训练影像数据集进行聚类为9个固定锚框，代入算法并训练得到模型参数。S22: According to the marked detection frame of the moving target in the image data set to be trained, the training image data set is clustered into 9 fixed anchor frames, which are substituted into the algorithm and trained to obtain model parameters.

优选的，S22中的固定锚框采用k-meaning方法对待训练影像数据集中的运动目标进行聚类，得到9个目标尺寸的典型值作为固定参考框代入模型。Preferably, the k-meaning method is used for the fixed anchor frame in S22 to cluster the moving objects in the image data set to be trained, and 9 typical values of object sizes are obtained as the fixed reference frame and substituted into the model.

优选的，S21中将图像裁切成统一的输入尺寸：608×608像素。Preferably, in S21, the image is cropped into a uniform input size: 608×608 pixels.

本发明有益效果是：The beneficial effects of the present invention are:

1、本发明将提取的基于多帧的运动特征与基于单帧图像的目标特征在网络的特征层面融合，可端到端直接输出运动目标的类别、位置等信息，无需额外平台运动的先验知识进行运动补偿，也无需后处理步骤，通过直接利用两帧原始图像中的信息，运动信息提取网络提取的运动信息增强了网络的鲁棒性，获得了较高的运动目标检测精度，减少了检测的流程，提升了检测速度，使网络能达到实时检测。1. The present invention fuses the extracted multi-frame-based motion features with the single-frame image-based target features at the feature level of the network, and can directly output information such as the category and position of the moving target end-to-end without additional prior knowledge of platform motion Knowledge for motion compensation, and no post-processing steps are required. By directly using the information in the two frames of original images, the motion information extracted by the motion information extraction network enhances the robustness of the network, obtains higher detection accuracy of moving objects, and reduces The detection process improves the detection speed and enables the network to achieve real-time detection.

2、本发明可根据实际运动目标的运动快慢设置合适的两帧间隔参数，使得网络能同时兼顾运动较快与运动较慢的运动目标；本发明基于两帧之间的差值包含获取运动信息的特点，设计了时间特征增强模块，对两帧间的运动信息进行加强，并根据三维卷积可以提取运动信息的特点，改进三维卷积使其计算开销减小，进一步提取运动信息，使网络获得更好的运动信息提取效果，增加了运动目标的检测精度。2. The present invention can set an appropriate two-frame interval parameter according to the motion speed of the actual moving object, so that the network can simultaneously take into account the moving objects that move faster and slower; the present invention includes the acquisition of motion information based on the difference between two frames According to the characteristics of the time feature enhancement module, the time feature enhancement module is designed to enhance the motion information between two frames, and according to the characteristics of the motion information that can be extracted by the three-dimensional convolution, the three-dimensional convolution is improved to reduce the calculation cost, and the motion information is further extracted, so that the network A better motion information extraction effect is obtained, and the detection accuracy of the moving target is increased.

附图说明Description of drawings

图1是本发明实施例提供的运动目标检测流程图。FIG. 1 is a flow chart of moving object detection provided by an embodiment of the present invention.

图2是本发明实施例提供的应用场景图。Fig. 2 is a diagram of an application scenario provided by an embodiment of the present invention.

图3是本发明实施例提供的模型总体框架图。Fig. 3 is an overall framework diagram of the model provided by the embodiment of the present invention.

图4是本发明实施例提供的运动特征增强模块图。Fig. 4 is a diagram of a motion feature enhancement module provided by an embodiment of the present invention.

图5是本发明实施例提供的运动信息提取网络图。Fig. 5 is a network diagram of motion information extraction provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图1-5及具体实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，而不构成对本发明的限制。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with accompanying drawings 1-5 and specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.

一种基于航空图像序列的运动目标检测方法，包括如下步骤：A method for detecting moving targets based on aerial image sequences, comprising the steps of:

S1：输入航空遥感图像序列；如图2所示，本实施例中的航空遥感图像拥有复杂背景的特点，其中，a、b、c、d处框示出的车辆为所要提取的运动目标，旁边的静止车辆拥有与其相同的目标信息，但其拥有不同的运动信息，结合运动信息与目标信息便可以提取出运动目标信息，运动目标在遥感图像中像素数量少，图像中存在与运动目标具有相同目标特征的静止目标干扰，加之背景复杂，在成像期间摄像头同时也在相对于背景而运动，所得到的航空遥感图像序列由运动目标与静止背景间的运动和摄像头与静止背景间的相互运动两种运动所合成。S1: Input the sequence of aerial remote sensing images; as shown in Figure 2, the aerial remote sensing images in this embodiment have the characteristics of complex background, wherein the vehicles shown in the boxes at a, b, c, and d are the moving targets to be extracted, The stationary vehicle next to it has the same target information, but it has different motion information. The moving target information can be extracted by combining the motion information and the target information. The moving target has a small number of pixels in the remote sensing image, and the image has the same characteristics as the moving target. The interference of static targets with the same target characteristics, coupled with the complex background, the camera is also moving relative to the background during the imaging period, the obtained aerial remote sensing image sequence is composed of the motion between the moving target and the static background and the mutual motion between the camera and the static background Combination of two movements.

S2：准备待训练影像数据集；待训练影像数据集的获取方法如下：S2: Prepare the image data set to be trained; the acquisition method of the image data set to be trained is as follows:

S21：对图像进行裁切得到待训练影像数据集，将图像裁切成统一的输入尺寸：608×608像素。S21: Cutting the image to obtain an image data set to be trained, and cutting the image into a uniform input size: 608×608 pixels.

S22：根据待训练影像数据集中运动目标已标注的检测框，将训练影像数据集进行聚类为9个固定锚框，代入算法并训练得到模型参数；其中固定锚框采用k-meaning方法对待训练影像数据集中的运动目标进行聚类，得到9个目标尺寸的典型值作为固定参考框代入模型；利用得到的数据集对提出的方法进行训练，得到能够检测运动目标的卷积神经网络模型。S22: According to the marked detection frame of the moving target in the image data set to be trained, the training image data set is clustered into 9 fixed anchor frames, which are substituted into the algorithm and trained to obtain model parameters; the fixed anchor frames are treated with k-meaning method for training The moving targets in the image data set are clustered, and the typical values of 9 target sizes are obtained as a fixed reference frame and substituted into the model; the proposed method is trained using the obtained data set, and a convolutional neural network model capable of detecting moving targets is obtained.

S3：将第k帧图像输入至目标检测网络中，将第k帧和第k-n帧图像输入至运动检测网络中，将目标检测网络中和运动检测网络中得到的特征图进行信息融合并得到最终的网络输出。运动目标检测网络可提取运动目标的坐标位置，长宽尺寸与类别名称，为了增强其运动目标检测能力，采用二维卷积增强运动信息特征，三维卷积提取运动信息的方式，对两帧图像间的运动信息进行提取，S3具体包括如下子步骤：S3: Input the k-th frame image into the target detection network, input the k-th frame and the k-nth frame image into the motion detection network, and fuse the feature maps obtained in the target detection network and the motion detection network to obtain the final network output. The moving object detection network can extract the coordinate position, length, width, and category name of the moving object. In order to enhance its ability to detect moving objects, two-dimensional convolution is used to enhance the characteristics of motion information, and three-dimensional convolution is used to extract motion information. Two frames of images Extracting the motion information during the time, S3 specifically includes the following sub-steps:

S31：对输入的第k帧、第k-n帧图像进行运动信息提取，将两帧图像输入至基于二维卷积的运动特征增强模块MFEM(Motion feature enhancement module)中，得到加强的运动特征图f_64*3*152*152；MFEM模块如图4所示，TFEM模块由两帧差法中包含运动信息所启发而改进得到，MFEM模块中将两帧图像的特征做差，并利用差值作为三维卷积运动信息提取模块输入的第三维度，可进一步加强其运动信息，抑制背景和静止目标。S31: Extract motion information from the input kth and knth frame images, and input the two frames of images into the motion feature enhancement module MFEM (Motion feature enhancement module) based on two-dimensional convolution to obtain an enhanced motion feature map f _64*3*152*152 ; The MFEM module is shown in Figure 4. The TFEM module is inspired by the motion information contained in the two-frame difference method and is improved. In the MFEM module, the features of the two-frame images are differentiated, and the difference is used as The third dimension input by the three-dimensional convolutional motion information extraction module can further enhance its motion information and suppress background and static targets.

由于S2中的航空数据集较少，运动目标较小且标注难度较大，需设计多种数据增强方式，以提升算法的鲁棒性；Due to the small number of aviation data sets in S2, small moving objects and difficult labeling, it is necessary to design a variety of data enhancement methods to improve the robustness of the algorithm;

当航空影像数据集的数量较少时，若基于单帧的目标检测分支的权重较大，添加训练影像数据集中随机的十分之一的运动目标正样本同时作为第k-n帧输入；即当输入的第k帧和第k-n帧图像为同一帧图像时，将第k帧和第k-n帧图像作为负样本使运动信息提取分支获得更大的权重。When the number of aerial image data sets is small, if the weight of the target detection branch based on a single frame is large, add a random one-tenth of the moving target positive samples in the training image data set as the k-nth frame input at the same time; that is, when the input When the k-th frame and the k-n-th frame image are the same frame image, the k-th frame and the k-n-th frame image are used as negative samples to make the motion information extraction branch obtain a greater weight.

当航空数据集图像少时，引入仿射变换算法：When there are few images in the aerial dataset, the affine transformation algorithm is introduced:

当相邻两帧之间运动目标的运动距离大时，将第k-n帧图像中的n设为n＜5，如设n＝1；当相邻两帧之间运动目标的运动距离小时，将第k-n帧图像中的n设为n≥5，如设n＝5。When the moving distance of the moving object between adjacent two frames is large, set n in the image of the k-nth frame as n<5, such as setting n=1; when the moving distance of the moving object between adjacent two frames is small, set n in the image of the k-nth frame is set as n≥5, for example, n=5.

S32：将得到的加强运动特征图输入至改进的三维卷积模块MIE-Net(Motioninformation extraction network)中，提取出不同卷积深度的包含运动信息的特征图S_76*76，S_38*38，S_19*19；改进的三维卷积模块MIE-Net具体为：将一个卷积核为(3*3*3)的三维卷积分解为一个卷积核为(3*1*1)和(1*1*1)的两个一维卷积和一个卷积核为(1*2*2)的二维卷积。S32: Input the obtained enhanced motion feature map into the improved three-dimensional convolution module MIE-Net (Motioninformation extraction network), and extract feature maps S _76*76 , S _38*38 , which contain motion information with different convolution depths, S _19*19 ; The improved three-dimensional convolution module MIE-Net is specifically: a three-dimensional convolution with a convolution kernel of (3*3*3) is decomposed into a convolution kernel of (3*1*1) and ( Two one-dimensional convolutions of 1*1*1) and a two-dimensional convolution with a convolution kernel of (1*2*2).

S34：利用YOLOv5s算法作为目标检测分支的基准算法，对第k帧图像进行目标信息特征提取，得到不同卷积深度的特征图F_76*76，F_38*38，F_19*19；S34: Using the YOLOv5s algorithm as the benchmark algorithm of the target detection branch, perform target information feature extraction on the image of the kth frame, and obtain feature maps F _76*76 , F _38*38 , F _19*19 of different convolution depths;

S35：将S32中得到S_76*76和S_38*38，S33中得到的S’_19*19，S34中得到的F_76*76、F_38*38和F_19*19的特征图按对应尺寸沿通道维度进行拼接(Concatenation)；S35: The feature maps of S _76*76 and S _38*38 obtained in S32, S' _19*19 obtained in S33, and F _76*76 , F _38*38 and F _19*19 obtained in S34 according to the corresponding size Concatenation along the channel dimension;

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制。本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations on the present invention. Those skilled in the art can make changes, modifications, substitutions and modifications to the above-mentioned embodiments within the scope of the present invention.

以上本发明的具体实施方式，并不构成对本发明保护范围的限定。任何根据本发明的技术构思所做出的各种其他相应的改变与变形，均应包含在本发明权利要求的保护范围内。The above specific implementation manners of the present invention do not constitute a limitation to the protection scope of the present invention. Any other corresponding changes and modifications made according to the technical concept of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. The moving target detection method based on the aerial image sequence is characterized by comprising the following steps of:

s1: inputting an aerial remote sensing image sequence;

s2: preparing an image data set to be trained;

s3: inputting the kth frame image into a target detection network, inputting the kth frame and the kth-n frame images into a motion detection network, and carrying out information fusion on feature images obtained in the target detection network and the motion detection network to obtain final network output.

2. The method for detecting a moving object based on an aerial image sequence according to claim 1, wherein the step S3 comprises the following sub-steps:

s31: extracting motion information of the input k frame and k-n frame images, and inputting two frame images to a base station based on two framesIn the motion feature enhancement module MFEM of the dimensional convolution, an enhanced motion feature graph f is obtained _64*3*152*152 ；

S32: inputting the obtained enhanced motion feature map into an improved three-dimensional convolution module MIE-Net, and extracting feature maps S containing motion information with different convolution depths _76*76 ，S _38*38 ，S _19*19 ；

S33: s obtained in S32 _19*19 The characteristics are input into a non-local module to further integrate the motion information to obtain integrated S' _19*19 ；

S34: extracting target information features of the kth frame image to obtain feature images F with different convolution depths _76*76 ，F _38*38 ，F _19*19 ；

S35: s is obtained in S32 _76*76 And S is _38*38 S 'obtained in S33' _19*19 F obtained in S34 _76*76 、F _38*38 And F _19*19 Splicing the feature graphs of the channel along the dimension of the channel according to the corresponding dimension;

s36: and (3) inputting the three feature maps obtained in the step (S35) into two convolution modules for decoding, normalizing the size into a YOLO format, and obtaining the final output of the network.

3. The method for detecting a moving object based on an aerial image sequence according to claim 2, wherein the MFEM module in S31 makes a difference between features of two frames of images, and uses the difference as a third dimension input by the three-dimensional convolution motion information extraction module.

4. The method for detecting a moving object based on an aerial image sequence according to claim 2, wherein the improved three-dimensional convolution module MIE-Net in S32 is specifically: a three-dimensional convolution with a convolution kernel of (3 x 3) is decomposed into two one-dimensional convolutions with a convolution kernel of (3 x 1) and (1 x 1) and a two-dimensional convolution with a convolution kernel of (1 x 2).

5. The method for detecting a moving object based on an aerial image sequence according to claim 4, wherein when the number of aerial image data sets is small in S31, if the weight of the object detection branch based on a single frame is large, a random tenth of positive samples of the moving object in the training image data set is added and is input as the k-n frame; that is, when the input k-th frame and k-n-th frame images are the same frame image, the motion information extraction branch is given a greater weight by taking the k-th frame and k-n-th frame images as negative samples.

6. The moving object detection method based on an aerial image sequence according to any one of claims 2 to 5, wherein when aerial dataset images are small, an affine transformation algorithm is introduced:

where (x, y) is the original pixel point coordinates and (u, v) is the coordinates after affine transformation.

7. The method for detecting a moving object based on an aerial image sequence according to claim 6, wherein n in the k-n frame image is set to n < 5 when the moving distance of the moving object between two adjacent frames is large in step S3; when the motion distance of the moving object between two adjacent frames is small, n in the k-n frame image is set as n is more than or equal to 5.

8. The method for detecting a moving object based on an aerial image sequence according to claim 7, wherein the method for acquiring the image dataset to be trained in S2 is as follows:

s21: cutting the image to obtain an image data set to be trained;

s22: and clustering the training image data set into 9 anchor frames according to the detection frames marked by the moving targets in the image data set to be trained, substituting the anchor frames into an algorithm, and training to obtain model parameters.

9. The moving object detection method based on the aerial image sequence according to claim 8, wherein the fixed anchor frame in S22 clusters moving objects in the image data set to be trained by adopting a k-means method, and typical values of 9 object sizes are obtained and substituted into the model as fixed reference frames.

10. The moving object detection method based on aerial image sequence of claim 9, wherein the image is cut into uniform input sizes in S21: 608 x 608 pixels.