CN117876774A - High-speed object detection method based on multi-branch feature fusion and re-parameterization - Google Patents
High-speed object detection method based on multi-branch feature fusion and re-parameterization Download PDFInfo
- Publication number
- CN117876774A CN117876774A CN202410048866.3A CN202410048866A CN117876774A CN 117876774 A CN117876774 A CN 117876774A CN 202410048866 A CN202410048866 A CN 202410048866A CN 117876774 A CN117876774 A CN 117876774A
- Authority
- CN
- China
- Prior art keywords
- speed
- detection method
- parameterization
- module
- target detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 44
- 230000004927 fusion Effects 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000011176 pooling Methods 0.000 claims abstract description 10
- 230000000694 effects Effects 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000002776 aggregation Effects 0.000 claims 1
- 238000004220 aggregation Methods 0.000 claims 1
- 230000000877 morphologic effect Effects 0.000 abstract description 2
- 238000002372 labelling Methods 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 10
- 239000000284 extract Substances 0.000 description 9
- 238000010586 diagram Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000002679 ablation Methods 0.000 description 2
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000035484 reaction time Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域Technical Field
本发明属于计算机视觉中的图像处理和模式识别领域,涉及一种基于多分支特征融合和重参数化的高速目标检测方法。The invention belongs to the field of image processing and pattern recognition in computer vision, and relates to a high-speed target detection method based on multi-branch feature fusion and re-parameterization.
背景技术Background technique
近年来随着深度学习的快速发展,基于深度学习的检测算法在现实生活中应用的越来越广泛。在自动驾驶中,对路上的行人、快速移动的其他车辆、交通信号的识别等,都需要高速目标检测算法快速响应,以便给后续操作预留反应时间。在体育竞技中,高速形变目标检测能够帮助裁判更准确地定位到运动员、球类等运动物体,以此更加公平地判罚。但由于各种高速形变目标本身与静止时有很大的区别,存在形变、拖影的问题。并且从实际应用的角度出发,还需要保证模型较快的检测速度。With the rapid development of deep learning in recent years, detection algorithms based on deep learning have been increasingly used in real life. In autonomous driving, the recognition of pedestrians on the road, other fast-moving vehicles, and traffic signals requires a high-speed target detection algorithm to respond quickly in order to reserve reaction time for subsequent operations. In sports competitions, high-speed deformable target detection can help referees more accurately locate athletes, balls and other moving objects, so as to make more fair judgments. However, since various high-speed deformable targets themselves are very different from those at rest, there are problems of deformation and smearing. And from the perspective of practical application, it is also necessary to ensure a faster detection speed of the model.
在卷积神经网络的框架中,底层特征富含细节信息,包括边缘、位置以及纹理等信息。这些特征对于捕获静止或正常速度下的目标有着显著的优势。然而,在高速运动条件下,由于环境光照变化、设备限制等因素,目标可能出现诸如形变、拖影等现象,此时底层特征就极易受到这些噪声的影响。例如,形变可能造成纹理和边缘信息的大幅改变,这导致原本针对正常速度下目标设计的绝大多数检测模型无法有效地应对高速运动下的目标检测。而高层特征经过多次非线性变换和池化操作,有着更强的鲁棒性,能够抵御图像噪声、亮度变化、形变拖影等因素的干扰。此外,高层特征还包含了大量的语义信息和全局信息,包括物体的形状信息和场景信息等。以汽车检测为例,即使在图像中汽车形态发生拉长或压扁的形变,其整体轮廓仍然保持汽车所特有的特征,检测模型可以依赖这些全局特征来更准确地识别出形变的汽车。所以融合底层特征和高层特征可以得到更全面更丰富地进行特征表示,进一步提升处理形变问题的能力。In the framework of convolutional neural networks, the underlying features are rich in detailed information, including edge, position, and texture information. These features have significant advantages in capturing targets at rest or normal speed. However, under high-speed motion conditions, due to factors such as changes in ambient light and equipment limitations, the target may experience deformation, smearing, and other phenomena. At this time, the underlying features are easily affected by these noises. For example, deformation may cause a significant change in texture and edge information, which makes most detection models originally designed for targets at normal speed unable to effectively cope with target detection under high-speed motion. However, high-level features have stronger robustness after multiple nonlinear transformations and pooling operations, and can resist interference from factors such as image noise, brightness changes, and deformation smearing. In addition, high-level features also contain a large amount of semantic information and global information, including object shape information and scene information. Taking car detection as an example, even if the car shape in the image is elongated or flattened, its overall outline still maintains the unique characteristics of the car. The detection model can rely on these global features to more accurately identify the deformed car. Therefore, the fusion of low-level features and high-level features can obtain a more comprehensive and richer feature representation, further improving the ability to handle deformation problems.
另外,一般的单尺度局部特征提取仅在一定大小的窗口内提取特征,即只从一个固定尺度进行特征提取。而在实际场景中由于距离、角度等因素的影响,同一目标可能在图像中呈现出不同的尺寸。单尺度特征提取的信息在面对不同大小的目标时表现出一定的局限性。而多尺度特征提取在不同的尺度(如不同的分辨率或空间范围)上提取目标的特征。例如,在小尺度上,模型可能会关注到物体的某个小部分或特定纹理;而在大尺度上,模型可能会获取到更全局或更上下文的信息。所以多尺度局部特征提取有助于增强对高速目标在不同尺寸中表现出的形态和结构的理解,从而提高对高速目标检测的鲁棒性。In addition, general single-scale local feature extraction only extracts features within a window of a certain size, that is, it only extracts features from a fixed scale. In actual scenes, due to the influence of factors such as distance and angle, the same target may appear in different sizes in the image. The information extracted by single-scale features shows certain limitations when facing targets of different sizes. Multi-scale feature extraction extracts features of targets at different scales (such as different resolutions or spatial ranges). For example, at a small scale, the model may focus on a small part or a specific texture of an object; at a large scale, the model may obtain more global or contextual information. Therefore, multi-scale local feature extraction helps to enhance the understanding of the morphology and structure of high-speed targets at different sizes, thereby improving the robustness of high-speed target detection.
发明内容Summary of the invention
本发明针对高速运动下的目标出现的形变、拖影等问题,在Yolov7的基础上,提出了IM-ELAN结构,在不减慢推理速度的前提下,进一步增强网络模型对高速目标的形变特征提取能力。Aiming at the deformation and smear problems of targets in high-speed motion, the present invention proposes an IM-ELAN structure based on Yolov7, which further enhances the deformation feature extraction capability of the network model for high-speed targets without slowing down the reasoning speed.
本发明的第一方面,提供了一种基于多分支特征融合和重参数化的高速目标检测方法,该方法包括以下在步骤:In a first aspect of the present invention, a high-speed target detection method based on multi-branch feature fusion and re-parameterization is provided, the method comprising the following steps:
步骤1、获取高速形变目标数据集;Step 1: Obtain a high-speed deformation target data set;
步骤2、改进yolov7的网络模型中的E-ELAN结构,用于提升模型对高速目标的检测效果;Step 2: Improve the E-ELAN structure in the network model of yolov7 to improve the detection effect of the model on high-speed targets;
所述的E-ELAN结构主要由四个分支组成,在第一个分支增加IM模块,其余三个分支为卷积提取分支,对由四个分支提取到的特征进行融合;The E-ELAN structure is mainly composed of four branches. An IM module is added to the first branch, and the remaining three branches are convolution extraction branches, which fuse the features extracted by the four branches.
所述的IM模块包括IMA子模块和MSCR子模块,IMA子模块通过多头自注意力机制直接处理输入元素之间的任意两对关系;MSCR子模块在不影响推理速度的前提下,帮助模型提取多尺度的局部信息;The IM module includes an IMA submodule and an MSCR submodule. The IMA submodule directly processes any two pairs of relationships between input elements through a multi-head self-attention mechanism; the MSCR submodule helps the model extract multi-scale local information without affecting the reasoning speed.
步骤3、在COCO大型目标检测数据上训练改进后的yolov7网络,获得预训练模型;Step 3: Train the improved yolov7 network on COCO large-scale target detection data to obtain a pre-trained model;
步骤4、基于预训练模型,在高速形变目标数据集上训练出最终模型;Step 4: Based on the pre-trained model, the final model is trained on the high-speed deformation target dataset;
步骤5、将待检测的可能包含高速形变目标的图像输入最终模型,进行高速目标的检测。Step 5: Input the image to be detected that may contain high-speed deformation targets into the final model to detect high-speed targets.
本发明的第二方面,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可执行指令,所述处理器执行所述指令时实现上述高速目标检测方法。A second aspect of the present invention provides a computer device, comprising a memory, a processor, and computer executable instructions stored in the memory and executable on the processor, wherein the processor implements the above-mentioned high-speed target detection method when executing the instructions.
本发明的第三方面,提供了一种计算机存储介质,所述计算机存储介质存储有计算机可执行指令,其特征在于,所述计算机可执行指令被执行时实现上述高速目标检测方法。According to a third aspect of the present invention, a computer storage medium is provided, wherein the computer storage medium stores computer executable instructions, and wherein the computer executable instructions implement the above-mentioned high-speed target detection method when executed.
本发明的有益效果是:针对高速移动下而产生形变、拖影的目标,现有的检测模型并不能够很好地完成对其检测。本发明提出的优化结构在Yolov7网络的主干网络部分进行修改,不仅通过多分支特征融合得到更全面更丰富的特征表示,并且通过池化层减少了多头自注意力的参数量与计算量,最后使用重参数化在不影响速度的前提下帮助模型提取多尺度的局部特征。The beneficial effect of the present invention is that the existing detection model cannot well detect targets that are deformed and have smears under high-speed movement. The optimization structure proposed by the present invention modifies the backbone network of the Yolov7 network, not only obtaining a more comprehensive and richer feature representation through multi-branch feature fusion, but also reducing the number of parameters and calculations of multi-head self-attention through the pooling layer, and finally using re-parameterization to help the model extract multi-scale local features without affecting the speed.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地展示本发明的网络结构以及训练过程,下面将对实施案例中所需要的附图做以简单的介绍。In order to more clearly illustrate the network structure and training process of the present invention, the following briefly introduces the drawings required in the implementation case.
图1为本申请的高速形变目标检测流程图。FIG1 is a flowchart of high-speed deformable target detection of the present application.
图2为本申请实施例设计的IM-ELAN结构。FIG2 is a diagram of the IM-ELAN structure designed in an embodiment of the present application.
图3为本申请实施例设计的IMA-MSCR模块。FIG3 is an IMA-MSCR module designed in an embodiment of the present application.
图4为本申请实施例设计的IMA模块。FIG. 4 is an IMA module designed in an embodiment of the present application.
图5为本申请实施例设计的MSCR模块。FIG5 is a diagram of a MSCR module designed in an embodiment of the present application.
图6为本申请实施例所使用的高速形变目标数据集的部分样本图。FIG. 6 is a partial sample diagram of the high-speed deformation target data set used in the embodiment of the present application.
图7为本申请实施例在PASCAL VOC数据集和自建高速目标数据集上的可视化效果。FIG7 is a visualization effect of an embodiment of the present application on the PASCAL VOC dataset and a self-built high-speed target dataset.
具体实施方式Detailed ways
为了更为具体地描述本发明,下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solution of the present invention is described in detail below in conjunction with the accompanying drawings and specific implementation methods.
本申请的实施例提供了一种基于多分支特征融合和重参数化的高速目标检测方法,该方法包括以下步骤:The embodiment of the present application provides a high-speed target detection method based on multi-branch feature fusion and re-parameterization, the method comprising the following steps:
步骤1.获取现实生活中高速运动目标的视频,手工采集目标在不同情况下的形变图像,得到一个高速形变目标数据集,部分样本图如图6所示。数据集分布情况如表1所示。Step 1. Obtain a video of a high-speed moving target in real life, manually collect deformation images of the target under different conditions, and obtain a high-speed deformation target dataset. Some sample images are shown in Figure 6. The distribution of the dataset is shown in Table 1.
表1高速目标数据集Table 1 High-speed target dataset
步骤2.对Yolov7网络中的E-ELAN结构改进,改进后得到IM-ELAN结构。Step 2. Improve the E-ELAN structure in the Yolov7 network to obtain the IM-ELAN structure.
IM-ELAN结构如图2所示。首先,输入图片X经过三个卷积后得到特征图C2,再将C2按通道数拆分为F1和F2两个特征图。其中F1作为第一个分支,通过IM模块得到特征图Fim;F2作为第二个分支,经过两个卷积后得到特征图F3;F3作为第三个分支,将F3经过两个卷积后得到特征图F4作为第四个分支。再将四个分支进行通道上的拼接,经过一个卷积得到特征图F5。F5将通过shuffle和pooling两个分支分别进行特征提取,得到特征图F5-1和特征图F5-2,最后将两者进行通道上的相加得到特征图C3。The IM-ELAN structure is shown in Figure 2. First, the input image X is subjected to three convolutions to obtain the feature map C 2 , and then C 2 is split into two feature maps F 1 and F 2 according to the number of channels. Among them, F 1 is used as the first branch, and the feature map F im is obtained through the IM module; F 2 is used as the second branch, and the feature map F 3 is obtained after two convolutions; F 3 is used as the third branch, and F 3 is subjected to two convolutions to obtain the feature map F 4 as the fourth branch. Then the four branches are spliced on the channel, and a convolution is performed to obtain the feature map F 5 . F 5 will be extracted through the shuffle and pooling branches respectively to obtain the feature map F 5-1 and the feature map F 5-2 , and finally the two are added on the channel to obtain the feature map C 3 .
本实施例不仅通过多分支特征融合得到更全面更丰富的特征表示,并且通过池化层减少了多头自注意力的参数量与计算量,最后使用重参数化在不影响速度的前提下帮助模型提取多尺度的局部特征,增强对目标在不同尺寸中表现出的形态理解,从而提升了高速形变目标的检测效果。This embodiment not only obtains a more comprehensive and richer feature representation through multi-branch feature fusion, but also reduces the number of parameters and calculations of multi-head self-attention through the pooling layer. Finally, re-parameterization is used to help the model extract multi-scale local features without affecting the speed, enhance the morphological understanding of the target in different sizes, and thus improve the detection effect of high-speed deformed targets.
在一优先实施例中,所述IM-ELAN采用了随机梯度下降的方法SGD,对前三轮迭代采用warmup。训练超参数设置如表2所示。In a preferred embodiment, the IM-ELAN uses the stochastic gradient descent method SGD, and warmup is used for the first three iterations. The training hyperparameter settings are shown in Table 2.
表2训练超参数设置Table 2 Training hyperparameter settings
进一步的,所述IM模块由IMA子模块和MSCR子模块两部分组成,如图3所示。Furthermore, the IM module consists of two parts: an IMA submodule and an MSCR submodule, as shown in FIG3 .
所述IMA子模块通过多头自注意力机制直接处理输入元素之间的任意两对关系,这使得它们特别适合处理长距离依赖性,这在需要识别的对象分布在图像不同部分时尤其有用,比如一辆火车中间被植被遮挡。同时为了保证模型的实时性,引入池化模块减少自注意力模块的参数量。MSCR子模块在不影响推理速度的前提下,帮助网络模型提取多尺度的局部信息。The IMA submodule directly processes any two pairs of relationships between input elements through a multi-head self-attention mechanism, which makes them particularly suitable for processing long-distance dependencies, which is especially useful when the objects to be identified are distributed in different parts of the image, such as a train blocked by vegetation in the middle. At the same time, in order to ensure the real-time performance of the model, the pooling module is introduced to reduce the number of parameters of the self-attention module. The MSCR submodule helps the network model extract multi-scale local information without affecting the inference speed.
在一优先实施例中,IMA子模块采用和Vision Transformer一样的思路,结构如图4所示。In a preferred embodiment, the IMA submodule adopts the same idea as the Vision Transformer, and the structure is shown in FIG4 .
首先将特征图F1中的像素看作是每个token中的元素,通过将宽和高展平合并的方法,构建成Self-Attention的输入,即一个向量序列E,如公式1所示。First, the pixels in the feature map F1 are regarded as elements in each token. By flattening and merging the width and height, they are constructed into the input of Self-Attention, that is, a vector sequence E, as shown in Formula 1.
E=[token1,token2,..,tokenwh]=flatten(w,h,c) (1)E = [token 1 , token 2 , .. , token wh ] = flatten (w, h, c) (1)
再加上位置编码后,经过三个权重共享的线性层分别得到Q、K、V三个矩阵。其中,Q、K、V分别表示查询矩阵、键值矩阵以及值矩阵。After adding the position encoding, three weight-sharing linear layers are used to obtain the three matrices Q, K, and V. Q, K, and V represent the query matrix, key-value matrix, and value matrix, respectively.
现有的自注意力机制思路是将直接进行计算,从而导致计算复杂度会很大,本实施例希望不降低计算效率的同时,又能够帮助网络模型提高空间相关性。所以针对K、V矩阵进行改进,采用池化核为3、步长为2的平均池化pooling,将K、V的宽高两个维度缩小为原来的1/3,得到K′、V′。The existing self-attention mechanism is to directly perform calculations, which leads to high computational complexity. This embodiment hopes to help the network model improve spatial correlation without reducing computational efficiency. Therefore, the K and V matrices are improved by using average pooling with a pooling kernel of 3 and a step size of 2, reducing the width and height dimensions of K and V to 1/3 of the original, and obtaining K′ and V′.
采用上述改进后,减少了模型的参数量与计算量,提高了模型中计算Self-Attention计算速度,改进前Q和KT矩阵的计算复杂度为O((w×h)2×c),改进后的Q和(K′)T矩阵的计算复杂度为O(((w×h)2×c)/3),同理,在后续的V′计算上也下降了三倍的计算复杂度,极大地减少了Self-Attention的计算复杂度。After adopting the above improvements, the number of model parameters and the amount of calculation are reduced, and the speed of calculating Self-Attention in the model is improved. The calculation complexity of Q and K T matrices before the improvement is O((w×h) 2 ×c), and the calculation complexity of Q and (K′) T matrices after the improvement is O(((w×h) 2 ×c)/3). Similarly, the calculation complexity of the subsequent V′ calculation is also reduced by three times, which greatly reduces the calculation complexity of Self-Attention.
后续采用了多头注意力机制,将四组头的计算结果通过拼接和一个线性层融合之后得到特征图Fima。Subsequently, a multi-head attention mechanism was adopted to concatenate the calculation results of the four groups of heads and fuse them with a linear layer to obtain the feature map F ima .
经过IMA子模块得到的特征图Fima接下来经过MSCR子模块,MSCR子模块将分成训练流程和推理流程两个流程,如图5所示。The feature map F ima obtained by the IMA sub-module will then pass through the MSCR sub-module. The MSCR sub-module will be divided into two processes: training process and inference process, as shown in Figure 5.
在训练中,首先将Fima放入一个全连接层和一个BN层。为了节省参数量,采用分组全连接层。即将每个通道的特征图进行flatten操作,然后在每个通道内部进行全连接操作。在某一示例中,采用1×1分组卷积来替代分组全连接层,但其本质是分组全连接层。In training, the F ima is first placed in a fully connected layer and a BN layer. In order to save parameters, a grouped fully connected layer is used. That is, the feature map of each channel is flattened, and then a fully connected operation is performed inside each channel. In one example, a 1×1 grouped convolution is used to replace the grouped fully connected layer, but its essence is a grouped fully connected layer.
将Fima通过1×1分组卷积和BN层,再将得到的特征向量变成原输入的形状得到Ffc。如公式2所示:Pass F ima through 1×1 group convolution and BN layer, and then transform the obtained feature vector into the shape of the original input to obtain F fc . As shown in Formula 2:
Ffc=flatten(BN(GroupConv(flatten(Fima)))) (2)F fc = flatten(BN(GroupConv(flatten(Fi ima )))) (2)
同时,将Fima输入到另外的多尺度卷积分支中,其中多尺度卷积模块中的卷积核分别为1、3、5。本申请采用多尺度的卷积可以帮助网络模型对不同尺寸的局部特征进行学习。同时,为了考虑到后续需要将卷积分支的特征图与Ffc融合,需要对不同对卷积做不同大小的padding的操作,最后通过BN层,如公式3所示:At the same time, F ima is input into another multi-scale convolution branch, where the convolution kernels in the multi-scale convolution module are 1, 3, and 5 respectively. The use of multi-scale convolution in this application can help the network model learn local features of different sizes. At the same time, in order to consider the subsequent need to fuse the feature map of the convolution branch with F fc , it is necessary to perform padding operations of different sizes on different pairs of convolutions, and finally pass through the BN layer, as shown in Formula 3:
由于多尺度卷积模块的帮助,本申请能够更好地关注不同尺度高速目标的形变特性,进而帮助目标在高速运动下的检测。With the help of the multi-scale convolution module, this application can better focus on the deformation characteristics of high-speed targets of different scales, thereby helping to detect targets in high-speed motion.
再将得到的Ffc和相加后得到Fim,如公式4所示:Then the obtained F fc and After adding, we get F im , as shown in Formula 4:
以上就是MSCR模块的训练流程。The above is the training process of the MSCR module.
而在推理中,MSCR子模块通过重参数化将分支融合到Ffc分支中。具体的理论如下所示:由矩阵乘法(MMUL)具有可相加性可知,权重矩阵也具有相加性,如公式5所示,又因为卷积本质就是一种稀疏化矩阵乘法,即卷积和全连接的矩阵乘法存在一个等价的权重转换,如公式6所示:In inference, the MSCR submodule reparameterizes The branch is merged into the F fc branch. The specific theory is as follows: Since matrix multiplication (MMUL) is additive, the weight matrix is also additive, as shown in Formula 5. Since convolution is essentially a sparse matrix multiplication, there is an equivalent weight conversion between convolution and fully connected matrix multiplication, as shown in Formula 6:
MMUL(Fima (in),W1)+MMUL(Fima (in),W2)=MMUL(M(in),W1+W2) (5)MMUL( Fima (in) , W1 )+MMUL( Fima (in) , W2 )=MMUL(M (in) , W1 + W2 ) (5)
MMUL(Fima (in),W)=CONV(Fima (in)) (6)MMUL( Fima (in) ,W)=CONV( Fima (in) ) (6)
根据上述两个公式,即矩阵可加性和卷积权重可转换性,在前向推理的时候将卷积的权重加权到全连接上,如公式7所示:According to the above two formulas, namely matrix additivity and convolution weight convertibility, the convolution weight is weighted to the full connection during forward reasoning, as shown in Formula 7:
Fim=flatten(BN(FC_CONV(flatten(Fima)))) (7)F im = flatten(BN(FC_CONV(flatten(F ima )))) (7)
传统的MLP通常考虑全局特征的提取,只根据输入的维度映射到相应的维度,并没有考虑到位置信息,缺乏局部特征信息。与传统的MLP大不相同,本申请使用多尺度卷积的先验知识,能够有效地帮助网络模型对目标局部特征的提取。同时,为了兼顾速度,在前向推理的过程中,采用重参数化的方式将卷积的权重融入到全连接当中。Traditional MLP usually considers the extraction of global features, only mapping the input dimension to the corresponding dimension, without considering the location information and lacking local feature information. Different from the traditional MLP, this application uses the prior knowledge of multi-scale convolution, which can effectively help the network model extract the local features of the target. At the same time, in order to take into account the speed, in the process of forward reasoning, the weight of the convolution is integrated into the full connection by using a re-parameterization method.
综上,训练时,多尺度局部特征提取模块和多层感知机模块并行;推理时,将卷积的权重融合到全连接中。MSCR子模块进一步帮助IMA-MSCR模块在充分利用空间相关性的同时,能够让模型具备多尺度的局部特征。In summary, during training, the multi-scale local feature extraction module and the multi-layer perceptron module are run in parallel; during inference, the convolution weights are integrated into the full connection. The MSCR submodule further helps the IMA-MSCR module to fully utilize spatial correlation while enabling the model to have multi-scale local features.
为了证明IM-ELAN设计的有效性,对原版E-ELAN和改进后的IM-ELAN两个结构在高速目标数据集上做消融实验,实验结果如表3所示。In order to prove the effectiveness of the IM-ELAN design, ablation experiments are performed on the original E-ELAN and the improved IM-ELAN structures on the high-speed target dataset. The experimental results are shown in Table 3.
表3消融实验表Table 3 Ablation experiment table
可以从表中观察得出,改进后的IM-ELAN模块相比较于E-ELAN提升了2.1%的准确率,进一步说明了IM模块有效地提升了网络模型对高速目标检测效果。It can be observed from the table that the improved IM-ELAN module improves the accuracy by 2.1% compared with E-ELAN, which further illustrates that the IM module effectively improves the high-speed target detection effect of the network model.
步骤3、在COCO大型目标检测数据上训练改进后的yolov7网络,得到一个预训练模型。Step 3: Train the improved yolov7 network on the COCO large-scale object detection data to obtain a pre-trained model.
步骤4、基于预训练模型,在高速形变目标数据集上训练出最终模型。Step 4: Based on the pre-trained model, train the final model on the high-speed deformation target dataset.
其中回归损失使用CIoU Loss,类别和目标置信度损失使用BCE Loss。然后对损失进行融合,设置不同的权重对每个损失值进行平衡,防止某个损失值占比重过大,即:The regression loss uses CIoU Loss, and the category and target confidence loss uses BCE Loss. Then the losses are fused and different weights are set to balance each loss value to prevent a certain loss value from accounting for too large a proportion, that is:
总Loss=λ1Lcls+λ2Lobj+λ3Lloc;Total Loss = λ 1 L cls + λ 2 L obj + λ 3 L loc ;
其中Lcla为分类损失,Lobj为目标置信度损失,Lloc为回归损失,λ1为分类损失的权重,默认0.125,λ2为目标置信度损失的权重,默认0.1,λ3为回归损失的权重,默认0.05。Where Lcla is the classification loss, Lobj is the target confidence loss, Lloc is the regression loss, λ1 is the weight of the classification loss, the default value is 0.125, λ2 is the weight of the target confidence loss, the default value is 0.1, and λ3 is the weight of the regression loss, the default value is 0.05.
步骤5、将待检测的可能包含高速形变目标的图像输入改进后的模型,进行目标的检测。具体是:Step 5: Input the image to be detected that may contain high-speed deformation targets into the improved model to detect the targets. Specifically:
(1)高速目标检测系统实时读取目标图像;(1) High-speed target detection system reads target images in real time;
(2)目标图像输入网络模型中进行前向推理;(2) The target image is input into the network model for forward reasoning;
(3)进入网络判断目标图像中是否存在目标,若存在则进入步骤(4),否则进入步骤(5);(3) Enter the network to determine whether there is an object in the target image. If so, proceed to step (4); otherwise, proceed to step (5);
(4)检测系统对目标进行标注,并提示该图像存在目标;(4) The detection system marks the target and indicates that the target exists in the image;
(5)是否还有未读取的图像,如果有则回到步骤(1),否则结束此次检测。(5) Check whether there are any unread images. If yes, return to step (1); otherwise, terminate the detection.
在PASCAL VOC数据集和自建高速目标数据集的可视化效果对比如图7所示。其中IM-ELAN检出了被树遮挡的火车、高速形变的球,证明了IM-ELAN结构帮助模型提取到多尺度的局部信息,且能更好地处理长距离依赖性,最终能更好地检出高速移动目标。The visualization comparison of the PASCAL VOC dataset and the self-built high-speed target dataset is shown in Figure 7. IM-ELAN detects the train blocked by the tree and the ball deforming at high speed, proving that the IM-ELAN structure helps the model extract multi-scale local information and can better handle long-distance dependencies, and ultimately can better detect high-speed moving targets.
本申请实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可执行指令,所述处理器执行所述指令时实现上述的高速目标检测方法。An embodiment of the present application also provides a computer device, including a memory, a processor, and computer executable instructions stored in the memory and executable on the processor, wherein the processor implements the above-mentioned high-speed target detection method when executing the instructions.
本申请实施例又一种计算机存储介质,所述计算机存储介质存储有计算机可执行指令,其特征在于,所述计算机可执行指令被执行时实现上述的高速目标检测方法。The embodiment of the present application provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and is characterized in that the above-mentioned high-speed target detection method is implemented when the computer executable instructions are executed.
综上,本申请基于目标检测模型YOLOv7,采用特征融合和多尺度局部特征提取,将底层特征和高层特征的优势结合以实现更全面的特征表述,并使用重参数化,在不影响速度的前提下提取不同尺度的局部特征以增强对高速目标在不同尺寸中表现出的形态和结构的理解,从而提升高速运动下目标的检测精度。In summary, this application is based on the target detection model YOLOv7, adopts feature fusion and multi-scale local feature extraction, combines the advantages of low-level features and high-level features to achieve a more comprehensive feature description, and uses re-parameterization to extract local features of different scales without affecting the speed to enhance the understanding of the morphology and structure of high-speed targets in different sizes, thereby improving the detection accuracy of targets under high-speed motion.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410048866.3A CN117876774A (en) | 2024-01-12 | 2024-01-12 | High-speed object detection method based on multi-branch feature fusion and re-parameterization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410048866.3A CN117876774A (en) | 2024-01-12 | 2024-01-12 | High-speed object detection method based on multi-branch feature fusion and re-parameterization |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117876774A true CN117876774A (en) | 2024-04-12 |
Family
ID=90586554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410048866.3A Pending CN117876774A (en) | 2024-01-12 | 2024-01-12 | High-speed object detection method based on multi-branch feature fusion and re-parameterization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117876774A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118968561A (en) * | 2024-08-26 | 2024-11-15 | 山东科技大学 | Orchard pedestrian detection method, system, equipment and medium based on structural re-parameterization |
-
2024
- 2024-01-12 CN CN202410048866.3A patent/CN117876774A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118968561A (en) * | 2024-08-26 | 2024-11-15 | 山东科技大学 | Orchard pedestrian detection method, system, equipment and medium based on structural re-parameterization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109584248B (en) | Infrared target instance segmentation method based on feature fusion and dense connection network | |
CN107563372B (en) | License plate positioning method based on deep learning SSD frame | |
WO2023056889A1 (en) | Model training and scene recognition method and apparatus, device, and medium | |
CN108509978A (en) | The multi-class targets detection method and model of multi-stage characteristics fusion based on CNN | |
CN108319972A (en) | A kind of end-to-end difference online learning methods for image, semantic segmentation | |
CN117037004B (en) | UAV image detection method based on multi-scale feature fusion and context enhancement | |
CN109063719B (en) | Image classification method combining structure similarity and class information | |
CN107463892A (en) | Pedestrian detection method in a kind of image of combination contextual information and multi-stage characteristics | |
CN108416266A (en) | A kind of video behavior method for quickly identifying extracting moving target using light stream | |
CN114529982B (en) | Lightweight human body posture estimation method and system based on streaming attention | |
CN113963333B (en) | Traffic sign board detection method based on improved YOLOF model | |
CN113065645A (en) | Twin attention network, image processing method and device | |
CN114972976A (en) | Night target detection and training method and device based on frequency domain self-attention mechanism | |
CN112132145A (en) | Image classification method and system based on model extended convolutional neural network | |
CN116630850A (en) | Siamese object tracking method based on multi-attention task fusion and bounding box encoding | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
CN113343953A (en) | FGR-AM method and system for remote sensing scene recognition | |
CN118587449A (en) | A RGB-D saliency detection method based on progressive weighted decoding | |
CN112446292B (en) | 2D image salient object detection method and system | |
CN116189281B (en) | End-to-end human behavior classification method and system based on spatiotemporal adaptive fusion | |
CN118135200A (en) | A low-light target detection method based on infrared and visible light image fusion | |
CN117058641A (en) | Panoramic driving perception method based on deep learning | |
CN119648999A (en) | A target detection method, system, device and medium based on cross-modal fusion and guided attention mechanism | |
CN118298162A (en) | Method for detecting target in severe scene based on receptive field fusion pyramid | |
CN113537119A (en) | Detection method of transmission line connecting parts based on improved Yolov4-tiny |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |