CN116266387A

CN116266387A - YOLOV4 image recognition algorithm and system based on reparameterized residual structure and coordinate attention mechanism

Info

Publication number: CN116266387A
Application number: CN202111426910.2A
Authority: CN
Inventors: 王瑜; 毕玉; 闫善武
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-11-28
Filing date: 2021-11-28
Publication date: 2023-06-20

Abstract

本发明公开了一种基于重参数化残差结构和坐标注意力机制的YOLOV4的图像识别算法及系统。其中，该算法包含以下的步骤：获取待识别图像集；运用Mosaic数据增强方式对输入训练图像集进行处理，利用K‑means++聚类算法计算得到训练图像集初始瞄框；根据基于重参数化残差结构和坐标注意力机制的YOLOV4模型和训练图像集、验证图像集进行模型训练，生成特征提取模型；将待识别图像集输入该模型，得到图像识别结果。该算法用提出的基于重参数化残差结构和坐标注意力机制的YOLOV4模型对图像进行目标识别，可以实现复杂背景下目标的分类与定位，模型鲁棒性好，从而可以有效地提高目标识别的准确性和快速性。The invention discloses a YOLOV4 image recognition algorithm and system based on a reparameterized residual structure and a coordinate attention mechanism. Among them, the algorithm includes the following steps: obtain the image set to be recognized; use the Mosaic data enhancement method to process the input training image set, and use the K-means++ clustering algorithm to calculate the initial aiming frame of the training image set; The YOLOV4 model of the difference structure and coordinate attention mechanism, the training image set and the verification image set are used for model training to generate a feature extraction model; the image set to be recognized is input into the model to obtain the image recognition result. The algorithm uses the proposed YOLOV4 model based on the reparameterized residual structure and coordinate attention mechanism to perform target recognition on images, which can realize the classification and positioning of targets in complex backgrounds, and the model is robust, which can effectively improve target recognition accuracy and speed.

Description

Image Recognition of YOLOV4 Based on Reparameterized Residual Structure and Coordinate Attention Mechanism different algorithms and systems

技术领域technical field

本发明涉及图像处理和模式识别领域，特别涉及一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法及系统。The invention relates to the fields of image processing and pattern recognition, in particular to a YOLOV4 image recognition algorithm and system based on a reparameterized residual structure and a coordinate attention mechanism.

背景技术Background technique

目标识别在图像处理领域中占有重要的地位，其是对目标同时进行定位与分类。传统的目标识别算法包括V-J(Viola-Jones)检测算法，方向梯度直方图(Histogram ofOriented Gradient，HOG)检测算法和可变形部件模型(Deformable Parts Model，DPM)算法。V-J检测算法主要用于人脸检测，主要原理是通过对输入图像进行窗口滑动来搜索haar特征。HOG检测算法是通过提取梯度来构建对应的特征表，对图像每个网格构建直方图。DPM是深度学习算法兴起前最为成功的传统检测模型。但是传统算法在复杂背景以及多检测目标的情况下，速度和精度都不占有优势。所以近年来，研究学者的热点研究方向是基于深度学习的目标检测算法，与传统的学习算法相比，深度学习算法拥有更快的识别速度与更稳定的识别结果。目标识别的深度学习算法分为一阶段算法和二阶段算法，一阶段算法的典型代表包括YOLO系列和SSD。二阶段算法的典型代表包括R-CNN，Fast R-CNN和Faster R-CNN等。二阶段算法目标识别的速度慢、精度高，一阶段算法目标识别的速度快、精度低，但是在近两年内，一阶段算法中的典型代表YOLO系列，在目标识别的精度与速度方面得到了非常好的平衡效果。Target recognition plays an important role in the field of image processing, which is to locate and classify targets at the same time. Traditional target recognition algorithms include V-J (Viola-Jones) detection algorithm, Histogram of Oriented Gradient (HOG) detection algorithm and Deformable Parts Model (DPM) algorithm. The V-J detection algorithm is mainly used for face detection. The main principle is to search for haar features by sliding the window on the input image. The HOG detection algorithm constructs the corresponding feature table by extracting the gradient, and constructs a histogram for each grid of the image. DPM is the most successful traditional detection model before the rise of deep learning algorithms. However, in the case of complex backgrounds and multiple detection targets, traditional algorithms do not have advantages in speed and accuracy. Therefore, in recent years, the hot research direction of researchers is the target detection algorithm based on deep learning. Compared with traditional learning algorithms, deep learning algorithms have faster recognition speed and more stable recognition results. The deep learning algorithm of target recognition is divided into one-stage algorithm and two-stage algorithm. Typical representatives of one-stage algorithm include YOLO series and SSD. Typical representatives of the two-stage algorithm include R-CNN, Fast R-CNN and Faster R-CNN, etc. The target recognition speed of the second-stage algorithm is slow and the accuracy is high, and the target recognition speed of the first-stage algorithm is fast and the accuracy is low. However, in the past two years, the typical representative YOLO series of the first-stage algorithm has achieved great progress in the accuracy and speed of target recognition. Very well balanced effect.

以下对该系列算法进行简单的介绍：The following is a brief introduction to the series of algorithms:

2016年，Joseph Redmon等人提出了YOLO系列的第一代模型YOLOV1，该模型具体原理为，将输入的图片分为n×n个网格，每个网格预测x个候选框和物体的类别，该模型检测速度非常快，每秒可以处理45张图像，但是检测精度较差。在2017年，第二代模型YOLOV2在第一代模型的基础上，对主干网络添加了平均池化层和BN层，使模型收敛的更快，并且引入了瞄框机制，不直接预测坐标值，通过坐标的偏移量和置信度就可以相对准确的得到目标的定位。在2018年，通过对第二代模型YOLOV2进行改进，得到了模型YOLOV3，该模型选择了三种尺度大小不同的瞄框，以实现对大小不同目标的准确检测，并且在分类层选用多标签分类，可以对每种类型进行是与不是的判断，以达到更高的精度。在2020年，第四代的模型YOLOV4被推出，其主干网络选择 CSPDarknet53，添加了SPP模块以及FPN+PAN的特征融合结构，这样的改进不仅让该模型具有速度方面的优势，在检测精度方面对比其他模型也具有显著优势。但是该模型仍有需要改进的地方，以实现更好的预测效果。In 2016, Joseph Redmon and others proposed the first generation model of the YOLO series, YOLOV1. The specific principle of the model is to divide the input picture into n×n grids, and each grid predicts x candidate boxes and object categories. , the detection speed of this model is very fast, and it can process 45 images per second, but the detection accuracy is poor. In 2017, the second-generation model YOLOV2 added an average pooling layer and a BN layer to the backbone network on the basis of the first-generation model to make the model converge faster, and introduced a frame-seeing mechanism that does not directly predict coordinate values , the location of the target can be obtained relatively accurately through the offset and confidence of the coordinates. In 2018, the model YOLOV3 was obtained by improving the second-generation model YOLOV2. The model selected three aiming frames of different sizes to achieve accurate detection of targets of different sizes, and selected multi-label classification in the classification layer. , you can judge whether it is true or not for each type, so as to achieve higher precision. In 2020, the fourth-generation model YOLOV4 was launched. Its backbone network chooses CSPDarknet53, adding the SPP module and the feature fusion structure of FPN+PAN. Such improvements not only give the model an advantage in speed, but also in terms of detection accuracy. Other models also have significant advantages. But the model still needs to be improved to achieve better predictions.

发明内容Contents of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

为此，本发明的第一个目的在于提出一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法，该算法适用于复杂背景下的目标识别，并且在推理阶段速度有大幅提升。For this reason, the first purpose of the present invention is to propose a YOLOV4 image recognition algorithm based on a reparameterized residual structure and coordinate attention mechanism, which is suitable for target recognition in complex backgrounds, and has a significant speed increase in the reasoning stage. promote.

本发明的另外一个目的在于提出一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别系统。Another object of the present invention is to propose a YOLOV4 image recognition system based on a reparameterized residual structure and a coordinate attention mechanism.

为了达到上述的两个目的，本发明在第一个方面实施提出了一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法，包括了以下的步骤：输入待识别图像集；对输入的训练图像集进行数据增强，并利用K-means++ 聚类算法计算得到训练图像集的初始瞄框；获取重参数化残差结构和坐标注意力机制的YOLOV4模型，该模型基于YOLOV4，在特征提取阶段添加X、Y 两个方向的空间信息，可以提升模型的准确率，在复杂特征的提取阶段，添加了重参数化的残差结构，可以使模型推理速度提升；根据所述重参数化残差结构和坐标注意力机制的YOLOV4模型和所述训练图像集、验证图像集进行模型训练，以生成重参数化残差结构和坐标注意力机制的YOLOV4识别模型；根据所述待识别图像集通过重参数化残差结构和坐标注意力机制的YOLOV4 识别模型得到图像识别结果。In order to achieve the above two purposes, the present invention proposes a YOLOV4 image recognition algorithm based on a reparameterized residual structure and a coordinate attention mechanism in the first aspect, which includes the following steps: input the image set to be recognized; Perform data enhancement on the input training image set, and use the K-means++ clustering algorithm to calculate the initial aiming frame of the training image set; obtain the YOLOV4 model of the reparameterized residual structure and coordinate attention mechanism, which is based on YOLOV4. Adding spatial information in the X and Y directions in the feature extraction stage can improve the accuracy of the model. In the extraction stage of complex features, a re-parameterized residual structure is added to improve the model reasoning speed; according to the re-parameter The YOLOV4 model of the residual structure and the coordinate attention mechanism and the training image set and the verification image set are used for model training to generate the YOLOV4 recognition model of the reparameterized residual structure and the coordinate attention mechanism; according to the image to be identified The image recognition results are obtained through the YOLOV4 recognition model of the reparameterized residual structure and coordinate attention mechanism.

本发明实施例的基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法，可以通过深度学习残差网络理论和基于深度学习的模型训练算法获得基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别模型，该模型不受待识别图像背景的复杂程度限制，鲁棒性更好，推理速度更快，从而可以有效提高目标识别的精准性和快速性。The YOLOV4 image recognition algorithm based on the reparameterized residual structure and coordinate attention mechanism of the embodiment of the present invention can obtain the YOLOV4 image recognition algorithm based on the reparameterized residual structure and coordinate attention through the deep learning residual network theory and the model training algorithm based on deep learning. The YOLOV4 image recognition model of the force mechanism, which is not limited by the complexity of the image background to be recognized, has better robustness and faster reasoning speed, which can effectively improve the accuracy and speed of target recognition.

另外，根据本发明上述实施例的基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法还可以具有下述的附加技术特征：In addition, the YOLOV4 image recognition algorithm based on the reparameterized residual structure and coordinate attention mechanism according to the above-mentioned embodiments of the present invention may also have the following additional technical features:

第一步，在本发明中的一个实施例中，所述的YOLOV4模型包括输入端、主干网络、瓶颈网络和输出端。输入端将训练图像集进行数据增强，并根据 K-means++聚类算法计算得到训练图像集的初始瞄框。主干网络包括Darknet53 网络，Mish激活函数以及Leakyrelu激活函数。瓶颈网络包括SPP模块和 FPN+PAN的特征融合结构。输出端包括CIOU_Loss损失函数和CIOU_nms 预测框筛选方法。In the first step, in one embodiment of the present invention, the YOLOV4 model includes an input terminal, a backbone network, a bottleneck network and an output terminal. The input end will perform data enhancement on the training image set, and calculate the initial aiming frame of the training image set according to the K-means++ clustering algorithm. The backbone network includes Darknet53 network, Mish activation function and Leakyrelu activation function. The bottleneck network includes the SPP module and the feature fusion structure of FPN+PAN. The output includes the CIOU_Loss loss function and the CIOU_nms prediction box screening method.

第二步，在本发明中的一个实施例中，所述训练时运用的Rer包括第一个瓶颈模块1(Bottleneck1)与第一个1×1的卷积层进行Add操作，经过Mish 激活函数，并通过第一个瓶颈模块2(Bottleneck2)与第二个1×1的卷积层进行Add操作，最后通过Mish激活函数。其中，Add具体操作为将特征图在维度不变的条件下，进行特征图信息的叠加，使描述图像特征的信息增多。 Bottleneck1包括第一个3×3的卷积层与第一个1×1的卷积层进行Add操作。 Bottleneck2包括第一个3×3的卷积层，第一个1×1的卷积层和Identity进行 Add操作。所述模型在推理前对残差结构进行重参数化，将Bottleneck1和 Bottleneck2分别转换为3×3的卷积层，最后将Rer转换为两个串联的3×3 的卷积层，转换后的单路结构可以大幅加快推理速度。具体的重参数化融合过程包括将各卷积层与BN层进行融合，卷积层可表示为：In the second step, in one embodiment of the present invention, the Rer used in the training includes the first bottleneck module 1 (Bottleneck1) and the first 1×1 convolutional layer to perform the Add operation, and pass the Mish activation function , and perform the Add operation through the first bottleneck module 2 (Bottleneck2) and the second 1×1 convolutional layer, and finally through the Mish activation function. Among them, the specific operation of Add is to superimpose the feature map information under the condition that the dimension is unchanged, so as to increase the information describing the image features. Bottleneck1 includes the first 3×3 convolutional layer and the first 1×1 convolutional layer for Add operations. Bottleneck2 includes the first 3×3 convolutional layer, the first 1×1 convolutional layer and Identity for Add operation. The model reparameterizes the residual structure before inference, converts Bottleneck1 and Bottleneck2 into 3×3 convolutional layers, and finally converts Rer into two concatenated 3×3 convolutional layers, and the converted The one-way structure can greatly speed up inference. The specific reparameterized fusion process includes fusing each convolutional layer with the BN layer, and the convolutional layer can be expressed as:

Conv(x)＝W(x)+bConv(x)=W(x)+b

其中x表示输入向量，Conv表示卷积操作，W表示权重向量，b表示偏置。Where x represents the input vector, Conv represents the convolution operation, W represents the weight vector, and b represents the bias.

BN层可表示为：The BN layer can be expressed as:

其中x表示输入向量，BN表示批归一化操作，mean表示输入向量的平均值， var表示输入向量的方差，β、γ表示可学习参数。Where x represents the input vector, BN represents the batch normalization operation, mean represents the average value of the input vector, var represents the variance of the input vector, and β and γ represent learnable parameters.

将卷积层结果带入BN层中得到融合结果，可表示为：The result of the convolutional layer is brought into the BN layer to obtain the fusion result, which can be expressed as:

其中x表示输入向量，Conv表示卷积操作，BN表示批归一化操作，β、γ表示可学习参数，W表示权重向量，b表示偏置，mean表示输入向量的平均值， var表示输入向量的方差。所述残差结构中的1×1的卷积层，扩充为一个3×3 的卷积层，将其数值放在3×3的卷积层的中心位置，其余位置进行0填充。所述融合操作是一个重参数化的过程，可以使模型推理速度大幅提升。Where x represents the input vector, Conv represents the convolution operation, BN represents the batch normalization operation, β, γ represents the learnable parameters, W represents the weight vector, b represents the bias, mean represents the average value of the input vector, and var represents the input vector Variance. The 1×1 convolutional layer in the residual structure is expanded into a 3×3 convolutional layer, and its value is placed in the center of the 3×3 convolutional layer, and the remaining positions are filled with 0. The fusion operation is a reparameterization process, which can greatly increase the speed of model inference.

第三步，在本发明中的一个实施例中，所述坐标注意力机制包括X、Y两个空间方向上的平均池化，可表示为：In the third step, in one embodiment of the present invention, the coordinate attention mechanism includes average pooling in the two spatial directions of X and Y, which can be expressed as:

其中x为指定输入，d表示通道数，使用(H,1)，(1,W)的平均池化核分别沿着水平方向和竖直方向的每个通道进行编码，i表示高度上的每一个特征点， j表示宽度上的每一个特征点，z表示X、Y两方向进行平均池化后的输出。坐标注意力机制使该模型更好的对复杂背景下的目标进行识别，分别将输入特征图的两个空间方向上的位置信息和空间信息聚合，获得鲁棒性更好的模型。Where x is the specified input, d represents the number of channels, and the average pooling kernel of (H, 1), (1, W) is used to encode each channel in the horizontal direction and vertical direction respectively, and i represents each channel in the height A feature point, j represents each feature point on the width, and z represents the output after average pooling in the X and Y directions. The coordinate attention mechanism enables the model to better recognize targets in complex backgrounds, and aggregates the position information and spatial information in the two spatial directions of the input feature map to obtain a more robust model.

第四步，在本发明的一个实施例中，输入端利用K-means++聚类算法对训练图像集初始瞄框进行设定，该算法对输入的图像集X＝{x₁，x₂，...，x_n}和簇的个数k，从图像集中任意的选择一个样本点作为初始的聚类中心c₁，对于图像集中的每一个样本点x_i，计算得出两者之间的最短距离D(x)。D(x)数值较大的点被选为新的聚类中心的概率较大，重复上述步骤得到k个聚类中心。The fourth step, in one embodiment of the present invention, the input end uses the K-means++ clustering algorithm to set the initial aiming frame of the training image set, and the algorithm sets the input image set X={x ₁ , x ₂ ,. .., x _n } and the number of clusters k, randomly select a sample point from the image set as the initial clustering center c ₁ , for each sample point x _i in the image set, calculate the distance between the two The shortest distance D(x). Points with larger values of D(x) have a higher probability of being selected as new cluster centers. Repeat the above steps to obtain k cluster centers.

第五步，在本发明的一个实施例中，选用Leaky relu激活函数和Mish激活函数作为基准网络的激活函数，使用Dropblock正则化方法来缓解模型训练中出现的过拟合现象，具体操作是在特征图中随机的丢弃成块的特征点。The fifth step, in one embodiment of the present invention, the Leaky relu activation function and the Mish activation function are selected as the activation function of the benchmark network, and the Dropblock regularization method is used to alleviate the over-fitting phenomenon in the model training. The specific operation is in Randomly discard blocky feature points in the feature map.

第六步，在本发明的一个实施例中，模型的瓶颈网络部分选用SPP结构，该结构同时对输入特征图进行第一个1×1的最大池化层、第一个5×5的最大池化层和第一个9×9的最大池化层并将三种不同最大池化核进行池化后的特征图与原特征图进行Concat操作，以获得不同范围内的特征图视野。同时在特征融合部分选用FPN+PAN的结构，FPN结构的具体操作是将特征图进行两次上采样，得到语义信息更为丰富的特征图，PAN结构的具体操作是对特征图进行两次下采样，得到位置信息更为丰富的特征图，经过FPN+PAN的特征融合结构得到了位置与语义信息都极为丰富的特征图。The sixth step, in one embodiment of the present invention, the bottleneck network part of the model uses the SPP structure, which simultaneously performs the first 1×1 maximum pooling layer and the first 5×5 maximum pooling layer on the input feature map. The pooling layer and the first 9×9 maximum pooling layer perform concat operations on the feature maps after pooling with three different maximum pooling kernels and the original feature maps to obtain feature map views in different ranges. At the same time, the FPN+PAN structure is selected in the feature fusion part. The specific operation of the FPN structure is to upsample the feature map twice to obtain a feature map with richer semantic information. The specific operation of the PAN structure is to downsample the feature map twice. Sampling to obtain a feature map with richer positional information, and a feature map with extremely rich positional and semantic information through the feature fusion structure of FPN+PAN.

第七步，在本发明的一个实施例中，输出端包括CIOU_Loss损失函数，可表示为：The seventh step, in one embodiment of the present invention, the output terminal includes a CIOU_Loss loss function, which can be expressed as:

其中IOU表示真实框与预测框的交并比，是衡量目标检测准确度的一个标准， b^p表示预测框的中心点，b^gt表示真实框的中心点，ρ²(b^p,b^gt)表示预测框与真实框的中心点的欧氏距离，c表示真实框与预测框的最小外接矩形的对角线距离，v可以表示两类框在长宽比方面对于损失函数的影响，w^gt表示真实框的宽度，h^gt表示真实框的高度，w^p表示预测框的宽度，h^p表示预测框的高度，最后还选用CIOU改进后的NMS对众多候选框进行筛选，CIOU考虑到候选框的长宽比对结果的影响，所以筛选结果更加准确。Among them, IOU represents the intersection and union ratio between the real frame and the predicted frame, which is a standard to measure the accuracy of target detection, b ^p represents the center point of the predicted frame, b ^gt represents the center point of the real frame, ρ ² (b ^p ,b ^gt ) Represents the Euclidean distance between the predicted frame and the center point of the real frame, c represents the diagonal distance between the real frame and the minimum circumscribed rectangle of the predicted frame, v can represent the impact of the two types of frames on the loss function in terms of aspect ratio, w ^gt Represents the width of the real frame, h ^gt represents the height of the real frame, w ^p represents the width of the predicted frame, h ^p represents the height of the predicted frame, and finally selects the improved NMS of CIOU to screen many candidate frames, CIOU considers the candidate frame The effect of the aspect ratio on the result, so the screening result is more accurate.

附图说明Description of drawings

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

图1为本发明实施例所提供的一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法的流程示意图；FIG. 1 is a schematic flow diagram of a YOLOV4 image recognition algorithm based on a reparameterized residual structure and a coordinate attention mechanism provided by an embodiment of the present invention;

图2为根据本发明实施例的基于重参数化残差结构和坐标注意力机制的 YOLOV4图像识别算法的模型结构示意图以及模型中各个模块的具体结构解释；2 is a schematic diagram of the model structure of the YOLOV4 image recognition algorithm based on a reparameterized residual structure and a coordinate attention mechanism according to an embodiment of the present invention and a specific structural explanation of each module in the model;

图3为本发明实施例提供的一种基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法中残差网络结构进行重参数化的示意图。FIG. 3 is a schematic diagram of reparameterization of the residual network structure in a YOLOV4 image recognition algorithm based on a reparameterized residual structure and coordinate attention mechanism provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

下面参照附图描述根据本发明实施例提出的基于重参数化残差结构和坐标注意力机制的YOLOV4的图像识别算法，首先将参照上面附图中描述根据本发明实施例提出的重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法。The following describes the image recognition algorithm of YOLOV4 based on the reparameterized residual structure and coordinate attention mechanism according to the embodiments of the present invention with reference to the accompanying drawings. YOLOV4 Image Recognition Algorithm with Difference Structure and Coordinate Attention Mechanism.

图1是本发明一个实施例的基于重参数化残差结构和坐标注意力机制的 YOLOV4图像识别算法流程图。Fig. 1 is a flow chart of the YOLOV4 image recognition algorithm based on reparameterized residual structure and coordinate attention mechanism according to an embodiment of the present invention.

图1所示，基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别算法包括以下步骤：As shown in Figure 1, the YOLOV4 image recognition algorithm based on reparameterized residual structure and coordinate attention mechanism includes the following steps:

在步骤S101中，输入待识别图像集。In step S101, an image set to be recognized is input.

选用背景环境较为复杂且需要检测的目标较多的图像集，该图像集对模型的目标识别效果进行评价。An image set with a complex background environment and many targets to be detected is selected, and the image set is used to evaluate the target recognition effect of the model.

在步骤S102中，对训练图像集进行数据增强，并利用K-means++聚类算法计算得到训练图像集初始瞄框。In step S102, data enhancement is performed on the training image set, and K-means++ clustering algorithm is used to calculate the initial aiming frame of the training image set.

可以理解为，将训练图像集进行裁剪、缩放、饱和度变换等方式处理，并采用Mosaic数据增强操作，其具体原理是：提取训练集中的一个批量(batch)，并从中随机的抽取四张图片进行剪裁，拼接成一张新图片，重复此操作批量大小(batch size)次，最终将batch size个Mosaic处理过后的图片输入模型，该操作可以有效提升网络的训练速度。利用K-means++聚类算法得到该训练图像集的初始瞄框，该算法对输入的图像集X＝{x₁，x₂，...，x_n}和簇的个数k，从图像集中任意的选择一个样本点作为初始的聚类中心c₁，对于图像集中的每一个样本点x_i，计算得出两者之间的最短距离D(x)。D(x)数值较大的点被选为新的聚类中心的概率较大，重复上述步骤得到k个聚类中心。It can be understood that the training image set is processed by cropping, scaling, saturation transformation, etc., and the Mosaic data enhancement operation is used. The specific principle is: extract a batch (batch) in the training set, and randomly extract four pictures from it Cutting, splicing into a new picture, repeating this operation batch size (batch size) times, and finally inputting batch size Mosaic processed pictures into the model, this operation can effectively improve the training speed of the network. Using the K-means++ clustering algorithm to obtain the initial aiming frame of the training image set, the algorithm takes the input image set X={x ₁ , x ₂ ,...,x _n } and the number of clusters k, from the image set Randomly select a sample point as the initial cluster center c ₁ , and for each sample point _xi in the image set, calculate the shortest distance D(x) between the two. Points with larger values of D(x) have a higher probability of being selected as new cluster centers. Repeat the above steps to obtain k cluster centers.

在步骤S103中，获取如图2所示重参数化残差结构和坐标注意力机制的 YOLOV4模型，其中，重参数化残差结构和坐标注意力机制的YOLOV4模型以YOLOV4为基础，将模型中的残差网络结构改进为重参数化的残差网络结构，并在模型中添加坐标注意力机制，进行X、Y两空间方向的特征聚合。In step S103, the YOLOV4 model of the reparameterized residual structure and coordinate attention mechanism as shown in Figure 2 is obtained, wherein the YOLOV4 model of the reparameterized residual structure and coordinate attention mechanism is based on YOLOV4, and the The residual network structure is improved to a reparameterized residual network structure, and a coordinate attention mechanism is added to the model to perform feature aggregation in the X and Y spatial directions.

可以理解为，如图2所示，输入图像的尺寸是608*608*3，该模型包括第一个CBM模块，第一个坐标注意力机制，第一个Rer1模块，第一个Rer2模块，第一个Rer8模块，第二个Rer8模块，第一个CBL*4模块，第一个CBL*3 模块，第一个SPP模块，第二个CBL*3模块，第一个CBL模块，将所得特征图进行上卷积操作后和第二个Rer8模块处理后的特征图经过第二个CBL模块后进行Concat操作，第一个CBL*5模块，第三个CBL模块经过上卷积操作后，和第一个Rer8模块处理后的特征图经过第四个CBL模块后，进行Concat 操作，第二个CBL*5模块，第五个CBL模块，第一个Conv模块，得到第一个预测的特征图大小为76*76*255，将第二个CBL*5模块处理后的特征图通过第六个CBL模块，并将其与第一个CBL*5模块处理得到的特征图进行 Concat操作，第三个CBL*5模块，第七个CBL模块，第二个Conv模块，得到第二个预测的特征图大小为38*38*255，将第三个CBL*5模块处理后的特征图通过第八个CBL模块，并将其与第二个CBL*3模块处理后得到的特征图进行Concat操作，第四个CBL*5模块，第八个CBL模块，第三个Conv模块，得到第三个预测的特征图大小为19*19*255。其中CBM模块包括第一个卷积层，第一个BN层和第一个Mish激活函数。CBL模块包括第一个卷积层，第一个BN层和第一个Leakyrelu激活函数。SPP模块是将输入特征图同时进行第一个最大池化核为1×1的Max pool模块，第一最大池化核为5×5的Max pool模块，第一个最大池化核为9×9的Maxpool模块，并与未处理过的特征图进行Concat操作，第一个CBL模块。坐标注意力机制是将输入特征图同时进行第一个XAVGpool模块和第一个YAVG pool模块，后进行Concat操作，第一个Conv模块，第一个BN层，第一个Leaky relu激活函数，将上一步输出同时进行第二个Conv模块，第二个Leaky relu激活函数，和第三个Conv 模块，第三个Leaky relu激活函数，将第二个Leaky relu激活函数处理过后的特征图和第三个Leaky relu激活函数处理过后的特征图，与未处理过的特征图进行Add操作。RerX模块由X组图2所示的模块串联而成，所述训练时运用的Rer模块包括第一个Bottleneck1与第一个1×1卷积层进行Add操作，经过 Mish激活函数，并通过第一个Bottleneck2与第二个1×1的卷积层进行Add 操作，最后通过Mish激活函数。Bottleneck1包括第一个3×3的卷积层，第一个1×1的卷积层进行Add操作。Bottleneck2包括第一个3×3的卷积层，第一个1×1的卷积层和Identity进行Add操作。所述模型在推理前对残差结构进行重参数化，将Bottleneck1和Bottleneck2分别转换为3×3的卷积层，最后将Rer模块转换为两个串联的3×3的卷积层，转换后的单路结构可以大幅加快推理速度，具体的融合部分如图3所示包括将各卷积层与BN层进行融合，卷积层可表示为：It can be understood that, as shown in Figure 2, the size of the input image is 608*608*3, the model includes the first CBM module, the first coordinate attention mechanism, the first Rer1 module, the first Rer2 module, The first Rer8 module, the second Rer8 module, the first CBL*4 module, the first CBL*3 module, the first SPP module, the second CBL*3 module, the first CBL module, the obtained After the feature map is subjected to the upconvolution operation and the feature map processed by the second Rer8 module is subjected to the Concat operation after the second CBL module, the first CBL*5 module and the third CBL module are subjected to the upconvolution operation, After the feature map processed by the first Rer8 module passes through the fourth CBL module, Concat operation is performed, the second CBL*5 module, the fifth CBL module, and the first Conv module get the first predicted feature The size of the graph is 76*76*255. Pass the feature map processed by the second CBL*5 module through the sixth CBL module, and perform Concat operation with the feature map processed by the first CBL*5 module. Three CBL*5 modules, the seventh CBL module, and the second Conv module get the second predicted feature map size of 38*38*255, and pass the feature map processed by the third CBL*5 module through the second Eight CBL modules, and perform Concat operation on the feature map obtained after processing with the second CBL*3 module, the fourth CBL*5 module, the eighth CBL module, and the third Conv module to get the third The predicted feature map size is 19*19*255. The CBM module includes the first convolutional layer, the first BN layer and the first Mish activation function. The CBL module includes the first convolutional layer, the first BN layer and the first Leakyrelu activation function. The SPP module is to input the feature map at the same time. The first Max pool module with a maximum pooling kernel of 1×1, the first Max pool module with a maximum pooling kernel of 5×5, and the first maximum pooling kernel of 9× 9's Maxpool module, and Concat operation with the unprocessed feature map, the first CBL module. The coordinate attention mechanism is to simultaneously perform the first XAVGpool module and the first YAVG pool module on the input feature map, and then perform the Concat operation. The first Conv module, the first BN layer, and the first Leaky relu activation function will be The previous step outputs the second Conv module, the second Leaky relu activation function, and the third Conv module, the third Leaky relu activation function, the feature map processed by the second Leaky relu activation function and the third The feature map processed by a Leaky relu activation function is added to the unprocessed feature map. The RerX module is composed of X groups of modules shown in Figure 2 in series. The Rer module used in the training includes the first Bottleneck1 and the first 1×1 convolution layer to perform the Add operation, pass the Mish activation function, and pass the first A Bottleneck2 performs an Add operation with the second 1×1 convolutional layer, and finally passes the Mish activation function. Bottleneck1 includes the first 3×3 convolutional layer, and the first 1×1 convolutional layer performs the Add operation. Bottleneck2 includes the first 3×3 convolutional layer, the first 1×1 convolutional layer and Identity for the Add operation. The model reparameterizes the residual structure before inference, converts Bottleneck1 and Bottleneck2 into 3×3 convolutional layers respectively, and finally converts the Rer module into two concatenated 3×3 convolutional layers, after conversion The single-channel structure of the method can greatly speed up the reasoning speed. The specific fusion part includes the fusion of each convolutional layer and the BN layer as shown in Figure 3. The convolutional layer can be expressed as:

Conv(x)＝W(x)+bConv(x)=W(x)+b

BN层可表示为：The BN layer can be expressed as:

其中x表示输入向量，Conv表示卷积操作，BN表示批归一化操作，β、γ表示可学习参数，W表示权重向量，b表示偏置，mean表示输入向量的平均值， var表示输入向量的偏差。Bottleneck1如图3中A1所示有两个分支，包括第一个1×1的卷积层和第一个3×3的卷积层，将第一个1×1的卷积层扩充为3×3的卷积层，将其数值放在3×3卷积层的中间位置，其余部分进行0的填充，填充结果如图3中A2所示。最后将两个3×3的卷积层进行合并，成为一个3×3的卷积层，如图3中A3所示。Bottleneck2如图3中A4所示，其中 Identity并不会改变输入的数值，所以将其转换为一个3×3的卷积层，所述残差结构中的1×1的卷积层，扩充为一个3×3的卷积层，将其数值放在3×3的卷积层的中心位置，其余位置进行0填充，所述过程完成后如图3中A5所示，残差结构包括的三个分支都变成了一个3×3的卷积层，将上述三种分支的权重和偏置进行叠加，得到一个新的3×3的卷积层，如图3中的A6所示。Rer模块由两个类似Bottleneck1的结构串联而成，经过Bottleneck1相同的融合过程，得到第一个3×3的卷积层，经过第一个Mish激活函数，第二个3×3的卷积层，最后通过第二个Mish激活函数，所述融合操作是一个重参数化的过程，可以使模型推理速度大幅提升。Where x represents the input vector, Conv represents the convolution operation, BN represents the batch normalization operation, β, γ represents the learnable parameters, W represents the weight vector, b represents the bias, mean represents the average value of the input vector, and var represents the input vector deviation. Bottleneck1 has two branches as shown in A1 in Figure 3, including the first 1×1 convolutional layer and the first 3×3 convolutional layer, expanding the first 1×1 convolutional layer to 3 ×3 convolutional layer, put its value in the middle of the 3×3 convolutional layer, and fill the rest with 0. The filling result is shown in A2 in Figure 3. Finally, the two 3×3 convolutional layers are combined to form a 3×3 convolutional layer, as shown in A3 in Figure 3. Bottleneck2 is shown in A4 in Figure 3, where Identity does not change the input value, so it is converted into a 3×3 convolutional layer, and the 1×1 convolutional layer in the residual structure is expanded to A 3×3 convolutional layer, put its value in the center of the 3×3 convolutional layer, and fill the rest of the positions with 0. After the process is completed, as shown in A5 in Figure 3, the residual structure includes three Each branch becomes a 3×3 convolutional layer, and the weights and biases of the above three branches are superimposed to obtain a new 3×3 convolutional layer, as shown in A6 in Figure 3. The Rer module is composed of two structures similar to Bottleneck1 in series. After the same fusion process of Bottleneck1, the first 3×3 convolutional layer is obtained, after the first Mish activation function, the second 3×3 convolutional layer , and finally through the second Mish activation function, the fusion operation is a reparameterization process, which can greatly increase the model reasoning speed.

图2为基于重参数化残差结构和坐标注意力机制的YOLOV4模型结构图，在第一个CBL后添加了坐标注意力机制，所述坐标注意力机制包括X、Y两个空间方向上的平均池化，可表示为：Figure 2 is a structure diagram of the YOLOV4 model based on the reparameterized residual structure and the coordinate attention mechanism. The coordinate attention mechanism is added after the first CBL, and the coordinate attention mechanism includes X and Y in the two spatial directions. Average pooling can be expressed as:

其中x为指定输入，d表示通道数，使用(H,1)，(1,W)的平均池化核分别沿着水平方向和竖直方向的每个通道进行编码，i表示高度上的每一个特征点，j表示宽度上的每一个特征点，z表示X、Y两方向进行平均池化后的输出。坐标注意力机制使该模型更好的对复杂背景下的目标进行识别，将输入特征图的两个空间方向上的位置信息和空间信息进行聚合，获得鲁棒性更好的模型。图2 中的瓶颈网络部分为FPN+PAN的特征融合结构，SPP模块增大了网络的感受野，将不同尺度的特征融合到一起，最终输出三个不同尺度的预测图。如图2所示，网络的输出端通过一个CBL和一个Conv进行特征图预测。输出端包括CIOU_Loss损失函数，可表示为：Where x is the specified input, d represents the number of channels, and the average pooling kernel of (H, 1), (1, W) is used to encode each channel in the horizontal direction and vertical direction respectively, and i represents each channel in the height A feature point, j represents each feature point on the width, and z represents the output after average pooling in the X and Y directions. The coordinate attention mechanism enables the model to better recognize targets in complex backgrounds, aggregate the position information and spatial information in the two spatial directions of the input feature map, and obtain a more robust model. The bottleneck network part in Figure 2 is the feature fusion structure of FPN+PAN. The SPP module increases the receptive field of the network, fuses features of different scales together, and finally outputs three prediction maps of different scales. As shown in Figure 2, the output of the network passes a CBL and a Conv for feature map prediction. The output includes the CIOU_Loss loss function, which can be expressed as:

其中IOU表示真实框与预测框的交并比，是衡量目标检测准确度的一个标准， b^p表示预测框的中心点，b^gt表示真实框的中心点，ρ²(b^p,b^gt)表示预测框与真实框的中心点的欧氏距离，c表示真实框与预测框的最小外接矩形的对角线距离。v可以表示两类框在长宽比方面对于损失函数的影响，w^gt表示真实框的宽度，h^gt表示真实框的高度，w^p表示预测框的宽度，h^p表示预测框的高度，最后还选用CIOU改进后的NMS对众多候选框进行筛选，CIOU考虑到候选框的长宽比对结果的影响，所以筛选结果更加准确，其具体操作过程为：Among them, IOU represents the intersection and union ratio between the real frame and the predicted frame, which is a standard to measure the accuracy of target detection, b ^p represents the center point of the predicted frame, b ^gt represents the center point of the real frame, ρ ² (b ^p ,b ^gt ) Indicates the Euclidean distance between the predicted frame and the center point of the real frame, and c represents the diagonal distance between the real frame and the smallest circumscribed rectangle of the predicted frame. v can represent the influence of two types of frames on the loss function in terms of aspect ratio, w ^gt represents the width of the real frame, h ^gt represents the height of the real frame, w ^p represents the width of the predicted frame, h ^p represents the height of the predicted frame, and finally The NMS improved by CIOU is also used to screen many candidate boxes. CIOU takes into account the influence of the aspect ratio of the candidate box on the results, so the screening results are more accurate. The specific operation process is as follows:

第一步，将众多候选框中置信度最高的一个作为样本，其他的候选框与样本计算CIOU，可表示为；In the first step, the one with the highest confidence among the many candidate frames is used as a sample, and the CIOU of other candidate frames and samples is calculated, which can be expressed as;

其中IOU表示真实框与预测框的交并比，是衡量目标检测准确度的一个标准， b^p表示预测框的中心点，b^gt表示真实框的中心点，ρ²(b^p,b^gt)表示预测框与真实框的中心点的欧氏距离，c表示真实框与预测框的最小外接矩形的对角线距离。v可以表示两类框在长宽比方面对于损失函数的影响。Among them, IOU represents the intersection and union ratio between the real frame and the predicted frame, which is a standard to measure the accuracy of target detection, b ^p represents the center point of the predicted frame, b ^gt represents the center point of the real frame, ρ ² (b ^p ,b ^gt ) Indicates the Euclidean distance between the predicted frame and the center point of the real frame, and c represents the diagonal distance between the real frame and the smallest circumscribed rectangle of the predicted frame. v can represent the influence of two types of boxes on the loss function in terms of aspect ratio.

第二步，计算得到的CIOU值大于设定的阈值时，候选框则被移除；In the second step, when the calculated CIOU value is greater than the set threshold, the candidate box is removed;

重复进行上述步骤，将预测得到的大量重复的框进行筛选，得到准确的预测结果，CIOU考虑到了框与框之间的重叠面积、中心点间距以及框的长宽比，所以可以得到更为准确的预测结果。Repeat the above steps to filter a large number of predicted repeated frames to obtain accurate prediction results. CIOU takes into account the overlapping area between frames, the distance between center points and the aspect ratio of frames, so it can be more accurate. prediction results.

在步骤S104中，根据重参数化残差结构和坐标注意力机制的YOLOV4 模型和训练图像集、验证图像集进行模型训练，以生成识别模型。In step S104, model training is performed according to the YOLOV4 model of the reparameterized residual structure and the coordinate attention mechanism, the training image set, and the verification image set to generate a recognition model.

可以理解为，首先采用LabelImg工具进行图像集的制作，根据K-means++ 聚类算法计算得到训练图像集初始瞄框。训练过程中使用Adam优化器，该优化器综合考量了梯度的一节矩估计和二阶矩估计。所述Adam优化器的具体操作步骤为：It can be understood that, firstly, the LabelImg tool is used to make the image set, and the initial aiming frame of the training image set is calculated according to the K-means++ clustering algorithm. The Adam optimizer is used in the training process, which comprehensively considers the first-order moment estimation and the second-order moment estimation of the gradient. The specific operation steps of the Adam optimizer are:

第一步，设置学习率lr，平滑常数β₁，β₂分别用于平滑m和v，可学习参数的初始值设置为θ₀，m₀＝0，v₀＝0，t＝0；The first step is to set the learning rate lr, the smoothing constants β ₁ and β ₂ are used to smooth m and v respectively, and the initial value of the learnable parameters is set to θ ₀ , m ₀ =0, v ₀ =0, t=0;

第二步，在没有停止训练的前提下，训练次数更新为t＝t+1；In the second step, on the premise of not stopping the training, the training times are updated to t=t+1;

第三步，计算梯度g_t；The third step is to calculate the gradient g _t ;

第四步，累计梯度可表示为：In the fourth step, the cumulative gradient can be expressed as:

m_t＝β₂*v_t-1+(1-β₂)*(g_t)² m _t ＝β ₂ *v _t-1 +(1-β ₂ )*(g _t ) ²

第五步，偏差纠正m可表示为：In the fifth step, the deviation correction m can be expressed as:

第六步，偏差纠正v可表示为：In the sixth step, the deviation correction v can be expressed as:

第七步，更新参数可表示为：In the seventh step, the update parameters can be expressed as:

其中ε为一个较小的常数，避免分母为0的情况出现。Where ε is a small constant to avoid the case where the denominator is 0.

为了避免过拟合现象的出现，采用DropBlock正则化方法和类标签平滑，DropBlock正则化方法具体操作为：随机抽取特征图中的成块特征点，进行丢弃。类标签平滑的具体操作是，将模型预测目标的上限调整为一个小于1.0的数值，一定程度上减轻了模型对预测结果的记忆性，使模型不会过于自信。In order to avoid the occurrence of over-fitting phenomenon, the DropBlock regularization method and class label smoothing are adopted. The specific operation of the DropBlock regularization method is: randomly extract the block feature points in the feature map and discard them. The specific operation of class label smoothing is to adjust the upper limit of the model prediction target to a value less than 1.0, which reduces the memory of the model to the prediction results to a certain extent, so that the model will not be overconfident.

在步骤S105中，将待识别图像集输入重参数化残差结构和坐标注意力机制的YOLOV4识别模型，得到图像识别结果。In step S105, the image set to be recognized is input into the YOLOV4 recognition model of re-parameterized residual structure and coordinate attention mechanism, and the image recognition result is obtained.

可以理解为，训练结束后会生成权重文件，调用生成的权重文件，根据重参数化残差结构和坐标注意力机制的YOLOV4模型进行测试，所述训练模型可以对复杂背景下的目标快速且准确的识别。It can be understood that after the training, the weight file will be generated, the generated weight file will be called, and the test will be performed according to the YOLOV4 model of the reparameterized residual structure and the coordinate attention mechanism. The training model can quickly and accurately target the target under the complex background identification.

需要说明的是，前述对基于重参数化残差结构和坐标注意力机制的 YOLOV4图像识别算法实施例的解释说明也适用于该实施例的基于重参数化残差结构和坐标注意力机制的YOLOV4图像识别系统，此处不再赘述。It should be noted that the foregoing explanations for the embodiment of the YOLOV4 image recognition algorithm based on the reparameterized residual structure and the coordinate attention mechanism are also applicable to the YOLOV4 based on the reparameterized residual structure and the coordinate attention mechanism of this embodiment The image recognition system will not be repeated here.

根据本发明实施例的重参数化残差结构和坐标注意力机制的YOLOV4的图像识别算法及系统，可以实现端到端的识别任务，可以全自动地对图像进行识别，且不受待识别图像背景复杂程度的限制，适用性强，模型性能好，具有鲁棒性，使目标识别不但速度快，而且精度高。According to the YOLOV4 image recognition algorithm and system of the re-parameterized residual structure and coordinate attention mechanism of the embodiment of the present invention, the end-to-end recognition task can be realized, and the image can be fully automatically recognized without being affected by the background of the image to be recognized. The limitation of complexity, strong applicability, good model performance and robustness make the target recognition not only fast, but also high precision.

本发明实施例算法包括的全部步骤是可以通过程序指令相关的硬件完成，程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括算法实施例的步骤之一或其组合。All the steps included in the algorithm of the embodiment of the present invention can be completed by the hardware related to the program instruction, and the program can be stored in a computer-readable storage medium. When the program is executed, it includes one or a combination of the steps of the algorithm embodiment.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined.

流程图中或在此以其他方式描述的任何过程或算法描述可以被理解为，表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or algorithmic description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing a custom logical function or process step , and the scope of preferred embodiments of the invention includes alternative implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present invention pertain.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、系统或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、系统或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、系统或设备而使用。就本说明书而言， "计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、系统或设备或结合这些指令执行系统、系统或设备而使用的系统。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子系统)，便携式计算机盘盒(磁系统)，随机存取存储器 (RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤系统，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, For use with an instruction execution system, system, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, system, or device and execute instructions), or in conjunction with such an instruction execution system, system or equipment used. For the purposes of this specification, a "computer-readable medium" may be any system that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, system or device. More specific examples (non-exhaustive list) of computer-readable media include the following: electrical connection with one or more wires (electronic system), portable computer disk case (magnetic system), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Systems, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium on which the program can be printed, as it may be possible, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or other suitable processing if necessary. The program is processed electronically and stored in computer memory.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或算法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。It should be understood that various parts of the present invention can be realized by hardware, software, firmware or their combination. In the above described embodiments, various steps or algorithms may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system.

本技术领域的普通技术人员可以理解实现上述实施例算法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括算法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the algorithm of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the algorithm embodiment is included.

此外，在本发明各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. A YOLOV4 image recognition algorithm based on reparameterized residual structure and coordinate attention mechanism, characterized in that, comprising:

Input the image set to be recognized;

Perform data enhancement on the training image set, and use the K-means++ clustering algorithm to calculate the initial aiming frame of the image set;

Obtain the YOLOV4 model of reparameterized residual structure and coordinate attention mechanism;

According to the YOLOV4 model of the reparameterized residual structure and coordinate attention mechanism, the training image set and the verification image set are used for model training to generate a recognition model;

Input the image set to be recognized into the YOLOV4 recognition model with re-parameterized residual structure and coordinate attention mechanism to obtain the image recognition result.

2. the YOLOV4 image recognition algorithm based on reparameterization residual structure and coordinate attention mechanism according to claim 1, is characterized in that, described YOLOV4 model comprises input end, backbone network, bottleneck network, output end four parts . At the input end, data enhancement is performed on the training image set, and the K-means++ clustering algorithm is used to set the initial aiming frame of the training image set. The backbone network uses the Darknet53 network, which can extract the features of the image set. Bottleneck networks include Feature Pyramid Networks (FPN) and Pyramid Attention Networks (PAN), which can extract complex features of image sets. The output end includes a convolution module, and finally predicts the location and category of the target.

3. the YOLOV4 image recognition algorithm based on reparametric residual structure and coordinate attention mechanism according to claim 1, is characterized in that, in the network, the reparametric residual structure (Reparametric residual structure, Rer) is in The residual structure with branches is used in model training, and the single-pass convolution module is obtained after reparameterization of the residual structure with branches. The above-mentioned single-pass convolution module is used in the inference process.

The Rer used in the training includes the first bottleneck module 1 (Bottleneck1) and the first 1×1 convolutional layer to perform the Add operation, through the Mish activation function, and through the first bottleneck module 2 (Bottleneck2) and the second Two 1×1 convolutional layers perform the Add operation, and finally pass the Mish activation function. The specific principle of the Add operation is to superimpose the feature map information under the condition that the dimension of the feature map remains unchanged, so as to increase the information describing the image features. Bottleneck1 includes the first 3×3 convolutional layer and the first 1×1 convolutional layer for the Add operation. Bottleneck2 includes the first 3×3 convolutional layer, the first 1×1 convolutional layer and the original input (Identity) for the Add operation. The model reparameterizes the residual structure before inference, converts Bottleneck1 and Bottleneck2 into 3×3 convolutional layers respectively, and finally converts Rer into two concatenated 3×3 convolutional layers, and the converted The one-way structure can greatly speed up inference.

4. the YOLOV4 image recognition algorithm based on reparameterization residual structure and coordinate attention mechanism according to claim 1, is characterized in that, the coordinate attention mechanism of described addition can be to the feature of X, Y two spatial directions The graph is aggregated, which includes the average pooling in the two spatial directions of X and Y, which can be expressed as:

Where x is the specified input, d represents the number of channels, and the average pooling kernel of (H, 1), (1, W) is used to encode each channel in the horizontal direction and vertical direction respectively, and i represents each channel in the height A feature point, j represents each feature point on the width, z represents the output of the average pooling in the X and Y directions, the coordinate attention mechanism can extract the position accuracy information in one spatial direction and the other spatial direction remote dependencies.

5. the YOLOV4 image recognition algorithm based on reparameterized residual structure and coordinate attention mechanism according to claim 2, is characterized in that, the input end of described YOLOV4 model comprises Mosaic data enhancement, at first any four pictures are carried out Basic processing such as cropping, shrinking, and transparency transformation, and finally splicing the processed four images into a new image, this operation can not only speed up the inference speed of the model, but also perform data enhancement on the training image set. The Dropblock regularization method is used to alleviate the overfitting phenomenon in model training. The specific operation is to randomly discard block feature points in the feature map.

6. the YOLOV4 image recognition algorithm based on reparameterization residual structure and coordinate attention mechanism according to claim 2, is characterized in that, the backbone network of described YOLOV4 model selects Darknet53, and the activation function in the benchmark network selects Mish activation function and the Leaky relu activation function.

7. the YOLOV4 image recognition algorithm based on reparameterized residual structure and coordinate attention mechanism according to claim 2, is characterized in that, the bottleneck network of described YOLOV4 model, selects pyramid pooling layer (Spatial PyramidPooling Layer, SPP ), which includes performing the first 1×1 maximum pooling layer, the first 5×5 maximum pooling layer, the first 9×9 maximum pooling layer on the input feature map and three different maximum pooling layers The feature map after core pooling and the original feature map perform Concat operation. The specific principle of Concat operation is that the number of channels describing image features increases, but the information of each channel does not increase to obtain features in different ranges. Figure field of view. The model also uses the structure of FPN+PAN. FPN is a feature map that is upsampled twice to obtain a feature map with richer semantic information. PAN is a feature map that is downsampled twice to obtain a feature map with richer positional information. Finally, the two feature maps with rich semantic information and location information are superimposed to obtain an output that can fully express the feature map information.

8. the YOLOV4 image recognition algorithm based on reparameterization residual structure and coordinate attention mechanism according to claim 2, is characterized in that, the output end of described YOLOV4 model comprises complete IOU loss function (Complete-IOU_Loss, CIOU_Loss) , which can be expressed as:

Among them, IOU represents the intersection ratio between the real frame and the predicted frame, which is a standard to measure the accuracy of target detection, b ^p represents the center point of the predicted frame, b ^gt represents the center point of the real frame, ρ ² (b ^p ,b ^gt ) Represents the Euclidean distance between the predicted frame and the center point of the real frame, c represents the diagonal distance between the real frame and the minimum circumscribed rectangle of the predicted frame, v represents the influence of the two types of frames on the loss function in terms of aspect ratio, w ^gt represents The width of the real frame, h ^gt represents the height of the real frame, w ^p represents the width of the predicted frame, and h ^p represents the height of the predicted frame. Finally, the improved NMS of the complete IOU (Complete-IOU, CIOU) is selected to carry out many candidate frames. filter.

9. the YOLOV4 image recognition algorithm based on reparameterization residual structure and coordinate attention mechanism according to claim 3, is characterized in that, the specific process of the single-way structure when the residual structure during training is improved into reasoning For, Identity is equivalent to a 1×1 convolutional layer, and a 1×1 convolutional layer is equivalent to a 3×3 convolutional layer that puts the value in the center of the convolutional layer and fills the remaining positions with 0 .

10. A YOLOV4 target recognition system based on reparameterized residual structure and coordinate attention mechanism, characterized in that, comprising:

The input module is used to input the image set to be recognized, and the data format is VOC format;

The data enhancement and clustering module is used to perform data enhancement on the input training image set, and calculate the initial aiming frame of the training image set according to the K-means++ clustering algorithm;

The obtaining module is used to obtain the YOLOV4 target recognition model based on the reparameterized residual structure and coordinate attention mechanism;

The training module is used for carrying out model training according to the YOLOV4 target recognition model based on the reparameterized residual structure and coordinate attention mechanism and the training image set and the verification image set to generate a recognition model;

The recognition module is used to obtain an image recognition result through the YOLOV4 target recognition model based on the reparameterized residual structure and coordinate attention mechanism according to the image set to be recognized.

11. the YOLOV4 target recognition algorithm system based on reparameterization structure and coordinate attention mechanism according to claim 10, it is characterized in that, in the network, the reparameterization residual structure adopts branched Residual structure, reparameterize the residual structure with branches to obtain a single-pass convolution module, and use the above-mentioned single-pass convolution module in the inference process.

12. The YOLOV4 target recognition algorithm system based on re-parameterized structure and coordinate attention mechanism according to claim 10, wherein the coordinate attention mechanism added can be used for the feature maps of X and Y two spatial directions Aggregation is carried out, which includes the average pooling in the two spatial directions of X and Y, which can be expressed as:

Where x is the specified input, d represents the number of channels, and the average pooling kernel of (H, 1), (1, W) is used to encode each channel in the horizontal direction and vertical direction respectively, and i represents each channel in the height A feature point, j represents each feature point on the width, and z represents the output after average pooling in the X and Y directions. The coordinate attention mechanism can extract location accuracy information in one spatial direction and long-range dependencies in another spatial direction.

13. The YOLOV4 target recognition algorithm system based on reparameterized structure and coordinate attention mechanism according to claim 10, wherein the output terminal includes a CIOU_Loss loss function, which can be expressed as:

Among them, IOU represents the intersection ratio between the real frame and the predicted frame, which is a standard to measure the accuracy of target detection, b ^p represents the center point of the predicted frame, b ^gt represents the center point of the real frame, ρ ² (b ^p ,b ^gt ) Represents the Euclidean distance between the predicted frame and the center point of the real frame, c represents the diagonal distance between the real frame and the minimum circumscribed rectangle of the predicted frame, v can represent the impact of the two types of frames on the loss function in terms of aspect ratio, w ^gt Represents the width of the real frame, h ^gt represents the height of the real frame, w ^p represents the width of the predicted frame, h ^p represents the height of the predicted frame, and finally selects CIOU's improved NMS to screen many candidate frames.