CN109685145B

CN109685145B - Small object detection method based on deep learning and image processing

Info

Publication number: CN109685145B
Application number: CN201811605116.2A
Authority: CN
Inventors: 李卫军; 吴超
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2022-09-06
Anticipated expiration: 2038-12-26
Also published as: CN109685145A

Abstract

The invention relates to the field of image processing, in particular to a small object detection method based on deep learning and image processing, which reserves more details by replacing the convolution of 5x5 of an original increment module with two convolution kernels of 3x3, meanwhile, in order to accelerate the training speed and the output consistency, BN is added at the end of each branch, namely Batch Normalization, and introduces a residual error network structure to increase the accuracy, in addition, the invention adopts deconvolution to enhance the context information of the upper layer and the bottom layer of the two adjacent layers, the result of deconvolution of the upper layer is aligned with the pixels of the bottom layer and added one by one, and the obtained new characteristic diagram is used as the detected characteristic diagram, the invention can improve the identification of the small object, and improve the accuracy rate of the traditional SSD for detecting the small object on the premise of not influencing the high FPS of the traditional SSD.

Description

A small object detection method based on deep learning and image processing

技术领域technical field

本发明涉及图像处理领域，更具体的，涉及一种基于深度学习和图像处理的小物件检测方法。The invention relates to the field of image processing, and more particularly, to a small object detection method based on deep learning and image processing.

背景技术Background technique

目前，常用的对物体进行检测的算法是SSD，即Single Shot MultiBoxDetection。SSD是一种基于深度学习的端到端的检测框架，它的框架主要分为两个部分：第一部分是位于前端的卷积神经网络(VGG16)，用于对目标进行特征提取，后端是多尺度特征检测网络，将前段网络产生的特征层进行不同尺度条件下的特征提取；然后将Conv4_3,Conv7，Conv8_2，Conv9_2，Conv10_2,Conv11_2各层进行卷积得到坐标位置和置信度得分，最后通过非极大值抑制(non maximum suppression，NMS)得到结果。At present, the commonly used algorithm for object detection is SSD, namely Single Shot MultiBoxDetection. SSD is an end-to-end detection framework based on deep learning. Its framework is mainly divided into two parts: the first part is the convolutional neural network (VGG16) located in the front end, which is used for feature extraction of the target, and the back end is a multi- The scale feature detection network performs feature extraction under different scale conditions on the feature layers generated by the previous network; Non maximum suppression (NMS) results.

但是由于SSD是采用的是多尺度检测的方法，这种方法会减少计算量有很高的FPS，并且由于是在不同尺度的特征图上进行检测，不同尺度的特征图上的卷积感受野就会不同，特别是在高层卷积层，它的感受野就会很大，提取的特征也很抽象，所以对小物体的和细节的检测上很不敏感。However, since SSD uses a multi-scale detection method, this method will reduce the amount of calculation and have a high FPS, and because the detection is performed on feature maps of different scales, the convolution receptive field on the feature maps of different scales It will be different, especially in the high-level convolutional layer, its receptive field will be very large, and the extracted features are also very abstract, so it is very insensitive to the detection of small objects and details.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中SSD检测算法对小物件检测不敏感的不足，本发明提供了一种基于深度学习和图像处理的小物件检测方法。In order to solve the problem that the SSD detection algorithm in the prior art is not sensitive to small object detection, the present invention provides a small object detection method based on deep learning and image processing.

为实现以上发明目的，采用的技术方案是：In order to achieve the above purpose of the invention, the technical scheme adopted is:

一种基于深度学习和图像处理的小物件检测方法，包括以下步骤：A small object detection method based on deep learning and image processing, comprising the following steps:

步骤S1：获取数据集，数据集中包括已标注的物体类别信息和目标框的左上(x_min,y_min)和右下(x_max,y_max)两个点的坐标信息的原始图片，从数据集的训练集中任意选取一张带标签信息的图片，将图片调整到300x300的大小作为输入；Step S1: Obtain a data set, which includes the marked object category information and the original picture of the coordinate information of the upper left (x _min , y _min ) and lower right (x _max , y _max ) points of the target frame, from the data Select a picture with label information arbitrarily in the training set of the set, and adjust the picture to the size of 300x300 as input;

步骤S2：将图片沿着水平(0，150)(300，150)和竖直方向(150，0)(150，300)分割成大小为150x150的4部分P1，P2，P3，P4；另外取以(75，75)(225，75)(75，225)(225，225)为四个顶点坐标的图像作为第5部分P5；Step S2: Divide the picture into 4 parts P1, P2, P3, P4 with a size of 150×150 along the horizontal (0, 150) (300, 150) and vertical directions (150, 0) (150, 300); Taking (75, 75) (225, 75) (75, 225) (225, 225) as the image of the coordinates of the four vertices as the fifth part P5;

步骤S3：根据每张输入图片带的目标框的左上和右下两个坐标信息(x_min,y_min),(x_max,y_max)判断图片中的物体有没有被分割，并根据物体被分割的情况修改坐标；Step S3: According to the two coordinate information (x _min , y _min ) and (x _max , y _max ) of the upper left and lower right of the target frame of each input picture, it is judged whether the object in the picture has been divided, and according to the object Modify the coordinates in the case of segmentation;

步骤S4：运用三次内插法对图片进行插值，使被分割的大小为150x150的5部分图片P1、P2、P3、P4、P5与原始图片300x300的大小相同，并命名为F1，F2，F3，F4，F5，同时将步骤S3得到的修改后的坐标乘以2并进行更新；Step S4: Interpolate the picture by using the cubic interpolation method, so that the divided pictures P1, P2, P3, P4 and P5 with a size of 150x150 are the same size as the original picture of 300x300, and are named as F1, F2, F3, F4, F5, at the same time multiply the modified coordinates obtained in step S3 by 2 and update;

步骤S5：对F1，F2，F3，F4，F5五张图片的每一张经VGG16网络提取特征，再用3x3x1024大小的卷积核进行卷积得到大小为19x19x1024的Conv6特征图，再继续用1x1x1024大小的卷积核进行卷积得到大小为19x19x1024的Conv7特征图；Step S5: Extract features from each of the five pictures F1, F2, F3, F4, and F5 through the VGG16 network, and then perform convolution with a 3x3x1024 convolution kernel to obtain a Conv6 feature map with a size of 19x19x1024, and then continue to use 1x1x1024 The size of the convolution kernel is convolved to obtain a Conv7 feature map of size 19x19x1024;

步骤S6：将1x1,3x3，3x3的卷积核堆叠在一起,组成三个分支,每个分支的最后加入了BN，即Batch Normalization来进行批量归一化处理，将各个分支连接融合同时引入残差网络结构，将该结构命名为IRBNet卷积结构；Step S6: Stack the 1x1, 3x3, and 3x3 convolution kernels together to form three branches. At the end of each branch, BN, that is, Batch Normalization, is added to perform batch normalization processing, and each branch is connected and fused while introducing residuals. Poor network structure, this structure is named IRBNet convolution structure;

步骤S7：由步骤S5得到的大小为19x19x1024的Conv7特征图，经过IRBNet卷积结构来提取特征，得到大小为10x10x512的特征图Conv8；Conv8经过IRBNet卷积得到大小为5x5x256的特征图Conv9；Conv9经过IRBNet卷积得到大小为3x3x256的特征图Conv10；Conv10经过IRBNet卷积得到大小为1x1x256的特征图Conv11；Step S7: The Conv7 feature map with a size of 19x19x1024 obtained in step S5 is extracted through the IRBNet convolution structure, and the feature map Conv8 with a size of 10x10x512 is obtained; IRBNet convolution obtains a feature map Conv10 with a size of 3x3x256; Conv10 is convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x256;

步骤S8：采用卷积核为3x3,步长为4的卷积方式对高层特征图进行反卷积，让其扩大两倍使得与上一底层大小一样，然后将对应位置的像素进行一一相加，得到的新的特征图大小与底层特征图大小一致，将该结构取名为HDPANet；Step S8: Deconvolve the high-level feature map with a convolution method with a convolution kernel of 3x3 and a stride of 4, so that it is doubled to make it the same size as the previous bottom layer, and then the pixels at the corresponding positions are phased one by one. Plus, the size of the new feature map obtained is consistent with the size of the underlying feature map, and the structure is named HDPANet;

步骤S9：将特征图Conv8经过步骤S8得到另一个大小为19x19x1024的特征图与Conv7相加得到特征图Conv7D，特征图Conv9经过步骤S8得到另一个大小为10x10x512的特征图与Conv8相加得到特征图Conv8D,特征图Conv10经过步骤S8步得到另一个大小为5x5x256的特征图与Conv9相加得到特征图Conv9D,特征图Conv11经过步骤S8得到另一个大小为3x3x256的特征图与Conv10相加得到特征图Conv10D；Step S9: Add another feature map of size 19x19x1024 from the feature map Conv8 to Conv7 through step S8 to obtain the feature map Conv7D, and the feature map Conv9 obtains another feature map of size 10x10x512 through step S8 and add Conv8 to obtain the feature map Conv8D, the feature map Conv10 obtains another feature map of size 5x5x256 after step S8 and adds it to Conv9 to obtain the feature map Conv9D, and the feature map Conv11 obtains another feature map of size 3x3x256 after step S8 and Conv10 is added to obtain the feature map Conv10D ;

步骤S10：在Conv4_3和Conv10D以及Conv11特征图层用3x3的卷积核进行卷积得到通道数为4x(class+4)的特征图，在Conv7D,Conv8D,Conv9D特征图层用3x3的卷积核进行卷积得到通道数为6x(class+4)的特征图；Step S10: Perform convolution with a 3x3 convolution kernel in the Conv4_3, Conv10D and Conv11 feature layers to obtain a feature map with a channel number of 4x (class+4), and use a 3x3 convolution kernel in the Conv7D, Conv8D, Conv9D feature layers. Perform convolution to obtain a feature map with a channel number of 6x (class+4);

步骤S11:F1，F2，F3，F4，F5通过步骤S1～S10得到各自对应的损失函数loss；在反向传播的时候通过随机梯度下降算法优化五个损失函数loss的总和total_loss，同时还设置训练迭代次数epoch，当total_loss稳定时候得到的网络参数即为最优解；Step S11: F1, F2, F3, F4, F5 obtain their corresponding loss function loss through steps S1 to S10; during backpropagation, the sum total_loss of the five loss function losses is optimized by the stochastic gradient descent algorithm, and the training is also set. The number of iterations epoch, the network parameters obtained when the total_loss is stable is the optimal solution;

步骤S12：在数据集中选取不带标签信息的图片，执行步骤S1以及步骤S2进行图片分割，并将分割好的图片放入到步骤S1～步骤S10训练好的网络中，再经过非极大值抑制进行过滤，最终得到F1，F2，F3，F4，F5这五张图的带预测类别label和预测坐标(x_{pred_min}，y_{pred_min})，(x_{pred_max}，y_{pred_max})；Step S12: Select pictures without label information in the data set, perform steps S1 and S2 to perform picture segmentation, and put the segmented pictures into the network trained in steps S1 to S10, and then pass through the non-maximum value. Suppress and filter, and finally get the five pictures of F1, F2, F3, F4, F5 with the predicted category label and predicted coordinates (x _{pred_min} , y _{pred_min} ), (x _{pred_max} , y _{pred_max} );

步骤S13:根据F1，F2，F3，F4，F5五张图片的预测类别label以及预测坐标对图片进行融合，最后的结果即为检测的最终结果。Step S13: fuse the pictures according to the predicted category labels and predicted coordinates of the five pictures F1, F2, F3, F4, and F5, and the final result is the final result of the detection.

优选的，所述的步骤S3修改坐标的具体步骤如下：Preferably, the specific steps of modifying the coordinates in step S3 are as follows:

1)若x_min＜150，x_max＞150，且y_min，y_max＜150或者x_min＜150，x_max＞150，y_min，y_max＞150则图像中的物体被沿竖直方向分割为左右两部分，令新的坐标为(x_min，y_min)，(150，y_max)和(150，y_min)，(x_max，y_max)，类别信息不改变；1) If _xmin <150, _xmax >150, and _ymin , _ymax <150 or _xmin <150, _xmax >150, _ymin , _ymax >150, then the object in the image is segmented along the vertical direction For the left and right parts, let the new coordinates be (x _min , y _min ), (150, y _max ) and (150, y _min ), (x _max , y _max ), the category information does not change;

2)若x_min，x_max＜150，y_min＜150，y_max＞150或者x_min，x_max＞150，y_min＜150，y_max＞150则图像中的物体被水平方向分割为上下两部分，令新的坐标为(x_min，y_min)，(x_max，150)，和(x_min，150)，(x_max，y_max)，类别信息不改变；2) If x _min , x _max < 150, y _min < 150, y _max > 150 or x _min , x _max > 150, y _min < 150, y _max > 150, the object in the image is horizontally divided into upper and lower parts. part, let the new coordinates be (x _min , y _min ), (x _max , 150), and (x _min , 150), (x _max , y _max ), and the category information does not change;

3)若x_min＜150,y_min＜150,x_max＞150,y_max＞150表示图像中的物体被水平方向和竖直方向一起切割为四部分，令新的坐标为(x_min,y_min),(150,150)和(150,y_min),(x_max,150)和(x_min,150),(150,y_max)以及(150，150)，(x_max，y_max)，类别信息不变。3) If x _min <150, y _min <150, x _max > 150, y _max > 150, it means that the object in the image is cut into four parts in the horizontal direction and the vertical direction together, let the new coordinates be (x _min , y _min ), (150, 150) and (150, y _min ), (x _max , 150) and (x _min , 150), (150, y _max ) and (150, 150), (x _max , y _max ), categories Information remains unchanged.

优选的，所述的步骤S11求取损失函数loss以及total_loss的具体步骤如下：Preferably, the specific steps for obtaining the loss function loss and total_loss in the step S11 are as follows:

Loss分为了confidence loss和location loss两部分，Loss is divided into confidence loss and location loss.

其中,L(x,c,l,g)表示Loss，L_conf表示confidence loss，confidence loss是softmax loss算法，L_loc表示location loss，N是confidence loss中match到GroundTruth的priorbox数量；而α参数用于调整confidence loss和location loss之间的比例，

代表第i个预测框匹配到了第j个真实框为p类别的GT box；c表示置信度，l表示预测框，g表示真框；Among them, L(x,c,l,g) represents Loss, L _conf represents confidence loss, confidence loss is the softmax loss algorithm, L _loc represents location loss, and N is the number of priorboxes that match to GroundTruth in confidence loss; and the α parameter uses To adjust the ratio between confidence loss and location loss,

Represents that the i-th predicted box matches the j-th real box to the GT box of the p category; c represents the confidence, l represents the prediction box, and g represents the true box;

表示通过softmax方法产生的概率值，Pos表示是正样本，Neg表示负样本，N是confidence loss中匹配到到Ground Truth的prior box数量当

时成立，

表示第i个预测框属于类别p的概率、p表示类别中的第p个类别；

Represents the probability value generated by the softmax method, Pos represents a positive sample, Neg represents a negative sample, and N is the number of prior boxes matched to Ground Truth in confidence loss.

was established,

Indicates the probability that the i-th prediction box belongs to category p, and p indicates the p-th category in the category;

其中，cx表示框的中心点x坐标，cy表示中心点y坐标，w表示宽，h表示高，i表示第i个预测框，j表示第j个真实框，d_i表示偏移量，

表示第i个预测框与第j个真实框关于类别k是否匹配，匹配为1，不匹配为0，

表示预测框，

表示真实框的偏移框；m表示属于(cx,cy,w,h)中的一个取值，

表示第j个真实框的偏移框的中心点x的坐标，

表示第j个真实框的偏移框的中心点y的坐标，

表示第j个真实框的偏移框的宽度，

表示第j个真实框的偏移框的高度，

表示第i个预测框的中心点x坐标偏移量，

表示第i个预测框的中心点y坐标偏移量，

表示第i个预测框宽度偏移量，

表示第i个预测框的高度偏移量，

表示第j个真实框中心点x坐标，

表示第j个真实框的中心点y坐标，

表示第j个真实框的宽度，

表示第j个真实框的高度；Among them, cx represents the x coordinate of the center point of the box, cy represents the y coordinate of the center point, w represents the width, h represents the height, i represents the ith prediction box, j represents the jth real box, and d _i represents the offset,

Indicates whether the i-th predicted frame matches the j-th real frame with respect to category k, the match is 1, the mismatch is 0,

represents the prediction box,

Represents the offset frame of the real frame; m represents a value belonging to (cx, cy, w, h),

Represents the coordinates of the center point x of the offset box of the jth ground truth box,

Represents the coordinates of the center point y of the offset box of the jth real box,

represents the width of the offset box of the jth ground truth box,

represents the height of the offset box of the jth ground truth box,

Indicates the x-coordinate offset of the center point of the i-th prediction box,

Indicates the y-coordinate offset of the center point of the i-th prediction frame,

Indicates the width offset of the i-th prediction box,

represents the height offset of the i-th prediction box,

Represents the x-coordinate of the center point of the jth real box,

Represents the y-coordinate of the center point of the jth real box,

represents the width of the jth ground truth box,

Represents the height of the jth ground truth box;

F1，F2，F3，F4，F5经过处理得到的五个损失函数分别记为L₁(x,c,l,g)，L₂(x,c,l,g)，L₃(x,c,l,g)，L₄(x,c,l,g),L₅(x,c,l,g)，总的损失函数记作：The five loss functions obtained by processing F1, F2, F3, F4, and F5 are recorded as L ₁ (x,c,l,g), L ₂ (x,c,l,g), L ₃ (x,c) ,l,g), L ₄ (x,c,l,g),L ₅ (x,c,l,g), the total loss function is written as:

Total_loss＝L₁(x,c,l,g)+L₂(x,c,l,g)+L₃(x,c,l,g)+L₄(x,c,l,g)+L₅(x,c,l,g)。Total_loss=L ₁ (x,c,l,g)+L ₂ (x,c,l,g)+L ₃ (x,c,l,g)+L ₄ (x,c,l,g)+ L ₅ (x,c,l,g).

优选的，所述的步骤S13对图片进行融合的具体步骤如下：Preferably, in the step S13, the specific steps of fusing the pictures are as follows:

(1)若F1,F2,F3,F4各图片的预测坐标x_{pred_max}，y_{pred_max}＜300且x_{pred_min}，y_{pred_min}＞0，则将F1,F2,F3,F4按照原始位置合为一张图片，再将融合的图片大小缩小4倍得到原图300x300的大小，同时将预测坐标缩小4倍，最后的结果即为检测的最终结果；(1) If the predicted coordinates of each picture of F1, F2, F3, and F4 are x _{pred_max} , y _{pred_max} <300 and x _{pred_min} , y _{pred_min} > 0, then combine F1, F2, F3, F4 into one picture according to the original position, Then reduce the size of the fused image by 4 times to obtain the size of the original image of 300x300, and reduce the predicted coordinates by 4 times, and the final result is the final result of the detection;

(2)检测左右两部分边界上的物体的类别label1和label2，若label1等于label2则表示为同一类，比较两物体的坐标信息的大小，以大的边框为准向向小的方向延长(x_max-x_min)的长度，然后将图片的四条边进行补齐，将F1,F2,F3,F4按原始位置合为一张图片，把融合后的一整张图片的大小缩小4倍得到原图300x300的大小，同时将修改后的坐标缩小4倍，最后的结果即为检测的最终结果；(2) Detect the categories label1 and label2 of the objects on the boundary of the left and right parts. If label1 is equal to label2, it means the same category. Compare the size of the coordinate information of the two objects, and extend to the smaller direction based on the larger frame (x _max -x _min ) length, then fill in the four sides of the picture, combine F1, F2, F3, F4 into one picture according to the original position, and reduce the size of the fused whole picture by 4 times to get the original The size of the picture is 300x300, and the modified coordinates are reduced by 4 times, and the final result is the final result of the detection;

(3)检测上下两部分边界上的物体的类别label1和label2，若label1等于label2则表示为同一类，比较两物体的坐标信息的大小，以大的边框为准向向小的方向延长y_max减去y_min的长度，然后补齐；把融合后的一整张图片的大小缩小4倍得到原图300x300的大小，同时将修改后的坐标缩小4倍，最后的结果即为检测的最终结果；(3) Detect the categories label1 and label2 of the objects on the boundary of the upper and lower parts. If label1 is equal to label2, it indicates the same category. Compare the size of the coordinate information of the two objects, and extend y _max in the direction of the smaller frame based on the larger frame. Subtract the length of y _min , and then fill it up; reduce the size of the entire image after fusion by 4 times to obtain the size of the original image 300x300, and reduce the modified coordinates by 4 times, and the final result is the final result of the detection ;

(4)若F1,F2,F3,F4各图片的预测坐标(x_{pred_min}，y_{pred_min})＝(300，300)或者(x_{pred_max}，y_{pred_max})＝(300，300)表示物体被左上，左下，右上，右下四部分同时分割了；则用中间部分的图片F5的检测结果作为中间物体的检测结果，把融合后的一整张图片的大小缩小4倍得到原图300x300的大小，得到的坐标信息同时缩小4倍，最后的结果即为检测的最终结果。(4) If the predicted coordinates of each picture of F1, F2, F3, F4 (x _{pred_min} , y _{pred_min} )=(300, 300) or (x _{pred_max} , y _{pred_max} )=(300, 300), it means that the object is top left, bottom left, The upper right and lower right parts are divided at the same time; the detection result of the picture F5 in the middle part is used as the detection result of the intermediate object, and the size of the fused whole picture is reduced by 4 times to obtain the size of the original picture 300x300, and the obtained coordinates The information is reduced by 4 times at the same time, and the final result is the final result of the detection.

优选的，所述的α＝1。Preferably, the α=1.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

本发明通过把原始Inception模块的5x5的卷积替换为两个3x3的卷积核，保留了更多的细节，同时为了加快训练速度和输出一致性，在每个分支的最后加入了BN，即BatchNormalization,进行批量归一化处理，同时引入残差网络结构，增加准确率，而且本发明采用反卷积增强相邻两层的高层和底层的上下文信息，将上层反卷积的结果与底层卷积曾像素对齐一一相加，得到的新的特征图作为检测的特征图，可以提高对小物体的识别，本发明在不影响传统SSD的高FPS的前提下，提高传统SSD对小物体检测的准确率。The present invention retains more details by replacing the 5x5 convolution of the original Inception module with two 3x3 convolution kernels. At the same time, in order to speed up the training speed and output consistency, BN is added at the end of each branch, namely BatchNormalization, performing batch normalization processing, and introducing a residual network structure at the same time to increase the accuracy, and the present invention uses deconvolution to enhance the context information of the upper layer and the lower layer of the two adjacent layers, and the result of the upper layer deconvolution is combined with the lower layer volume. The accumulated pixels are aligned one by one, and the obtained new feature map is used as the detected feature map, which can improve the recognition of small objects. The present invention improves the detection of small objects by traditional SSD without affecting the high FPS of traditional SSD. 's accuracy.

附图说明Description of drawings

图1为本发明的流程图。FIG. 1 is a flow chart of the present invention.

图2为本发明分割图像的分割点示意图。FIG. 2 is a schematic diagram of a segmentation point for segmentation of an image according to the present invention.

图3为本发明图像分割网的流程图。FIG. 3 is a flow chart of the image segmentation network of the present invention.

图4为本发明残差网络的结构图。FIG. 4 is a structural diagram of the residual network of the present invention.

图5为IRBNet的结构图。Figure 5 is the structure diagram of IRBNet.

图6为高层反卷积像素相加的流程图。Figure 6 is a flowchart of high-level deconvolutional pixel addition.

图7为求解预测类别label与预测坐标的流程图。FIG. 7 is a flowchart of solving the predicted category label and predicted coordinates.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本专利的限制；The accompanying drawings are for illustrative purposes only, and should not be construed as limitations on this patent;

以下结合附图和实施例对本发明做进一步的阐述。The present invention will be further elaborated below in conjunction with the accompanying drawings and embodiments.

实施例1Example 1

如图1～图7所示，一种基于深度学习和图像处理的小物件检测方法，包括以下步骤：As shown in Figures 1 to 7, a small object detection method based on deep learning and image processing includes the following steps:

步骤S2：如图2所示，将图片沿着水平(0，150)(300，150)和竖直方向(150，0)(150，300)分割成大小为150x150的4部分P1，P2，P3，P4；另外取以(75，75)(225，75)(75，225)(225，225)为四个顶点坐标的图像作为第5部分P5；Step S2: As shown in FIG. 2, the picture is divided into 4 parts P1, P2 with a size of 150×150 along the horizontal (0, 150) (300, 150) and vertical directions (150, 0) (150, 300), P3, P4; In addition, take (75, 75) (225, 75) (75, 225) (225, 225) as the image of the coordinates of the four vertices as the fifth part P5;

步骤S4：如图3所示，运用三次内插法对图片进行插值，使被分割的大小为150x150的5部分图片P1、P2、P3、P4、P5与原始图片300x300的大小相同，并命名为F1，F2，F3，F4，F5，同时将步骤S3得到的修改后的坐标乘以2并进行更新；Step S4: As shown in Figure 3, use the cubic interpolation method to interpolate the picture, so that the divided pictures P1, P2, P3, P4, P5 with a size of 150x150 are the same size as the original picture 300x300, and are named as F1, F2, F3, F4, F5, at the same time multiply the modified coordinates obtained in step S3 by 2 and update;

步骤S6：如图4、图5所示，将1x1,3x3，3x3的卷积核堆叠在一起,组成三个分支,每个分支的最后加入了BN，即Batch Normalization来进行批量归一化处理，将各个分支连接融合同时引入残差网络结构，将该结构命名为IRBNet卷积结构；Step S6: As shown in Figure 4 and Figure 5, the 1x1, 3x3, and 3x3 convolution kernels are stacked together to form three branches, and BN is added at the end of each branch, that is, Batch Normalization for batch normalization. , each branch connection is fused and the residual network structure is introduced at the same time, and the structure is named IRBNet convolution structure;

步骤S8：如图6所示，采用卷积核为3x3,步长为4的卷积方式对高层特征图进行反卷积，让其扩大两倍使得与上一底层大小一样，然后将对应位置的像素进行一一相加，得到的新的特征图大小与底层特征图大小一致，将该结构取名为HDPANet；Step S8: As shown in Figure 6, the high-level feature map is deconvolved with a convolution method with a convolution kernel of 3x3 and a stride of 4, so that it is doubled to make it the same size as the previous bottom layer, and then the corresponding position The pixels are added one by one, and the size of the new feature map obtained is consistent with the size of the underlying feature map, and the structure is named HDPANet;

步骤S12：如图7所示，在数据集中选取不带标签信息的图片，执行步骤S1以及步骤S2进行图片分割，并将分割好的图片放入到步骤S1～步骤S10训练好的网络中，再经过非极大值抑制进行过滤，最终得到F1，F2，F3，F4，F5这五张图的带预测类别label和预测坐标(x_{pred_min}，y_{pred_min})，(x_{pred_max}，y_{pred_max})；Step S12: As shown in Figure 7, select pictures without label information in the data set, perform steps S1 and S2 to perform picture segmentation, and put the segmented pictures into the network trained in steps S1 to S10, Then filter by non-maximum value suppression, and finally get the predicted category label and predicted coordinates (x _{pred_min} , y _{pred_min} ), (x _{pred_max} , y _{pred_max} ) of the five pictures F1, F2, F3, F4, and F5;

作为一个优选的实施例，所述的步骤S3修改坐标的具体步骤如下：As a preferred embodiment, the specific steps of modifying the coordinates in step S3 are as follows:

3)若x_min＜150,y_min＜150,x_max＞150,y_max＞150表示图像中的物体被水平方向和竖直方向一起切割为四部分，令新的坐标为(x_min,y_min),(150,150)和(150,y_min),(x_max,150)和(x_min,150),(150,y_max)以及(150，150)，(x_max，y_max)，类别信息不变。3) If x _min < 150, y _min < 150, x _max > 150, y _max > 150, it means that the object in the image is cut into four parts in the horizontal and vertical directions, and the new coordinates are (x _min , y _min ), (150, 150) and (150, y _min ), (x _max , 150) and (x _min , 150), (150, y _max ) and (150, 150), (x _max , y _max ), categories Information remains unchanged.

作为一个优选的实施例，所述的步骤S11求取损失函数loss以及total_loss的具体步骤如下：As a preferred embodiment, the specific steps for obtaining the loss function loss and total_loss in the step S11 are as follows:

代表第i个预测框匹配到了第j个真实框为p类别的GT box；c表示置信度，l表示预测框，g表示真框；Among them, L(x, c, l, g) represents Loss, L _conf represents confidence loss, confidence loss is the softmax loss algorithm, L _loc represents location loss, and N is the number of priorboxes that match to GroundTruth in confidence loss; and the α parameter uses To adjust the ratio between confidence loss and location loss,

时成立，

Indicates the probability value generated by the softmax method, Pos indicates a positive sample, Neg indicates a negative sample, and N is the number of prior boxes matched to Ground Truth in confidence loss.

was established,

表示预测框，

表示真实框的偏移框；m表示属于(cx,cy,w,h)中的一个取值，

表示第j个真实框的偏移框的中心点x的坐标，

表示第j个真实框的偏移框的中心点y的坐标，

表示第j个真实框的偏移框的宽度，

表示第j个真实框的偏移框的高度，

表示第i个预测框的中心点x坐标偏移量，

表示第i个预测框的中心点y坐标偏移量，

表示第i个预测框宽度偏移量，

表示第i个预测框的高度偏移量，

表示第j个真实框中心点x坐标，

表示第j个真实框的中心点y坐标，

表示第j个真实框的宽度，

represents the prediction box,

represents the width of the offset box of the jth ground truth box,

represents the height of the offset box of the jth ground truth box,

Represents the x-coordinate offset of the center point of the i-th prediction box,

Indicates the width offset of the i-th prediction box,

represents the height offset of the i-th prediction box,

Represents the x-coordinate of the center point of the jth real box,

Represents the y-coordinate of the center point of the jth real box,

represents the width of the jth ground truth box,

Represents the height of the jth ground truth box;

作为一个优选的实施例，所述的步骤S13对图片进行融合的具体步骤如下：As a preferred embodiment, the specific steps of merging the pictures in the step S13 are as follows:

作为一个优选的实施例，所述的α＝1。As a preferred embodiment, the α=1.

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. a small object detection method based on deep learning and image processing, is characterized in that, comprises the following steps:

Step S1: Obtain a data set, which includes the marked object category information and the original picture of the coordinate information of the upper left (x _min , y _min ) and lower right (x _max , y _max ) points of the target frame, from the data Select a picture with label information arbitrarily in the training set of the set, and adjust the picture to the size of 300x300 as input;

Step S2: Divide the picture into 4 parts P1, P2, P3, P4 with a size of 150×150 along the horizontal (0, 150) (300, 150) and vertical directions (150, 0) (150, 300); Taking (75, 75) (225, 75) (75, 225) (225, 225) as the image of the coordinates of the four vertices as the fifth part P5;

Step S3: According to the two coordinate information (x _min , y _min ) and (x _max , y _max ) of the upper left and lower right of the target frame of each input picture, it is judged whether the object in the picture has been divided, and according to the object Modify the coordinates in the case of segmentation;

Step S4: Interpolate the picture by using the cubic interpolation method, so that the divided pictures P1, P2, P3, P4 and P5 with a size of 150x150 are the same size as the original picture of 300x300, and are named as F1, F2, F3, F4, F5, at the same time multiply the modified coordinates obtained in step S3 by 2 and update;

Step S5: Extract features from each of the five pictures F1, F2, F3, F4, and F5 through the VGG16 network, and then perform convolution with a 3x3x1024 convolution kernel to obtain a Conv6 feature map with a size of 19x19x1024, and then continue to use 1x1x1024 The size of the convolution kernel is convolved to obtain a Conv7 feature map of size 19x19x1024;

Step S6: Stack the 1x1, 3x3, and 3x3 convolution kernels together to form three branches. At the end of each branch, BN, that is, Batch Normalization, is added to perform batch normalization processing, and each branch is connected and fused while introducing residuals. Poor network structure, this structure is named IRBNet convolution structure;

Step S7: The Conv7 feature map with a size of 19x19x1024 obtained in step S6, extracts features through the IRBNet convolution structure, and obtains a feature map Conv8 with a size of 10x10x512; IRBNet convolution obtains a feature map Conv10 with a size of 3x3x256; Conv10 is convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x256;

Step S8: Deconvolve the high-level feature map with a convolution method with a convolution kernel of 3x3 and a stride of 4, so that it is doubled to make it the same size as the previous bottom layer, and then the pixels at the corresponding positions are phased one by one. Plus, the size of the new feature map obtained is consistent with the size of the underlying feature map, and the structure is named HDPANet;

Step S9: Add another feature map of size 19x19x1024 from the feature map Conv8 to Conv7 through step S8 to obtain the feature map Conv7D, and the feature map Conv9 obtains another feature map of size 10x10x512 through step S8 and add Conv8 to obtain the feature map Conv8D, the feature map Conv10 obtains another feature map of size 5x5x256 after step S8 and adds it to Conv9 to obtain the feature map Conv9D, and the feature map Conv11 obtains another feature map of size 3x3x256 after step S8 and Conv10 is added to obtain the feature map Conv10D ;

Step S10: Perform convolution with a 3x3 convolution kernel in the Conv4_3, Conv10D and Conv11 feature layers to obtain a feature map with a channel number of 4x (class+4), and use a 3x3 convolution kernel in the Conv7D, Conv8D, Conv9D feature layers. Perform convolution to obtain a feature map with a channel number of 6x (class+4);

Step S11: F1, F2, F3, F4, F5 obtain their corresponding loss function loss through steps S1 to S10; during backpropagation, the sum total_loss of the five loss function losses is optimized by the stochastic gradient descent algorithm, and the training is also set. The number of iterations epoch, the network parameters obtained when the total_loss is stable is the optimal solution;

Step S12: Select pictures without label information in the data set, perform steps S1 and S2 to perform picture segmentation, and put the segmented pictures into the network trained in steps S1 to S10, and then pass through the non-maximum value. Suppress and filter, and finally get the five pictures of F1, F2, F3, F4, F5 with the predicted category label and predicted coordinates (x _{pred_min} , y _{pred_min} ), (x _{pred_max} , y _{pred_max} );

Step S13: fuse the pictures according to the predicted category labels and predicted coordinates of the five pictures F1, F2, F3, F4, and F5, and the final result is the final result of the detection.

2. according to a kind of small object detection method based on deep learning and image processing described in claim 1, it is characterized in that, the concrete steps of described step S3 modifying coordinates are as follows:

1) If _xmin <150, _xmax >150, and _ymin , _ymax <150 or _xmin <150, _xmax >150, _ymin , _ymax >150, then the object in the image is segmented along the vertical direction For the left and right parts, let the new coordinates be (x _min , y _min ), (150, y _max ) and (150, y _min ), (x _max , y _max ), the category information does not change;

2) If x _min , x _max < 150, y _min < 150, y _max > 150 or x _min , x _max > 150, y _min < 150, y _max > 150, the object in the image is horizontally divided into upper and lower parts. part, let the new coordinates be (x _min , y _min ), (x _max , 150), and (x _min , 150), (x _max , y _max ), and the category information does not change;

3) If x _min <150, y _min <150, x _max > 150, y _max > 150, it means that the object in the image is cut into four parts in the horizontal direction and the vertical direction together, let the new coordinates be (x _min , y _min ), (150, 150) and (150, y _min ), (x _max , 150) and (x _min , 150), (150, y _max ) and (150, 150), (x _max , y _max ), category Information remains unchanged.

3. according to a kind of small object detection method based on deep learning and image processing described in claim 2, it is characterized in that, described step S11 seeks the concrete steps of loss function loss and total_loss as follows:

Loss is divided into confidence loss and location loss.

Among them, L(x,c,l,g) means Loss, L _conf means confidence loss, confidence loss is the softmaxloss algorithm, L _loc means location loss, N is the number of priorboxes matched to GroundTruth in confidence loss; and the α parameter is used for Adjust the ratio between confidence loss and location loss,

was established,

Among them, cx represents the x coordinate of the center point of the box, cy represents the y coordinate of the center point, w represents the width, h represents the height, i represents the ith prediction box, j represents the jth real box, and d _i represents the offset,

represents the prediction box,

represents the width of the offset box of the jth ground truth box,

represents the height of the offset box of the jth ground truth box,

Indicates the width offset of the i-th prediction box,

represents the height offset of the i-th prediction box,

Represents the x-coordinate of the center point of the jth real box,

Represents the y-coordinate of the center point of the jth real box,

represents the width of the jth ground truth box,

Represents the height of the jth ground truth box;

The five loss functions obtained by processing F1, F2, F3, F4, and F5 are recorded as L ₁ (x,c,l,g), L ₂ (x,c,l,g), L ₃ (x,c) ,l,g), L ₄ (x,c,l,g),L ₅ (x,c,l,g), the total loss function is written as:

Total_loss=L ₁ (x,c,l,g)+L ₂ (x,c,l,g)+L ₃ (x,c,l,g)+L ₄ (x,c,l,g)+ L ₅ (x,c,l,g).

4. according to a kind of small object detection method based on deep learning and image processing described in claim 3, it is characterized in that, described step S13 carries out the concrete steps of picture fusion as follows:

(1) If the predicted coordinates of each picture of F1, F2, F3, and F4 are x _{pred_max} , y _{pred_max} <300 and x _{pred_min} , y _{pred_min} > 0, then combine F1, F2, F3, F4 into one picture according to the original position, Then reduce the size of the fused image by 4 times to obtain the size of the original image of 300x300, and reduce the predicted coordinates by 4 times, and the final result is the final result of the detection;

(2) Detect the categories label1 and label2 of the objects on the boundary of the left and right parts. If label1 is equal to label2, it means the same category. Compare the size of the coordinate information of the two objects, and extend to the smaller direction based on the larger frame (x _max -x _min ) length, then fill in the four sides of the picture, combine F1, F2, F3, F4 into one picture according to the original position, and reduce the size of the fused whole picture by 4 times to get the original The size of the picture is 300x300, and the modified coordinates are reduced by 4 times, and the final result is the final result of the detection;

(3) Detect the categories label1 and label2 of the objects on the boundary of the upper and lower parts. If label1 is equal to label2, it indicates the same category. Compare the size of the coordinate information of the two objects, and extend y _max in the direction of the smaller frame based on the larger frame. Subtract the length of y _min , and then fill it up; reduce the size of the entire image after fusion by 4 times to obtain the size of the original image 300x300, and reduce the modified coordinates by 4 times, and the final result is the final result of the detection ;

(4) If the predicted coordinates of each picture of F1, F2, F3, F4 (x _{pred_min} , y _{pred_min} )=(300, 300) or (x _{pred_max} , y _{pred_max} )=(300, 300), it means that the object is top left, bottom left, The upper right and lower right parts are divided at the same time; the detection result of the picture F5 in the middle part is used as the detection result of the intermediate object, and the size of the fused whole picture is reduced by 4 times to obtain the size of the original picture 300x300, and the obtained coordinates The information is reduced by 4 times at the same time, and the final result is the final result of the detection.

5 . The small object detection method based on deep learning and image processing according to claim 4 , wherein the α=1. 6 .