CN108564097B

CN108564097B - Multi-scale target detection method based on deep convolutional neural network

Info

Publication number: CN108564097B
Application number: CN201711267789.7A
Authority: CN
Inventors: 徐雪妙; 肖永杰; 胡枭玮
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2020-09-22
Anticipated expiration: 2037-12-05
Also published as: CN108564097A

Abstract

The invention discloses a multi-scale target detection method based on a deep convolutional neural network, which comprises the following steps: 1) acquiring data; 2) processing data; 3) constructing a model; 4) defining a loss function; 5) training a model; 6) and (5) verifying the model. The method combines the capability of extracting high-level semantic information of the image by the deep convolutional neural network, the capability of generating a candidate region by a region generation network, the capability of repairing and mapping a pooling layer of an interested region with content perception capability and the accurate classification capability of a multi-task classification network, and more accurately and efficiently completes multi-scale target detection.

Description

A multi-scale target detection method based on deep convolutional neural network

技术领域technical field

本发明涉及计算机图像处理的技术领域，尤其是指一种基于深度卷积神经网络的多尺度目标检测方法。The invention relates to the technical field of computer image processing, in particular to a multi-scale target detection method based on a deep convolutional neural network.

背景技术Background technique

目标检测与识别是计算机视觉计算领域的重要课题之一。随着人类科学技术的发展，目标检测这一重要技术不断地得到充分利用，人们把它运用到各种场景中，实现各种预期目标，如战场警戒、安全检测、交通管制、视频监控等都方面。Object detection and recognition is one of the important topics in the field of computer vision computing. With the development of human science and technology, the important technology of target detection is constantly being fully utilized. People use it in various scenarios to achieve various expected goals, such as battlefield warning, security detection, traffic control, video surveillance, etc. aspect.

近些年，随着深度学习的快速发展，深度卷积神经网络在目标检测与识别技术上也有进一步的突破。利用深度卷积神经网络，可以提取到图片的高层语义特征信息，然后再利用这些高层语义信息进行目标的检测。神经网络越深，其所表达的特征信息就更具有代表性，但是其存在的问题是，对小尺度物体则表达的非常粗糙，甚至会使得小尺度物体的部分特征丢失，而且，神经网络对大小尺度非常敏感，不同大小尺度的物体经过神经网络所提取到的特征信息存在很大的差异性，导致小尺度物体检测的准确率低，从而大大降低了目标检测的鲁棒性和有效性。In recent years, with the rapid development of deep learning, deep convolutional neural networks have also made further breakthroughs in target detection and recognition technology. Using the deep convolutional neural network, the high-level semantic feature information of the picture can be extracted, and then the high-level semantic information can be used to detect the target. The deeper the neural network is, the more representative the feature information it expresses, but the problem is that it expresses very coarsely for small-scale objects, and even some features of small-scale objects are lost. The size and scale are very sensitive, and the feature information extracted by the neural network of objects of different sizes and scales has great differences, resulting in low accuracy of small-scale object detection, which greatly reduces the robustness and effectiveness of target detection.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的缺点与不足，提出了一种深度卷积神经网络的多尺度目标检测方法，该方法可以很好的将大小尺度的目标检测出来，突破了之前方法中无法很好检测出大小尺度差异很大的同类目标的限制。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and propose a multi-scale target detection method based on a deep convolutional neural network. It is good at detecting the limitation of homogeneous objects with large differences in size and scale.

为实现上述目的，本发明所提供的技术方案为：一种基于深度卷积神经网络的多尺度目标检测方法，包括以下步骤：In order to achieve the above purpose, the technical solution provided by the present invention is: a multi-scale target detection method based on a deep convolutional neural network, comprising the following steps:

1)数据获取1) Data acquisition

训练深度卷积神经网络需要大量的训练数据，因此需要使用大规模的自然图像或视频图像数据，如果得到的图像数据没有标签数据则需要进行人工标注，然后划分为训练数据集以及验证数据集；Training a deep convolutional neural network requires a lot of training data, so large-scale natural image or video image data needs to be used. If the obtained image data does not have label data, it needs to be manually labeled, and then divided into training data sets and verification data sets;

2)数据处理2) Data processing

将图像数据集的图像和标签数据通过预处理转化为训练深度卷积神经网络所需要的格式；Convert the image and label data of the image dataset into the format required for training a deep convolutional neural network through preprocessing;

3)模型构建3) Model building

根据训练目标以及模型的输入输出形式，构造出一个适用于多尺度目标检测问题的深度卷积神经网络；According to the training target and the input and output form of the model, a deep convolutional neural network suitable for multi-scale target detection is constructed;

4)定义损失函数4) Define the loss function

根据训练目标以及模型的架构，定义出所需的损失函数；Define the required loss function according to the training target and the architecture of the model;

5)模型训练5) Model training

初始化各层网络的参数，不断迭代输入训练样本，根据损失函数计算得到网络的损失值，再通过反向传播计算出各网络层参数的梯度，通过随机梯度下降法对各层网络的参数进行更新；Initialize the parameters of each layer of the network, iteratively input training samples, calculate the loss value of the network according to the loss function, and then calculate the gradient of the parameters of each network layer through back propagation, and update the parameters of each layer of the network through the stochastic gradient descent method. ;

6)模型验证6) Model Validation

使用验证数据集对训练得到的模型进行验证，测试其泛化性能。Use the validation dataset to validate the trained model and test its generalization performance.

所述步骤2)包括以下步骤：Described step 2) comprises the following steps:

2.1)将数据集中的图像缩放到长和宽为m×n像素大小，标签数据也根据相应的比例缩放到相应的大小；2.1) Scale the images in the dataset to the size of m×n pixels in length and width, and the label data is also scaled to the corresponding size according to the corresponding ratio;

2.2)在缩放后的图像，随机裁剪包含有标签的地方得到a×b像素大小的矩形图像，a<＝m，b<＝n；2.2) In the scaled image, randomly crop the place containing the label to obtain a rectangular image of a×b pixel size, a<=m, b<=n;

2.3)以0.5的概率随机水平翻转裁剪后的图像；2.3) Randomly flip the cropped image horizontally with a probability of 0.5;

2.4)将随机翻转后的图像从[0,255]转换到[-1,1]的范围内。2.4) Convert the randomly flipped image from [0, 255] to the range of [-1, 1].

所述步骤3)包括以下步骤：Described step 3) comprises the following steps:

3.1)构造特征提取网络模型3.1) Construct feature extraction network model

特征提取网络相当于一个编码器，用于从输入的图像中提取出高层的语义信息并保存到一个低维的编码中，特征提取网络的输入为经过步骤2)处理的图像，小物体在越深层的编码中会丢失部分信息，因此为了减少保全更多的信息，输出低维和较低维的特征编码；为了实现从输入到一系列输出的转换，特征提取网络包含多个级联的下采样层，下采样层由串联的卷积层、批量正则化层、以及非线性激活函数层、池化层组成，其中卷积层步长为1，卷积核大小为3×3，提取出相应的特征图，批量正则化层通过归一化同一个批次的输入样本的均值和标准差，起到稳定和加速模型训练的作用，非线性激活函数层的加入防止模型退化为简单的线性模型，提高模型的描述能力，池化层的作用是缩小特征图的大小，这样能够增加卷积核的感受野；The feature extraction network is equivalent to an encoder, which is used to extract high-level semantic information from the input image and save it into a low-dimensional encoding. The input of the feature extraction network is the image processed in step 2). Some information will be lost in the deep coding, so in order to reduce the preservation of more information, low-dimensional and low-dimensional feature codes are output; in order to realize the transformation from input to a series of outputs, the feature extraction network contains multiple cascaded downsampling The downsampling layer consists of convolutional layers, batch regularization layers, nonlinear activation function layers, and pooling layers in series. The batch regularization layer plays a role in stabilizing and accelerating model training by normalizing the mean and standard deviation of the input samples of the same batch. The addition of a nonlinear activation function layer prevents the model from degenerating into a simple linear model. , to improve the description ability of the model, the role of the pooling layer is to reduce the size of the feature map, which can increase the receptive field of the convolution kernel;

3.2)构造区域生成网络模型3.2) Constructing a regional generative network model

区域生成网络负责找到输入图中所有的物体和它们的位置；区域生成网络输入特征图，然后把这个特征图上的每一个点映射回原图，得到这些点的坐标，再在这些点周围取一些提前设定好的不同大小不同长宽比例的候选框，并计算出每个框是物体的概率分数；其中，区域生成网络的输入为步骤3.1)特征提取网络的输出，输出一系列候选框的坐标和一系列候选框是物体的概率分数；The region generation network is responsible for finding all the objects and their positions in the input image; the region generation network inputs the feature map, and then maps each point on the feature map back to the original image, obtains the coordinates of these points, and then takes the coordinates around these points. Some candidate boxes of different sizes and different aspect ratios are set in advance, and the probability score that each box is an object is calculated; among them, the input of the region generation network is the output of step 3.1) feature extraction network, and a series of candidate boxes are output. The coordinates of and a series of candidate boxes are the probability scores of the object;

为了实现从输入到输出的一系列转换，区域生成网络模型包括3个串联的功能结构，有卷积层、批量正则化层、非线性激活函数层，第一个功能结构是将输入进行3×3大小的特征融合，融合周边的信息，并分别作为第二和第三个功能结构的输入，第二个功能结构实现输出矩形框的坐标信息，第三个功能结构实现输出对应矩形框是物体的概率分数；In order to realize a series of transformations from input to output, the region generation network model includes 3 functional structures in series, including convolutional layers, batch regularization layers, and nonlinear activation function layers. The first functional structure is to perform 3 × The feature fusion of 3 sizes, fuses the surrounding information, and serves as the input of the second and third functional structures respectively. The second functional structure realizes the output of the coordinate information of the rectangular frame, and the third functional structure realizes the output corresponding to the rectangular frame is an object. probability score;

3.3)构造有内容感知能力的感兴趣区域池化层3.3) Construct a content-aware region of interest pooling layer

有内容感知能力的感兴趣区域池化层的作用是实现从原图的目标区域映射到所述步骤3.1)得到的低维编码区域，再池化到固定大小的功能，而有内容感知能力则表现在以下两方面：The role of the content-aware region of interest pooling layer is to map the target region of the original image to the low-dimensional coding region obtained in step 3.1), and then pool to a fixed size, while the content-aware capability It is manifested in the following two aspects:

3.3.1)信息补全3.3.1) Information Completion

信息补全是为了补全小目标在低维编码时丢失的信息，让小目标的检测更准确；针对从原图的目标区域映射到所述步骤3.1)的低维编码的特征图，若其长和宽其中一个大于z，z的取值根据网络需求而定，另一个小于z，则通过反卷积的方式将其放大到边长为max(长，宽)的正方形，再进行池化操作；若其长和宽都小于z，则长宽通过反卷积的方式放大到原来的2倍，再进行池化操作；若其长和宽都大于z，则直接进行后续的池化操作；Information completion is to complete the information lost by small targets during low-dimensional coding, so that the detection of small targets is more accurate; for the low-dimensional coding feature map mapped from the target area of the original image to the step 3.1), if its One of the length and width is greater than z, the value of z is determined according to the network requirements, and the other is less than z, then it is enlarged to a square whose side length is max (length, width) by deconvolution, and then pooled Operation; if its length and width are both less than z, the length and width are enlarged to twice the original size by deconvolution, and then the pooling operation is performed; if both its length and width are greater than z, the subsequent pooling operation is performed directly ;

3.3.2)划分大小3.3.2) Division size

对所述步骤3.2)输出原图的目标区域进行划分大小，根据准备的训练数据集中所有标签框的面积的均值，若所述步骤3.2)输出的矩形框的面积小于该均值，标记为小目标输出，而大于或等于该均值的，标记为大目标输出；Divide the size of the target area of the original image output in the step 3.2), according to the average value of the area of all the label boxes in the prepared training data set, if the area of the rectangular box output in the step 3.2) is smaller than the average value, mark it as a small target output, and those greater than or equal to the mean are marked as large target outputs;

3.4)构造多任务分类网络3.4) Constructing a multi-task classification network

多任务分类网络是为了分别识别大尺度和小尺度的目标，防止大和小尺度的目标的低维编码不同导致的分类错误；根据步骤3.3)得到的大小两类矩形框，分别输入两个分类网络；分类网络输出类别的分数用以分类任务，以及精修选框的位置用于回归任务，为了完成分类和回归任务，该网络包含全连接层、非线性激活函数层、信号丢失层，全连接层起到将学到的“分布式特征表示”映射到样本标记空间的作用，非线性激活函数层的加入防止了模型退化为简单的线性模型，提高模型的描述能力，信号丢失层以0.5的概率让神经元不工作，让训练过程收敛更快，防止过拟合；The multi-task classification network is to identify large-scale and small-scale targets respectively, and prevent classification errors caused by different low-dimensional codes of large-scale and small-scale targets; according to the two types of rectangular boxes obtained in step 3.3), input the two classification networks respectively. ; The score of the output category of the classification network is used for the classification task, and the position of the refinement box is used for the regression task. In order to complete the classification and regression tasks, the network includes a fully connected layer, a nonlinear activation function layer, a signal loss layer, and a fully connected layer. The layer plays the role of mapping the learned "distributed feature representation" to the sample label space. The addition of the nonlinear activation function layer prevents the model from degenerating into a simple linear model and improves the description ability of the model. The signal loss layer is 0.5 Probability makes neurons not work, makes the training process converge faster, and prevents overfitting;

最后将大小分类网络的输出结果进行融合，作为最终输出；Finally, the output results of the size classification network are fused as the final output;

所述步骤4)包括以下步骤：Described step 4) comprises the following steps:

4.1)定义区域生成网络的损失函数4.1) Define the loss function of the region generation network

区域生成网络用于在低维的编码中得到输入图感兴趣区域的坐标和该区域是否为前景的分数，即回归任务和分类任务，定义损失函数使输出的选框尽可能的接近标准参考框的位置；因此，回归任务的损失函数能够定义为平滑化曼哈顿距离损失损失(SmoothL1Loss)，公式如下所示：The region generation network is used to obtain the coordinates of the region of interest in the input image and the score of whether the region is foreground in the low-dimensional encoding, that is, the regression task and the classification task, and the loss function is defined to make the output box as close to the standard reference box as possible. ; therefore, the loss function for the regression task can be defined as the smoothed Manhattan distance loss loss (SmoothL1Loss), the formula is as follows:

其中，L_reg为回归损失，v和t分别表示预测框的位置和其对应的标准参考框的位置，x和y表示左上角坐标值，w和h分别表示矩形框的宽和高；Among them, L _reg is the regression loss, v and t respectively represent the position of the prediction frame and the position of its corresponding standard reference frame, x and y represent the upper left corner coordinate value, and w and h represent the width and height of the rectangular frame, respectively;

分类任务的损失函数定义为柔性最大化损失(SoftmaxLoss)，公式如下所示：The loss function of the classification task is defined as the soft maximization loss (SoftmaxLoss), and the formula is as follows:

x'_i＝x'_i-max(x'₁,...,x'_n)x' _i =x' _i -max(x' ₁ ,...,x' _n )

L_cls＝-logp_i L _cls = _-logpi

其中，x'为网络的输出，n表示总类别数，p表示每一类的概率，L_cls为分类损失；Among them, x' is the output of the network, n is the total number of categories, p is the probability of each category, and L _cls is the classification loss;

4.2)定义分类网络的损失函数4.2) Define the loss function of the classification network

分类网络输出类别的分数用于分类任务，以及精修选框的位置用于回归任务，定义损失函数使其输出的类别尽可能的和标签数据一致，同时使其输出的选框位置尽可能的和标准参考框的位置一致；同样如步骤4.1)，回归任务的损失函数能够定义为SmoothL1Loss，分类任务的损失函数定义为SoftmaxLoss；The score of the output category of the classification network is used for the classification task, and the position of the refinement box is used for the regression task. The loss function is defined so that the output category is as consistent as possible with the label data, and the output box position is as close as possible. The position of the standard reference frame is the same; as in step 4.1), the loss function of the regression task can be defined as SmoothL1Loss, and the loss function of the classification task can be defined as SoftmaxLoss;

4.3)定义总损失函数4.3) Define the total loss function

步骤4.1)和步骤4.2)中定义的两个区域生成网络损失函数与两个分类网络损失函数能够通过加权的方式组合起来，使得网络可以完成图片中多尺度目标检测的任务；The two area generation network loss functions defined in step 4.1) and step 4.2) and the two classification network loss functions can be combined in a weighted manner, so that the network can complete the task of multi-scale target detection in the picture;

所述步骤5)包括以下步骤：Described step 5) comprises the following steps:

5.1)初始化模型各层参数5.1) Initialize the parameters of each layer of the model

各层参数的初始化采用的是传统的深度卷积神经网络中使用到的方法，对特征提取网络的卷积层参数利用在ImageNet预训练好的VGG16网络模型的卷积层参数值作为初始值，区域生成网络中的卷积层以及分类网络的全连接层，则采用均值为0，标准差为0.02的高斯分布进行初始化，而对所有的批量正则化层的参数采用均值为1，标准差为0.02的高斯分布进行初始化；The initialization of the parameters of each layer adopts the method used in the traditional deep convolutional neural network. For the convolutional layer parameters of the feature extraction network, the convolutional layer parameter values of the VGG16 network model pre-trained in ImageNet are used as the initial values. The convolutional layer in the region generation network and the fully connected layer of the classification network are initialized with a Gaussian distribution with a mean of 0 and a standard deviation of 0.02, while the parameters of all batch regularization layers are initialized with a mean of 1 and a standard deviation of The Gaussian distribution of 0.02 is initialized;

5.2)训练网络模型5.2) Train the network model

随机输入经过步骤2)处理的原始图像，经过步骤3.1)的特征提取网络得到相应的低维编码特征，在经过步骤3.2)的区域生成网络生成一批选框的候选区域，并通过步骤4.1)计算相应的损失值，然后将这些区域经过步骤3.3)的有内容感知能力的感兴趣区域池化层得到固定大小的另一种低维编码特征，而后再经过步骤3.4)的分类网络得到目标的分类以及精修的选框位置，并通过步骤4.2)计算相应的损失值。最后将这两部分的损失值经过步骤4.3)的处理得到最终损失值，将该值通过反向传播能够得到步骤3)网络模型中的各层参数的梯度，再通过随机梯度下降算法使得到的梯度对各层参数进行优化，即可实现一轮网络模型的训练；Randomly input the original image processed in step 2), obtain the corresponding low-dimensional coding features through the feature extraction network in step 3.1), generate a batch of candidate regions for the selection box in the region generation network in step 3.2), and pass step 4.1) Calculate the corresponding loss value, and then pass these regions through the content-aware region of interest pooling layer in step 3.3) to obtain another low-dimensional encoding feature of a fixed size, and then go through the classification network in step 3.4) to obtain the target's Classification and refinement of the marquee position, and calculate the corresponding loss value through step 4.2). Finally, the loss values of these two parts are processed in step 4.3) to obtain the final loss value, and the gradient of each layer parameter in the network model in step 3) can be obtained by back-propagation of this value, and then the stochastic gradient descent algorithm is used to make the obtained The gradient optimizes the parameters of each layer to realize a round of training of the network model;

5.3)重复步骤5.2)直到网络关于多尺度目标检测的能力达到预期的目标为止。5.3) Repeat step 5.2) until the ability of the network on multi-scale object detection reaches the desired goal.

所述步骤6)的具体做法如下：The concrete practice of described step 6) is as follows:

随机从验证数据集中取出一些原始图像，经过步骤2)处理后，输入到步骤5)训练好的网络模型，让该网络模型去检测图中的目标的位置并预测其类别，通过输出的结果与对应的标签数据进行比对，从而判断该训练好的网络模型的多尺度目标检测能力。Randomly take some original images from the verification data set, and after processing in step 2), input them to the network model trained in step 5), and let the network model detect the position of the target in the figure and predict its category. The corresponding label data are compared to judge the multi-scale target detection ability of the trained network model.

本发明与现有技术相比，具有如下优点与有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、提出了新的网络层--有内容感知能力的感兴趣区域池化层(CAROIPooling，Content-Aware ROIPooling layer)，实现从原图区域映射到所低维编码区域再池化到固定大小的功能，尤其会对小尺度物体的地位编码特征图进行信息补全，达到更准确和更全面的低维编码特征图的目的，而且该网络层在其他目标检测网络中一样适用。1. A new network layer is proposed--a content-aware region of interest pooling layer (CAROIPooling, Content-Aware ROIPooling layer), which realizes the mapping from the original image area to the low-dimensional coding area and then pools it to a fixed size. In particular, it will complete the information of the position-encoding feature maps of small-scale objects, so as to achieve the purpose of more accurate and comprehensive low-dimensional encoding feature maps, and this network layer is also applicable to other target detection networks.

2、提出了一个多分支的目标检测网络，不同分支分别负责大尺度和小尺度的目标检测任务，从而更加准确的区分和检测出大尺度物体和小尺度物体，突破已有方法的限制。2. A multi-branch target detection network is proposed. Different branches are responsible for large-scale and small-scale target detection tasks, so as to more accurately distinguish and detect large-scale objects and small-scale objects, breaking through the limitations of existing methods.

附图说明Description of drawings

图1为本发明方法流程图。Fig. 1 is the flow chart of the method of the present invention.

图2为特征提取网络示意图。Figure 2 is a schematic diagram of the feature extraction network.

图3为区域生成网络示意图。Figure 3 is a schematic diagram of the area generation network.

图4为分类网络示意图。Figure 4 is a schematic diagram of the classification network.

具体实施方式Detailed ways

下面结合具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with specific embodiments.

如图1所示，本实施例所提供的基于深度卷积神经网络的多尺度目标检测方法，其具体情况如下：As shown in FIG. 1 , the details of the multi-scale target detection method based on a deep convolutional neural network provided by this embodiment are as follows:

步骤1，获取高速公路视频数据集，然后获取其视频帧，进行人工标注，并划分为训练数据集以及验证数据集。Step 1: Obtain the highway video data set, then obtain its video frames, perform manual annotation, and divide it into a training data set and a verification data set.

步骤2，将图像数据集的图像和标签数据通过预处理转化为训练深度卷积神经网络所需要的格式，包括以下步骤：Step 2: Convert the image and label data of the image dataset into a format required for training a deep convolutional neural network through preprocessing, including the following steps:

步骤2.1，将数据集中的图像缩放到长和宽为768×1344像素大小，标签数据也根据相应的比例缩放到相应的大小。In step 2.1, the images in the dataset are scaled to a size of 768 × 1344 pixels in length and width, and the label data is also scaled to the corresponding size according to the corresponding scale.

步骤2.2，在缩放后的图像，随机裁剪其中包含有标签的地方得到768×768像素大小的正方形图像。Step 2.2, in the scaled image, randomly crop the place containing the label to obtain a square image with a size of 768 × 768 pixels.

步骤2.3，以0.5的概率随机水平翻转裁剪后的图像。Step 2.3, randomly flip the cropped image horizontally with a probability of 0.5.

步骤2.4，将随机翻转后的图像从[0,255]转换到[-1,1]的范围内。Step 2.4, transform the randomly flipped image from [0,255] to the range of [-1,1].

步骤3，构建网络模型，包括特征提取网络、区域生成网络、多任务分类网络，包括以下步骤：Step 3, building a network model, including a feature extraction network, a region generation network, and a multi-task classification network, including the following steps:

步骤3.1，构造特征提取网络。特征提取网络的输入为3×768×768的图像，输出为一系列低维编码特征图(512×48×48和512×24×24)。该网络包括多个级联的下采样层。下采样层由串联的卷积层、批量正则化层、以及非线性激活函数层、池化层组成。以下是一个特征提取网络模型的具体例子，如图2所示。Step 3.1, construct a feature extraction network. The input to the feature extraction network is an image of 3 × 768 × 768, and the output is a series of low-dimensional encoded feature maps (512 × 48 × 48 and 512 × 24 × 24). The network includes multiple cascaded downsampling layers. The downsampling layer is composed of convolutional layers, batch regularization layers, nonlinear activation function layers, and pooling layers in series. The following is a specific example of a feature extraction network model, as shown in Figure 2.

步骤3.2，构造区域生成网络。区域生成网络的输入为512×48×48/512×24×24的特征图，输出为36×48×48/36×24×24和18×48×48/18×24×24的矩阵信息。该网络包括3个串联的结构(卷积层、批量正则化层、非线性激活函数层)。以下是一个区域生成网络模型的具体例子，如图3所示。Step 3.2, construct the region generation network. The input of the region generation network is the feature map of 512×48×48/512×24×24, and the output is the matrix information of 36×48×48/36×24×24 and 18×48×48/18×24×24. The network consists of 3 concatenated structures (convolutional layer, batch regularization layer, non-linear activation function layer). The following is a specific example of a region generative network model, as shown in Figure 3.

步骤3.3，构造多任务分类网络。本例子用了两个分类网络，他们的输入都是长度为512×7×7的向量，输出长度为4的向量A和长度为4的向量B，其中向量A中的4个值分别表示背景、小车、公共汽车、火车的类别分数，向量B中的4个值表示了一个选框的位置(左上角点的坐标x和y，选框的宽和高w和h)。该网络包含了全连接层、非线性激活函数层，信息丢失层。以下是本例子多任务分类网络模型的具体例子，如图4所示。Step 3.3, construct a multi-task classification network. This example uses two classification networks, their input is a vector of length 512 × 7 × 7, and the output is a vector A of length 4 and a vector B of length 4, where the four values in vector A represent the background respectively. , car, bus, train class scores, the 4 values in the vector B represent the position of a marquee (coordinates x and y of the upper left point, and the width and height of the marquee w and h). The network includes a fully connected layer, a nonlinear activation function layer, and an information loss layer. The following is a specific example of the multi-task classification network model in this example, as shown in Figure 4.

步骤4，定义区域生成网络和分类网络的损失函数，包括以下步骤：Step 4, define the loss function of the region generation network and the classification network, including the following steps:

步骤4.1，定义区域生成网络的损失函数。定义损失函数使输出的选框尽可能的接近标准参考框的位置，此处用SmoothL1Loss定义损失函数使输出的选框的前景分数尽可能的与标签数据接近，此处用SoftmaxLoss。Step 4.1, define the loss function of the region generation network. Define the loss function to make the output box as close as possible to the position of the standard reference frame. Here, SmoothL1Loss is used to define the loss function to make the foreground score of the output box as close to the label data as possible, and SoftmaxLoss is used here.

步骤4.2，定义分类网络的损失函数。定义损失函数使输出的选框的前景分数尽可能的与标签数据接近，类别为4类。定义损失函数使输出的选框尽可能的接近标准参考框的位置。Step 4.2, define the loss function of the classification network. The loss function is defined so that the foreground score of the output box is as close as possible to the label data, and the category is 4 categories. Define the loss function so that the output box is as close as possible to the position of the standard reference box.

步骤4.3，定义总损失函数。对以上4个损失进行加权求和。用公式表示如下：Step 4.3, define the total loss function. Weighted summation of the above 4 losses. The formula is expressed as follows:

Loss＝(w₁×L_cls+w₂×L_reg)_{区域生成网络损失}+(w₁×L_cls+w₂×L_reg)_{分类网络损失} Loss=(w ₁ ×L _cls +w ₂ ×L _reg ) _{area generation network loss} +(w ₁ ×L _cls +w ₂ ×L _reg ) _{classification network loss}

其中，Loss为总损失值，w1、w2、w3、w4为权重，本例w1＝w2＝w3＝w4＝1，L_cls为分类损失值，L_reg为回归损失值。Among them, Loss is the total loss value, w1, w2, w3, and w4 are the weights. In this example, w1=w2=w3=w4=1, L _cls is the classification loss value, and L _reg is the regression loss value.

步骤5，训练网络模型，包括以下步骤：Step 5, train the network model, including the following steps:

步骤5.1，初始化模型各层参数，特征提取网络的卷积层参数利用在一个大数据库ImageNet上预训练好的VGG16网络模型的卷积层参数值作为初始值，区域生成网络中的卷积层以及分类网络的全连接层，则采用均值为0，标准差为0.02的高斯分布进行初始化，而对所有的批量正则化层的参数采用均值为1，标准差为0.02的高斯分布进行初始化。Step 5.1, initialize the parameters of each layer of the model, the convolutional layer parameters of the feature extraction network use the convolutional layer parameter values of the VGG16 network model pre-trained on a large database ImageNet as the initial value, and the convolutional layer and The fully connected layer of the classification network is initialized with a Gaussian distribution with a mean of 0 and a standard deviation of 0.02, while the parameters of all batch regularization layers are initialized with a Gaussian distribution with a mean of 1 and a standard deviation of 0.02.

步骤5.2，训练网络模型随机输入经过步骤2处理的原始图像，输入步骤3的网络模型，输出类别信息和回归框的坐标信息，再经过步骤4计算得到最终损失值，将该值通过反向传播能够得到步骤3网络模型中的各层参数的梯度，再通过随机梯度下降算法使得到的梯度对各层参数进行优化，即可实现一轮网络模型的训练。Step 5.2, train the network model to randomly input the original image processed in step 2, input the network model of step 3, output the category information and the coordinate information of the regression box, and then calculate the final loss value through step 4, and pass this value through back propagation The gradient of the parameters of each layer in the network model in step 3 can be obtained, and then the obtained gradient can be used to optimize the parameters of each layer through the stochastic gradient descent algorithm, and then a round of training of the network model can be realized.

步骤5.3，持续迭代训练，即重复步骤5.2直到网络关于多尺度目标检测的能力达到预期的目标为止。Step 5.3, continuous iterative training, that is, repeat step 5.2 until the ability of the network on multi-scale target detection reaches the expected target.

步骤6，使用验证数据集对训练得到的模型进行验证，测试其泛化性能。Step 6: Validate the trained model using the validation dataset to test its generalization performance.

具体做法是随机从验证数据集中取出一些原始图像，经过步骤2处理后，输入到步骤5训练好的网络模型，让该网络模型去检测图中的目标的位置并预测其类别。通过输出的结果与对应的标签数据进行比对，从而判断该训练好的网络模型的多尺度目标检测能力。The specific method is to randomly take some original images from the verification data set, and after processing in step 2, input them into the network model trained in step 5, and let the network model detect the position of the target in the picture and predict its category. The multi-scale target detection ability of the trained network model is judged by comparing the output results with the corresponding label data.

以上所述实施例只为本发明之较佳实施例，并非以此限制本发明的实施范围，故凡依本发明之形状、原理所作的变化，均应涵盖在本发明的保护范围内。The above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of implementation of the present invention. Therefore, any changes made according to the shape and principle of the present invention should be included within the protection scope of the present invention.

Claims

1. a multi-scale target detection method based on deep convolutional neural network, is characterized in that, comprises the following steps:

1) Data acquisition

Training a deep convolutional neural network requires a lot of training data, so large-scale natural image or video image data needs to be used. If the obtained image data does not have label data, it needs to be manually labeled, and then divided into training data sets and verification data sets;

2) Data processing

Convert the image and label data of the image dataset into the format required for training a deep convolutional neural network through preprocessing;

3) Model building

According to the training target and the input and output form of the model, a deep convolutional neural network suitable for multi-scale target detection is constructed, including the following steps:

3.1) Construct feature extraction network model

The feature extraction network is equivalent to an encoder, which is used to extract high-level semantic information from the input image and save it into a low-dimensional encoding. The input of the feature extraction network is the image processed in step 2). Some information will be lost in the deep coding, so in order to reduce the preservation of more information, low-dimensional and low-dimensional feature codes are output; in order to realize the transformation from input to a series of outputs, the feature extraction network contains multiple cascaded downsampling The downsampling layer consists of convolutional layers, batch regularization layers, nonlinear activation function layers, and pooling layers in series. The batch regularization layer plays a role in stabilizing and accelerating model training by normalizing the mean and standard deviation of the input samples of the same batch. The addition of a nonlinear activation function layer prevents the model from degenerating into a simple linear model. , to improve the description ability of the model, the role of the pooling layer is to reduce the size of the feature map, which can increase the receptive field of the convolution kernel;

3.2) Constructing a regional generative network model

The region generation network is responsible for finding all the objects and their positions in the input image; the region generation network inputs the feature map, and then maps each point on the feature map back to the original image, obtains the coordinates of these points, and then takes the coordinates around these points. Some candidate boxes of different sizes and different aspect ratios are set in advance, and the probability score that each box is an object is calculated; among them, the input of the region generation network is the output of step 3.1) feature extraction network, and a series of candidate boxes are output. The coordinates of and a series of candidate boxes are the probability scores of the object;

In order to realize a series of transformations from input to output, the region generation network model includes 3 functional structures in series, including convolutional layers, batch regularization layers, and nonlinear activation function layers. The first functional structure is to perform 3 × The feature fusion of 3 sizes, fuses the surrounding information, and serves as the input of the second and third functional structures respectively. The second functional structure realizes the output of the coordinate information of the rectangular frame, and the third functional structure realizes the output corresponding to the rectangular frame is an object. probability score;

3.3) Construct a content-aware region of interest pooling layer

The role of the content-aware region of interest pooling layer is to map the target region of the original image to the low-dimensional coding region obtained in step 3.1), and then pool to a fixed size, while the content-aware capability It is manifested in the following two aspects:

3.3.1) Information Completion

Information completion is to complete the information lost by small targets during low-dimensional coding, so that the detection of small targets is more accurate; for the low-dimensional coding feature map mapped from the target area of the original image to the step 3.1), if its One of the length and width is greater than z, the value of z is determined according to the network requirements, and the other is less than z, then it is enlarged to a square whose side length is max (length, width) by deconvolution, and then pooled Operation; if its length and width are both less than z, the length and width are enlarged to twice the original size by deconvolution, and then the pooling operation is performed; if both its length and width are greater than z, the subsequent pooling operation is performed directly ;

3.3.2) Division size

Divide the size of the target area of the original image output in the step 3.2), according to the average value of the area of all the label boxes in the prepared training data set, if the area of the rectangular box output in the step 3.2) is smaller than the average value, mark it as a small target output, and those greater than or equal to the mean are marked as large target outputs;

3.4) Constructing a multi-task classification network

The multi-task classification network is to identify large-scale and small-scale targets respectively, and prevent classification errors caused by different low-dimensional codes of large-scale and small-scale targets; according to the two types of rectangular boxes obtained in step 3.3), input the two classification networks respectively. ; The score of the output category of the classification network is used for the classification task, and the position of the refinement box is used for the regression task. In order to complete the classification and regression tasks, the network includes a fully connected layer, a nonlinear activation function layer, a signal loss layer, and a fully connected layer. The layer plays the role of mapping the learned "distributed feature representation" to the sample label space. The addition of the nonlinear activation function layer prevents the model from degenerating into a simple linear model and improves the description ability of the model. The signal loss layer is 0.5 Probability makes neurons not work, makes the training process converge faster, and prevents overfitting;

Finally, the output results of the size classification network are fused as the final output;

4) Define the loss function

According to the training target and the architecture of the model, the required loss function is defined, including the following steps:

4.1) Define the loss function of the region generation network

The region generation network is used to obtain the coordinates of the region of interest in the input image and the score of whether the region is foreground in the low-dimensional encoding, that is, the regression task and the classification task, and the loss function is defined to make the output box close to the position of the standard reference box; Therefore, the loss function of the regression task can be defined as the smoothed Manhattan distance loss SmoothL1Loss, the formula is as follows:

Among them, L _reg is the regression loss, v and t respectively represent the position of the prediction frame and the position of its corresponding standard reference frame, x and y represent the upper left corner coordinate value, and w and h represent the width and height of the rectangular frame, respectively;

The loss function of the classification task is defined as the soft maximization loss SoftmaxLoss, the formula is as follows:

x' _i =x' _i -max(x' ₁ ,...,x' _n )

L _cls = _-logpi

Among them, x' is the output of the network, n is the total number of categories, p is the probability of each category, and L _cls is the classification loss;

4.2) Define the loss function of the classification network

The score of the output category of the classification network is used for the classification task, and the position of the refinement box is used for the regression task. The loss function is defined to make the output category and label data consistent, and the output box position and the standard reference frame position. Consistent; also as in step 4.1), the loss function of the regression task can be defined as SmoothL1Loss, and the loss function of the classification task can be defined as SoftmaxLoss;

4.3) Define the total loss function

The two area generation network loss functions defined in step 4.1) and step 4.2) and the two classification network loss functions can be combined in a weighted manner, so that the network can complete the task of multi-scale target detection in the picture;

5) Model training

Initialize the parameters of each layer of the network, iteratively input training samples, calculate the loss value of the network according to the loss function, and then calculate the gradient of the parameters of each network layer through back propagation, and update the parameters of each layer of the network through the stochastic gradient descent method. , including the following steps:

5.1) Initialize the parameters of each layer of the model

The initialization of the parameters of each layer adopts the method used in the traditional deep convolutional neural network. For the convolutional layer parameters of the feature extraction network, the convolutional layer parameter values of the VGG16 network model pre-trained in ImageNet are used as the initial values. The convolutional layer in the region generation network and the fully connected layer of the classification network are initialized with a Gaussian distribution with a mean of 0 and a standard deviation of 0.02, while the parameters of all batch regularization layers are initialized with a mean of 1 and a standard deviation of The Gaussian distribution of 0.02 is initialized;

5.2) Train the network model

Randomly input the original image processed in step 2), obtain the corresponding low-dimensional coding features through the feature extraction network in step 3.1), generate a batch of candidate regions for the selection box in the region generation network in step 3.2), and pass step 4.1) Calculate the corresponding loss value, and then pass these regions through the content-aware region of interest pooling layer in step 3.3) to obtain another low-dimensional encoding feature of a fixed size, and then go through the classification network in step 3.4) to obtain the target's Classification and refinement of the marquee position, and calculate the corresponding loss value through step 4.2); finally, the loss value of these two parts is processed in step 4.3) to obtain the final loss value, which can be obtained through back propagation. Step 3 ) the gradients of the parameters of each layer in the network model, and then optimize the parameters of each layer with the gradient obtained by the stochastic gradient descent algorithm, and then a round of training of the network model can be realized;

5.3) Repeat step 5.2) until the ability of the network on multi-scale target detection reaches the expected target;

6) Model Validation

Use the validation dataset to validate the trained model and test its generalization performance.

2. a kind of multi-scale target detection method based on deep convolutional neural network according to claim 1, is characterized in that, described step 2) comprises the following steps:

2.1) Scale the images in the dataset to the size of m×n pixels in length and width, and the label data is also scaled to the corresponding size according to the corresponding ratio;

2.2) In the scaled image, randomly crop the place containing the label to obtain a rectangular image of a×b pixel size, a<=m, b<=n;

2.3) Randomly flip the cropped image horizontally with a probability of 0.5;

2.4) Convert the randomly flipped image from [0, 255] to the range of [-1, 1].

3. a kind of multi-scale target detection method based on deep convolutional neural network according to claim 1, is characterized in that, the concrete practice of described step 6) is as follows:

Randomly take out some original images from the verification data set, and after processing in step 2), input them into the network model trained in step 5), and let the network model detect the position of the target in the figure and predict its category, through the output result and The corresponding label data are compared to judge the multi-scale target detection ability of the trained network model.