Deep learning intelligent identification method for deformable living body small target
Technical Field
The invention relates to a deep learning intelligent identification method of a deformable living body small target, belonging to the technical field of robot vision and intelligent identification thereof.
Background
The vision and intelligent recognition technology of the robot is one of the main means for acquiring external information by the robot, and is widely used in the fields of detection, target tracking, operation and the like in the robot field at present. However, with the revolution of technology and the need for improving system performance, the visual intelligent technology requires the robot to detect and identify not only small-scale targets in different scenes, but also deformable living targets. Two main solutions are currently available for the problem of difficulty in detecting deformable living targets. The first category is to build a training set with enough varying shapes for the target, which is mainly achieved by augmenting the existing data. The method mainly achieves robust detection of deformable targets by consuming a large amount of training and complex model parameters. The second category of methods uses features and algorithms with transform invariance, and this category of methods contains many classical algorithms such as SIFT, scale invariant feature transform, and sliding window based object detection paradigm.
However, the above-mentioned method suffers from both of these disadvantages. First, the geometric transformation is fixed and known, and this a priori knowledge is used to design augmentation data and design features and algorithms, however, for live objects, the shape transformation has many forms, and the augmented object morphology is limited, so this approach cannot deal with the unknown geometric transformations for those morphologies that are not augmented. Second, for overly complex transforms, even if the transforms are known, artificially designing invariant features and algorithms is difficult to implement and infeasible.
Disclosure of Invention
The invention aims to provide a deep learning intelligent identification method of a deformable living small target for improving the detection effect of the deformable target.
The invention aims to realize the method, and the method for intelligently recognizing the deep learning of the deformable living small target specifically comprises the following steps:
step 1, replacing a basic convolution unit by a deformable convolution module: adding two-dimensional or even high-dimensional offset to the spatial sampling points of the standard convolution to change the shapes of the sampling points of the standard convolution;
step 2. the deformable ROI pooling module replaces the ROI pooling layer: adding a two-dimensional or even high-dimensional offset to the position of each square grid of the common ROI (Region of Interest) pooling so as to improve the deformable capability of the convolutional neural network, obtain the deformable convolutional network and improve the detection and identification capability of the convolutional neural network on a deformable target;
step 3, aiming at the detection and identification of the small target, a structure based on inverse convolution and multi-layer feature fusion is used for improving the Faster R-CNN model, so that the information quantity obtained by the small target preselection frame is richer;
and 4, in the fast R-CNN network, the RPN network is used for generating preselected frames, then an algorithm classifies and regresses the preselected frames, the mechanism of the anchor point is improved, and a group of small-scale preselected frames are added in the anchor point, so that the RPN can generate more small target preselected frames, and the detection and identification effects on small targets are improved.
The invention also includes such structural features:
1. the deformable convolution network comprises a deformable convolution module, a deformable ROI pooling module and a deformable position-sensitive ROI pooling module; the convolution and the feature map in a convolutional neural network are both three-dimensional, the deformable convolution operates in a two-dimensional spatial domain, and the deformable convolution operation is the same between different channel dimensions.
2. Step 1 is a two-dimensional operation description of deformable convolution, and specifically comprises adding two-dimensional or even high-dimensional offset to spatial sampling points of standard convolution to enable the shape of the sampling points of the convolution to change; the offset is obtained by performing convolution operation on the same input feature map, and the convolution kernel of the convolution operation and the previous convolution layer keep the same resolution and expansion value; the output offset domain and the input feature map have the same spatial resolution, the number of channels of the offset domain is twice that of the input feature map, which corresponds to the two-dimensional offset of each sampling position of convolution, in training, a convolution kernel for generating the output feature map and a convolution kernel for generating the offset domain are learned at the same time, and for learning to obtain the offset domain, the gradient is obtained by bilinear operation inverse operation of the following two formulas:
where p denotes the position of an arbitrary sample point, the gradient relative offset Δ p in the deformable ROI pooling moduleijIn the formula (c), p is p0+pn+ΔpnQ represents an input feature diagram InAll integer spatial traversal points in (a), (b), G (a, b) max (0,1- | a-b);
in the deformable convolution formula, the gradient is relative to the offset Δ pnThe calculation formula of (2) is as follows:
in the formula (I), the compound is shown in the specification,
can be represented by the formula G (q, p) ═ G (q)
x,p
x)·g(q
y,p
y) Derivation gives, notice of Δ p
nIs a two-dimensional quantity, which we use for simplicity
To replace
And
3. the step 2 is that the deformable ROI pooling operation of the ROI pooling layer is operated in a two-dimensional space domain, the deformable ROI pooling operation is the same among different channel dimensions, and the deformable ROI pooling operation specifically comprises that a two-dimensional or even high-dimensional offset is added to the position of each square of the common ROI pooling, so that the deformable capability of the convolutional neural network is improved, and the detection and identification capability of the convolutional neural network on a deformable target is improved; firstly, obtaining a pooled feature map by using ROI pooling operation; then, a full connection layer is connected behind the characteristic graph to obtain normalized offset; finally this normalized offset is calculated by multiplying the elements of the width and height of the region of interest; the normalization of the offset is essential for learning the invariance of the offset to the size of the region of interest, and the parameters of the subsequent full-link layer are obtained through a back propagation algorithm; in the deformable ROI pooling Module, the relative gradient offset Δ pijThe value of (d) can be calculated as:
4. the deformable convolution network can be improved on a Faster R-CNN network, the improvement is divided into two stages, the first stage is that a full convolution network generates a feature map for an input picture, and a modified VGG16 network removes a maximum pooling layer, two 4096 unit full connection layers and a 1000 unit full connection layer which follow a convolution unit in order to extract features; the deformable convolution is applied to the last convolution unit, i.e., the three convolution layers conv5_1, conv5_2, and conv5_ 3. The second phase is that a lightweight task-based network generates results based on the input feature map; the classification regression part of the Faster R-CNN network mainly uses an RPN network to generate a preselected frame, then the preselected frame and a feature map are input into the Fast R-CNN network, firstly, an ROI pooling layer performs ROI pooling on a frame to obtain features, two 1024-dimensional full-connected layers are added, and finally, two parallel branches are connected, and target regression and classification are respectively performed to obtain a final result.
5. Step 3, the improvement of the Faster R-CNN model by using a structure based on inverse convolution specifically comprises the steps of inserting an inverse pooling layer in a convolution neural network; in order to apply the inverse pooling layer, the position of the maximum activation value is first recorded at the time of the pooling operation; then, returning the activation value to the position of the activation value during pooling during anti-pooling, and setting the rest positions to be zero; finally, the output characteristic diagram of the inverse convolution needs to be clipped, so that the resolution of the characteristic diagram after the inverse convolution processing is consistent with the resolution of the inverse pooling output characteristic diagram.
6. Step 3, the improvement of the Faster R-CNN model by using the multi-layer feature fusion structure specifically comprises the steps of firstly carrying out fusion processing on features aiming at the condition of insufficient feature information, and then carrying out ROI pooling on a plurality of regions of interest, so that only one-time feature fusion and one-time normalization are needed, and the time for repeated calculation is saved; secondly, aiming at the condition that the region of interest is small, the last layer of features is subjected to inverse convolution processing, the third layer of features is subjected to maximum pooling processing, and finally the three feature graphs are fused.
Compared with the prior art, the invention has the beneficial effects that: the invention designs a deep learning intelligent identification method of a deformable living body small target, which reasonably combines a deformable convolution module and a deformable ROI pooling module with fast R-CNN according to the characteristics of the deformable living body small target, wherein the deformable convolution module is used for replacing a basic convolution unit, the deformable ROI pooling module is used for replacing an ROI pooling layer, and the sampling of a detection model can be changed along with the change of the shape of the detection target by introducing the deformable convolution and the deformable ROI pooling module, so that the detection effect of the deformable target is improved. The fast R-CNN model is improved by using inverse convolution and multi-layer feature fusion, the information quantity obtained by a small target preselection frame is richer by using the inverse convolution and the multi-layer feature fusion, and the improvement of an anchor point mechanism is that RPN can generate more small target preselection frames. Meanwhile, the method based on inverse convolution and multi-layer feature fusion has strong semantic information of high-layer features and combines the advantage of high resolution of low-layer features.
Drawings
FIG. 1 is a schematic diagram of a 3 × 3 deformable convolution;
FIG. 2 is a schematic 3X 3 deformable ROI pooling;
FIG. 3 is a schematic diagram of a modification of the deformable convolution, deformable ROI pooling to Faster R-CNN;
FIG. 4 is a schematic diagram of the deconvolution and inverse pooling operations;
FIG. 5 is a schematic of multi-layer feature fusion;
FIG. 6 is a schematic diagram of improved multi-layer feature fusion;
fig. 7 is a schematic diagram of the structure of an RPN network;
FIG. 8 is a result of deformable convolution, deformable ROI pooling real-time online identification of video frames;
FIG. 9 is a visualization of the original Faster R-CNN (left) and improved Faster R-CNN (right) sea creature target detection.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention designs a method for intelligently recognizing the deep learning of a deformable living body small target, which reasonably combines a deformable convolution module and a deformable ROI pooling module with an Faster R-CNN according to the characteristics of the deformable living body small target, wherein the deformable convolution module is used for replacing a basic convolution unit, the deformable ROI pooling module is used for replacing an ROI pooling layer, meanwhile, an inverse convolution and multilayer feature fusion are used for improving an Faster R-CNN model, the inverse convolution and multilayer feature fusion enable a small target preselection frame to obtain richer information quantity, and the improvement on an anchor point mechanism enables an RPN to generate more small target preselection frames.
The method is based on the advanced achievement of the inventor in the aspect of artificial intelligence research, and the method is accurate in identification of the deformable living body small target. The introduction of the deformable convolution and the deformable ROI pooling module can enable the sampling of the detection model to change along with the change of the shape of the detection target, thereby improving the detection effect of the deformable target. The method based on the inverse convolution and the multi-layer feature fusion has strong semantic information of high-layer features and combines the advantage of high resolution of low-layer features.
The invention is realized as follows:
a. a deep learning intelligent identification method of a deformable living body small target mainly comprises the following steps: firstly, adding two-dimensional or even high-dimensional offset to spatial sampling points of standard convolution to enable the shape of the sampling points of the convolution to change; secondly, adding a two-dimensional or even high-dimensional offset to the position of each square block of the ordinary ROI (region of interest) pooling so as to improve the deformable capability of the convolutional neural network, thereby improving the detection and identification capability of the convolutional neural network on a deformable target. Aiming at the detection and identification of small targets, a structure based on inverse convolution and multi-layer feature fusion is firstly used for improving a Faster R-CNN model, so that the information quantity obtained by a small target preselection frame is richer; second, an improvement to the anchor point mechanism allows the RPN to generate more small target pre-selection boxes. Thereby improving the detection and identification effects on small targets.
b. The deformable convolution network comprises a deformable convolution module, a deformable ROI pooling module and a deformable position-sensitive ROI pooling module. The convolution and the feature map in a convolutional neural network are both three-dimensional, the deformable convolution operates in a two-dimensional spatial domain, and the deformable convolution operation is the same between different channel dimensions. Without loss of generality, we will next describe the two-dimensional operation of the model, extending to three dimensions, exactly the same, in order to simplify the problem.
Adding two-dimensional or even high-dimensional offset to the spatial sampling points of the standard convolution to enable the shape of the sampling points of the standard convolution to change; the offset is obtained by performing a convolution operation on the same input feature map, the convolution kernel of the convolution operation maintaining the same resolution and dilation values as the previous convolution layer. The output offset field has the same spatial resolution as the input signature, and the number of channels in the offset field is twice the number of channels in the input signature, which corresponds to the two-dimensional offset of convolving each sample location. In training, the convolution kernel that generates the output feature map and the convolution kernel that generates the offset field are learned simultaneously. To learn the offset domain, the gradient is obtained by inverse operation of bilinear operations in equations (1) and (2).
In the formula, p represents an arbitrary sampling point position (in the formula (4-3), p is p0+pn+Δpn) Q represents an input feature diagram InAll integer space traversal points in (a), (b) represent a bilinear interpolation kernel, G (a, b) ═ max (0,1- | a-b |).
In the deformable convolution formula, the gradient is relative to the offset Δ pnThe calculation formula of (a) is as follows:
in the formula (I), the compound is shown in the specification,
can be derived by the formula (2). Note Δ p
nIs a two-dimensional quantity, which we use for simplicity
To replace
And
c. likewise, the deformable ROI pooling operation is also operated in the two-dimensional spatial domain, and is the same between different channel dimensions. Without loss of generality, we will next describe the two-dimensional operation of the model, extending to three dimensions, exactly the same, in order to simplify the problem.
A two-dimensional or even high-dimensional offset is added to the position of each square block for the common ROI (region of interest) pooling, so that the deformable capability of the convolutional neural network is improved, and the detection and identification capability of the convolutional neural network on a deformable target is improved. First, a pooled feature map is obtained using an ROI pooling operation. The signature is then followed by a fully connected layer to get the normalized offset. Finally, this normalized offset is multiplied by the elements of the width and height of the region of interest. Normalization of the offset is essential for the offset learning to be invariant to the region of interest size, and the parameters of the subsequent fully-connected layer will be obtained by the back-propagation algorithm.
In the deformable ROI pooling Module, the relative gradient offset Δ pijThe value of (d) can be calculated as follows:
d. for Faster R-CNN, the network is intended to be divided into two phases. In the first stage, a full convolution network generates a feature map for an input picture. In the second phase, a lightweight task-based network generates results based on the input feature map. We mainly refined these two parts with deformable convolution and deformable ROI pooling.
The first stage of the improvement of deformable network to the fast R-CNN network is that a full convolution network generates a characteristic diagram for an input picture. A modified version of the VGG16 network removed one maximum pooling layer, two 4096-cell full-link layers, and one 1000-cell full-link layer following the convolution cell in order to extract features. The method of claim 1 is applied to the last convolution unit, namely the three convolution layers conv5_1, conv5_2 and conv5_ 3.
A classification regression part of a light task-based network generating a result Faster R-CNN network based on an input feature map, mainly using an RPN network to generate a preselected frame, inputting the preselected frame and the feature map into the Fast R-CNN network, performing ROI pooling on a frame by an ROI pooling layer to obtain features, adding two 1024-dimensional full-connected layers, and finally connecting two parallel branches to respectively perform target regression and classification to obtain a final result.
e. Aiming at the detection and identification of small targets, a structure based on inverse convolution and multi-layer feature fusion is designed. The fast R-CNN model is first improved, and an inverse pooling layer is inserted into a convolutional neural network. To apply the inverse pooling layer, first, at the time of the pooling operation, the location of the maximum activation value is recorded. Then, when the pool is reversed, the activation value is returned to the position when the pool is reversed, and the rest positions are all set to be zero. Finally, we need to clip the deconvolved output feature map so that the resolution of the deconvolved feature map is consistent with the resolution of the inverse pooled output feature map.
In the aspect of multi-layer feature fusion, firstly, the feature is fused under the condition of insufficient feature information, and then ROI pooling is carried out on a plurality of interested regions, so that only one-time feature fusion and one-time normalization are needed, and the time of repeated calculation is saved. Secondly, aiming at the condition that the region of interest is small, the last layer of features is subjected to inverse convolution processing, the third layer of features is subjected to maximum pooling processing, and finally the three feature graphs are fused. The resolution of the final used feature map is improved.
f. The anchor point mechanism in the RPN network is modified, and a group of small-scale pre-selection frames are added in the anchor point, so that the pre-selection frames of the small targets contained in the pre-selection frames extracted last by the RPN network are more, and the detection and the identification of the small targets are facilitated.
In the Faster R-CNN network, the RPN network is used to generate preselected boxes, which are then classified and regressed by an algorithm. Therefore, if the RPN is able to generate a more appropriate pre-selection frame, the detection recognition result will also be improved.
The present invention will be described in detail with reference to the drawings:
the first implementation mode comprises the following steps: FIG. 1 is a schematic diagram of a deformable convolution in which a sample point becomes an irregular and offset point by applying an offset to a conventional sampling grid, typically a fractional number, for sampling on an input feature map, typically by bilinear interpolation. The offset is obtained by performing convolution operation on the same input feature map, the convolution kernel of the convolution operation maintains the same resolution and expansion value as the previous convolution layer, the output offset domain has the same spatial resolution as the input feature map, and the number of channels of the offset domain is twice the number of channels of the input feature map, which corresponds to the two-dimensional offset (offset in the x-axis direction and offset in the y-axis direction) of each sampling position of convolution.
The second embodiment: FIG. 2 is a schematic diagram of deformable ROI pooling. First, a pooled feature map is obtained using an ROI pooling operation. Then, the feature map is followed by a fully connected layer to obtain a normalized offset
Finally, this normalized offset
By multiplication of elements with the width and height of the region of interest, e.g. formula
The offset Δ p used in the following equation is obtained
ij. Empirically, this amount is usually set to γ of 0.1. Normalization of the offset is essential for the invariance of the offset to the size of the region of interest. The parameters of the subsequent fully-connected layer are obtained by a back propagation algorithm.
The third embodiment is as follows: FIG. 3 is a schematic diagram of a modification of the deformable convolution, deformable ROI pooling to Faster R-CNN. The characteristic extraction part of the Faster R-CNN network uses a modified VGG16 network as a basic network to extract characteristics, and the modified VGG16 network removes a maximum pooling layer, two 4096 unit full-link layers and a 1000 unit full-link layer which follow a convolution unit in order to extract characteristics. Experiments have shown that better results can be obtained when the last convolution unit is used for the deformable convolution. Therefore, consider applying the deformable convolution to the last convolution unit, i.e., the three convolution layers conv5_1, conv5_2, conv5_ 3.
The classification regression part of the Faster R-CNN network mainly uses an RPN network to generate a preselected frame, then the preselected frame and a feature map are input into the Fast R-CNN network, firstly, an ROI pooling layer performs ROI pooling on a frame to obtain features, two 1024-dimensional full-connected layers are added, and finally, two parallel branches are connected, and target regression and classification are respectively performed to obtain a final result. In the Fast R-CNN part, we replace the ROI pooling layer with a deformable ROI pooling layer.
The fourth embodiment: FIG. 4 is a schematic diagram of the deconvolution and inverse pooling operations: first, at the time of the pooling operation, the location of the maximum activation value is recorded. Then, when the pool is reversed, the activation value is returned to the position when the pool is reversed, and the rest positions are all set to be zero. The deconvolution operation subjects the output profile of the inverse pooling operation to densification by using a multi-layer convolution-like operation to generate a dense profile. However, in contrast to convolutional layer one convolution operation, which convolves multiple inputs to obtain one output, one input is deconvoluted to obtain multiple outputs. Finally, we need to clip the deconvolved output feature map so that the resolution of the deconvolved feature map is consistent with the resolution of the inverse pooled output feature map.
The fifth embodiment: the combination of global features and local features, such as multi-scale, is used to enhance the acquisition of global texture and local information by the fast R-CNN network to improve the robustness of target detection, and fig. 5 is a multi-layer feature fusion, and the combination of global features and local features, such as multi-scale, is used to enhance the acquisition of global texture and local information by the fast R-CNN network to improve the robustness of target detection. In order to enhance the detection capability of the network, a shallow feature map, such as conv3 or conv4, is considered and then ROI pooling is performed, so that the network can detect a feature containing more low-level features within the region of interest, as shown in the figure.
Embodiment six: consider that the high-level information is deconvoluted to the same resolution as the low-level information, and then the multi-level features of the same resolution are fused. FIG. 6 is a schematic diagram of improved multi-layer feature fusion. Firstly, the output characteristic maps of the three layers conv3, conv4 and conv5 are taken. Then, ROI pooling is carried out on areas corresponding to conv3, conv4 and conv5 by using the region of interest, pooled features are normalized and merged by using L2 in one layer, and the number of the merged feature channels is reduced to be consistent with the output features of conv 5. And finally, connecting a target classification layer and a target regression layer. Since the three feature maps need to be combined, the features of different layers are normalized, such as L2 normalization, and then combined.
Embodiment seven: fig. 7 is a network configuration diagram of the RPN. Nine pre-selected boxes are generated at each sliding window in the original RPN network, which are the metrics 1282,2562,5122]And aspect ratio [1:1,1:2,2:1]A random combination of (a). This scale and aspect ratio selection gave the best test results for the pascal voc data set. Adding a set 64 of small target objects2Of (4), i.e. a preselected box size of [642,1282,2562,5122]. Thus, 12 pre-selection frames are generated at each sliding window and the pre-selection frames tend to tilt towards small target detection, ultimately improving the detection efficiency of small targets.
The eighth embodiment: FIG. 8 is the result of deformable convolution, deformable ROI pooling real-time online identification of video frames, we performed online identification experiments using the improved Faster R-CNN model, with the detection rate of the improved algorithm in the experiments being 12 frames per second.
Table 1 is the test results of our online identification. The predicted value is a target value of each kind of result obtained by algorithm prediction, and the true value is a value obtained by manually labeling the real-time detection video. As can be seen from Table 1, the predicted value of the algorithm is close to the true value, which shows that the improved algorithm has better detection robustness to the deformation problem of marine organisms encountered in real-time detection. FIG. 8 is a test result of identifying certain frames in a video online. As can be seen from fig. 8, the detection result is stable, which indicates that the improved algorithm has a better detection performance for the disturbance deformable target in the unstable imaging environment.
TABLE 1 Online identification test results
The ninth embodiment: detection results of original Faster R-CNN algorithm and improved Faster R-CNN algorithm on targets with different scales of marine biological data
TABLE 2
As can be seen from Table 2, the improved Faster R-CNN improves the detection results of targets with different scales, and the improvement effect of the small target detection effect is obvious. The detection results of the original Faster R-CNN algorithm and the improved Faster R-CNN algorithm on the small target are respectively mAP (IOU threshold value minus 0.5)35.45 and 42.95, the improvement is 21.16%, and the improvement of the improved Faster R-CNN algorithm on the small target detection is obvious compared with the original Faster R-CNN algorithm. Under a stricter evaluation index, namely, the IOU threshold is taken to be 0.7, the detection results mAP of the original Faster R-CNN algorithm and the improved Faster R-CNN algorithm on the small target are respectively 22.40 and 29.78, the improvement is 32.94%, and the improvement of the algorithm on the small target detection effect is further explained.
In summary, the invention introduces the deformable network, improves the model by using the deformable convolution module and the deformable ROI pooling module of the deformable network, adds two-dimensional or even high-dimensional offset to the spatial sampling point of the standard convolution and the common ROI (region of interest) pooling, enables the sampling point of the convolution to generate shape change, improves the deformable characteristic of the improved model, and improves the detection and identification effect of the deformable target by improving the model. And considering fusion of feature maps of different layers, performing pooling processing on the feature maps of the bottom layer to reduce resolution, performing deconvolution processing on the features of the high layer to improve resolution, and then fusing the feature maps of the low layer, the middle layer and the high layer. Meanwhile, a group of small-scale preselection frames is added, the generation number of the small target preselection frames is increased, and the detection and identification effect of the small target is improved by improving the model.