Monocular three-dimensional target detection method based on convolution attention and feature decoupling
Technical Field
The invention relates to a monocular three-dimensional target detection method.
Background
Monocular three-dimensional object detection is a challenging task in the field of autopilot. Initially, some monocular methods used only one image as input [1], an impressive development was made by exploiting the geometric constraints between two and three dimensions. Although monocular three-dimensional object detection has the advantage of low cost, estimating depth from a single image is difficult, resulting in monocular three-dimensional object detection performance that is still far from satisfactory.
In the field of monocular three-dimensional object detection, methods using images only initially rely on a single image to predict an object. However, the lack of depth information in images constitutes a challenge. Some approaches rely on geometric consistency to predict to address this limitation. Document [2] combines the geometric relationships of two-dimensional and three-dimensional projections to construct a three-dimensional target region suggestion network. Document [3] further improves the three-dimensional detection performance based on exploration of the paired spatial relationships. Document [4] introduces the bottom surface of the three-dimensional bounding box of the object as a ground plane to mitigate interference of object-independent properties. Document [5] regards monocular target depth estimation as an asymptotic refinement problem and proposes a joint semantic and geometric cost volume to model depth errors. To address the limitations of monocular three-dimensional object detection due to lack of depth cues, researchers have proposed additional methods to utilize depth information in the training process [6]. Document [7] focuses on fusing images and estimating depth by using a specially designed convolutional network. Document [8] utilizes a graphical model to efficiently extract context information from neighboring point clouds. Document [9] reconstructs two-dimensional coordinates of an image by projecting the three-dimensional coordinates, and learns target geometric information in a self-supervised learning manner.
Inspired by visual self-attention models (Visual Transformers, viTs) with powerful self-attention mechanisms and global feature extraction capabilities [10], some work has successfully applied visual self-attention models to monocular target detection tasks in an autopilot scenario, further improving detection accuracy. Document [11] proposes a first monocular 3D object detection network based on a self-attention model, which effectively integrates visual and depth features, and improves the accuracy of monocular three-dimensional object detection. Furthermore, document [12] proposes detecting objects by means of an encoder-decoder paradigm and using a hungarian matching algorithm for output prediction. The document [13] proposes a depth encoder and a depth guidance decoder for adaptive scene-level depth understanding on the basis of the visual encoder and decoder of [12], which significantly improves the accuracy of monocular three-dimensional object detection.
While monocular three-dimensional object detection methods based on visual self-attention models achieve a certain effect, there is currently a trend on how to further refine the visual self-attention model-based approach, one possible solution is to introduce convolved local features into visual self-attention with a degree of offset invariance, scale invariance and distortion invariance. In addition, all existing models based on visual self-attention input visual features and depth features into the same decoder for decoding processing, and inaccurate depth information can interfere with learning of other information by the model due to the nature of monocular tasks.
Reference is made to:
[1]Ku J,Pon A D,Waslander S L.Monocular 3d object detection leveraging accurate proposals and shape reconstruction[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:11867-11876.
[2]Brazil G,Liu X.M3d-rpn:Monocular 3d region proposal network for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9287-9296.
[3]Chen Y,Tai L,Sun K,et al.Monopair:Monocular 3d object detection using pairwise spatial relationships[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12093-12102.
[4]Qin Z,Li X.Monoground:Detecting monocular 3d objects from the ground[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3793-3802.
[5]Lian Q,Li P,Chen X.Monojsg:Joint semantic and geometric cost volume for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:1070-1079.
[6]WangY,Chao W L,Garg D,et al.Pseudo-lidar from visual depth estimation:Bridging the gap in 3d object detection for autonomous driving[C]//Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8445-8453.
[7]Ding M,Huo Y,Yi H,et al.Learning depth-guided convolutions for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops.2020:1000-1001.
[8]Wang L,Du L,Ye X,et al.Depth-conditioned dynamic message propagation for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:454-463.
[9]Chen H,Huang Y,Tian W,et al.Monorun:Monocular 3d object detection by reconstruction and uncertainty propagation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10379-10388.
[10]Han K,Wang Y,Chen H,et al.A survey on vision transformer[J].IEEE transactions on pattern analysis and machine intelligence,2022,45(1):87-110.
[11]Huang K C,Wu T H,Su H T,et al.Monodtr:Monocular 3d object detection with depth-aware transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4012-4021.
[12]Carion N,Massa F,Synnaeve G,et al.End-to-end object detection with transformers[C]//European conference on computer vision.Cham:Springer International Publishing,2020:213-229.
[13]Zhang R,Qiu H,Wang T,et al.MonoDETR:depth-guided transformer for monocular 3D object detection[J].arXiv preprint arXiv:2203.13310,2022.
[14]Huang K C,Wu T H,Su H T,et al.Monodtr:Monocular 3d object detection with depth-aware transformer[C]//Proceedings ofthe IEEE/CVF Conference on ComputerVision and Pattern Recognition.2022:4012-4021.
disclosure of Invention
The patent aims at the problems in the prior art, and discloses a monocular three-dimensional target detection method based on multi-scale asymmetric convolution attention and feature decoupling. The method introduces an asymmetric convolution on a self-attention model with decoupled depth and visual features. The multi-scale detail convolution encoder aggregates local features of an input image, and uses a plurality of convolution kernels of different shapes to encode the features by depth convolution to obtain more detail in the image. The decoupled architecture allows visual and depth features to learn independently without interfering with each other. Furthermore, the combination of convolution and self-attention models enables the method to have both local and global properties. Experimental results on the KITTI dataset show that the method has good performance compared to other advanced methods. The technical proposal is as follows:
a monocular three-dimensional target detection method based on convolution attention and feature decoupling comprises the following steps:
step one: given an input image with resolution of H×W, outputting a feature map F through a backbone network, taking F as an original feature, and generating a visual feature F by using a series of convolution layers V =conv (F), after which the depth feature F is implicitly learned with a depth estimator under aided supervision of the depth map D The method comprises the following steps:
for the original feature F, two convolution layers are employed to predict the probability of the discrete depth interval DThe probability represents the confidence that the depth value of each pixel belongs to a certain depth interval, and the depth true value is discretized from continuous space to discrete space d by adopting linear increment discretization i I is the depth interval index, d i Representing the i-th discrete interval from shallow to deep count;
representing initial depth perception features with intermediate feature map x=conv (F), feature centers for each depth interval, i.e., depth prototypes, calculated by aggregating depth perception features for each pixel belonging to a specified interval, generating predicted depth intervals D using group convolution, reducing the number of intervals from N to N' =n/r with a set scale r, depth prototypes F d By weighting the features X' of all pixels to the depth class m according to the respective probabilities to generate, let X be i ' features representing the ith depth interval pixel in X ' =conv (X), L is the set of pixels in feature map X ', based on prototype F d Reconstructing a new depth feature denoted as f D ;
Step two, the visual features and the depth features are respectively input into a visual multi-scale detail convolution encoder and a depth multi-scale detail convolution encoder for processing, and the multi-scale detail convolution encoder integrates a multi-scale asymmetric convolution attention module to aggregate local features of an input image and encode the local features obtained by aggregation, wherein the method comprises the following steps:
the input visual characteristic and the depth characteristic respectively pass through a multi-scale asymmetric convolution Attention module to obtain Attention map Attention and output sum f of the multi-scale asymmetric convolution Attention module out ;
In a multi-scale asymmetric convolution attention module, input features are first asymmetrically convolved to aggregate local information; then capturing multi-scale context information by using multi-branch asymmetric convolution, and carrying out channel dimension correlation modeling by 1X 1 convolution;
output f of multi-scale asymmetric convolution attention module out Generating final output f of multi-scale detail convolutional encoder by a feed forward network and a batch normalization layer c ;
Step three, self-attention guided by decoupling featuresThe double encoder-double decoder structure of the model respectively encodes and decodes the spatial and appearance information of the input image, the self-attention model guided by the decoupling characteristics is provided with two branches of depth and vision, the two branches adopt the same structure, and the parameters of the two branches are respectively trained; the two branches of the decoupling feature directed self-attention model are each convolved with the output of the encoder in depth multi-scale detailAnd the output of a visual multi-scale detail convolutional encoder +.>As input;
step four, the output F of the depth branch is outputted by using a cross attention guiding fusion module D And output of visual branch F V Fusing; in the cross-attention directed fusion module, F D As input query Q FD =Linear(F D ),F V As K FV ,V FV =Linear(F V ) Input into a cross-attention layer; the output of the cross attention layer is subjected to a layer normalization layer to obtain the final output F of the fusion module fused :
Step five, returning the bounding box by adopting a single-stage detector of predefined two-dimensional-three-dimensional anchor points, wherein each predefined anchor point is formed by a two-dimensional bounding box [ x ] 2d ,y 2d ,w 2d ,h 2d ]And a three-dimensional bounding box [ x ] p ,y p ,z,w 3d ,h 3d ,l 3d ,θ]Parameter composition of [ x ] 2d ,y 2d ]And [ x ] p ,y p ]Representing the center of a two-dimensional frame and the center of a three-dimensional object projected onto an image plane, [ w ] 2d ,h 2d ]And [ w ] 3d ,h 3d ,l 3d ]The physical dimensions of the two-dimensional boundary box and the three-dimensional boundary box are respectively represented, z represents the depth of the center of the three-dimensional object, and θ represents the observation angle; during the training processProjecting all true values into a two-dimensional space to calculate intersections of all two-dimensional anchor points; selecting an anchor point with intersection more than 0.5, and optimizing corresponding three-dimensional frames;
step six, predicting a two-dimensional boundary box [ t ] of each anchor point x ,t y ,t w ,t h ] 2d And a three-dimensional bounding box [ t ] x ,t y ,t w ,t h ,t l ,t z ,t θ ] 3d And outputting the restored bounding box according to anchor points and network prediction by parameterizing residual values of the two-dimensional bounding box and the three-dimensional bounding box and predicting classification scores.
Further, in step one, the depth truth value is discretized from the continuous space to the discrete space d by adopting linear incremental discretization i :
Where N is the number of depth intervals, [ d ] min ,d max ]Is the depth range and pixels with depth values outside the range will be marked as invalid and not used for optimization during training. i is the depth interval index, d i The i-th interval from the shallow to the deep count is indicated.
Further, in the first step, the depth interval number N is set to 96, and the depth range [ d ] min ,d max ]Set as [1,80 ]],r=2。
Further, in the second step, the asymmetric convolution is expressed as:
wherein BN represents a batch normalization operation, γ i And beta i Is a learnable parameter in BN operation, i=1, 2,3.
Further, the multi-scale asymmetric convolution attention module in the second step is:
f out =Attention×f
wherein f represents the input feature f D Or f V Asy_conv represents an asymmetric convolution; scale i I e {0,1,2,3} represents the i-th branch; attention and f out Representing the attention map and output of the multi-scale asymmetric convolution attention module, respectively.
Further, in the third step, for the deep branch, the method is as follows:
1) Output of a convolutional encoder for depth multi-scale detailAs an input feature, a depth branch of the self-attention model guided by the decoupling feature;
2) In the encoder of the depth branch,the coding process is performed by a self-attention layer and a layer normalization layer:
wherein Linear represents Linear transformation, softmax is an activation function, C represents the dimension of input features, and LN represents layer normalization operation; a represents the attention score of the person,representing the output of the encoder in the depth branch;
3) In a decoder of the deep branch,as input query +.>Position coding P as K P ,V P =linear (P) input into one cross-attention layer; the output of the cross attention layer is decoded by a self attention layer and a layer normalization layer to obtain the output F of the depth branch D The method comprises the steps of carrying out a first treatment on the surface of the The implementation method of the cross attention layer is as follows:
wherein A is D An attention score representing the cross attention layer;
for the visual branch, processing according to the same method to obtain an output F of the visual branch V 。
Further, in step six, the output bounding box is restored as follows:
wherein (≡) represents the recovery parameters of the three-dimensional object; for two-dimensional boundary center [ x ] 2d ,y 2d ]And a three-dimensional projection center [ x ] p ,y p ]Using the same anchorsCentering.
Drawings
FIG. 1 is a diagram of the overall structure of the method
FIG. 2 visualizes a three-dimensional bounding box of a class of cars in a KITTI dataset
Detailed Description
The method of the invention belongs to supervised learning, and firstly, the model is required to be subjected to supervised training, and new data is used for detection after the optimal model is obtained through training. The overall structure of the present invention is shown in fig. 1. In order to make the technical scheme of the invention clearer, the following detailed description of the invention is further provided. The invention is realized by the following steps.
Step one: given an input image of resolution H W, the feature map F is output via the backbone network DLA-102. Generating visual features F using a series of convolution layers with F as the original feature V =conv (F). Then the depth estimator is used for implicitly learning the depth characteristic f under the auxiliary supervision of the depth map D [14]The method comprises the following steps:
depth feature f D The specific generation process of (2) is as follows: for the original feature F, two convolution layers are employed to predict the probability of a discrete depth interval D, where N is the number of depth intervals, which probability represents the confidence that the depth value of each pixel belongs to a certain depth interval. Discretizing depth truth values from continuous space to discrete space using linear incremental discretization:
where i is the depth interval index, d i The i-th interval from the shallow to the deep count is indicated. The depth interval number N is set to 96, and the depth range d min ,d max ]Set as [1,80 ]]. Pixels whose depth values are outside the range will be marked as invalid and not used for optimization during training. The intermediate feature map x=conv (F) represents the initial depth perception feature. To further enhance the ability of the depth representation, a central representation of the corresponding depth interval is introduced to enhance each pixelFeatures. The feature center (i.e., depth prototype) for each depth interval may be calculated by aggregating the depth perception features of each pixel belonging to the specified interval. In practice, the group convolution is first applied to generate the predicted depth interval D, reducing the number of intervals from N to N' =n/r with the scale r, in this example r=2, to share similar depth cues and reduce the computational effort. Depth prototype F d Can be generated by weighting the features X' of all pixels to the depth class m according to their probabilities:
wherein X 'is' i Representing the features of the i-th depth interval pixel in X '=conv (X), L is the set of pixels in feature map X',is the normalized probability of the mth depth prototype. In this way F d Global context information for each depth interval may be represented.
Further, a new depth feature f is reconstructed based on the depth prototype representation D This allows each pixel to understand the representation of the depth interval from the global view. The method for calculating the reconstructed depth features comprises the following steps:
and step two, respectively inputting the visual features and the depth features into a visual multi-scale detail convolutional encoder and a depth multi-scale detail convolutional encoder for processing. The multi-scale detail convolution encoder integrates a multi-scale asymmetric convolution attention module to aggregate local features of an input image and encode the features, and the method is as follows:
the input features pass through a multi-scale asymmetric convolution attention module:
f out =Attention×f
wherein f represents the input feature f D Or f V Asy_conv represents an asymmetric convolution. Scale i I e {0,1,2,3} represents the i-th branch. Attention and f out Representing the attention map and output of the multi-scale asymmetric convolution attention module, respectively.
In a multi-scale asymmetric convolution attention module, input features first aggregate local information by a 5 x 5 asymmetric convolution; then capturing multi-scale context information using multi-branch asymmetric convolution, wherein the convolution kernel size of each branch is set to 7, 11, and 21, respectively; further, the correlation modeling of channel dimensions is performed by a 1 x 1 convolution. Wherein the asymmetric convolution is expressed as:
wherein BN represents a batch normalization operation, γ and β are learnable parameters in BN operation;
output f of multi-scale asymmetric convolution attention module out Generating final output f of multi-scale detail convolutional encoder by a feed forward network and a batch normalization layer c =BN(FFN(f out )). Wherein FFN represents a feed forward network;
and thirdly, respectively encoding and decoding the spatial and appearance information of the input image by utilizing a double encoder-double decoder structure of the self-attention model guided by the decoupling characteristics. The decoupling feature guided self-attention model has two branches of depth and vision, the two branches adopt the same structure, and parameters of the two branches are respectively trained. The two branches of the decoupling feature directed self-attention model are each convolved with the output of the encoder in depth multi-scale detailAnd the output of a visual multi-scale detail convolutional encoder +.>As input. To be used forThe deep branching is illustrated as an example, and the method is as follows:
1) Output of a convolutional encoder for depth multi-scale detailAs an input feature, a depth branch of the self-attention model guided by the decoupling feature;
2) In the encoder of the depth branch,the coding process is performed by a self-attention layer and a layer normalization layer:
where Linear represents a Linear transformation, softmax is an activation function, C represents the dimension of the input feature, and LN represents the layer normalization operation. A represents the attention score of the person,representing the output of the encoder in the depth branch;
4) In a decoder of the deep branch,as input query +.>Position coding P as K P ,V P =linear (P) input into one cross-attention layer. Further, the output of the cross-attention layer completes the decoding process through a self-attention layer and a layer normalization layer. The implementation method of the cross attention layer is as follows:
wherein A is D An attention score representing the cross attention layer;
step four, the output F of the depth branch is outputted by using a cross attention guiding fusion module D And output of visual branch F V Fusion is performed. In the cross-attention directed fusion module, F D As input query Q FD =Linear(F D ),F V As K FV ,V FV =Linear(F V ) Into a cross-attention layer. Further, the output of the cross-attention layer is subjected to a layer normalization layer to obtain the final output F of the fusion module fused :
And fifthly, adopting a single-stage detector of a predefined two-dimensional-three-dimensional anchor point to go back and forth to the bounding box. Each predefined anchor point is defined by a two-dimensional bounding box [ x ] 2d ,y 2d ,w 2d ,h 2d ]And a three-dimensional bounding box [ x ] p ,y p ,z,w 3d ,h 3d ,l 3d ,θ]Is a parameter composition of (a). [ x ] 2d ,y 2d ]And [ x ] p ,y p ]Representing the center of a two-dimensional frame projected onto the image plane and the center of a three-dimensional object. [ w ] 2d ,h 2d ]And [ w ] 3d ,h 3d ,l 3d ]Representing the physical dimensions of the two-dimensional and three-dimensional bounding boxes, respectively. z represents the depth of the center of the three-dimensional object. θ represents the observation angle. During the training process, we project all truth values into the two-dimensional space to calculate the intersection of all two-dimensional anchor points. Selecting an anchor point with intersection more than 0.5, and optimizing corresponding three-dimensional frames;
step six, predicting [ t ] of each anchor point x ,t y ,t w ,t h ] 2d And [ t ] x ,t y ,t w ,t h ,t l ,t z ,t θ ] 3d To parameterize the residual values of the two-dimensional and three-dimensional bounding boxes and predict the classification scores. Based on the anchor and network predictions, the output bounding box may recover as follows:
where (≡) represents the recovery parameters of the three-dimensional object. For two-dimensional boundary center [ x ] 2d ,y 2d ]And a three-dimensional projection center [ x ] p ,y p ]The same anchoring centre is applied.
The present invention has been tested on the autopilot data set KITTI, which is a data set sponsored by the university of Toyota industry, and the university of Karsuyu, germany, chicago division for research in the autopilot field. The authors collected a real traffic environment for up to 6 hours, and the dataset consisted of corrected and synchronized images, radar scans, high accuracy GPS information, IMU acceleration information, and other multi-modality information. The KITTI data set contains 7481 images for training and 7518 images for testing. The ground truth value of the test set is not formally disclosed, so that the experimental result on the test set is obtained by submitting the method to the official website of the KITTI. The invention follows from other literature to divide training samples into training sets (3712 sheets) and validation sets (3769 sheets).
Model training, verification and testing are carried out on the data set, automobiles, pedestrians and cyclists in the scene are detected according to the input single image, and a three-dimensional boundary box is output. The results show that in the validation set, the three-dimensional average accuracy in the three settings of simple, medium and difficult was 29.70%, 20.64% and 17.05% when the car class was detected and IoU =0.7, respectively. In the test set, when the automobile class is detected and IoU =0.7, the three-dimensional average precision under the three settings of simplicity, medium and difficulty is 24.27%, 17.06% and 14.76%, respectively; when detecting pedestrian category and IoU =0.5, the three-dimensional average accuracy in the three settings of simple, medium and difficult is 13.30%, 8.25% and 7.38%, respectively; when the cyclist class was examined and IoU =0.5, the three-dimensional average accuracies in the three settings of simple, medium and difficult were 10.67%, 6.47% and 5.62%, respectively. The result has higher precision compared with other detection models, which indicates that the model can learn and accurately detect targets of different categories. The proposed method visualizes the detection result of the targets of the car class on the KITTI data set, the result being shown in FIG. 2. The result shows that the model can accurately detect the target object, and the detection result is close to the actual value, so that the model has excellent performance.