CN117557980A

CN117557980A - A monocular three-dimensional target detection method based on convolutional attention and feature decoupling

Info

Publication number: CN117557980A
Application number: CN202311290307.5A
Authority: CN
Inventors: 徐岩; 王浩远; 贾倩; 吉常涛; 黄博文
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2024-02-13

Abstract

The invention relates to a monocular three-dimensional target detection method based on convolution attention and feature decoupling, which includes the following steps: given an input image, outputting a feature map through a backbone network, and using a depth estimator to conceal the target under the auxiliary supervision of the depth map. Deep features are learned through the method; visual features and depth features are input to the visual multi-scale detail convolution encoder and the depth multi-scale detail convolution encoder for processing respectively. The multi-scale detail convolution encoder integrates a multi-scale asymmetric convolution attention module. , to aggregate the local features of the input image and encode the aggregated local features; the dual encoder-dual decoder structure of the decoupled feature-guided self-attention model is used to encode and encode the spatial and appearance information of the input image respectively. Decoding process; fuse the outputs of the depth branch and the vision branch; use a single-stage detector of predefined 2D-3D anchors to regress bounding boxes; predict the bounding box of each anchor point, and recover the bounding box.

Description

Monocular three-dimensional target detection method based on convolution attention and feature decoupling

Technical Field

The invention relates to a monocular three-dimensional target detection method.

Background

Monocular three-dimensional object detection is a challenging task in the field of autopilot. Initially, some monocular methods used only one image as input [1], an impressive development was made by exploiting the geometric constraints between two and three dimensions. Although monocular three-dimensional object detection has the advantage of low cost, estimating depth from a single image is difficult, resulting in monocular three-dimensional object detection performance that is still far from satisfactory.

In the field of monocular three-dimensional object detection, methods using images only initially rely on a single image to predict an object. However, the lack of depth information in images constitutes a challenge. Some approaches rely on geometric consistency to predict to address this limitation. Document [2] combines the geometric relationships of two-dimensional and three-dimensional projections to construct a three-dimensional target region suggestion network. Document [3] further improves the three-dimensional detection performance based on exploration of the paired spatial relationships. Document [4] introduces the bottom surface of the three-dimensional bounding box of the object as a ground plane to mitigate interference of object-independent properties. Document [5] regards monocular target depth estimation as an asymptotic refinement problem and proposes a joint semantic and geometric cost volume to model depth errors. To address the limitations of monocular three-dimensional object detection due to lack of depth cues, researchers have proposed additional methods to utilize depth information in the training process [6]. Document [7] focuses on fusing images and estimating depth by using a specially designed convolutional network. Document [8] utilizes a graphical model to efficiently extract context information from neighboring point clouds. Document [9] reconstructs two-dimensional coordinates of an image by projecting the three-dimensional coordinates, and learns target geometric information in a self-supervised learning manner.

Inspired by visual self-attention models (Visual Transformers, viTs) with powerful self-attention mechanisms and global feature extraction capabilities [10], some work has successfully applied visual self-attention models to monocular target detection tasks in an autopilot scenario, further improving detection accuracy. Document [11] proposes a first monocular 3D object detection network based on a self-attention model, which effectively integrates visual and depth features, and improves the accuracy of monocular three-dimensional object detection. Furthermore, document [12] proposes detecting objects by means of an encoder-decoder paradigm and using a hungarian matching algorithm for output prediction. The document [13] proposes a depth encoder and a depth guidance decoder for adaptive scene-level depth understanding on the basis of the visual encoder and decoder of [12], which significantly improves the accuracy of monocular three-dimensional object detection.

While monocular three-dimensional object detection methods based on visual self-attention models achieve a certain effect, there is currently a trend on how to further refine the visual self-attention model-based approach, one possible solution is to introduce convolved local features into visual self-attention with a degree of offset invariance, scale invariance and distortion invariance. In addition, all existing models based on visual self-attention input visual features and depth features into the same decoder for decoding processing, and inaccurate depth information can interfere with learning of other information by the model due to the nature of monocular tasks.

Reference is made to:

[1]Ku J,Pon A D,Waslander S L.Monocular 3d object detection leveraging accurate proposals and shape reconstruction[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:11867-11876.

[2]Brazil G,Liu X.M3d-rpn:Monocular 3d region proposal network for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9287-9296.

[3]Chen Y,Tai L,Sun K,et al.Monopair:Monocular 3d object detection using pairwise spatial relationships[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12093-12102.

[4]Qin Z,Li X.Monoground:Detecting monocular 3d objects from the ground[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3793-3802.

[5]Lian Q,Li P,Chen X.Monojsg:Joint semantic and geometric cost volume for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:1070-1079.

[6]WangY,Chao W L,Garg D,et al.Pseudo-lidar from visual depth estimation:Bridging the gap in 3d object detection for autonomous driving[C]//Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8445-8453.

[7]Ding M,Huo Y,Yi H,et al.Learning depth-guided convolutions for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops.2020:1000-1001.

[8]Wang L,Du L,Ye X,et al.Depth-conditioned dynamic message propagation for monocular 3d object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:454-463.

[9]Chen H,Huang Y,Tian W,et al.Monorun:Monocular 3d object detection by reconstruction and uncertainty propagation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10379-10388.

[10]Han K,Wang Y,Chen H,et al.A survey on vision transformer[J].IEEE transactions on pattern analysis and machine intelligence,2022,45(1):87-110.

[11]Huang K C,Wu T H,Su H T,et al.Monodtr:Monocular 3d object detection with depth-aware transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4012-4021.

[12]Carion N,Massa F,Synnaeve G,et al.End-to-end object detection with transformers[C]//European conference on computer vision.Cham:Springer International Publishing,2020:213-229.

[13]Zhang R,Qiu H,Wang T,et al.MonoDETR:depth-guided transformer for monocular 3D object detection[J].arXiv preprint arXiv:2203.13310,2022.

[14]Huang K C,Wu T H,Su H T,et al.Monodtr:Monocular 3d object detection with depth-aware transformer[C]//Proceedings ofthe IEEE/CVF Conference on ComputerVision and Pattern Recognition.2022:4012-4021.

disclosure of Invention

The patent aims at the problems in the prior art, and discloses a monocular three-dimensional target detection method based on multi-scale asymmetric convolution attention and feature decoupling. The method introduces an asymmetric convolution on a self-attention model with decoupled depth and visual features. The multi-scale detail convolution encoder aggregates local features of an input image, and uses a plurality of convolution kernels of different shapes to encode the features by depth convolution to obtain more detail in the image. The decoupled architecture allows visual and depth features to learn independently without interfering with each other. Furthermore, the combination of convolution and self-attention models enables the method to have both local and global properties. Experimental results on the KITTI dataset show that the method has good performance compared to other advanced methods. The technical proposal is as follows:

a monocular three-dimensional target detection method based on convolution attention and feature decoupling comprises the following steps:

step one: given an input image with resolution of H×W, outputting a feature map F through a backbone network, taking F as an original feature, and generating a visual feature F by using a series of convolution layers _V =conv (F), after which the depth feature F is implicitly learned with a depth estimator under aided supervision of the depth map _D The method comprises the following steps:

for the original feature F, two convolution layers are employed to predict the probability of the discrete depth interval DThe probability represents the confidence that the depth value of each pixel belongs to a certain depth interval, and the depth true value is discretized from continuous space to discrete space d by adopting linear increment discretization _i I is the depth interval index, d _i Representing the i-th discrete interval from shallow to deep count;

representing initial depth perception features with intermediate feature map x=conv (F), feature centers for each depth interval, i.e., depth prototypes, calculated by aggregating depth perception features for each pixel belonging to a specified interval, generating predicted depth intervals D using group convolution, reducing the number of intervals from N to N' =n/r with a set scale r, depth prototypes F _d By weighting the features X' of all pixels to the depth class m according to the respective probabilities to generate, let X be _i ' features representing the ith depth interval pixel in X ' =conv (X), L is the set of pixels in feature map X ', based on prototype F _d Reconstructing a new depth feature denoted as f _D ；

Step two, the visual features and the depth features are respectively input into a visual multi-scale detail convolution encoder and a depth multi-scale detail convolution encoder for processing, and the multi-scale detail convolution encoder integrates a multi-scale asymmetric convolution attention module to aggregate local features of an input image and encode the local features obtained by aggregation, wherein the method comprises the following steps:

the input visual characteristic and the depth characteristic respectively pass through a multi-scale asymmetric convolution Attention module to obtain Attention map Attention and output sum f of the multi-scale asymmetric convolution Attention module ^out ；

In a multi-scale asymmetric convolution attention module, input features are first asymmetrically convolved to aggregate local information; then capturing multi-scale context information by using multi-branch asymmetric convolution, and carrying out channel dimension correlation modeling by 1X 1 convolution;

output f of multi-scale asymmetric convolution attention module ^out Generating final output f of multi-scale detail convolutional encoder by a feed forward network and a batch normalization layer ^c ；

Step three, self-attention guided by decoupling featuresThe double encoder-double decoder structure of the model respectively encodes and decodes the spatial and appearance information of the input image, the self-attention model guided by the decoupling characteristics is provided with two branches of depth and vision, the two branches adopt the same structure, and the parameters of the two branches are respectively trained; the two branches of the decoupling feature directed self-attention model are each convolved with the output of the encoder in depth multi-scale detailAnd the output of a visual multi-scale detail convolutional encoder +.>As input;

step four, the output F of the depth branch is outputted by using a cross attention guiding fusion module _D And output of visual branch F _V Fusing; in the cross-attention directed fusion module, F _D As input query Q _FD ＝Linear(F _D )，F _V As K _FV ,V _FV ＝Linear(F _V ) Input into a cross-attention layer; the output of the cross attention layer is subjected to a layer normalization layer to obtain the final output F of the fusion module _fused ：

Step five, returning the bounding box by adopting a single-stage detector of predefined two-dimensional-three-dimensional anchor points, wherein each predefined anchor point is formed by a two-dimensional bounding box [ x ] _2d ,y _2d ,w _2d ,h _2d ]And a three-dimensional bounding box [ x ] _p ,y _p ,z,w _3d ,h _3d ,l _3d ,θ]Parameter composition of [ x ] _2d ,y _2d ]And [ x ] _p ,y _p ]Representing the center of a two-dimensional frame and the center of a three-dimensional object projected onto an image plane, [ w ] _2d ,h _2d ]And [ w ] _3d ,h _3d ,l _3d ]The physical dimensions of the two-dimensional boundary box and the three-dimensional boundary box are respectively represented, z represents the depth of the center of the three-dimensional object, and θ represents the observation angle; during the training processProjecting all true values into a two-dimensional space to calculate intersections of all two-dimensional anchor points; selecting an anchor point with intersection more than 0.5, and optimizing corresponding three-dimensional frames;

step six, predicting a two-dimensional boundary box [ t ] of each anchor point _x ,t _y ,t _w ,t _h ] _2d And a three-dimensional bounding box [ t ] _x ,t _y ,t _w ,t _h ,t _l ,t _z ,t _θ ] _3d And outputting the restored bounding box according to anchor points and network prediction by parameterizing residual values of the two-dimensional bounding box and the three-dimensional bounding box and predicting classification scores.

Further, in step one, the depth truth value is discretized from the continuous space to the discrete space d by adopting linear incremental discretization _i ：

Where N is the number of depth intervals, [ d ] _min ，d _max ]Is the depth range and pixels with depth values outside the range will be marked as invalid and not used for optimization during training. i is the depth interval index, d _i The i-th interval from the shallow to the deep count is indicated.

Further, in the first step, the depth interval number N is set to 96, and the depth range [ d ] _min ，d _max ]Set as [1,80 ]]，r＝2。

Further, in the second step, the asymmetric convolution is expressed as:

wherein BN represents a batch normalization operation, γ _i And beta _i Is a learnable parameter in BN operation, i=1, 2,3.

Further, the multi-scale asymmetric convolution attention module in the second step is:

f ^out ＝Attention×f

wherein f represents the input feature f _D Or f _V Asy_conv represents an asymmetric convolution; scale _i I e {0,1,2,3} represents the i-th branch; attention and f ^out Representing the attention map and output of the multi-scale asymmetric convolution attention module, respectively.

Further, in the third step, for the deep branch, the method is as follows:

1) Output of a convolutional encoder for depth multi-scale detailAs an input feature, a depth branch of the self-attention model guided by the decoupling feature;

2) In the encoder of the depth branch,the coding process is performed by a self-attention layer and a layer normalization layer:

wherein Linear represents Linear transformation, softmax is an activation function, C represents the dimension of input features, and LN represents layer normalization operation; a represents the attention score of the person,representing the output of the encoder in the depth branch;

3) In a decoder of the deep branch,as input query +.>Position coding P as K _P ,V _P =linear (P) input into one cross-attention layer; the output of the cross attention layer is decoded by a self attention layer and a layer normalization layer to obtain the output F of the depth branch _D The method comprises the steps of carrying out a first treatment on the surface of the The implementation method of the cross attention layer is as follows:

wherein A is _D An attention score representing the cross attention layer;

for the visual branch, processing according to the same method to obtain an output F of the visual branch _V 。

Further, in step six, the output bounding box is restored as follows:

wherein (≡) represents the recovery parameters of the three-dimensional object; for two-dimensional boundary center [ x ] _2d ,y _2d ]And a three-dimensional projection center [ x ] _p ,y _p ]Using the same anchorsCentering.

Drawings

FIG. 1 is a diagram of the overall structure of the method

FIG. 2 visualizes a three-dimensional bounding box of a class of cars in a KITTI dataset

Detailed Description

The method of the invention belongs to supervised learning, and firstly, the model is required to be subjected to supervised training, and new data is used for detection after the optimal model is obtained through training. The overall structure of the present invention is shown in fig. 1. In order to make the technical scheme of the invention clearer, the following detailed description of the invention is further provided. The invention is realized by the following steps.

Step one: given an input image of resolution H W, the feature map F is output via the backbone network DLA-102. Generating visual features F using a series of convolution layers with F as the original feature _V =conv (F). Then the depth estimator is used for implicitly learning the depth characteristic f under the auxiliary supervision of the depth map _D [14]The method comprises the following steps:

depth feature f _D The specific generation process of (2) is as follows: for the original feature F, two convolution layers are employed to predict the probability of a discrete depth interval D, where N is the number of depth intervals, which probability represents the confidence that the depth value of each pixel belongs to a certain depth interval. Discretizing depth truth values from continuous space to discrete space using linear incremental discretization:

where i is the depth interval index, d _i The i-th interval from the shallow to the deep count is indicated. The depth interval number N is set to 96, and the depth range d _min ，d _max ]Set as [1,80 ]]. Pixels whose depth values are outside the range will be marked as invalid and not used for optimization during training. The intermediate feature map x=conv (F) represents the initial depth perception feature. To further enhance the ability of the depth representation, a central representation of the corresponding depth interval is introduced to enhance each pixelFeatures. The feature center (i.e., depth prototype) for each depth interval may be calculated by aggregating the depth perception features of each pixel belonging to the specified interval. In practice, the group convolution is first applied to generate the predicted depth interval D, reducing the number of intervals from N to N' =n/r with the scale r, in this example r=2, to share similar depth cues and reduce the computational effort. Depth prototype F _d Can be generated by weighting the features X' of all pixels to the depth class m according to their probabilities:

wherein X 'is' _i Representing the features of the i-th depth interval pixel in X '=conv (X), L is the set of pixels in feature map X',is the normalized probability of the mth depth prototype. In this way F _d Global context information for each depth interval may be represented.

Further, a new depth feature f is reconstructed based on the depth prototype representation _D This allows each pixel to understand the representation of the depth interval from the global view. The method for calculating the reconstructed depth features comprises the following steps:

and step two, respectively inputting the visual features and the depth features into a visual multi-scale detail convolutional encoder and a depth multi-scale detail convolutional encoder for processing. The multi-scale detail convolution encoder integrates a multi-scale asymmetric convolution attention module to aggregate local features of an input image and encode the features, and the method is as follows:

the input features pass through a multi-scale asymmetric convolution attention module:

f ^out ＝Attention×f

wherein f represents the input feature f _D Or f _V Asy_conv represents an asymmetric convolution. Scale _i I e {0,1,2,3} represents the i-th branch. Attention and f ^out Representing the attention map and output of the multi-scale asymmetric convolution attention module, respectively.

In a multi-scale asymmetric convolution attention module, input features first aggregate local information by a 5 x 5 asymmetric convolution; then capturing multi-scale context information using multi-branch asymmetric convolution, wherein the convolution kernel size of each branch is set to 7, 11, and 21, respectively; further, the correlation modeling of channel dimensions is performed by a 1 x 1 convolution. Wherein the asymmetric convolution is expressed as:

wherein BN represents a batch normalization operation, γ and β are learnable parameters in BN operation;

output f of multi-scale asymmetric convolution attention module ^out Generating final output f of multi-scale detail convolutional encoder by a feed forward network and a batch normalization layer ^c ＝BN(FFN(f ^out )). Wherein FFN represents a feed forward network;

and thirdly, respectively encoding and decoding the spatial and appearance information of the input image by utilizing a double encoder-double decoder structure of the self-attention model guided by the decoupling characteristics. The decoupling feature guided self-attention model has two branches of depth and vision, the two branches adopt the same structure, and parameters of the two branches are respectively trained. The two branches of the decoupling feature directed self-attention model are each convolved with the output of the encoder in depth multi-scale detailAnd the output of a visual multi-scale detail convolutional encoder +.>As input. To be used forThe deep branching is illustrated as an example, and the method is as follows:

where Linear represents a Linear transformation, softmax is an activation function, C represents the dimension of the input feature, and LN represents the layer normalization operation. A represents the attention score of the person,representing the output of the encoder in the depth branch;

4) In a decoder of the deep branch,as input query +.>Position coding P as K _P ,V _P =linear (P) input into one cross-attention layer. Further, the output of the cross-attention layer completes the decoding process through a self-attention layer and a layer normalization layer. The implementation method of the cross attention layer is as follows:

wherein A is _D An attention score representing the cross attention layer;

step four, the output F of the depth branch is outputted by using a cross attention guiding fusion module _D And output of visual branch F _V Fusion is performed. In the cross-attention directed fusion module, F _D As input query Q _FD ＝Linear(F _D )，F _V As K _FV ,V _FV ＝Linear(F _V ) Into a cross-attention layer. Further, the output of the cross-attention layer is subjected to a layer normalization layer to obtain the final output F of the fusion module _fused ：

And fifthly, adopting a single-stage detector of a predefined two-dimensional-three-dimensional anchor point to go back and forth to the bounding box. Each predefined anchor point is defined by a two-dimensional bounding box [ x ] _2d ,y _2d ,w _2d ,h _2d ]And a three-dimensional bounding box [ x ] _p ,y _p ,z,w _3d ,h _3d ,l _3d ,θ]Is a parameter composition of (a). [ x ] _2d ,y _2d ]And [ x ] _p ,y _p ]Representing the center of a two-dimensional frame projected onto the image plane and the center of a three-dimensional object. [ w ] _2d ,h _2d ]And [ w ] _3d ,h _3d ,l _3d ]Representing the physical dimensions of the two-dimensional and three-dimensional bounding boxes, respectively. z represents the depth of the center of the three-dimensional object. θ represents the observation angle. During the training process, we project all truth values into the two-dimensional space to calculate the intersection of all two-dimensional anchor points. Selecting an anchor point with intersection more than 0.5, and optimizing corresponding three-dimensional frames;

step six, predicting [ t ] of each anchor point _x ,t _y ,t _w ,t _h ] _2d And [ t ] _x ,t _y ,t _w ,t _h ,t _l ,t _z ,t _θ ] _3d To parameterize the residual values of the two-dimensional and three-dimensional bounding boxes and predict the classification scores. Based on the anchor and network predictions, the output bounding box may recover as follows:

where (≡) represents the recovery parameters of the three-dimensional object. For two-dimensional boundary center [ x ] _2d ,y _2d ]And a three-dimensional projection center [ x ] _p ,y _p ]The same anchoring centre is applied.

The present invention has been tested on the autopilot data set KITTI, which is a data set sponsored by the university of Toyota industry, and the university of Karsuyu, germany, chicago division for research in the autopilot field. The authors collected a real traffic environment for up to 6 hours, and the dataset consisted of corrected and synchronized images, radar scans, high accuracy GPS information, IMU acceleration information, and other multi-modality information. The KITTI data set contains 7481 images for training and 7518 images for testing. The ground truth value of the test set is not formally disclosed, so that the experimental result on the test set is obtained by submitting the method to the official website of the KITTI. The invention follows from other literature to divide training samples into training sets (3712 sheets) and validation sets (3769 sheets).

Model training, verification and testing are carried out on the data set, automobiles, pedestrians and cyclists in the scene are detected according to the input single image, and a three-dimensional boundary box is output. The results show that in the validation set, the three-dimensional average accuracy in the three settings of simple, medium and difficult was 29.70%, 20.64% and 17.05% when the car class was detected and IoU =0.7, respectively. In the test set, when the automobile class is detected and IoU =0.7, the three-dimensional average precision under the three settings of simplicity, medium and difficulty is 24.27%, 17.06% and 14.76%, respectively; when detecting pedestrian category and IoU =0.5, the three-dimensional average accuracy in the three settings of simple, medium and difficult is 13.30%, 8.25% and 7.38%, respectively; when the cyclist class was examined and IoU =0.5, the three-dimensional average accuracies in the three settings of simple, medium and difficult were 10.67%, 6.47% and 5.62%, respectively. The result has higher precision compared with other detection models, which indicates that the model can learn and accurately detect targets of different categories. The proposed method visualizes the detection result of the targets of the car class on the KITTI data set, the result being shown in FIG. 2. The result shows that the model can accurately detect the target object, and the detection result is close to the actual value, so that the model has excellent performance.

Claims

1. A monocular three-dimensional target detection method based on convolution attention and feature decoupling comprises the following steps:

for the original feature F, two convolution layers are adopted to predict the probability of a discrete depth interval D, the probability represents the confidence that the depth value of each pixel belongs to a certain depth interval, and the linear incremental discretization is adopted to discrete the depth true value from continuous space to discrete space D _i I is the depth interval index, d _i Representing the ith discrete area counted from shallow to deepA compartment;

representing initial depth perception features with intermediate feature map x=conv (F), feature centers for each depth interval, i.e., depth prototypes, calculated by aggregating depth perception features for each pixel belonging to a specified interval, generating predicted depth intervals D using group convolution, reducing the number of intervals from N to N' =n/r with a set scale r, depth prototypes F _d By weighting the features X 'of all pixels to the depth class m according to the respective probabilities, let X' denote the feature of the i-th depth interval pixel in X '=conv (X), L be the set of pixels in the feature map X', based on the prototype F _d Reconstructing a new depth feature denoted as f _D ；

Thirdly, respectively encoding and decoding the space and appearance information of the input image by utilizing a double encoder-double decoder structure of the self-attention model guided by the decoupling characteristics, wherein the self-attention model guided by the decoupling characteristics is provided with two branches of depth and vision, the two branches adopt the same structure, and the parameters of the two branches are respectively trained; decoupling ofThe two branches of the feature-directed self-attention model convolve the output of the encoder with depth multi-scale detail, respectivelyAnd the output of a visual multi-scale detail convolutional encoder +.>As input;

Step five, returning the bounding box by adopting a single-stage detector of predefined two-dimensional-three-dimensional anchor points, wherein each predefined anchor point is formed by a two-dimensional bounding box [ x ] _2d ,y _2d ,w _2d ,h _2d ]And a three-dimensional bounding box [ x ] _p ,y _p ,z,w _3d ,h _3d ,l _3d ,θ]Parameter composition of [ x ] _2d ,y _2d ]And [ x ] _p ,y _p ]Representing the center of a two-dimensional frame and the center of a three-dimensional object projected onto an image plane, [ w ] _2d ,h _2d ]And [ w ] _3d ,h _3d ,l _3d ]The physical dimensions of the two-dimensional boundary box and the three-dimensional boundary box are respectively represented, z represents the depth of the center of the three-dimensional object, and θ represents the observation angle; in the training process, all true values are projected to a two-dimensional space to calculate intersections of all two-dimensional anchor points; selecting an anchor point with intersection more than 0.5, and optimizing corresponding three-dimensional frames;

2. The method according to claim 1, wherein in the first step, the depth truth value is discretized from the continuous space to the discrete space d by using linear incremental discretization _i ：

3. The method according to claim 2, wherein in the first step, the depth interval number N is set to 96, and the depth range [ d ] _min ，d _max ]Set as [1,80 ]]，r＝2。

4. The method of claim 1, wherein in the second step, the asymmetric convolution is expressed as:

5. The method of claim 1, wherein the one multi-scale asymmetric convolution attention module in the second step is:

f ^out ＝Attention×f

6. The method of monocular three-dimensional object detection according to claim 1, wherein in step three, for the deep branches, the method is as follows:

wherein A is _D An attention score representing the cross attention layer;

7. The method according to claim 1, wherein in the sixth step, the output bounding box is restored as follows:

wherein (≡) represents the recovery parameters of the three-dimensional object; for two-dimensional boundary center [ x ] _2d ,y _2d ]And a three-dimensional projection center [ x ] _p ,y _p ]The same anchoring centre is applied.