CN112950697B

CN112950697B - Monocular unsupervised depth estimation method based on CBAM

Info

Publication number: CN112950697B
Application number: CN202110142746.6A
Authority: CN
Inventors: 潘树国; 魏建胜; 高旺; 赵涛
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2024-04-16
Anticipated expiration: 2041-02-02
Also published as: CN112950697A

Abstract

The invention discloses a monocular unsupervised depth estimation method based on CBAM. The depth estimation is one of key technologies for realizing the perception of the surrounding environment by the robot, and the depth estimation method based on the supervision learning processes the distance measurement value obtained by the sensor such as the laser radar and trains as a true value, but the process occupies a large amount of manpower and computing resources, so that the application of the depth estimation method in the cross-scene is very limited. In the invention, under the depth estimation framework based on unsupervised learning, a convolution block attention module is introduced, the photometric reconstruction, parallax smoothing and left-right parallax consistency training of a stereoscopic image pair are carried out, and the depth estimation with scale is carried out on a monocular image. By using the method provided by the invention, the depth details of objects in the surrounding environment can be kept, the overall depth estimation precision is improved, and meanwhile, the generalization capability under a cross-scene condition can be ensured.

Description

Monocular unsupervised depth estimation method based on CBAM

Technical Field

The invention belongs to the field of autonomous navigation and environmental awareness of an intelligent agent, and particularly relates to a monocular unsupervised depth estimation method based on CBAM.

Background

The intelligent agent needs to have perfect environment sensing function for realizing safe and reliable autonomous navigation, which includes depth estimation of the environment around the intelligent agent. The depth estimation based on the three-dimensional laser radar can obtain a more accurate depth estimation result, but the depth estimation is expensive and only sparse depth estimation can be obtained. The RGBD camera based depth estimation is simple to operate, but the depth estimation range is limited and is used in an outdoor environment. Depth estimation based on stereo cameras is not limited in indoor and outdoor use, but its occupancy of computational resources is large and the range of depth estimation is limited due to short baselines. Depth estimation based on monocular cameras can yield dense depth maps, but conventional monocular methods cannot yield true depth estimates due to lack of absolute scale.

With the development of artificial intelligence, the intelligent agent gradually applies a deep convolutional neural network to complete the task of environmental perception. Researchers were first using supervised learning to recover the absolute scale of monocular cameras, thereby completing monocular dense depth estimation. However, supervised learning requires a large number of data samples with GroundTruth for training, which greatly restricts its generalization ability. Currently, unsupervised monocular depth estimation is favored by researchers in its simple and effective training manner and increasingly improved accuracy, and various advanced network design ideas are applied to the model, such as attention mechanisms, multipath connection, space searching and the like. Therefore, the monocular non-supervision depth estimation method with the attention mechanism is researched to realize high-precision dense depth perception of the intelligent body on the surrounding environment, and the method has important scientific research value and practical significance.

Disclosure of Invention

In order to solve the problems, the invention discloses a monocular unsupervised depth estimation method based on CBAM (Convolutional Block Attention Module, attention module of convolution block), which introduces an attention mechanism into a depth estimation task, reserves depth details of objects and improves the overall accuracy of depth estimation, thereby providing a foundation for autonomous navigation and environmental perception of an intelligent agent.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a CBAM-based monocular unsupervised depth estimation method, comprising the steps of:

step 1), introducing CBAM and Resblock to form Resblock-CBAM;

step 2), designing a depth estimation network with an attention mechanism based on a Resblock-CBAM;

and 3) training a depth estimation network aiming at luminosity reconstruction, parallax smoothing and left-right parallax consistency of the stereo image pair, and completing depth estimation of the monocular image.

Further, the step 1) of introducing CBAM and Resblock to form Resblock-CBAM comprises the following specific steps:

a) Setting a channel attention sub-module and a space attention sub-module in the CBAM to be sequentially connected, then connecting the CBAM and a Resblock in parallel to form Conventional Resblock-CBAM, and finally outputting an equation as shown in a formula (1):

F _c ＝F _r +F _r ” (1)

in the method, in the process of the invention,for the output feature of Resblock, +.>Attention to the output characteristics of the sub-module for CBAM space,/->An output characteristic of Conventional Resblock-CBAM;

b) Setting a channel attention sub-module and a space attention sub-module in the CBAM to be sequentially connected, then connecting the CBAM and a Resblock in series to form Modified Resblock-CBAM, and finally outputting an equation as shown in a formula (2):

F _M ＝F _r ” (2)

in the method, in the process of the invention,an output characteristic of Modified Resblock-CBAM;

c) The specific process of the channel attention sub-module and the space attention sub-module in the CBAM is shown in the formula (3):

in the middle ofPaying attention to the output characteristics of the sub-modules for the CBAM channel; />Attention map for one-dimensional channel, +.>For a two-dimensional spatial attention map, +.>Representing pixel-by-pixel multiplication;

the specific process of the channel attention sub-module is shown in the formula (4):

wherein sigma represents a sigmoid function, MLP is a multi-component perceptron, omega ₀ 、ω ₁ Is the weight of the multi-layer perceptron,channel descriptors corresponding to the maximum pooling and the average pooling are obtained;

the specific process of the space attention sub-module is shown in the formula (5):

f in ^7×7 Representing the convolution operation of a 7 x 7 filter,corresponding spatial descriptors are pooled for maximum and average.

Step 2), designing a depth estimation network with an attention mechanism based on a Resblock-CBAM, wherein the depth estimation network comprises the following specific steps of:

a) Four Resblock-CBAM, the first three Conventional Resblock-CBAM and the fourth Modified Resblock-CBAM, are used sequentially in the encoder of the depth estimation network;

b) Five jumper layers are used in the encoder of the depth estimation network, wherein the first jumper layer is connected with a first convolution layer in the encoder and a second upper convolution layer in the decoder, the second jumper layer is connected with a first pooling layer in the encoder and a third upper convolution layer in the decoder, the third jumper layer is connected with a first Conventional Resblock-CBAM and a fourth upper convolution layer in the decoder, the fourth jumper layer is connected with a second Conventional Resblock-CBAM and a fifth upper convolution layer in the decoder, the fifth jumper layer is connected with a third Conventional Resblock-CBAM and a sixth upper convolution layer in the decoder, and Modified Resblock-CBAM is directly connected with the decoder without jumper layers.

Step 3), training a depth estimation network aiming at luminosity reconstruction, parallax smoothing and left-right parallax consistency of a stereo image pair, and estimating the depth of a monocular image during testing, wherein the method comprises the following specific steps of:

a) The total training loss of the depth estimation network includes photometric reconstruction loss, parallax smoothing loss, and left-right parallax consistency loss, as shown in equation (6):

where L is the total training loss of the depth estimation network, L _s For each scale training loss, alpha _ap 、α _ds And alpha _ap Weight coefficients respectively representing loss of light reconstruction, parallax smoothing loss and loss of left-right parallax consistency,light reconstruction loss representing left and right images, < ->Parallax smoothing loss representing left and right images, < ->A parallax consistency loss representing left and right images;

b) The difference between the input source image and its corresponding reconstructed image is measured using the image photometric reconstruction loss, which is shown in equation (7):

in the middle ofLoss for left image photometric reconstruction, N is the pixel quantity of a single image, +.>For the left image +.>For reconstructed left image, SSIM is a structure similarity function, α ₁ For the scale parameters of the SSIM function, the right image photometric reconstruction loss +.>And->In the same form;

c) The steep and discontinuous changes of the depth map at the image gradient are improved using the parallax smoothing loss, which is shown in the formula (8):

in the middle ofFor parallax smoothing loss of left image, +.>Parallax smoothing loss for right image as parallax map for left imageAnd->In the same form;

d) The estimation accuracy of the network to the depth map is improved by using the left-right parallax consistency loss, and the left-right parallax consistency loss of the left image is shown as a formula (9):

in the middle ofFor disparity consistency loss of left image, alpha ₂ For the scaling parameters of the SSIM function, +.>Disparity consistency loss of right image for disparity map mapped to right image +.>And->In the same form.

The beneficial effects of the invention are as follows:

according to the monocular unsupervised depth estimation method based on the CBAM, the convolution block attention module CBAM is introduced into an unsupervised depth estimation framework, so that dense depth estimation of monocular images is realized. In the process of introducing CBAM to a depth estimation network, combining the CBAM and Resblock into Resblock-CBAM, and extracting features of input from two dimensions of space and channel; meanwhile, the multi-scale information is fused by adopting the jump connection. By using the method provided by the invention, the attention mechanism is integrated into the depth estimation network and unsupervised training such as luminosity reconstruction, parallax smoothing, left-right parallax consistency and the like based on the image pair is performed, so that the depth details of objects in the environment can be kept and the overall depth estimation precision can be improved.

Drawings

FIG. 1 is an unsupervised depth estimation framework diagram;

FIG. 2 is a schematic diagram of a combination of a residual block and a convolution block attention module;

FIG. 3 is a schematic diagram of a sub-module in the convolution block attention module;

FIG. 4 is a diagram of a depth estimation network architecture;

FIG. 5 is a depth estimation visual quality assessment graph;

FIG. 6 is a diagram of a depth estimation experiment platform;

FIG. 7 is a view of a real city scene depth estimation;

FIG. 8 is a table of depth estimation accuracy comparisons.

Detailed Description

The present invention is further illustrated in the following drawings and detailed description, which are to be regarded as illustrative in nature and not as restrictive.

According to the monocular unsupervised depth estimation method based on the CBAM, as shown in fig. 1, the CBAM and the Resblock are firstly introduced to form a Resblock-CBAM, then a depth estimation network with an attention mechanism is designed based on the Resblock-CBAM, finally the depth estimation network is trained for luminosity reconstruction, parallax smoothing and left-right parallax consistency of a stereoscopic image pair, and the depth estimation of the monocular image is completed; the method comprises the following specific steps:

step 1), introducing CBAM and Resblock to form Resblock-CBAM, comprising the following steps:

in the middle ofPaying attention to the output characteristics of the sub-modules for the CBAM channel; />Attention map for one-dimensional channel, +.>Is two (two)Dimension space attention map->Representing pixel-by-pixel multiplication;

In this embodiment, the unsupervised depth estimation framework operates in the TensorFlow, and the training network of NVIDIA GeForce RTX 2080 Ti-type graphics cards with 11GB memory is selected for about 22 hours to complete convergence. The relevant weight parameters in the trained loss function are set as follows: alpha _ap ＝1，α _lr ＝1，α ₁ ＝0.85，α ₂ =0.15, since unsupervised training such as photometric reconstruction, parallax smoothing, and left-right parallax consistency based on image pairs is performed at four scales, α is _ds The values at disp1, disp2, disp3 and disp4 are set to 1, 0.5, 0.25 and 0.125, respectively. In order to better measure the accuracy of depth estimation, five evaluation indexes are defined as follows:

Abs Rel:Sq Rel:/>RMSE:/>

RMSE Log:Threshold:％of d _k />where T is the pixel count of the test image, d _k ,/>The predicted depth and true depth of the kth pixel, respectively.

An Eigen split test set containing 697 pictures of 29 scenes is selected to test a network trained on the KITTI data set, and accuracy comparison and visual quality evaluation are carried out with other existing methods. FIG. 8 is a graph showing the accuracy of the proposed method versus other depth estimation methods on an Eigen split test set, where a1, a2, a3 each represent delta < 1.25, delta < 1.25 ² ， δ＜1.25 ³ . As shown in FIG. 8, the proposed method performs optimally in comparison to existing several types of unsupervised depth estimation methods; compared with the main stream several types of supervised depth estimation methods, the method is inferior to the depth estimation method ACA based on the attention aggregation network, and is superior to the other types of supervised learning methods using GroundTruth training. Fig. 5 is a visual quality evaluation chart of the proposed method and two main stream non-supervision depth estimation methods, and it can be seen from the chart that the proposed method can better preserve depth details of objects in the environment after using a convolution block attention module, and the overall visual quality is better than that of the other two main stream non-supervision depth estimation methods.

In order to better demonstrate the cross-scene generalization capability of unsupervised depth estimation over supervised depth estimation, a network trained on a KITTI data set is subjected to a depth estimation experiment under Nanjing part of urban road scenes. Fig. 6 is an experimental platform for performing depth estimation in a real environment, and the depth estimation result of the proposed method is shown in fig. 7, and the trained network has satisfactory visual quality of depth estimation in an unknown scene and can retain depth details of a plurality of near objects.

The technical means disclosed by the scheme of the invention are not limited to the technical means disclosed by the embodiment, but also comprise improvements and modifications based on the technical characteristics, and the improvements and modifications are also considered to be the protection scope of the invention.

Claims

1. A monocular unsupervised depth estimation method based on CBAM is characterized in that: the method comprises the following steps:

step 1), introducing CBAM and Resblock to form Resblock-CBAM; the method comprises the following specific steps:

F _c ＝F _r +F _r ” (1)

F _M ＝F _r ” (2)

2. The CBAM-based monocular unsupervised depth estimation method according to claim 1, wherein the specific procedure of the channel attention sub-module is as shown in formula (4):

wherein sigma represents a sigmoid function, MLP is a multi-component perceptron, omega ₀ 、ω ₁ Is the weight of the multi-layer perceptron,corresponding channel descriptors are pooled for maximum and average.

3. The CBAM-based monocular unsupervised depth estimation method according to claim 1, wherein the specific procedure of the channel attention sub-module is as shown in formula (4): the specific process of the space attention sub-module is shown in the formula (5):

4. The CBAM-based monocular unsupervised depth estimation method according to claim 1, wherein the depth estimation network with attention mechanism is designed based on the Resblock-CBAM in step 2), comprising the following specific steps:

5. The CBAM-based monocular unsupervised depth estimation method according to claim 1, wherein the step 3) trains the depth estimation network for photometric reconstruction, parallax smoothing and left-right parallax consistency of the stereo image pair, and performs monocular image depth estimation at the time of testing, comprising the specific steps of:

where L is the total training loss of the depth estimation network, L _s For each scale training loss, alpha _ap 、α _ds And alpha _ap Weight coefficients respectively representing luminosity reconstruction loss, parallax smoothing loss and left-right parallax consistency loss,photometric reconstruction loss representing left and right images, < ->Parallax smoothing loss representing left and right images, < ->A parallax consistency loss representing left and right images;

c) The steep change and discontinuity of the depth map at the image gradient are improved using the parallax smoothing loss, which is shown in formula (8):

in the middle ofFor parallax smoothing loss of left image, +.>Parallax smoothing loss of right image as parallax map of left image +.>Andin the same form;