CN119888093B

CN119888093B - Binocular depth estimation-based three-dimensional road scene generation method

Info

Publication number: CN119888093B
Application number: CN202510361155.6A
Authority: CN
Inventors: 陈慧勤; 钱俊龙; 高明裕; 肖蓬勃
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2025-03-26
Filing date: 2025-03-26
Publication date: 2025-06-24
Anticipated expiration: 2045-03-26
Also published as: CN119888093A

Abstract

The invention discloses a three-dimensional road scene generation method based on binocular depth estimation, which is based on a left-right-eye depth camera sensor, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module. The three-dimensional reconstruction method based on the Open3D point cloud processing and rendering technology improves the three-dimensional reconstruction precision in the complex road scene by combining binocular depth estimation with an edge constraint algorithm, achieves real-time performance and visualization by utilizing the efficient point cloud processing and rendering capability of the Open3D, and finally achieves the purpose of replacing a laser radar by adopting a low-cost binocular camera, thereby remarkably reducing hardware cost and providing feasibility for large-scale application.

Description

Binocular depth estimation-based three-dimensional road scene generation method

Technical Field

The invention relates to the field of automatic driving system display technology and road scene reconstruction, in particular to a three-dimensional road scene generation method based on binocular depth estimation.

Background

In the fields of autopilot and intelligent transportation, three-dimensional reconstruction of the road environment is a crucial research direction. Currently, many technical solutions are applied in this field, but all have certain limitations. For example, patent CN110796728B acquires three-dimensional point cloud data using a lidar, and reconstructs the external dimensions, structure, position and pose of the target by a greedy projection algorithm. Such conventional three-dimensional reconstruction methods rely on high-precision lidar, which, despite its high accuracy, is expensive in equipment and difficult to achieve economies in large-scale application scenarios.

In contrast, cameras are becoming an economical and efficient sensor, and in some situations, increasingly replacing lidar. For example, patent CN116091695A uses a hierarchical reinforcement learning technique for three-dimensional reconstruction, which, although capable of achieving higher accuracy, is mainly directed to a single object, not to a complex road scene, and thus has limited applicability. The other patent CN116091703a adopts a real-time three-dimensional reconstruction technology based on multi-view stereo matching, and builds a three-dimensional model by shooting one by one through a monocular camera, and although the scheme can improve the reconstruction accuracy to a certain extent, the shooting efficiency is lower, and the accuracy performance under a dynamic environment is not ideal.

In contrast, the binocular camera realizes depth estimation by parallax information of left and right viewing angles, and is a sensor scheme with relatively low cost and lower hardware requirements. However, current binocular depth estimation algorithms still present challenges such as insufficient accuracy and high computational complexity when faced with complex road scenes. Therefore, a new technical scheme is needed, which not only can effectively reduce the hardware cost, but also can improve the reconstruction precision and efficiency under complex scenes so as to overcome the limitation of the prior art.

Disclosure of Invention

The invention provides a three-dimensional road scene generation method based on binocular depth estimation, which comprises the steps of acquiring left and right view images of a road environment through a binocular camera, generating depth information of each pixel point by combining a depth estimation algorithm, highlighting key features of the road scene through an edge constraint algorithm, and finally realizing efficient three-dimensional road scene generation by using Open 3D. The invention can obviously improve the reconstruction precision and efficiency of complex road scenes, is beneficial to the application of automatic driving and intelligent traffic systems, and particularly realizes real-time perception in a low-cost hardware environment.

The invention discloses a three-dimensional road scene generation method based on binocular depth estimation, which is based on a left-right-eye depth camera sensor, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module, and comprises the following steps:

step 1, dynamically capturing road scene pictures through left and right eye depth camera sensors;

Step 2, calculating left and right eye depth images through a binocular depth estimation neural network model;

And 3, generating point cloud data according to the predicted depth map and the camera internal parameter matrix.

Further, in step 1, the method further comprises the steps of:

step 1.1, configuring left and right eye depth cameras according to an internal reference matrix of a sensor to enable the cameras to work synchronously, and setting capturing frame rate and resolution to meet real-time processing requirements;

Step 1.2, collecting left and right images of a road scene through left and right vision sensors arranged on a vehicle, and preprocessing, wherein the method comprises the following steps:

step 1.2.1, calibrating an internal reference matrix and distortion coefficients through a camera, and performing geometric correction on an image:

, ;

Wherein, (x, y, z) is pixel coordinate, (x ', y') is standard coordinate, (-) , ) In order to correct the coordinates of the object,, , As a coefficient of distortion,;

And 1.2.2, extracting a road region of interest from the corrected image, and using an edge detection algorithm to combine the set region mask, so as to keep the road surface and important edge characteristics.

Further, in step 2, the method further comprises the following steps:

Step 2.1, inputting the preprocessed left and right eye images into a neural network to serve as basic data of a model;

step 2.2, extracting and fusing multi-layer features of left and right eye images through an encoder;

Step 2.3, calculating an initial parallax map through a parallax estimation module;

;

wherein, the For pixel coordinatesThe upper parallax value is used for representing the displacement difference of the left and right eye images at the pixel position,For the pixel coordinates,AndRespectively the abscissa of corresponding points in the left and right images;

Step 2.4, the method is used for improving the significance of the key areas of the scene through a spatial attention mechanism;

step 2.5, enhancing boundary information in the depth map through a depth edge constraint module;

;

wherein, the Representing the depth values after optimization by the depth edge constraint module,As the original depth value is to be given,As an edge detection value, lambda is a weight parameter;

step 2.6, carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module;

And carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module:

;

wherein, the For the final depth map obtained after three module weighting processes,A depth map is estimated for the disparity,For a spatial attention enhancing depth map,For the depth edge enhancement depth map,,,Is a fusion weight.

Further, in step 2.1, the method further comprises the steps of:

step 2.2.1, multistage feature extraction, comprising:

extracting low-level characteristics, namely extracting edge and texture characteristics of left and right eye images through an initial convolution layer:

;

wherein, the For the input image pixel values,For the width of the convolution kernel,For the height of the convolution kernel,Is a low-level convolution kernel which,Outputting low-level characteristics;

Extracting middle-level characteristics, namely extracting geometric and texture information in a scene by using a convolution module with a pooling layer, and reducing redundant data:

;

wherein, the The pixel coordinate range of the window in the horizontal direction is pooled,Is the pixel coordinate range of the pooling window in the vertical direction,Is the index of pixel coordinates within the pooled window,Outputting the middle-level characteristics;

High-level feature extraction, namely extracting scene semantic features by using stacked convolution layers:

;

wherein:

Outputting high-level characteristics;

For the output of the characteristic diagram of the upper layer, the characteristic diagram is output in the channel The characteristic value of the characteristic value is set up in the above,Is a feature location within the local window;

is a convolution kernel of the size of The input channel isThe output channel is;A bias term for output channel c; for the activation function, for introducing nonlinearity;

Gradually expanding receptive field and capturing semantic features by stacking multiple convolution layers, and iterating formula until finally obtaining ;

Combining the low, medium and high level features by a jump connection to form a hierarchical feature:

;

,, Respectively representing the weights of the low, medium and high level features in the fusion process.

Further, in step 2.4, the method further comprises the steps of:

Step 2.4.1, generating multi-scale features;

Inputting a feature map Features of different scales are extracted through different convolution kernels:

;

wherein, the Indicating a core size ofIs used in the convolution operation of (1),Represented by convolution kernel sizeAfter the convolution operation of (a), at pixel coordinatesThe first generatedA scale feature map;

Step 2.4.2, overlapping the feature graphs with different scales according to weights:

;

wherein, the In order to be a number of dimensions,The normalization of the weights is ensured and,Expressed in pixel coordinatesThe final characteristic values after all scale characteristics are fused;

and 2.4.3, superposing the fused feature map on the input feature map to obtain the final enhanced feature:

;

Expressed in pixel coordinates Where the original hierarchical features are integratedAnd multiscale fusion featuresEnhanced feature map values.

Further, in step 3, the method further comprises the steps of:

Step 3.1, depth map Combining with the camera internal reference matrix to generate initial point cloud data:

;

wherein, the Is pixel coordinates) Is the optical center position of the camera;, focal lengths in the horizontal direction and the vertical direction of the camera respectively; x and Y respectively represent the positions of the three-dimensional points in the horizontal direction and the vertical direction of a camera coordinate system, and Z is the depth of the camera in the optical axis direction;

step 3.2, performing downsampling and denoising processing on the initial point cloud to improve data quality, wherein the method comprises the following steps:

Step 3.2.1, dividing the initial point cloud into cube grids with the size s by a voxel grid method, and representing the points of each grid by using the mass center of the points in the grid:

;

where N is the number of points in the grid, Is the centroid coordinates of the grid,Coordinates for points within each grid;

Step 3.2.2, calculate each Point and its Average distance of nearest neighbors, and eliminating outliers which do not meet the following conditions:

;

wherein, the As the average distance to be used,Is the standard deviation of the two-dimensional image,A threshold value is set for the user and,For each point and itAverage distance of nearest neighbors;

Step 3.3, inputting the optimized point cloud data into an Open3D library, generating and rendering a three-dimensional road scene, adding color information for the point cloud data, and mapping depth values into colors:

;

wherein, the Is taken as a pointIs used for the depth value of (a),AndRespectively maximum and minimum depth values. The beneficial effects achieved by the invention are as follows:

according to the method, the accuracy of three-dimensional reconstruction in the complex road scene is improved by combining binocular depth estimation with an edge constraint algorithm;

According to the method, real-time performance and visualization are realized by utilizing the high-efficiency point cloud processing and rendering capability of Open 3D;

the invention adopts the binocular camera with low cost to replace the laser radar, thereby obviously reducing the hardware cost and providing feasibility for large-scale application.

Drawings

FIG. 1 is a flow framework of a three-dimensional road scene generation method based on binocular depth estimation;

FIG. 2 is a binocular depth estimation network model of a three-dimensional road scene generation method based on binocular depth estimation;

fig. 3 is a point cloud data processing flow of a three-dimensional road scene generation method based on binocular depth estimation.

Detailed Description

The invention will be further described with reference to specific embodiments, and advantages and features of the invention will become apparent from the description. These examples are merely exemplary and do not limit the scope of the invention in any way. It will be understood by those skilled in the art that various changes and substitutions of details and forms of the technical solution of the present invention may be made without departing from the spirit and scope of the present invention, but these changes and substitutions fall within the scope of the present invention.

As shown in fig. 1, this embodiment is a flow framework of a three-dimensional road scene generation method based on binocular depth estimation. The frame comprises the following modules:

The left-right eye depth camera sensor is used for capturing road scene images in real time and carrying an internal reference matrix of the sensor;

The binocular depth estimation neural network model is used for generating a depth map of a road scene;

the edge constraint road segmentation module optimizes the characteristics of road edges and scene boundaries;

and the point cloud data optimization module is used for carrying out down-sampling, noise reduction and rendering optimization processing based on the generated point cloud data.

The method specifically comprises the following steps:

And step 1, capturing and preprocessing left and right eye images.

The road scene picture is dynamically captured by a left eye depth camera sensor and a right eye depth camera sensor, and the method comprises the following steps:

step 1.1, initializing left and right vision sensors;

The left and right eye depth cameras are configured according to the internal reference matrix of the sensor to synchronously work, and capture frame rate and resolution are set to meet the real-time processing requirement.

Step 1.2, capturing a road scene image and preprocessing;

The method comprises the following steps of collecting left and right images of a road scene through a left and right vision sensor arranged on a vehicle, and preprocessing, wherein the method specifically comprises the following steps:

step 1.2.1, correcting image distortion;

the geometric correction is carried out on the image through camera calibration parameters (an internal reference matrix and a distortion coefficient), and the specific formula is as follows:

, ;

Wherein (x, y, z) is the pixel coordinates, (x ', y') is the standard coordinates, , , As a coefficient of distortion,。

Step 1.2.2, extracting a region of interest (ROI);

The region of interest of the road is extracted from the corrected image, and the road surface and important edge features are preserved by using an edge detection algorithm (such as Canny) in combination with a set region mask.

Step 2, inputting and calculating a binocular depth estimation neural network model;

as shown in fig. 2, this embodiment illustrates the architecture of a binocular depth estimation network model, specifically including the following modules:

the parallax estimation module is used for generating an initial parallax image;

The spatial attention module is used for highlighting key areas in the scene and enhancing the accuracy of depth estimation;

and the depth edge constraint module is used for enhancing the definition of the boundary through semantic edge enhancement.

The method comprises the following specific steps:

Step 2.1, inputting left and right eye images;

and inputting the preprocessed left and right eye images into a neural network to serve as basic data of the model.

Step 2.2, feature coding;

Three types of features of the left and right eye images, namely low-level features (such as edge information), medium-level features (such as texture information) and high-level features (such as scene semantic information) are extracted through an encoder.

The method comprises the following specific steps:

2.2.1, extracting multi-stage characteristics;

(1) And extracting low-level characteristics, namely extracting edge and texture characteristics of the left and right eye images through an initial convolution layer.

The sliding operation is performed using a plurality of 3×3 convolution check images, and the calculation formula is as follows:

;

wherein, the For the input image pixel values,Is a low-level convolution kernel which,And outputting the low-level characteristics.

(2) And extracting middle-level characteristics, namely extracting geometric and texture information in a scene by using a convolution module with a pooling layer, and reducing redundant data.

Pooling operation formula:

;

wherein, the , Is the window range over which the window is to be opened,And outputting the characteristics of the middle level.

(3) High-level feature extraction, namely extracting scene semantic features such as roads, barriers and the like by using stacked convolution layers.

The high-level features are expressed as:

;

wherein: Outputting high-level characteristics;

outputting a characteristic diagram of the upper layer, wherein the channel index is i;

is a convolution kernel of the size of The input channel is C, the output channel is;Is a bias term for convolution; for activating the function, a nonlinear is introduced.

Gradually expanding receptive field and capturing semantic features by stacking multiple convolution layers, and iterating formula until finally obtaining。

2.2.2, Feature fusion;

The low, medium, and high level features are combined by Skip Connections (Skip Connections) to form hierarchical features:

;

Step 2.3, parallax estimation;

after feature extraction and fusion is completed, this step aims at generating an initial disparity map by a disparity estimation module as a basis for depth estimation.

The disparity map reflects displacement difference between the left image and the right image, is directly related to depth information, provides input data for subsequent depth edge optimization, and a computing formula of the disparity estimation module is as follows:

;

wherein, the For the pixel coordinates,AndThe abscissa of the corresponding point in the left and right images, respectively.

Step 2.4, enhancing the spatial attention;

In order to further optimize the accuracy of depth estimation and complement the perspective phenomenon which cannot be solved by parallax estimation, the method introduces a spatial attention mechanism, and the spatial attention mechanism is used for improving the significance of key areas (such as lane lines and road edges) of a scene, and specifically comprises the following steps:

Step 2.4.1, generating multi-scale features;

extracting the characteristics of the step 2.2.2 to obtain a characteristic diagram Extracting features of different scales through different convolution kernels to generate a multi-scale feature map:

;

wherein, the Indicating a core size ofIs performed by the convolution operation of (a).

Step 2.4.2, trans-scale fusion;

obtaining feature graphs with different scales through the step 2.4.1, overlapping the feature graphs with different scales according to weights, wherein the formula is as follows:

;

wherein, the In order to be a number of dimensions,Weight normalization is ensured.

Step 2.4.3, enhancing output;

superimposing the fused feature map to the feature map Finally enhanced features are obtained:

;

Step 2.5, depth edge constraint;

The spatial attention enhancement of step 2.4 is completed, and in order to further improve the boundary definition of the depth map and reduce the blurred region, a depth edge constraint module is introduced in this step. The module optimizes the boundary characteristics of the key region by fusing the edge detection information and the depth map data, and provides more accurate input data for final three-dimensional reconstruction. The depth edge constraint module is optimized by the following formula:

;

wherein, the As the original depth value is to be given,For the edge detection value, λ is the weight parameter.

Step 2.6, fusing depth maps;

;

wherein, the A depth map is estimated for the disparity,For a spatial attention enhancing depth map,For the depth edge enhancement depth map,,,Is a fusion weight.

Step 3, generating and optimizing point cloud data;

as shown in fig. 3, this embodiment illustrates the point cloud data processing flow of the method.

The step utilizes the optimized depth map output in the step 2And a camera reference matrix, comprising the steps of:

step 3.1, generating initial point cloud;

Depth map Combining with the camera internal reference matrix to generate initial point cloud data, wherein the formula is as follows:

;

wherein, the Is pixel coordinates, ) Is the position of the optical center;, Is the focal length; X and Y respectively represent the positions of the three-dimensional points in the horizontal direction and the vertical direction of the camera coordinate system, and Z is the depth of the camera in the optical axis direction.

Step 3.2, optimizing the point cloud;

and carrying out downsampling and denoising treatment on the initial point cloud so as to improve the data quality.

The specific implementation steps are as follows:

Step 3.2.1, voxel grid downsampling;

Dividing an initial point cloud into cube grids with the size of s by a voxel grid method, and representing the points of each grid by using the mass center of the points in each grid, wherein the formula is as follows:

;

where N is the number of points in the grid, Is the centroid coordinates of the grid,Coordinates for points within each grid.

Step 3.2.2, denoising through statistical filtering;

after the down-sampling process by voxel grid method, each point and its corresponding point are calculated Average distance of nearest neighbors, and eliminating outliers which do not meet the following conditions:

;

where μ is the average distance, σ is the standard deviation, and α is the user-set threshold.

Step 3.3, constructing a three-dimensional scene;

inputting the optimized point cloud data into an Open3D library, and generating and rendering a three-dimensional road scene. Adding color information to the point cloud data, and mapping the depth value into color, wherein the specific formula is as follows:

;

wherein, the To the optimized pointIs used for the depth value of (a),,Respectively maximum and minimum depth values.

The above is only a specific step of the present invention, and the protection scope of the present invention is not limited in any way, and all the technical solutions formed by equivalent transformation or equivalent substitution fall within the protection scope of the present invention, and the present invention does not detail the part of the known technology of the person skilled in the art.

Claims

1. A three-dimensional road scene generating method based on binocular depth estimation is based on left and right binocular depth camera sensors, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module, the three-dimensional road scene generation method based on binocular depth estimation is characterized by comprising the following steps of:

Step 3, generating point cloud data according to the predicted depth map and the camera internal reference matrix;

in step 2, the method further comprises the following steps:

;

2. The binocular depth estimation-based three-dimensional road scene generating method of claim 1, further comprising the steps of, in step 1:

, ;

3. The three-dimensional road scene generation method based on binocular depth estimation according to claim 1, further comprising the steps of, in step 2.1:

step 2.2.1, multistage feature extraction, comprising:

;

wherein:

Outputting high-level characteristics;

;

4. A three-dimensional road scene generation method based on binocular depth estimation according to claim 3, characterized in that in step 2.4, it further comprises the steps of:

Step 2.4.1, generating multi-scale features;

;

5. The three-dimensional road scene generating method based on binocular depth estimation according to claim 1, further comprising the steps of, in step 3:

;

wherein, the To the optimized pointIs used for the depth value of (a),AndRespectively maximum and minimum depth values.