[go: up one dir, main page]

CN119888093B - Binocular depth estimation-based three-dimensional road scene generation method - Google Patents

Binocular depth estimation-based three-dimensional road scene generation method Download PDF

Info

Publication number
CN119888093B
CN119888093B CN202510361155.6A CN202510361155A CN119888093B CN 119888093 B CN119888093 B CN 119888093B CN 202510361155 A CN202510361155 A CN 202510361155A CN 119888093 B CN119888093 B CN 119888093B
Authority
CN
China
Prior art keywords
depth
module
point cloud
binocular
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510361155.6A
Other languages
Chinese (zh)
Other versions
CN119888093A (en
Inventor
陈慧勤
钱俊龙
高明裕
肖蓬勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202510361155.6A priority Critical patent/CN119888093B/en
Publication of CN119888093A publication Critical patent/CN119888093A/en
Application granted granted Critical
Publication of CN119888093B publication Critical patent/CN119888093B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/80Geometric correction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a three-dimensional road scene generation method based on binocular depth estimation, which is based on a left-right-eye depth camera sensor, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module. The three-dimensional reconstruction method based on the Open3D point cloud processing and rendering technology improves the three-dimensional reconstruction precision in the complex road scene by combining binocular depth estimation with an edge constraint algorithm, achieves real-time performance and visualization by utilizing the efficient point cloud processing and rendering capability of the Open3D, and finally achieves the purpose of replacing a laser radar by adopting a low-cost binocular camera, thereby remarkably reducing hardware cost and providing feasibility for large-scale application.

Description

Binocular depth estimation-based three-dimensional road scene generation method
Technical Field
The invention relates to the field of automatic driving system display technology and road scene reconstruction, in particular to a three-dimensional road scene generation method based on binocular depth estimation.
Background
In the fields of autopilot and intelligent transportation, three-dimensional reconstruction of the road environment is a crucial research direction. Currently, many technical solutions are applied in this field, but all have certain limitations. For example, patent CN110796728B acquires three-dimensional point cloud data using a lidar, and reconstructs the external dimensions, structure, position and pose of the target by a greedy projection algorithm. Such conventional three-dimensional reconstruction methods rely on high-precision lidar, which, despite its high accuracy, is expensive in equipment and difficult to achieve economies in large-scale application scenarios.
In contrast, cameras are becoming an economical and efficient sensor, and in some situations, increasingly replacing lidar. For example, patent CN116091695A uses a hierarchical reinforcement learning technique for three-dimensional reconstruction, which, although capable of achieving higher accuracy, is mainly directed to a single object, not to a complex road scene, and thus has limited applicability. The other patent CN116091703a adopts a real-time three-dimensional reconstruction technology based on multi-view stereo matching, and builds a three-dimensional model by shooting one by one through a monocular camera, and although the scheme can improve the reconstruction accuracy to a certain extent, the shooting efficiency is lower, and the accuracy performance under a dynamic environment is not ideal.
In contrast, the binocular camera realizes depth estimation by parallax information of left and right viewing angles, and is a sensor scheme with relatively low cost and lower hardware requirements. However, current binocular depth estimation algorithms still present challenges such as insufficient accuracy and high computational complexity when faced with complex road scenes. Therefore, a new technical scheme is needed, which not only can effectively reduce the hardware cost, but also can improve the reconstruction precision and efficiency under complex scenes so as to overcome the limitation of the prior art.
Disclosure of Invention
The invention provides a three-dimensional road scene generation method based on binocular depth estimation, which comprises the steps of acquiring left and right view images of a road environment through a binocular camera, generating depth information of each pixel point by combining a depth estimation algorithm, highlighting key features of the road scene through an edge constraint algorithm, and finally realizing efficient three-dimensional road scene generation by using Open 3D. The invention can obviously improve the reconstruction precision and efficiency of complex road scenes, is beneficial to the application of automatic driving and intelligent traffic systems, and particularly realizes real-time perception in a low-cost hardware environment.
The invention discloses a three-dimensional road scene generation method based on binocular depth estimation, which is based on a left-right-eye depth camera sensor, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module, and comprises the following steps:
step 1, dynamically capturing road scene pictures through left and right eye depth camera sensors;
Step 2, calculating left and right eye depth images through a binocular depth estimation neural network model;
And 3, generating point cloud data according to the predicted depth map and the camera internal parameter matrix.
Further, in step 1, the method further comprises the steps of:
step 1.1, configuring left and right eye depth cameras according to an internal reference matrix of a sensor to enable the cameras to work synchronously, and setting capturing frame rate and resolution to meet real-time processing requirements;
Step 1.2, collecting left and right images of a road scene through left and right vision sensors arranged on a vehicle, and preprocessing, wherein the method comprises the following steps:
step 1.2.1, calibrating an internal reference matrix and distortion coefficients through a camera, and performing geometric correction on an image:
, ;
, ;
Wherein, (x, y, z) is pixel coordinate, (x ', y') is standard coordinate, (-) , ) In order to correct the coordinates of the object,, , As a coefficient of distortion,;
And 1.2.2, extracting a road region of interest from the corrected image, and using an edge detection algorithm to combine the set region mask, so as to keep the road surface and important edge characteristics.
Further, in step 2, the method further comprises the following steps:
Step 2.1, inputting the preprocessed left and right eye images into a neural network to serve as basic data of a model;
step 2.2, extracting and fusing multi-layer features of left and right eye images through an encoder;
Step 2.3, calculating an initial parallax map through a parallax estimation module;
;
wherein, the For pixel coordinatesThe upper parallax value is used for representing the displacement difference of the left and right eye images at the pixel position,For the pixel coordinates,AndRespectively the abscissa of corresponding points in the left and right images;
Step 2.4, the method is used for improving the significance of the key areas of the scene through a spatial attention mechanism;
step 2.5, enhancing boundary information in the depth map through a depth edge constraint module;
;
wherein, the Representing the depth values after optimization by the depth edge constraint module,As the original depth value is to be given,As an edge detection value, lambda is a weight parameter;
step 2.6, carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module;
And carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module:
;
wherein, the For the final depth map obtained after three module weighting processes,A depth map is estimated for the disparity,For a spatial attention enhancing depth map,For the depth edge enhancement depth map,,,Is a fusion weight.
Further, in step 2.1, the method further comprises the steps of:
step 2.2.1, multistage feature extraction, comprising:
extracting low-level characteristics, namely extracting edge and texture characteristics of left and right eye images through an initial convolution layer:
;
wherein, the For the input image pixel values,For the width of the convolution kernel,For the height of the convolution kernel,Is a low-level convolution kernel which,Outputting low-level characteristics;
Extracting middle-level characteristics, namely extracting geometric and texture information in a scene by using a convolution module with a pooling layer, and reducing redundant data:
;
wherein, the The pixel coordinate range of the window in the horizontal direction is pooled,Is the pixel coordinate range of the pooling window in the vertical direction,Is the index of pixel coordinates within the pooled window,Outputting the middle-level characteristics;
High-level feature extraction, namely extracting scene semantic features by using stacked convolution layers:
;
wherein:
Outputting high-level characteristics;
For the output of the characteristic diagram of the upper layer, the characteristic diagram is output in the channel The characteristic value of the characteristic value is set up in the above,Is a feature location within the local window;
is a convolution kernel of the size of The input channel isThe output channel is;A bias term for output channel c; for the activation function, for introducing nonlinearity;
Gradually expanding receptive field and capturing semantic features by stacking multiple convolution layers, and iterating formula until finally obtaining ;
Combining the low, medium and high level features by a jump connection to form a hierarchical feature:
;
,, Respectively representing the weights of the low, medium and high level features in the fusion process.
Further, in step 2.4, the method further comprises the steps of:
Step 2.4.1, generating multi-scale features;
Inputting a feature map Features of different scales are extracted through different convolution kernels:
;
wherein, the Indicating a core size ofIs used in the convolution operation of (1),Represented by convolution kernel sizeAfter the convolution operation of (a), at pixel coordinatesThe first generatedA scale feature map;
Step 2.4.2, overlapping the feature graphs with different scales according to weights:
;
wherein, the In order to be a number of dimensions,The normalization of the weights is ensured and,Expressed in pixel coordinatesThe final characteristic values after all scale characteristics are fused;
and 2.4.3, superposing the fused feature map on the input feature map to obtain the final enhanced feature:
;
Expressed in pixel coordinates Where the original hierarchical features are integratedAnd multiscale fusion featuresEnhanced feature map values.
Further, in step 3, the method further comprises the steps of:
Step 3.1, depth map Combining with the camera internal reference matrix to generate initial point cloud data:
;
wherein, the Is pixel coordinates) Is the optical center position of the camera;, focal lengths in the horizontal direction and the vertical direction of the camera respectively; x and Y respectively represent the positions of the three-dimensional points in the horizontal direction and the vertical direction of a camera coordinate system, and Z is the depth of the camera in the optical axis direction;
step 3.2, performing downsampling and denoising processing on the initial point cloud to improve data quality, wherein the method comprises the following steps:
Step 3.2.1, dividing the initial point cloud into cube grids with the size s by a voxel grid method, and representing the points of each grid by using the mass center of the points in the grid:
;
where N is the number of points in the grid, Is the centroid coordinates of the grid,Coordinates for points within each grid;
Step 3.2.2, calculate each Point and its Average distance of nearest neighbors, and eliminating outliers which do not meet the following conditions:
;
wherein, the As the average distance to be used,Is the standard deviation of the two-dimensional image,A threshold value is set for the user and,For each point and itAverage distance of nearest neighbors;
Step 3.3, inputting the optimized point cloud data into an Open3D library, generating and rendering a three-dimensional road scene, adding color information for the point cloud data, and mapping depth values into colors:
;
wherein, the Is taken as a pointIs used for the depth value of (a),AndRespectively maximum and minimum depth values. The beneficial effects achieved by the invention are as follows:
according to the method, the accuracy of three-dimensional reconstruction in the complex road scene is improved by combining binocular depth estimation with an edge constraint algorithm;
According to the method, real-time performance and visualization are realized by utilizing the high-efficiency point cloud processing and rendering capability of Open 3D;
the invention adopts the binocular camera with low cost to replace the laser radar, thereby obviously reducing the hardware cost and providing feasibility for large-scale application.
Drawings
FIG. 1 is a flow framework of a three-dimensional road scene generation method based on binocular depth estimation;
FIG. 2 is a binocular depth estimation network model of a three-dimensional road scene generation method based on binocular depth estimation;
fig. 3 is a point cloud data processing flow of a three-dimensional road scene generation method based on binocular depth estimation.
Detailed Description
The invention will be further described with reference to specific embodiments, and advantages and features of the invention will become apparent from the description. These examples are merely exemplary and do not limit the scope of the invention in any way. It will be understood by those skilled in the art that various changes and substitutions of details and forms of the technical solution of the present invention may be made without departing from the spirit and scope of the present invention, but these changes and substitutions fall within the scope of the present invention.
As shown in fig. 1, this embodiment is a flow framework of a three-dimensional road scene generation method based on binocular depth estimation. The frame comprises the following modules:
The left-right eye depth camera sensor is used for capturing road scene images in real time and carrying an internal reference matrix of the sensor;
The binocular depth estimation neural network model is used for generating a depth map of a road scene;
the edge constraint road segmentation module optimizes the characteristics of road edges and scene boundaries;
and the point cloud data optimization module is used for carrying out down-sampling, noise reduction and rendering optimization processing based on the generated point cloud data.
The method specifically comprises the following steps:
And step 1, capturing and preprocessing left and right eye images.
The road scene picture is dynamically captured by a left eye depth camera sensor and a right eye depth camera sensor, and the method comprises the following steps:
step 1.1, initializing left and right vision sensors;
The left and right eye depth cameras are configured according to the internal reference matrix of the sensor to synchronously work, and capture frame rate and resolution are set to meet the real-time processing requirement.
Step 1.2, capturing a road scene image and preprocessing;
The method comprises the following steps of collecting left and right images of a road scene through a left and right vision sensor arranged on a vehicle, and preprocessing, wherein the method specifically comprises the following steps:
step 1.2.1, correcting image distortion;
the geometric correction is carried out on the image through camera calibration parameters (an internal reference matrix and a distortion coefficient), and the specific formula is as follows:
, ;
, ;
Wherein (x, y, z) is the pixel coordinates, (x ', y') is the standard coordinates, , , As a coefficient of distortion,
Step 1.2.2, extracting a region of interest (ROI);
The region of interest of the road is extracted from the corrected image, and the road surface and important edge features are preserved by using an edge detection algorithm (such as Canny) in combination with a set region mask.
Step 2, inputting and calculating a binocular depth estimation neural network model;
as shown in fig. 2, this embodiment illustrates the architecture of a binocular depth estimation network model, specifically including the following modules:
the parallax estimation module is used for generating an initial parallax image;
The spatial attention module is used for highlighting key areas in the scene and enhancing the accuracy of depth estimation;
and the depth edge constraint module is used for enhancing the definition of the boundary through semantic edge enhancement.
The method comprises the following specific steps:
Step 2.1, inputting left and right eye images;
and inputting the preprocessed left and right eye images into a neural network to serve as basic data of the model.
Step 2.2, feature coding;
Three types of features of the left and right eye images, namely low-level features (such as edge information), medium-level features (such as texture information) and high-level features (such as scene semantic information) are extracted through an encoder.
The method comprises the following specific steps:
2.2.1, extracting multi-stage characteristics;
(1) And extracting low-level characteristics, namely extracting edge and texture characteristics of the left and right eye images through an initial convolution layer.
The sliding operation is performed using a plurality of 3×3 convolution check images, and the calculation formula is as follows:
;
wherein, the For the input image pixel values,Is a low-level convolution kernel which,And outputting the low-level characteristics.
(2) And extracting middle-level characteristics, namely extracting geometric and texture information in a scene by using a convolution module with a pooling layer, and reducing redundant data.
Pooling operation formula:
;
wherein, the , Is the window range over which the window is to be opened,And outputting the characteristics of the middle level.
(3) High-level feature extraction, namely extracting scene semantic features such as roads, barriers and the like by using stacked convolution layers.
The high-level features are expressed as:
;
wherein: Outputting high-level characteristics;
outputting a characteristic diagram of the upper layer, wherein the channel index is i;
is a convolution kernel of the size of The input channel is C, the output channel is;Is a bias term for convolution; for activating the function, a nonlinear is introduced.
Gradually expanding receptive field and capturing semantic features by stacking multiple convolution layers, and iterating formula until finally obtaining
2.2.2, Feature fusion;
The low, medium, and high level features are combined by Skip Connections (Skip Connections) to form hierarchical features:
;
,, Respectively representing the weights of the low, medium and high level features in the fusion process.
Step 2.3, parallax estimation;
after feature extraction and fusion is completed, this step aims at generating an initial disparity map by a disparity estimation module as a basis for depth estimation.
The disparity map reflects displacement difference between the left image and the right image, is directly related to depth information, provides input data for subsequent depth edge optimization, and a computing formula of the disparity estimation module is as follows:
;
wherein, the For the pixel coordinates,AndThe abscissa of the corresponding point in the left and right images, respectively.
Step 2.4, enhancing the spatial attention;
In order to further optimize the accuracy of depth estimation and complement the perspective phenomenon which cannot be solved by parallax estimation, the method introduces a spatial attention mechanism, and the spatial attention mechanism is used for improving the significance of key areas (such as lane lines and road edges) of a scene, and specifically comprises the following steps:
Step 2.4.1, generating multi-scale features;
extracting the characteristics of the step 2.2.2 to obtain a characteristic diagram Extracting features of different scales through different convolution kernels to generate a multi-scale feature map:
;
wherein, the Indicating a core size ofIs performed by the convolution operation of (a).
Step 2.4.2, trans-scale fusion;
obtaining feature graphs with different scales through the step 2.4.1, overlapping the feature graphs with different scales according to weights, wherein the formula is as follows:
;
wherein, the In order to be a number of dimensions,Weight normalization is ensured.
Step 2.4.3, enhancing output;
superimposing the fused feature map to the feature map Finally enhanced features are obtained:
;
Step 2.5, depth edge constraint;
The spatial attention enhancement of step 2.4 is completed, and in order to further improve the boundary definition of the depth map and reduce the blurred region, a depth edge constraint module is introduced in this step. The module optimizes the boundary characteristics of the key region by fusing the edge detection information and the depth map data, and provides more accurate input data for final three-dimensional reconstruction. The depth edge constraint module is optimized by the following formula:
;
wherein, the As the original depth value is to be given,For the edge detection value, λ is the weight parameter.
Step 2.6, fusing depth maps;
And carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module:
;
wherein, the A depth map is estimated for the disparity,For a spatial attention enhancing depth map,For the depth edge enhancement depth map,,,Is a fusion weight.
Step 3, generating and optimizing point cloud data;
as shown in fig. 3, this embodiment illustrates the point cloud data processing flow of the method.
The step utilizes the optimized depth map output in the step 2And a camera reference matrix, comprising the steps of:
step 3.1, generating initial point cloud;
Depth map Combining with the camera internal reference matrix to generate initial point cloud data, wherein the formula is as follows:
;
wherein, the Is pixel coordinates, ) Is the position of the optical center;, Is the focal length; X and Y respectively represent the positions of the three-dimensional points in the horizontal direction and the vertical direction of the camera coordinate system, and Z is the depth of the camera in the optical axis direction.
Step 3.2, optimizing the point cloud;
and carrying out downsampling and denoising treatment on the initial point cloud so as to improve the data quality.
The specific implementation steps are as follows:
Step 3.2.1, voxel grid downsampling;
Dividing an initial point cloud into cube grids with the size of s by a voxel grid method, and representing the points of each grid by using the mass center of the points in each grid, wherein the formula is as follows:
;
where N is the number of points in the grid, Is the centroid coordinates of the grid,Coordinates for points within each grid.
Step 3.2.2, denoising through statistical filtering;
after the down-sampling process by voxel grid method, each point and its corresponding point are calculated Average distance of nearest neighbors, and eliminating outliers which do not meet the following conditions:
;
where μ is the average distance, σ is the standard deviation, and α is the user-set threshold.
Step 3.3, constructing a three-dimensional scene;
inputting the optimized point cloud data into an Open3D library, and generating and rendering a three-dimensional road scene. Adding color information to the point cloud data, and mapping the depth value into color, wherein the specific formula is as follows:
;
wherein, the To the optimized pointIs used for the depth value of (a),,Respectively maximum and minimum depth values.
The above is only a specific step of the present invention, and the protection scope of the present invention is not limited in any way, and all the technical solutions formed by equivalent transformation or equivalent substitution fall within the protection scope of the present invention, and the present invention does not detail the part of the known technology of the person skilled in the art.

Claims (5)

1. A three-dimensional road scene generating method based on binocular depth estimation is based on left and right binocular depth camera sensors, a binocular depth estimation neural network model, an edge constraint road segmentation module and a point cloud data optimization module, the three-dimensional road scene generation method based on binocular depth estimation is characterized by comprising the following steps of:
step 1, dynamically capturing road scene pictures through left and right eye depth camera sensors;
Step 2, calculating left and right eye depth images through a binocular depth estimation neural network model;
Step 3, generating point cloud data according to the predicted depth map and the camera internal reference matrix;
in step 2, the method further comprises the following steps:
Step 2.1, inputting the preprocessed left and right eye images into a neural network to serve as basic data of a model;
step 2.2, extracting and fusing multi-layer features of left and right eye images through an encoder;
Step 2.3, calculating an initial parallax map through a parallax estimation module;
;
wherein, the For pixel coordinatesThe upper parallax value is used for representing the displacement difference of the left and right eye images at the pixel position,For the pixel coordinates,AndRespectively the abscissa of corresponding points in the left and right images;
Step 2.4, the method is used for improving the significance of the key areas of the scene through a spatial attention mechanism;
step 2.5, enhancing boundary information in the depth map through a depth edge constraint module;
;
wherein, the Representing the depth values after optimization by the depth edge constraint module,As the original depth value is to be given,As an edge detection value, lambda is a weight parameter;
step 2.6, carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module;
And carrying out weighted fusion on the depth map output by the parallax estimation module, the spatial attention module and the depth edge constraint module:
;
wherein, the For the final depth map obtained after three module weighting processes,A depth map is estimated for the disparity,For a spatial attention enhancing depth map,For the depth edge enhancement depth map,,,Is a fusion weight.
2. The binocular depth estimation-based three-dimensional road scene generating method of claim 1, further comprising the steps of, in step 1:
step 1.1, configuring left and right eye depth cameras according to an internal reference matrix of a sensor to enable the cameras to work synchronously, and setting capturing frame rate and resolution to meet real-time processing requirements;
Step 1.2, collecting left and right images of a road scene through left and right vision sensors arranged on a vehicle, and preprocessing, wherein the method comprises the following steps:
step 1.2.1, calibrating an internal reference matrix and distortion coefficients through a camera, and performing geometric correction on an image:
, ;
, ;
Wherein, (x, y, z) is pixel coordinate, (x ', y') is standard coordinate, (-) , ) In order to correct the coordinates of the object,, , As a coefficient of distortion,;
And 1.2.2, extracting a road region of interest from the corrected image, and using an edge detection algorithm to combine the set region mask, so as to keep the road surface and important edge characteristics.
3. The three-dimensional road scene generation method based on binocular depth estimation according to claim 1, further comprising the steps of, in step 2.1:
step 2.2.1, multistage feature extraction, comprising:
extracting low-level characteristics, namely extracting edge and texture characteristics of left and right eye images through an initial convolution layer:
;
wherein, the For the input image pixel values,For the width of the convolution kernel,For the height of the convolution kernel,Is a low-level convolution kernel which,Outputting low-level characteristics;
Extracting middle-level characteristics, namely extracting geometric and texture information in a scene by using a convolution module with a pooling layer, and reducing redundant data:
;
wherein, the The pixel coordinate range of the window in the horizontal direction is pooled,Is the pixel coordinate range of the pooling window in the vertical direction,Is the index of pixel coordinates within the pooled window,Outputting the middle-level characteristics;
High-level feature extraction, namely extracting scene semantic features by using stacked convolution layers:
;
wherein:
Outputting high-level characteristics;
For the output of the characteristic diagram of the upper layer, the characteristic diagram is output in the channel The characteristic value of the characteristic value is set up in the above,Is a feature location within the local window;
is a convolution kernel of the size of The input channel isThe output channel is;A bias term for output channel c; for the activation function, for introducing nonlinearity;
Gradually expanding receptive field and capturing semantic features by stacking multiple convolution layers, and iterating formula until finally obtaining ;
Combining the low, medium and high level features by a jump connection to form a hierarchical feature:
;
,, Respectively representing the weights of the low, medium and high level features in the fusion process.
4. A three-dimensional road scene generation method based on binocular depth estimation according to claim 3, characterized in that in step 2.4, it further comprises the steps of:
Step 2.4.1, generating multi-scale features;
Inputting a feature map Features of different scales are extracted through different convolution kernels:
;
wherein, the Indicating a core size ofIs used in the convolution operation of (1),Represented by convolution kernel sizeAfter the convolution operation of (a), at pixel coordinatesThe first generatedA scale feature map;
Step 2.4.2, overlapping the feature graphs with different scales according to weights:
;
wherein, the In order to be a number of dimensions,The normalization of the weights is ensured and,Expressed in pixel coordinatesThe final characteristic values after all scale characteristics are fused;
and 2.4.3, superposing the fused feature map on the input feature map to obtain the final enhanced feature:
;
Expressed in pixel coordinates Where the original hierarchical features are integratedAnd multiscale fusion featuresEnhanced feature map values.
5. The three-dimensional road scene generating method based on binocular depth estimation according to claim 1, further comprising the steps of, in step 3:
Step 3.1, depth map Combining with the camera internal reference matrix to generate initial point cloud data:
;
wherein, the Is pixel coordinates) Is the optical center position of the camera;, focal lengths in the horizontal direction and the vertical direction of the camera respectively; x and Y respectively represent the positions of the three-dimensional points in the horizontal direction and the vertical direction of a camera coordinate system, and Z is the depth of the camera in the optical axis direction;
step 3.2, performing downsampling and denoising processing on the initial point cloud to improve data quality, wherein the method comprises the following steps:
Step 3.2.1, dividing the initial point cloud into cube grids with the size s by a voxel grid method, and representing the points of each grid by using the mass center of the points in the grid:
;
where N is the number of points in the grid, Is the centroid coordinates of the grid,Coordinates for points within each grid;
Step 3.2.2, calculate each Point and its Average distance of nearest neighbors, and eliminating outliers which do not meet the following conditions:
;
wherein, the As the average distance to be used,Is the standard deviation of the two-dimensional image,A threshold value is set for the user and,For each point and itAverage distance of nearest neighbors;
Step 3.3, inputting the optimized point cloud data into an Open3D library, generating and rendering a three-dimensional road scene, adding color information for the point cloud data, and mapping depth values into colors:
;
wherein, the To the optimized pointIs used for the depth value of (a),AndRespectively maximum and minimum depth values.
CN202510361155.6A 2025-03-26 2025-03-26 Binocular depth estimation-based three-dimensional road scene generation method Active CN119888093B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510361155.6A CN119888093B (en) 2025-03-26 2025-03-26 Binocular depth estimation-based three-dimensional road scene generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510361155.6A CN119888093B (en) 2025-03-26 2025-03-26 Binocular depth estimation-based three-dimensional road scene generation method

Publications (2)

Publication Number Publication Date
CN119888093A CN119888093A (en) 2025-04-25
CN119888093B true CN119888093B (en) 2025-06-24

Family

ID=95433064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510361155.6A Active CN119888093B (en) 2025-03-26 2025-03-26 Binocular depth estimation-based three-dimensional road scene generation method

Country Status (1)

Country Link
CN (1) CN119888093B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428255A (en) * 2018-02-10 2018-08-21 台州智必安科技有限责任公司 A kind of real-time three-dimensional method for reconstructing based on unmanned plane

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568026B (en) * 2011-12-12 2014-01-29 浙江大学 Three-dimensional enhancing realizing method for multi-viewpoint free stereo display
US11094137B2 (en) * 2012-02-24 2021-08-17 Matterport, Inc. Employing three-dimensional (3D) data predicted from two-dimensional (2D) images using neural networks for 3D modeling applications and other applications
CN114255279A (en) * 2020-09-19 2022-03-29 重庆一极科技有限公司 Binocular vision three-dimensional reconstruction method based on high-precision positioning and deep learning
CN112435325B (en) * 2020-09-29 2022-06-07 北京航空航天大学 VI-SLAM and depth estimation network-based unmanned aerial vehicle scene density reconstruction method
CN118657888A (en) * 2024-08-06 2024-09-17 北京航空航天大学 A sparse view 3D reconstruction method based on depth prior information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428255A (en) * 2018-02-10 2018-08-21 台州智必安科技有限责任公司 A kind of real-time three-dimensional method for reconstructing based on unmanned plane

Also Published As

Publication number Publication date
CN119888093A (en) 2025-04-25

Similar Documents

Publication Publication Date Title
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN112634341B (en) Construction method of multi-vision task collaborative depth estimation model
CN114170290B (en) Image processing method and related equipment
CN110378838B (en) Variable-view-angle image generation method and device, storage medium and electronic equipment
CN109903372B (en) Depth map super-resolution completion method and high-quality three-dimensional reconstruction method and system
CN115330935A (en) A 3D reconstruction method and system based on deep learning
CN119888738B (en) Multi-view semantic recognition method based on depth map assistance
CN111539983A (en) Moving object segmentation method and system based on depth image
EP3293700A1 (en) 3d reconstruction for vehicle
CN115249269A (en) Object detection method, computer program product, storage medium, and electronic device
CN115511759A (en) Point cloud image depth completion method based on cascade feature interaction
CN115564888A (en) Visible light multi-view image three-dimensional reconstruction method based on deep learning
CN118781000B (en) A monocular dense SLAM map construction method based on image enhancement and NeRF
CN115272450A (en) Target positioning method based on panoramic segmentation
CN115908731A (en) Double-unmanned aerial vehicle three-dimensional reconstruction method based on cloud edge cooperation
CN116152442B (en) Three-dimensional point cloud model generation method and device
CN117670969A (en) Depth estimation method, device, terminal equipment and storage medium
CN110766732A (en) Robust single-camera depth map estimation method
CN119888093B (en) Binocular depth estimation-based three-dimensional road scene generation method
CN120388230A (en) A BEV elevation estimation method and system based on binocular data
CN117994777A (en) Three-dimensional target detection method based on road side camera
CN117058183A (en) Image processing method and device based on double cameras, electronic equipment and storage medium
CN114708219B (en) Stereo matching method, device and storage medium based on slope cost aggregation
Zhang Extending deep learning based Multi-view stereo algorithms for aerial datasets
CN119169552A (en) A target detection method, device, vehicle controller and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant