[go: up one dir, main page]

CN118736009B - A visual odometer method and system based on image depth prediction and monocular geometry - Google Patents

A visual odometer method and system based on image depth prediction and monocular geometry Download PDF

Info

Publication number
CN118736009B
CN118736009B CN202410853788.4A CN202410853788A CN118736009B CN 118736009 B CN118736009 B CN 118736009B CN 202410853788 A CN202410853788 A CN 202410853788A CN 118736009 B CN118736009 B CN 118736009B
Authority
CN
China
Prior art keywords
image
depth
monocular
point
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410853788.4A
Other languages
Chinese (zh)
Other versions
CN118736009A (en
Inventor
陈建军
李壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202410853788.4A priority Critical patent/CN118736009B/en
Publication of CN118736009A publication Critical patent/CN118736009A/en
Application granted granted Critical
Publication of CN118736009B publication Critical patent/CN118736009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于图像深度预测和单目几何的视觉里程计方法及系统,所述方法包括输入两个连续图像帧,采用尺度不变特征变换算法检测和描述图像局部特征,之后使用FLANN算法在两帧图像之间匹配对应的特征点对;使用对极几何约束求解本质矩阵,得到相机的相对姿态变换;构建并训练单目深度预测模型,为每一帧输入图像预测密集深度信息;如果两个连续图像帧构成的有效深度信息对的数量大于给定阈值,则使用三角测量法估算尺度因子,得到校正后相对姿态变换,否则利用透视N点投影算法、RANSAC算法和局部非线性优化相结合的方式,求解绝对姿态变换。本发明有效集成深度学习与传统几何方法的优势,能适应动态环境,提高了单目视觉里程计的鲁棒性和精度。

The present invention discloses a visual odometer method and system based on image depth prediction and monocular geometry, the method comprising inputting two consecutive image frames, using a scale-invariant feature transformation algorithm to detect and describe local features of the image, and then using a FLANN algorithm to match corresponding feature point pairs between the two frames of image; using epipolar geometry constraints to solve the essential matrix to obtain the relative posture transformation of the camera; constructing and training a monocular depth prediction model to predict dense depth information for each frame of input image; if the number of valid depth information pairs formed by two consecutive image frames is greater than a given threshold, using triangulation to estimate the scale factor to obtain the corrected relative posture transformation, otherwise using a perspective N-point projection algorithm, a RANSAC algorithm and a combination of local nonlinear optimization to solve the absolute posture transformation. The present invention effectively integrates the advantages of deep learning and traditional geometric methods, can adapt to dynamic environments, and improves the robustness and accuracy of monocular visual odometers.

Description

Visual odometer method and system based on image depth prediction and monocular geometry
Technical Field
The invention belongs to the field of computer vision and robot navigation, and particularly relates to a vision odometer method and system based on image depth prediction and monocular geometry.
Background
The visual odometer is a process of estimating a motion trajectory of a robot from an input image sequence, and is one of core modules of a synchronous positioning and mapping (SLAM) system. The traditional geometric visual odometer method based on feature point matching has higher precision in a controlled environment, but is easy to fail in complex scenes with poor textures, dynamic objects and illumination changes. In recent years, although some progress is made in the end-to-end visual odometer method based on deep learning, geometrical constraint is often lacked, and precision and robustness are not satisfactory. The existing mixed vision odometer method integrates a classical geometric model and a deep learning framework, but needs to be improved when processing complex scenes such as high dynamics, low textures and the like.
Disclosure of Invention
The invention aims to provide a visual odometer method and a visual odometer system based on image depth prediction and monocular geometry, which are used for overcoming the defects of the prior art and improving the robustness and accuracy of the monocular visual odometer in a dynamic environment.
The technical scheme is that the visual odometer method based on image depth prediction and monocular geometry comprises the following steps:
(1) Inputting two continuous image frames, detecting and describing image local features by adopting a scale-invariant feature transformation algorithm, and then matching corresponding feature point pairs (F i,Fj) between the two images by using a FLANN algorithm;
(2) Aiming at the characteristic matching relation obtained in the step (1), solving an essential matrix E by using epipolar geometric constraint to obtain relative attitude transformation [ R, t ] of a camera, wherein R is a rotation matrix and t is a translation vector;
(3) Constructing and training a monocular depth prediction model, and predicting dense depth information for each frame of input image;
(4) If the number of the effective depth information pairs formed by two continuous image frames is larger than a given threshold value, estimating a scale factor s by using a triangulation method to obtain corrected relative posture transformation [ R, st ];
(5) And solving absolute attitude transformation (R, t) by utilizing a combination mode of a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.
Further, the implementation process of the step (1) is as follows:
Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:
D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)
Wherein x, y are the spatial coordinates of the image, I (x, y) represents the pixel value of the input image at the (x, y) point, k represents the multiple relationship between adjacent scales, G (x, y, σ) is the gaussian kernel at scale σ, for generating the different scale spaces of the input image I (x, y);
And (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points of the fitting function, and then removing points with low contrast, wherein the following formula is used:
Wherein, the Is the extreme point of the interpolation function, discardThe method comprises the steps of determining the principal direction of each key point, obtaining the principal direction through statistics of gradient histograms when determining the principal direction of each key point to enable the feature to have rotation invariance, finally generating a key point descriptor, calculating gradient direction histograms in the neighborhood of each key point by taking each key point as a center to form a 128-dimensional vector serving as a feature vector representation of the key point, and matching corresponding feature point pairs between image frames by using a FLANN algorithm after obtaining the feature vectors of two images (F i,fj).
Further, the step (2) is implemented by the following formula:
Wherein E is an essential matrix, K is a known parameter of the camera internal reference, and the relative attitude transformation [ R, t ] of the camera is obtained by decomposing E.
Further, the implementation process of the step (3) is as follows:
Given the predicted depth D t and the camera pose matrix T t+1→t, an image I t+1→t is reconstructed:
It+1→t=w(It+1,KTt+1→tK-1xDt[x])
wherein w (·) is a differentiable image warping function, x represents the pixel value of a point in the image;
Using the generated I t+1→t and reference image I t, the following objective function is constructed:
pe(It+1→t,It)=αSSIM(It+1→t,It)+(1-α)||It+1→t-It||1
wherein SSIM is a structural similarity index, α is a weight that balances SSIM loss and L1 loss;
For reference view I t, image I t+1→t、It-1→t is reconstructed from its neighboring view I t-1、It+1, and only the photometric error between the smallest pair of reference pixels and the synthesized pixel in the view is calculated:
Lp=min(pe(It,It-1→t),pe(It,It+1→t))
and an edge-aware depth smoothing term L s is introduced:
Wherein, the AndGradient amplitude values of the depth map D t in the horizontal direction and the vertical direction are calculated respectively, so that the depth values of adjacent pixels are promoted to be close; And The method comprises the steps of obtaining a weight item based on the image gradient of the current view, allowing discontinuity of the depth map when the weight item is smaller in an edge area with larger image gradient, enabling the depth map to be kept smooth when the weight item is larger in a flat area with smaller image gradient, and enabling the depth map to be kept smooth according to a final loss function:
L=Lp+λLs
and realizing and training the monocular depth prediction model by adopting PyTorch.
Further, the implementation process of the step (4) is as follows:
Normalization of the corresponding pairs of feature points (F i,Fj) of a given two consecutive image frames I i and I j yields X i and X j, reconstructing the corresponding three-dimensional point cloud X j of the second view by triangulation:
Wherein e 3 is vector [0, 1] T;
projecting X j back to the image plane to obtain triangulation-based depth information Depth information D j of image I j is then estimated using a monocular depth prediction model, and the fly-out mask M d is then applied to the depth information D j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:
|ri-s|≤∈1
where e 1 is the residual threshold and s is the scale factor.
Further, the implementation process of the step (5) is as follows:
Estimating depth information D i of the current view I i using a monocular depth prediction model, and then applying a fly-out mask M d to D i to obtain processed depth data Finally, camera internal parameters are utilizedCalculating three-dimensional point cloud coordinates X i corresponding to the view I i through back projection, and calculating an initial solution R 0,t0 of the pose by using a RANSAC random sampling minimum inner point set from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points:
Wherein ρ is a robust kernel function, F j is a feature point corresponding to the reference view I j, pi is a projection function, and the pose solution { R *,t* } is optimized by adopting a local nonlinear optimization method with the initial solution R 0,t0 as a starting point:
And calculating all N pairs of reprojection errors according to the optimized pose solution, and determining an inner point set:
I={i∣||Fj-π(R*,t*,Xi)||<∈2} (3)
wherein, E 2 is the interior point threshold value, and repeating the above process until the maximum iteration number is reached or the interior point number is not increased, thereby obtaining the optimal solution [ R, t ].
The invention relates to a visual odometer system based on image depth prediction and monocular geometry, which comprises:
The feature detection and matching module is used for detecting key points of the input image and matching corresponding feature points between the continuous image frames;
A 2D pose solver for estimating a 2D pose transform of the camera based on the correspondence of the feature point pairs;
monocular depth prediction for predicting a dense depth map for each frame of input image;
The scale factor correction module is used for correcting the scale factor of the 2D gesture transformation by utilizing the depth information output by the depth prediction network;
and the 3D gesture solver is used for directly solving the 3D gesture of the camera based on the corresponding relation between the 3D coordinates and the 2D pixel coordinates of the characteristic point pair when the scale factor correction module cannot operate.
The invention combines the advantages of the traditional geometric method and the latest depth learning technology, not only theoretically solves the challenges of the field such as scale estimation and dynamic environment adaptation, but also shows excellent performance in practical application, firstly, the invention combines the classical algorithm of feature detection and geometric constraint, can avoid the influence of complex environment scenes such as texture deficiency, dynamic objects, illumination change and the like, simultaneously introduces a 2D gesture solving strategy based on optical flow amplitude, further improves the adaptation capability to the dynamic scene, secondly, the invention can utilize the dense depth information output by a depth prediction network by the proposed novel scale factor correction method, stably estimates the inherent scale uncertainty of the monocular vision odometer based on the RANSAC algorithm, avoids long-term track drift, and the monocular depth prediction network of the system adopts a self-supervision training mode, does not need to depend on extra ground labeling data, even if training is only carried out on a conventional data set, also shows excellent generalization performance, and can stably run under extreme scenes.
Drawings
FIG. 1 is a schematic view of the overall framework of the visual odometer system of the invention;
FIG. 2 is a schematic diagram of the calculation of the fly-out mask of the present invention;
fig. 3 is a graph comparing experimental results when different sequences in KITTI data set are used as verification data, wherein graphs (a) to (k) are respectively graphs comparing experimental results when 00 sequences to 10 sequences in KITTI data set are used as verification data.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, the present invention proposes a visual odometer method based on image depth prediction and monocular geometry, comprising the steps of:
step1, running a feature detection and matching module to match corresponding feature points between continuous image frames, wherein the specific steps are as follows:
two consecutive image frames are input, local features of the images are detected and described using a scale invariant feature transform algorithm, and then a FLANN algorithm is used to match corresponding pairs of feature points between the two images (F i,Fj).
The purpose of scale space extremum detection is to find keypoints in different scale spaces. Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:
D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)
and (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points (key points) of the fitting function, and then removing points with low contrast, wherein the following formula is used:
Wherein, the Is the extreme point of the interpolation function, discardThe method comprises the steps of determining the principal direction of each key point, obtaining the principal direction through statistics of gradient histograms when determining the principal direction of each key point to enable the feature to have rotation invariance, finally generating a key point descriptor, calculating gradient direction histograms in the neighborhood of each key point by taking each key point as a center to form a 128-dimensional vector serving as a feature vector representation of the key point, and matching corresponding feature point pairs between image frames by using a FLANN algorithm after obtaining the feature vectors of two images (F i,Fj).
Step 2, solving an essential matrix E based on the epipolar geometry constraint of the 2D attitude solver, thereby obtaining the relative attitude transformation [ R, t ] of the camera, which satisfies the following conditions:
Wherein R is a rotation matrix, t is a translation vector, E is an essential matrix, K is a known parameter of the camera internal reference, and F i,Fj is the feature matching point obtained in the step (1). Thus, the relative pose transformation [ R, t ] of the camera can be found by decomposing E. The solver monitors the optical flow amplitude between the continuous frames in real time, and only when the amplitude is large enough, the essential matrix is decomposed to obtain pose solutions, so that small-scale dynamic point interference can be avoided.
And 3, constructing a monocular depth prediction network, and predicting a dense depth map for each frame of input image. Firstly, training a monocular depth model, and given a predicted depth D t and a camera pose matrix T t+1→t, reconstructing an image I t+1→t:
It+1→t=w(It+1,KTt+1→tK-1xDt[x])
where w (·) is a differentiable image warping function and x represents the pixel value of a point in the image.
Using the generated I t+1→t and reference image I t, the following objective function can be constructed:
pe(It+1→t,It)=αSSIM(It+1→t,It)+(1-α)||It+1→t-It||1
wherein SSIM is a structural similarity index, which is an index for measuring the similarity of two images, and α=0.85 is a weight for balancing SSIM loss and L1 loss.
Specifically, as shown in fig. 2, for reference view I t, image I t+1→t、It-1→t is reconstructed from his neighboring view I t-1、It+1. The present invention does not average the photometric errors between the reference pixels and the synthesized pixels in multiple views, but only calculates the photometric error between the pair of reference pixels and synthesized pixels for which the error is minimal:
Lp=min(pe(It,It-1→t),pe(It,It+1→t))
finally, an edge-aware depth smoothing term L s is introduced:
Wherein, the AndGradient magnitudes of the depth map D t in the horizontal and vertical directions are calculated, respectively, to promote the depth values of adjacent pixels to be close.AndIs a weight term based on the current view image gradient. When the image gradient is large (e.g., edge region), the weight term is small, allowing discontinuities in the depth map, while in flat regions where the image gradient is small, the weight term is large, forcing the depth map to remain smooth. The reason for introducing this depth smoothing term is that it is desirable that the predicted depth map has a segmentation in the object edge region, while the region inside the object remains smooth, which complies with the features of the real scene. By this penalty term, edge retention and smoothness of the depth prediction may be enhanced, resulting in a more reasonable depth estimation result.
The final loss function is l=l p+λLs. The depth prediction model is implemented and trained using PyTorch. In the training process, an Adam optimization algorithm is used, 15 epochs are iterated in total, the learning rate is set to be 0.0001, and the weight coefficient lambda of the edge perception depth smoothing regular term is 0.001. After the monocular depth model is trained, the monocular image is input into the model and output as dense depth information of the current image.
Step 4, based on the result of the step 2, running a scale factor correction module, estimating a scale factor s by using a triangulation method by utilizing dense depth information of the image, and obtaining corrected relative posture transformation [ R, st ], wherein the specific steps are as follows:
The corresponding pairs of feature points (F i,Fj) for a given two consecutive image frames I i and I j are normalized to obtain X i and X j, and the corresponding three-dimensional point cloud X j for the second view can be reconstructed by triangulation:
Where e 3 is the vector [0, 1] T. Projecting X j back to the image plane to obtain triangulation-based depth information Depth information D j of image I j is then estimated using a monocular depth prediction model, and the fly-out mask M d is then applied to the depth information D j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:
|ri-s|≤∈1
Where e 1 is the appropriate residual threshold. In this way, a scale factor estimate s of the visual odometer may be obtained. And (3) obtaining the relative posture transformation [ R, st ] after scale correction on the basis of the step (2).
And 5, running a 3D gesture solver, and accurately solving absolute gesture transformation [ R, t ] by utilizing a mode of combining a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.
First, depth information D i of the current view I i is estimated using a monocular depth prediction model, and then a fly-out mask M d is applied to D i to obtain processed depth dataFinally, camera internal parameters are utilizedAnd calculating the three-dimensional point cloud coordinate X i corresponding to the view I i through back projection. Next, from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points, using the RANSAC to randomly sample the minimum inner point set to calculate the initial solution R of the pose 0,t0
Where ρ is the robust kernel function, F j is the two-dimensional pixel coordinates of the reference view I j, X i is the three-dimensional spatial point of the current view I i, and pi is the projection function. Using the initial solution R 0,t0 as a starting point, adopting a local nonlinear optimization method to optimize the pose solution { R *,t* }:
And calculating all N pairs of reprojection errors according to the optimized pose solution, and determining an inner point set:
I={i∣||Fj-π(R*,t*,Xi)||<∈2}
Here e 2 is the inlier threshold. The above procedure is repeated until the maximum number of iterations or the number of interior points is reached without increasing, thereby obtaining an optimal solution [ R, t ].
The invention also proposes a visual odometer system based on image depth prediction and monocular geometry, comprising:
The feature detection and matching module is used for detecting key points of the input image and matching corresponding feature points between the continuous image frames;
A 2D pose solver for estimating a 2D pose transform of the camera based on the correspondence of the feature point pairs;
a monocular depth prediction network for predicting a dense depth map for each frame of input image;
The scale factor correction module is used for correcting the scale factor of the 2D gesture transformation by utilizing the depth information output by the depth prediction network;
and the 3D gesture solver is used for directly solving the 3D gesture of the camera based on the corresponding relation between the 3D coordinates and the 2D pixel coordinates of the characteristic point pair when the scale factor correction module cannot operate.
In the present embodiment, 5 evaluation indexes, that is, an average translational error (t_err), an average rotational error (r_err), an average displacement error (ATE), a translational relative pose error (RPE (m)), and a rotational relative pose error (RPE (°)) are used. KITTIOdometry Dataset the dataset contains stereo camera and lidar data for 22 real urban scenes and provides accurate Ground Truth positions. As shown in fig. 3, 11 sequences, i.e., 00 sequences to 10 sequences in KITTI datasets, respectively, were used as evaluation data in (a) to (k) in fig. 3. The visual odometer system provided by the invention is compared with various existing methods. The methods that were compared included geometry-based methods (VISO 2 and ORB-SLAM 2) and end-to-end deep learning methods, such as SC-SfM-Learner and Depth-VO-Feat, and DF-VO. The quantitative results are shown in Table 1, and the best results are marked with bold underlines, and it can be seen that four of the five indexes are optimal in the present invention.
Table 1 error contrast of monocular vision odometer system and methods in KITTI dataset
It can also be seen from fig. 3 that in long distance trajectory estimation without route loop (sequences 01, 02, 08), serious drift accumulation occurs in ORB-SLAM 2. Although the method faces multiple challenges such as movement of dynamic obstacles and severe illumination change, the method still shows excellent track estimation capability under the high-dynamic complex scenes, and has highest matching degree with the ground truth track.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not deviate from the essence of the corresponding technical solution from the scope of the technical solution of the embodiments of the present invention.

Claims (5)

1.一种基于图像深度预测和单目几何的视觉里程计方法,其特征在于,包括以下步骤:1. A visual odometer method based on image depth prediction and monocular geometry, characterized in that it comprises the following steps: 步骤(1)输入两个连续图像帧,采用尺度不变特征变换算法检测和描述图像局部特征,之后使用FLANN算法在两帧图像之间匹配对应的特征点对(Fi,Fj);Step (1) Input two consecutive image frames, use the scale-invariant feature transformation algorithm to detect and describe the local features of the image, and then use the FLANN algorithm to match the corresponding feature point pairs ( Fi , Fj ) between the two frames of images; 步骤(2)针对步骤(1)得到的特征匹配关系,使用对极几何约束求解本质矩阵E,得到相机的相对姿态变换;Step (2) solves the essential matrix E using the epipolar geometry constraint for the feature matching relationship obtained in step (1) to obtain the relative posture transformation of the camera; 步骤(3)构建并训练单目深度预测模型,为每一帧输入图像预测密集深度信息;Step (3) constructing and training a monocular depth prediction model to predict dense depth information for each frame of input image; 步骤(4)如果两个连续图像帧构成的有效深度信息对的数量大于给定阈值,则使用三角测量法估算尺度因子s,得到校正后相对姿态变换[R,st],R为旋转矩阵,t为平移向量;否则执行步骤(5);Step (4) If the number of valid depth information pairs formed by two consecutive image frames is greater than a given threshold, the scale factor s is estimated using triangulation to obtain the corrected relative posture transformation [R,st], where R is the rotation matrix and t is the translation vector; otherwise, step (5) is executed; 步骤(5)利用透视N点投影算法、RANSAC算法和局部非线性优化相结合的方式,求解绝对姿态变换;Step (5) solves the absolute posture transformation by combining the perspective N-point projection algorithm, the RANSAC algorithm and the local nonlinear optimization; 所述步骤(2)通过以下公式实现:The step (2) is implemented by the following formula: 其中,E为本质矩阵,K为相机内参是个已知参数,相机的相对姿态变换[R,t]通过分解E求得;(Fi,Fj)为特征点对;Where E is the essential matrix, K is the camera internal parameter which is a known parameter, and the relative posture transformation [R, t] of the camera is obtained by decomposing E; ( Fi , Fj ) is the feature point pair; 所述步骤(3)实现过程如下:The implementation process of step (3) is as follows: 给定预测的深度Dt和相机姿态矩阵Tt+1→t,重建出图像It+1→tGiven the predicted depth D t and camera pose matrix T t+1→t , reconstruct the image I t+1→t : It+1→t=w(It+1,KTt+1→tK-1xDt[x])I t+1→t =w(I t+1 ,KT t+1→t K -1 xD t [x]) 其中,w(·)是可微分的图像翘曲函数,K为相机内参,[x]代表图像中某个点的像素值;Where w(·) is a differentiable image warping function, K is the camera intrinsic parameter, and [x] represents the pixel value of a point in the image; 利用生成的It+1→t和参考图像It,构建如下目标函数:Using the generated I t+1→t and the reference image I t , the following objective function is constructed: pe(It+1→t,It)=αSSIM(It+1→t,It)+(1-α)||It+1→t-It||1 pe(I t+1→t ,I t )=αSSIM(I t+1→t ,I t )+(1-α)||I t+1→t -I t || 1 其中,SSIM是结构相似度指数,α为平衡SSIM损失和L1损失的权重;Among them, SSIM is the structural similarity index, and α is the weight for balancing SSIM loss and L1 loss; 对于参考视图It,根据其邻近视图It-1、It+1重建出图像It+1→t、It-1→t;只计算视图中最小的一对参考像素和合成像素之间的光度误差:For the reference view I t , images I t+1 →t , I t-1 →t are reconstructed based on its neighboring views I t-1 , I t+1 ; only the photometric error between the smallest pair of reference pixels and synthesized pixels in the view is calculated: Lp=min(pe(It,It-1→t),pe(It,It+1→t))L p =min(pe(I t ,I t-1→t ),pe(I t ,I t+1→t )) 并引入了边缘感知深度平滑项LsAnd the edge-aware depth smoothing term L s is introduced: 其中,分别计算深度图Dt在水平和垂直方向上的梯度幅值,促使相邻像素的深度值接近;是基于当前视图图像梯度的权重项;在图像梯度较大的边缘区域,该权重项较小,允许深度图存在不连续;而在图像梯度较小的平坦区域,该权重项较大,促使深度图保持平滑;最终的损失函数为:in, and Calculate the gradient amplitude of the depth map Dt in the horizontal and vertical directions respectively to make the depth values of adjacent pixels close; and It is a weight term based on the gradient of the current view image. In the edge area where the image gradient is large, the weight term is small, allowing discontinuity in the depth map. In the flat area where the image gradient is small, the weight term is large, which makes the depth map smooth. The final loss function is: L=Lp+λLs L=L p +λL s 采用PyTorch对单目深度预测模型进行实现和训练。PyTorch is used to implement and train the monocular depth prediction model. 2.根据权利要求1所述的一种基于图像深度预测和单目几何的视觉里程计方法,其特征在于,所述步骤(1)实现过程如下:2. According to the visual odometer method based on image depth prediction and monocular geometry according to claim 1, it is characterized in that the implementation process of step (1) is as follows: 通过高斯差分核检测,利用下式在不同尺度空间比较像素值,找到极值点作为候选关键点:Through Gaussian difference kernel detection, the following formula is used to compare pixel values in different scale spaces to find extreme points as candidate key points: D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y) 其中,x,y为图像的空间坐标,I(x,y)表示输入图像在(x,y)点的像素值,k表示相邻尺度之间的倍数关系,G(x,y,σ)是在尺度σ下的高斯核,用于产生输入图像I(x,y)的不同尺度空间;Where x, y are the spatial coordinates of the image, I(x, y) represents the pixel value of the input image at point (x, y), k represents the multiple relationship between adjacent scales, and G(x, y, σ) is the Gaussian kernel at scale σ, which is used to generate different scale spaces of the input image I(x, y); 采用样条插值法对关键点进行三维二次函数拟合,进而找到拟合函数真正的极值点,然后,去除低对比度的点,使用下式:The spline interpolation method is used to fit the three-dimensional quadratic function to the key points, and then the true extreme points of the fitting function are found. Then, the points with low contrast are removed, using the following formula: 其中,是插值函数的极值点,舍弃小于给定阈值的点;在确定每个关键点主方向时,通过统计梯度直方图得到主方向使得特征具有旋转不变性;最后生成关键点描述符,以每个关键点为中心,计算其邻域内的梯度方向直方图,形成一个128维的向量,作为该关键点的特征向量表示;在获取到两幅图像的特征向量后,使用FLANN算法在图像帧之间匹配对应的特征点对(Fi,Fj)。in, is the extreme point of the interpolation function, discard Points with a value less than a given threshold; when determining the main direction of each key point, the main direction is obtained by statistically calculating the gradient histogram so that the feature has rotation invariance; finally, a key point descriptor is generated, and the gradient direction histogram in its neighborhood is calculated with each key point as the center to form a 128-dimensional vector as the feature vector representation of the key point; after obtaining the feature vectors of the two images, the FLANN algorithm is used to match the corresponding feature point pairs ( Fi , Fj ) between the image frames. 3.根据权利要求1所述的一种基于图像深度预测和单目几何的视觉里程计方法,其特征在于,所述步骤(4)实现过程如下:3. According to the visual odometer method based on image depth prediction and monocular geometry of claim 1, it is characterized in that the implementation process of step (4) is as follows: 对给定的两个连续图像帧Ii和Ij的对应的特征点对(Fi,Fj)归一化处理后得到xi和xj,通过三角测量重建视图对应的三维点云xjThe corresponding feature point pairs ( Fi , Fj ) of two given consecutive image frames Ii and Ij are normalized to obtain xi and xj , and the three-dimensional point cloud xj corresponding to the view is reconstructed by triangulation: 其中,e3是向量[0,0,1]TWhere e 3 is the vector [0,0,1] T ; 将Xj投影回图像平面得到基于三角测量的深度信息然后使用单目深度预测模型估计图像Ij的深度信息Dj,随后对深度信息Dj应用飞出掩码Md得到处理后的深度数据寻找中的深度信息对,如果深度信息对的数量大于给定的阈值,则计算深度比率引入了RANSAC算法对含噪数据进行稳健拟合:Project Xj back to the image plane to get the depth information based on triangulation Then the monocular depth prediction model is used to estimate the depth information Dj of image Ij , and then the fly-out mask Md is applied to the depth information Dj to obtain the processed depth data Search and If the number of depth information pairs is greater than a given threshold, the depth ratio is calculated. The RANSAC algorithm is introduced to perform robust fitting of noisy data: |ri-s|≤∈1 |r i -s|≤∈ 1 其中,∈1为残差阈值,s为尺度因子。Among them, ∈ 1 is the residual threshold and s is the scale factor. 4.根据权利要求1所述的一种基于图像深度预测和单目几何的视觉里程计方法,其特征在于,所述步骤(5)实现过程如下:4. According to the visual odometer method based on image depth prediction and monocular geometry according to claim 1, it is characterized in that the implementation process of step (5) is as follows: 使用单目深度预测模型估计当前视图Ii的深度信息Di,然后对Di应用飞出掩码Md,得到处理后的深度数据最后利用相机内参和通过反投影计算出视图Ii对应的三维点云坐标Xi;从所有的N对三维空间点和二维像素点对应关系中,使用RANSAC随机采样最小内点集,计算位姿的初始解R0,t0;以初始解R0,t0为起点,采用局部非线性优化方法,优化位姿解{R*,t*};根据优化后的位姿解计算所有N对点的重投影误差,确定内点集:Use the monocular depth prediction model to estimate the depth information Di of the current view Ii , and then apply the fly-out mask Md to Di to obtain the processed depth data Finally, using the camera internal parameters and The 3D point cloud coordinates Xi corresponding to the view Ii are calculated by back-projection; the minimum internal point set is randomly sampled using RANSAC from all the N pairs of correspondences between 3D space points and 2D pixel points, and the initial solution R0 , t0 of the pose is calculated; the pose solution {R * , t * } is optimized using the local nonlinear optimization method with the initial solution R0 , t0 as the starting point; the reprojection errors of all N pairs of points are calculated based on the optimized pose solution, and the internal point set is determined: 其中,∈2为内点阈值;重复上述过程直到达到最大迭代次数或内点集的数量不再增加,从而得到最终的姿态变换。Among them, ∈ 2 is the interior point threshold; repeat the above process until the maximum number of iterations or the interior point set is reached. The number of no longer increases, thus obtaining the final posture transformation. 5.一种采用如权利要求1至4任一所述方法的基于图像深度预测和单目几何的视觉里程计系统,其特征在于,包括:5. A visual odometer system based on image depth prediction and monocular geometry using the method according to any one of claims 1 to 4, characterized in that it comprises: 特征检测与匹配模块,用于检测输入图像的关键点,并在连续图像帧之间匹配对应的特征点;Feature detection and matching module, used to detect key points of input images and match corresponding feature points between consecutive image frames; 2D姿态求解器,用于基于特征点对的对应关系估计相机的2D姿态变换;2D pose solver, used to estimate the 2D pose transformation of the camera based on the correspondence between feature point pairs; 单目深度预测,用于为每一帧输入图像预测密集深度图;Monocular depth prediction, used to predict a dense depth map for each input image frame; 尺度因子校正模块,用于利用深度预测网络输出的深度信息,校正2D姿态变换的尺度因子;The scale factor correction module is used to correct the scale factor of 2D posture transformation using the depth information output by the depth prediction network; 3D姿态求解器,用于当尺度因子校正模块无法运行时,基于特征点对的3D坐标和2D像素坐标对应关系直接求解相机的3D姿态。The 3D attitude solver is used to directly solve the camera's 3D attitude based on the correspondence between the 3D coordinates of feature point pairs and the 2D pixel coordinates when the scale factor correction module cannot run.
CN202410853788.4A 2024-06-28 2024-06-28 A visual odometer method and system based on image depth prediction and monocular geometry Active CN118736009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410853788.4A CN118736009B (en) 2024-06-28 2024-06-28 A visual odometer method and system based on image depth prediction and monocular geometry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410853788.4A CN118736009B (en) 2024-06-28 2024-06-28 A visual odometer method and system based on image depth prediction and monocular geometry

Publications (2)

Publication Number Publication Date
CN118736009A CN118736009A (en) 2024-10-01
CN118736009B true CN118736009B (en) 2025-07-11

Family

ID=92844944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410853788.4A Active CN118736009B (en) 2024-06-28 2024-06-28 A visual odometer method and system based on image depth prediction and monocular geometry

Country Status (1)

Country Link
CN (1) CN118736009B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119810198B (en) * 2024-12-31 2025-09-26 浙江大学 Monocular vision odometer method and system based on bird's eye view angle representation and differentiable weighted Procrustes solver
CN120259619B (en) * 2025-06-06 2025-08-08 西北工业大学 Monocular vision odometer positioning method based on end-to-end deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899280A (en) * 2020-07-13 2020-11-06 哈尔滨工程大学 Monocular vision odometer method adopting deep learning and mixed pose estimation
CN111967337A (en) * 2020-07-24 2020-11-20 电子科技大学 Pipeline line change detection method based on deep learning and unmanned aerial vehicle images
CN112906766A (en) * 2021-02-02 2021-06-04 电子科技大学 Monocular vision odometer method integrating deep learning and geometric reasoning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381890B (en) * 2020-11-27 2022-08-02 上海工程技术大学 RGB-D vision SLAM method based on dotted line characteristics
CN114494150A (en) * 2021-12-30 2022-05-13 杭州电子科技大学 Design method of monocular vision odometer based on semi-direct method
CN116958419A (en) * 2023-07-08 2023-10-27 复旦大学 A binocular stereo vision three-dimensional reconstruction system and method based on wavefront coding
CN116972874A (en) * 2023-07-31 2023-10-31 电子科技大学 An unsupervised monocular visual odometry based on global perception of optical flow

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899280A (en) * 2020-07-13 2020-11-06 哈尔滨工程大学 Monocular vision odometer method adopting deep learning and mixed pose estimation
CN111967337A (en) * 2020-07-24 2020-11-20 电子科技大学 Pipeline line change detection method based on deep learning and unmanned aerial vehicle images
CN112906766A (en) * 2021-02-02 2021-06-04 电子科技大学 Monocular vision odometer method integrating deep learning and geometric reasoning

Also Published As

Publication number Publication date
CN118736009A (en) 2024-10-01

Similar Documents

Publication Publication Date Title
CN109166149B (en) Positioning and three-dimensional line frame structure reconstruction method and system integrating binocular camera and IMU
CN109544677B (en) Indoor scene main structure reconstruction method and system based on depth image key frame
CN104732518B (en) A kind of PTAM improved methods based on intelligent robot terrain surface specifications
CN118736009B (en) A visual odometer method and system based on image depth prediction and monocular geometry
CN108682027A (en) VSLAM realization method and systems based on point, line Fusion Features
CN113658337B (en) Multi-mode odometer method based on rut lines
Liu et al. Direct visual odometry for a fisheye-stereo camera
CN107358629B (en) An indoor mapping and localization method based on target recognition
CN113393524B (en) Target pose estimation method combining deep learning and contour point cloud reconstruction
CN110942476A (en) Improved three-dimensional point cloud registration method and system based on two-dimensional image guidance and readable storage medium
CN101765019B (en) Stereo matching algorithm for motion blur and illumination change image
CN116662600A (en) A Visual Localization Method Based on Lightweight Structured Line Map
CN116883590A (en) Three-dimensional face point cloud optimization method, medium and system
Ci et al. Stereo visual odometry based on motion decoupling and special feature screening for navigation of autonomous vehicles
Cao et al. Single view 3D reconstruction based on improved RGB-D image
CN120088514A (en) Image feature matching model, estimation method and system based on spatial geometric constraints
CN101395613A (en) 3D face reconstruction from 2D images
Chen et al. G 2-mapping: General Gaussian mapping for monocular, RGB-D, and LiDAR-inertial-visual systems
CN113313200B (en) Point cloud precision matching method based on normal constraint
CN116128961A (en) Product assembly-oriented label-free augmented reality tracking registration method
Yao et al. 3D detection of pavement cracking utilizing a neural radiation field (NeRF) and semantic segmentation network
CN118501829A (en) Non-target camera-radar calibration method based on line segments
CN111179327A (en) Depth map calculation method
CN115170745A (en) Unmanned aerial vehicle distance measurement method based on stereoscopic vision
Kim et al. CLoc: Confident initial estimation of long-term visual localization using a few sequential images in large-scale spaces

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant