Visual odometer method and system based on image depth prediction and monocular geometry
Technical Field
The invention belongs to the field of computer vision and robot navigation, and particularly relates to a vision odometer method and system based on image depth prediction and monocular geometry.
Background
The visual odometer is a process of estimating a motion trajectory of a robot from an input image sequence, and is one of core modules of a synchronous positioning and mapping (SLAM) system. The traditional geometric visual odometer method based on feature point matching has higher precision in a controlled environment, but is easy to fail in complex scenes with poor textures, dynamic objects and illumination changes. In recent years, although some progress is made in the end-to-end visual odometer method based on deep learning, geometrical constraint is often lacked, and precision and robustness are not satisfactory. The existing mixed vision odometer method integrates a classical geometric model and a deep learning framework, but needs to be improved when processing complex scenes such as high dynamics, low textures and the like.
Disclosure of Invention
The invention aims to provide a visual odometer method and a visual odometer system based on image depth prediction and monocular geometry, which are used for overcoming the defects of the prior art and improving the robustness and accuracy of the monocular visual odometer in a dynamic environment.
The technical scheme is that the visual odometer method based on image depth prediction and monocular geometry comprises the following steps:
(1) Inputting two continuous image frames, detecting and describing image local features by adopting a scale-invariant feature transformation algorithm, and then matching corresponding feature point pairs (F i,Fj) between the two images by using a FLANN algorithm;
(2) Aiming at the characteristic matching relation obtained in the step (1), solving an essential matrix E by using epipolar geometric constraint to obtain relative attitude transformation [ R, t ] of a camera, wherein R is a rotation matrix and t is a translation vector;
(3) Constructing and training a monocular depth prediction model, and predicting dense depth information for each frame of input image;
(4) If the number of the effective depth information pairs formed by two continuous image frames is larger than a given threshold value, estimating a scale factor s by using a triangulation method to obtain corrected relative posture transformation [ R, st ];
(5) And solving absolute attitude transformation (R, t) by utilizing a combination mode of a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.
Further, the implementation process of the step (1) is as follows:
Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:
D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)
Wherein x, y are the spatial coordinates of the image, I (x, y) represents the pixel value of the input image at the (x, y) point, k represents the multiple relationship between adjacent scales, G (x, y, σ) is the gaussian kernel at scale σ, for generating the different scale spaces of the input image I (x, y);
And (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points of the fitting function, and then removing points with low contrast, wherein the following formula is used:
Wherein, the Is the extreme point of the interpolation function, discardThe method comprises the steps of determining the principal direction of each key point, obtaining the principal direction through statistics of gradient histograms when determining the principal direction of each key point to enable the feature to have rotation invariance, finally generating a key point descriptor, calculating gradient direction histograms in the neighborhood of each key point by taking each key point as a center to form a 128-dimensional vector serving as a feature vector representation of the key point, and matching corresponding feature point pairs between image frames by using a FLANN algorithm after obtaining the feature vectors of two images (F i,fj).
Further, the step (2) is implemented by the following formula:
Wherein E is an essential matrix, K is a known parameter of the camera internal reference, and the relative attitude transformation [ R, t ] of the camera is obtained by decomposing E.
Further, the implementation process of the step (3) is as follows:
Given the predicted depth D t and the camera pose matrix T t+1→t, an image I t+1→t is reconstructed:
It+1→t=w(It+1,KTt+1→tK-1xDt[x])
wherein w (·) is a differentiable image warping function, x represents the pixel value of a point in the image;
Using the generated I t+1→t and reference image I t, the following objective function is constructed:
pe(It+1→t,It)=αSSIM(It+1→t,It)+(1-α)||It+1→t-It||1
wherein SSIM is a structural similarity index, α is a weight that balances SSIM loss and L1 loss;
For reference view I t, image I t+1→t、It-1→t is reconstructed from its neighboring view I t-1、It+1, and only the photometric error between the smallest pair of reference pixels and the synthesized pixel in the view is calculated:
Lp=min(pe(It,It-1→t),pe(It,It+1→t))
and an edge-aware depth smoothing term L s is introduced:
Wherein, the AndGradient amplitude values of the depth map D t in the horizontal direction and the vertical direction are calculated respectively, so that the depth values of adjacent pixels are promoted to be close; And The method comprises the steps of obtaining a weight item based on the image gradient of the current view, allowing discontinuity of the depth map when the weight item is smaller in an edge area with larger image gradient, enabling the depth map to be kept smooth when the weight item is larger in a flat area with smaller image gradient, and enabling the depth map to be kept smooth according to a final loss function:
L=Lp+λLs
and realizing and training the monocular depth prediction model by adopting PyTorch.
Further, the implementation process of the step (4) is as follows:
Normalization of the corresponding pairs of feature points (F i,Fj) of a given two consecutive image frames I i and I j yields X i and X j, reconstructing the corresponding three-dimensional point cloud X j of the second view by triangulation:
Wherein e 3 is vector [0, 1] T;
projecting X j back to the image plane to obtain triangulation-based depth information Depth information D j of image I j is then estimated using a monocular depth prediction model, and the fly-out mask M d is then applied to the depth information D j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:
|ri-s|≤∈1
where e 1 is the residual threshold and s is the scale factor.
Further, the implementation process of the step (5) is as follows:
Estimating depth information D i of the current view I i using a monocular depth prediction model, and then applying a fly-out mask M d to D i to obtain processed depth data Finally, camera internal parameters are utilizedCalculating three-dimensional point cloud coordinates X i corresponding to the view I i through back projection, and calculating an initial solution R 0,t0 of the pose by using a RANSAC random sampling minimum inner point set from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points:
Wherein ρ is a robust kernel function, F j is a feature point corresponding to the reference view I j, pi is a projection function, and the pose solution { R *,t* } is optimized by adopting a local nonlinear optimization method with the initial solution R 0,t0 as a starting point:
And calculating all N pairs of reprojection errors according to the optimized pose solution, and determining an inner point set:
I={i∣||Fj-π(R*,t*,Xi)||<∈2} (3)
wherein, E 2 is the interior point threshold value, and repeating the above process until the maximum iteration number is reached or the interior point number is not increased, thereby obtaining the optimal solution [ R, t ].
The invention relates to a visual odometer system based on image depth prediction and monocular geometry, which comprises:
The feature detection and matching module is used for detecting key points of the input image and matching corresponding feature points between the continuous image frames;
A 2D pose solver for estimating a 2D pose transform of the camera based on the correspondence of the feature point pairs;
monocular depth prediction for predicting a dense depth map for each frame of input image;
The scale factor correction module is used for correcting the scale factor of the 2D gesture transformation by utilizing the depth information output by the depth prediction network;
and the 3D gesture solver is used for directly solving the 3D gesture of the camera based on the corresponding relation between the 3D coordinates and the 2D pixel coordinates of the characteristic point pair when the scale factor correction module cannot operate.
The invention combines the advantages of the traditional geometric method and the latest depth learning technology, not only theoretically solves the challenges of the field such as scale estimation and dynamic environment adaptation, but also shows excellent performance in practical application, firstly, the invention combines the classical algorithm of feature detection and geometric constraint, can avoid the influence of complex environment scenes such as texture deficiency, dynamic objects, illumination change and the like, simultaneously introduces a 2D gesture solving strategy based on optical flow amplitude, further improves the adaptation capability to the dynamic scene, secondly, the invention can utilize the dense depth information output by a depth prediction network by the proposed novel scale factor correction method, stably estimates the inherent scale uncertainty of the monocular vision odometer based on the RANSAC algorithm, avoids long-term track drift, and the monocular depth prediction network of the system adopts a self-supervision training mode, does not need to depend on extra ground labeling data, even if training is only carried out on a conventional data set, also shows excellent generalization performance, and can stably run under extreme scenes.
Drawings
FIG. 1 is a schematic view of the overall framework of the visual odometer system of the invention;
FIG. 2 is a schematic diagram of the calculation of the fly-out mask of the present invention;
fig. 3 is a graph comparing experimental results when different sequences in KITTI data set are used as verification data, wherein graphs (a) to (k) are respectively graphs comparing experimental results when 00 sequences to 10 sequences in KITTI data set are used as verification data.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, the present invention proposes a visual odometer method based on image depth prediction and monocular geometry, comprising the steps of:
step1, running a feature detection and matching module to match corresponding feature points between continuous image frames, wherein the specific steps are as follows:
two consecutive image frames are input, local features of the images are detected and described using a scale invariant feature transform algorithm, and then a FLANN algorithm is used to match corresponding pairs of feature points between the two images (F i,Fj).
The purpose of scale space extremum detection is to find keypoints in different scale spaces. Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:
D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)
and (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points (key points) of the fitting function, and then removing points with low contrast, wherein the following formula is used:
Wherein, the Is the extreme point of the interpolation function, discardThe method comprises the steps of determining the principal direction of each key point, obtaining the principal direction through statistics of gradient histograms when determining the principal direction of each key point to enable the feature to have rotation invariance, finally generating a key point descriptor, calculating gradient direction histograms in the neighborhood of each key point by taking each key point as a center to form a 128-dimensional vector serving as a feature vector representation of the key point, and matching corresponding feature point pairs between image frames by using a FLANN algorithm after obtaining the feature vectors of two images (F i,Fj).
Step 2, solving an essential matrix E based on the epipolar geometry constraint of the 2D attitude solver, thereby obtaining the relative attitude transformation [ R, t ] of the camera, which satisfies the following conditions:
Wherein R is a rotation matrix, t is a translation vector, E is an essential matrix, K is a known parameter of the camera internal reference, and F i,Fj is the feature matching point obtained in the step (1). Thus, the relative pose transformation [ R, t ] of the camera can be found by decomposing E. The solver monitors the optical flow amplitude between the continuous frames in real time, and only when the amplitude is large enough, the essential matrix is decomposed to obtain pose solutions, so that small-scale dynamic point interference can be avoided.
And 3, constructing a monocular depth prediction network, and predicting a dense depth map for each frame of input image. Firstly, training a monocular depth model, and given a predicted depth D t and a camera pose matrix T t+1→t, reconstructing an image I t+1→t:
It+1→t=w(It+1,KTt+1→tK-1xDt[x])
where w (·) is a differentiable image warping function and x represents the pixel value of a point in the image.
Using the generated I t+1→t and reference image I t, the following objective function can be constructed:
pe(It+1→t,It)=αSSIM(It+1→t,It)+(1-α)||It+1→t-It||1
wherein SSIM is a structural similarity index, which is an index for measuring the similarity of two images, and α=0.85 is a weight for balancing SSIM loss and L1 loss.
Specifically, as shown in fig. 2, for reference view I t, image I t+1→t、It-1→t is reconstructed from his neighboring view I t-1、It+1. The present invention does not average the photometric errors between the reference pixels and the synthesized pixels in multiple views, but only calculates the photometric error between the pair of reference pixels and synthesized pixels for which the error is minimal:
Lp=min(pe(It,It-1→t),pe(It,It+1→t))
finally, an edge-aware depth smoothing term L s is introduced:
Wherein, the AndGradient magnitudes of the depth map D t in the horizontal and vertical directions are calculated, respectively, to promote the depth values of adjacent pixels to be close.AndIs a weight term based on the current view image gradient. When the image gradient is large (e.g., edge region), the weight term is small, allowing discontinuities in the depth map, while in flat regions where the image gradient is small, the weight term is large, forcing the depth map to remain smooth. The reason for introducing this depth smoothing term is that it is desirable that the predicted depth map has a segmentation in the object edge region, while the region inside the object remains smooth, which complies with the features of the real scene. By this penalty term, edge retention and smoothness of the depth prediction may be enhanced, resulting in a more reasonable depth estimation result.
The final loss function is l=l p+λLs. The depth prediction model is implemented and trained using PyTorch. In the training process, an Adam optimization algorithm is used, 15 epochs are iterated in total, the learning rate is set to be 0.0001, and the weight coefficient lambda of the edge perception depth smoothing regular term is 0.001. After the monocular depth model is trained, the monocular image is input into the model and output as dense depth information of the current image.
Step 4, based on the result of the step 2, running a scale factor correction module, estimating a scale factor s by using a triangulation method by utilizing dense depth information of the image, and obtaining corrected relative posture transformation [ R, st ], wherein the specific steps are as follows:
The corresponding pairs of feature points (F i,Fj) for a given two consecutive image frames I i and I j are normalized to obtain X i and X j, and the corresponding three-dimensional point cloud X j for the second view can be reconstructed by triangulation:
Where e 3 is the vector [0, 1] T. Projecting X j back to the image plane to obtain triangulation-based depth information Depth information D j of image I j is then estimated using a monocular depth prediction model, and the fly-out mask M d is then applied to the depth information D j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:
|ri-s|≤∈1
Where e 1 is the appropriate residual threshold. In this way, a scale factor estimate s of the visual odometer may be obtained. And (3) obtaining the relative posture transformation [ R, st ] after scale correction on the basis of the step (2).
And 5, running a 3D gesture solver, and accurately solving absolute gesture transformation [ R, t ] by utilizing a mode of combining a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.
First, depth information D i of the current view I i is estimated using a monocular depth prediction model, and then a fly-out mask M d is applied to D i to obtain processed depth dataFinally, camera internal parameters are utilizedAnd calculating the three-dimensional point cloud coordinate X i corresponding to the view I i through back projection. Next, from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points, using the RANSAC to randomly sample the minimum inner point set to calculate the initial solution R of the pose 0,t0
Where ρ is the robust kernel function, F j is the two-dimensional pixel coordinates of the reference view I j, X i is the three-dimensional spatial point of the current view I i, and pi is the projection function. Using the initial solution R 0,t0 as a starting point, adopting a local nonlinear optimization method to optimize the pose solution { R *,t* }:
And calculating all N pairs of reprojection errors according to the optimized pose solution, and determining an inner point set:
I={i∣||Fj-π(R*,t*,Xi)||<∈2}
Here e 2 is the inlier threshold. The above procedure is repeated until the maximum number of iterations or the number of interior points is reached without increasing, thereby obtaining an optimal solution [ R, t ].
The invention also proposes a visual odometer system based on image depth prediction and monocular geometry, comprising:
The feature detection and matching module is used for detecting key points of the input image and matching corresponding feature points between the continuous image frames;
A 2D pose solver for estimating a 2D pose transform of the camera based on the correspondence of the feature point pairs;
a monocular depth prediction network for predicting a dense depth map for each frame of input image;
The scale factor correction module is used for correcting the scale factor of the 2D gesture transformation by utilizing the depth information output by the depth prediction network;
and the 3D gesture solver is used for directly solving the 3D gesture of the camera based on the corresponding relation between the 3D coordinates and the 2D pixel coordinates of the characteristic point pair when the scale factor correction module cannot operate.
In the present embodiment, 5 evaluation indexes, that is, an average translational error (t_err), an average rotational error (r_err), an average displacement error (ATE), a translational relative pose error (RPE (m)), and a rotational relative pose error (RPE (°)) are used. KITTIOdometry Dataset the dataset contains stereo camera and lidar data for 22 real urban scenes and provides accurate Ground Truth positions. As shown in fig. 3, 11 sequences, i.e., 00 sequences to 10 sequences in KITTI datasets, respectively, were used as evaluation data in (a) to (k) in fig. 3. The visual odometer system provided by the invention is compared with various existing methods. The methods that were compared included geometry-based methods (VISO 2 and ORB-SLAM 2) and end-to-end deep learning methods, such as SC-SfM-Learner and Depth-VO-Feat, and DF-VO. The quantitative results are shown in Table 1, and the best results are marked with bold underlines, and it can be seen that four of the five indexes are optimal in the present invention.
Table 1 error contrast of monocular vision odometer system and methods in KITTI dataset
It can also be seen from fig. 3 that in long distance trajectory estimation without route loop (sequences 01, 02, 08), serious drift accumulation occurs in ORB-SLAM 2. Although the method faces multiple challenges such as movement of dynamic obstacles and severe illumination change, the method still shows excellent track estimation capability under the high-dynamic complex scenes, and has highest matching degree with the ground truth track.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not deviate from the essence of the corresponding technical solution from the scope of the technical solution of the embodiments of the present invention.