Open AccessArticle

R-LVIO: Resilient LiDAR-Visual-Inertial Odometry for UAVs in GNSS-denied Environment

Bing Zhang

Xiangyu Shao

Yankun Wang

Guanghui Sun

and

Weiran Yao

Department of Control Science and Engineering, Harbin Institute of Technology, Harbin 150001, China

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 487; https://doi.org/10.3390/drones8090487

Submission received: 17 August 2024 / Revised: 8 September 2024 / Accepted: 12 September 2024 / Published: 14 September 2024

(This article belongs to the Special Issue Mission Planning, Perception and Control for Drones in Wide-Area Operations)

Download

Browse Figures

Figure 1
The schematic diagram of coordinate transformation. <math display="inline"><semantics> <msub> <mi>T</mi> <mrow> <mi>C</mi> <mi>B</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>T</mi> <mrow> <mi>L</mi> <mi>B</mi> </mrow> </msub> </semantics></math> represent the external parameter from camera and LiDAR to IMU. <math display="inline"><semantics> <msub> <mi>T</mi> <mrow> <mi>B</mi> <mi>W</mi> </mrow> </msub> </semantics></math> represents the transformation from the body frame to the world frame. "> Figure 2
Pipeline of the proposed system. The proposed system is divided into the IMU module, the visual-inertial module, and the LiDAR-inertial module. Modules with red borders are highlighted in this paper. In detail, the LiDAR-inertial module provides depth measurements for visual features by aggregating recent multi-frame sweeps. Moreover, the motion estimation of the visual-inertial module is constrained by the back-propagated pose from the LiDAR-inertial module at the previous moment. The visual-inertial module provides the initial guess for the LiDAR-inertial module’s point cloud matching. The camera pose and LiDAR pose are fed into the IMU module and form the measurement residuals with IMU preintegration, followed by minimizing the measurement residuals in the factor graph optimization to estimate the final state. "> Figure 3
Factor graph of the system. The IMU module is constrained by LiDAR and visual modules, and ultimately outputs a refined system state. "> Figure 4
The illustration of edge points aggregation. Different colored edge dots indicate different ranges. Vertical observations are projected onto the segmented horizontal plane for clustering. "> Figure 5
The model of the aligned point. (a) is the point-to-plane model that is employed in structured scenes. (b) is the point-to-surface model with uncertainty for aligning irregular ground points. "> Figure 6
The platform is used for research and collection of the private dataset. The UAV in (a) is equipped with a GPS mobile station, LIDAR, on-board computer and pinhole camera. The world frame is defined as the first IMU frame. Satellite photographs (b–e) show four scenes. The orange curves represent the ground truth of these sequences as determined by the GNSS/IMU positioning system. "> Figure 7
The qualitative structures of the private dataset are shown. (a–c) are grove sequence, beach sequence, and desert sequence, respectively. Cubes are salient edge points. The QS stands for quantitative structure. "> Figure 8
Point cloud maps of NTU dataset are shown. (a–c) are rtp sequences, sbs sequences, and tnp sequences, respectively. "> Figure 9
3D bird’s-eye view map of the campus sequence. The quantitative structure of the three locations is emphasized as (a–c). Figure (d,e) show point cloud maps from a bird’s-eye perspective, presenting the consistent map without point cloud divergence. "> Figure 10
Localization accuracy experiments are performed on the NTU-VIRAL dataset. The trajectories’ errors are compared with the ground truths, which are provided by the public dataset. Subfigures (a–i) represent the trajectory results for the rtp sequence, the sbs sequence, and the tnp sequence, respectively. The “reference” in the subfigure is the ground truth of the UAV trajectory. The heat map color of the estimated trajectories indicate the error level. "> Figure 11
Trajectory comparisons on the private dataset. Subfigures (a–c) are indicate the comparisons on the grove, beach, and desert sequences. As the degree of non-structuring increases, the robustness of LVI-SAM and FAST-LIVO decreases. The proposed system can still maintain high localization accuracy. ">

Versions Notes

Abstract

In low-altitude, GNSS-denied scenarios, Unmanned aerial vehicles (UAVs) rely on sensor fusion for self-localization. This article presents a resilient multi-sensor fusion localization system that integrates light detection and ranging (LiDAR), cameras, and inertial measurement units (IMUs) to achieve state estimation for UAVs. To address challenging environments, especially unstructured ones, IMU predictions are used to compensate for pose estimation in the visual and LiDAR components. Specifically, the accuracy of IMU predictions is enhanced by increasing the correction frequency of IMU bias through data integration from the LiDAR and visual modules. To reduce the impact of random errors and measurement noise in LiDAR points on visual depth measurement, cross-validation of visual feature depth is performed using reprojection error to eliminate outliers. Additionally, a structure monitor is introduced to switch operation modes in hybrid point cloud registration, ensuring accurate state estimation in both structured and unstructured environments. In unstructured scenes, a geometric primitive capable of representing irregular planes is employed for point-to-surface registration, along with a novel pose-solving method to estimate the UAV’s pose. Both private and public datasets collected by UAVs validate the proposed system, proving that it outperforms state-of-the-art algorithms by at least 12.6%.

Keywords:

Multi-sensor fusion; LiDAR-visual-inertial odometry; structure quantification; point-to-surface alignment

1. Introduction

Unmanned aerial vehicle (UAV) navigation and localization systems must be stable and accurate to complete various tasks. UAVs generally adopt the global navigation satellite system (GNSS) as the core navigation technology. However, the open frequency bands of GNSS satellite signals make them susceptible to interference, which can result in UAVs failing to complete their planned missions or even return to base [1,2,3,4]. Multi-sensor fusion-based localization techniques are essential for UAVs to achieve state estimation in low-altitude, GNSS-denied scenarios. In multi-sensor fusion frameworks, UAVs are typically equipped with light detection and ranging (LiDAR), cameras, and inertial measurement units (IMUs) to collect multi-source information, achieving 6-degree-of-freedom (DoF) state estimation [5,6].

In recent years, numerous UAV sensor fusion frameworks based on LiDAR, cameras, and IMUs have been presented to achieve superior estimation accuracy in artificial environments, such as streets, campuses, and factories [7,8]. In these frameworks, LiDAR is usually used as the core sensor due to its high-fidelity measurements and wide-range sensing capabilities [9,10]. However, in unstructured environments like corridors, deserts, and stadiums, the lack of LiDAR return points leads to insufficient constraints for pose estimation. The challenges encountered in unstructured environments include the following. (1) Irregular planes are difficult to represent accurately. Ground points in unstructured scenes make up a relatively large portion of the LiDAR point cloud. The ground points can be fitted to similar planes due to minimal height differences. However, using planes as primitives results in the loss of the ground’s uneven geometric properties. (2) Vertical constraints are inadequate for UAV height measurement. Unstructured scenes have too few LiDAR return points in the vertical direction, making the LiDAR module insensitive to altitude changes.

To tackle the problems in unstructured scenarios, we propose the following measures. (1) Select appropriate geometric primitives to represent irregular surfaces and adjust the point cloud alignment model. (2) Use short-term prediction of the IMU vertical direction instead of using the LiDAR module for height measurement. (3) Employ the IMU module as the core of the system to output the final state, thereby mitigating the risk of failure in the LiDAR or visual modules. This paper proposes a multi-sensor fusion-based odometry and mapping framework that relies on the complementary advantages of LiDAR, cameras, and IMUs, achieving low-drift and high robustness state estimation. The main contributions of the proposed system are as follows:

Improve the accuracy of short-term IMU predictions by increasing the frequency of corrections from the LiDAR and visual modules. LiDAR pose frequency is boosted by sweep segmentation to synchronize the LiDAR input time with the camera sampling time.
Devise an outliers rejection strategy of depth association between the camera image and LiDAR points to select accurate depth points by evaluating the reprojection error of visual feature points in a sliding window.
Design a structure monitor to distinguish structured scenes and unstructured scenes by analyzing the vertical landmarks. The environmental structuring is quantified to switch the operating modes of the LiDAR module.
Propose a novel point-to-surface model to register irregular surfaces in unstructured scenes, achieving three horizontal DoF state estimation. The vertical 3-DoF state is predicted by IMU relative measurement.

The paper is structured as follows. The related work of sensor fusion localization and point cloud registration is presented in Section 2. The problem statement is presented in Section 3. The details of the proposed system are illustrated in Section 4. Experimental results on public and private datasets are provided in Section 5. Conclusions and future work are given in Section 6. The notations are listed in Table 1.

2. Related Works

2.1. Sensor Fusion Localization System

Recently, significant efforts have been made in the field of sensor fusion localization. Nguyen proposes VIRAL-fusion [11], which combines an IMU, an ultra wideband ranging sensor, and multiple on-board visual-inertial and LiDAR-ranging subsystems to implement an optimization-based comprehensive state estimator on UAVs. This estimator effectively mitigates the problem of pose position drift and robustness in low-texture environments. LIC-fusion2 [12] introduces a novel sliding-window planar feature tracking technique based on online spatio-temporal calibration for efficient processing of 3D LiDAR point clouds. A novel outlier rejection criterion is proposed in planar feature tracking to initialize feature points belonging to the same plane for high-quality data correlation. Shao [13] proposes a integrated LiDAR-visual-inertial framework that conducts coupling optimization in a factor graph manner, enabling a refined system state. Lin [14] proposes a novel multi-sensor fusion framework, called R3LIVE, that leverages the measurement strengths of LIDAR, inertial, and visual sensors to enable real-time reconstruction of accurate, dense, 3D, RGB-colored maps of the surrounding environment. Zheng [15] proposes a fast LiDAR-inertial-visual odometry system based on two tightly coupled direct subsystems: a VIO subsystem and a LIO subsystem. The LIO subsystem registers new sweep points onto an incrementally constructed point cloud map. The points on the map are also appended with image patches, which are then used in the VIO subsystem to align the new image by minimizing photometric errors. The system combines the advantages of sparse direct image alignment and raw point direct alignment to achieve accurate and reliable attitude estimation with low computational cost.

The aforementioned systems enable accurate state estimation in structured environments. However, in unstructured scenes, the LiDAR module with a point-to-plane model cannot build a consistent surrounding map, and the overall system perhaps fails due to LiDAR pose drift or map divergence. To tackle this problem, some researchers have utilized the property of multi-sensor fusion, which involves discarding the output of the LiDAR module and using other sensors’ poses as the system state. Zhang [16] introduces LiDAR-visual-IMU odometry, which starts with IMU preintegration measurement and ends with refined poses in a visual-inertial subsystem and LiDAR-inertial subsystem. Although the system fails to greatly improve the state accuracy compared with the LiDAR subsystem, the robustness is enhanced due to the coupling of the visual subsystem as a complement to the state estimation. Wisth [17] proposes VILENS, a factor graph-based odometry system for legged robots. By tightly fusing LiDAR, cameras, IMUs, and leg odometry together, reliable operation is achieved despite the fact that individual sensors can produce degraded estimates. To minimize legged odometry drift, the system extends the robot’s state using a linear velocity deviation term, which is estimated online by preintegrating measurement with the visual, LiDAR, and IMU factors. The system exhibits excellent localization performance and strong robustness in unstructured environments. Based on the above study, we propose a method to compensate for LiDAR drift by combining short-term IMU predictions with the pre-drift LiDAR pose for state estimation. The proposed system achieves synchronization between LiDAR and camera by segmenting the LiDAR sweep, which improves the accuracy of short-term IMU predictions by boosting the correction frequencies. In addition, the undegraded LiDAR module is also used to constrain the pose estimation of the visual module to improve the localization accuracy of the visual sub-module. In the case of LiDAR failure, the visual-inertial module is employed to generate the final state.

2.2. Point Cloud Registration

The majority of existing work accomplishes point cloud alignment by sweep matching, which involves using iterative closest point (ICP) and normal distribution transform (NDT) algorithms to solve sweep-to-sweep or sweep-to-map transformations [18,19,20]. Point-to-point, point-to-edge, point-to-plane, and point-to-probability model techniques are among the geometric primitives that are employed in it. By fully considering the sparsity and scene complexity of LiDAR point clouds, Cui [21] provides a linear keypoint representation for 3D LiDAR point clouds, which minimizes keypoint-to-keypoint distance to efficiently perform sweep-to-sweep alignment. Zhang [22] distinguishes plane and edge points by calculating the local smoothness, and then minimizes the point-to-plane distance and point-to-edge distance to achieve accurate point cloud alignment. On this foundation, Guo [23] extracts edge and plane features by the principal component analysis (PCA) method, which is employed in two-stage alignment to achieve, without loss of real-time performance, improved odometry accuracy and consistent mapping.

Regularized planes are commonly used as geometric primitives for matching in the above methods. However, uneven surfaces are widely found in unstructured and undeveloped environments. Still, utilizing a planar model for point cloud matching will produce large random errors. Choi [24] presents a fast and generalized feature-based LiDAR odometry method using local quadratic surface approximation and point-to-surface alignment. Unlike most matching methods based on point-to-plane distances, the method approximates the local geometry of the LiDAR scan as a quadratic surface to minimize performance degradation due to the inconsistency of feature classes with the local geometry of the map. Chen [25] presents a lightweight front-end LiDAR odometry solution that uses a point-to-point probabilistic model in a generalized ICP-based direct point cloud matching method to yield accurate state estimation in unstructured subterranean environments. Combining these two approaches, this paper approximates an uneven surface as a Gaussian surface, which is formulated as a Gaussian probability function consisting of neighboring points. The point-to-Gaussian surface distance is employed to point cloud matching, achieving low-drift LiDAR odometry in unstructured scenes.

3. Problem Statement

The state variables are represented by the following in the proposed system:

\begin{matrix} X_{n} = {\{x_{i}\}}_{i \in T_{n}} = {\{R_{i}, p_{i}, v_{i}, b_{w_{i}}, b_{a_{i}}\}}_{i \in T_{n}} . \end{matrix}

(1)

where

X_{n}

represents the set of all states.

R_{i}

p_{i}

, and

v_{i}

denote the orientation, the translation, and the motion velocity of the UAV at moment

t_{i}

, respectively.

b_{w_{i}}

and

b_{a_{i}}

are the IMU biases.

The camera observations at time

t_{i}

are represented as

C_{i}

, which include the extracted feature point,

z_{i c}

, from the images. The measurement data from the LiDAR are represented as

L_{i}

, which include the salient points,

z_{i l}

, to be matched. IMU measurements between adjacent camera sampling moments

t_{i}

and

t_{j}

are represented as

I_{i j}

. The UAV state is estimated by minimizing the sum of squared observation residuals, as follows:

\begin{matrix} X_{n} = \arg \min_{X_{n}} (- l n P (X_{n} | Z_{n})) = \arg \min_{X_{n}} {∥r_{0}∥}^{2} \\ + \sum_{(i, j) \in T_{n}} {∥r_{I_{i j}}∥}^{2} + \sum_{i \in T_{n}} \sum_{c \in C_{i}} {∥r_{z_{i c}}∥}^{2} + \sum_{i \in T_{n}} \sum_{l \in L_{i}} {∥r_{z_{i l}}∥}^{2} . \end{matrix}

(2)

where

r_{0}

is the prior error.

{r_{I}}_{i j}

r_{z_{i c}}

r_{z_{i l}}

represent the residuals of the associated measurements. Residuals are functions of the state variables and observations, quantifying the mismatch between the observations and the estimated values under the current state and prior constraints.

4. Proposed Method

4.1. System Overview

The goal of our proposed system is to estimate the UAV state and construct the surrounding map. The intrinsic parameters of these sensors are assumed to be known. The extrinsic parameters between the three sensors have been calibrated to share a common coordinate system, with the IMU frame designated as the primary coordinate system. The camera and LiDAR frames are considered sub-coordinate systems. The definitions of these coordinate systems are illustrated in Figure 1. According to the figure, C, L, and B represent the coordinate systems of the camera, LiDAR, and IMU, respectively.

An overview of our system is illustrated in Figure 2. The IMU and visual modules provide prior poses and constraints to the LiDAR module, which is the main component of the system that uses motion estimation from coarse to fine. Initially, the camera input frequency reconstructs a LiDAR sweep to guarantee synchronized transmission between various sensors. Then, using cross-validation to assess the projection errors of visual feature-associated LiDAR points under various perspectives, the depth of visual features with large approximation errors and measurement noise is eliminated by outliers rejection. By utilizing temporal synchronization, the vision module receives the LiDAR pose from the previous instant, which helps to provide a position consistency constraint that enhances the accuracy of the camera pose estimation. To reduce the drift of the feature-based LiDAR module in scenes with fewer features, a direct point cloud registration method with IMU-constrained point-to-Gaussian surface error is proposed to be incorporated into the pose estimation.

4.2. Imu Module with High Frequency Correction

In this module, the LiDAR point cloud and camera image are synchronized to implement the transmission between the LiDAR module and the visual module. IMU measurements between two consecutive frames are integrated to align with the LiDAR point cloud and camera image.

4.2.1. Time Synchronization Based on Sweep Segmentation

Temporal interpolation techniques are used in most existing research on sensor time synchronization and are effective for matching data between sensors with noticeably differing frequencies [26]. However, when using sensors with resembling input frequencies (e.g., LiDAR and camera), errors can arise because of the large gaps between their sample periods.

To solve the problem of information lag caused by different sampling rates among sensors, a LiDAR sweep reconstruction algorithm controlled by camera input frequency is adopted to avoid ambiguous transmission between the LiDAR module and the visual module. Specifically, image acquisition time serves as the starting point for reconstructing the LiDAR sweep, which is motivated by the continuous sampling nature of LiDAR. This synchronization allows for simultaneous processing of camera images and LiDAR sweeps, avoiding interpolation or approximation operations during LiDAR and visual information fusion. This not only facilitates the transfer of enhanced depth and backward propagated poses from LiDAR to the camera but also provides higher frequency corrections to the IMU bias to improve the accuracy of IMU short-term predictions.

4.2.2. IMU Kinetic Model

Let the timestamps of two consecutive frames,

F_{i}

and

F_{j}

, be denoted as

t_{i}

t_{j}

. The measurements of the accelerometer and gyroscope during the time interval are described as:

\begin{matrix} {\hat{a}}_{k}^{B} = a_{k}^{B} - g + b_{a_{k}} + n_{a}, {\hat{w}}_{k}^{B} = w_{k}^{B} + b_{w_{k}} + n_{w} . \end{matrix}

(3)

where measurements between clock times

t_{i}

and

t_{j}

are denoted by the

k = 1, 2, \dots, n

index. Motion measurements include gravity,

g

, motivation,

a_{k}^{B}

, and angular velocity,

w_{k}^{B}

. These measurements are interfered with biases

b_{a_{k}}

b_{w_{k}}

and measurement noise

n_{a}

n_{w}

In this work, a discrete-time IMU preintegration method is employed to obtain the relative motion within the time interval

[t_{i}, t_{j}]

[27]. The preintegrated measurements for orientation,

Δ {\bar{R}}_{i j}

, translation,

Δ {\bar{p}}_{i j}

, and velocity,

Δ {\bar{v}}_{i j}

, in the IMU frame are given by:

\begin{matrix} Δ {\bar{R}}_{i j} & = R_{i}^{⊤} R_{j} ≐ \prod_{k = i}^{j - 1} exp [({\hat{w}}_{k} - b_{w_{i}}) δ t], \\ Δ {\bar{v}}_{i j} & = R_{i}^{⊤} (v_{k} - v_{i} - g δ t) ≐ \sum_{k = i}^{j - 1} Δ R_{i k} ({\hat{a}}_{k} - b_{a_{i}}) δ t, \\ Δ {\bar{p}}_{i j} & = R_{i}^{⊤} (p_{j} - p_{i} - v_{i} δ t - \frac{1}{2} g {δ t}^{2}) ≐ \sum_{k = i}^{j - 1} (Δ v_{i k} δ t + \frac{1}{2} Δ R_{i k} ({\hat{a}}_{k} - b_{a_{i}}) {δ t}^{2}) . \end{matrix}

(4)

where

δ t

is the time interval between adjacent IMU measurements. The preintegration error in the IMU frame

{r_{I}}_{i j} = {[r_{{Δ R}_{i j}}^{⊤}, r_{Δ v_{i j}}^{⊤}, r_{Δ p_{i j}}^{⊤}]}^{⊤}

is naturally converted from (6):

\begin{matrix} r_{{Δ R}_{i j}}^{⊤} & = log (Δ \bar{R} (b_{w_{i}})) R_{i}^{⊤} R_{j}, \\ r_{Δ v_{i j}}^{⊤} & = R_{i}^{⊤} (v_{j} - v_{i} - g Δ t_{i j}) - Δ {\bar{v}}_{i j} (b_{w_{i}}, b_{a_{i}}), \\ r_{Δ p_{i j}}^{⊤} & = R_{i}^{⊤} (p_{j} - p_{i} - v_{i} Δ t_{i j} - \frac{1}{2} g Δ t_{i j}^{2}) - Δ {\bar{p}}_{i j} (b_{w_{i}}, b_{a_{i}}) . \end{matrix}

(5)

The relative states of the other sensors are predicted by the IMU preintegration measurements within the time interval between consecutive frames. Superscript O represents the target sensor frame including the LiDAR frame and camera frame. Then, the prediction state can be expressed as follows:

\begin{matrix} Δ R_{i j}^{O} & = {(R_{i}^{O})}^{⊤} R_{j}^{B} ≐ Δ {\bar{R}}_{i j}^{B} \\ Δ v_{i j}^{O} & = {(R_{i}^{O})}^{⊤} (v_{j}^{O} - v_{i}^{O} - g δ t) ≐ R_{B}^{O} Δ {\bar{v}}_{i j}^{B} \\ Δ p_{i j}^{O} & = {(R_{i}^{O})}^{⊤} (p_{j}^{O} - p_{i}^{O} - v_{i}^{O} δ t - \frac{1}{2} g {δ t}^{2}) ≐ R_{B}^{O} Δ {\bar{p}}_{i j}^{B} \end{matrix}

(6)

where

R_{B}^{O}

is the rotation matrix in the external parameter between the target frame and the IMU frame. The solved relative states are leveraged to the feature tracking process in the visual module and the distortion elimination process in the LiDAR module.

IMU odometry error is triggered by slowly varying random drift in the accelerometer bias,

b_{a_{i}}

, and gyroscope bias,

b_{w_{i}}

. IMU measurement biases are jointly corrected using the LiDAR and visual poses. Additionally, the system state is optimized using multiple IMU measurement residuals, which are equivalent to the errors

\{E_{i j}^{C}, E_{i j}^{L}\}

between the IMU measurements and poses from the visual and LiDAR modules. The state estimation problem in Equation (2) is converted to minimize the IMU measurement residuals, as follows:

\begin{matrix} X_{n} = \arg \min_{X_{n}} \sum_{(i, j) \in T_{n}} {(E_{i j}^{C})}^{⊤} Ω_{i j}^{C} E_{i j}^{C} + {(E_{i j}^{L})}^{⊤} Ω_{i j}^{L} E_{i j}^{L} + E_{m} . \end{matrix}

(7)

where

Ω_{i j}

is the uncertainty matrix of LiDAR and visual poses, which can be calculated based on the sensor measurement noise.

E_{m}

is the marginalized prior, consisting of states and observations before the oldest state in the sliding window. The factor graph is shown in Figure 3.

4.3. Visual Module with Position Consistency Constraint

The visual-inertial module is implemented based on the LiDAR module assistance. In this module, the point reprojection method is designed to remove outliers of depth points. The position from LiDAR back-propagation is incorporated into visual-inertial optimization, achieving a refined camera pose.

4.3.1. Depth Association by Outliers Rejection

By utilizing external parameters between LiDAR and camera, the multiple LiDAR sweeps are projected on the image plane to generate depth map. Each feature depth is associated with three adjacent points on a unit sphere frame. The depth value is typically solved by spherical interpolation [28]. However, points with large incident angles of LiDAR will produce deviation. Therefore, the relative poses in the sliding window are leveraged to cross-cut evaluate the projection errors of depth points under different perspectives. In detail, the 3D landmark of the visual features

z_{i c}

and

z_{j c}

with depths

d_{i}

and

d_{j}

in frames

F_{i}

and

F_{j}

can be defined as

P^{C}

. The reprojection residual error related to

P^{C}

between frames

F_{i}

and

F_{j}

in the sliding window can be written as:

\begin{matrix} r_{C} (z_{i c}) = e_{i}^{r p} (P^{C}, x_{i}) = [\begin{matrix} b_{1} \\ b_{2} \end{matrix}] (P^{C} - π_{c} (z_{j c})), \\ z_{j c} = R_{B}^{C} (R_{W}^{B_{j}} (R_{B_{k}}^{W} (R_{C}^{B} d_{i} π_{c}^{- 1} (z_{i c}) + t_{C}^{B}) + t_{B_{k}}^{W}) + t_{W}^{B_{j}}) - t_{B}^{C} \end{matrix}

(8)

where

π_{c}^{- 1} (\cdot)

is the back projection function.

R_{B}^{C}

and

t_{B}^{C}

are the rotation matrix and translation vector of the transformation between camera frame and IMU frame.

R_{B}^{W}

and

t_{B}^{W}

represent poses in the world frame.

b_{1}

and

b_{2}

are two orthogonal bases that span the tangent plane of

z_{j c}

The depth of projected points from different perspectives is used to quantify the depth association:

\begin{matrix} τ_{i} = \exp (- β \cdot e_{i}^{r p} (P^{C}, x_{i})) . \end{matrix}

(9)

where the corresponding projection points are considered outliers if

τ

is greater than the preset threshold.

β

is a decay coefficient that has been manipulated. The average depth value of the interior projected points is calculated as the final depth value. In addition, the visual features will be removed if the number of depth measurements is below a certain threshold. The final features with modified depth are further utilized to update the visual map.

4.3.2. Motion Estimation Assisted by LiDAR Odometry Back-propagation

The aims of the visual module are improving the robustness of the overall system and providing an initial guess for the LiDAR module. In our work, the LiDAR point clouds and visual images are synchronized and consistent. To improve the localization accuracy of the visual odometry, the position consistency constraint is proposed by incorporating backward propagated poses from the LiDAR module into the sliding window optimization. The definition of the above constraint is as follows:

\begin{matrix} C_{p o} (T_{k}^{C}) = {∥{[T_{k - 1}^{L} T_{L}^{C} {(T_{k - 1}^{C})}^{- 1} T_{k}^{C}]}_{: t} - {(T_{k}^{C})}_{: t}∥}^{2} \end{matrix}

(10)

where

C_{p o} (T_{k}^{C})

ensures that the location of the camera and the transformed LiDAR, using the external parameters

T_{L}^{C}

, are consistent at the k-th keyframe.

T_{k - 1}^{L}

and

T_{k - 1}^{C}

are the 6-DoF poses of LiDAR and the camera frame at the

k - 1

-th keyframe, respectively.

{()}_{: t}

represents the translation vector of the transformation matrix.

For each newly acquired camera keyframe, motion estimation is implemented by jointly adjusting the camera poses

T^{C} = (T_{1}^{C}, \dots, T_{n}^{C})

and the 3D observations,

C

. This process is formulated by minimizing the sum of the feature observation residual error,

r_{C} (z_{k c})

, IMU preintegration error,

r_{I}

, and constraint,

C_{p o}

, as follows:

\begin{matrix} {(T_{k}^{C})}^{*} = \arg \min_{T_{k}^{C}} \{r_{I} + + \sum_{c \in C_{k}} {∥r_{C} (z_{k c})∥}^{2} + C_{p o}\} \end{matrix}

(11)

The IMU states and visual features of the regular frames and removed keyframes are marginalized as prior to constrain the sliding window optimization [29]. In addition, if the LiDAR module fails completely, the visual pose is output as the final state to ensure robustness.

4.4. Adaptive LiDAR Module with Hybrid Registration Mode

Existing feature-based LiDAR-inertial odometry may fail to work in geometrically uninformative environments, pushing optimization toward divergence along weakly constrained directions. In this paper, the LiDAR module adopts a hybrid registration method, which utilizes a feature-based approach to obtain the rough pose in high-level structured scenes and performs the point-to-surface matching algorithm under IMU constraints in unstructured scenes, achieving accurate and robust pose estimation.

4.4.1. Structure Monitor Based on Vertical Landmarks

After receiving a new sweep, the firstly performed feature extraction is employed to divide the original point cloud into edge points,

F_{n}^{e}

, and plane points,

F_{n}^{p}

, according to the smoothness of the local surface [30]. The set of total features is denoted as

F_{n}

at the n-th frame. The smoothness,

S (p_{i})

, of the LiDAR point

p_{i}

is calculated as follows:

\begin{matrix} S (p_{i}) = \frac{1}{|B_{i}|} ∥ \sum_{j \in B_{i}, j \neq i} (r_{j} - r_{i}) ∥, r_{b} < ∥ r_{i} ∥ \leq r_{m} . \end{matrix}

(12)

where

∥ r_{i} ∥

represents the range of the point

p_{i}

r_{b}

and

r_{m}

are the blind distance and the maximum distance of the return points, respectively.

B_{i}

is the set of consecutive points of

p_{i}

from the same scan.

|B_{i}|

is the number of points in

B_{i}

The edge points are extracted if

S (p_{i})

greater than preset threshold,

S_{t h}

. Then, all edge points of the current sweep are projected onto the horizontal plane, which is divided into

N_{s}

regions. The direction from the LiDAR origin to the edge points is used to allocate the edge points on the horizontal plane. Assume the positions of the edge features in the horizontal plane follow a Gaussian distribution. In each region, the 2D positions will be used as samples for local kernel density estimation, iteratively moving in the direction of increasing density. As a result, the sample points will eventually converge at the local maximum density, and the points that converge to the same local maximum are considered members of the same cluster. The final clustering result is the set of points in the region with the highest density of edge points. Figure 4 illustrates the aforementioned aggregation process. The

N_{s}

-dimensional vector is obtained by forming the results of all regions:

\begin{matrix} U_{n} = [η_{1} V_{1}, η_{2} V_{2}, \dots, η_{N_{s}} V_{N_{s}}] . \end{matrix}

(13)

where

η_{i}

is the normalized factor:

η_{i} = N_{i} / N_{t a t a l}

N_{i}

represents the number of points in the i-th sector in the horizontal plane.

N_{t a t a l}

is the number of all points in the horizontal plane. If

N_{i} < N_{t h}

η_{i}

is set to zero, where

N_{t h}

is the preset threshold of points number in one sector. The weighted distance vector of the i-th sector can be calculated by:

\begin{matrix} V_{i} = \sum_{k = 1}^{N_{i}} \exp (- \frac{d_{k}^{h}}{σ^{2}}) . \end{matrix}

(14)

where

N_{i}

is the number of points in the i-th sector in the horizontal plane.

d_{k}^{h}

is the horizontal range of the k-th point in the i-th sector.

σ

is the attenuation factor for adjusting distance weights.

The environmental structure,

C_{n}

, is then quantified by the dispersion of the

N_{s}

-dimensional vector. A greater degree of dispersion indicates a higher level of environmental structuring. The details are illustrated by:

\begin{matrix} C_{n} = v a r (U_{n}) = \frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} (η_{i} V_{i} - \frac{1}{N_{s}} \sum_{j = 1}^{N_{s}} η_{j} V_{j}) . \end{matrix}

(15)

The threshold

C_{t h}

, used in this method to distinguish between environmental structures, can be empirically set to 5. The quantitative environmental structuring is used to adjust the LiDAR module’s operating mode.

4.4.2. Hybrid Point Cloud Alignment

The sparsity of LiDAR sweeps may lead to imprecise vertical constraints, especially in open and unstructured environments, failing to estimate the altitude variables

r o l l

p i t c h

, and z in point cloud alignment. To tackle these problems, we propose a hybrid point cloud alignment strategy that performs two modes according to the output of the structure monitor. In structured scenes, the point-to-feature model with distance weight is minimized to solve the LiDAR pose. In unstructured scenes, a novel point-to-surface model is generated to register the non-planar surface, achieving refined pose estimation. The point cloud alignment strategy is shown in Algorithm 1.

Algorithm 1 Hybrid Point Cloud Registration

1:: Input: $F_{k}, C_{k}, T_{k - 1}^{L}, M_{k - 1}^{l}, Δ R_{(k - 1, k)}^{L}, Δ p_{(k - 1, k)}^{L}$
2:: Output: $T_{k}^{L}$
3:: While $F_{k} \neq \emptyset$ do
4:: ${\overset{ˇ}{T}}_{k}^{L} \leftarrow InitialGuess (T_{k - 1}^{L}, T_{k - 1}^{C}, Δ R_{(k - 1, k)}^{L}, Δ p_{(k - 1, k)}^{L})$
5:: if $C_{k} > C_{t h}$ then
6:: $S^{F_{k}} \leftarrow Point 2 FeatureDistance (F_{k}, M_{k - 1}^{l})$
7:: $r_{p f} \leftarrow WeightedRegistrationError (S^{F_{k}}, w (M_{k - 1}^{l}))$
8:: $T_{k}^{L} \leftarrow LiDARPose . minimize (r_{p f})$
9:: else
10:: $r_{p g} \leftarrow Point 2 GaussianError (F_{k}, M_{k - 1}^{l})$
11:: $λ (F_{k} ({\overset{ˇ}{T}}_{k}^{L})) \leftarrow ResidualErrorTransform (r_{p g})$
12:: $T_{k}^{L} \leftarrow LiDARPose . minimize (λ (F_{k} ({\overset{ˇ}{T}}_{k}^{L})))$
13:: end if
14:: return $T_{k}^{L}$
15:: end

C_{n} > C_{t h}

, the n-th sweep is considered structured. In this case, the j-th feature,

f_{j}^{L}

, in

F_{n}

is transformed from the m-th map point,

P_{m}^{G}

, in the global frame by the pose

T_{n}^{L}

, where

T_{n}^{L} = (R_{n}^{L}, t_{n}^{L})

. The feature in the global frame is defined by:

\begin{matrix} P_{m}^{G} = R_{n}^{L} f_{j}^{L} + t_{n}^{L} . \end{matrix}

(16)

According to the widely used ICP method [31,32], the rigid transformation,

T

, between the prior map,

M^{l}

, and the feature cloud,

F_{n}

, at the n-th frame can be solved by minimizing the feature registration error,

r_{p f} (P_{m}^{G})

, including the weighted point-to-feature distance

S^{F_{i}} (f_{j}, P_{m}^{G})

\begin{matrix} r_{L} (z_{i l}) = r_{p f} = \frac{\sum_{f_{j}^{L} \in F_{i}} w (P_{m}^{G}) \cdot S^{F_{i}} (f_{j}^{L}, P_{m}^{G})}{\sum_{f_{j}^{L} \in F_{i}} w (P_{m}^{G})}, \\ S^{F_{i}} (f_{j}^{L}, P_{m}^{G}) = n_{m}^{⊤} (f_{j}^{L} - P_{m}^{G}) + l_{m}^{\land} (f_{j}^{L} - P_{m}^{G}) . \end{matrix}

(17)

where

w (P_{m}^{G}) = e^{- {∥P_{m}^{G} - c∥}^{2} / σ^{2}}

is the weight, which declines as distance is gained. The parameter

σ

is chosen to exclude features that are more than

3 σ

distances from the feature center, c, on the local map.

S^{F_{i}} (f_{m}^{L}, P_{m}^{L})

is the distance function, where

n_{m}

and

l_{m}

represent dominant vectors of the corresponding feature.

Typically, point cloud alignment with point-to-feature error yields consistent matching results. However, in unstructured environments, edge points are insufficient to provide adequate constraints, and the numerous surfaces cannot be accurately represented by regular planes. The point-to-Gaussian distance is modeled to provide a more generalized representation with Gaussian mean and uncertainty. Figure 5 shows these two models in different scenes.

Let the neighbor region of

P_{m}^{G}

include a set of LiDAR points

q_{i} (i = 1, \dots, M)

P_{m}^{G}

has an uncertainty,

Σ_{P_{m}^{G}}

, due to LiDAR measurement noise and position estimation errors. The uncertainty model of an irregular surface is illustrated in Figure 5b. The Gaussian mean of

P_{m}^{G}

is set to its 3D position. The uncertainty is the inverse of the covariance,

W

, which is calculated from the neighbor points and is expressed as

W = 1 / M \sum_{i = 1}^{M} {(q_{i} - P_{m}^{G})}^{⊤} (q_{i} - P_{m}^{G})

. The cost function of the point-to-Gaussian surface is described as follows:

\begin{matrix} {(T_{n}^{L})}^{*} = \arg \min_{T_{n}^{L}} r_{p g} (M^{l}, F_{n}, T_{n}^{L}) = \arg \min_{T_{n}^{L}} \sum_{j = 1} {(e^{⊤} W^{- 1} e)}_{j} \end{matrix}

(18)

where the Mahalanobis distance between the target frame point and the corresponding Gaussian surface point is minimized rather than the Euclidean distance.

The inverse matrix of covariance

W^{- 1}

is composed of an eigenvector matrix,

N

, and a diagonal matrix,

Λ

, formed by eigenvalues. So, we can calculate the point-to-surface error by the decomposed

W^{- 1}

\begin{matrix} r_{p g} (M^{l}, F_{n}, T_{n}^{L}) = \sum_{j = 1} (e_{j}^{⊤} N Λ N^{⊤} e_{j}) \\ = \sum_{j = 1} e_{j}^{⊤} [v_{1}, v_{2}, v_{3}] diag (λ_{1}, λ_{2}, λ_{3}) {[v_{1}, v_{2}, v_{3}]}^{⊤} e_{j} \\ = \sum_{j = 1} λ_{1} (e_{j}^{⊤} v_{1} v_{1}^{⊤} e_{j}) + λ_{2} (e_{j}^{⊤} v_{2} v_{2}^{⊤} e_{j}) + λ_{3} (e_{j}^{⊤} v_{3} v_{3}^{⊤} e_{j}) . \end{matrix}

(19)

where

λ_{1}, λ_{2}, λ_{3}

are descending eigenvalues of

W^{- 1}

, and

v_{1}, v_{2}, v_{3}

are the corresponding eigenvectors. Then, the standard least squares definition of the point-to-surface cost function can be obtained:

\begin{matrix} r_{p g} (M^{l}, F_{n}, T_{n}^{L}) = \sum_{k = 1} \sum_{i = 1} λ_{k} ({\tilde{f}}_{j} ({\overset{ˇ}{T}}_{n}^{L})) = \sum_{k = 1} \sum_{j = 1} λ_{k} {∥v_{k}^{⊤} ({\overset{ˇ}{R}}_{n}^{L} f_{j}^{L} + {\overset{ˇ}{t}}_{n}^{L} - P_{m}^{G})∥}_{2}^{2} \end{matrix}

(20)

where

{\overset{ˇ}{R}}_{n}^{L}

and

{\overset{ˇ}{t}}_{n}^{L}

are the rotation and translation part of the initial transformation,

{\overset{ˇ}{T}}_{n}^{L}

. The initial pose,

{\overset{ˇ}{T}}_{n}^{L}

, can be jointly calculated by the previous pose,

T_{n - 1}^{L}

, and the prediction state

\{Δ R_{(n - 1, n)}^{L}, Δ p_{(n - 1, n)}^{L}\}

from IMU.

{\tilde{f}}_{j}

is the transformed point from

F_{n}

{\overset{ˇ}{T}}_{n}^{L}

. Further, the second-order Taylor expansion of the eigenvalue matrix

λ_{k} ({\tilde{f}}_{j} ({\overset{ˇ}{T}}_{n}^{L}))

is denoted by:

\begin{matrix} λ_{k} ({\tilde{f}}_{j} ({\overset{ˇ}{T}}_{n}^{L})) \approx λ_{k} ({\tilde{f}}_{j}) + J ({\tilde{f}}_{j}) δ f + \frac{1}{2} δ {\tilde{f}}^{⊤} H ({\tilde{f}}_{j}) δ f \end{matrix}

(21)

where

J ({\tilde{f}}_{j})

and

H ({\tilde{f}}_{j})

are the Jacobian matrix and the Hessian matrix of

λ_{k} ({\tilde{f}}_{j} (\overset{ˇ}{T}))

, respectively.

Additionally, due to the insufficient vertical constraints in unstructured scenes, the point-to-surface model is modified for 3-DoF state estimation, that is, only the three horizontal DoF poses will be estimated. In this case, projecting the corresponding pose

T_{n}^{L}

{\tilde{f}}_{j}

on the tangent plane, we have:

\begin{matrix} T_{n}^{L} = {\overset{ˇ}{T}}_{n} \oplus δ T = (R (γ_{n}) \exp ({δ γ}_{n}^{\land}), t_{n} |_{x, y} + δ t) \\ R (γ_{n}) = [\begin{matrix} cos γ_{n} & - sin γ_{n} & 0 \\ sin γ_{n} & cos γ_{n} & 0 \\ 0 & 0 & 1 \end{matrix}] \approx I + [\begin{matrix} 0 & - 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}] γ_{n} \\ {\tilde{f}}_{j} = R (γ_{n}) \exp ({δ γ}_{n}^{\land}) f_{j}^{L} + t_{n} |_{x, y} + δ t \end{matrix}

(22)

By differentiating the projected point

{\tilde{f}}_{j}

with respect to pose

{\hat{T}}_{n}^{L}

, we can obtain:

\begin{matrix} λ ({\overset{ˇ}{T}}_{n}^{L} \oplus δ T) \approx λ ({\overset{ˇ}{T}}_{n}^{L}) + \underset{\tilde{J}}{\underset{︸}{J A}} δ T + \frac{1}{2} δ T^{⊤} \underset{\tilde{H}}{\underset{︸}{A^{⊤} H A}} δ T \\ A = \frac{δ {\tilde{f}}_{j}}{δ T} = [\begin{matrix} - {(f_{j}^{L})}^{\land} - [\begin{matrix} 0 & - 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}] γ_{n} {(f_{j}^{L})}^{\land} & I \end{matrix}] \end{matrix}

(23)

The objective function of point cloud registration can be formulated by:

\begin{matrix} (\tilde{H} (T_{n}^{L}) + μ I) δ T^{*} = - \tilde{J} {(T_{n}^{L})}^{⊤} . \end{matrix}

(24)

where

μ

is employed to adjust the iterative process. Finally, we minimize the cost function to refine pose

T_{n}^{L}

by repeatedly calculating its second-order derivative using the Levenberg–Marquardt (LM) approach.

5. Experimental Results

The full system’s tests are conducted on an Intel Core i7-10700K CPU with 16 GB RAM. In this section, the accuracy and robustness of the proposed system are evaluated in off-line mode, and the runtime evaluation demonstrates that the proposed system can operate in real time. The UAV localization accuracy is tested between the proposed algorithm and the advanced algorithms in highly structured scenes. In robust testing, trajectory comparisons for structurally weaker scenes with two advanced LiDAR-visual-IMU odometry methods is performed. A public dataset and a private dataset are employed as test sets. The NTU-VIRAL dataset is collected by equipping the UAV with sensors such as 3D rotating LiDARs, global shutter cameras, IMUs, and ultra wideband ranging units [33]. This dataset records multiple sequences under several challenging indoor and outdoor conditions. In this experiment, nine outdoor sequences from NTU-VIRAL are used for localization accuracy evaluation. In addition, we acquire a campus sequence of the low-altitude environment for testing location accuracy, and construct a high-precision 3D map. Three structurally different outdoor scenes are captured using LiDAR, a pinhole camera, and IMU: grove, beach, and desert. The platform and the private dataset are shown in Figure 6.

5.1. Structure Monitor Evaluation

In this section, a vertical observation-based structural monitor is employed to quantify the structure of different scenarios. The sequences from both public and private datasets are quantified separately, as shown in Table 2. A data sequence is defined, structured if its quantitative result exceeds 5; otherwise, it is regarded as unstructured. As shown in Table 2, the quantitative results indicate that all sequences in the public dataset and the campus sequence in the private dataset were collected in structured environments, whereas the grove, beach, and desert sequences reflect unstructured environments. The quantitative results of the unstructured scenes are show in Figure 7. In the structured environments, localization accuracy is the primary concern, while, in unstructured scenes, the focus shifts to ensuring stable system operation.

5.2. System Localization Accuracy Evaluation

The evaluation results of localization accuracy are shown in Table 3. The localization error is dictated by the root mean square error (RMSE) of translation estimation. As shown in Table 2, the proposed method has higher localization accuracy than state-of-the-art LiDAR-visual-IMU odometry in structured scenes such as rtp, sbs, and tnp data sequences. In the NTU-VIRAL public dataset, the proposed system can maintain near-optimal localization performance in most data sequence tests. However, its performance in the rtp data sequence is slightly inferior to that of the LiDAR-IMU system. This is because strong lighting variations interfere with the visual module’s localization accuracy, thus increasing the translation error. Additionally, the LiDAR module is not affected by sensor degradation in the rtp data sequence. Therefore, LiDAR-inertial odometry can obtain better location results than LiDAR-visual-inertial odometry with optical noise interference. Unlike other tightly coupled sensor fusion localization schemes such as FAST-LIVO and mVIL-Fusion [34], the proposed system uses the output of the visual module only as the initial value for the next pose estimation stage, making it less dependent on the visual module. Even if the visual module fails, the proposed system can still deliver accurate positioning. The real-time mapping results of rtp, sbs, and tnp are shown in Figure 8; the proposed system provides structured point clouds and signs without distortion to exhibit low-drift localization performance. The 3D point cloud map of the campus sequence in the private dataset is shown in Figure 9, where static objects can all be recognized clearly. In addition, Figure 10 illustrates the positioning error of the proposed system in the publicly available NTU-VIRAL dataset, with a focus on annotating the minimum error, average error, and maximum error between the proposed system and the ground truth trajectories. Overall, in testing with the public dataset, our proposed system reduced localization errors by 25.1%, 29.5%, 15.7%, 50.0%, and 55.8% compared with LIO-SAM, LVI-SAM, FAST-LIO2, FAST-LIVO, and mVIL Fusion, respectively, after excluding the failure and mismatch of each system.

5.3. System Robustness Evaluation

In this section, hierarchical unstructured scenarios are employed to test the localization accuracy of the proposed system. The localization error is shown in Table 2. The difficulty of grove, beach, and desert sequences increases progressively. The grove scene is the most structured, with minimal brightness variation, making the environment relatively easier. However, due to the grove sequence covering a distance of

983.5 m

, the cumulative error is relatively large. The beach scene covers only

137.3 m

and, like the

r t p

sequence in the public dataset, has strong visual interference. Due to its low level of structuring, the positioning error is larger compared with the

r t p

sequence. The total distance of the desert sequence is

988.3 m

, with minimal changes in luminosity. However, the degree of structuring is extremely low, and there is almost no vertical observation. As a result, pose estimation can only rely on ground point clouds, leading to the complete failure of feature-based LiDAR odometry methods. Despite this, our method with point-to-surface registration can still maintain high accuracy in this scenario, ensuring that system localization does not fail. Figure 11 shows the positioning trajectory of the proposed system compared with FAST-LIVO and LVI-SAM in unstructured scenarios. Compared with these two state-of-the-art methods, the proposed system is closer to the ground truth and has relatively smaller positioning errors.

5.4. Runtime Evaluation

The runtime of the proposed method is evaluated in the grove sequence. As shown in Table 4, the time consumption of the proposed system is divided into the visual module, LiDAR module, and IMU module. The visual module includes the extraction and matching of visual features, feature management, and data alignment, as well as image-based pose solving. The LiDAR module includes preprocessing, feature extraction, structure monitor, and hybrid point cloud tracking. The IMU module has an IMU measurement preintegration section and factor graph optimization section. These three modules run in parallel mode. According to the results in Table 3, the average running time of the vision module, LiDAR module, and IMU module is 48.1

ms

, 39.17

ms

, and 0.425

ms

, respectively, which is lower than the acquisition time of the camera and LiDAR. Overall, the experimental results demonstrate that the proposed system can operate in real time.

6. Conclusions

This article presents a sensor fusion system coupling LiDAR, cameras, and IMUs to reduce trajectory error and ensure robust operation of UAVs. The visual module is supported by LiDAR measurements and pose to estimate feature depth and visual pose. The estimated feature depth is validated by calculating the reprojection error using corresponding 3D observations within a sliding window in the visual module, which helps to remove outliers. Additionally, the LiDAR pose is back-propagated to constrain visual pose estimation, achieving an enhanced camera pose of the UAV. The LiDAR module employs a structure monitor to switch matching modes in various environments. In unstructured environments, we use Gaussian probability-based uncertainty to model irregular surfaces. This uncertainty is then decoupled into eigenvalues and eigenvectors, and a pose estimation objective function is constructed to achieve accurate localization. Finally, the IMU measurement errors with the LiDAR and visual modules are used to construct the odometry factor, which is incorporated into the factor graph optimization to complete the state solution of the UAV. Experimental results with UAVs demonstrate that the proposed system outperforms state-of-the-art algorithms by averagely reducing localization errors by at least 15.7%. In unstructured scenarios, the algorithm proposed in this paper leads the other control algorithms by at least 12.6%, effectively improving the localization accuracy of UAVs in unstructured environments. The system runs about 48.525

ms

per frame, which meets the task requirements for real-time work. In future research, we will explore a sensor fusion architecture on UAVs that integrates satellite navigation and full-domain place recognition for long-term localization, aiming to maintain system accuracy and robustness.

Author Contributions

Conceptualization, B.Z. and X.S.; methodology, B.Z.; software, Y.W.; validation, B.Z., X.S. and G.S.; formal analysis, W.Y.; investigation, X.S.; resources, B.Z.; data curation, Y.W.; writing—original draft preparation, B.Z.; writing—review and editing, X.S.; visualization, Y.W.; supervision, W.Y.; project administration, G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (U23A20346, 62033005, 62173107, and 62106062), in part by the Key Project of Natural Science Foundation of Heilongjiang Province (No. ZD2021F001), and in part by the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation under Grant GZC20233462.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Bai, J.; Wang, G.; Wu, X.; Sun, F.; Guo, Z.; Geng, H. UAV Localization in Low-Altitude GNSS-Denied Environments Based on POI and Store Signage Text Matching in UAV Images. Drones 2023, 7, 451. [Google Scholar] [CrossRef]
Wang, R.; Deng, Z. Rapid Initialization Method of Unmanned Aerial Vehicle Swarm Based on VIO-UWB in Satellite Denial Environment. Drones 2024, 8, 451. [Google Scholar] [CrossRef]
Fu, J.; Yao, W.; Sun, G.; Ma, Z.; Dong, B.; Ding, J.; Wu, L. Multi-robot Cooperative Path Optimization Approach for Multi-objective Coverage in a Congestion Risk Environment. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 1816–1827. [Google Scholar] [CrossRef]
Shao, X.; Sun, G.; Yao, W.; Liu, J.; Wu, L. Adaptive Sliding Mode Control for Quadrotor UAVs with Input Saturation. IEEE-ASME T MECH. 2022, 27, 1498–1509. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-mono: A Robust and Versatile Monocular Visual-inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Qin, C.; Ye, H.; Pranata, C.E.; Han, J.; Zhang, S.; Liu, M. LINS: A LiDAR-inertial State Estimator for Robust and Efficient Navigation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 8899–8906. [Google Scholar]
Shu, C.; Luo, Y. Multi-Modal Feature Constraint Based Tightly Coupled Monocular Visual-LiDAR Odometry and Mapping. IEEE Trans. Intell. Veh. 2023, 8, 3384–3393. [Google Scholar] [CrossRef]
Xie, J.; He, X.; Mao, J.; Zhang, L.; Hu, X. C2VIR-SLAM: Centralized Collaborative Visual-Inertial-Range Simultaneous Localization and Mapping. Drones 2022, 6, 312. [Google Scholar] [CrossRef]
Xu, W.; Zhang, F. FAST-LIO: A Fast, Robust LiDAR-inertial Odometry Package by Tightly-coupled Iterated Kalman Filter. IEEE Robot. Autom. Lett. 2021, 6, 3317–3324. [Google Scholar] [CrossRef]
Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 5135–5142. [Google Scholar]
Nguyen, T.M.; Cao, M.; Yuan, S.; Lyu, Y.; Nguyen, T.H.; Xie, L. VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor Fusion Approach. IEEE Trans. Robot. 2022, 38, 958–977. [Google Scholar] [CrossRef]
Zuo, X.; Yang, Y.; Geneva, P.; Lv, J.; Liu, Y.; Huang, G.; Pollefeys, M. LIC-Fusion 2.0: LiDAR-Inertial-Camera Odometry with Sliding-Window Plane-Feature Tracking. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 5112–5119. [Google Scholar]
Shan, T.; Englot, B.; Ratti, C.; Rus, D. LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 5692–5698. [Google Scholar]
Lin, J.; Zhang, F. R3LIVE: A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package. In Proceedings of the 2022 International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 10672–10678. [Google Scholar]
Zheng, C.; Zhu, Q.; Xu, W.; Liu, X.; Guo, Q.; Zhang, F. FAST-LIVO: Fast and Tightly-coupled Sparse-Direct LiDAR-Inertial-Visual Odometry. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2022; pp. 4003–4009. [Google Scholar]
Zhang, J.; Singh, S. Laser-visual-inertial Odometry and Mapping with High Robustness and Low Drift. J. Field Robot. 2018, 35, 1242–1264. [Google Scholar] [CrossRef]
Wisth, D.; Camurri, M.; Fallon, M. VILENS: Visual, Inertial, Lidar, and Leg Odometry for All-Terrain Legged Robots. IEEE Trans. Robot. 2023, 39, 309–326. [Google Scholar] [CrossRef]
Yuan, Z.; Wang, Q.; Cheng, K.; Hao, T.; Yang, X. SDV-LOAM: Semi-Direct Visual–LiDAR Odometry and Mapping. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11203–11220. [Google Scholar] [CrossRef] [PubMed]
Shan, T.; Englot, B. LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar]
Chen, S.; Ma, H.; Jiang, C.; Zhou, B.; Xue, W.; Xiao, Z.; Li, Q. NDT-LOAM: A Real-Time Lidar Odometry and Mapping With Weighted NDT and LFA. IEEE Sensors J. 2022, 22, 3660–3671. [Google Scholar] [CrossRef]
Cui, Y.; Zhang, Y.; Dong, J.; Sun, H.; Chen, X.; Zhu, F. LinK3D: Linear Keypoints Representation for 3D LiDAR Point Cloud. IEEE Robot. Autom. Lett. 2024, 9, 2128–2135. [Google Scholar] [CrossRef]
Zhang, J.; Singh, S. LOAM: Lidar Odometry and Mapping in Real-time. In Proceedings of the 2014 Robotics: Science and Systems, Berkeley, CA, USA, 12–16 July 2014. [Google Scholar]
Guo, S.; Rong, Z.; Wang, S.; Wu, Y. A LiDAR SLAM with PCA-based feature extraction and two-stage matching. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
Choi, S.; Chae, W.H.; Jeung, Y.; Kim, S.; Cho, K.; Kim, T.W. Fast and Versatile Feature-Based LiDAR Odometry via Efficient Local Quadratic Surface Approximation. IEEE Robot. Autom. Lett. 2023, 8, 640–647. [Google Scholar] [CrossRef]
Chen, K.; Lopez, B.T.; Agha, M.; Ali, A.; Mehta, A. Direct LiDAR Odometry: Fast Localization with Dense Point Clouds. IEEE Robot. Autom. Lett. 2022, 7, 2000–2007. [Google Scholar] [CrossRef]
Chen, G.; Hong, L. Research on Environment Perception System of Quadruped Robots Based on LiDAR and Vision. Drones 2023, 7, 329. [Google Scholar] [CrossRef]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual-inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
Zuo, X.; Geneva, P.; Lee, W.; Liu, Y.; Huang, G. LIC-Fusion: LiDAR-Inertial-Camera Odometry. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 4–8 November 2018; pp. 5848–5854. [Google Scholar]
Campos, C.; Elvira, R.; Rodriguez, G.J.J.; Montiel, M.M.J.; Tardos, J.D. ORB-SLAM3: An Accurate Open-source Library for Visual, Visual¨Cinertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Xu, M.; Lin, S.; Wang, J.; Chen, Z. A LiDAR SLAM System With Geometry Feature Group-Based Stable Feature Selection and Three-Stage Loop Closure Optimization. IEEE Trans. Instrum. Meas. 2023, 72, 1–10, Erratum in IEEE Trans. Instrum. Meas. 2023, 7, 524. [Google Scholar] [CrossRef]
Zhou, R.; Sun, H.; Ma, K.; Tang, J.; Chen, S.; Fu, L.; Liu, Q. Improving Estimation of Tree Parameters by Fusing ALS and TLS Point Cloud Data Based on Canopy Gap Shape Feature Points. Drones 2023, 7, 524. [Google Scholar] [CrossRef]
Baah, G.A.; Savin, I.Y.; Vernyuk, Y.I. Pollution from Highways Detection Using Winter UAV Data. Drones 2023, 7, 178. [Google Scholar] [CrossRef]
Nguyen, T.M.; Yuan, S.; Cao, M.; Lyu, Y.; Nguyen, T.H.; Xie, L. NTU VIRAL: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint. Int. J. Robot. Res. 2022, 41, 270–280. [Google Scholar] [CrossRef]
Wang, Y.; Ma, H. mVIL-Fusion: Monocular Visual-Inertial-LiDAR Simultaneous Localization and Mapping in Challenging Environments. IEEE Robot. Autom. Lett. 2023, 8, 504–511. [Google Scholar] [CrossRef]

Figure 1. The schematic diagram of coordinate transformation.

T_{C B}

and

T_{L B}

represent the external parameter from camera and LiDAR to IMU.

T_{B W}

represents the transformation from the body frame to the world frame.

Figure 1. The schematic diagram of coordinate transformation.

T_{C B}

and

T_{L B}

represent the external parameter from camera and LiDAR to IMU.

T_{B W}

represents the transformation from the body frame to the world frame.

Figure 2. Pipeline of the proposed system. The proposed system is divided into the IMU module, the visual-inertial module, and the LiDAR-inertial module. Modules with red borders are highlighted in this paper. In detail, the LiDAR-inertial module provides depth measurements for visual features by aggregating recent multi-frame sweeps. Moreover, the motion estimation of the visual-inertial module is constrained by the back-propagated pose from the LiDAR-inertial module at the previous moment. The visual-inertial module provides the initial guess for the LiDAR-inertial module’s point cloud matching. The camera pose and LiDAR pose are fed into the IMU module and form the measurement residuals with IMU preintegration, followed by minimizing the measurement residuals in the factor graph optimization to estimate the final state.

Figure 3. Factor graph of the system. The IMU module is constrained by LiDAR and visual modules, and ultimately outputs a refined system state.

Figure 4. The illustration of edge points aggregation. Different colored edge dots indicate different ranges. Vertical observations are projected onto the segmented horizontal plane for clustering.

Figure 5. The model of the aligned point. (a) is the point-to-plane model that is employed in structured scenes. (b) is the point-to-surface model with uncertainty for aligning irregular ground points.

Figure 6. The platform is used for research and collection of the private dataset. The UAV in (a) is equipped with a GPS mobile station, LIDAR, on-board computer and pinhole camera. The world frame is defined as the first IMU frame. Satellite photographs (b–e) show four scenes. The orange curves represent the ground truth of these sequences as determined by the GNSS/IMU positioning system.

Figure 7. The qualitative structures of the private dataset are shown. (a–c) are grove sequence, beach sequence, and desert sequence, respectively. Cubes are salient edge points. The QS stands for quantitative structure.

Figure 8. Point cloud maps of NTU dataset are shown. (a–c) are rtp sequences, sbs sequences, and tnp sequences, respectively.

Figure 9. 3D bird’s-eye view map of the campus sequence. The quantitative structure of the three locations is emphasized as (a–c). Figure (d,e) show point cloud maps from a bird’s-eye perspective, presenting the consistent map without point cloud divergence.

Figure 10. Localization accuracy experiments are performed on the NTU-VIRAL dataset. The trajectories’ errors are compared with the ground truths, which are provided by the public dataset. Subfigures (a–i) represent the trajectory results for the rtp sequence, the sbs sequence, and the tnp sequence, respectively. The “reference” in the subfigure is the ground truth of the UAV trajectory. The heat map color of the estimated trajectories indicate the error level.

Figure 11. Trajectory comparisons on the private dataset. Subfigures (a–c) are indicate the comparisons on the grove, beach, and desert sequences. As the degree of non-structuring increases, the robustness of LVI-SAM and FAST-LIVO decreases. The proposed system can still maintain high localization accuracy.

Table 1. List of used notations with their descriptions.

Notations	Descriptions
$t_{i}$ , $t_{j}$	Input time of camera image and LiDAR sweep
$X$	Set of all states up to moment $t_{n} \in T_{n}$
$x_{i}$	State at time $t_{i}$
$R_{i}$ , $p_{i}$ and $v_{i}$	Rotation matrix, position vector, and linear velocity at time $t_{i}$
$C_{i}$ , $L_{i}$	Observations of camera and LiDAR at time $t_{i}$
$I_{i} j$	Set of IMU measurements between moments $t_{i}$ and $t_{j}$
$r_{0}$ , $r_{I_{i} j}$ , $r_{z_{i c}}$ , $r_{z_{i l}}$	Residuals of prior, IMU preintegration, and visual and LiDAR feature
${\hat{a}}_{k}^{B}$ , ${\hat{w}}_{k}^{B}$	Measurements value of accelerometer and gyroscope at time $t_{k}$
$a_{k}^{B}$ , $w_{k}^{B}$	Acceleration and angular velocity of platform motion
$g$	gravity
$b_{a_{k}}$ , $b_{w_{k}}$ , $b_{a}$ , $b_{w}$	Biases and noise of accelerometer and gyroscope
$Δ {\bar{R}}_{i j}$ , $Δ {\bar{p}}_{i j}$ , $Δ {\bar{v}}_{i j}$	Preintegrated measurements for orientation, translation, and velocity
$r_{Δ R_{i j}}^{⊤}$ , $r_{Δ v_{i j}}^{⊤}$ , $r_{Δ p_{i j}}^{⊤}$	Preintegration error
$Δ R_{i j}^{O}$ , $Δ p_{i j}^{O}$ , $Δ v_{i j}^{O}$	Prediction states from IMU
$E_{i j}^{C}$ , $E_{i j}^{L}$	IMU measurement errors with VIO and LIO
$Ω_{i j}^{C}$ , $Ω_{i j}^{L}$	Uncertain matrix of LiDAR and camera poses
$E_{m}$	Error of marginalized prior
$e_{i}^{r p}$	Reprojection residual error of the visual feature $z_{i c}$
$C_{p o}$	Constraint from the back-propagated LiDAR pose
$T^{C}$	Set of camera poses
$S (p_{i})$	Smoothness of the LiDAR point $p_{i}$
$∥ r_{i} ∥$	The range of the point $p_{i}$ .
$F_{n}^{e}$ , $F_{n}^{p}$	Edge and plane points of n-th sweep
$F_{n}$	All feature points of n-th sweep
$U_{n}$	Clustering results of n-th sweep
$V_{i}$	The weighted distance vector of the i-th sector
$C_{n}$	Environmental structuring of n-th sweep
${\overset{ˇ}{T}}_{k}^{L}$	Initial LiDAR pose from prediction state
$S^{F_{k}}$	Weighted point-to-feature distance
$r_{p f}$	Feature registration error
$r_{p g}$	Point-to-Gaussian surface error
$W$ , $N$ , $Λ$	Covariance matrix, eigenvector matrix, and diagonal matrix
$T_{n}^{L}$	LiDAR pose at time $t_{n}$
$R (γ_{n})$	Rotation matrix on $y a w$ angle
$\tilde{J}$ , $\tilde{H}$	Jacobian matrix and Hessian matrix by differentiating the projected point ${\tilde{f}}_{j}$ to the pose ${\hat{T}}_{n}^{L}$

Table 2. Quantitative structuring on public and private datasets.

Data	rtp1	rtp2	rtp3	sbs1	sbs2	sbs3	tnp1	tnp2	tnp3	Camp	Grove	Beach	Desert
Result	5.21	5.53	5.39	6.03	5.86	5.15	6.84	6.67	5.51	6.30	4.13	2.26	0.04

Table 3. Localization error on public and private datasets (UNIT: Meter).

Data	LIO-SAM	LVI-SAM	FAST-LIO2	FAST-LIVO	mVIL-Fusion	Ours
rtp1	${\underset{̲}{0.242}}^{1}$	$X^{2}$	${0.148}^{3}$	0.674	0.954	0.265
rtp2	0.177	X	0.195	0.861	0.673	0.201
rtp3	0.385	0.204	0.195	0.283	0.426	0.141
sbs1	0.214	0.215	0.223	0.351	0.213	0.114
sbs2	0.208	0.208	0.213	0.232	0.225	0.200
sbs3	0.179	X	0.210	0.210	0.193	0.175
tnp1	0.193	0.134	0.146	0.202	0.269	0.214
tnp2	0.192	0.180	0.169	0.124	0.229	0.168
tnp3	0.176	0.479	0.181	0.165	0.223	0.109
campus	0.466	1.641	0.457	0.395	$-^{4}$	0.312
grove	5.237	6.318	5.047	5.215	-	4.880
beach	0.453	6.449	1.171	2.159	-	0.241
desert	X	X	X	X	-	10.87

¹ underlined numbers are the sub-optimal results. ² X means that the system failed to complete the full localization. ³ Bolded numbers are the best ones. ⁴ - means that the system is not applicable.

Table 4. Time consumption on each component (UNIT: MILLISECOND).

Component		Median	Mean	Std
Visual module	$Feature tracking$	17.11	17.26	5.16
	Feature management	17.22	18.31	4.39
	Pose estimation	13.19	12.53	5.33
LiDAR module	Preprocessing	9.49	11.61	3.92
	Feature extraction	1.15	1.29	1.72
	Structure monitor	0.15	1.56	3.67
	Hybrid point registration	22.55	24.71	17.43
IMU module	Preintegration	0.143	0.151	0.360
IMU module	Factor graph optimization	0.280	0.274	0.311

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Shao, X.; Wang, Y.; Sun, G.; Yao, W. R-LVIO: Resilient LiDAR-Visual-Inertial Odometry for UAVs in GNSS-denied Environment. Drones 2024, 8, 487. https://doi.org/10.3390/drones8090487

AMA Style

Zhang B, Shao X, Wang Y, Sun G, Yao W. R-LVIO: Resilient LiDAR-Visual-Inertial Odometry for UAVs in GNSS-denied Environment. Drones. 2024; 8(9):487. https://doi.org/10.3390/drones8090487

Chicago/Turabian Style

Zhang, Bing, Xiangyu Shao, Yankun Wang, Guanghui Sun, and Weiran Yao. 2024. "R-LVIO: Resilient LiDAR-Visual-Inertial Odometry for UAVs in GNSS-denied Environment" Drones 8, no. 9: 487. https://doi.org/10.3390/drones8090487

Article Menu

R-LVIO: Resilient LiDAR-Visual-Inertial Odometry for UAVs in GNSS-denied Environment

Abstract

1. Introduction

2. Related Works

2.1. Sensor Fusion Localization System

2.2. Point Cloud Registration

3. Problem Statement

4. Proposed Method

4.1. System Overview

4.2. Imu Module with High Frequency Correction

4.2.1. Time Synchronization Based on Sweep Segmentation

4.2.2. IMU Kinetic Model

4.3. Visual Module with Position Consistency Constraint

4.3.1. Depth Association by Outliers Rejection

4.3.2. Motion Estimation Assisted by LiDAR Odometry Back-propagation

4.4. Adaptive LiDAR Module with Hybrid Registration Mode

4.4.1. Structure Monitor Based on Vertical Landmarks

4.4.2. Hybrid Point Cloud Alignment

5. Experimental Results

5.1. Structure Monitor Evaluation

5.2. System Localization Accuracy Evaluation

5.3. System Robustness Evaluation

5.4. Runtime Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI