Open AccessArticle

Sonar-Based Simultaneous Localization and Mapping Using the Semi-Direct Method

Xu Han

¹,

Jinghao Sun

Shu Zhang

^1,*

Junyu Dong

^1,* and

Hui Yu

^2,*

Faculty of Computer Science and Technology, Ocean University of China, Qingdao 266000, China

Faculty of Creative and Cultural Industries, School of Creative Technologies, University of Portsmouth, Portsmouth PO1 2DJ, UK

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(12), 2234; https://doi.org/10.3390/jmse12122234

Submission received: 29 September 2024 / Revised: 7 November 2024 / Accepted: 3 December 2024 / Published: 5 December 2024

(This article belongs to the Special Issue Theories and Techniques in Intelligent Digital Twins in Marine Science and Engineering)

Download

Browse Figures

Figure 1
Underwater sonar-based SLAM: (a) The poses obtained using both methods, projecting the point cloud from the previous frame to the position in the next frame: the top is the result of optimizing the position using the semi-direct method, and the bottom is the result obtained using only ICP. (b) Enlarged view of the dense point cloud area in (a); it can be seen that the ICP algorithm is relatively accurate but still has minor errors. (c) Minor errors, when accumulated over a large amount of data, will eventually lead to an unacceptable level of drift. (d) ROV equipment equipped with sonar. "> Figure 2
System framework diagram. The entire SLAM system is divided into three main components: First, the processing of raw sonar data, where different shapes of windows are employed based on the levels of reverberation noise to extract features from the sonar data and obtain Cartesian coordinate sonar images; second, the front-end, which is primarily responsible for obtaining initial pose estimates using scan matching and semi-direct methods; and third, the back-end, which is mainly in charge of global optimization and final state estimation. "> Figure 3
Forward-looking sonar model: Let P be a point in space, R represents the range from this point to the sonar origin, <math display="inline"><semantics> <mi>θ</mi> </semantics></math> represents the horizontal angle, and <math display="inline"><semantics> <mi>φ</mi> </semantics></math> represents the pitch angle. In sonar imaging, the information from the pitch angle to the imaging plane will all be compressed into the imaging plane. "> Figure 4
(a) One-dimensional CFAR detection is prone to false alarms: red represents the test cells, blue represents the reference cells, and orange represents the guard cells. (b) Two-dimensional SO-CFAR: Two-dimensional SO-CFAR window: The window of SO-CFAR includes the test cells (represented in red), reference cells (represented in blue), and guard cells (represented in orange). Areas with no reflections are represented in white, and obstacles are represented in gray. "> Figure 5
The principle of the direct method in sonar imaging. A forward-looking multibeam sonar emits sound waves from different poses, and the echoes generated by obstacles form the sonar image. Assuming we know the pose transformation between the sonars, the photometric error should be minimal when calculating the position of a pixel block from the previous sonar image in the next sonar image based on the sonar projection model. However, in reality, we do not know the pose transformation between the sonars. Therefore, we need to iterate multiple times to find the pose transformation that minimizes the photometric error, and this process is the direct method. "> Figure 6
Image process diagram. The forward-looking multi-beam sonar grabs sonar images, which are in polar coordinates. The raw sonar images can be transformed into Cartesian coordinates through interpolation and transformation. Pixel values along a certain direction often exhibit high-frequency changes, making it easy to fall into local minima. Image gradients also change at a very high frequency. "> Figure 7
The changes in pixel values along the straight line AB in the sonar image of <a href="#jmse-12-02234-f006" class="html-fig">Figure 6</a> before and after bilateral filtering: (a) before bilateral filtering; (b) after bilateral filtering. "> Figure 8
Effects of feature extraction with different CFAR detection methods: (a) Raw sonar image. (b) Features extracted via one-dimensional SO-CFAR detection along the range dimension. (c) Features extracted via GO-CFAR detection along the range dimension. (d) Features extracted via two-dimensional SO-CFAR detection. "> Figure 9
Quantitative assessment: (a) Pixel coordinates in the imaging of point clouds transformed using poses <math display="inline"><semantics> <msub> <mi>T</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>T</mi> <mn>2</mn> </msub> </semantics></math>. (b) Enlarged sonar image (part) of (a). The red box represents the enlarged area. (c) Position of point clouds in the sonar image. (d) Sonar image (part corresponding to the point clouds). (e) Statistical chart of point cloud pixel values. "> Figure 10
(a) The trajectory and map of SLAM result that uses only the ICP algorithm in the front end. (b) The trajectory and map of SLAM result that utilizes the semi-direct method based on the ICP algorithm. ">

Versions Notes

Abstract

The SLAM problem is a common challenge faced by ROVs working underwater, with the key issue being the accurate estimation of pose. In this work, we make full use of the positional information of point clouds and the surrounding pixel data. To obtain better feature extraction results in specific directions, we propose a method that accelerates the computation of the two-dimensional SO-CFAR algorithm, with the time cost being only a very slight increase compared to the one-dimensional SO-CFAR. We develop a sonar semi-direct method, adapted from the direct method used in visual SLAM. With the initialization from the ICP algorithm, we apply this method to further refine the pose estimation. To overcome the deficiencies of sonar images, we preprocess the images and reformulate the sonar imaging model in imitation of camera imaging models, further optimizing the pose by minimizing photometric error and fully leveraging pixel information. The improved front end and the accelerated two-dimensional SO-CFAR are assessed through quantitative experiments. The performance of SLAM in large real-world environments is assessed through qualitative experiments.

Keywords:

SLAM; SO-CFAR; semi-direct method; sonar; feature extraction

1. Introduction

Observation of underwater structures in areas such as docks, bridges, and waterways requires the use of Remotely Operated Vehicles (ROVs) or Autonomous Underwater Vehicles (AUVs) equipped with multiple sensors [1]. These robots gather local environmental information through sensors, but local information alone may not always suffice for the task at hand. For instance, underwater engineering inspections in large-scale environments often necessitate establishing a correlation between the robot and known coordinates.

Electromagnetic waves have poor penetration capability in water. Consequently, GPS is ineffective underwater, meaning that ROVs and AUVs, without prior maps, need to construct maps of their surrounding environments while operating to determine their position in large-scale environments [2]. This process is referred to as the Simultaneous Localization and Mapping (SLAM) problem.

In current aerial environments, cameras and LIDAR sensors are frequently utilized for environmental perception and object recognition. However, underwater environments are often turbid and poorly lit, significantly impacting the performance of SLAM algorithms utilizing the aforementioned sensors [3,4].

Electromagnetic waves have limited propagation in water, whereas sound waves experience weaker attenuation. Therefore, for underwater detection, sonar is most of the time chosen for environmental perception.

Due to the high noise level, limited resolution, and inherent blurriness of sonar images [5], extracting features from sonar images has always been a challenging problem. Compared to the field of visual SLAM, research on sonar-based SLAM is relatively scarce. Moreover, the differences between sonar and camera imaging principles also affect the effectiveness of various image processing methods transitioning from the visual domain to the sonar domain. For example, the effectiveness of feature-based methods commonly used in visual SLAM relies on the stability of feature extractors and feature-matching techniques. However, feature matching in sonar images is often unreliable and time-consuming. The point cloud is a collection of extracted feature points. Algorithms like NDT [6], UspIC [7], and IDC [8] that avoid the feature matching process are not stringent with initial value requirements but neglect pixel information. Not only that, but the lack of stability in feature extraction between consecutive frames can also lead to errors in point cloud registration. As shown in Figure 1a, frame-to-frame registration usually comes with minor errors that these algorithms cannot correct. Over time, without loop closure correction, the errors can become unacceptable, as illustrated in Figure 1c, where the ROV’s trajectory crosses obstacles.

To tackle the aforementioned problems related to multibeam sonar, we propose a method that takes into account both position and pixel information. Firstly, we select different types of two-dimensional constant false alarm detection based on the varying conditions of the sonar images, ensuring the stability of feature extraction between consecutive frames. Secondly, after obtaining the initial values, we minimize photometric errors to optimize the pose. Our main contributions are as follows:

We propose a two-dimensional constant false alarm detection method that uses prefix sum of matrix and accelerated calculations. Compared with the one-dimensional constant false alarm algorithm, the added time overhead is very low, and it has better detection effects for obstacles in the horizontal and vertical directions.
We write a sonar imaging model based on the camera imaging model. We further optimize the initial pose by minimizing photometric errors.
We have conducted a quantitative assessment of the photometric optimization effect using data collected by the robot shown in Figure 1d, and qualitative experiments have been tested within an existing real-world dataset [9].

2. Related Work

2.1. Sonar-Based SLAM

Sonar emits sound waves and receives echoes generated by the reflection of these waves off objects. Signal processing and analysis are then performed to extract information about the objects. Sonar is unaffected by illumination and less influenced by water quality. Hover et al. [10] demonstrated the ability to rely entirely on DIDSON sonar images for navigation and planning in ship hull detection tasks, even under poor water quality conditions. Wang et al. [11] employed single-beam sonar for mapping seawalls and obtained maps closely resembling the ground truth, even at long observation distances. These studies illustrate that sonar can maintain good performance in underwater environments, which are typically more complex and challenging than air environments.

Sonar-based SLAM can be primarily categorized into two types: filter-based SLAM and optimization-based SLAM. Filter-based SLAM includes techniques such as the extended Kalman filter and particle filter. In their work, Cheng et al. [12] integrated data from Doppler Velocity Log (DVL), Inertial Measurement Unit (IMU), and sonar to estimate the pose of Autonomous Underwater Vehicles (AUVs) and generate occupancy grid maps.

Optimization-based SLAM is currently the focal point of SLAM research. In graph-based SLAM, graph theory techniques can be employed for marginalization, elimination, and computational acceleration. Factor graph optimization represents the SLAM problem as a dynamic Bayesian network. A factor graph represents a global function of many variables as a product of local functions with smaller subsets of variables. Li et al. [13] proposed a landmark SLAM algorithm and corresponding feature-matching algorithm. They developed a system to identify and select sonar frames with maximum information content to enhance system efficiency and reliability. Fallon et al. [14] proposed a sonar-based SLAM method that initially uses scanning sonar to generate mosaic images of regions of interest, serving as prior feature maps. During the subsequent exploration process using an AUV equipped with forward-looking multibeam sonar, the method employs target reacquisition. Upon re-observing a sufficient number of features, the AUV localizes itself relative to the prior map. Ref. [9] provides an open-source SLAM method that uses DVL and IMU for dead reckoning to obtain initial poses and employs only the ICP [15] algorithm for frame-to-frame point cloud matching. We will compare our results with this work in the experimental section.

Recently, many researchers have conducted studies on sonar image analysis. For instance, reference [16] utilizes sonar images from multiple sonar perspectives to recover 3D structures and localize the sonar. Machado et al. improved a novel sonar image descriptor and added a matching check after sonar image matching to handle outlier cases in their work [17]. Zhang et al. [18] proposed a SLAM framework using side-scan sonar, which automatically detects and robustly matches keypoints between pairs of side-scan images. The detected correspondences are then used as constraints to optimize the AUV’s pose trajectory. However, the resolution of the scanning sonar is higher than that of forward-looking multibeam sonar. In our preliminary experiments, it was challenging to extract FAST corners and perform SIFT matching between images from forward-looking multibeam sonar. These efforts have resulted in significant success, but they have have also revealed the unreliability of sonar image feature matching. It is for this reason that we ultimately opted for pose SLAM, where the features extracted from sonar images are not optimized. This pose SLAM utilizes the open-source library GTSAM to optimize the poses, and since GTSAM employs non-linear optimization techniques, the pose SLAM should also be categorized as optimization-based SLAM.

2.2. Optical Flow and Direct Methods

There are numerous strategies that have been devised to avoid the feature matching procedure. Hurtos et al. [19] proposed a Fourier-based registration technique, which exhibited high robustness in featureless environments. McConnel et al. [12] utilized deep learning to generate synthetic overhead imagery and register it with prior maps, highlighting that the challenge of landmark SLAM with acoustic sensors is the identification of objects as landmarks and their subsequent repeat association as they are re-observed.

The brightness constancy assumption in a sequence of continuous images posits that if a scene point remains unchanged across different images, then the pixel intensity of this point should also be the same in the different images. This assumption requires that lighting conditions remain constant. Sonar images can also be regarded as optical images where lighting conditions are invariant. Optical flow methods [20] and direct methods are both based on the brightness constancy assumption, and there are also many related studies in the field of sonar. In the literature [21], during the process of underwater terrain reconstruction, the authors used the optical flow method to track the extracted AKAZE [22] feature points. Reference [23] presents an approach to the segmentation, two-dimensional motion estimation, and subsequent tracking of multiple objects in sequences of sector scan sonar images. Optical flow is calculated for moving objects with significant sizes in the images to obtain size and directional motion estimates. The authors of [24] proposed a fish tracking method based on the application of modified local optical flow. These methods, which are all aimed at recovering the optical flow between two images, require computing the motion of a vast number of pixel points to determine the motion information of the entire scene, making it a computationally intensive task [25]. Direct SLAM directly utilizes the pixel intensity information of the entire image to estimate the camera’s motion, without the need to extract feature points or calculate descriptors. Semi-direct SLAM requires sparse feature points for preliminary pose estimation and optimizes the pixel blocks around these feature points. Semi-direct visual SLAM [26,27] typically exhibits faster speed. Direct method-based visual SLAM [28] demonstrates faster speeds and can be implemented on a regular CPU for semi-dense SLAM. In [29], a direct method acoustic odometry was designed.

3. Method

3.1. Overview

As shown in Figure 2, after receiving sonar data, the images in polar coordinates are first processed using a two-dimensional SO-CFAR algorithm to extract features. Reverberation noise is produced by the interaction of sound waves with other objects in the water body, such as the sea surface and the seafloor. When the reverberation noise is high, our focus is on avoiding false detections; hence the use of a cross window. We use the visualization software that comes with the sonar equipment to see whether the reverberation noise in the sonar image is severe. When the reverberation noise is low, our focus is on more fully extracting the noise; hence the use of a rectangular window. DBSCAN [30] is then applied to remove outliers, and the coordinates of the feature points are transformed into Cartesian coordinates. These transformed feature points are matched across each sonar keyframe, and rough pose values are computed between frames using the ICP algorithm. The sonar images in polar coordinates are transformed into Cartesian images, requiring noise reduction to prevent falling into local minima too easily. Based on the pose estimation from ICP, more accurate pose estimation is performed using the direct method. Factor graph optimization is utilized in the backend for further refinement.

3.2. Sonar-Based SLAM Problem Formulation

Multi-beam imaging sonar is a device that surveys underwater environments using acoustic means, offering comprehensive imaging results across long distances. The sonar emits hundreds of beams of sound waves, which reflect when encountering obstacles. By analyzing the reflection time, strength, and direction of the received reflected sound waves, a sonar raw data image can be constructed by stitching these reflections together.

The raw data image from the sonar differs from the fan-shaped image in Cartesian coordinates. The horizontal axis represents the angle of the reflected sound waves on the horizontal plane, while the vertical axis represents the distance at which the sound waves reflect. The pixel values of the image indicate the intensity of the reflected sound waves.

After a series of coordinate transformations and interpolation operations, the image in polar coordinates can be converted into an image in Cartesian coordinates, as shown in Figure 3. For any point P in space, with the sonar as the reference frame, the coordinates are as follows:

P = (\begin{matrix} X \\ Y \\ Z \end{matrix}) = R (\begin{matrix} sin θ cos φ \\ cos θ cos φ \\ sin φ \end{matrix})

(1)

However, in sonar images, all content within the vertical field of view of the sonar is compressed onto the two-dimensional image. Although some studies have reconstructed three-dimensional coordinates [16,31], for computational efficiency, we have decided to use a simplified sonar model:

P = (\begin{matrix} X \\ Y \end{matrix}) = R (\begin{matrix} sin θ \\ cos θ \end{matrix})

(2)

We assume that the ROV or AUV moves along a horizontal plane at a constant depth. Therefore, the Z-coordinate is a fixed value that can be calculated using a pressure sensor. Therefore, the SLAM problem can be represented as a dynamic Bayesian network, and its motion equations can be expressed by Equation (3).

x_{k} = f (x_{k - 1}, u_{k}) + w_{k}

(3)

x_{k}

represents the pose of the vehicle. k and

k - 1

denote the current frame and the previous frame, respectively, and

u_{k}

represents the motion data.

f (x_{k - 1}, u_{k})

represents the state transition model, which illustrates how the pose at moment

k - 1

, through motion data, is used to calculate the pose at moment k.

w_{k}

represents noise, which can be measured through experimental statistical analysis. Therefore, the objective of the back-end optimization in pose SLAM can be viewed as adjusting the pose values to maximize the posterior probability of the Bayesian network, as shown in Equation (4).

x^{*} = \underset{X}{arg min} P_{p r i o r} (x_{0}) \prod_{k = 1}^{N_{f}} P_{p r o b a b i l i t y} (x_{k} | x_{k - 1}, u_{k})

(4)

3.3. Smallest of Constant False Alarm Rate

The quality of sonar images is primarily affected by environmental noise and reverberation. Noise includes self-noise and waterborne noise. Self-noise consists of the noise generated by the sonar itself and the noise from the AUV carrying the sonar. Waterborne noise includes noise from other navigation activities within the water area, as well as noise from aquatic organisms and other sources. Noises can easily lead to false identifications of sound wave reflections. To process sonar images, it is necessary to distinguish between false identifications and actual targets.

Acosta et al. were inspired by the constant false alarm rate (CFAR) algorithm commonly used in radar signal processing [32], and they considered the interference power in radar to be analogous to the acoustic reverberation power in sonar [33]. They applied the CA-CFAR algorithm to the side-scan sonar images and achieved good results. In [34], one-dimensional CA-CFAR detection was conducted along the range dimension for each sonar beam. In [31,35], one-dimensional SO-CFAR detection was performed along the range dimension for each sonar beam.

These approaches have shown a degree of effectiveness. While these approaches utilize CFAR algorithms in a two-dimensional plane, they neglect the correlation of data in the bearing dimension. Consequently, obstacles parallel to the sonar direction are occasionally misclassified as noise. As shown in Figure 4a, suppose there are two types of obstacles in the horizontal and vertical directions, forming a cross shape. The obstacles are represented by gray squares (some of the gray squares in the horizontal direction are obscured by the detection window). A one-dimensional CFAR detection window moves along the bearing dimension, processing one row of data at a time before moving to the next row. This situation will be demonstrated in the experiment section.

CA-CFAR is suitable for situations with homogeneous clutter, but the constant false alarm rate increases due to clutter edges, leading to a decrease in detection performance in multi-target environments. SO-CFAR maintains good target detection performance, even in multi-target environments. Therefore, we choose to use SO-CFAR in this work, as shown in Figure 4b.

In radar target detection, two-dimensional CFAR detection simultaneously considers the correlation of data in both the range and Doppler dimensions. Similarly, in the sonar domain, two-dimensional CFAR detection takes into account the correlation of data in both the range and bearing dimensions. The window of SO-CFAR includes the test cells (represented in red), reference cells (represented in blue), and guard cells (represented in orange). G is the number of guard cells in one direction, the total number of reference cells is 4N (each direction, including top, bottom, left, and right, contains N reference cells), and the desired false alarm rate is

P_{f a}

. The guard cells are employed to eliminate excess energy surrounding the test cells. The noise power at this location is estimated based on the intensity in the reference cells from all four directions, as shown in Equation (5). The threshold factor

α

is calculated according to Equation (6).

β^{2} = min \frac{1}{N} {\sum_{i = 1}^{N} x_{l e f t . i}, \sum_{i = 1}^{N} x_{r i g h t . i}, \sum_{i = 1}^{N} x_{t o p . i}, \sum_{i = 1}^{N} x_{b o t t o m . i}}

(5)

α = N (P_{f a}^{- \frac{1}{N}} - 1)

(6)

The threshold is determined by Equation (7), and when the pixel value of the test cell exceeds the threshold, it is considered to contain a target.

τ = α β^{2}

(7)

The two-dimensional CFAR iterates over each cell in the entire image and performs accumulation within the corresponding window for each cell. When the number of reference cells is large, this process requires a long processing time. In the work proposed by Acosta et al. [33], ACA-CFAR was introduced to accelerate the processing of side-scan sonar images in Cartesian coordinates compared to CA-CFAR. The rectangular side-scan sonar images differ from the images generated by forward-looking multibeam sonar in Cartesian coordinates. The presence of sector edges in the latter makes the analysis of noise power using rectangular reference cell blocks significantly more complex. Therefore, following the approach in [34], we directly process the raw data in the polar coordinate domain. Zhao investigated radar signal processing workflows and designed a parallel signal processing solution based on GPU architecture, which achieved promising results [36]. Our method can run relatively fast on a regular CPU. To accelerate computation, we first use a two-dimensional matrix prefix sum to obtain the sum of pixel values from the top-left corner to any point in the sonar image, denoted as the sum matrix, where

s u m [r o w] [c o l]

represents the sum of pixel values from (0, 0) to (

r o w - 1

c o l - 1

). Then, according to Equations (8)–(11), we calculate the noise power for the four parts: top, bottom, left, and right.

\begin{matrix} β_{t o p}^{2} = s u m [r o w - G] [c o l + G + 1] - s u m [r o w - G] [c o l - G] - \\ s u m [r o w - G - N] [c o l + G + 1] + s u m [r o w - G - N] [c o l - G] \end{matrix}

(8)

\begin{matrix} β_{b o t t o m}^{2} = s u m [r o w + G + N + 1] [c o l + G + 1] - s u m [r o w + G] [c o l + G] - \\ s u m [r o w + G + N + 1] [c o l - G] + s u m [r o w + G + 1] [c o l - G] \end{matrix}

(9)

\begin{matrix} β_{l e f t}^{2} = s u m [r o w + G + 1] [c o l - G] - s u m [r o w + G + 1] [c o l - G - N] - \\ s u m [r o w - G] [c o l - G - N] + s u m [r o w - G] [c o l - G - N] \end{matrix}

(10)

\begin{matrix} β_{r i g h t}^{2} = s u m [r o w + G + 1] [c o l + G + N + 1] - s u m [r o w + G + 1] [c o l + G + 1] - \\ s u m [r o w - G] [c o l + G + 1] + s u m [r o w - G] [c o l + G + 1] \end{matrix}

(11)

This applies to an image with dimensions of m by n, where the guard cell has a length of G in one direction and the reference cell has a length of N in one direction. Originally, without using the prefix sum acceleration method, it requires 4mn(2G + 1)N addition operations, but with the acceleration method described above, it only requires 5mn addition operations. Since G ≥ 1 and N ≥ 1, it always satisfies 4mn(2G + 1)N > 5mn, and the larger N and G are, the more significant the acceleration effect becomes. The feature extraction and acceleration effects will be discussed in the experimental section. After using SO-CFAR to extract features, many isolated points are extracted, which may originate from environmental noise and clutter. Therefore, we use the DBSCAN algorithm to remove the influence of noise. DBSCAN is a density-based spatial data clustering method. Its advantages include the ability to discover clusters of arbitrary shapes in spatial data and not requiring prior knowledge of the number of clusters that need to be partitioned.

3.4. Semi-Direct Method Applicable to the Field of Sonar

Unlike the direct method, which calculates based on dense pixels, the semi-direct method first extracts sparse feature points. After obtaining a rough pose using algorithms such as NDT and ICP, it further optimizes the photometric error to achieve a more accurate pose.

The error function for semi-direct methods is photometric error, as previously explained. To determine the pixel coordinates of a point in a Cartesian coordinate sonar image, we first need to define a projection model. Although the principles of sonar and camera imaging differ, the objective remains the same: to project a point in space onto a pixel on the imaging plane. Hence, we can still establish mapping between pixel coordinates and spatial coordinates.

The original sonar image (sonar image in polar coordinates) has H pixels in the vertical direction. Given a sonar range distance D, measured in meters, dividing D by H yields r. If the horizontal field of view is

θ

, then its horizontal width W in Cartesian coordinates can be calculated using Equation (13), measured in pixels.

r = \frac{D}{H}

(12)

\begin{matrix} W = 2 H sin (\frac{θ}{2}) \end{matrix}

(13)

As shown in Figure 5, we assume the imaging coordinate system has its origin at the top-left corner. The coordinates of a point

P_{u v}

in the imaging coordinate system are represented as

P_{u v} = [u, v]

, where u represents the horizontal coordinate and v represents the vertical coordinate. The coordinates of a point in the sonar coordinate system are represented as

q = [x, y]

. Their relationship can be expressed using a matrix, as shown in Equation (14).

P_{u v} = (\begin{matrix} u \\ v \end{matrix}) = (\begin{matrix} - \frac{1}{r} & 0 & \frac{H}{2} \\ 0 & - \frac{1}{r} & W \end{matrix}) (\begin{matrix} x \\ y \\ 1 \end{matrix})

(14)

The matrix composed of intermediate variables for the transformation between pixel coordinates and sonar coordinates in Equation (14) remains fixed and does not change during the operation of the sonar. In Equation (14), we use coordinates in the sonar coordinate system. However, the sonar moves with the AUV, so the coordinates of a point in the sonar coordinate system are the result of transforming the coordinates of that point in the world coordinate system

P_{w}

to the sonar coordinate system through the current sonar pose. The pose of the sonar is described by its rotation matrix and translation vector, denoted by the corresponding Lie group T. Their relationship is illustrated by Equation (15). C represents the transformation matrix between the camera coordinate system coordinates and the pixel coordinate system in Equation (14), which can be likened to the intrinsic matrix in camera imaging.

P_{u v} = C T P_{w}

(15)

Considering a point P in space and sonar images captured at different times, let the world coordinates of point P be denoted as [X,Y]. The pixel coordinates of point P in the sonar images at times

t_{1}

and

t_{2}

are denoted as

P_{u v 1}

and

P_{u v 2}

, respectively. Based on the pose of the sonar at time

t_{1}

, the pose at time

t_{1}

is transformed into the pose at time

t_{2}

via rotation R and translation t, represented by the Lie group T.

P_{u v 1} = C P

(16)

P_{u v 2} = C T P

(17)

If there is a discrepancy between T and the ground truth, the pixels near

P_{u v 2}

and

P_{u v 1}

will also differ. Following the assumption of photometric invariance, we optimize the pose transformation to minimize the photometric error, as defined in Equation (18).

I_{1}

and

I_{2}

represent the pixel values of a point in image 1 and image 2, respectively.

e = I_{1} (P_{u v 1}) - I_{2} (P_{u v 2})

(18)

The objective function for optimization is the Euclidean norm of the error. To minimize the objective function, we need to find the derivative relationship between the error e and the pose transformation. However, due to the non-additivity of the Lie group,

T + Δ T

is not a SE(2) group. Directly differentiating the Lie group using the definition of derivatives does not yield correct results. Therefore, it is necessary to differentiate the Lie algebra corresponding to the Lie group. The Lie algebra consists of the translational vector

ρ

and the rotational angle

θ

, denoted as

ξ

, as shown in Equation (19). Starting from the formulas here, we represent the coordinates of a point in space in the Cartesian coordinate system. Therefore, the current

θ

no longer represents the angle between a point in the XY plane and the positive y-axis. We can now use

θ

to represent the rotation in the robot’s pose. Since the movement of the ROV is limited to a 2D plane,

θ

is one-dimensional.

ξ = (\begin{matrix} ρ \\ θ \end{matrix})

(19)

The Jacobian matrix of the error with respect to the Lie algebra is denoted as J and is composed of three parts multiplied together. Here,

I_{2}

is as described earlier,

P_{u v}

represents the pixel coordinates, and q represents the coordinates of the point in the sonar coordinate system at time

t_{2}

. The Jacobian matrix is shown in Equation (20). After obtaining the Jacobian matrix, the error will be iteratively solved using the Levenberg–Marquardt method.

J = - \frac{\partial I_{2}}{\partial P_{u v}} \frac{\partial P_{u v}}{\partial q} \frac{\partial q}{\partial ξ}

(20)

The Jacobian matrix consists of three parts. The first part is the image gradient. The second part, which relates

P_{u v}

and q, is clearly explained in Equation (14). Therefore, the second part is presented in Equation (21).

\frac{\partial P_{u v}}{\partial q} = [\begin{matrix} - \frac{1}{r} & 0 \\ 0 & - \frac{1}{r} \end{matrix}]

(21)

In the Lie group SE3,

ξ^{^}

represents the transformation of a six-dimensional vector into a 4 × 4 matrix. The matrix involves a skew-symmetric matrix, which is a property of a matrix. In the Lie group SE2,

ρ

and

θ

are two-dimensional and one-dimensional vectors, respectively. The equations for SE3 are not applicable. The Lie algebra

ξ

corresponds to the Lie group T, and the space point P is transformed into q after the transformation T, represented as [x, y]. Giving T a left perturbation

Δ T = exp (δ ξ^{^})

, where the perturbation corresponds to the Lie algebra

δ ξ = [δ ρ, δ θ]

, we can express this as shown in Equation (22).

\begin{matrix} \frac{\partial q}{\partial ξ} & = lim_{δ ξ \to 0} \frac{exp (δ ξ^{^}) exp (ξ^{^}) p - exp (θ^{^}) p}{δ ξ} \\ = lim_{δ ξ \to 0} \frac{δ ξ^{^} exp (ξ^{^}) p}{δ ξ} \\ = lim_{δ ξ \to 0} \frac{[\begin{matrix} δ θ^{^} & δ ρ \\ 0 & 0 \end{matrix}] [\begin{matrix} q \\ 1 \end{matrix}]}{δ ξ} \\ = lim_{δ ξ \to 0} \frac{[\begin{matrix} δ θ^{^} q + δ ρ \\ 0 \end{matrix}]}{δ ξ} \end{matrix}

(22)

The cross product of the three-dimensional vectors a and b yields the result shown in Equation (23).

a \times b = [\begin{matrix} 0 & - a_{3} & a_{2} \\ a_{3} & 0 & - a_{1} \\ - a_{2} & a_{1} & 0 \end{matrix}] b

(23)

The cross product symbol and the ^{^} symbol are equivalent. In the cross product operation, when the positions of the vectors are interchanged, the result needs to be negated. Therefore, the same rule applies when the ^{^} symbol is interchanged; the result also needs to be negated. Thus, Equation (24) can be expressed as follows:

θ^{^} q = - q^{^} θ

(24)

The vector

{(ρ_{x}, ρ_{y}, θ)}^{⊤}

represents rotation and translation in a two-dimensional plane, but the two-dimensional plane is also part of three-dimensional space. This means that the translation vector

{(ρ_{x}, ρ_{y})}^{t o p}

can be considered as

{(ρ_{x}, ρ_{y}, 0)}^{⊤}

in three-dimensional space. And the rotation vector

(θ)

can be considered as the rotation vector

{(0, 0, θ)}^{⊤}

in three-dimensional space (where

θ

is the rotation around the Z-axis). The vector

q = (x, y^{⊤})

is also considered as the vector

q = {(x, y, 0)}^{⊤}

. As shown in Equation (23), Equation (24) can be written as follows:

[\begin{matrix} 0 & - θ & 0 \\ θ & 0 & 0 \\ 0 & 0 & 0 \end{matrix}] [\begin{matrix} x \\ y \\ 0 \end{matrix}] = [\begin{matrix} 0 & 0 & - y \\ 0 & 0 & x \\ y & - x & 0 \end{matrix}] [\begin{matrix} 0 \\ 0 \\ θ \end{matrix}]

(25)

We continue to optimize Equation (22), but we must pay attention to the changes in dimensions. The rows of the Jacobian matrix represent the dependent variables, and the number of columns represents the independent variables. The dimension of the Jacobian matrix we need is 2 × 3, but our dependent variables have changed from

{(x, y)}^{⊤}

{(x, y, 0)}^{⊤}

, and our independent variables have changed from

{(ρ_{x}, ρ_{y}, θ)}^{⊤}

{(ρ_{x}, ρ_{y}, 0, 0, 0, θ)}^{⊤}

. Therefore, the calculated Jacobian matrix needs to have the irrelevant third row and the third, fourth, and fifth columns removed to obtain the correct Jacobian matrix. We use

q^{+}

ξ^{+}

ρ^{+}

, and

θ^{+}

to represent the vectors after dimension augmentation. The calculation process for Equation (22) after increasing the dimensions is as follows:

\begin{matrix} \frac{\partial q^{+}}{\partial ξ^{+}} & = lim_{δ ξ^{+} \to 0} \frac{[\begin{matrix} δ {(θ^{+})}^{^} q^{+} + δ ρ^{+} \\ 0 \end{matrix}]}{δ ξ^{+}} \\ = lim_{δ ξ^{+} \to 0} \frac{[\begin{matrix} δ ρ^{+} - {(q^{+})}^{^} δ θ^{+} \\ 0 \end{matrix}]}{δ ξ^{+}} \\ = lim_{δ ξ \to 0} \frac{[\begin{matrix} δ ρ^{+} - {(q^{+})}^{^} δ θ^{+} \\ 0 \end{matrix}]}{δ [\begin{matrix} ρ^{+} \\ θ^{+} \end{matrix}]} \\ = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & - y \\ 0 & 1 & 0 & 0 & 0 & x \\ 0 & 0 & 1 & y & - x & 0 \end{matrix}] \end{matrix}

(26)

After removing the rows and columns introduced for dimensional supplementation, the correctly dimensioned Jacobian matrix is shown as follows:

\frac{\partial q}{\partial ξ} = [\begin{matrix} 1 & 0 & - y \\ 0 & 1 & x \end{matrix}]

(27)

Using the multiplication of Equations (21) and (27), we can obtain

\frac{\partial P_{u v}}{\partial ξ}

, as illustrated in Equation (28).

\frac{\partial P_{u v}}{\partial ξ} = [\begin{matrix} - \frac{1}{r} & 0 & \frac{y}{r} \\ 0 & - \frac{1}{r} & - \frac{x}{r} \end{matrix}]

(28)

Similarly, based on the assumption of photometric invariance, the Lucas–Kanade optical flow method [20] has the advantage of eliminating the need for feature matching, significantly reducing the time required. However, it relies solely on image gradients for optimization, overlooking the intrinsic structure of the sonar and failing to consider the rotation of the sonar.

Direct methods share similar advantages to optical flow methods. As indicated in Equation (28), this model accounts for the rotation of the sonar itself, thus overcoming the limitations of the optical flow method. However, direct methods are still affected by the non-convexity of images, making them prone to falling into local minima. To achieve more accurate results, a good initial estimate is required. Therefore, we initially use the ICP algorithm to calculate an initial estimate, then we further optimize the pose using direct methods.

3.5. Image Processing

Unlike feature-based methods, direct methods rely solely on optimization to determine the pose. As in the first part of Equation (20),

\frac{\partial I}{\partial P_{u v}}

represents the pixel gradient, which guides the direction of the optimization. To obtain correct results, it is necessary to ensure that the majority of the pixel gradients can guide the optimization in the correct direction.

Correct optimization requires ensuring that the photometric error continuously decreases as it progresses along the image gradient. However, due to the strong non-convex nature of the image, it is easy to fall into local minima when moving along the image gradient. As the ROV moves, the sonar continuously captures images, as shown in Figure 6. The collected raw sonar images are in polar coordinates, with the horizontal axis representing the bearing dimension and the vertical axis representing the range dimension. However, images in polar coordinates are not very intuitive, as shown in Figure 6. The images are mapped to Cartesian coordinates (as shown in Figure 6), which requires interpolation and coordinate transformation. In other words, new points need to be added to the raw data, introducing content that is not part of the original data. Under the combined influence of reverberation and environmental noise, the images in Cartesian coordinates are filled with interference. We have enlarged a section of a sonar image, corresponding to a space where no reflections are present, as shown in Figure 6. The pixel values along a specific horizontal line, marked with a red box, are shown in Figure 6. The extremely high frequency of pixel value changes makes it very easy for the objective function to fall into local minima, and the image gradients are displayed in Figure 6. To properly guide the reduction of errors, it is necessary to avoid excessively high frequencies of local minima in the pixel value change curve. Therefore, we have applied a bilateral filter to the image.

As shown in Figure 7a, the pixel values along a certain line change at an extremely high frequency.

Sonar images are characterized by high noise and blurriness. While mean filtering can reduce noise, it also makes the image blurrier, which is clearly not conducive to feature extraction. To preserve edge information, we have chosen bilateral filtering [37]. Bilateral filtering takes both spatial and color information into account, allowing it to reduce noise while preserving edges. When computing the new value for a pixel, bilateral filtering considers not only the distance but also the color differences.

We set the bilateral filter parameters to a distance threshold of 10, a color variance of 50, and a spatial variance of 15. As shown in Figure 7b, following bilateral filtering, the frequency of pixel value changes along the line direction is noticeably reduced, and the distance between adjacent local minima is increased. This means that during the optimization process guided by image gradients, it is less likely to fall into local minima.

4. Experiments and Results

4.1. Hardware Overview

To conduct experiments in real-world settings, we utilized a custom ROV. This vehicle incorporates an integrated control module, communication module, and variable point module, and it is operated using an industrial-grade control joystick to manage the robot’s basic movements.

For environmental perception, we utilize the Oculus M750d wide-aperture multibeam imaging sonar. This device is designed for high-resolution underwater sonar applications and was sourced from Oculus Marine, a leading manufacturer in marine technology located in Toronto, ON, Canada. This sonar has a vertical aperture of 20 degrees and a horizontal field of view of 130 degrees. We operate the sonar within a 30 m range, and the experimental site is located at the Qingdao International Yacht Expo Center in Qingdao, Shandong Province.

During our data collection, only sonar images were captured, without DVL and IMU data. Therefore, our incomplete dataset can only be used to assess the effectiveness of the semi-direct method optimization we designed. The performance of the complete SLAM system is tested on a public dataset.

4.2. Feature Extraction

As previously mentioned, SO-CFAR detection performed in a single dimension may lead to misclassification of obstacles oriented in certain directions, as illustrated in Figure 8b. In contrast, two-dimensional CFAR detection achieves better results. We refer to the single-dimensional method as 1D-CFAR [11]. The performance of methods like GO-CFAR is also not satisfactory.

The top image in Figure 8a was collected by sonar near the sea floor, which also led to a significant amount of reverberation noise. The bottom image in Figure 8a was collected at a moderate depth, resulting in a cleaner sonar image. For images with a large amount of reverberation noise, using a two-dimensional window in SO-CFAR can easily cause false detections. Therefore, one-dimensional windows in two dimensions are used, and their results are synthesized. For cleaner sonar images, scanning with a two-dimensional window in SO-CFAR allows for a more thorough extraction of features.

As illustrated in Figure 8d, two-dimensional CFAR detection provides better detection performance for features in certain directions compared to one-dimensional SO-CFAR. Unlike one-dimensional CFAR detection performed along a specific dimension, this paper utilizes matrix prefix sums to accelerate two-dimensional SO-CFAR detection. We tested the time taken to extract features across 45 keyframes and observed a relatively modest increase in time expenditure. Although the detection window transitioned from one dimension to two dimensions, resulting in a quadratic increase in the number of cells within the new window, the detection time only increased up to 2.2 times the original, rather than a quadratic multiple. This indicates a significantly restrained growth in processing time compared to the expansion in window dimensions. The mean time cost for the 45 frames is 0.052661 and 0.091919 s, respectively. When the number of reference cells is squared compared to the original quantity, the average time cost increase is less than 75%.

4.3. SLAM Performance

Before evaluating the performance of this SLAM method in large scenes, we first assess the performance of the front end improved by the semi-direct method. Since it is difficult to obtain the ground truth of the underwater structures at the center of the yacht, we have designed an experimental method that quantifies the optimization effect using the ratio of pixel sums.

The pose calculated after ICP is denoted as

T_{2}

, and the pose optimized by minimizing photometric errors is denoted as

T_{1}

. We transform the point cloud of the previous frame using

T_{1}

and

T_{2}

and calculate the pixel coordinates of the point cloud in the next frame.

In a Cartesian coordinate system, the grayscale of a pixel in the sonar image represents the intensity of the echo. In a sonar image, areas with higher pixel values are more likely to represent obstacles in the scene, and the extracted point cloud features are these areas. To validate the performance improvements brought by this method, we first calculate the sum of pixel values

S u m_{1}

in the pixel coordinates derived from the point cloud transformed using pose

T_{1}

according to the imaging model. Subsequently, we calculate the sum of pixel values

S u m_{2}

in the pixel coordinates derived according to the sonar imaging model. It can be easily understood that the more accurate the pose, the more precise the point cloud, and the higher the pixel values at the pixel coordinates calculated according to the imaging model in the sonar image, as shown in Figure 9c. Consequently, the sum of pixel values calculated for all points will be higher. We will use the ratio of these pixel sums as a metric to evaluate the performance improvement of the algorithm, as shown in Equation (29).

R a t i o_{s u m} = \frac{S u m_{1}}{S u m_{2}}

(29)

Experiments have demonstrated that using the direct method in the front end can effectively optimize the pose obtained by the ICP algorithm, as shown in Table 1. From Table 1, it can be determined whether it is one-dimensional SO-CFAR or two-dimensional SO-CFAR based on whether the mean sum of pixel values is higher when using the semi-direct method compared to the ICP-only algorithm. However, these two approaches are not mutually exclusive but complementary. We still rely on ICP to provide a good initial estimate for further optimization.

To assess the performance of our method in real-world settings, it is necessary to compare the maps constructed by the SLAM system with the ground truth values. SLAM experiments conducted in real-world environments differ from those conducted in controlled pool settings, where pools have known or easily measurable accurate data. This allows for the creation of CAD models of objects, facilitating the comparison between the generated point clouds and the CAD models. These comparisons allow for the calculation of metrics such as mean absolute error and root mean square error, which are essential for evaluating the accuracy of the SLAM process. In real-world underwater environments, it is difficult to obtain accurate data, especially in dynamic settings such as the center of a yacht, where conditions frequently change (for example, the movement of the yacht). Therefore, in this experiment, only qualitative assessments were conducted.

In real-world environments, to test the performance of SLAM, we compared the method proposed in reference [9] and utilized the dataset provided in that work. This dataset includes data collected by the Rowe SeaPilot DVL, Vectornav 100 MEMS IMU, and Oculus M750d imaging sonar. In this dataset, sonar images are collected at a frequency of 5 Hz, with a total duration of approximately 15 min. The DVL message includes the velocity in three directions, and by multiplying the average velocity by the average time, we can estimate the position of the ROV in the world coordinate system. The IMU message contains the rotation information of the ROV, and combining these two pieces of information allows us to estimate the position of the robot in the world coordinate system. This estimate can serve as the initial value for the ICP algorithm. For the generation of keyframes, we adopted the same criteria as the comparative work: sonar keyframes are generated every 3 m of translation or 30° of rotation. Due to the absence of loop closure corrections in the robot’s movement, each imprecise frame-to-frame matching introduces new errors, and the accumulation of these errors leads to trajectory drift. As shown in Figure 10a, an increase in drift led to erroneous situations where the trajectory crossed through previously generated maps. However, optimizing the trajectory by minimizing photometric error after initial pose matching, as depicted in Figure 10b, significantly reduced this problem. In a frame-to-frame matching process, where a maximum of 600 feature points are involved in further optimization, the time cost only increases by 0.01 s.

In conclusion, qualitative tests reveal that the poses obtained using the semi-direct method are more precise, and quantitative assessments confirm that, in the absence of loop closure, the poses optimized using the semi-direct method exhibit less drift.

5. Conclusions

In this work, we propose a new method to mitigate the drift in underwater SLAM solutions based on sonar, which includes accelerated feature extraction and pose optimization. Our experiments demonstrate that our method provides significant added value in terms of pose estimation accuracy. Although there are limitations due to the ROV needing to move in a horizontal plane, we believe this still covers many AUV/ROV applications, such as hull inspections, port security, and operations near offshore structures.

Our future work will initially involve creating more complex and extended fact-based datasets. Secondly, while this paper used bilateral filtering to de-noise images in Cartesian coordinates, the advancement of deep learning technologies has introduced methods that often surpass traditional algorithms. Works such as those by Rixon [38], which introduced deep learning-based enhancement of sonar images, and Lu [39], who presented methods that reduce speckle noise while preserving the geometric structure of targets, align with our needs.

Author Contributions

Conceptualization, X.H. and J.D.; methodology, X.H. and S.Z.; software, X.H. and J.S.; validation, S.Z., J.D., and H.Y.; formal analysis, X.H. and J.S.; investigation, X.H. and J.S.; resources, S.Z.; data curation, H.Y.; writing—original draft preparation, X.H.; writing—review and editing, X.H. and J.S.; visualization, X.H. and J.S.; supervision, S.Z., H.Y., and J.D.; project administration, J.D.; funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shandong Province of China (ZR2023MF012), the Sanya Science and Technology Special Fund (2022KJCX92), the Hainan Provincial Joint Project of Sanya Yazhou Bay Science and Technology City (2021JJLH0061), and the Natural Science Foundation of China (41927805, 41906177).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors will supply the relevant data in response to reasonable requests.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hafiza, W.; Shukor, M.M.; Jasman, F.; Mutalip, Z.A.; Abdullah, M.S.; Idrus, S.M. Advancement of Underwater Surveying and Scanning Techniques: A Review. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 41, 256–281. [Google Scholar] [CrossRef]
Sun, K.; Cui, W.; Chen, C. Review of underwater sensing technologies and applications. Sensors 2021, 21, 7849. [Google Scholar] [CrossRef] [PubMed]
Paull, L.; Saeedi, S.; Seto, M.; Li, H. AUV navigation and localization: A review. IEEE J. Ocean. Eng. 2013, 39, 131–149. [Google Scholar] [CrossRef]
Amer, K.O.; Elbouz, M.; Alfalou, A.; Brosseau, C.; Hajjami, J. Enhancing underwater optical imaging by using a low-pass polarization filter. Opt. Express 2019, 27, 621–643. [Google Scholar] [CrossRef] [PubMed]
Filisetti, A.; Marouchos, A.; Martini, A.; Martin, T.; Collings, S. Developments and applications of underwater LiDAR systems in support of marine science. In Proceedings of the OCEANS 2018 MTS/IEEE Charleston, Charleston, SC, USA, 22–25 October 2018; pp. 1–10. [Google Scholar]
Biber, P.; Straßer, W. The normal distributions transform: A new approach to laser scan matching. In Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No. 03CH37453), Las Vegas, NV, USA, 27–31 October 2003; Volume 3, pp. 2743–2748. [Google Scholar]
Burguera, A.; González, Y.; Oliver, G. The UspIC: Performing scan matching localization using an imaging sonar. Sensors 2012, 12, 7855–7885. [Google Scholar] [CrossRef]
Bengtsson, O.; Baerveldt, A.J. Robot localization based on scan-matching—Estimating the covariance matrix for the IDC algorithm. Robot. Auton. Syst. 2003, 44, 29–40. [Google Scholar] [CrossRef]
Wang, J.; Chen, F.; Huang, Y.; McConnell, J.; Shan, T.; Englot, B. Virtual maps for autonomous exploration of cluttered underwater environments. IEEE J. Ocean. Eng. 2022, 47, 916–935. [Google Scholar] [CrossRef]
Hover, F.S.; Eustice, R.M.; Kim, A.; Englot, B.; Johannsson, H.; Kaess, M.; Leonard, J.J. Advanced perception, navigation and planning for autonomous in-water ship hull inspection. Int. J. Robot. Res. 2012, 31, 1445–1464. [Google Scholar] [CrossRef]
Wang, J.; Bai, S.; Englot, B. Underwater localization and 3D mapping of submerged structures with a single-beam scanning sonar. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4898–4905. [Google Scholar]
Cheng, C.; Wang, C.; Yang, D.; Liu, W.; Zhang, F. Underwater localization and mapping based on multi-beam forward looking sonar. Front. Neurorobot. 2022, 15, 801956. [Google Scholar] [CrossRef]
Li, J.; Kaess, M.; Eustice, R.M.; Johnson-Roberson, M. Pose-graph SLAM using forward-looking sonar. IEEE Robot. Autom. Lett. 2018, 3, 2330–2337. [Google Scholar] [CrossRef]
Fallon, M.F.; Folkesson, J.; McClelland, H.; Leonard, J.J. Relocating underwater features autonomously using sonar-based SLAM. IEEE J. Ocean. Eng. 2013, 38, 500–513. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Huang, T.A.; Kaess, M. Incremental data association for acoustic structure from motion. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 1334–1341. [Google Scholar]
Machado, M.; Zaffari, G.; Ribeiro, P.O.; Drews-Jr, P.; Botelho, S. Description and Matching of Acoustic Images Using a Forward Looking Sonar: A Topological Approach. IFAC-PapersOnLine 2017, 50, 2317–2322. [Google Scholar] [CrossRef]
Zhang, J.; Xie, Y.; Ling, L.; Folkesson, J. A fully-automatic side-scan sonar simultaneous localization and mapping framework. IET Radar Sonar Navig. 2024, 18, 674–683. [Google Scholar] [CrossRef]
Hurtos, N.; Ribas, D.; Cufi, X.; Petillot, Y.; Salvi, J. Fourier-based registration for robust forward-looking sonar mosaicing in low-visibility underwater environments. J. Field Robot. 2015, 32, 123–151. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
Wang, J.; Shan, T.; Englot, B. Underwater terrain reconstruction from forward-looking sonar imagery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3471–3477. [Google Scholar]
Alcantarilla, P.F.; Solutions, T. Fast explicit diffusion for accelerated features in nonlinear scale spaces. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1281–1298. [Google Scholar]
Lane, D.M.; Chantler, M.J.; Dai, D. Robust tracking of multiple objects in sector-scan sonar image sequences using optical flow motion estimation. IEEE J. Ocean. Eng. 1998, 23, 31–46. [Google Scholar] [CrossRef]
Bouzaouit, A.; Fietz, D.; Badri-Höher, S. Fish tracking based on sonar images by means of a modified optical flow. In Proceedings of the OCEANS 2021: San Diego–Porto, Virtual, 20–23 September 2021; pp. 1–7. [Google Scholar]
Zhu, H.; Sun, X.; Zhang, Q.; Wang, Q.; Robles-Kelly, A.; Li, H.; You, S. Full view optical flow estimation leveraged from light field superpixel. IEEE Trans. Comput. Imaging 2019, 6, 12–23. [Google Scholar] [CrossRef]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
Dong, X.; Cheng, L.; Peng, H.; Li, T. FSD-SLAM: A fast semi-direct SLAM algorithm. Complex Intell. Syst. 2022, 8, 1823–1834. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Computer Vision, Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Xu, S.; Zhang, K.; Hong, Z.; Liu, Y.; Wang, S. DISO: Direct Imaging Sonar Odometry. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8573–8579. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD-96: Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
McConnell, J.; Martin, J.D.; Englot, B. Fusing concurrent orthogonal wide-aperture sonar images for dense underwater 3d reconstruction. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 1653–1660. [Google Scholar]
Rohling, H. Radar CFAR thresholding in clutter and multiple target situations. IEEE Trans. Aerosp. Electron. Syst. 1983, 4, 608–621. [Google Scholar] [CrossRef]
Acosta, G.G.; Villar, S.A. Accumulated CA–CFAR process in 2-D for online object detection from sidescan sonar data. IEEE J. Ocean. Eng. 2014, 40, 558–569. [Google Scholar] [CrossRef]
McConnell, J.; Chen, F.; Englot, B. Overhead image factors for underwater sonar-based slam. IEEE Robot. Autom. Lett. 2022, 7, 4901–4908. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Z.; Cui, R.; Song, X.; Li, Y. SLAM Algorithm of Underwater Vehicle Based on Multi-beam Sonar. In Proceedings of the International Conference on Intelligent Robotics and Applications, Hangzhou, China, 5–7 July 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 274–286. [Google Scholar]
Zhao, X.; Liu, P.; Wang, B.; Jin, Y. GPU-Accelerated Signal Processing for Passive Bistatic Radar. Remote Sens. 2023, 15, 5421. [Google Scholar] [CrossRef]
Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India, 4–7 January 1998; pp. 839–846. [Google Scholar]
Rixon Fuchs, L.; Larsson, C.; Gällström, A. Deep learning based technique for enhanced sonar imaging. In Proceedings of the 5th Underwater Acoustics Conference & Exhibition (UACE), Hersonissos, Crete, Greece, 30 June–5 July 2019; pp. 1021–1028. [Google Scholar]
Lu, Y.; Liu, R.W.; Chen, F.; Xie, L. Learning a deep convolutional network for speckle noise reduction in underwater sonar images. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; pp. 445–450. [Google Scholar]

Figure 1. Underwater sonar-based SLAM: (a) The poses obtained using both methods, projecting the point cloud from the previous frame to the position in the next frame: the top is the result of optimizing the position using the semi-direct method, and the bottom is the result obtained using only ICP. (b) Enlarged view of the dense point cloud area in (a); it can be seen that the ICP algorithm is relatively accurate but still has minor errors. (c) Minor errors, when accumulated over a large amount of data, will eventually lead to an unacceptable level of drift. (d) ROV equipment equipped with sonar.

Figure 2. System framework diagram. The entire SLAM system is divided into three main components: First, the processing of raw sonar data, where different shapes of windows are employed based on the levels of reverberation noise to extract features from the sonar data and obtain Cartesian coordinate sonar images; second, the front-end, which is primarily responsible for obtaining initial pose estimates using scan matching and semi-direct methods; and third, the back-end, which is mainly in charge of global optimization and final state estimation.

Figure 3. Forward-looking sonar model: Let P be a point in space, R represents the range from this point to the sonar origin,

θ

represents the horizontal angle, and

φ

represents the pitch angle. In sonar imaging, the information from the pitch angle to the imaging plane will all be compressed into the imaging plane.

Figure 3. Forward-looking sonar model: Let P be a point in space, R represents the range from this point to the sonar origin,

θ

represents the horizontal angle, and

φ

represents the pitch angle. In sonar imaging, the information from the pitch angle to the imaging plane will all be compressed into the imaging plane.

Figure 4. (a) One-dimensional CFAR detection is prone to false alarms: red represents the test cells, blue represents the reference cells, and orange represents the guard cells. (b) Two-dimensional SO-CFAR: Two-dimensional SO-CFAR window: The window of SO-CFAR includes the test cells (represented in red), reference cells (represented in blue), and guard cells (represented in orange). Areas with no reflections are represented in white, and obstacles are represented in gray.

Figure 5. The principle of the direct method in sonar imaging. A forward-looking multibeam sonar emits sound waves from different poses, and the echoes generated by obstacles form the sonar image. Assuming we know the pose transformation between the sonars, the photometric error should be minimal when calculating the position of a pixel block from the previous sonar image in the next sonar image based on the sonar projection model. However, in reality, we do not know the pose transformation between the sonars. Therefore, we need to iterate multiple times to find the pose transformation that minimizes the photometric error, and this process is the direct method.

Figure 6. Image process diagram. The forward-looking multi-beam sonar grabs sonar images, which are in polar coordinates. The raw sonar images can be transformed into Cartesian coordinates through interpolation and transformation. Pixel values along a certain direction often exhibit high-frequency changes, making it easy to fall into local minima. Image gradients also change at a very high frequency.

Figure 7. The changes in pixel values along the straight line AB in the sonar image of Figure 6 before and after bilateral filtering: (a) before bilateral filtering; (b) after bilateral filtering.

Figure 8. Effects of feature extraction with different CFAR detection methods: (a) Raw sonar image. (b) Features extracted via one-dimensional SO-CFAR detection along the range dimension. (c) Features extracted via GO-CFAR detection along the range dimension. (d) Features extracted via two-dimensional SO-CFAR detection.

Figure 9. Quantitative assessment: (a) Pixel coordinates in the imaging of point clouds transformed using poses

T_{1}

and

T_{2}

. (b) Enlarged sonar image (part) of (a). The red box represents the enlarged area. (c) Position of point clouds in the sonar image. (d) Sonar image (part corresponding to the point clouds). (e) Statistical chart of point cloud pixel values.

Figure 9. Quantitative assessment: (a) Pixel coordinates in the imaging of point clouds transformed using poses

T_{1}

and

T_{2}

Figure 10. (a) The trajectory and map of SLAM result that uses only the ICP algorithm in the front end. (b) The trajectory and map of SLAM result that utilizes the semi-direct method based on the ICP algorithm.

Table 1. The average sum of pixel values of the projected points calculated using different method combinations for pose estimation.

Method	Mean
Semi-direct + 1D SO-CFAR	13,290.7
ICP + 1D SO-CFAR	11,985.5
Semi-direct + 2D SO-CFAR	13,855.1
ICP + 2D SO-CFAR	12,557.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, X.; Sun, J.; Zhang, S.; Dong, J.; Yu, H. Sonar-Based Simultaneous Localization and Mapping Using the Semi-Direct Method. J. Mar. Sci. Eng. 2024, 12, 2234. https://doi.org/10.3390/jmse12122234

AMA Style

Han X, Sun J, Zhang S, Dong J, Yu H. Sonar-Based Simultaneous Localization and Mapping Using the Semi-Direct Method. Journal of Marine Science and Engineering. 2024; 12(12):2234. https://doi.org/10.3390/jmse12122234

Chicago/Turabian Style

Han, Xu, Jinghao Sun, Shu Zhang, Junyu Dong, and Hui Yu. 2024. "Sonar-Based Simultaneous Localization and Mapping Using the Semi-Direct Method" Journal of Marine Science and Engineering 12, no. 12: 2234. https://doi.org/10.3390/jmse12122234

APA Style

Han, X., Sun, J., Zhang, S., Dong, J., & Yu, H. (2024). Sonar-Based Simultaneous Localization and Mapping Using the Semi-Direct Method. Journal of Marine Science and Engineering, 12(12), 2234. https://doi.org/10.3390/jmse12122234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu