DFD-SLAM: Visual SLAM with Deep Features in Dynamic Environment
<p>System architecture.</p> "> Figure 2
<p>The complete process of precise elimination. (<b>a</b>) The segmentation results of YOLOv8. (<b>b</b>) The results of optical flow tracking on the extracted feature points. (<b>c</b>) The results of epipolar constraints. (<b>d</b>,<b>e</b>) The system dividing the detected potential dynamic objects into sub-frames and identifying the dynamic regions within them. In (<b>e</b>), the red boxes indicate dynamic regions, while the green areas indicate static regions. (<b>f</b>) The final retained segmentation results after dilation processing.</p> "> Figure 3
<p>Filter out optical flow vectors that do not meet the requirements.</p> "> Figure 4
<p>In a rotating scene, detect the matching situation before and after frame rotation. The first row uses HFNet descriptors. The second row is the frame identified as rotating, with red points indicating the optimized rotation center. The third row uses ORB descriptors instead.</p> "> Figure 5
<p>Matching performance of DFD-SLAM and ORB-SLAM3 under varying lighting and scene conditions. The first row shows the matching performance of ORB-SLAM3 using its strategies. The second row illustrates the matching performance of DFD-SLAM using HFNet for feature point extraction and descriptor matching. In most cases, the deep-features-based extraction method still holds advantages.</p> "> Figure 6
<p>Comparison of loop closure detection in monocular mode. The final trajectory maps are shown in (<b>h</b>,<b>i</b>). The numbers annotated above indicate the positions where loop closure detection occurred in each system. Scenes (<b>a</b>–<b>g</b>) correspond to the occurrences of loop closure detection, where the second row indicates the frames where the systems correctly detected loop closures relative to the first row.</p> "> Figure 7
<p>Comparison of trajectories between outstanding dynamic SLAM systems and our method in highly dynamic environments. The first row shows the trajectory map for the <math display="inline"><semantics> <mrow> <mi mathvariant="italic">W</mi> <mo>/</mo> <mi mathvariant="italic">static</mi> </mrow> </semantics></math> sequence, the second row for the <math display="inline"><semantics> <mrow> <mi mathvariant="italic">W</mi> <mo>/</mo> <mi mathvariant="italic">xyz</mi> </mrow> </semantics></math> sequence, the third row for the <math display="inline"><semantics> <mrow> <mi mathvariant="italic">W</mi> <mo>/</mo> <mi mathvariant="italic">rpy</mi> </mrow> </semantics></math> sequence, and the fourth row for the W/half sequence. The blue lines represent the system’s result trajectory, the black lines indicate the ground truth, and the red lines show the difference between the two. More prominent and numerous red lines indicate a higher absolute trajectory error, signifying lower tracking accuracy of the system.</p> "> Figure 8
<p>Dynamic point culling flowchart in <math display="inline"><semantics> <mrow> <mi>W</mi> <mo>/</mo> <mi>r</mi> <mi>p</mi> <mi>y</mi> </mrow> </semantics></math> sequence. Each of these lines represents a complete culling process. Each column represents a cull step.</p> ">
Abstract
:1. Introduction
- A real-time dynamic SLAM system based on the ORB-SLAM3 framework is proposed, supporting multiple sensor input modalities. Utilizing HFNet for both local and global feature extraction significantly enhances the tracking and loop closure performance of ORB-SLAM3. Simultaneously, YOLOv8 contributes to the precise removal of feature points, leveraging semantic information. The system maximizes the benefits of deep features extraction in dynamic SLAM settings.
- A frame rotation estimation method is introduced, where geometric consistency detection calculates possible rotation centers based on optic flow vectors to determine the appropriate usage of descriptors generated by HFNet or re-extraction of ORB descriptors under different circumstances. The system effectively combines the advantages of deep features and traditional manual methods.
- A better feature point removal strategy is proposed, integrating geometric consistency detection to accurately filter semantic information from YOLOv8 and ensure precise removal of dynamic feature points, avoiding over-removal scenarios.
2. Related Works
2.1. Visual SLAM
2.2. SLAM with Deep Features
2.3. Dynamic SLAM
3. Materials and Methods
3.1. System Architecture
3.2. Dynamic Points Culled Algorithm
Algorithm 1 Dynamic Points Culled Algorithm |
|
3.3. Frame Rotation Estimation and Feature Point Matching
Algorithm 2 Rotation Estimation Algorithm |
|
3.4. Loop Closure
4. Results
4.1. Experiment Introduction
4.2. Test on TUM-VI Dataset
4.3. Test on TUM-RGBD Dataset
4.4. Computation Cost
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A Survey of State-of-the-Art on Visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Gálvez-López, D.; Tardos, J.D. Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
- Li, D.; Shi, X.; Long, Q.; Liu, S.; Yang, W.; Wang, F.; Wei, Q.; Qiao, F. DXSLAM: A robust and efficient visual SLAM system with deep features. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4958–4965. [Google Scholar] [CrossRef]
- Pu, H.; Luo, J.; Wang, G.; Huang, T.; Liu, H. Visual SLAM integration with semantic segmentation and deep learning: A review. IEEE Sens. J. 2023, 23, 22119–22138. [Google Scholar] [CrossRef]
- Liu, L.; Aitken, J.M. HFNet-SLAM: An Accurate and Real-Time Monocular SLAM System with Deep Features. Sensors 2023, 23, 2113. [Google Scholar] [CrossRef]
- Soares, J.C.V.; Gattass, M.; Meggiolaro, M.A. Crowd-SLAM: Visual SLAM Towards Crowded Environments using Object. Detect. J. Intell. Robot. Syst. 2021, 102, 50. [Google Scholar] [CrossRef]
- Zhang, Q.; Li, C. Semantic SLAM for mobile robots in dynamic environments based on visual camera sensors. Meas. Sci. Technol. 2023, 34, 085202. [Google Scholar] [CrossRef]
- Jeong, E.; Kim, J.; Tan, S.; Lee, J.; Ha, S. Deep learning inference parallelization on heterogeneous processors with tensorrt. IEEE Embed. Syst. Lett. 2021, 14, 15–18. [Google Scholar] [CrossRef]
- Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12716–12725. [Google Scholar] [CrossRef]
- YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
- Shi, J. Good features to track. In Proceedings of the 1994 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar] [CrossRef]
- Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef]
- Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar] [CrossRef]
- Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the ICCV, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar] [CrossRef]
- Tang, J.; Ericson, L.; Folkesson, J.; Jensfelt, P. GCNv2: Efficient correspondence prediction for real-time SLAM. IEEE Robot. Autom. Lett. 2019, 4, 3505–3512. [Google Scholar] [CrossRef]
- Bruno, H.M.S.; Colombini, E.L. LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method. Neurocomputing 2021, 455, 97–110. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
- Lajoie, P.Y.; Ramtoula, B.; Chang, Y.; Carlone, L.; Beltrame, G. DOOR-SLAM: Distributed, Online, and Outlier Resilient SLAM for Robotic Teams. IEEE Robot. Autom. Lett. 2020, 5, 1656–1663. [Google Scholar] [CrossRef]
- Yang, Y.; Tang, D.; Wang, D.; Song, W.; Wang, J.; Fu, M. Multi-camera visual SLAM for off-road navigation. Robot. Auton. Syst. 2020, 128, 103505. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Sun, Y.; Liu, M.; Meng, M.Q.H. Motion removal for reliable RGB-D SLAM in dynamic environments. Robot. Auton. Syst. 2018, 108, 115–128. [Google Scholar] [CrossRef]
- Dai, W.; Zhang, Y.; Li, P.; Fang, Z.; Scherer, S. RGB-D SLAM in Dynamic Environments Using Point Correlations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 373–389. [Google Scholar] [CrossRef]
- Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
- Liu, Y.; Miura, J. RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
- Yu, C.; Liu, Z.; Liu, X.J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar] [CrossRef]
- Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A Real-Time RGB-D Visual SLAM Toward Dynamic Scenes with Semantic and Geometric Information. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stückler, J.; Cremers, D. The TUM VI benchmark for evaluating visual-inertial odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1680–1687. [Google Scholar] [CrossRef]
- Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar] [CrossRef]
- Du, Z.J.; Huang, S.S.; Mu, T.J.; Zhao, Q.; Martin, R.R.; Xu, K. Accurate dynamic SLAM using CRF-based long-term consistency. IEEE Trans. Vis. Comput. Graph. 2020, 28, 1745–1757. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Peng, J.; Yang, Q. PR-SLAM: Parallel Real-Time Dynamic SLAM Method Based on Semantic Segmentation. IEEE Access 2024, 12, 36498–36514. [Google Scholar] [CrossRef]
Sequence | TUM-VI (ATE) | |||
---|---|---|---|---|
ORB-SLAM3 | VINS-Mono | HFNet-SLAM | DFD-SLAM | |
corridor1 | 0.04 | 0.63 | 0.023 | 0.018 |
corridor2 | 0.02 | 0.95 | 0.048 | 0.015 |
corridor3 | 0.31 | 1.56 | 0.036 | 0.112 |
corridor4 | 0.17 | 0.25 | 0.227 | 0.183 |
corridor5 | 0.03 | 0.77 | 0.051 | 0.027 |
average | 0.11 | 0.83 | 0.077 | 0.071 |
Magistrale1 | 0.56 | 2.19 | 0.130 | 0.144 |
Magistrale2 | 0.52 | 3.11 | 0.471 | 0.319 |
Magistrale3 | 4.89 | 0.40 | 2.903 | 2.478 |
Magistrale4 | 0.13 | 5.12 | 0.184 | 0.113 |
Magistrale5 | 1.03 | 0.85 | 0.874 | 0.956 |
Magistrale6 | 1.30 | 2.29 | 0.604 | 0.547 |
average | 1.41 | 2.33 | 0.861 | 0.760 |
Room1 | 0.01 | 0.07 | 0.008 | 0.008 |
Room2 | 0.02 | 0.07 | 0.012 | 0.009 |
Room3 | 0.04 | 0.11 | 0.013 | 0.013 |
Room4 | 0.01 | 0.04 | 0.016 | 0.011 |
Room5 | 0.02 | 0.20 | 0.012 | 0.008 |
Room6 | 0.01 | 0.08 | 0.006 | 0.012 |
average | 0.02 | 0.10 | 0.011 | 0.010 |
Slides1 | 0.97 | 0.68 | 0.414 | 0.402 |
Slides2 | 1.06 | 0.84 | 0.803 | 0.776 |
Slides3 | 0.69 | 0.69 | 0.611 | 0.549 |
average | 0.91 | 0.74 | 0.609 | 0.576 |
Sequence | TUM-VI (ATE) | ||
---|---|---|---|
DFD-SLAM(H) | DFD-SLAM(B) | DFD-SLAM(HB) | |
corridor1 | 0.024 | 0.031 | 0.018 |
corridor2 | 0.052 | 0.027 | 0.015 |
corridor3 | 0.068 | 0.247 | 0.112 |
corridor4 | 0.196 | 0.201 | 0.183 |
corridor5 | 0.039 | 0.058 | 0.027 |
average | 0.076 | 0.113 | 0.071 |
Sequence | TUM-VI (ATE) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DFD-SLAM | DFD-SLAM (BOW) | DFD-SLAM (HFNet) | ORB-SLAM3 | |||||||||
Without Loop | With Loop | Boost | With Loop | Boost | Without Loop | With Loop | Boost | |||||
Magistrale1 | 5.724 | 4.427 | 0.227 | 0.417 | 0.927 | 12.943 | 9.897 | 0.235 | ||||
Magistrale2 | 1.153 | 0.879 | 0.238 | 0.716 | 0.379 | 1.120 | 0.757 | 0.324 | ||||
Magistrale4 | 1.013 | 0.523 | 0.484 | 0.761 | 0.249 | 3.376 | 0.893 | 0.735 | ||||
Magistrale5 | 2.119 | 2.012 | 0.050 | 1.472 | 0.305 | 2.547 | 2.539 | 0.003 | ||||
Magistrale6 | 3.587 | 3.226 | 0.101 | 2.124 | 0.408 | 4.568 | 4.035 | 0.117 | ||||
average | 2.719 | 2.213 | 0.220 | 1.098 | 0.454 | 4.911 | 3.624 | 0.282 |
Sequence | TUM-RGBD (ATE) | |||||||
---|---|---|---|---|---|---|---|---|
O3 | Dyna | DS | Crowd | Lccrf | CDS | PR | OURS | |
W/half | 0.424 | 0.029 | 0.030 | 0.026 | 0.028 | 0.019 | 0.025 | 0.026 |
W/rpy | 0.726 | 0.035 | 0.044 | 0.044 | 0.035 | 0.053 | 0.034 | 0.029 |
W/static | 0.022 | 0.006 | 0.008 | 0.007 | 0.011 | 0.005 | 0.006 | 0.005 |
W/xyz | 0.825 | 0.016 | 0.024 | 0.020 | 0.016 | 0.013 | 0.017 | 0.007 |
S/half | 0.019 | 0.018 | - | 0.020 | - | 0.013 | 0.015 | 0.011 |
S/xyz | 0.012 | 0.012 | - | 0.018 | 0.009 | 0.011 | 0.007 | 0.007 |
Sequence | TUM-RGBD (RPE) | |||||||
---|---|---|---|---|---|---|---|---|
O3 | Dyna | DS | Crowd | Lccrf | CDS | PR | OURS | |
W/half | 0.023 | 0.028 | 0.030 | 0.037 | 0.035 | 0.018 | 0.013 | 0.024 |
W/rpy | 0.138 | 0.044 | 0.150 | 0.065 | 0.050 | 0.035 | 0.017 | 0.026 |
W/static | 0.011 | 0.008 | 0.010 | 0.010 | 0.014 | 0.006 | 0.006 | 0.005 |
W/xyz | 0.042 | 0.021 | 0.033 | 0.025 | 0.021 | 0.017 | 0.012 | 0.009 |
S/half | 0.014 | 0.023 | - | 0.022 | - | 0.012 | 0.011 | 0.010 |
S/xyz | 0.016 | 0.014 | - | 0.020 | 0.012 | 0.012 | 0.010 | 0.011 |
Sequence | TUM-RGBD (ATE) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
DFD (H) | DFD (HS) | DFD (HSD) | DFD (BSD) | ||||||||
ATE | ATE | Boost | ATE | Boost | ATE | Boost | |||||
W/half | 0.162 | 0.024 | 0.852 | 0.026 | 0.840 | 0.028 | 0.827 | ||||
W/rpy | 0.117 | 0.032 | 0.726 | 0.029 | 0.752 | 0.035 | 0.701 | ||||
W/static | 0.021 | 0.006 | 0.714 | 0.005 | 0.762 | 0.008 | 0.619 | ||||
W/xyz | 0.077 | 0.009 | 0.883 | 0.007 | 0.909 | 0.011 | 0.857 | ||||
average | 0.094 | 0.018 | 0.794 | 0.017 | 0.816 | 0.021 | 0.751 |
Systems | Tracking Cost (ms) | Hardware |
---|---|---|
ORB-SLAM3 | 18.92 | Intel12700h (Intel) |
CDS-SLAM | 37.96 | Ryzen7-5800H RTX3070 (AMD Santa Clara, CA, USA) (Nvidia) |
DynaSLAM | 195.00 | Nvidia Tesla M40 GPU (Nvidia) |
PR-SLAM | 50–60 | R5-3600 RTX3070 (AMD) (Nvidia) |
Ours | 47.83 | Intel12700h GTX1070TI (Intel) (Nvidia) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qian, W.; Peng, J.; Zhang, H. DFD-SLAM: Visual SLAM with Deep Features in Dynamic Environment. Appl. Sci. 2024, 14, 4949. https://doi.org/10.3390/app14114949
Qian W, Peng J, Zhang H. DFD-SLAM: Visual SLAM with Deep Features in Dynamic Environment. Applied Sciences. 2024; 14(11):4949. https://doi.org/10.3390/app14114949
Chicago/Turabian StyleQian, Wei, Jiansheng Peng, and Hongyu Zhang. 2024. "DFD-SLAM: Visual SLAM with Deep Features in Dynamic Environment" Applied Sciences 14, no. 11: 4949. https://doi.org/10.3390/app14114949
APA StyleQian, W., Peng, J., & Zhang, H. (2024). DFD-SLAM: Visual SLAM with Deep Features in Dynamic Environment. Applied Sciences, 14(11), 4949. https://doi.org/10.3390/app14114949