NeXtFusion: Attention-Based Camera-Radar Fusion Network for Improved Three-Dimensional Object Detection and Tracking
<p>An overview of the proposed NeXtFusion multi-modal sensor fusion neural network.</p> "> Figure 2
<p>Comparison between velocities: the true velocity (<math display="inline"><semantics> <mrow> <msup> <mrow> <mi>v</mi> </mrow> <mrow> <mi>A</mi> </mrow> </msup> </mrow> </semantics></math>) for object A is equal to the radial velocity (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>v</mi> </mrow> <mrow> <mi>r</mi> </mrow> </msub> </mrow> </semantics></math>), whereas, for object B, the radar-reported radial velocity (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>v</mi> </mrow> <mrow> <mi>r</mi> </mrow> </msub> </mrow> </semantics></math>) indicated by the red arrow is not equal to the true velocity (<math display="inline"><semantics> <mrow> <msup> <mrow> <mi>v</mi> </mrow> <mrow> <mi>B</mi> </mrow> </msup> </mrow> </semantics></math>).</p> "> Figure 3
<p>NextDet architecture utilized as the baseline for the proposed NeXtFusion architecture.</p> "> Figure 4
<p>A visual depiction of top-down and bottom-up implementation of FPN and PAN networks.</p> "> Figure 5
<p>A visual depiction of the spatial pyramid pooling operation.</p> "> Figure 6
<p>Three sets of examples (<b>a</b>–<b>c</b>) where (<b>a</b>) represents a perfect overlap between the predicted bounding box (<math display="inline"><semantics> <mrow> <mi>A</mi> </mrow> </semantics></math>) and the ground-truth box (<math display="inline"><semantics> <mrow> <mi>B</mi> </mrow> </semantics></math>), (<b>b</b>) represents a partial overlap resulting in 0.5 IoU losses, and (<b>c</b>) represents the disjoint problem of IoU when <math display="inline"><semantics> <mrow> <mi>A</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>B</mi> </mrow> </semantics></math> do not overlap.</p> "> Figure 7
<p>Two sets of examples (<b>a</b>,<b>b</b>) where (<b>a</b>) represents a perfect overlap between the predicted bounding box (<math display="inline"><semantics> <mrow> <mi>A</mi> </mrow> </semantics></math>) and the ground-truth bounding box (<math display="inline"><semantics> <mrow> <mi>B</mi> </mrow> </semantics></math>) and (<b>b</b>) represents a non-overlapping case of <math display="inline"><semantics> <mrow> <mi>A</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>B</mi> </mrow> </semantics></math>, which solves the IoU’s disjoint problem by introducing a third smallest enclosing bounding box called <math display="inline"><semantics> <mrow> <mi>C</mi> </mrow> </semantics></math> that encompasses both <math display="inline"><semantics> <mrow> <mi>A</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>B</mi> </mrow> </semantics></math> bounding boxes.</p> "> Figure 8
<p>Right-handed coordinate system.</p> "> Figure 9
<p>Architecture of the proposed NeXtFusion 3D object detection network.</p> "> Figure 10
<p>An example of an occluded image obtained from the nuScenes dataset, captured by a front camera positioned at the vehicle’s top-front location.</p> "> Figure 11
<p>An example of Camera-Radar fusion-based 3D object detection involving occlusion at night from the nuScenes dataset using the proposed NeXtFusion network. Different colors of the bounding boxes indicate different objects detected.</p> "> Figure 12
<p>An example of Camera-Radar fusion-based 3D object detection at an intersection using the proposed NeXtFusion network. Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.</p> "> Figure 13
<p>An illustration of radar point-cloud data of the environment from the nuScenes dataset generated by NeXtFusion. Dots represent objects, with darker shades indicating closer proximity to the sensor.</p> "> Figure 14
<p>An example of very confident radar returns from two vehicles, captured and combined using five radar sweeps. Light blue lines indicate radar detection, indicating its length and orientation. Red cross indicates vehicle’s ego-centric position. Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.</p> "> Figure 15
<p>A static representation to analyze how the proposed network performs multi-modal 3D object detection and tracking as part of the validation tests. nuScenes provides data for a scene captured from camera and radar sensors mounted on the AV as follows: (<b>a</b>) Front left camera identifies and starts tracking the object. In this example, only a single pedestrian is being tracked for analysis purposes. (<b>b</b>) Front center camera continues to track the identified object while it is in its field of view. (<b>c</b>) Front right camera continues to track the identified object until it leaves its field of view.</p> ">
Abstract
:1. Introduction
2. Literature Review on Relevant Studies
2.1. Camera-Based Object Detection
- Backbone: This module acts as the foundation, extracting salient features from the image and producing a compressed representation through a robust image classifier. Imagine it like a skilled photographer capturing the essence of a scene.
- Neck: This module acts as the bridge, connecting the backbone and skillfully merging features extracted from different levels of the backbone. Consider it like a data sculptor, gathering and harmonizing different perspectives to create a richer understanding.
- Head: This module is the decision maker responsible for drawing bounding boxes around objects and classifying their types. Think of it as a detective, analyzing the combined information and identifying what each object is and where it lies.
2.2. Radar Point Cloud and Camera-Based Sensor Fusion for Object Detection
3. The Proposed NeXtFusion 3D Object Detection Network
3.1. Backbone
3.2. Neck
3.3. Head
3.4. Bounding-Box Regression
3.5. Extracting Radar Features
3.6. Assosciating Radar Data to the Image Plane
4. Experiments and Results
4.1. Dataset
4.2. Evaluation Metrics
- True positive (TP) represents correctly identified objects where the predicted bounding box overlaps with the ground-truth box.
- False positive (FP) denotes erroneous detections where the predicted box has some overlap with the ground truth but falls, indicating an incorrect identification.
- False negative (FN) occurs when the model fails to detect a valid object present in the scene, resulting in a missed target.
- Mean average translation error (mATE): This metric represents the average Euclidean distance (in meters) between the predicted and ground-truth 3D translation (, , ) of an object across all objects and scenes in the dataset.
- Mean average scale error (mASE): This metric represents the average absolute difference between the predicted and ground-truth scale (measured as the product of length, width, and height) of an object across all objects and scenes.
- Mean average orientation error (mAOE): This metric represents the average minimum yaw angle difference (in radians) between the predicted and ground-truth orientation of an object across all objects and scenes. It is typically calculated on a per-class basis due to variations in object types.
- Mean average velocity error (mAVE): This metric represents the average absolute difference between the predicted and ground-truth velocity (magnitude in m/s) of an object across all objects and scenes. It is usually calculated on a per-class basis.
- Mean average attribute error (mAAE): This metric represents the average absolute difference between the predicted and ground-truth attributes (depending on the specific attribute considered) of an object across all objects and scenes. It is typically calculated on a per-class basis and depends on the specific set of attributes considered for the object category.
4.3. Training Infrastructure
- NVIDIA Tesla V100 GPU: Delivers high-performance graphics processing for efficient training.
- Intel Xenon Gold 6248 20-core CPU: Provides robust central processing power for computations.
- 1.92 TB solid-state drive: Ensures fast data storage and retrieval.
- 768 GB of RAM: Supports large model training and data handling.
- PyTorch 1.12.1: A deep-learning framework for model development and training.
- Python 3.7.9: The general-purpose programming language used for the research.
- CUDA 11.3: Enables efficient utilization of the NVIDIA GPUs for computations.
4.4. Experiment Results
Object Detection Network | Training Dataset | Optimizer | Learning Rate | Batch Size | Weight Decay | Momentum | # of Epochs |
---|---|---|---|---|---|---|---|
OFT | Y | Adam | 0.0001 | 64 | 0.0005 | 0.85 | 400 |
MonoDIS | Y | Adam | 0.0001 | 64 | 0.0005 | 0.85 | 400 |
InfoFocus | Y | Adam | 0.0001 | 64 | 0.0005 | 0.85 | 400 |
NeXtFusion | Y | Adam | 0.0001 | 64 | 0.0005 | 0.85 | 400 |
Object Detection Network | Validation Dataset | ||||||
---|---|---|---|---|---|---|---|
OFT | Y | 0.122 | 0.819 | 0.359 | 0.848 | 1.727 | 0.479 |
MonoDIS | Y | 0.301 | 0.737 | 0.266 | 0.544 | 1.531 | 0.140 |
InfoFocus | Y | 0.396 | 0.361 | 0.263 | 1.130 | 0.617 | 0.394 |
NeXtFusion | Y | 0.473 | 0.449 | 0.261 | 0.534 | 0.538 | 0.139 |
4.5. Ablation Studies
4.6. Visualization of Samples
- One LIDAR (light detection and ranging) sensor:
- LIDAR_TOP
- Five RADAR (radio detection and ranging) sensors:
- RADAR_FRONT
- RADAR_FRONT_LEFT
- RADAR_FRONT_RIGHT
- RADAR_BACK_LEFT
- RADAR_BACK_RIGHT
- Six camera sensors:
- CAM_FRONT
- CAM_FRONT_LEFT
- CAM_FRONT_RIGHT
- CAM_BACK
- CAM_BACK_LEFT
- CAM_BACK_RIGHT
- Continuously monitor the movement of objects in its environment.
- Classify and differentiate between different types of objects and understand their potential intentions even under unfavorable conditions.
- Make informed decisions by planning safe maneuvers based on perceived information about the environment.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Robson, K. Full Self-Driving Cars Might Not Be with Us until 2035, Experts Predict. Available online: https://www.verdict.co.uk/fully-self-driving-cars-unlikely-before-2035-experts-predict/ (accessed on 21 December 2023).
- Tang, Y.; He, H.; Wang, Y.; Mao, Z.; Wang, H. Multi-Modality 3D Object Detection in Autonomous Driving: A Review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
- Qian, R.; Lai, X.; Li, X. 3D Object Detection for Autonomous Driving: A Survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
- Le, H.-S.; Le, T.D.; Huynh, K.-T. A Review on 3D Object Detection for Self-Driving Cars. In Proceedings of the 2022 RIVF International Conference on Computing and Communication Technologies (RIVF), Ho Chi Minh City, Vietnam, 20–22 December 2022; pp. 398–403. [Google Scholar]
- Alessandretti, G.; Broggi, A.; Cerri, P. Vehicle and Guard Rail Detection Using Radar and Vision Data Fusion. IEEE Trans. Intell. Transp. Syst. 2007, 8, 95–105. [Google Scholar] [CrossRef]
- Zhou, Y.; Liu, L.; Zhao, H.; López-Benítez, M.; Yu, L.; Yue, Y. Towards Deep Radar Perception for Autonomous Driving: Datasets, Methods, and Challenges. Sensors 2022, 22, 4208. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8437–8445. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Kalgaonkar, P.; El-Sharkawy, M. NextDet: Efficient Sparse-to-Dense Object Detection with Attentive Feature Aggregation. Future Internet 2022, 14, 355. [Google Scholar] [CrossRef]
- Kalgaonkar, P. AI on the Edge with CondenseNeXt: An Efficient Deep Neural Network for Devices with Constrained Computational Resources. Master’s Thesis, Purdue University Graduate School, Indianapolis, IN, USA, 2021. [Google Scholar]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 24 January 2018; pp. 2980–2988. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Kim, J.; Sung, J.-Y.; Park, S. Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Asia (ICCE-Asia), Seoul, Republic of Korea, 1–3 November 2020; pp. 1–4. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 346–361. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Kalgaonkar, P.; El-Sharkawy, M. CondenseNeXt: An Ultra-Efficient Deep Neural Network for Embedded Systems. In Proceedings of the 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 27–30 January 2021; pp. 0524–0528. [Google Scholar]
- Modas, A.; Sanchez-Matilla, R.; Frossard, P.; Cavallaro, A. Towards Robust Sensing for Autonomous Vehicles: An Adversarial Perspective. IEEE Signal Process. Mag. 2020, 37, 14–23. [Google Scholar] [CrossRef]
- Hoermann, S.; Henzler, P.; Bach, M.; Dietmayer, K. Object Detection on Dynamic Occupancy Grid Maps Using Deep Learning and Automatic Label Generation. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 826–833. [Google Scholar]
- El Natour, G.; Bresson, G.; Trichet, R. Multi-Sensors System and Deep Learning Models for Object Tracking. Sensors 2023, 23, 7804. [Google Scholar] [CrossRef]
- Srivastav, A.; Mandal, S. Radars for Autonomous Driving: A Review of Deep Learning Methods and Challenges. IEEE Access 2023, 11, 97147–97168. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Zhang, Y.; Jin, R.; Zhou, Z.-H. Understanding Bag-of-Words Model: A Statistical Framework. Int. J. Mach. Learn. Cyber. 2010, 1, 43–52. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Llinas, M.L.I.; David Hall, J. (Eds.) Handbook of Multisensor Data Fusion: Theory and Practice, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2017; ISBN 978-1-315-21948-6. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar]
- Nabati, R.; Qi, H. CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1526–1535. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
- Stewart, C.A.; Welch, V.; Plale, B.; Fox, G.; Pierce, M.; Sterling, T. Indiana University Pervasive Technology Institute. 2017. Available online: https://scholarworks.iu.edu/dspace/items/ddb55636-7550-471d-be5f-d9df6ee82310 (accessed on 26 March 2024).
- Roddick, T. Orthographic Feature Transform for Monocular 3D Object Detection. arXiv 2018, arXiv:1811.08188. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2147–2156. [Google Scholar]
- Wang, J.; Lan, S.; Gao, M.; Davis, L.S. InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 405–420. [Google Scholar]
Network | C | R | NM | FAM | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MonoDIS | - | - | - | 0.301 | 0.738 | 0.269 | 0.544 | 1.532 | 0.141 | |
NeXtDet | - | - | - | 0.351 | 0.635 | 0.271 | 0.550 | 1.346 | 0.142 | |
NeXtFusion | - | 0.427 | 0.499 | 0.268 | 0.541 | 0.846 | 0.140 | |||
NeXtFusion | - | 0.474 | 0.448 | 0.262 | 0.534 | 0.536 | 0.138 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kalgaonkar, P.; El-Sharkawy, M. NeXtFusion: Attention-Based Camera-Radar Fusion Network for Improved Three-Dimensional Object Detection and Tracking. Future Internet 2024, 16, 114. https://doi.org/10.3390/fi16040114
Kalgaonkar P, El-Sharkawy M. NeXtFusion: Attention-Based Camera-Radar Fusion Network for Improved Three-Dimensional Object Detection and Tracking. Future Internet. 2024; 16(4):114. https://doi.org/10.3390/fi16040114
Chicago/Turabian StyleKalgaonkar, Priyank, and Mohamed El-Sharkawy. 2024. "NeXtFusion: Attention-Based Camera-Radar Fusion Network for Improved Three-Dimensional Object Detection and Tracking" Future Internet 16, no. 4: 114. https://doi.org/10.3390/fi16040114