3D-GIoU: 3D Generalized Intersection over Union for Object Detection in Point Cloud
<p>Three-Dimensional Generalized Intersection over Union (3D-GIoU) Architecture. The network takes point cloud as input. After the point cloud is discretized into 3D voxel grids, Point-Voxel Feature Encoder is used to learn voxel-wise features. Then, these features are processed by Sparse Convolution Middle Layers and sent to the Region Proposal Network to predict the classification score and the bounding box regression map. Last, the detection results and ground truth bounding boxes are used to calculate 3D GIoU loss, and 3D GIoU loss is used for optimizing the bounding box regression.</p> "> Figure 2
<p>Voxelization of point cloud. Firstly, the original point cloud is cropped according to the fixed size <math display="inline"><semantics> <mrow> <mi mathvariant="normal">L</mi> <mo>×</mo> <mi mathvariant="normal">W</mi> <mo>×</mo> <msup> <mrow> <mrow> <mi mathvariant="normal">H</mi> <mtext> </mtext> <mi mathvariant="normal">m</mi> </mrow> </mrow> <mn>3</mn> </msup> </mrow> </semantics></math>, and then the cropped point cloud is further transformed into 3D voxel grids.</p> "> Figure 3
<p>The architecture of Backbone Network. The meanings of lines and two-dimensional (2D) shapes with different colors in figure are given in the legend. Green 3D boxes represent feature maps with different sizes.</p> "> Figure 4
<p>Three different ways of overlap between two rectangles with the exact same IoU values, (i.e., IoU = 0.50), but different GIoU values (i.e., from the left to right GIoU = 0.50, 0.45, and 0.09, respectively). For the case with better aligned orientation, the GIoU value will be higher.</p> "> Figure 5
<p>Different ways of overlap between bounding boxes in case of 2D and 3D, respectively. For (<b>a</b>) and (<b>b</b>), cyan and pink represent the predicted bounding box <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="normal">B</mi> <mi mathvariant="normal">p</mi> </msub> </mrow> </semantics></math> and ground truth <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="normal">B</mi> <mi mathvariant="normal">g</mi> </msub> </mrow> </semantics></math>, respectively, and yellow represents the intersection of them. In addition, the green bounding box represents the smallest enclosing box <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="normal">B</mi> <mi mathvariant="normal">c</mi> </msub> </mrow> </semantics></math>.</p> "> Figure 6
<p>The AP of different methods on KITTI validation set with the different difficulty level (car detection).</p> "> Figure 7
<p>3D-GIoU vs. SECOND for the detection performance of 3D and BEV evaluation on the KITTI validation set across three difficulty levels (i.e., Easy, Moderate and Hard). In (<b>a</b>) and (<b>b</b>), the solid line represents 3D-GIoU, while the dotted line represents SECOND.</p> "> Figure 8
<p>Several 3D detection results on the KITTI validation set. In each RGB image, all the 2D and 3D bounding boxes represent the detection results. The digit and word beside each 2D box represent the instance score and class. In the point clouds, teal 3D boxes indicate detection results, and 3D red boxes represent ground truths.</p> "> Figure 8 Cont.
<p>Several 3D detection results on the KITTI validation set. In each RGB image, all the 2D and 3D bounding boxes represent the detection results. The digit and word beside each 2D box represent the instance score and class. In the point clouds, teal 3D boxes indicate detection results, and 3D red boxes represent ground truths.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Monocular Image-Based Detection
2.2. Point Cloud-Based Detection
2.3. Multimodal Fusion-Based Detection
3. Method
3.1. Data Preprocessing
3.2. Point-Voxel Feature Encoder
3.3. Sparse Convolution Middle Layers
3.4. Region Proposal Network
4. Loss Function
4.1. Classification Loss
4.2. 3D Bounding Box Regression Loss
4.3. 3D GIoU Loss
- (1)
- When the predicted and ground truth bounding box do not coincide completely, the gradient of loss function is 0, which makes it impossible to optimize;
- (2)
- Two shapes can overlap in different ways to get the same IoU value, that is, the IoU does not reflect how overlap between two objects occurs (see Figure 4).
Algorithm 1: 3D Generalized Intersection Over Union Loss |
Input: The information of the predicted and ground truth bounding box: Output:
|
5. Experiments
5.1. Network Details
5.1.1. Car Detection
5.1.2. Cyclist and Pedestrian Detection
5.2. Training
5.3. Comparisons on the KITTI Validation Set
5.4. Analysis of the Detection Results
5.4.1. Car Detection
5.4.2. Cyclist and Pedestrian Detection
5.5. Ablation Studies
- (1)
- Comparing Baseline 1 with SECOND [18], it is easy to find that the proposed 3D GIoU loss can improve detection performance. In particular, the AP of the Hard level was increased by 6.4%.
- (2)
- Comparing Baseline 2 with SECOND [18], we can find that the use of the proposed Backbone Network improved the detection performance in Hard level by 7.28%.
- (3)
- By comparing 3D-GIoU with Baseline 1, Baseline 2, and SECOND [18], it is not difficult to find that when the 3D GIoU loss and Backbone Network are used simultaneously, the performance of 3D object detection is greatly improved.
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Chao, M.; Yulan, G.; Yinjie, L.; Wei, A. Binary Volumetric Convolutional Neural Networks for 3-D Object Recognition. IEEE Trans. Instrum. Meas. 2019, 68, 38–48. [Google Scholar]
- Chao, M.; Yulan, G.; Jungang, Y.; Wei, A. Learning Multi-view Representation with LSTM for 3D Shape Recognition and Retrieval. IEEE Trans. Multimed. 2019, 21, 1169–1182. [Google Scholar]
- Ankit, K.; Ozan, I.; Peter, O.; Mohit, I.; James, B.; Ishaan, G.; Victor, Z.; Romain, P.; Richard, S. Ask Me Anything Dynamic Memory Networks for Natural Language Processing. arXiv 2015, arXiv:1506.07285. [Google Scholar]
- Alexis, C.; Holger, S.; Yann Le, C.; Loïc, B. Deep Convolutional Networks for Natural Language Processing. arXiv 2018, arXiv:1805.09843. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, pp. 1907–1915. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d Object Detection for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2147–2156. [Google Scholar]
- Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long beach, CA, USA, 15–21 June 2019; pp. 1019–1028. [Google Scholar]
- Guan, P.; Ulrich, N. 3D Point Cloud Object Detection with Multi-View Convolutional Neural Network. In Proceedings of the IEEE Conference on International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2040–2049. [Google Scholar]
- Zeng, Y.; Hu, Y.; Liu, S.; Ye, J.; Han, Y.; Li, X.; Sun, N. RT3D: Real-Time 3-D Vehicle Detection in LiDAR Point Cloud for Autonomous Driving. IEEE Robot. Autom. Lett. 2018, 3, 3434–3440. [Google Scholar] [CrossRef]
- François, P.; Francis, C.; Roland, S. A Review of Point Cloud Registration Algorithms for Mobile Robotics; Foundations and Trends® in Robotics, Mike Casey: Boston, MA, USA, 2015; Volume 4, pp. 1–104. [Google Scholar]
- Boyoon, J.; Sukhatme, G.S. Detecting Moving Objects Using a Single Camera on a Mobile Robot in an Outdoor Environment. In Proceedings of the 8th Conference on Intelligent Autonomous Systems, Amsterdam, The Netherlands, 10–13 March 2004; pp. 980–987. [Google Scholar]
- Lavanya, S.; Nirvikar, L.; Dileep, K.Y. A Study of Challenging Issues on Video Surveillance System for Object Detection. J. Basic Appl. Eng. Res. 2017, 4, 313–318. [Google Scholar]
- Khan, M.; Jamil, A.; Zhihan, L.; Paolo, B.; Po, Y.; Sung, W. Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 1419–1434. [Google Scholar]
- Cheng-bin, J.; Shengzhe, L.; Trung, D.D.; Hakil, K. Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras. In Advances in Multimedia Information Processing—PCM 2015; Springer: Cham, Switzerland, 2015; pp. 330–339. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
- Kitti 3D Object Detection Benchmark Leader Board. Available online: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 28 April 2018).
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-YOLO: An Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 197–209. [Google Scholar]
- Hamid, R.; Nathan, T.; JunYoung, G.; Amir, S.; Ian, R.; Silvio, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7652–7660. [Google Scholar]
- Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3D lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
- Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
- Kiwoo, S.; Youngwook Paul, K.; Masayoshi, T. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement. arXiv 2018, arXiv:1811.03818. [Google Scholar]
- Liu, W.; Ji, R.; Li, S. Towards 3D Object Detection with Bimodal Deep Boltzmann Machines over RGBD Imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3013–3021. [Google Scholar]
- Zhuo, D.; Londin, J.L. Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5762–5770. [Google Scholar]
- Qianhui, L.; Huifang, M.; Yue, W.; Li, T.; Rong, X. 3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection. arXiv 2015, arXiv:1711.00238. [Google Scholar]
- Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in Rgb-d Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
- Ling, M.; Yang, B.; Wang, S.; Raquel, U. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
- Huitl, R.; Schroth, G.; Hilsenbeck, S.; Schweiger, F.; Steinbach, E. TUMindoor: An Extensive Image and Point Cloud Dataset for Visual Indoor Localization and Mapping. In Proceedings of the IEEE International Conference on Image Processing, Orlando, FL, USA, 30 September–3 October 2012. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Li, M.; Hu, Y.; Zhao, N.; Qian, Q. One-Stage Multi-Sensor Data Fusion Convolutional Neural Network for 3D Object Detection. Sensors 2019, 19, 1434. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Ma, Y.; He, S.; Zhu, J.; Xiao, Y.; Zhang, J. PVFE: Point-Voxel Feature Encoders for 3D Object Detection. In Proceedings of the IEEE International Conference on Signal, Information and Data Processing, Chongqing, China, 11–13 December 2019. accepted. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
- Everingham, M.L.; Van Gool, C.; Williams, K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Alex, H.; Sourabh, V.; Holger, C.; Zhou, L.; Jiong, Y.; Oscar, B. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
Method | Modality | Car | Cyclist | Pedestrian | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
MV3D | Img. & Lidar | 71.09 | 62.35 | 55.12 | N/A | N/A | N/A | N/A | N/A | N/A |
AVOD | Img. & Lidar | 81.94 | 71.88 | 66.38 | 64.00 | 52.18 | 46.61 | 50.80 | 42.81 | 40.88 |
F-PointNet | Img. & Lidar | 81.20 | 70.39 | 62.19 | 71.96 | 56.77 | 50.39 | 51.21 | 44.89 | 51.21 |
VoxelNet | Lidar | 77.47 | 65.11 | 57.73 | 61.22 | 48.36 | 44.37 | 39.48 | 33.69 | 31.50 |
PointPillars | Lidar | 86.96 | 76.35 | 70.19 | 77.75 | 58.55 | 54.85 | 67.07 | 58.74 | 55.97 |
PVFE | Lidar | 87.32 | 77.12 | 68.87 | 81.58 | 62.41 | 56.33 | 58.48 | 51.74 | 45.09 |
SECOND | Lidar | 85.99 | 75.51 | 68.25 | 80.47 | 57.02 | 55.79 | 56.99 | 50.22 | 43.59 |
3D-GIoU | Lidar | 87.83 | 77.91 | 75.55 | 83.32 | 64.69 | 63.51 | 67.23 | 59.58 | 52.69 |
Method | Modality | Car | Cyclist | Pedestrian | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
MV3D | Img & Lidar | 86.02 | 76.90 | 68.48 | N/A | N/A | N/A | N/A | N/A | N/A |
AVOD | Img & Lidar | 88.53 | 83.79 | 77.90 | 68.09 | 57.48 | 50.77 | 58.75 | 51.05 | 47.54 |
F-PointNet | Img & Lidar | 88.07 | 84.00 | 75.33 | 75.38 | 61.96 | 54.68 | 58.09 | 50.22 | 47.02 |
PIXOR | Lidar | 89.38 | 83.70 | 77.97 | N/A | N/A | N/A | N/A | N/A | N/A |
VoxelNet | Lidar | 89.35 | 79.26 | 77.39 | 66.07 | 54.76 | 50.55 | 46.13 | 40.74 | 38.11 |
PointPillars | Lidar | 90.12 | 86.67 | 84.53 | 80.89 | 61.54 | 58.63 | 73.08 | 68.20 | 63.20 |
PVFE | Lidar | 89.98 | 87.03 | 79.31 | 84.30 | 64.72 | 58.42 | 61.93 | 54.88 | 51.93 |
SECOND | Lidar | 89.23 | 86.25 | 78.95 | 82.88 | 63.46 | 57.63 | 60.81 | 53.67 | 51.10 |
3D-GIoU | Lidar | 90.16 | 87.92 | 86.55 | 85.35 | 66.91 | 65.06 | 70.16 | 62.57 | 55.52 |
Method | Method | Car | Cyclist | Pedestrian | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
3D | SECOND | 85.99 | 75.51 | 68.25 | 80.47 | 57.02 | 55.79 | 56.99 | 50.22 | 43.59 |
Baseline 1 | 87.20 | 76.80 | 74.65 | 82.84 | 62.34 | 56.66 | 58.16 | 51.42 | 44.74 | |
Baseline 2 | 87.62 | 77.37 | 75.53 | 83.89 | 64.27 | 62.75 | 59.37 | 52.42 | 49.78 | |
3D-GIoU | 87.83 | 77.91 | 75.55 | 83.32 | 64.69 | 63.51 | 67.23 | 59.58 | 52.69 | |
BEV | SECOND | 89.23 | 86.25 | 78.95 | 82.88 | 63.46 | 57.63 | 60.81 | 53.67 | 51.10 |
Baseline 1 | 89.99 | 86.82 | 86.03 | 84.83 | 64.56 | 58.55 | 62.34 | 59.35 | 52.70 | |
Baseline 2 | 89.80 | 87.13 | 86.31 | 85.42 | 65.78 | 64.45 | 66.40 | 59.40 | 52.56 | |
3D-GIoU | 90.16 | 87.92 | 86.55 | 85.35 | 66.91 | 65.06 | 70.16 | 62.57 | 55.52 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, J.; Ma, Y.; He, S.; Zhu, J. 3D-GIoU: 3D Generalized Intersection over Union for Object Detection in Point Cloud. Sensors 2019, 19, 4093. https://doi.org/10.3390/s19194093
Xu J, Ma Y, He S, Zhu J. 3D-GIoU: 3D Generalized Intersection over Union for Object Detection in Point Cloud. Sensors. 2019; 19(19):4093. https://doi.org/10.3390/s19194093
Chicago/Turabian StyleXu, Jun, Yanxin Ma, Songhua He, and Jiahua Zhu. 2019. "3D-GIoU: 3D Generalized Intersection over Union for Object Detection in Point Cloud" Sensors 19, no. 19: 4093. https://doi.org/10.3390/s19194093