Abstract
In this paper, we propose a novel unsupervised pre-training method for point cloud deep learning models using multimodal contrastive learning. Point clouds, which consist of a set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cameras, etc. play an important role in representing 3D scenes, and understanding them is crucial for implementing autonomous driving or navigation. Deep learning models based on supervised learning for point cloud understanding require a label for each point cloud data that corresponds to the correct answer in training. However, generating these labels is expensive, making it difficult to build large datasets, which is essential for good model performance. Our proposed unsupervised pre-training method, on the other hand, does not require labels and can serve as an initial value for a model that can alleviate the need for such large datasets. The proposed method is characterized as a multimodal approach that utilizes two modalities for point clouds: the point cloud itself and an image rendering of the point cloud. By using images that directly render the point clouds, the shape information of the point clouds from various viewpoints can be obtained from the images without additional data such as meshes. We pre-trained the model with the proposed method and conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The linear classification accuracy of the point cloud feature vector extracted by the pre-trained model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accuracy was 93.3% and 86.9%, respectively.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Lin C-H, Kong C, Lucey S (2018) Learning efficient point cloud generation for dense 3D object reconstruction. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, New Orleans, pp 7114–7121
Guo M-H, Cai J-X, Liu Z-N et al (2021) PCT: point cloud transformer. Comp Visual Media 7:187–199. https://doi.org/10.1007/s41095-021-0229-5
Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 77–85. https://doi.org/10.1109/CVPR.2017.16
Wang Y, Sun Y, Liu Z et al (2019) Dynamic graph CNN for learning on point clouds. ACM Trans Graph 38(146):1–146. https://doi.org/10.1145/3326362
Zhang Z, Girdhar R, Joulin A, Misra I (2021) Self-Supervised Pretraining of 3D Features on any Point-Cloud. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 10232–10243. https://doi.org/10.1109/ICCV48922.2021.01009
Afham M, Dissanayake I, Dissanayake D, et al (2022) CrossPoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, pp 9892–9902. https://doi.org/10.1109/CVPR52688.2022.00967
Huang S, Xie Y, Zhu S-C, Zhu Y (2021) Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 6515–6525. https://doi.org/10.1109/ICCV48922.2021.00647
Du B, Gao X, Hu W, Li X (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. In: Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, pp 3133–3142. https://doi.org/10.1145/3474085.3475458
Xie S, Gu J, Guo D et al (2020) PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 574–591. https://doi.org/10.1007/978-3-030-58580-8_34
Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Yu F (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. https://doi.org/10.48550/arXiv.1512.03012
Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi-view Convolutional Neural Networks for 3D Shape Recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, pp 945–953. https://doi.org/10.1109/ICCV.2015.114
Pang G, Neumann U (2016) 3D point cloud object detection with multi-view convolutional neural network. In: 2016 23rd International Conference on Pattern Recognition (ICPR). pp 585–590. https://doi.org/10.1109/ICPR.2016.7899697
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Maturana D, Scherer S (2015) VoxNet: A 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp 922–928. https://doi.org/10.1109/IROS.2015.7353481
Klokov R, Lempitsky V (2017) Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, pp 863–872. https://doi.org/10.1109/ICCV.2017.99
Riegler G, Ulusoy AO, Geiger A (2017) OctNet: learning deep 3D representations at high resolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 6620–6629. https://doi.org/10.1109/CVPR.2017.701
Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.
Zhao H, Jiang L, Fu C-W, Jia J (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 5560–5568. https://doi.org/10.1109/CVPR.2019.00571
Wang H, Liu Q, Yue X et al (2021) Unsupervised point cloud pre-training via occlusion completion. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 9762–9772. https://doi.org/10.1109/ICCV48922.2021.00964
Poursaeed O, Jiang T, Qiao H et al (2020) Self-Supervised learning of point clouds via orientation estimation. In: 2020 International Conference on 3D Vision (3DV). pp 1018–1028. https://doi.org/10.1109/3DV50981.2020.00112
He K, Fan H, Wu Y et al (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, pp 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp 1597–1607
Oord AVD, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. https://doi.org/10.48550/arXiv.1807.03748
Saff EB, Kuijlaars ABJ (1997) Distributing many points on a sphere. Math Intelligencer 19:5–11. https://doi.org/10.1007/BF03024331
González Á (2010) Measurement of areas on a sphere using Fibonacci and latitude–longitude lattices. Math Geosci 42:49–64. https://doi.org/10.1007/s11004-009-9257-x
Lazzarotto D, Ebrahimi T (2022) Sampling color and geometry point clouds from ShapeNet dataset. arXiv preprint arXiv:2201.06935. https://doi.org/10.48550/arXiv.2201.06935
Uy MA, Pham Q-H, Hua B-S et al (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp 1588–1597. https://doi.org/10.1109/ICCV.2019.00167
Hua B-S, Pham Q-H, Nguyen DT et al (2016) SceneNN: a scene meshes dataset with aNNotations. In: 2016 Fourth International Conference on 3D Vision (3DV). pp 92–101. https://doi.org/10.1109/3DV.2016.18
Dai A, Chang AX, Savva M, et al (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 2432–2443. https://doi.org/10.1109/CVPR.2017.261
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. https://doi.org/10.48550/arXiv.1706.02677
Johnson J, Ravi N, Reizenstein J, Novotny D, Tulsiani S, Lassner C, Branson S (2020) Accelerating 3d deep learning with pytorch3d. In: SIGGRAPH Asia 2020 Courses. pp 1–1. https://doi.org/10.1145/3415263.3419160
Hassani K, Haley M (2019) Unsupervised multi-task feature learning on point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp 8159–8170. https://doi.org/10.1109/ICCV.2019.00825
Sauder J, Sievers B (2019) Self-supervised deep learning on point clouds by reconstructing space. Adv Neural Inf Proces Syst 32
Sharma C, Kaul M (2020) Self-supervised few-shot learning on point clouds. Adv Neural Inf Proces Systs 33:7212–7221
Acknowledgements
This work was supported by the BK21 FOUR (Fostering Outstanding Universities for Research) funded by the Ministry of Education(MOE, Korea) and National Research Foundation of Korea(NRF).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, W., Kim, H. Multimodal contrastive learning using point clouds and their rendered images. Multimed Tools Appl 83, 78577–78592 (2024). https://doi.org/10.1007/s11042-024-18653-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-024-18653-7