[go: up one dir, main page]

Skip to main content
Log in

Multimodal contrastive learning using point clouds and their rendered images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel unsupervised pre-training method for point cloud deep learning models using multimodal contrastive learning. Point clouds, which consist of a set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cameras, etc. play an important role in representing 3D scenes, and understanding them is crucial for implementing autonomous driving or navigation. Deep learning models based on supervised learning for point cloud understanding require a label for each point cloud data that corresponds to the correct answer in training. However, generating these labels is expensive, making it difficult to build large datasets, which is essential for good model performance. Our proposed unsupervised pre-training method, on the other hand, does not require labels and can serve as an initial value for a model that can alleviate the need for such large datasets. The proposed method is characterized as a multimodal approach that utilizes two modalities for point clouds: the point cloud itself and an image rendering of the point cloud. By using images that directly render the point clouds, the shape information of the point clouds from various viewpoints can be obtained from the images without additional data such as meshes. We pre-trained the model with the proposed method and conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The linear classification accuracy of the point cloud feature vector extracted by the pre-trained model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accuracy was 93.3% and 86.9%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Lin C-H, Kong C, Lucey S (2018) Learning efficient point cloud generation for dense 3D object reconstruction. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, New Orleans, pp 7114–7121

  2. Guo M-H, Cai J-X, Liu Z-N et al (2021) PCT: point cloud transformer. Comp Visual Media 7:187–199. https://doi.org/10.1007/s41095-021-0229-5

    Article  Google Scholar 

  3. Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 77–85. https://doi.org/10.1109/CVPR.2017.16

  4. Wang Y, Sun Y, Liu Z et al (2019) Dynamic graph CNN for learning on point clouds. ACM Trans Graph 38(146):1–146. https://doi.org/10.1145/3326362

    Article  Google Scholar 

  5. Zhang Z, Girdhar R, Joulin A, Misra I (2021) Self-Supervised Pretraining of 3D Features on any Point-Cloud. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 10232–10243. https://doi.org/10.1109/ICCV48922.2021.01009

  6. Afham M, Dissanayake I, Dissanayake D, et al (2022) CrossPoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, pp 9892–9902. https://doi.org/10.1109/CVPR52688.2022.00967

  7. Huang S, Xie Y, Zhu S-C, Zhu Y (2021) Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 6515–6525. https://doi.org/10.1109/ICCV48922.2021.00647

  8. Du B, Gao X, Hu W, Li X (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. In: Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, pp 3133–3142. https://doi.org/10.1145/3474085.3475458

  9. Xie S, Gu J, Guo D et al (2020) PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 574–591. https://doi.org/10.1007/978-3-030-58580-8_34

  10. Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Yu F (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. https://doi.org/10.48550/arXiv.1512.03012

  11. Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi-view Convolutional Neural Networks for 3D Shape Recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, pp 945–953. https://doi.org/10.1109/ICCV.2015.114

  12. Pang G, Neumann U (2016) 3D point cloud object detection with multi-view convolutional neural network. In: 2016 23rd International Conference on Pattern Recognition (ICPR). pp 585–590. https://doi.org/10.1109/ICPR.2016.7899697

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  14. Maturana D, Scherer S (2015) VoxNet: A 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp 922–928. https://doi.org/10.1109/IROS.2015.7353481

  15. Klokov R, Lempitsky V (2017) Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, pp 863–872. https://doi.org/10.1109/ICCV.2017.99

  16. Riegler G, Ulusoy AO, Geiger A (2017) OctNet: learning deep 3D representations at high resolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 6620–6629. https://doi.org/10.1109/CVPR.2017.701

  17. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.

  18. Zhao H, Jiang L, Fu C-W, Jia J (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 5560–5568. https://doi.org/10.1109/CVPR.2019.00571

  19. Wang H, Liu Q, Yue X et al (2021) Unsupervised point cloud pre-training via occlusion completion. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 9762–9772. https://doi.org/10.1109/ICCV48922.2021.00964

  20. Poursaeed O, Jiang T, Qiao H et al (2020) Self-Supervised learning of point clouds via orientation estimation. In: 2020 International Conference on 3D Vision (3DV). pp 1018–1028. https://doi.org/10.1109/3DV50981.2020.00112

  21. He K, Fan H, Wu Y et al (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, pp 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975

  22. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp 1597–1607

  23. Oord AVD, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. https://doi.org/10.48550/arXiv.1807.03748

  24. Saff EB, Kuijlaars ABJ (1997) Distributing many points on a sphere. Math Intelligencer 19:5–11. https://doi.org/10.1007/BF03024331

    Article  MathSciNet  Google Scholar 

  25. González Á (2010) Measurement of areas on a sphere using Fibonacci and latitude–longitude lattices. Math Geosci 42:49–64. https://doi.org/10.1007/s11004-009-9257-x

    Article  MathSciNet  Google Scholar 

  26. Lazzarotto D, Ebrahimi T (2022) Sampling color and geometry point clouds from ShapeNet dataset. arXiv preprint arXiv:2201.06935. https://doi.org/10.48550/arXiv.2201.06935

  27. Uy MA, Pham Q-H, Hua B-S et al (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp 1588–1597. https://doi.org/10.1109/ICCV.2019.00167

  28. Hua B-S, Pham Q-H, Nguyen DT et al (2016) SceneNN: a scene meshes dataset with aNNotations. In: 2016 Fourth International Conference on 3D Vision (3DV). pp 92–101. https://doi.org/10.1109/3DV.2016.18

  29. Dai A, Chang AX, Savva M, et al (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 2432–2443. https://doi.org/10.1109/CVPR.2017.261

  30. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. https://doi.org/10.48550/arXiv.1706.02677

  31. Johnson J, Ravi N, Reizenstein J, Novotny D, Tulsiani S, Lassner C, Branson S (2020) Accelerating 3d deep learning with pytorch3d. In: SIGGRAPH Asia 2020 Courses. pp 1–1. https://doi.org/10.1145/3415263.3419160

  32. Hassani K, Haley M (2019) Unsupervised multi-task feature learning on point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp 8159–8170. https://doi.org/10.1109/ICCV.2019.00825

  33. Sauder J, Sievers B (2019) Self-supervised deep learning on point clouds by reconstructing space. Adv Neural Inf Proces Syst 32

  34. Sharma C, Kaul M (2020) Self-supervised few-shot learning on point clouds. Adv Neural Inf Proces Systs 33:7212–7221

Download references

Acknowledgements

This work was supported by the BK21 FOUR (Fostering Outstanding Universities for Research) funded by the Ministry of Education(MOE, Korea) and National Research Foundation of Korea(NRF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyungki Kim.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, W., Kim, H. Multimodal contrastive learning using point clouds and their rendered images. Multimed Tools Appl 83, 78577–78592 (2024). https://doi.org/10.1007/s11042-024-18653-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-024-18653-7

Keywords

Navigation