[go: up one dir, main page]

Skip to main content

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

  • 374 Accesses

Abstract

Unsupervised 3D instance segmentation aims to segment objects from a 3D point cloud without any annotations. Existing methods face the challenge of either too loose or too tight clustering, leading to under-segmentation or over-segmentation. To address this issue, we propose Part2Object, hierarchical clustering with object guidance. Part2Object employs multi-layer clustering from points to object parts and objects, allowing objects to manifest at any layer. Additionally, it extracts and utilizes 3D objectness priors from temporally consecutive 2D RGB frames to guide the clustering process. Moreover, we propose Hi-Mask3D to support hierarchical 3D object part and instance segmentation. By training Hi-Mask3D on the objects and object parts extracted from Part2Object, we achieve consistent and superior performance compared to state-of-the-art models in various settings, including unsupervised instance segmentation, data-efficient fine-tuning, and cross-dataset generalization. Code is release at https://github.com/ChengShiest/Part2Object.

C. Shi and Y. Zhang–Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 16(6), 641–647 (1994)

    Article  Google Scholar 

  2. Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep VIT features as dense visual descriptors. arXiv preprint arXiv:2112.05814 (2021)

  3. An, D., et al.: ETPNav: evolving topological planning for vision-language navigation in continuous environments. arXiv preprint arXiv:2304.03047 (2023)

  4. Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  7. Chen, R., et al.: Towards label-free scene understanding by vision foundation models. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)

    Google Scholar 

  8. Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3d instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15467–15476 (2021)

    Google Scholar 

  9. Chen, Z., Yin, K., Fisher, M., Chaudhuri, S., Zhang, H.: BAE-NET: branched autoencoder for shape co-segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8490–8499 (2019)

    Google Scholar 

  10. Chibane, J., Engelmann, F., Anh Tran, T., Pons-Moll, G.: Box2Mask: weakly supervised 3D semantic instance segmentation using bounding boxes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 681–699. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_39

    Chapter  Google Scholar 

  11. Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)

    Google Scholar 

  12. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

    Google Scholar 

  13. Dai, Q., Yang, S.: Curriculum point prompting for weakly-supervised referring segmentation (2024)

    Google Scholar 

  14. Deng, R., Shen, C., Liu, S., Wang, H., Liu, X.: Learning to predict crisp boundaries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 570–586. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_35

    Chapter  Google Scholar 

  15. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 226–231. AAAI Press (1996)

    Google Scholar 

  16. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004)

    Article  Google Scholar 

  17. Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7081–7091 (2023)

    Google Scholar 

  18. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143 (2021)

  19. Han, L., Zheng, T., Xu, L., Fang, L.: OccuSeg: occupancy-aware 3D instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2937–2946 (2020). https://doi.org/10.1109/CVPR42600.2020.00301

  20. Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)

    Google Scholar 

  21. Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)

    Google Scholar 

  22. Hu, Q., et al.: RandLA-Net: efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117 (2020)

    Google Scholar 

  23. Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16750–16761 (2023)

    Google Scholar 

  24. Huang, Z., Wu, X., Chen, X., Zhao, H., Zhu, L., Lasenby, J.: OpenIns3D: snap and lookup for 3D open-vocabulary instance segmentation. arXiv preprint arXiv:2309.00616 (2023)

  25. Hui, L., Tang, L., Shen, Y., Xie, J., Yang, J.: Learning superpoint graph cut for 3D instance segmentation. In: NeurIPS (2022)

    Google Scholar 

  26. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  27. Kolodiazhnyi, M., Rukhovich, D., Vorontsova, A., Konushin, A.: Top-down beats bottom-up in 3D instance segmentation (2023). https://doi.org/10.48550/ARXIV.2302.02871. https://arxiv.org/abs/2302.02871

  28. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=RriDjddCLN

  29. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  30. Liu, J., Yu, M., Ni, B., Chen, Y.: Self-prediction for joint instance and semantic segmentation of point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXII. LNCS, vol. 12367, pp. 187–204. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_12

    Chapter  Google Scholar 

  31. Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347 (2023)

  32. McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. IEEE (2017)

    Google Scholar 

  33. Nunes, L., et al.: Unsupervised class-agnostic instance segmentation of 3D lidar data for autonomous vehicles. IEEE Robot. Autom. Lett. 7(4), 8713–8720 (2022)

    Article  Google Scholar 

  34. Papon, J., Abramov, A., Schoeler, M., Worgotter, F.: Voxel cloud connectivity segmentation-supervoxels for point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2027–2034 (2013)

    Google Scholar 

  35. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  36. Peng, S., et al.: OpenScene: 3D scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)

    Google Scholar 

  37. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  38. Rethage, D., Wald, J., Sturm, J., Navab, N., Tombari, F.: Fully-convolutional point networks for large-scale point clouds. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 625–640. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_37

    Chapter  Google Scholar 

  39. Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_8

    Chapter  Google Scholar 

  40. Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_8

    Chapter  Google Scholar 

  41. Rozenberszki, D., Litany, O., Dai, A.: UnScene3D: unsupervised 3D instance segmentation for indoor scenes. arXiv preprint arXiv:2303.14541 (2023)

  42. Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D for 3D semantic instance segmentation. In: International Conference on Robotics and Automation (ICRA) (2023)

    Google Scholar 

  43. Shi, C., Yang, S.: EdaDet: open-vocabulary object detection using early dense alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15724–15734 (2023)

    Google Scholar 

  44. Shi, C., Yang, S.: LoGoPrompt: synthetic text images can be good visual prompts for vision-language models. arXiv preprint arXiv:2309.01155 (2023)

  45. Shi, C., Yang, S.: The devil is in the object boundary: towards annotation-free instance segmentation using foundation models. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=4JbrdrHxYy

  46. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  47. Song, Z., Yang, B.: OGC: unsupervised 3D object segmentation from rigid dynamics of point clouds. Adv. Neural. Inf. Process. Syst. 35, 30798–30812 (2022)

    Google Scholar 

  48. Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016)

    Google Scholar 

  49. Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

  50. Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation (2022)

    Google Scholar 

  51. Suo, S., et al.: MixSim: a hierarchical framework for mixed reality traffic simulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9622–9631 (2023)

    Google Scholar 

  52. Suomela, L., Kalliola, J., Dag, A., Edelman, H., Kämäräinen, J.K.: Benchmarking visual localization for autonomous navigation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2945–2955 (2023)

    Google Scholar 

  53. Tang, J., Zheng, G., Shi, C., Yang, S.: Contrastive grouping with transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23570–23580 (2023)

    Google Scholar 

  54. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  55. Vu, T., Kim, K., Luu, T.M., Nguyen, T., Kim, J., Yoo, C.D.: SoftGroup++: scalable 3D instance segmentation with octree pyramid grouping. arXiv preprint arXiv:2209.08263 (2022)

  56. Vu, T., Kim, K., Luu, T.M., Nguyen, X.T., Yoo, C.D.: SoftGroup for 3D instance segmentation on 3D point clouds. In: CVPR (2022)

    Google Scholar 

  57. Wang, R., Zhang, Y., Mao, J., Zhang, R., Cheng, C.Y., Wu, J.: IKEA-manual: seeing shape assembly step by step. Adv. Neural. Inf. Process. Syst. 35, 28428–28440 (2022)

    Google Scholar 

  58. Wang, W., Yu, R., Huang, Q., Neumann, U.: SGPN: similarity group proposal network for 3D point cloud instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578 (2018)

    Google Scholar 

  59. Wang, X., et al.: FreeSOLO: learning to segment objects without annotations. arXiv preprint arXiv:2202.12181 (2022)

  60. Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320 (2023)

  61. Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 2022

    Google Scholar 

  62. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (TOG) 38(5), 1–12 (2019)

    Article  Google Scholar 

  63. Wu, W., Qi, Z., Fuxin, L.: PointConv: deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630 (2019)

    Google Scholar 

  64. Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: SAM3D: segment anything in 3D scenes. arXiv preprint arXiv:2306.03908 (2023)

  65. Zhang, B., Wonka, P.: Point cloud instance segmentation using probabilistic embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8883–8892 (2021)

    Google Scholar 

  66. Zhang, Z., Ding, J., Jiang, L., Dai, D., Xia, G.S.: FreePoint: unsupervised point cloud instance segmentation. arXiv preprint arXiv:2305.06973 (2023)

  67. Zhang, Z., Yang, B., Wang, B., Li, B.: GrowSP: unsupervised semantic segmentation of 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17619–17629 (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62206174) and MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (ShanghaiTech University).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sibei Yang .

Editor information

Editors and Affiliations

Ethics declarations

Conflict of Interest

Given that our 2D knowledge is derived from the self-supervised models DINO, we acknowledge that biases and controversies inherent in the training data for these models may be introduced into our model.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, C., Zhang, Y., Yang, B., Tang, J., Ma, Y., Yang, S. (2025). Part2Object: Hierarchical Unsupervised 3D Instance Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72649-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72648-4

  • Online ISBN: 978-3-031-72649-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics