Abstract
Unsupervised 3D instance segmentation aims to segment objects from a 3D point cloud without any annotations. Existing methods face the challenge of either too loose or too tight clustering, leading to under-segmentation or over-segmentation. To address this issue, we propose Part2Object, hierarchical clustering with object guidance. Part2Object employs multi-layer clustering from points to object parts and objects, allowing objects to manifest at any layer. Additionally, it extracts and utilizes 3D objectness priors from temporally consecutive 2D RGB frames to guide the clustering process. Moreover, we propose Hi-Mask3D to support hierarchical 3D object part and instance segmentation. By training Hi-Mask3D on the objects and object parts extracted from Part2Object, we achieve consistent and superior performance compared to state-of-the-art models in various settings, including unsupervised instance segmentation, data-efficient fine-tuning, and cross-dataset generalization. Code is release at https://github.com/ChengShiest/Part2Object.
C. Shi and Y. Zhang–Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 16(6), 641–647 (1994)
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep VIT features as dense visual descriptors. arXiv preprint arXiv:2112.05814 (2021)
An, D., et al.: ETPNav: evolving topological planning for vision-language navigation in continuous environments. arXiv preprint arXiv:2304.03047 (2023)
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Chen, R., et al.: Towards label-free scene understanding by vision foundation models. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3d instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15467–15476 (2021)
Chen, Z., Yin, K., Fisher, M., Chaudhuri, S., Zhang, H.: BAE-NET: branched autoencoder for shape co-segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8490–8499 (2019)
Chibane, J., Engelmann, F., Anh Tran, T., Pons-Moll, G.: Box2Mask: weakly supervised 3D semantic instance segmentation using bounding boxes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 681–699. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_39
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Dai, Q., Yang, S.: Curriculum point prompting for weakly-supervised referring segmentation (2024)
Deng, R., Shen, C., Liu, S., Wang, H., Liu, X.: Learning to predict crisp boundaries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 570–586. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_35
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 226–231. AAAI Press (1996)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004)
Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7081–7091 (2023)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143 (2021)
Han, L., Zheng, T., Xu, L., Fang, L.: OccuSeg: occupancy-aware 3D instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2937–2946 (2020). https://doi.org/10.1109/CVPR42600.2020.00301
Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)
Hu, Q., et al.: RandLA-Net: efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117 (2020)
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16750–16761 (2023)
Huang, Z., Wu, X., Chen, X., Zhao, H., Zhu, L., Lasenby, J.: OpenIns3D: snap and lookup for 3D open-vocabulary instance segmentation. arXiv preprint arXiv:2309.00616 (2023)
Hui, L., Tang, L., Shen, Y., Xie, J., Yang, J.: Learning superpoint graph cut for 3D instance segmentation. In: NeurIPS (2022)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kolodiazhnyi, M., Rukhovich, D., Vorontsova, A., Konushin, A.: Top-down beats bottom-up in 3D instance segmentation (2023). https://doi.org/10.48550/ARXIV.2302.02871. https://arxiv.org/abs/2302.02871
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=RriDjddCLN
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, J., Yu, M., Ni, B., Chen, Y.: Self-prediction for joint instance and semantic segmentation of point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXII. LNCS, vol. 12367, pp. 187–204. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_12
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347 (2023)
McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. IEEE (2017)
Nunes, L., et al.: Unsupervised class-agnostic instance segmentation of 3D lidar data for autonomous vehicles. IEEE Robot. Autom. Lett. 7(4), 8713–8720 (2022)
Papon, J., Abramov, A., Schoeler, M., Worgotter, F.: Voxel cloud connectivity segmentation-supervoxels for point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2027–2034 (2013)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)
Peng, S., et al.: OpenScene: 3D scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 30 (2017)
Rethage, D., Wald, J., Sturm, J., Navab, N., Tombari, F.: Fully-convolutional point networks for large-scale point clouds. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 625–640. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_37
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_8
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_8
Rozenberszki, D., Litany, O., Dai, A.: UnScene3D: unsupervised 3D instance segmentation for indoor scenes. arXiv preprint arXiv:2303.14541 (2023)
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D for 3D semantic instance segmentation. In: International Conference on Robotics and Automation (ICRA) (2023)
Shi, C., Yang, S.: EdaDet: open-vocabulary object detection using early dense alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15724–15734 (2023)
Shi, C., Yang, S.: LoGoPrompt: synthetic text images can be good visual prompts for vision-language models. arXiv preprint arXiv:2309.01155 (2023)
Shi, C., Yang, S.: The devil is in the object boundary: towards annotation-free instance segmentation using foundation models. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=4JbrdrHxYy
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Song, Z., Yang, B.: OGC: unsupervised 3D object segmentation from rigid dynamics of point clouds. Adv. Neural. Inf. Process. Syst. 35, 30798–30812 (2022)
Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016)
Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation (2022)
Suo, S., et al.: MixSim: a hierarchical framework for mixed reality traffic simulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9622–9631 (2023)
Suomela, L., Kalliola, J., Dag, A., Edelman, H., Kämäräinen, J.K.: Benchmarking visual localization for autonomous navigation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2945–2955 (2023)
Tang, J., Zheng, G., Shi, C., Yang, S.: Contrastive grouping with transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23570–23580 (2023)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Kim, J., Yoo, C.D.: SoftGroup++: scalable 3D instance segmentation with octree pyramid grouping. arXiv preprint arXiv:2209.08263 (2022)
Vu, T., Kim, K., Luu, T.M., Nguyen, X.T., Yoo, C.D.: SoftGroup for 3D instance segmentation on 3D point clouds. In: CVPR (2022)
Wang, R., Zhang, Y., Mao, J., Zhang, R., Cheng, C.Y., Wu, J.: IKEA-manual: seeing shape assembly step by step. Adv. Neural. Inf. Process. Syst. 35, 28428–28440 (2022)
Wang, W., Yu, R., Huang, Q., Neumann, U.: SGPN: similarity group proposal network for 3D point cloud instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578 (2018)
Wang, X., et al.: FreeSOLO: learning to segment objects without annotations. arXiv preprint arXiv:2202.12181 (2022)
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320 (2023)
Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 2022
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (TOG) 38(5), 1–12 (2019)
Wu, W., Qi, Z., Fuxin, L.: PointConv: deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630 (2019)
Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: SAM3D: segment anything in 3D scenes. arXiv preprint arXiv:2306.03908 (2023)
Zhang, B., Wonka, P.: Point cloud instance segmentation using probabilistic embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8883–8892 (2021)
Zhang, Z., Ding, J., Jiang, L., Dai, D., Xia, G.S.: FreePoint: unsupervised point cloud instance segmentation. arXiv preprint arXiv:2305.06973 (2023)
Zhang, Z., Yang, B., Wang, B., Li, B.: GrowSP: unsupervised semantic segmentation of 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17619–17629 (2023)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 62206174) and MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (ShanghaiTech University).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Conflict of Interest
Given that our 2D knowledge is derived from the self-supervised models DINO, we acknowledge that biases and controversies inherent in the training data for these models may be introduced into our model.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shi, C., Zhang, Y., Yang, B., Tang, J., Ma, Y., Yang, S. (2025). Part2Object: Hierarchical Unsupervised 3D Instance Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-72649-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)