InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Jianhua Sun ORCID: orcid.org/0000-0002-1030-5575¹^na1,
Hao-Shu Fang¹^na1,
Yuxuan Li¹,
Runzhong Wang¹,
Minghao Gou¹ &
…
Cewu Lu¹

676 Accesses
1 Altmetric
Explore all metrics

Abstract

Instance-level perception tasks like object detection, instance segmentation, and 3D detection require many training samples to achieve satisfactory performance. The meticulous labels for these tasks are usually expensive to obtain and data augmentation is a natural choice to tackle such a problem. However, instance-level augmentation is less studied in previous research. In this paper, we present an effective, efficient and unified crop-paste mechanism to augment the training set utilizing existing instance-level annotations. Our design is derived from visual coherence and mines three inherent principles that widely exist in real-world data: (i) background coherence in local neighbor area, (ii) appearance coherence for instance placement, and (iii) instance coherence within the same category. Such methodologies are unified for various tasks including object detection, instance segmentation, and 3D detection. Extensive experiments demonstrate that our proposed approaches can successfully boost the performance of diverse frameworks on various datasets across multiple tasks, without modifying the network structure. Remarkable improvements are obtained: 5.1 mAP for object detection and 3.2 mAP for instance segmentation on COCO dataset, and 6.9 mAP for 3D detection on ScanNetV2 dataset. Our method can be easily integrated into different frameworks without affecting the training and inference efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

SOS: Segment Object System for Open-World Instance Segmentation with Object Priors

Learning to Detect Every Thing in an Open World

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The data that support the findings of this study are available from VOC [22], VOCSDS [33], COCO [53], SUN RGB-D [73] and ScanNetV2 [15], which were used under their licenses for the current study. Data are available from their corresponding websites.

References

Arnheim, R. (1969). Visual thinking.
Bearman, A., Russakovsky, O., & Ferrari, V., et al. (2016). What’s the point: Semantic segmentation with point supervision. In ECCV.
Bertalmio, M., Bertozzi, A.L., & Sapiro, G. (2001). Navier–Stokes, fluid dynamics, and image and video inpainting. In CVPR.
Bleau, A., & Leon, L.J. (2000). Watershed-based segmentation and region merging. In Computer vision and image understanding.
Bolya, D., Zhou, C., & Xiao, F., et al. (2019). Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9157–9166).
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In CVPR.
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp 213–229). Springer.
Chen, L.C., Hermans, A., & Papandreou, G., et al. (2018). Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.
Chen, X., Ma, H., & Wan, J., et al. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).
Chen, K., Pang, J., & Wang, J., et al. (2018). mmdetection. https://github.com/open-mmlab/mmdetection.
Chen, T., Saxena, S., & Li, L., et al. (2021). Pix2seq: A language modeling framework for object detection. In International conference on learning representations.
Cheng, B., Misra, I., & Schwing, A.G, et al. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).
Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Dai, A., Chang, A.X., & Savva, M., et al. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).
Dai, J., He, K., & Li, Y., et al. (2016). Instance-sensitive fully convolutional networks. In ECCV.
Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV.
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR.
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
Dvornik, N., Mairal, J., & Schmid, C. (2018). Modeling visual context is key to augmenting object detection datasets. In ECCV.
Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV.
Everingham, M., Van Gool, L., & Williams, C.K., et al. (2010). The pascal visual object classes (voc) challenge. IJCV.
Fang, H.S., Sun, J., & Wang, R., et al. (2019). Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In Proceedings of the IEEE international conference on computer vision (pp. 682–691).
Fang, H.S., Xie, S., & Tai, Y.W., et al. (2017). Rmpe: Regional multi-person pose estimation. In ICCV.
Fang, Y., Yang, S., & Wang, X., et al. (2021). Instances as queries. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6910–6919).
Field, D.J., Hayes, A., & Hess, R.F. (1993). Contour integration by the human visual system: evidence for a local “association field”. Vision Research.
Ge, Z., Liu, S., & Wang, F., et al. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
Ghiasi, G., Cui, Y., & Srinivas, A., et al. (2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR.
Girshick, R. (2015). Fast R-CNN. In ICCV.
Girshick, R., Donahue, J., & Darrell, T., et al. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gong, K., Liang, X., & Zhang, D., et al. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Hariharan, B., Arbelaez, P., & Bourdev, L., et al. (2011). Semantic contours from inverse detectors. In ICCV.
He, K., Gkioxari, G., & Dollár, P., et al. (2017). Mask r-cnn. In ICCV.
He, K., Rhemann, C., & Rother, C., et al. (2011). A global sampling method for alpha matting. In CVPR.
Khoreva, A., Benenson, R., Ilg, E., et al. (2018). Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 1, 1–23.
Google Scholar
Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Kirillov, A., He, K., & Girshick, R., et al. (2019). Panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9404–9413).
Kirillov, A., Wu, Y., & He, K., et al. (2020). Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9799–9808)
Ku, J., Mozifian, M., & Lee, J., et al. (2018). Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp 1–8). IEEE.
Lang, A.H., Vora, S., & Caesar, H., et al. (2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12697–12705).
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV) (pp. 734–750).
Leng, Z., Cheng, S., & Caine, B., et al. (2022). Pseudoaugment: Learning to use unlabeled data for data augmentation in point clouds. In European conference on computer vision (pp. 555–572). Springer.
Li, Q., Arnab, A., & Torr, P.H. (2018). Weakly-and semi-supervised panoptic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 102–118).
Li, Y., Chen, X., & Zhu, Z., et al. (2019). Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7026–7035).
Li, Y., Qi, H., & Dai, J., et al. (2017). Fully convolutional instance-aware semantic segmentation. In CVPR.
Li, J., Wang, C., & Zhu, H., et al. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR.
Li, Y.L., Xu, L., & Liu, X., et al. (2020). Pastanet: Toward human activity knowledge engine. In CVPR.
Li, Y.L., Zhou, S., & Huang, X., et al. (2019). Transferable interactiveness knowledge for human-object interaction detection. In CVPR.
Liang, M., Yang, B., & Wang, S., et al. (2018). Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 641–656).
Lin, D., Dai, J., & Jia, J., et al. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR.
Lin, T.Y., Dollar, P., & Girshick, R., et al. (2017). Feature pyramid networks for object detection. In CVPR.
Lin, T.Y., Maire. M., & Belongie. S, et al. (2014). Microsoft coco: Common objects in context. In ECCV.
Lin, L., Xu, Y., & Liang, X., et al. (2014). Complex background subtraction by pursuing dynamic spatio-temporal models. TIP.
Liu, W., Anguelov, D., & Erhan, D., et al. (2016). Ssd: Single shot multibox detector. In ECCV.
Lu, X., Wang, W., & Ma, C., et al. (2019). See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.
Lucas, C. (2019). Point cloud hole filling. https://github.com/Geodan/fill-holes-pointcloud.
Meng, D., Chen, X., & Fan, Z., et al. (2021). Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3651–3660).
Mscoco. (2018). Mscoco detection leaderboard. http://cocodataset.org/#detection-leaderboard.
Nako, R., Wu, R., & Eimer, M. (2014). Rapid guidance of visual search by object categories. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 50.
Google Scholar
Nako, R., Wu, R., Smith, T. J., et al. (2014). Item and category-based attentional control during search for real-world objects: Can you find the pants among the pans? Journal of Experimental Psychology: Human Perception and Performance, 40(4), 1283.
Google Scholar
Pang, J., Chen, K., & Shi, J., et al. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 821–830).
Pinheiro, PO., Collobert, R., & Dollár, P. (2015). Learning to segment object candidates. In NIPS
Qi, CR., Litany, O., He, K., et al. (2019). Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision (pp. 9277–9286).
Qi, S., Wang, W., & Jia, B., et al. (2018). Learning human-object interactions by graph parsing neural networks. In CVPR
Redmon, J., Divvala, S., & Girshick, R., et al. (2016). You only look once: Unified, real-time object detection. In CVPR
Remez, T., Huang, J., & Brown, M. (2018). Learning to segment via cut-and-paste. In Proceedings of the European conference on computer vision (ECCV) (pp 37–52).
Ren, S., He, K., & Girshick, R., et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS
Richtsfeld, A., Mörwald, T., & Prankl, J., et al. (2012). Segmentation of unknown objects in indoor environments. In IROS
Shao, D., Xiong, Y., & Zhao, Y., et al. (2018). Find and focus: Retrieve and localize video events with natural language queries. In ECCV
Shi, S., Guo, C., & Jiang, L., et al. (2020). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 10529–10538).
Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 770–779).
Song, S., & Xiao, J. (2016). Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 808–816).
Song, S., Lichtenberg, SP., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 567–576).
Sun, J., Fang, HS., & Zhu, X., et al. (2022). Correlation field for boosting 3d object detection in structured scenes. In Proceedings of the AAAI conference on artificial intelligence
Sun, J., Li, Y., & Chai, L., et al. (2022). Human trajectory prediction with momentary observation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6467–6476)
Sun, P., Zhang, R., & Jiang, Y., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 14454–14463).
Tian, Z., Shen, C., & Chen, H., et al. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).
Tian, Z., Zhang, B., Chen, H., et al. (2022). Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 669–680.
Article Google Scholar
Tsai, YH., Shen, X., & Lin, Z., et al. (2017). Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3789–3797)
Wade, N., & Swanston, M. (2013). Visual perception: An introduction.
Wang, X., Kong, T., & Shen, C., et al. (2020). Solo: Segmenting objects by locations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16 (pp. 649–665). Springer.
Wang, W., Lai, Q., & Fu, H., et al. (2019). Salient object detection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146.
Wang, F., Wang, H., & Wei, C., et al. (2022). Cp2: Copy-paste contrastive pretraining for semantic segmentation. arXiv preprint arXiv:2203.11709.
Wang, W., Shen, J., Yang, R., et al. (2017). Saliency-aware video object segmentation. TPAMI, 40(1), 20–33.
Article Google Scholar
Wang, X., Zhang, R., Kong, T., et al. (2020). Solov2: Dynamic and fast instance segmentation. Advances in Neural information processing systems, 33, 17721–17732.
Google Scholar
Wu, R., McGee, B., & Echiverri, C., et al. (2018). Prior knowledge of category size impacts visual search. Psychophysiology 55(8):e13075
Xiao, A., Huang, J., & Guan, D., et al. (2022). Polarmix: A general data augmentation technique for lidar point clouds. arXiv preprint arXiv:2208.00223
Xie, S., Gu, J., & Guo, D., et al. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision (pp. 574–591). Springer.
Xie, Q., Lai, YK., & Wu, J., et al. (2020). Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10447–10456).
Xiong, Y., Liao, R., & Zhao, H., et al. (2019). Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8818–8826).
Xu, Z., Meng, A., & Shi, Z., et al. (2021). Continuous copy-paste for one-stage multi-object tracking and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15323–15332).
Xu, W., Wang, H., & Qi, F., et al. (2019). Explicit shape encoding for real-time instance segmentation. arXiv:1908.04067.
Yang, B., Liang, M., & Urtasun, R. (2018). Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning (pp. 146–155)
Yang, B., Luo, W., & Urtasun, R. (2018). Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 7652–7660)
Yang, J., Zhang, Y., Liang, Y., et al. (2021). Tumorcp: A simple but effective object-level data augmentation for tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp 579–588). Springer.
Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Article Google Scholar
Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11784–11793).
Zhang, Z., Sun, B., & Yang, H., et al. (2020). H3dnet: 3d object detection using hybrid geometric primitives. In European Conference on Computer Vision (pp. 311–329). Springer.
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4490–4499).
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
Zhou, X., Girdhar, R., Joulin, A., et al. (2022). Detecting twenty-thousand classes using image-level supervision. In I. X. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 350–368). Springer.
Chapter Google Scholar
Zhu, SC., & Mumford, D., et al. (2007). A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision
Zhu, X., Su, W., & Lu, L., et al. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

Download references

Acknowledgements

This work is supported in part by the National Key R &D Program of China, No. 2017YFA0700800, and in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102.

Author information

Jianhua Sun and Hao-Shu Fang have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Dongchuan Road, Minhang, 201100, Shanghai, China
Jianhua Sun, Hao-Shu Fang, Yuxuan Li, Runzhong Wang, Minghao Gou & Cewu Lu

Authors

Jianhua Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hao-Shu Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Runzhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Minghao Gou
View author publications
You can also search for this author in PubMed Google Scholar
Cewu Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cewu Lu.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, J., Fang, HS., Li, Y. et al. InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation. Int J Comput Vis 131, 2665–2681 (2023). https://doi.org/10.1007/s11263-023-01807-9

Download citation

Received: 18 November 2022
Accepted: 26 April 2023
Published: 17 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11263-023-01807-9

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SOS: Segment Object System for Open-World Instance Segmentation with Object Priors

Learning to Detect Every Thing in an Open World

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SOS: Segment Object System for Open-World Instance Segmentation with Object Priors

Learning to Detect Every Thing in an Open World

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now