Abstract
Instance-level perception tasks like object detection, instance segmentation, and 3D detection require many training samples to achieve satisfactory performance. The meticulous labels for these tasks are usually expensive to obtain and data augmentation is a natural choice to tackle such a problem. However, instance-level augmentation is less studied in previous research. In this paper, we present an effective, efficient and unified crop-paste mechanism to augment the training set utilizing existing instance-level annotations. Our design is derived from visual coherence and mines three inherent principles that widely exist in real-world data: (i) background coherence in local neighbor area, (ii) appearance coherence for instance placement, and (iii) instance coherence within the same category. Such methodologies are unified for various tasks including object detection, instance segmentation, and 3D detection. Extensive experiments demonstrate that our proposed approaches can successfully boost the performance of diverse frameworks on various datasets across multiple tasks, without modifying the network structure. Remarkable improvements are obtained: 5.1 mAP for object detection and 3.2 mAP for instance segmentation on COCO dataset, and 6.9 mAP for 3D detection on ScanNetV2 dataset. Our method can be easily integrated into different frameworks without affecting the training and inference efficiency.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are available from VOC [22], VOCSDS [33], COCO [53], SUN RGB-D [73] and ScanNetV2 [15], which were used under their licenses for the current study. Data are available from their corresponding websites.
References
Arnheim, R. (1969). Visual thinking.
Bearman, A., Russakovsky, O., & Ferrari, V., et al. (2016). What’s the point: Semantic segmentation with point supervision. In ECCV.
Bertalmio, M., Bertozzi, A.L., & Sapiro, G. (2001). Navier–Stokes, fluid dynamics, and image and video inpainting. In CVPR.
Bleau, A., & Leon, L.J. (2000). Watershed-based segmentation and region merging. In Computer vision and image understanding.
Bolya, D., Zhou, C., & Xiao, F., et al. (2019). Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9157–9166).
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In CVPR.
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp 213–229). Springer.
Chen, L.C., Hermans, A., & Papandreou, G., et al. (2018). Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.
Chen, X., Ma, H., & Wan, J., et al. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).
Chen, K., Pang, J., & Wang, J., et al. (2018). mmdetection. https://github.com/open-mmlab/mmdetection.
Chen, T., Saxena, S., & Li, L., et al. (2021). Pix2seq: A language modeling framework for object detection. In International conference on learning representations.
Cheng, B., Misra, I., & Schwing, A.G, et al. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).
Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Dai, A., Chang, A.X., & Savva, M., et al. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).
Dai, J., He, K., & Li, Y., et al. (2016). Instance-sensitive fully convolutional networks. In ECCV.
Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV.
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR.
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
Dvornik, N., Mairal, J., & Schmid, C. (2018). Modeling visual context is key to augmenting object detection datasets. In ECCV.
Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV.
Everingham, M., Van Gool, L., & Williams, C.K., et al. (2010). The pascal visual object classes (voc) challenge. IJCV.
Fang, H.S., Sun, J., & Wang, R., et al. (2019). Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In Proceedings of the IEEE international conference on computer vision (pp. 682–691).
Fang, H.S., Xie, S., & Tai, Y.W., et al. (2017). Rmpe: Regional multi-person pose estimation. In ICCV.
Fang, Y., Yang, S., & Wang, X., et al. (2021). Instances as queries. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6910–6919).
Field, D.J., Hayes, A., & Hess, R.F. (1993). Contour integration by the human visual system: evidence for a local “association field”. Vision Research.
Ge, Z., Liu, S., & Wang, F., et al. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
Ghiasi, G., Cui, Y., & Srinivas, A., et al. (2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR.
Girshick, R. (2015). Fast R-CNN. In ICCV.
Girshick, R., Donahue, J., & Darrell, T., et al. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gong, K., Liang, X., & Zhang, D., et al. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Hariharan, B., Arbelaez, P., & Bourdev, L., et al. (2011). Semantic contours from inverse detectors. In ICCV.
He, K., Gkioxari, G., & Dollár, P., et al. (2017). Mask r-cnn. In ICCV.
He, K., Rhemann, C., & Rother, C., et al. (2011). A global sampling method for alpha matting. In CVPR.
Khoreva, A., Benenson, R., Ilg, E., et al. (2018). Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 1, 1–23.
Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Kirillov, A., He, K., & Girshick, R., et al. (2019). Panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9404–9413).
Kirillov, A., Wu, Y., & He, K., et al. (2020). Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9799–9808)
Ku, J., Mozifian, M., & Lee, J., et al. (2018). Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp 1–8). IEEE.
Lang, A.H., Vora, S., & Caesar, H., et al. (2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12697–12705).
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV) (pp. 734–750).
Leng, Z., Cheng, S., & Caine, B., et al. (2022). Pseudoaugment: Learning to use unlabeled data for data augmentation in point clouds. In European conference on computer vision (pp. 555–572). Springer.
Li, Q., Arnab, A., & Torr, P.H. (2018). Weakly-and semi-supervised panoptic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 102–118).
Li, Y., Chen, X., & Zhu, Z., et al. (2019). Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7026–7035).
Li, Y., Qi, H., & Dai, J., et al. (2017). Fully convolutional instance-aware semantic segmentation. In CVPR.
Li, J., Wang, C., & Zhu, H., et al. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR.
Li, Y.L., Xu, L., & Liu, X., et al. (2020). Pastanet: Toward human activity knowledge engine. In CVPR.
Li, Y.L., Zhou, S., & Huang, X., et al. (2019). Transferable interactiveness knowledge for human-object interaction detection. In CVPR.
Liang, M., Yang, B., & Wang, S., et al. (2018). Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 641–656).
Lin, D., Dai, J., & Jia, J., et al. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR.
Lin, T.Y., Dollar, P., & Girshick, R., et al. (2017). Feature pyramid networks for object detection. In CVPR.
Lin, T.Y., Maire. M., & Belongie. S, et al. (2014). Microsoft coco: Common objects in context. In ECCV.
Lin, L., Xu, Y., & Liang, X., et al. (2014). Complex background subtraction by pursuing dynamic spatio-temporal models. TIP.
Liu, W., Anguelov, D., & Erhan, D., et al. (2016). Ssd: Single shot multibox detector. In ECCV.
Lu, X., Wang, W., & Ma, C., et al. (2019). See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.
Lucas, C. (2019). Point cloud hole filling. https://github.com/Geodan/fill-holes-pointcloud.
Meng, D., Chen, X., & Fan, Z., et al. (2021). Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3651–3660).
Mscoco. (2018). Mscoco detection leaderboard. http://cocodataset.org/#detection-leaderboard.
Nako, R., Wu, R., & Eimer, M. (2014). Rapid guidance of visual search by object categories. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 50.
Nako, R., Wu, R., Smith, T. J., et al. (2014). Item and category-based attentional control during search for real-world objects: Can you find the pants among the pans? Journal of Experimental Psychology: Human Perception and Performance, 40(4), 1283.
Pang, J., Chen, K., & Shi, J., et al. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 821–830).
Pinheiro, PO., Collobert, R., & Dollár, P. (2015). Learning to segment object candidates. In NIPS
Qi, CR., Litany, O., He, K., et al. (2019). Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision (pp. 9277–9286).
Qi, S., Wang, W., & Jia, B., et al. (2018). Learning human-object interactions by graph parsing neural networks. In CVPR
Redmon, J., Divvala, S., & Girshick, R., et al. (2016). You only look once: Unified, real-time object detection. In CVPR
Remez, T., Huang, J., & Brown, M. (2018). Learning to segment via cut-and-paste. In Proceedings of the European conference on computer vision (ECCV) (pp 37–52).
Ren, S., He, K., & Girshick, R., et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS
Richtsfeld, A., Mörwald, T., & Prankl, J., et al. (2012). Segmentation of unknown objects in indoor environments. In IROS
Shao, D., Xiong, Y., & Zhao, Y., et al. (2018). Find and focus: Retrieve and localize video events with natural language queries. In ECCV
Shi, S., Guo, C., & Jiang, L., et al. (2020). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 10529–10538).
Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 770–779).
Song, S., & Xiao, J. (2016). Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 808–816).
Song, S., Lichtenberg, SP., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 567–576).
Sun, J., Fang, HS., & Zhu, X., et al. (2022). Correlation field for boosting 3d object detection in structured scenes. In Proceedings of the AAAI conference on artificial intelligence
Sun, J., Li, Y., & Chai, L., et al. (2022). Human trajectory prediction with momentary observation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6467–6476)
Sun, P., Zhang, R., & Jiang, Y., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 14454–14463).
Tian, Z., Shen, C., & Chen, H., et al. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).
Tian, Z., Zhang, B., Chen, H., et al. (2022). Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 669–680.
Tsai, YH., Shen, X., & Lin, Z., et al. (2017). Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3789–3797)
Wade, N., & Swanston, M. (2013). Visual perception: An introduction.
Wang, X., Kong, T., & Shen, C., et al. (2020). Solo: Segmenting objects by locations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16 (pp. 649–665). Springer.
Wang, W., Lai, Q., & Fu, H., et al. (2019). Salient object detection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146.
Wang, F., Wang, H., & Wei, C., et al. (2022). Cp2: Copy-paste contrastive pretraining for semantic segmentation. arXiv preprint arXiv:2203.11709.
Wang, W., Shen, J., Yang, R., et al. (2017). Saliency-aware video object segmentation. TPAMI, 40(1), 20–33.
Wang, X., Zhang, R., Kong, T., et al. (2020). Solov2: Dynamic and fast instance segmentation. Advances in Neural information processing systems, 33, 17721–17732.
Wu, R., McGee, B., & Echiverri, C., et al. (2018). Prior knowledge of category size impacts visual search. Psychophysiology 55(8):e13075
Xiao, A., Huang, J., & Guan, D., et al. (2022). Polarmix: A general data augmentation technique for lidar point clouds. arXiv preprint arXiv:2208.00223
Xie, S., Gu, J., & Guo, D., et al. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision (pp. 574–591). Springer.
Xie, Q., Lai, YK., & Wu, J., et al. (2020). Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10447–10456).
Xiong, Y., Liao, R., & Zhao, H., et al. (2019). Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8818–8826).
Xu, Z., Meng, A., & Shi, Z., et al. (2021). Continuous copy-paste for one-stage multi-object tracking and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15323–15332).
Xu, W., Wang, H., & Qi, F., et al. (2019). Explicit shape encoding for real-time instance segmentation. arXiv:1908.04067.
Yang, B., Liang, M., & Urtasun, R. (2018). Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning (pp. 146–155)
Yang, B., Luo, W., & Urtasun, R. (2018). Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 7652–7660)
Yang, J., Zhang, Y., Liang, Y., et al. (2021). Tumorcp: A simple but effective object-level data augmentation for tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp 579–588). Springer.
Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11784–11793).
Zhang, Z., Sun, B., & Yang, H., et al. (2020). H3dnet: 3d object detection using hybrid geometric primitives. In European Conference on Computer Vision (pp. 311–329). Springer.
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4490–4499).
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
Zhou, X., Girdhar, R., Joulin, A., et al. (2022). Detecting twenty-thousand classes using image-level supervision. In I. X. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 350–368). Springer.
Zhu, SC., & Mumford, D., et al. (2007). A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision
Zhu, X., Su, W., & Lu, L., et al. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
Acknowledgements
This work is supported in part by the National Key R &D Program of China, No. 2017YFA0700800, and in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Vittorio Ferrari.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, J., Fang, HS., Li, Y. et al. InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation. Int J Comput Vis 131, 2665–2681 (2023). https://doi.org/10.1007/s11263-023-01807-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01807-9