[go: up one dir, main page]

Skip to main content

Advertisement

Log in

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Instance-level perception tasks like object detection, instance segmentation, and 3D detection require many training samples to achieve satisfactory performance. The meticulous labels for these tasks are usually expensive to obtain and data augmentation is a natural choice to tackle such a problem. However, instance-level augmentation is less studied in previous research. In this paper, we present an effective, efficient and unified crop-paste mechanism to augment the training set utilizing existing instance-level annotations. Our design is derived from visual coherence and mines three inherent principles that widely exist in real-world data: (i) background coherence in local neighbor area, (ii) appearance coherence for instance placement, and (iii) instance coherence within the same category. Such methodologies are unified for various tasks including object detection, instance segmentation, and 3D detection. Extensive experiments demonstrate that our proposed approaches can successfully boost the performance of diverse frameworks on various datasets across multiple tasks, without modifying the network structure. Remarkable improvements are obtained: 5.1 mAP for object detection and 3.2 mAP for instance segmentation on COCO dataset, and 6.9 mAP for 3D detection on ScanNetV2 dataset. Our method can be easily integrated into different frameworks without affecting the training and inference efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The data that support the findings of this study are available from VOC [22], VOCSDS [33], COCO [53], SUN RGB-D [73] and ScanNetV2 [15], which were used under their licenses for the current study. Data are available from their corresponding websites.

References

  • Arnheim, R. (1969). Visual thinking.

  • Bearman, A., Russakovsky, O., & Ferrari, V., et al. (2016). What’s the point: Semantic segmentation with point supervision. In ECCV.

  • Bertalmio, M., Bertozzi, A.L., & Sapiro, G. (2001). Navier–Stokes, fluid dynamics, and image and video inpainting. In CVPR.

  • Bleau, A., & Leon, L.J. (2000). Watershed-based segmentation and region merging. In Computer vision and image understanding.

  • Bolya, D., Zhou, C., & Xiao, F., et al. (2019). Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9157–9166).

  • Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In CVPR.

  • Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp 213–229). Springer.

  • Chen, L.C., Hermans, A., & Papandreou, G., et al. (2018). Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.

  • Chen, X., Ma, H., & Wan, J., et al. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).

  • Chen, K., Pang, J., & Wang, J., et al. (2018). mmdetection. https://github.com/open-mmlab/mmdetection.

  • Chen, T., Saxena, S., & Li, L., et al. (2021). Pix2seq: A language modeling framework for object detection. In International conference on learning representations.

  • Cheng, B., Misra, I., & Schwing, A.G, et al. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).

  • Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.

  • Dai, A., Chang, A.X., & Savva, M., et al. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).

  • Dai, J., He, K., & Li, Y., et al. (2016). Instance-sensitive fully convolutional networks. In ECCV.

  • Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV.

  • Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR.

  • Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803.

  • Dvornik, N., Mairal, J., & Schmid, C. (2018). Modeling visual context is key to augmenting object detection datasets. In ECCV.

  • Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV.

  • Everingham, M., Van Gool, L., & Williams, C.K., et al. (2010). The pascal visual object classes (voc) challenge. IJCV.

  • Fang, H.S., Sun, J., & Wang, R., et al. (2019). Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In Proceedings of the IEEE international conference on computer vision (pp. 682–691).

  • Fang, H.S., Xie, S., & Tai, Y.W., et al. (2017). Rmpe: Regional multi-person pose estimation. In ICCV.

  • Fang, Y., Yang, S., & Wang, X., et al. (2021). Instances as queries. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6910–6919).

  • Field, D.J., Hayes, A., & Hess, R.F. (1993). Contour integration by the human visual system: evidence for a local “association field”. Vision Research.

  • Ge, Z., Liu, S., & Wang, F., et al. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.

  • Ghiasi, G., Cui, Y., & Srinivas, A., et al. (2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR.

  • Girshick, R. (2015). Fast R-CNN. In ICCV.

  • Girshick, R., Donahue, J., & Darrell, T., et al. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Gong, K., Liang, X., & Zhang, D., et al. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  • Hariharan, B., Arbelaez, P., & Bourdev, L., et al. (2011). Semantic contours from inverse detectors. In ICCV.

  • He, K., Gkioxari, G., & Dollár, P., et al. (2017). Mask r-cnn. In ICCV.

  • He, K., Rhemann, C., & Rother, C., et al. (2011). A global sampling method for alpha matting. In CVPR.

  • Khoreva, A., Benenson, R., Ilg, E., et al. (2018). Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 1, 1–23.

    Google Scholar 

  • Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

  • Kirillov, A., He, K., & Girshick, R., et al. (2019). Panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9404–9413).

  • Kirillov, A., Wu, Y., & He, K., et al. (2020). Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9799–9808)

  • Ku, J., Mozifian, M., & Lee, J., et al. (2018). Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp 1–8). IEEE.

  • Lang, A.H., Vora, S., & Caesar, H., et al. (2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12697–12705).

  • Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV) (pp. 734–750).

  • Leng, Z., Cheng, S., & Caine, B., et al. (2022). Pseudoaugment: Learning to use unlabeled data for data augmentation in point clouds. In European conference on computer vision (pp. 555–572). Springer.

  • Li, Q., Arnab, A., & Torr, P.H. (2018). Weakly-and semi-supervised panoptic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 102–118).

  • Li, Y., Chen, X., & Zhu, Z., et al. (2019). Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7026–7035).

  • Li, Y., Qi, H., & Dai, J., et al. (2017). Fully convolutional instance-aware semantic segmentation. In CVPR.

  • Li, J., Wang, C., & Zhu, H., et al. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR.

  • Li, Y.L., Xu, L., & Liu, X., et al. (2020). Pastanet: Toward human activity knowledge engine. In CVPR.

  • Li, Y.L., Zhou, S., & Huang, X., et al. (2019). Transferable interactiveness knowledge for human-object interaction detection. In CVPR.

  • Liang, M., Yang, B., & Wang, S., et al. (2018). Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 641–656).

  • Lin, D., Dai, J., & Jia, J., et al. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR.

  • Lin, T.Y., Dollar, P., & Girshick, R., et al. (2017). Feature pyramid networks for object detection. In CVPR.

  • Lin, T.Y., Maire. M., & Belongie. S, et al. (2014). Microsoft coco: Common objects in context. In ECCV.

  • Lin, L., Xu, Y., & Liang, X., et al. (2014). Complex background subtraction by pursuing dynamic spatio-temporal models. TIP.

  • Liu, W., Anguelov, D., & Erhan, D., et al. (2016). Ssd: Single shot multibox detector. In ECCV.

  • Lu, X., Wang, W., & Ma, C., et al. (2019). See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.

  • Lucas, C. (2019). Point cloud hole filling. https://github.com/Geodan/fill-holes-pointcloud.

  • Meng, D., Chen, X., & Fan, Z., et al. (2021). Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3651–3660).

  • Mscoco. (2018). Mscoco detection leaderboard. http://cocodataset.org/#detection-leaderboard.

  • Nako, R., Wu, R., & Eimer, M. (2014). Rapid guidance of visual search by object categories. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 50.

    Google Scholar 

  • Nako, R., Wu, R., Smith, T. J., et al. (2014). Item and category-based attentional control during search for real-world objects: Can you find the pants among the pans? Journal of Experimental Psychology: Human Perception and Performance, 40(4), 1283.

    Google Scholar 

  • Pang, J., Chen, K., & Shi, J., et al. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 821–830).

  • Pinheiro, PO., Collobert, R., & Dollár, P. (2015). Learning to segment object candidates. In NIPS

  • Qi, CR., Litany, O., He, K., et al. (2019). Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision (pp. 9277–9286).

  • Qi, S., Wang, W., & Jia, B., et al. (2018). Learning human-object interactions by graph parsing neural networks. In CVPR

  • Redmon, J., Divvala, S., & Girshick, R., et al. (2016). You only look once: Unified, real-time object detection. In CVPR

  • Remez, T., Huang, J., & Brown, M. (2018). Learning to segment via cut-and-paste. In Proceedings of the European conference on computer vision (ECCV) (pp 37–52).

  • Ren, S., He, K., & Girshick, R., et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS

  • Richtsfeld, A., Mörwald, T., & Prankl, J., et al. (2012). Segmentation of unknown objects in indoor environments. In IROS

  • Shao, D., Xiong, Y., & Zhao, Y., et al. (2018). Find and focus: Retrieve and localize video events with natural language queries. In ECCV

  • Shi, S., Guo, C., & Jiang, L., et al. (2020). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 10529–10538).

  • Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 770–779).

  • Song, S., & Xiao, J. (2016). Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 808–816).

  • Song, S., Lichtenberg, SP., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 567–576).

  • Sun, J., Fang, HS., & Zhu, X., et al. (2022). Correlation field for boosting 3d object detection in structured scenes. In Proceedings of the AAAI conference on artificial intelligence

  • Sun, J., Li, Y., & Chai, L., et al. (2022). Human trajectory prediction with momentary observation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6467–6476)

  • Sun, P., Zhang, R., & Jiang, Y., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 14454–14463).

  • Tian, Z., Shen, C., & Chen, H., et al. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).

  • Tian, Z., Zhang, B., Chen, H., et al. (2022). Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 669–680.

    Article  Google Scholar 

  • Tsai, YH., Shen, X., & Lin, Z., et al. (2017). Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3789–3797)

  • Wade, N., & Swanston, M. (2013). Visual perception: An introduction.

  • Wang, X., Kong, T., & Shen, C., et al. (2020). Solo: Segmenting objects by locations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16 (pp. 649–665). Springer.

  • Wang, W., Lai, Q., & Fu, H., et al. (2019). Salient object detection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146.

  • Wang, F., Wang, H., & Wei, C., et al. (2022). Cp2: Copy-paste contrastive pretraining for semantic segmentation. arXiv preprint arXiv:2203.11709.

  • Wang, W., Shen, J., Yang, R., et al. (2017). Saliency-aware video object segmentation. TPAMI, 40(1), 20–33.

    Article  Google Scholar 

  • Wang, X., Zhang, R., Kong, T., et al. (2020). Solov2: Dynamic and fast instance segmentation. Advances in Neural information processing systems, 33, 17721–17732.

    Google Scholar 

  • Wu, R., McGee, B., & Echiverri, C., et al. (2018). Prior knowledge of category size impacts visual search. Psychophysiology 55(8):e13075

  • Xiao, A., Huang, J., & Guan, D., et al. (2022). Polarmix: A general data augmentation technique for lidar point clouds. arXiv preprint arXiv:2208.00223

  • Xie, S., Gu, J., & Guo, D., et al. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision (pp. 574–591). Springer.

  • Xie, Q., Lai, YK., & Wu, J., et al. (2020). Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10447–10456).

  • Xiong, Y., Liao, R., & Zhao, H., et al. (2019). Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8818–8826).

  • Xu, Z., Meng, A., & Shi, Z., et al. (2021). Continuous copy-paste for one-stage multi-object tracking and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15323–15332).

  • Xu, W., Wang, H., & Qi, F., et al. (2019). Explicit shape encoding for real-time instance segmentation. arXiv:1908.04067.

  • Yang, B., Liang, M., & Urtasun, R. (2018). Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning (pp. 146–155)

  • Yang, B., Luo, W., & Urtasun, R. (2018). Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 7652–7660)

  • Yang, J., Zhang, Y., Liang, Y., et al. (2021). Tumorcp: A simple but effective object-level data augmentation for tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp 579–588). Springer.

  • Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.

    Article  Google Scholar 

  • Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11784–11793).

  • Zhang, Z., Sun, B., & Yang, H., et al. (2020). H3dnet: 3d object detection using hybrid geometric primitives. In European Conference on Computer Vision (pp. 311–329). Springer.

  • Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4490–4499).

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.

  • Zhou, X., Girdhar, R., Joulin, A., et al. (2022). Detecting twenty-thousand classes using image-level supervision. In I. X. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 350–368). Springer.

    Chapter  Google Scholar 

  • Zhu, SC., & Mumford, D., et al. (2007). A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision

  • Zhu, X., Su, W., & Lu, L., et al. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

Download references

Acknowledgements

This work is supported in part by the National Key R &D Program of China, No. 2017YFA0700800, and in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cewu Lu.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, J., Fang, HS., Li, Y. et al. InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation. Int J Comput Vis 131, 2665–2681 (2023). https://doi.org/10.1007/s11263-023-01807-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01807-9

Keywords