[go: up one dir, main page]

Skip to main content
Log in

CI-Net: a joint depth estimation and semantic segmentation network using contextual information

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Monocular depth estimation and semantic segmentation are two fundamental goals of scene understanding. Due to the advantages of task interaction, many works have studied the joint-task learning algorithm. However, most existing methods fail to fully leverage the semantic labels, ignoring the provided context structures and only using them to supervise the prediction of segmentation split, which limits the performance of both tasks. In this paper, we propose a network injected with contextual information (CI-Net) to solve this problem. Specifically, we introduce a self-attention block in the encoder to generate an attention map. With supervision from the ideal attention map created by semantic label, the network is embedded with contextual information so that it could understand the scene better and utilize correlated features to make accurate prediction. Besides, a feature-sharing module (FSM) is constructed to make the task-specific features deeply fused, and a consistency loss is devised to ensure that the features mutually guided. We extensively evaluate the proposed CI-Net on NYU-Depth-v2, SUN-RGBD, and Cityscapes datasets. The experimental results validate that our proposed CI-Net could effectively improve the accuracy of semantic segmentation and depth estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Fang B, Mei G, Yuan X, Wang L, Wang Z, Wang J (2021) Visual slam for robot navigation in healthcare facility. Pattern Recogn 113:107822. https://doi.org/10.1016/j.patcog.2021.107822

    Article  Google Scholar 

  2. Husbands P, Shim Y, Garvie M, Dewar A, Domcsek N, Graham P, Knight J, Nowotny T, Philippides A (2021) Recent advances in evolutionary and bio-inspired adaptive robotics: Exploiting embodied dynamics. Appl Intell 51(9):6467–6496. https://doi.org/10.1007/s10489-021-02275-9

    Article  Google Scholar 

  3. Lee D-H, Chen K-L, Liou K-H, Liu C-L, Liu J-L (2020) Deep learning and control algorithms of direct perception for autonomous driving. Appl Intell 51(1):237–247. https://doi.org/10.1007/s10489-020-01827-9

    Article  Google Scholar 

  4. Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition

  5. Cao Y, Wu Z, Shen C (2018) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circ Syst Video Technol 28(11):3174–3182. https://doi.org/10.1109/tcsvt.2017.2740321

    Article  Google Scholar 

  6. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition

  7. Lan X, Gu X, Gu X (2021) MMNet: Multi-modal multi-stage network for RGB-t image semantic segmentation. Appl Intell. https://doi.org/10.1007/s10489-021-02687-7

  8. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  9. Guizilini V, Hou R, Li J, Ambrus R, Gaidon A (2019) Semantically-guided representation learning for self-supervised monocular depth. In: International conference on learning representations

  10. Zhang Z, Cui Z, Xu C, Yan Y, Sebe N, Yang J (2019) Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  11. Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2020) Joint task-recursive learning for rgb-d scene understanding. IEEE Trans Pattern Anal Mach Intell 42 (10):2608–2623. https://doi.org/10.1109/TPAMI.2019.2926728

    Article  Google Scholar 

  12. Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

  13. Jiao J, Cao Y, Song Y, Lau R (2018) Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In: Computer Vision – ECCV 2018, pp 55–71

  14. Chen Y, Zhao H, Hu Z, Peng J (2021) Attention-based context aggregation network for monocular depth estimation. Int J Mach Learn Cybern 12(6):1583–1596. https://doi.org/10.1007/s13042-020-01251-y

    Article  Google Scholar 

  15. Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020) Context prior for scene segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  16. Klingner M, Termöhlen J-A, Mikolajczyk J, Fingscheidt T (2020) Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer Vision – ECCV 2020, pp 582–600

  17. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D Vision (3DV)

  18. Yin W, Liu Y, Shen C (2021) Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Trans Pattern Anal Mach Intell:1–1. https://doi.org/10.1109/TPAMI.2021.3097396

  19. Zhou W, Zhou E, Liu G, Lin L, Lumsdaine A (2020) Unsupervised monocular depth estimation from light field image. IEEE Trans Image Process 29:1606–1617. https://doi.org/10.1109/TIP.2019.2944343

    Article  MathSciNet  MATH  Google Scholar 

  20. Ye X, Fan X, Zhang M, Xu R, Zhong W (2021) Unsupervised monocular depth estimation via recursive stereo distillation. IEEE Trans Image Process 30:4492–4504. https://doi.org/10.1109/TIP.2021.3072215

    Article  Google Scholar 

  21. Wu Y, Jiang J, Huang Z, Tian Y (2021) Fpanet: Feature pyramid aggregation network for real-time semantic segmentation. Appl Intell:1–18. https://doi.org/10.1007/s10489-021-02603-z

  22. Qi X, Liao R, Jia J, Fidler S, Urtasun R (2017) 3d graph neural networks for RGBD semantic segmentation. In: 2017 IEEE International Conference on Computer Vision (ICCV)

  23. Hazirbas C, Ma L, Domokos C, Cremers D (2017) FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Computer Vision – ACCV 2016, pp 213–228

  24. Sun L, Yang K, Hu X, Hu W, Wang K (2020) Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robot Autom Lett 5(4):5558–5565. https://doi.org/10.1109/LRA.2020.3007457

    Article  Google Scholar 

  25. Hu X, Yang K, Fei L, Wang K (2019) ACNET: Attention based network to exploit complementary features for RGBD semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP)

  26. Hung S-W, Lo S-Y, Hang H-M (2019) Incorporating luminance, depth and color information by a fusion-based network for semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP)

  27. Chen L-Z, Lin Z, Wang Z, Yang Y-L, Cheng M-M (2021) Spatial information guided convolution for real-time RGBD semantic segmentation. IEEE Trans Image Process 30:2313–2324. https://doi.org/10.1109/tip.2021.3049332

    Article  Google Scholar 

  28. Liu J, Wang Y, Li Y, Fu J, Li J, Lu H (2018) Collaborative deconvolutional neural networks for joint depth estimation and semantic segmentation. IEEE Trans Neural Netw Learn Syst 29(11):5655–5666. https://doi.org/10.1109/TNNLS.2017.2787781

    Article  MathSciNet  Google Scholar 

  29. Xu D, Ouyang W, Wang X, Sebe N (2018) PAD-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition

  30. He L, Lu J, Wang G, Song S, Zhou J (2021) SOSD-net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440:251–263. https://doi.org/10.1016/j.neucom.2021.01.126

    Article  Google Scholar 

  31. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  32. Roy S, Menapace W, Oei S, Luijten B, Fini E, Saltori C, Huijben I, Chennakeshava N, Mento F, Sentelli A, Peschiera E, Trevisan R, Maschietto G, Torri E, Inchingolo R, Smargiassi A, Soldati G, Rota P, Passerini A, van Sloun R J G, Ricci E, Demi L (2020) Deep learning for classification and localization of covid-19 markers in point-of-care lung ultrasound. IEEE Trans Med Imaging 39(8):2676–2687. https://doi.org/10.1109/TMI.2020.2994459

    Article  Google Scholar 

  33. Chen T, An S, Zhang Y, Ma C, Wang H, Guo X, Zheng W (2020) Improving monocular depth estimation by leveraging structural awareness and complementary datasets. In: Computer Vision – ECCV 2020, pp 90–108

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  35. Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  36. Huynh L, Nguyen-Ha P, Matas J, Rahtu E, Heikkilä J (2020) Guiding monocular depth estimation using depth-attention volume. In: Computer Vision – ECCV 2020, pp 581–597

  37. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Computer Vision – ECCV 2012, pp 746–760

  38. Song S, Lichtenberg S P, Xiao J (2015) SUN RGB-d: A RGB-d scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  39. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  40. Ming Y, Meng X, Fan C, Yu H (2021) Deep learning for monocular depth estimation: a review. Neurocomputing 438:14–33. https://doi.org/10.1016/j.neucom.2020.12.089

    Article  Google Scholar 

  41. Mohammadi Amiri M, Gündüz D (2020) Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air. IEEE Trans Signal Process 68:2155–2169. https://doi.org/10.1109/TSP.2020.2981904

    Article  MathSciNet  MATH  Google Scholar 

  42. Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5506–5514

  43. Lin X, Sánchez-Escobedo D, Casas J R, Pardàs M (2019) Depth estimation and semantic segmentation from a single rgb image using a hybrid convolutional neural network. Sensors 19(8). https://doi.org/10.3390/s19081795

  44. Cao Y, Zhao T, Xian K, Shen C, Cao Z, Xu S (2020) Monocular depth estimation with augmented ordinal depth relationships. IEEE Trans Circ Syst Video Technol 30(8):2674–2682. https://doi.org/10.1109/TCSVT.2019.2929202

    Article  Google Scholar 

  45. Ye X, Chen S, Xu R (2021) Dpnet: Detail-preserving network for high quality monocular depth estimation. Pattern Recogn 109:107578. https://doi.org/10.1016/j.patcog.2020.107578

    Article  Google Scholar 

  46. Li B, Dai Y, He M (2018) Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference. Pattern Recogn 83:328–339. https://doi.org/10.1016/j.patcog.2018.05.029

    Article  Google Scholar 

  47. Qi X, Liao R, Liu Z, Urtasun R, Jia J (2018) GeoNet: Geometric neural network for joint depth and surface normal estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

  48. Hu J, Ozay M, Zhang Y, Okatani T (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

  49. Yang X, Gao Y, Luo H, Liao C, Cheng K-T (2019) Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans Multimed 21(11):2701–2713. https://doi.org/10.1109/TMM.2019.2912121

    Article  Google Scholar 

  50. Hambarde P, Murala S (2020) S2dnet: Depth estimation from single image and sparse samples. IEEE Trans Comput Imaging 6:806–817. https://doi.org/10.1109/TCI.2020.2981761

    Article  Google Scholar 

  51. Lin G, Shen C, van den Hengel A, Reid I (2016) Efficient piecewise training of deep structured models for semantic segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3194–3203

  52. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  53. Lin G, Milan A, Shen C, Reid I (2017) RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  54. Nekrasov V, Shen C, Reid I (2018) Light-weight refinenet for real-time semantic segmentation. In: Proceedings of the british machine vision conference, pp 278–284

  55. Valada A, Mohan R, Burgard W (2019) Self-supervised model adaptation for multimodal semantic segmentation. Int J Comput Vis 128 (5):1239–1285. https://doi.org/10.1007/s11263-019-01188-y

    Article  MATH  Google Scholar 

  56. Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. In: NeurIPS

  57. Cipolla R, Gal Y, Kendall A (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

  58. Valada A, Vertens J, Dhall A, Burgard W (2017) AdapNet: Adaptive semantic segmentation in adverse environmental conditions. In: 2017 IEEE International Conference on Robotics and Automation (ICRA)

  59. Valada A, Oliveira G L, Brox T, Burgard W (2017) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In: Springer Proceedings in Advanced Robotics, pp 465–477

  60. Wang W, Neumann U (2018) Depth-aware CNN for RGB-d segmentation. In: Computer Vision – ECCV 2018, pp 144–161

  61. Wang L, Zhang J, Wang O, Lin Z, Lu H (2020) SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wu Wei or Zhun Fan.

Ethics declarations

Conflict of Interests

(check journal-specific guidelines for which heading to use): Not Available

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, T., Wei, W., Cai, Z. et al. CI-Net: a joint depth estimation and semantic segmentation network using contextual information. Appl Intell 52, 18167–18186 (2022). https://doi.org/10.1007/s10489-022-03401-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03401-x

Keywords

Navigation