Abstract
We present a new self-supervised monocular depth estimation method with multi-scale texture detail enhancement. Based on the observation that the image texture detail and the semantic information have essential significance on the depth estimation, we propose to provide them to the network to learn more sharpness and structural integrity of depth. Firstly, we generate the filtered images and detail images by multi-scale decomposition and use a deep neural network to automatically learn their weights to construct the texture detail enhanced image. Then, we consider the semantic features by putting deep features from the VGG-19 network into a self-attention network, guide the depth decoder network to focus on the integrity of objects in the scene. Finally, we propose a scale-invariant smooth loss to improve the structural integrity of the predicted depth. We evaluate our method on the KITTI 2015 and Make3D datasets and apply the predicted depth to novel view synthesis. The experimental results show that it has achieved satisfactory results compared with the existing methods.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. Patt. Recognit. Lett. 30(2), 88–97 (2009)
Burt, P., Adelson, E.: The laplacian pyramid as a compact image code. IEEE Trans. Commun. 31(4), 532–540 (1983)
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In: AAAI, pp. 8001–8008 (2019)
Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: CVPR, pp. 2619–2627 (2019)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Dippel, S., Stahl, M., Wiemker, R., Blaffert, T.: Multiscale contrast enhancement for radiographies: Laplacian pyramid versus fast wavelet transform. IEEE Trans. Med. Imaging 21(4), 343–353 (2002)
Do, M., Vetterli, M.: The Contourlet Transform: An Efficient Directional Multiresolution Image Representation. IEEE Trans. Image Process. 14(12), 2091–2106 (2005)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
Fan, X., Wu, W., Zhang, L., Yan, Q., Fu, G., Chen, Z., Long, C., Xiao, C.: Shading-aware shadow detection and removal from a single image. Visual Comput. 36(10–12), 2175–2188 (2020)
Fattal, R., Agrawala, M., Rusinkiewicz, S.: Multiscale shape and detail enhancement from multi-light image collections. ACM Transactions on Graphics 26(3),(2007)
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: Learning to predict new views from the world’s imagery. In: CVPR, pp. 5515–5524 (2016)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR, pp. 2002–2011 (2018)
Fu, Y., Yan, Q., Liao, J., Chow, A.L.H., Xiao, C.: Real-time dense 3D reconstruction and camera tracking via embedded planes representation. Visual Comput. 36(10–12), 2215–2226 (2020)
Fu, Y., Yan, Q., Liao, J., Xiao, C.: Joint texture and geometry optimization for rgb-d reconstruction. In: CVPR, pp. 5949–5958 (2020)
Garg, R., VijayKumar, B.G., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: ECCV, pp. 740–756 (2016)
Garg, V., Singh, K.: An improved grunwald-letnikov fractional differential mask for image texture enhancement. Int. J. Adv. Comput. Sci. Appl. 3(11), 130–135 (2012)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3827–3837 (2019)
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)
Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distilling cross-domain stereo networks. In: ECCV, pp. 506–523 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hirschmueller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Patt. Anal. Mach. Intell. 30(2), 328–341 (2008)
Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: CVPR, pp. 4755–4764 (2020)
Karen, S., Andrew, Z.: Very deep convolutional networks for large-scale image. In: ICLR (2015)
Karsch, K., Liu, C., Kang, S.B.: Depthtransfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Patt. Anal. Mach. Intell. 36(11), 2144–2158 (2014)
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV, pp. 66–75 (2017)
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: ECCV, pp. 2619–2627 (2020)
Klodt, M., Vedaldi, A.: Supervising the new with the old: Learning sfm from sfm. In: ECCV, pp. 713–728 (2018)
Kundu, J.N., Uppala, P.K., Pahuja, A., Babu, R.V.: Adadepth: Unsupervised content congruent adaptation for depth estimation. In: CVPR, pp. 2656–2665 (2018)
Kuznietsov, Y., Stückle, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 2215–2223 (2017)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV, pp. 239–248 (2016)
Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: Monocular visual odometry through unsupervised deep learning. In: ICRA, pp. 7286–7291 (2018)
Liao, J., Wei, M., Fu, Y., Yan, Q., Xiao, C.: Dense multiview stereo based on image texture enhancement. Computer Animation and Virtual Worlds 32(2),(2021)
Liu, C., Gu, J., Kim, K., Narasimhan, S.G., Kautz, J.: Neural rgb-d sensing: Depth and uncertainty from a video camera. In: CVPR, pp. 10,978–10,987 (2019)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Patt. Anal. Mach. Intell. 38(10), 2024–2039 (2016)
Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: CVPR, pp. 716–723 (2014)
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: ICCV, pp. 4473–4481 (2017)
Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., Yuille, A.: Every pixel counts ++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Trans. Patt. Anal. Mach. Intell. 42(10), 2624–2641 (2020)
Luo, Y., Ren, J., Lin, M., Pang, J., Sun, W., Li, H., Lin, L.: Single view stereo matching. In: CVPR, pp. 155–163 (2018)
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: CVPR, pp. 5667–5675 (2018)
Mehta, I., Sakurikar, P., Narayanan, P.J.: Structured adversarial training for unsupervised monocular depth estimation. In: 3DV, pp. 314–323 (2018)
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. ACM Trans. Graphics 38(6), 1842:1-1842:15 (2019)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS (2017)
Pillai, S., Ambrus, R., Gaidon, A.: Superdepth: Self-supervised, super-resolved monocular depth estimation. In: ICRA, pp. 9250–9256 (2019)
P.Kingma, D., Lei Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Poggi, M., Tosi, F., Mattoccia, S.: Learning monocular depth estimation with unsupervised trinocular assumptions. In: 3DV, pp. 324–333 (2018)
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR, pp. 12,232–12,241 (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Patt. Anal. Mach. Intell. 31(5), 824–840 (2009)
Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to synthesize a 4d rgbd light field from a single image. In: CVPR, pp. 2262–2270 (2017)
Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: CVPR (2019)
Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, p. 7794–7803 (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment : From error visibility to structural similarity. IEEE Transactions on Image Processing 13(4),(2004)
Watson, J., Firman, M., Brostow, G., Turmukhambetov, D.: Self-supervised monocular depth hints. In: ICCV, pp. 2162–2171 (2019)
Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: ECCV, pp. 842–857 (2016)
Yang, N., Wang, R., Stueckler, J., Cremers, D.: Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In: ECCV, pp. 835–852 (2018)
Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: Lego: Learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234 (2018)
Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In: AAAI, pp. 7493–7500 (2018)
Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. ACM Transactions on Graphics 37(4),(2018)
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: ECCV (2016)
Zou, Y., Luo, Z., Huang, J.B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: ECCV, pp. 38–55 (2018)
Acknowledgements
This work is partially supported by the Key Technological Innovation Projects of Hubei Province (2018AAA062), NSFC (No. 61972298), Science and Technology Cooperation Project of The Xinjiang Production and Construction Corps (No. 2019BC008), Wuhan University-Huawei GeoInformatices Innovation Lab.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Luo, F., Li, W. et al. Self-supervised monocular depth estimation based on image texture detail enhancement. Vis Comput 37, 2567–2580 (2021). https://doi.org/10.1007/s00371-021-02206-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02206-2