Abstract
With the development of deep learning, the network architectures and algorithm accuracy applied to monocular depth estimation have been greatly improved. However, these complex network structures can be very difficult to realize real-time processing on embedded platforms. Consequently, this study proposed a lightweight encoding and decoding structure based on the U-Net model. The depthwise separable convolution was introduced into the encoder and decoder to optimize the network structure, further reduce the computational complexity, and improve the running speed, the implementation algorithm being more suitable for embedded platforms. When the accuracy of similar depth images was achieved, the network parameters could be reduced by up to eight times, and the running speed could be more than doubled. The research showed the proposed method to be very effective, having a certain reference value in monocular depth estimation algorithms running on embedded platforms.
Similar content being viewed by others
Data Availability
Enquiries about data availability should be directed to the authors.
References
Liu, F., Shen, C., Lin, G., et al. (2015). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2024–2039.
Qingbo, Z., & Hongyuan, W. (2010). Block recovery stereo matching algorithm using image segmentation. Journal of Huazhong University of Science and Technology, 38(1), 81–84.
Zexiao, X., & Zuoqi, Z. (2018). Spatial point localization method based on the motion recovery structure. Progress in Laser and optoelectronics, 55(8), 370–377.
Cheng, X., Xiaohan, T., Siping, L., et al. (2019). Fast monocular depth estimation methods for embedded platforms. "in chinese", CN110599533A.
Eigen, D., Puhrsch, C., and Fergus, R., (2014). Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems (NIPS), pp. 2366–2374.
Eigen, D., Fergus, R. (2014). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In 2015 IEEE International Conference on Computer Vision (ICCV).
Liu, F., Shen, C., and Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5162–5170.
Li, B., Shen, C. H., Dai, Y. C., Van, den H. A., and He M Y. (2015). Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston IEEE, 1119–1127 https://doi.org/10.1109/CVPR.2015.7298715.
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N., (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248.
Cao, Y., Wu, Z., & Shen, C. (2017). Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology.
Garg, R., Vijay Kumar, B. G., Carneiro, G., and Ian, R.,(2016). Unsupervised CNN for single view depth estimation: geometry to the rescue. In Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer, 740-756.
Godard C, Aodha O M, and Brostow G J. (2017). Unsupervised monocular depth estimation with left-right consistency. In Conference on Computer Vision and Pattern Recognition (CVPR).
Godard, C., Aodha, O. M., Firman, M., et al. (2019). Digging into self-supervised monocular depth estimation. In ICCV.
Tosi, F., Aleotti, F., Poggi, M., and Mattoccia, S., (2019). Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9799–9809. https://doi.org/10.1109/CVPR.2019.01003.
Casser, V., Pirk, S., Mahjourian, R., et al. (2019). Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In AAAI.
Wang, L., Zhang, J., Wang, Y., et al. (2020). Cliffnet for monocular depth estimation with hierarchical embedding loss. Cham: Springer.
Mancini, M., Costante, G., Valigi, P., et al. (2016). Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. https://doi.org/10.1109/IROS.2016.7759632
Atapour-Ab Arghouei, A., (2018). Real-time monocular depth estimation using synthetic data with domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision & Pattern Recognition. IEEE.
Liu W, Anguelov D, Erhan D, et al. (2016). SSD: single shot multibox detector. In European Conference on Computer Vision.
Redmon, J., Farhadi, A., (2015). YOLOv3: An incremental improvement. arXiv e-prints, 2018.
Technicolor, T., Related, S., Technicolor, T., et al. ImageNet classification with deep convolutional neural networks [50].
Lecun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989). Optimal brain damage. In Advances in Neural Information Processing Systems 2, NIPS Conference, Denver, Colorado, USA, November 27–30, 1989.
He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Girshick R. Fast R-CNN. arXiv e-prints, 2015.
Mancini, M., Costante, G., Valigi, P., and Ciarfuglia, T. A., (2016). Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4296–4303.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint https://arxiv.org/pdf/1704.04861.pdf.
Jlab, C., Qlab, C., Rui, C., et al. (2020). MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation. ISPRS Journal of Photogrammetry and Remote Sensing, 166, 255–267.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. Springer International Publishing.
Chen, L. C., Zhu, Y., Papandreou, G., et al. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision. Springer, Cham.
Enkun, C., Yanqing, T., & Jiawei, L. (2020). Calibration error compensation for the stereo measurement system. Applied Optics, 242(06), 46–52.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/TIP.2003.819861
Heise, P., Klose, S., Jensen, B., & Knoll, A. (2013). Pm-huber: Patchmatch with huber regularization for stereo matching. In IEEE 2013 IEEE International Conference on Computer Vision (ICCV) - Sydney, Australia, 2013.12.1–2013.12.8, pp. 2360–2367.
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulffff, J., and Black, M. J., (2019). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249. https://doi.org/10.1109/CVPR.2019.01252.
Wofk, D., Ma, F., Yang, T-J., Karaman, S., and Sze, V., (2019). FastDepth: Fast monocular depth estimation on embedded systems. In International Conference on Robotics and Automation (ICRA).
Kuznietsov, Y., Stuckler, J., and Leibe, B., (2017). Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655. https://doi.org/10.1109/CVPR.2017.238.
Zhou T., Brown M., Snavely N, and Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6612–6619. https://doi.org/10.1109/CVPR.2017.700.
Yin, Z., and Shi, J., (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, 2018, pp. 1983–1992. doi:https://doi.org/10.1109/CVPR.2018.00212.
Acknowledgements
Thanks to Godard and his team who shared their results.
Funding
This work was supported by the National Natural Science Foundation of China (NSFC Grant No. 61903124).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, F., Yin, X., Shen, J. et al. OptiDepthNet: A Real-Time Unsupervised Monocular Depth Estimation Network. Wireless Pers Commun 128, 2831–2846 (2023). https://doi.org/10.1007/s11277-022-10074-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-022-10074-9