Abstract
Video dehazing is a critical research area in computer vision that aims to enhance the quality of hazy frames, which benefits many downstream tasks, e.g. semantic segmentation. Recent work devise CNN-based structure or attention mechanism to fuse temporal information, while some others utilize offset between frames to align frames explicitly. Another significant line of video dehazing research focuses on constructing paired datasets by synthesizing foggy effect on clear video or generating real haze effect on indoor scenes. Despite the significant contributions of these dehazing networks and datasets to the advancement of video dehazing, current methods still suffer from spatial–temporal inconsistency and poor generalization ability. We address the aforementioned issues by proposing a triplane smoothing module to explicitly benefit from spatial–temporal smooth prior of the input video and generate temporally coherent dehazing results. We further devise a query base decoder to extract haze-relevant information while also aggregate temporal clues implicitly. To increase the generalization ability of our dehazing model we utilize CLIP guidance with a rich and high-level understanding of hazy effect. We conduct extensive experiments to verify the effectiveness of our model to generate spatial–temporally consistent dehazing results and produce pleasing dehazing results of real-world data.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdelfattah, R., Guo, Q., Li, X., et al. (2023). Cdul: Clip-driven unsupervised learning for multi-label image classification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1348–1357).
Barron, J.T., Mildenhall, B., Tancik, M., et al. (2021). Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5855–5864).
Barron, J. T., Mildenhall, B., Verbin, D., et al. (2022). Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5470–5479).
Cao, A., & Johnson, J. (2023). Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 130–141).
Chan, E. R., Lin, C. Z., Chan, M. A., et al. (2022a). Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16123–16133).
Chan, K. C., Zhou, S., Xu, X., et al. (2022b). Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5972–5981).
Chang, Y. L., Liu, Z. Y., Lee, K. Y., et al. (2019). Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9066–9075).
Chen, H., Gu, J., Liu, Y., et al. (2023a). Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (pp. 1692–1703).
Chen, H., Ren, J., Gu, J., et al. (2023b). Snow removal in video: A new dataset and a novel method. In Proceedings of the IEEE/CVF International conference on computer vision (pp. 13211–13222).
Chen, J., Ren, X., Guo, Q., et al. (2023c). Lrr: Language-driven resamplable continuous representation against adversarial tracking attacks. In International conference on machine learning.
Chen, S., Ye, T., Liu, Y., et al. (2023d). Cplformer: Cross-scale prototype learning transformer for image snow removal. In Proceedings of the 31st ACM international conference on multimedia (pp. 4228–4239).
Chen, S., Ye, T., Xue, C., et al. (2023e). Uncertainty-driven dynamic degradation perceiving and background modeling for efficient single image desnowing. In Proceedings of the 31st ACM international conference on multimedia (pp. 4269–4280).
Deng, Z., Zhu, L., Hu, X., et al. (2019). Deep multi-model fusion for single-image dehazing. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2453–2462).
Dong, H., Pan, J., Xiang, L., et al. (2020). Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2157–2167).
Eskicioglu, A. M., & Fisher, P. S. (1995). Image quality measures and their performance. IEEE Transactions on Communications, 43(12), 2959–2965.
Esmaeilpour, S., Liu, B., Robertson, E., et al. (2022). Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence (pp. 6568–6576).
Fridovich-Keil, S., Meanti, G., Warburg, F. R., et al. (2023). K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12479–12488).
Goyal, S., Kumar, A., Garg, S., et al. (2023). Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19338–19347).
Guizilini, V., Ambrus, R., Pillai, S., et al. (2020). 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2485–2494).
Guo, C.L., Yan, Q., Anwar, S., et al. (2022). Image dehazing transformer with transmission-aware 3d position embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5812–5820).
He, K., Sun, J., & Tang, X. (2010). Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2341–2353.
Huang, C., Li, J., Li, B., et al. (2022). Neural compression-based feature learning for video restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5872–5881).
Jin, X., Han, L.H., Li, Z., et al. (2023). Dnf: Decouple and feedback network for seeing in the dark. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 18135–18144).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 4401–4410).
Karras, T., Laine, S., Aittala, M., et al. (2020). Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8110–8119).
Kim, D., Woo, S., Lee, J. Y., et al. (2020). Video panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9859–9868).
Kim, J. H., Jang, W. D., Sim, J. Y., et al. (2013). Optimized contrast enhancement for real-time image and video dehazing. Journal of Visual Communication and Image Representation, 24(3), 410–425.
Kirillov, A., Mintun, E., Ravi, N., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.
Li, B., Peng, X., Wang, Z., et al. (2017). Aod-net: All-in-one dehazing network. In Proceedings of the IEEE international conference on computer vision (pp. 4770–4778).
Li, B., Peng, X., Wang, Z., et al. (2018). End-to-end united video dehazing and detection. In Proceedings of the AAAI conference on artificial intelligence.
Li, B., Gou, Y., Liu, J. Z., et al. (2020). Zero-shot image dehazing. IEEE Transactions on Image Processing, 29, 8457–8466.
Li, B., Gou, Y., Gu, S., et al. (2021). You only look yourself: Unsupervised and untrained single image dehazing neural network. International Journal of Computer Vision, 129, 1754–1767.
Li, B., Liu, X., Hu, P., et al. (2022a). All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17452–17462).
Li, C., Guo, J., Cong, R., et al. (2016). Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior. IEEE Transactions on Image Processing, 25(12), 5664–5677.
Li, C., Guo, C., Guo, J., et al. (2019). Pdr-net: Perception-inspired single image dehazing network with refinement. IEEE Transactions on Multimedia, 22(3), 704–716.
Li, C., Guo, C., Zhou, S., et al. (2023). Flexicurve: Flexible piecewise curves estimation for photo retouching. In IEEE conference on computer vision and pattern recognition NTIRE workshop (CVPRW)-Oral (pp. 1092–1101).
Li, J., Li, D., Xiong, C., et al. (2022b). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, PMLR (pp. 12888–12900).
Liu, X., Ma, Y., Shi, Z., et al. (2019). Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7314–7323).
Liu, Y., Wan, L., Fu, H., et al. (2022a). Phase-based memory network for video dehazing. In Proceedings of the 30th ACM international conference on multimedia (pp. 5427–5435).
Liu, Z., Mao, H., Wu, C. Y., et al. (2022b). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986).
Luo, Z., Gustafsson, F. K., Zhao, Z., et al. (2023). Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018.
Mei, Y., Zhang, H., Zhang, X., et al. (2023). Lightpainter: Interactive portrait relighting with freehand scribble. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 195–205).
Mildenhall, B., Srinivasan, P. P., Tancik, M., et al. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Nayar, S. K., & Narasimhan, S. G. (1999). Vision in bad weather. In Proceedings of the seventh IEEE international conference on computer vision, IEEE (pp0 820–827).
Oh, S. W., Lee, J. Y., Xu, N., et al. (2019). Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9226–9235).
Patashnik, O., Wu, Z., Shechtman, E., et al. (2021). Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2085–2094).
Qin, X., Wang, Z., Bai, Y., et al. (2020). Ffa-net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence (pp. 11908–11915).
Radford, A., Kim, J.W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).
Ren, S., Liu, W., Liu, Y., et al. (2021). Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15455–15464).
Ren, S., Gao, Z., Hua, T., et al. (2022a). Co-advise: Cross inductive bias distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (pp. 16773–16782).
Ren, S., Wang, H., Gao, Z., et al. (2022b). A simple data mixing prior for improving self-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14595–14604).
Ren, S., Wei, F., Zhang, Z., et al. (2023). Tinymim: An empirical study of distilling mim pre-trained models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3687–3697).
Ren, W., Zhang, J., Xu, X., et al. (2018). Deep video dehazing with semantic segmentation. IEEE Transactions on Image Processing, 28(4), 1895–1908.
Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Ruiz, N., Li, Y., Jampani, V., et al. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22500–22510).
Sain, A., Bhunia, A. K., Chowdhury, P. N., et al. (2023). Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2765–2775).
Silberman, N., Hoiem, D., Kohli, P., et al. (2012). Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7–13, 2012, proceedings, part, (Vol. 12, pp 746–760). Springer.
Song, Y., He, Z., Qian, H., et al. (2023). Vision transformers for single image dehazing. IEEE Transactions on Image Processing, 32, 1927–1941.
Wang, J., Chan, K.C., & Loy, C.C. (2023). Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence (pp. 2555–2563).
Wang, X., Chan, K. ., Yu, K., et al. (2019). Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Wu, H., Qu, Y., Lin, S., et al. (2021). Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10551–10560).
Wu, R. Q., Duan, Z. P., Guo, C. L., et al. (2023). Ridcp: Revitalizing real image dehazing via high-quality codebook priors. In IEEE conference on computer vision and pattern recognition (CVPR).
Xia, W., Yang, Y., Xue, J. H., et al. (2021). Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2256–2265).
Xu, J., Hu, X., Zhu, L., et al.: (2023) Video dehazing via a multi-range temporal alignment network with physical prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18053–18062).
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Yu, H., Zheng, N., Zhou, M., et al. (2022). Frequency and spatial dual guidance for image dehazing. In European conference on computer vision (pp. 181–198). Springer.
Zhang, J., Li, L., Zhang, Y., et al. (2011). Video dehazing with spatial and temporal coherence. The Visual Computer, 27, 749–757.
Zhang, R., Gu, J., Chen, H., et al. (2023) Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. In International conference on machine learning, PMLR (pp. 41078–41091).
Zhang, X., Dong, H., Pan, J., et al. (2021). Learning to restore hazy video: A new real-world dataset and a new method. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9239–9248.)
Zhao, H., Shi, J., Qi, X., et al. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).
Zhao, S., Chen, D., Chen, Y. C., et al.: (2023). Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322
Zheng, Y., Zhan, J., He, S., et al. (2023). Curricular contrastive regularization for physics-aware single image dehazing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5785–5794).
Zheng, Z., Ren, W., Cao, X., et al. (2021). Ultra-high-definition image dehazing via multi-guided bilateral learning. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 16180–16189). IEEE.
Zhou, M., Huang, J., Guo, C. L., et al. (2023a). Fourmer: An efficient global modeling paradigm for image restoration. In International conference on machine learning (ICML) (pp. 42589–42601). PMLR.
Zhou, Z., Lei, Y., Zhang, B., et al. (2023b). Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11175–11185).
Acknowledgements
This work is supported by the Guangzhou-HKUST(GZ) Joint Funding Program (No. 2023A03J0671), the National Natural Science Foundation of China (Grant No. 61902275), the Guangzhou Industrial Information and Intelligent Key Laboratory Project (No. 2024A03J0628), and Guangzhou-HKUST(GZ) Joint Funding Program (No. 2024A03J0618).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Yasuyuki Matsushita.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ren, J., Chen, H., Ye, T. et al. Triplane-Smoothed Video Dehazing with CLIP-Enhanced Generalization. Int J Comput Vis 133, 475–488 (2025). https://doi.org/10.1007/s11263-024-02161-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02161-0