Abstract
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods. https://svp.is.tue.mpg.de/
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Midjourney (2023). https://www.midjourney.com. Accessed 01 Nov 2023
Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945 (2024)
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. (TOG) 41(4), 1–19 (2022)
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217 (2023)
Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: IEEE Computer Vision and Pattern Recognition Workshops (2019)
Drobyshev, N., Chelishev, J., Khakhulin, T., Ivakhnenko, A., Lempitsky, V., Zakharov, E.: Megaportraits: one-shot megapixel neural head avatars. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2663–2671 (2022)
Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. (ToG), Proc. SIGGRAPH 40(4), 88:1–88:13 (2021)
Filntisis, P.P., Retsinas, G., Paraperas-Papantoniou, F., Katsamanis, A., Roussos, A., Maragos, P.: Spectre: visual speech-informed perceptual 3D facial expression reconstruction from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5744–5754 (2023)
Gafni, G., Thies, J., Zollhofer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658 (2021)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., Lu, Y.: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5609–5619 (2023)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular RGB videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18664 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advance in Neural Information Processing System, vol. 33, pp. 6840–6851 (2020)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: Dreampose: fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025 (2023)
Karras, T., et al.: Alias-free generative adversarial networks. In: Advance in Neural Information Processing System, vol. 34, pp. 852–863 (2021)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nießner, M.: Nersemble: multi-view radiance field reconstruction of human heads. arXiv preprint arXiv:2305.03027 (2023)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194–201 (2017)
Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph. (ToG) 40(4), 1–13 (2021)
Lugaresi, C., et al.: Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
Ma, S., et al.: Pixel codec avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 64–73 (2021)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)
Qi, C., et al.: Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advance in Neural Information Processing System, vol. 35, pp. 25278–25294 (2022)
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Tewari, A., et al.: State of the art on neural rendering. In: Computer Graphics Forum, vol. 39, pp. 701–727. Wiley Online Library (2020)
Tewari, A., et al.: Advances in neural rendering. In: Computer Graphics Forum, vol. 41, pp. 703–735. Wiley Online Library (2022)
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (TOG) 38(4), 1–12 (2019)
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016)
Wang, L., et al.: Styleavatar: real-time photo-realistic portrait avatar from a single video. arXiv preprint arXiv:2305.00942 (2023)
Wu, J.Z., et al.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N.: BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Visi. 129, 3051–3068 (2021)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: Im avatar: implicit morphable head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13545–13555 (2022)
Zheng, Y., Yifan, W., Wetzstein, G., Black, M.J., Hilliges, O.: Pointavatar: deformable point-based head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21057–21067 (2023)
Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4574–4584 (2023)
Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Computer Graphics Forum, vol. 37, pp. 523–550. Wiley Online Library (2018)
Acknowledgements
The authors thank Tsvetelina Alexiadis, Claudia Gallatz and Asuka Bertler for data collection; Tsvetelina Alexiadis, Tomasz Niewiadomski and Taylor McConnell for perceptual study; Yue Gao, Nikita Drobyshev, Jalees Nehvi and Wojciech Zielonka for help with the baselines; Michael J. Black and Yao Feng for discussions; and Benjamin Pellkofer for IT support. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska Curie grant agreement No 860768 (CLIPE project).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ostrek, M., Thies, J. (2025). Stable Video Portraits. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15143. Springer, Cham. https://doi.org/10.1007/978-3-031-73013-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-73013-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73012-2
Online ISBN: 978-3-031-73013-9
eBook Packages: Computer ScienceComputer Science (R0)