[go: up one dir, main page]

Skip to main content

Stable Video Portraits

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15143))

Included in the following conference series:

  • 199 Accesses

Abstract

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods. https://svp.is.tue.mpg.de/

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Midjourney (2023). https://www.midjourney.com. Accessed 01 Nov 2023

  2. Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945 (2024)

  3. Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  4. Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators

  5. Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. (TOG) 41(4), 1–19 (2022)

    Article  Google Scholar 

  6. Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217 (2023)

    Google Scholar 

  7. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)

    Google Scholar 

  8. Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: IEEE Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  9. Drobyshev, N., Chelishev, J., Khakhulin, T., Ivakhnenko, A., Lempitsky, V., Zakharov, E.: Megaportraits: one-shot megapixel neural head avatars. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2663–2671 (2022)

    Google Scholar 

  10. Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. (ToG), Proc. SIGGRAPH 40(4), 88:1–88:13 (2021)

    Google Scholar 

  11. Filntisis, P.P., Retsinas, G., Paraperas-Papantoniou, F., Katsamanis, A., Roussos, A., Maragos, P.: Spectre: visual speech-informed perceptual 3D facial expression reconstruction from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5744–5754 (2023)

    Google Scholar 

  12. Gafni, G., Thies, J., Zollhofer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658 (2021)

    Google Scholar 

  13. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  14. Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., Lu, Y.: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5609–5619 (2023)

    Google Scholar 

  15. Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)

  16. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  17. Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular RGB videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18664 (2022)

    Google Scholar 

  18. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advance in Neural Information Processing System, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  19. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  20. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)

    Google Scholar 

  21. Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: Dreampose: fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025 (2023)

  22. Karras, T., et al.: Alias-free generative adversarial networks. In: Advance in Neural Information Processing System, vol. 34, pp. 852–863 (2021)

    Google Scholar 

  23. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  24. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)

    Google Scholar 

  25. Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)

    Article  Google Scholar 

  26. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  27. Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nießner, M.: Nersemble: multi-view radiance field reconstruction of human heads. arXiv preprint arXiv:2305.03027 (2023)

  28. Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194–201 (2017)

    Article  Google Scholar 

  29. Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph. (ToG) 40(4), 1–13 (2021)

    Article  Google Scholar 

  30. Lugaresi, C., et al.: Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

  31. Ma, S., et al.: Pixel codec avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 64–73 (2021)

    Google Scholar 

  32. Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)

    Google Scholar 

  33. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)

    Google Scholar 

  34. Qi, C., et al.: Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)

  35. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  36. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  37. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  38. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  39. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding (2022)

    Google Scholar 

  40. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advance in Neural Information Processing System, vol. 35, pp. 25278–25294 (2022)

    Google Scholar 

  41. Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  42. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  43. Tewari, A., et al.: State of the art on neural rendering. In: Computer Graphics Forum, vol. 39, pp. 701–727. Wiley Online Library (2020)

    Google Scholar 

  44. Tewari, A., et al.: Advances in neural rendering. In: Computer Graphics Forum, vol. 41, pp. 703–735. Wiley Online Library (2022)

    Google Scholar 

  45. Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (TOG) 38(4), 1–12 (2019)

    Article  Google Scholar 

  46. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016)

    Google Scholar 

  47. Wang, L., et al.: Styleavatar: real-time photo-realistic portrait avatar from a single video. arXiv preprint arXiv:2305.00942 (2023)

  48. Wu, J.Z., et al.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)

    Google Scholar 

  49. Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)

  50. Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N.: BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Visi. 129, 3051–3068 (2021)

    Article  Google Scholar 

  51. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  52. Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: Im avatar: implicit morphable head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13545–13555 (2022)

    Google Scholar 

  53. Zheng, Y., Yifan, W., Wetzstein, G., Black, M.J., Hilliges, O.: Pointavatar: deformable point-based head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21057–21067 (2023)

    Google Scholar 

  54. Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4574–4584 (2023)

    Google Scholar 

  55. Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Computer Graphics Forum, vol. 37, pp. 523–550. Wiley Online Library (2018)

    Google Scholar 

Download references

Acknowledgements

The authors thank Tsvetelina Alexiadis, Claudia Gallatz and Asuka Bertler for data collection; Tsvetelina Alexiadis, Tomasz Niewiadomski and Taylor McConnell for perceptual study; Yue Gao, Nikita Drobyshev, Jalees Nehvi and Wojciech Zielonka for help with the baselines; Michael J. Black and Yao Feng for discussions; and Benjamin Pellkofer for IT support. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska Curie grant agreement No 860768 (CLIPE project).

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2238 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ostrek, M., Thies, J. (2025). Stable Video Portraits. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15143. Springer, Cham. https://doi.org/10.1007/978-3-031-73013-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73013-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73012-2

  • Online ISBN: 978-3-031-73013-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics