Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15096))

Included in the following conference series:

European Conference on Computer Vision

339 Accesses
4 Citations

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

S. H. Lee and F. Deng—This work was done during an internship at Google.

J. Yu—This work was done during working at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Enhancing Reinforcement Learning Finetuned Text-to-Image Generative Model Using Reward Ensemble

Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis

References

Amazon mechanical turk (2005). https://www.mturk.com/
Anil, R., et al.: Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023)
Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: ICLR (2024)
Google Scholar
Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)
Deng, F., Wang, Q., Wei, W., Grundmann, M., Hou, T.: PRDP: proximal reward difference prediction for large-scale reward finetuning of diffusion models. In: CVPR (2024)
Google Scholar
Dong, H., et al.: RAFT: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, Y., Lee, K.: Optimizing DDPM sampling with shortcut fine-tuning. In: ICML (2023)
Google Scholar
Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. In: NeurIPS (2023)
Google Scholar
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: CVPR (2020)
Google Scholar
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: SVDiff: compact parameter space for diffusion fine-tuning. In: CVPR (2023)
Google Scholar
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022)
He, H., et al.: Learning profitable NFT image diffusions via multiple visual-policy guided reinforcement learning. arXiv preprint arXiv:2306.11731 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. TIP 29, 4041–4056 (2020)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jeong, Y., et al.: The power of sound (TPoS): audio reactive video generation with stable diffusion. In: ICCV (2023)
Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: ICCV (2021)
Google Scholar
Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: learning image aesthetics from user comments with vision-language pretraining. In: CVPR (2023)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
Lee, S.H., et al.: Soundini: sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818 (2023)
Li, Y., et al.: Gligen: open-set grounded text-to-image generation. In: CVPR (2023)
Google Scholar
Lin, X., Yang, Z., Zhang, X., Zhang, Q.: Pareto set learning for expensive multi-objective optimization. In: NeurIPS (2022)
Google Scholar
Mannor, S., Shimkin, N.: The steering approach for multi-criteria reinforcement learning. In: NeurIPS (2001)
Google Scholar
Miettinen, K.: Nonlinear Multiobjective Optimization, vol. 12. Springer, New York (1999). https://doi.org/10.1007/978-1-4615-5563-6
Book Google Scholar
Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: CVPR (2012)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rame, A., et al.: Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In: NeurIPS (2023)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Serra, A., Carrara, F., Tesconi, M., Falchi, F.: The emotions of the crowd: learning image sentiment from tweets via cross-modal distillation. In: ECAI (2023)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Tesauro, G., et al.: Managing power consumption and performance of computing systems using reinforcement learning. In: NeurIPS (2007)
Google Scholar
Tu, Z., et al.: Maxvit: multi-axis vision transformer. In: ECCV (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: better aligning text-to-image models with human preference. In: ICCV (2023)
Google Scholar
Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PAQ-2-PIQ): mapping the perceptual space of picture quality. In: CVPR (2020)
Google Scholar
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Yu, L., et al.: Magvit: masked generative video transformer. In: CVPR (2023)
Google Scholar
Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Google Research, Mountain View, USA
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Glenn Entis, Junfeng He, Gang Li, Irfan Essa & Feng Yang
Google, Mountain View, USA
Innfarn Yoo, Qifei Wang & Fei Deng
Google DeepMind, London, UK
Han Zhang
OpenAI, San Francisco, USA
Jiahui Yu
Rutgers University, New Brunswick, USA
Fei Deng
Korea University, Seoul, South Korea
Seung Hyun Lee & Sangpil Kim

Authors

Seung Hyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yinxiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Ke
View author publications
You can also search for this author in PubMed Google Scholar
Innfarn Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Han Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qifei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Deng
View author publications
You can also search for this author in PubMed Google Scholar
Glenn Entis
View author publications
You can also search for this author in PubMed Google Scholar
Junfeng He
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Sangpil Kim
View author publications
You can also search for this author in PubMed Google Scholar
Irfan Essa
View author publications
You can also search for this author in PubMed Google Scholar
Feng Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seung Hyun Lee .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8661 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, S.H. et al. (2025). Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72920-1_26
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics