[go: up one dir, main page]

Skip to main content

Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

S. H. Lee and F. Deng—This work was done during an internship at Google.

J. Yu—This work was done during working at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amazon mechanical turk (2005). https://www.mturk.com/

  2. Anil, R., et al.: Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023)

  3. Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  4. Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)

  5. Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)

  6. Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: ICLR (2024)

    Google Scholar 

  7. Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

  8. Deng, F., Wang, Q., Wei, W., Grundmann, M., Hou, T.: PRDP: proximal reward difference prediction for large-scale reward finetuning of diffusion models. In: CVPR (2024)

    Google Scholar 

  9. Dong, H., et al.: RAFT: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)

  10. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  11. Fan, Y., Lee, K.: Optimizing DDPM sampling with shortcut fine-tuning. In: ICML (2023)

    Google Scholar 

  12. Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. In: NeurIPS (2023)

    Google Scholar 

  13. Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: CVPR (2020)

    Google Scholar 

  14. Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: SVDiff: compact parameter space for diffusion fine-tuning. In: CVPR (2023)

    Google Scholar 

  15. Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022)

  16. He, H., et al.: Learning profitable NFT image diffusions via multiple visual-policy guided reinforcement learning. arXiv preprint arXiv:2306.11731 (2023)

  17. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  18. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  19. Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. TIP 29, 4041–4056 (2020)

    Google Scholar 

  20. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  21. Jeong, Y., et al.: The power of sound (TPoS): audio reactive video generation with stable diffusion. In: ICCV (2023)

    Google Scholar 

  22. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)

    Google Scholar 

  23. Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: ICCV (2021)

    Google Scholar 

  24. Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: learning image aesthetics from user comments with vision-language pretraining. In: CVPR (2023)

    Google Scholar 

  25. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  26. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)

  27. Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)

  28. Lee, S.H., et al.: Soundini: sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818 (2023)

  29. Li, Y., et al.: Gligen: open-set grounded text-to-image generation. In: CVPR (2023)

    Google Scholar 

  30. Lin, X., Yang, Z., Zhang, X., Zhang, Q.: Pareto set learning for expensive multi-objective optimization. In: NeurIPS (2022)

    Google Scholar 

  31. Mannor, S., Shimkin, N.: The steering approach for multi-criteria reinforcement learning. In: NeurIPS (2001)

    Google Scholar 

  32. Miettinen, K.: Nonlinear Multiobjective Optimization, vol. 12. Springer, New York (1999). https://doi.org/10.1007/978-1-4615-5563-6

    Book  Google Scholar 

  33. Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: CVPR (2012)

    Google Scholar 

  34. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  36. Rame, A., et al.: Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In: NeurIPS (2023)

    Google Scholar 

  37. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  38. Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)

  39. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  40. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)

    Google Scholar 

  41. Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)

    Google Scholar 

  42. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  43. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  44. Serra, A., Carrara, F., Tesconi, M., Falchi, F.: The emotions of the crowd: learning image sentiment from tweets via cross-modal distillation. In: ECAI (2023)

    Google Scholar 

  45. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

    Google Scholar 

  46. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  47. Tesauro, G., et al.: Managing power consumption and performance of computing systems using reinforcement learning. In: NeurIPS (2007)

    Google Scholar 

  48. Tu, Z., et al.: Maxvit: multi-axis vision transformer. In: ECCV (2022)

    Google Scholar 

  49. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  50. Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: better aligning text-to-image models with human preference. In: ICCV (2023)

    Google Scholar 

  51. Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)

  52. Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PAQ-2-PIQ): mapping the perceptual space of picture quality. In: CVPR (2020)

    Google Scholar 

  53. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)

  54. Yu, L., et al.: Magvit: masked generative video transformer. In: CVPR (2023)

    Google Scholar 

  55. Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seung Hyun Lee .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8661 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lee, S.H. et al. (2025). Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72920-1_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72919-5

  • Online ISBN: 978-3-031-72920-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics