[go: up one dir, main page]

Skip to main content

GarmentAligner: Text-to-Garment Generation via Retrieval-Augmented Multi-level Corrections

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15083))

Included in the following conference series:

  • 256 Accesses

Abstract

General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., et al.: SpaText: spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18370–18380 (2023)

    Google Scholar 

  2. Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing (2023)

    Google Scholar 

  3. Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis (2022)

    Google Scholar 

  4. Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens (2022)

    Google Scholar 

  5. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 1–10 (2023). https://doi.org/10.1145/3592116

    Article  Google Scholar 

  6. Chen, J., et al.: Pixart-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

  7. Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-Imagen: retrieval-augmented text-to-image generator (2022)

    Google Scholar 

  8. Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis (2023)

    Google Scholar 

  9. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022)

    Google Scholar 

  10. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: REALM: retrieval-augmented language model pre-training (2020)

    Google Scholar 

  11. He, K., Yao, K., Zhang, Q., Yu, J., Liu, L., Xu, L.: DressCode: autoregressively sewing and generating garments from text guidance (2024)

    Google Scholar 

  12. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2022)

    Google Scholar 

  13. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  14. Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models (2020)

    Google Scholar 

  15. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  16. Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)

  17. Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks (2021)

    Google Scholar 

  18. Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: UniHuman: a unified model for editing human images in the wild (2023)

    Google Scholar 

  19. Lin, A., Zhao, N., Ning, S., Qiu, Y., Wang, B., Han, X.: FashionTex: controllable virtual try-on with text and texture. In: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. SIGGRAPH ’23, ACM (2023). https://doi.org/10.1145/3588432.3591568

  20. Lin, H., Miao, L., Ziai, A.: RAFIC: retrieval-augmented few-shot image classification (2023)

    Google Scholar 

  21. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  22. Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: PICTURE: photorealistic virtual try-on from unconstrained designs (2023)

    Google Scholar 

  23. Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)

    Google Scholar 

  24. Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis (2023)

    Google Scholar 

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  26. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)

    Google Scholar 

  27. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)

    Google Scholar 

  28. Ramesh, A., et al.: Zero-shot text-to-image generation (2021)

    Google Scholar 

  29. Razzhigaev, A., et al.: Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion (2023)

    Google Scholar 

  30. Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)

    Google Scholar 

  31. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022)

    Google Scholar 

  32. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding (2022)

    Google Scholar 

  33. Seyfioglu, M.S., Bouyarmane, K., Kumar, S., Tavanaei, A., Tutar, I.B.: DreamPaint: few-shot inpainting of e-commerce items for virtual try-on without 3D modeling (2023)

    Google Scholar 

  34. Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via composable diffusion (2023)

    Google Scholar 

  35. Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

    Google Scholar 

  36. Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference (2023)

    Google Scholar 

  37. Xie, Z., Li, H., Ding, H., Li, M., Cao, Y.: Hierarchical fashion design with multi-stage diffusion models (2024)

    Google Scholar 

  38. Xu, J., et al.: ImageReward: learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  39. Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  40. Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling (2023)

    Google Scholar 

  41. Yu, J., et al.: Quality and quantity: Unveiling a million high-quality images for text-to-image synthesis in fashion design (2023)

    Google Scholar 

  42. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)

    Google Scholar 

  43. Zhang, X., Sha, Y., Kampffmeyer, M.C., Xie, Z., Jie, Z., Huang, C., Peng, J., Liang, X.: ARMANI: part-level garment-text alignment for unified cross-modal fashion design. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, ACM (2022). https://doi.org/10.1145/3503161.3548230

  44. Zhang, X., et al.: DiffCloth: diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, National Natural Science Foundation of China under Grant No. 62372482 and No. 61936002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Mobility Grant Award under Grant No. M-0461, Nansha Key RD Program under Grant No.2022ZD014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10874 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, S. et al. (2025). GarmentAligner: Text-to-Garment Generation via Retrieval-Augmented Multi-level Corrections. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72698-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72697-2

  • Online ISBN: 978-3-031-72698-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics