GarmentAligner: Text-to-Garment Generation via Retrieval-Augmented Multi-level Corrections

Shiyue Zhang¹³,
Zheng Chong^13,15,
Xujie Zhang¹³,
Hanhui Li¹³,
Yuhao Cheng¹⁴,
Yiqiang Yan¹⁴ &
…
Xiaodan Liang^13,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15083))

Included in the following conference series:

European Conference on Computer Vision

256 Accesses

Abstract

General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DT2I: Dense Text-to-Image Generation from Region Descriptions

PROMPTIST: Automated Prompt Optimization for Text-to-Image Synthesis

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

References

Avrahami, O., et al.: SpaText: spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18370–18380 (2023)
Google Scholar
Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing (2023)
Google Scholar
Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis (2022)
Google Scholar
Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens (2022)
Google Scholar
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 1–10 (2023). https://doi.org/10.1145/3592116
Article Google Scholar
Chen, J., et al.: Pixart-$\alpha $: fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)
Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-Imagen: retrieval-augmented text-to-image generator (2022)
Google Scholar
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022)
Google Scholar
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: REALM: retrieval-augmented language model pre-training (2020)
Google Scholar
He, K., Yao, K., Zhang, Q., Yu, J., Liu, L., Xu, L.: DressCode: autoregressively sewing and generating garments from text guidance (2024)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2022)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models (2020)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks (2021)
Google Scholar
Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: UniHuman: a unified model for editing human images in the wild (2023)
Google Scholar
Lin, A., Zhao, N., Ning, S., Qiu, Y., Wang, B., Han, X.: FashionTex: controllable virtual try-on with text and texture. In: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. SIGGRAPH ’23, ACM (2023). https://doi.org/10.1145/3588432.3591568
Lin, H., Miao, L., Ziai, A.: RAFIC: retrieval-augmented few-shot image classification (2023)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: PICTURE: photorealistic virtual try-on from unconstrained designs (2023)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)
Google Scholar
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation (2021)
Google Scholar
Razzhigaev, A., et al.: Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion (2023)
Google Scholar
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding (2022)
Google Scholar
Seyfioglu, M.S., Bouyarmane, K., Kumar, S., Tavanaei, A., Tutar, I.B.: DreamPaint: few-shot inpainting of e-commerce items for virtual try-on without 3D modeling (2023)
Google Scholar
Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via composable diffusion (2023)
Google Scholar
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Google Scholar
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference (2023)
Google Scholar
Xie, Z., Li, H., Ding, H., Li, M., Cao, Y.: Hierarchical fashion design with multi-stage diffusion models (2024)
Google Scholar
Xu, J., et al.: ImageReward: learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling (2023)
Google Scholar
Yu, J., et al.: Quality and quantity: Unveiling a million high-quality images for text-to-image synthesis in fashion design (2023)
Google Scholar
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)
Google Scholar
Zhang, X., Sha, Y., Kampffmeyer, M.C., Xie, Z., Jie, Z., Huang, C., Peng, J., Liang, X.: ARMANI: part-level garment-text alignment for unified cross-modal fashion design. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, ACM (2022). https://doi.org/10.1145/3503161.3548230
Zhang, X., et al.: DiffCloth: diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment (2023)
Google Scholar

Download references

Acknowledgements

This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, National Natural Science Foundation of China under Grant No. 62372482 and No. 61936002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Mobility Grant Award under Grant No. M-0461, Nansha Key RD Program under Grant No.2022ZD014.

Author information

Authors and Affiliations

Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li & Xiaodan Liang
Lenovo Research, Shenzhen, China
Yuhao Cheng & Yiqiang Yan
Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory, Shenzhen, China
Zheng Chong & Xiaodan Liang

Authors

Shiyue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chong
View author publications
You can also search for this author in PubMed Google Scholar
Xujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hanhui Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuhao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yiqiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10874 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S. et al. (2025). GarmentAligner: Text-to-Garment Generation via Retrieval-Augmented Multi-level Corrections. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72698-9_9
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics