Abstract
General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrahami, O., et al.: SpaText: spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18370–18380 (2023)
Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing (2023)
Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis (2022)
Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens (2022)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 1–10 (2023). https://doi.org/10.1145/3592116
Chen, J., et al.: Pixart-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)
Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-Imagen: retrieval-augmented text-to-image generator (2022)
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis (2023)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022)
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: REALM: retrieval-augmented language model pre-training (2020)
He, K., Yao, K., Zhang, Q., Yu, J., Liu, L., Xu, L.: DressCode: autoregressively sewing and generating garments from text guidance (2024)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2022)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models (2020)
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks (2021)
Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: UniHuman: a unified model for editing human images in the wild (2023)
Lin, A., Zhao, N., Ning, S., Qiu, Y., Wang, B., Han, X.: FashionTex: controllable virtual try-on with text and texture. In: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. SIGGRAPH ’23, ACM (2023). https://doi.org/10.1145/3588432.3591568
Lin, H., Miao, L., Ziai, A.: RAFIC: retrieval-augmented few-shot image classification (2023)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: PICTURE: photorealistic virtual try-on from unconstrained designs (2023)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation (2021)
Razzhigaev, A., et al.: Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion (2023)
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding (2022)
Seyfioglu, M.S., Bouyarmane, K., Kumar, S., Tavanaei, A., Tutar, I.B.: DreamPaint: few-shot inpainting of e-commerce items for virtual try-on without 3D modeling (2023)
Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via composable diffusion (2023)
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference (2023)
Xie, Z., Li, H., Ding, H., Li, M., Cao, Y.: Hierarchical fashion design with multi-stage diffusion models (2024)
Xu, J., et al.: ImageReward: learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling (2023)
Yu, J., et al.: Quality and quantity: Unveiling a million high-quality images for text-to-image synthesis in fashion design (2023)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)
Zhang, X., Sha, Y., Kampffmeyer, M.C., Xie, Z., Jie, Z., Huang, C., Peng, J., Liang, X.: ARMANI: part-level garment-text alignment for unified cross-modal fashion design. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, ACM (2022). https://doi.org/10.1145/3503161.3548230
Zhang, X., et al.: DiffCloth: diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment (2023)
Acknowledgements
This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, National Natural Science Foundation of China under Grant No. 62372482 and No. 61936002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Mobility Grant Award under Grant No. M-0461, Nansha Key RD Program under Grant No.2022ZD014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, S. et al. (2025). GarmentAligner: Text-to-Garment Generation via Retrieval-Augmented Multi-level Corrections. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72698-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)