Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Jun-Li Lu^10,11 &
Yoichi Ochiai^10,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13336))

Included in the following conference series:

International Conference on Human-Computer Interaction

3022 Accesses

Abstract

Co-creation with AI is trending and AI-generation of images from textual descriptions has shown advanced and attractive capabilities. However, commonly trained machine-learning models or built AI-based systems may have deficient points to generate satisfied results for personal usage or novice users of painting or AI co-creation, maybe because of deficient understanding of personal textual expressions or low customization of trained text-to-image machine learning models. Therefore, we assist in creating flexible and diverse visual contents from textual descriptions, by developing neural-networks models with machine learning. In modeling, we generate synthesized images using word-visual co-occurrence by Transformer model and synthesize images by decoding visual tokens. To improve visual and textual expressions and their relevance with more diversities, we utilize contrastive learning applying on texts, images, or pairs of texts and images. In the experimental results of a dataset of birds, we showed that the rendering quality was required of models with some scale neural-networks, and necessary training process with fined training by applying relatively low learning rates until the end of training. We further showed contrastive learning was possible for improvement of visual and textual expressions and their relevance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

To Enhance the Efficiency and Features of Text to Image Generation Using Neural Network Models

Transforming Text into Art: Exploring DALL-E’s Text-to-Image Generation Capabilities

“Idol talks!” AI-driven image to text to speech: illustrated by an application to images of deities

Article Open access 24 October 2024

References

Chang, Y., Subramanian, D., Pavuluri, R., Dinger, T.: Time series representation learning with contrastive triplet selection. In: Dasgupta, G., et al. (eds.) CODS-COMAD 2022: 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), Bangalore, India, 8–10 January 2022, pp. 46–53. ACM (2022)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (2020)
Google Scholar
Cherepkov, A., Voynov, A., Babenko, A.: Navigating the GAN parameter space for semantic image editing. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 3671–3680. Computer Vision Foundation/IEEE (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 12873–12883. Computer Vision Foundation/IEEE (2021)
Google Scholar
Haralabopoulos, G., Torres, M.T., Anagnostopoulos, I., McAuley, D.: Text data augmentations: permutation, antonyms and negation. Expert Syst. Appl. 177, 114769 (2021)
Article Google Scholar
Hong, Y., Niu, L., Zhang, J., Zhao, W., Fu, C., Zhang, L.: F2GAN: fusing-and-filling GAN for few-shot image generation. In: Chen, C.W., et al. (eds.) MM 2020: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020, pp. 2535–2543. ACM (2020)
Google Scholar
Ji, G., Zhu, L., Zhuge, M., Fu, K.: Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recognit. 123, 108414 (2022)
Article Google Scholar
Liu, D., Nabail, M., Hertzmann, A., Kalogerakis, E.: Neural contours: learning to draw lines from 3d shapes. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 5427–5435. Computer Vision Foundation/IEEE (2020)
Google Scholar
Lo, Y., et al.: CLCC: contrastive learning for color constancy. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 8053–8063. Computer Vision Foundation/IEEE (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8821–8831. PMLR (2021)
Google Scholar
Richardson, E., et al.: Encoding in style: a styleGAN encoder for image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 2287–2296. Computer Vision Foundation/IEEE (2021)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016)
Google Scholar
Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report. CNS-TR-2010-001, California Institute of Technology (2010)
Google Scholar
Weng, L., Elsawah, A.M., Fang, K.: Cross-entropy loss for recommending efficient fold-over technique. J. Syst. Sci. Complex. 34(1), 402–439 (2021)
Article MathSciNet Google Scholar
Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.H.: Deep learning for person re-identification: a survey and outlook. CoRR abs/2001.04193 (2020)
Google Scholar
Zhang, R., et al.: A progressive generative adversarial method for structurally inadequate medical image data augmentation. IEEE J. Biomed. Health Inform. 26(1), 7–16 (2022)
Google Scholar
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 3774–3782. IEEE Computer Society (2017)
Google Scholar

Download references

Acknowledgement

This work was supported by Japan Science and Technology Agency (JST CREST: JPMJCR19F2, Research Representative: Prof. Yoichi Ochiai, University of Tsukuba, Japan), and by University of Tsukuba (Basic Research Support Program Type A).

Author information

Authors and Affiliations

Research and Development Center for Digital Nature, University of Tsukuba, Tsukuba, Japan
Jun-Li Lu & Yoichi Ochiai
Faculty of Library, Information and Media Science, University of Tsukuba, Tsukuba, Japan
Jun-Li Lu & Yoichi Ochiai

Authors

Jun-Li Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yoichi Ochiai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun-Li Lu .

Editor information

Editors and Affiliations

Siemens (United States), Princeton, NJ, USA
Helmut Degen
Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Stavroula Ntoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, JL., Ochiai, Y. (2022). Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs. In: Degen, H., Ntoa, S. (eds) Artificial Intelligence in HCI. HCII 2022. Lecture Notes in Computer Science(), vol 13336. Springer, Cham. https://doi.org/10.1007/978-3-031-05643-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-05643-7_30
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05642-0
Online ISBN: 978-3-031-05643-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

To Enhance the Efficiency and Features of Text to Image Generation Using Neural Network Models

Transforming Text into Art: Exploring DALL-E’s Text-to-Image Generation Capabilities

“Idol talks!” AI-driven image to text to speech: illustrated by an application to images of deities

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

To Enhance the Efficiency and Features of Text to Image Generation Using Neural Network Models

Transforming Text into Art: Exploring DALL-E’s Text-to-Image Generation Capabilities

“Idol talks!” AI-driven image to text to speech: illustrated by an application to images of deities

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation