[go: up one dir, main page]

0% found this document useful (0 votes)
138 views11 pages

Prompt Log Analysis of Text-to-Image Generation Systems

This paper analyzes prompt logs from multiple text-to-image generation systems to understand user information needs. The analysis is analogous to analyzing query logs from web search engines. The paper finds that text-to-image prompts are longer and more structured than web search queries. Users also make more edits within creation sessions in these systems, showing exploratory patterns. There is a gap between user prompts and captions of images in open training datasets of generative models. The findings can help improve text-to-image systems for creative uses.

Uploaded by

ceritalely17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views11 pages

Prompt Log Analysis of Text-to-Image Generation Systems

This paper analyzes prompt logs from multiple text-to-image generation systems to understand user information needs. The analysis is analogous to analyzing query logs from web search engines. The paper finds that text-to-image prompts are longer and more structured than web search queries. Users also make more edits within creation sessions in these systems, showing exploratory patterns. There is a gap between user prompts and captions of images in open training datasets of generative models. The findings can help improve text-to-image systems for creative uses.

Uploaded by

ceritalely17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Prompt Log Analysis of Text-to-Image Generation Systems

Yutong Xie∗ Zhaoying Pan∗ Jinge Ma∗


yutxie@umich.edu panzy@umich.edu jingema@umich.edu
University of Michigan University of Michigan University of Michigan
Ann Arbor, Michigan, USA Ann Arbor, Michigan, USA Ann Arbor, Michigan, USA

Luo Jie Qiaozhu Mei


rogerluo@nianticlabs.com qmei@umich.com
Niantic, Inc. University of Michigan
San Francisco, California, USA Ann Arbor, Michigan, USA
arXiv:2303.04587v2 [cs.HC] 16 Mar 2023

ABSTRACT 1 INTRODUCTION
Recent developments in large language models (LLM) and genera- Recent developments in large language models (LLM) (e.g., GPT-3
tive AI have unleashed the astonishing capabilities of text-to-image [4], PaLM [6], LLaMA [38], and GPT-4 [24]) and generative AI (es-
generation systems to synthesize high-quality images that are faith- pecially the diffusion models [13, 37]) have enabled the astonishing
ful to a given reference text, known as a “prompt”. These systems image synthesis capabilities of text-to-image generation systems,
have immediately received lots of attention from researchers, cre- such as DALL·E [29, 30], Midjourney [20], latent diffusion models
ators, and common users. Despite the plenty of efforts to improve (LDMs) [32], Imagen [33], and Stable Diffusion [32]. As these sys-
the generative models, there is limited work on understanding tems are able to produce images of high quality that are faithful to a
the information needs of the users of these systems at scale. We given reference text (known as a “prompt”), they have immediately
conduct the first comprehensive analysis of large-scale prompt become a new source of creativity [25] and attracted a great number
logs collected from multiple text-to-image generation systems. Our of creators, researchers, and common users. As a major prototype
work is analogous to analyzing the query logs of Web search en- of generative AI, many believe that these systems are introducing
gines, a line of work that has made critical contributions to the fundamental changes to the creative work of humans [9].
glory of the Web search industry and research. Compared with Despite plenty of efforts on improving the performance of the
Web search queries, text-to-image prompts are significantly longer, underneath generative models, there is limited work on analyzing
often organized into special structures that consist of the subject, the information needs of the real users of these text-to-image sys-
form, and intent of the generation tasks and present unique cate- tems, regardless of the cruciality to understand the objectives and
gories of information needs. Users make more edits within creation workflows of the creators and identify the gaps in how the current
sessions, which present remarkable exploratory patterns. There is systems are capable of facilitating the creators’ needs.
also a considerable gap between the user-input prompts and the In this paper, we take the initiative to investigate the informa-
captions of the images included in the open training data of the gen- tion needs of text-to-image generation by conducting a comprehen-
erative models. Our findings provide concrete implications on how sive analysis of millions of user-input prompts in multiple popular
to improve text-to-image generation systems for creation purposes. systems, including Midjourney, Stable Diffusion, and LDMs. Our
analysis is analogous to query log analysis of search engines, a line
KEYWORDS of work that has inspired many developments of modern informa-
Text-to-Image Generation, AI-Generated Content (AIGC), AI for tion retrieval (IR) research and industry [3, 12, 14, 36, 44]. In this
Creativity, Prompt Analysis, Query Log Analysis. analogy, a text-to-image generation system is compared to a search
engine, the pretrained large language model is compared to the
ACM Reference Format: search index, a user-input prompt can be compared to a search
Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei. 2023. A
query that describes the user’s information need, while a text-to-
Prompt Log Analysis of Text-to-Image Generation Systems. In Proceedings
of the ACM Web Conference 2023 (WWW ’23), April 30-May 4, 2023, Austin,
image generation model can be compared to the search or ranking
TX, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3543 algorithm that generates (rather than retrieves) one or multiple
507.3587430 pieces of content (images) to fulfill the user’s need (Table 1).
Through a large-scale analysis of the prompt logs, we aim to
∗ These authors contributed equally to this research. answer the following questions: (1) How do users describe their
information needs in the prompts? (2) How do the information needs in
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed text-to-image generation compare with those in Web search? (3) How
for profit or commercial advantage and that copies bear this notice and the full citation are users’ information needs satisfied? (4) How are users’ information
on the first page. Copyrights for components of this work owned by others than the needs covered by the image captions in open datasets?
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission The results of our analysis suggest that (1) text-to-image prompts
and/or a fee. Request permissions from permissions@acm.org. are usually structured with terms that describe the subject, the form,
WWW ’23, April 30-May 4, 2023, Austin, TX, USA and the intent of the image to be created (Sec. 4); (2) text-to-image
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9416-1/23/04. . . $15.00 prompts are sufficiently different from Web search queries. Besides
https://doi.org/10.1145/3543507.3587430
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei

Midjourney [20], DALL·E 2 [29], latent diffusion models (LDMs)


Table 1: The analogy between text-to-image generation and [32], Imagen [33] and Stable Diffusion [32]. These systems imme-
Web or vertical search. diately become a trend in the art creation community, attracting
both artists and common users to create with such systems [25].
Text-to-image generation Web and vertical search
2.2 Text-to-Image Prompt Analysis
Images Webpages/Documents
Generation Retrieval Despite plenty of efforts on improving the performance of the
Text-to-image generation system Search engine underneath generative models, there is limited work on analyzing
Prompt Query the user-input prompts and understanding the information needs
Pretrained language model Search index of the real users of text-to-image systems.
Image generation models Ranking algorithms Liu and Chilton [17] explored what prompt keywords and model
Prompt log analysis Query log analysis hyperparameters can lead to better generation performance from
... ... the human-computer interaction (HCI) perspective. In particular,
51 keywords related to the subject and 51 related to the style have
been tested through synthetic experiments. Oppenlaender [26] fur-
significantly lengthier prompts and sessions, there is especially a ther conducted an autoethnographic study on the modifiers in the
prevalence of exploratory prompts (Sec. 4.2); (3) image generation prompts. As a result, six types of prompt modifiers have been identi-
quality (measured by user rating) is correlated with the length of fied, including subject terms, style modifiers, image prompts, quality
the prompt as well as the usage of terms (Sec. 4.3); and (4) there is boosters, repetition, and magic terms. In addition, Pavlichenko and
a considerable gap between the user-input prompts and the image Ustalov [27] presented a human-in-the-loop approach and extracted
captions in open datasets (Sec. 4.4). More details of our analysis are some most effective combinations of prompt keywords.
listed in the Appendix, and the code and the complete results are These studies have provided valuable insights into certain as-
accessible via our GitHub repository1 . Based on these findings, we pects of text-to-image prompts. However, these researches are
conclude several challenges and actionable opportunities of text-to- mostly based on small numbers of independent prompts and/or
image generation systems (Sec. 5). We anticipate our study would computational experiments. These lab experiments usually do not
help the text-to-image generation community to better understand consider prompts in real usage sessions and can hardly reflect the
and facilitate creativity on the Web. “whole picture” of the information needs of the real users. Our study
provides the first large-scale quantitative analysis based on the user
2 RELATED WORK inputs collected from real systems. Besides, we also compare the
2.1 Text-to-Image Generation characteristics of prompts with those of Web search queries as well
Text-to-Image generation is a multi-modal task that aims to trans- as image captains in open text-image datasets, revealing consider-
late text descriptions (known as “prompts”) into faithful images of able differences and practical implications.
high quality. Recent text-to-Image generation models could be cat-
egorized into two main streams: (1) models based on variational au- 2.3 Query Log Analysis
toencoders (VAEs) [16] or generative adversarial networks (GANs) Query log analysis of Web search engines is a classical line of work
[11], and (2) models built upon denoising diffusion probabilistic that has inspired many developments in modern information re-
models (DDPMs, or diffusion models) [13]. trieval (IR) research and industry. Such an analysis usually includes
The earliest application of deep neural networks in text-to-image examinations into terms, queries, sessions, and users [3, 14, 36].
generation could be dated back to 2015, where Mansimov et al. [18] Aside from the general research on Web search engines, query
proposed to generate images from texts using a recurrent VAE with log analysis has also been conducted on vertical search engines
the attention mechanism. In the next few years, Reed et al. [31] like medical search engines (e.g., PubMed and electronic health
and Cho et al. [5] started to use GANs as generative models from records (EHR) search engine), where the analysis results are further
texts to images. These models have made it possible to generate compared with Web search patterns [12, 44].
images from texts, however, most generated images are blurry In this paper, we make an analogy between query log analysis
and consist of simple structures. Later in 2021, OpenAI released and prompt analysis. In this analogy, a user-input prompt can be
DALL·E, combining the powerful GPT-3 language model [4] as the compared to a search query that describes the user’s information
text encoder and a VAE as the image generator [30]. DALL·E is able need, while a text-to-image generation system could be compared
to generate more complex and realistic images, establishing a new to a search engine that generates (rather than retrieves) one or more
standard of text-to-image generation. pieces of content (in our case, the image(s)) to fulfill the user’s need.
Since late 2021, with the advances of DDPMs (diffusion models),
several compelling text-to-image generation systems have been 3 PROMPT LOG DATASETS
developed and released to the public, demonstrating astounding ca-
pabilities in faithful image synthesis and bringing text-to-image gen- We consider three large and open prompt log datasets, including
eration into a new era. They include Disco Diffusion2 , GLIDE [21], the Midjourney Discord dataset [39], the DiffusionDB [40], and the
Simulacra Aesthetic Captions (SAC) [28]. These datasets involve
1 GitHub repository: https://github.com/zhaoyingpan/prompt_log_analysis. three popular text-to-image generation systems – Midjourney [20],
2 Disco Diffusion: https://github.com/alembics/disco-diffusion, retrieved on 3/14/2023. Stable Diffusion [32], and latent diffusion models (LDMs) [32].
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA

Table 2: Statistics of datasets. The values (except for the raw Table 3: Most frequent terms used in prompts.
number of records) are calculated after data processing.
Midjourney DiffusionDB SAC
Dataset Midjourney DiffusionDB SAC Term Freq. Term Freq. Term Freq.

Raw #Records 250K 14M 238K 1 , 91,993 , 1,689,552 , 17,265


#Prompts 145,074 2,208,019 34,190 2 > 71,353 of 1,084,836 of 15,400
#Unique prompts 122,905 1,817,721 34,190 3 < 71,348 a 1,043,542 a 14,693
4 of 56,470 by 943,503 by 12,442
#Unique terms 97,052 182,386 22,898
5 a 47,987 and 720,802 the 9,319
#Users 1,665 10,380 N/A 6 in 42,685 in 669,648 and 7,614
Median #prompts/user 12 62 N/A 7 --ar 40,014 detailed 653,587 in 7,186
Max #prompts/user 2,493 19,556 N/A 8 the 38,155 art 598,493 on 6,723
9 and 33,330 the 572,569 artstation 5,935
The Midjourney Discord dataset. The Midjourney dataset [39] 10 by 28,074 artstation 484,898 . 5,686
is obtained by crawling message records from the Midjourney Dis- 11 detailed 25,134 on 475,406 portrait 5,652
12 with 24,112 painting 426,930 art 5,598
cord community over a period of four weeks (June 20 – July 17,
13 style 23,461 portrait 412,547 with 4,444
2022). This dataset contains approximately 250K records, with user- 14 on 20,282 with 402,008 painting 4,347
input prompts, URLs of generated images, usernames, user IDs, 15 render 20,061 highly 334,410 illustration 3,359
message timestamps, and other Discord message metadata. 16 cinematic 19,782 k 320,290 - 3,354
17 16:9 18,616 lighting 310,598 oil 3,234
DiffusionDB. DiffusionDB [40] is a large-scale dataset with 18 realistic 18,012 digital 310,336 concept 3,214
14M images generated with Stable Diffusion [32]. For each image, 19 - 17,677 - 287,732 digital 2,997
this dataset also provides the corresponding prompt, user ID, times- 20 octane 16,925 intricate 276,246 beautiful 2,695
tamp, and other meta information.
term pairs are listed in Table 4. Based on the first- and second-order
Simulacra Aesthetic Captions (SAC). The SAC dataset [28]
analysis results, we present the following findings.
contains 238K images generated from over 40K user-submitted
prompts with LDMs [32]. SAC annotates images with aesthetic
Table 4: Most significant term pairs used in the same prompt.
ratings in the range of [1, 10] collected from surveys. The prompts
in SAC are also relatively clean. However, SAC does not include Midjourney DiffusionDB SAC
information about user IDs or timestamps. Pair 𝜒2 Pair 𝜒2 Pair 𝜒2

Table 2 lists the basic statistics of the datasets. In the raw data, 1 (norman, rockwell) 0.285 (donald, trump) 0.345 (matsunuma, shingo) 1.000
2 (fenghua, zhong) 0.250 (emma, watson) 0.321 (lisa, mona) 1.000
one input prompt can correspond to multiple generated images and 3 (ngai, victo) 0.240 (biden, joe) 0.283 (elon, musk) 1.000
create multiple data entries for the same input. We remove these 4 (makoto, shinkai) 0.237 (shinkawa, yoji) 0.266 (ariel, perez) 1.000
5 (ray, trace) 0.125 (blade, runner) 0.255 (angeles, los) 1.000
duplicates while reserving repeated inputs from users. More details 6 (fiction, science) 0.123 (katsuhiro, otomo) 0.238 (bradley, noah) 1.000
about the data and data processing are described in Appendix A. 7 (anderson, wes) 0.106 (contest, winner) 0.237 (hayao, miyazaki) 1.000
8 (11:17, circa) 0.074 (takato, yamamoto) 0.236 (finnian, macmanus) 1.000
9 (jia, ruan) 0.071 (“, ”) 0.216 (bartlett, bo) 1.000
4 PROMPT LOG ANALYSIS 10 (cushart, krenz) 0.070 (mead, syd) 0.130 (hasui, kawase) 0.500
11 (shinkawa, yoji) 0.062 (akihiko, yoshida) 0.123 (daniela, uhlig) 0.332
We analyze the prompts in the datasets and aim to answer the four 12 (albert, bierstadt) 0.060 (elvgren, gil) 0.114 (edlin, tyler) 0.318
13 (katsuhiro, otomo) 0.057 (new, york) 0.114 (jurgens, mandy) 0.286
questions mentioned in Section 1. 14 ([, ]) 0.053 (gi, jung) 0.106 (bacon, francis) 0.286
15 (annie, leibovitz) 0.052 (dore, gustave) 0.103 (araki, hirohiko) 0.258
16 (adams, ansel) 0.045 (star, wars) 0.092 (radke, scott) 0.257
17 (mignola, mike) 0.043 (fiction, science) 0.087 (ca’, n’t) 0.252
4.1 How do Users Describe Information Needs? 18 (1800s, tintype) 0.036 (league, legends) 0.082 (card, tarot) 0.201
19 (dore, gustave) 0.036 (rule, thirds) 0.074 (claude, monet) 0.190
We first investigate how users describe their information needs 20 (adams, tintype) 0.029 (ngai, victo) 0.061 (gogh, van) 0.180
by exploring the structures of prompts. We start with analyzing
the usage of terms (tokens or words) in prompts. We conduct a 4.1.1 Words in prompts describe subjects, forms, and in-
first-order analysis that focuses on term frequency, followed by a tents. In Art, a piece of work is typically described with three
second-order analysis that focuses on co-occurring term pairs. The basic components: subject, form, and content. In general, the subject
significance of a term pair is measured with the 𝜒 2 metric [1, 36]: defines “what” (the topic or focus); the form confines “how” (the
[𝐸 (𝑎𝑏) − 𝑂 (𝑎𝑏) ] 2 [𝐸 (𝑎𝑏)
¯ − 𝑂 (𝑎𝑏)¯ ]2 development, composition, or substantiation); and the content artic-
𝜒 2 (𝑎, 𝑏) = + + ulates “why” (the intention or meaning) [22]. We are able to relate
𝐸 (𝑎𝑏) 𝐸 (𝑎𝑏)
¯
¯ − 𝑂 (𝑎𝑏)
[𝐸 (𝑎𝑏) ¯ ] 2 [𝐸 (𝑎¯𝑏)
¯ − 𝑂 (𝑎¯𝑏)
¯ ]2 terms in a text-to-image prompt to these three basic components.
+ , (1) Note that the subject, form, and content of a work of art is often
¯
𝐸 (𝑎𝑏) ¯
𝐸 (𝑎¯𝑏)
intertwined with each other. For example, a term describing the
where 𝑎, 𝑏 are two terms, 𝑂 (𝑎𝑏) is the number of prompts they
subject might also be related to the form or content and vice versa.
co-occur in, 𝐸 (𝑎𝑏) is the expected co-occurrences under the inde-
pendence assumption, and 𝑎, ¯ 𝑏¯ stand for the absence of 𝑎, 𝑏. Subject. A prompt often contains terms describing its topic or
In Table 3, we list the most frequent terms, measured by the num- focus, referred to as the subject, which can be a person, an object,
ber of text-to-image prompts they appear in. The most significant or a theme [26, 27]. Among the 50 most frequent terms of all three
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei

datasets (parts of them listed in Table 3), we discover 9 terms related DiffusionDB

Term frequency (log-scale)


to the subject: “portrait”, “lighting”, “light”, “face”, “background”,
105 Zipf's law fit of DiffusionDB
“character”, “man”, “head”, and “space”. More examples can be found
in Table 4, such as (“donald”, “trump”), (“emma”, “watson”), (“biden”,
“joe”), (“elon”, “musk”), (“mona”, “lisa”), (“new”, “york”), (“los”, “an-
103
geles”), (“star”, “wars”), (“league”, “legends”), and (“tarot”, “card”). Midjourney
Form. The form confines the way in which an artwork is orga-
Zipf's law fit of Midjourney
101 SAC
nized, referring to the use of the principles of organization to arrange Zipf's law fit of SAC
the elements of art. These elements may include line, texture, color,
shape, and value; while the principles of organization consider 100 101 102 103 104 105
harmony, variety, balance, proportion, dominance, movement, and
Terms ranked by frequency (log-scale)
economy, etc. [22]. Comparably, the form of a prompt is usually Figure 1: Term frequencies in a log-log scale. The distribu-
described as constraints to image generation [26, 27]. Among the tions deviate from Zipf’s law with exponential tails.
top 50 terms of all datasets (parts of them listed in Table 3), we find
25 terms that are form-related: “detailed”/“detail”’, “art”, “painting”,
“style”, “render”, “illustration”, “cinematic”, “k” (e.g., “4K” or “8K”),
4.1.2 Prompts indicate potential applications. In the second-
“16:9”/“9:16”, “oil” (e.g., “oil painting”), “realistic”, “concept” (e.g.,
order analysis, we also discover interesting combinations of terms
“concept art”), “digital”, “intricate”, “black”, “dark”, “unreal”, “white”,
that might indicate potential applications of text-to-image genera-
“sharp”, “fantasy”, “photo”, “smooth”, and “canvas”.
tion in various areas:
In addition to these terms, we also notice names of art community
Websites (e.g., ArtStation3 , Artgerm4 , and CGSociety5 ), rendering • Image processing: (“film”, “grain”), (“blur”, “blurry”), (“iso”,
engines (e.g., Unreal Engine6 and OctaneRender7 ), and artists (e.g., “nikon”), (“hdr”, “professional”), (“flambient”, “professional”),
wlop, Norman Rockwell, Fenghua Zhong, Victo Ngai, Shingo Mat- (“post”, “processing”), (“color”, “scheme”), etc.
sunuma, Claude Monet, and Van Gogh, etc. ) that appear frequently • Image rendering: (“ray”, “trace/tracing”), (“fluid”, “redshift”),
in the prompts (Tables 3-4). These terms are often used to constrain (“unreal”, “engine”), (“3d”, “shading”), (“3d”, rendering”), (“global”,
the style of images, so can be interpreted as form-related. “illumination”), (“octane”, “render”), etc.
• Graphic design: (“movie”, “poster”), (“graphic”, “design”), (“key”,
Intent. The content (as defined in the Art literature) of a prompt “visual”), (“cel”, “shaded”), (“comic”, “book”), (“anime”, “vi-
tells the intention or purpose of the user and is often described as sual”), (“ghibli”, “studio”), (“disney”, “pixar”), etc.
the emotional or intellectual message that the user wants to express. • Industrial design: (“circuit”, “boards”), (“sports”, “car”), etc.
Among the three components of art, the content is the most abstract • Fashion design: (“fashion”, “model”), (“curly”, “hair”), etc.
and can be difficult to identify [22]. To avoid ambiguity (“content”
These pairs are often related to the forms or/and intents of the cre-
has specific meanings in the Web and the AI literature), we name
ation, indicating considerable opportunities to develop customized
this component of a prompt the “intent” instead. In the top 50 terms
applications for different forms and intents of creative activities.
of all datasets (parts of them listed in Table 3), we find only three
terms that might be related to the intent: “beautiful”, “trending”,
and “featured”. If we go down the list, we are able to identify more:
4.2 How do Text-to-Image Prompts Compare
“epic”, “moody”, “fantasy”, “dramatic”, “masterpiece”, etc. with Web Search Queries?
A text-to-image prompt is analogous to a query submitted to a Web
Other terms. Aside from the terms that describe the subject, form, search engine (image generation model) that retrieves (generates)
and intent, other types of frequently used terms include punctua- documents (images) that satisfy the information need (Table 1). It
tions (e.g., “,” and “.”), model-specific syntactic characters (e.g., “<”, is intriguing to compare text-to-image prompts with Web search
“>”, “--ar”, and “::” that specify model parameters in the Midjourney queries to further understand their similarities and differences.
dataset), and stop words (e.g., “of”, “the”, “in”, “a”, “and”, and “by”).
4.2.1 Term frequencies do not follow the power law. While
a power law distribution (or a Zipf’s distribution when the rank of
Overall, we find that many of the prompts consist of one or more terms is the independent variable) of term frequency is commonly
blocks of terms in at least one of these three categories. The frequent observed in large-scale corpora and Web search queries [41], we
appearance of form-related terms is particularly interesting, which find that the distribution of terms in text-to-image prompts deviates
adds constraints to the creation process. Future developments of from this pattern. Figure 1 shows that the frequencies of top-ranked
text-to-image generation should consider how to optimize for the terms present a milder decay than Zipf’s law, and the tail terms
users’ intents under these constraints. present a clear exponential tail [8]. This is likely due to the special-
ized nature of creative activities, where the use of terms is more
3 ArtStation:https://www.artstation.com/, retrieved on 3/14/2023. restricted than open Web search. This indicates the opportunity
4 Artgerm: https://artgerm.com/, retrieved on 3/14/2023.
5 CGSociety: and feasibility of curating specialized vocabularies for creation,
https://cgsociety.org/, retrieved on 3/14/2023.
6 Unreal Engine: https://www.unrealengine.com/, retrieved on 3/14/2023. something similar to the Unified Medical Language System (UMLS)
7 OctaneRender: https://home.otoy.com/render/octane-render/, retrieved on 3/14/2023. in the biomedical and health domain [2].
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA

103 DiffusionDB
Table 6: Statistics of prompt lengths.
Prompt frequency (log-scale)

Zipf's law fit of DiffusionDB Dataset Midjourney DiffusionDB SAC


102 Avg. #terms 27.16 30.34 17.53
Std. #terms 24.11 21.25 11.27
101 Median #terms 20 26 15
Max #terms 426 540 62

100
Web search queries (the average length is 5.0) [44], likely due to
100 101 102 103 104 105 106
Prompt ranked by frequency (log-scale) the highly specialized and complex nature of the tasks.
Bundled queries. When queries are more complex and harder to
Figure 2: Prompt frequencies of DiffusinoDB plotted in a log-
compose, an effective practice used in medical search engines is
log scale. The distribution follows Zipf’s law.
to allow users to bundle a long query, save it for reuse, and share
it with others. In the context of EHR search, bundled queries are
Table 5: Most frequent prompts in DiffusionDB. #Users indi- significantly longer (with 58.9 terms on average, compared to 1.7
cates the number of users who have used this prompt. terms in user typed-in queries) [44, 46]. Bundled queries tend to
have higher quality, and once shared, are more likely to be adopted
Rank Prompt Freq. #Users by other users [46]. Table 5 seems to suggest the same opportu-
1 painful pleasures by lynda benglis, octane render, colorful, 4k, 8k 1010 1
2 cinematic bust portrait of psychedelic robot from left, head and chest ... 240 2
nity, as certain well-composed lengthy queries are revisited many
3 divine chaos engine by karol bak, jean delville, william blake, gustav ... 240 7 times by their users. These prompts could be saved as “bundles”
4 divine chaos engine by karol bak and vincent van gogh 228 1
5 soft greek sculpture of intertwined bodies painted by james jean ... 202 2
and potentially shared with other users. To illustrate the poten-
6 detailed realistic beautiful young medieval queen face portrait ... 202 1 tial, we calculate the prompts used by multiple users and plot the
7 animation magic background game design with miss pokemon ... 181 2
8 cat 174 69
distribution in Figure 3. We find a total of 16,950 unique prompts
9 wrc rally car stylize, art gta 5 cover, official fanart behance hd ... 166 4 (0.94% of all unique prompts) have been used across users, 782 have
10 futurism movement hyperrealism 4k detail flat kinetic 157 1
11 a big pile of soft greek sculpture of intertwined bodies painted by ... 156 1 been used by five or more users, and 182 have been shared by 10
12 test 152 86 or more users. The result suggests that text-to-image generation
13 dream 149 133
14 realistic detailed face portrait of a beautiful futuristic viking warrior ... 149 2 users have already started to share bundled prompts spontaneously,
15
∗ 16
spritesheet game asset vector art, smooth style beeple, by thomas ... 141 3 even though this functionality has not been provided by the system.
137 50
17 abstract 3d female portrait age five by james jean and jason chan, ... 134 1 Compared to vertical search engines that provide bundle-sharing
18 symmetry!! egyptian prince of technology, solid cube of light, ... 130 1 features, the proportion of bundled prompts is still relatively small
19 retrofuturistic portrait of a woman in astronaut helmet, smooth ... 127 1
20 astronaut holding a flag in an underwater desert. a submarine is ... 127 1 (compared with 19.3% for an EHR search engine [44]), indicating a
∗ Row 16 is an empty prompt. huge opportunity for bundling and sharing prompts.

40
106 #Prompts
#Prompts (log-scale)

4.2.2 Prompt frequencies follow the power law. We also ex-


105 Avg. prompt length 30

Prompt length
amine the distribution of prompt frequencies. From Figure 2, we
find the prompt frequency distribution of the larger dataset, Dif- 104
fusionDB, does follow Zipf’s law (except for the very top-ranked 103 20
102
prompts), similar to the queries of Web and vertical search engines
101 10
[36, 41, 44]. The most frequently used prompts are listed in Table
5. Interestingly, many of the top-ranked prompts are (1) lengthy
0 20 40 60 80 100 120 0
and (2) only used by a few users. This indicates that although the Shared by #users
prompt frequency distributes are similar to that of Web search, the
mechanism underneath may be different (shorter Web queries are Figure 3: Prompts shared across users in DiffusionDB. The
more frequent and shared by more users [36]). orange line plots the average prompt length in the blue bins.

4.2.3 Text-to-image generation prompts tend to be longer.


We report the key statistics of prompt length (i.e., the number 4.2.4 Text-to-image generation sessions contain more prompts.
of terms in a prompt) in Table 6. The average length of prompts A session is defined as a sequence of queries made by the same
for text-to-image generation (27.16 for Midjourney and 30.34 for user within a short time frame in Web search [44], which often
DiffusionDB) and the median length (20 for Midjourney and 26 for corresponds to an atomic mission for a user to achieve a single
DiffusionDB) are significantly longer than the lengths of Web search information need [15, 36]. Analyzing sessions is critical in query
queries, where the mean is around 2.35 and the median is about log analysis because a session provides insights about how a user
2 terms [14, 36]). Interestingly, similar observations are reported modifies the queries to fulfill the information need [15, 36].
in vertical search engines such as electronic health records (EHR) Following the common practice in Web search, we chunk prompts
search engines, where the queries are also significantly longer than into sessions with a 30-minute timeout [14, 44], meaning any two
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei

Table 7: Prompts shared by the largest numbers of users in Navigational prompts. The most frequent queries in Web search
DiffusionDB. Only prompts longer than five terms are re- are often navigational, where users simply use a query to lead them
ported below row 10. to a particular, known Website (e.g., “Facebook” or “YouTube”). In
text-to-image generation, as the generation model often returns
Rank Prompt #Users
different images given the same text prompt due to randomiza-
1 dream 133
tion, the information need of “navigating” to a known image is
2 stable diffusion 91
3 help 89 rare. Indeed, the queries used by the most number of users (Figure
4 test 86 3) are generally not tied to a particular image. Even though the
5 cat 69 shorter queries on the top look somewhat similar to “Facebook” or
6 nothing 66
“Youtube”, are rather ambiguous and more like testing the system.
7 god 58
8 the backrooms 53
∗9 50
Informational prompts. Most other text-to-image prompts can
10 among us 44 be compared to informational queries in Web search, which aim to
19 a man standing on top of a bridge over a city, cyberpunk art ... 32 acquire certain information that is expected to present on one or
20 mar - a - lago fbi raid lego set 32 more Web pages [3]. The difference is that informational prompts
34 an armchair in the shape of an avocado 23
aim to synthesize (rather than retrieve) an image, which is expected
35 a giant luxury cruiseliner spaceship, shaped like a yacht, ... 23
42 a portrait photo of a kangaroo wearing an orange hoodie and ... 19 to exist in the latent representation space of images. Most prompts
45 anakin skywalker vacuuming the beach to remove sand 19 fall into this category, similar to the case in Web search [3].
48 emma watson as an avocado chair 18
64 milkyway in a glass bottle, 4k, unreal engine, octane render 16 Transactional prompts. Transactional queries are those intended
∗ Row 9 is an empty prompt.
of performing certain Web-related activities [3], such as completing
consecutive prompts that are submitted by the same user within 30 a transaction (e.g., to book a flight or to make a purchase). One
minutes will be considered as in the same session. could superficially categorize all prompts into transactional, as they
The statistics of sessions are listed in Table 8. Similar to prompts, are all intended to conduct the activities of “generating images”.
text-to-image generation sessions also tend to be significantly Zooming into this superficial categorization, we could identify
longer than Web search sessions (by the number of prompts in prompts that refer to specific and recurring tasks, such as “3D
a session). A text-to-image generation session contains 10.25 or rendering”, “post-processing”, “global illumination”, and “movie
13.71 (Midjourney or DiffusionDB) prompts on average and a me- poster” (see more examples in Section 4.1.2). These tasks may be
dian of 4 or 5 (Midjourney or DiffusionDB) prompts; while in Web considered transactional in the context of text-to-image generation.
search, the average session length is around 2.02 and the median is Exploratory prompts. Beyond the above categories correspond-
1 [36]. This is again likely due to the complexity of the creation task ing to the three basic types of Web search queries, we discover a
so the users need to update the prompts multiple times. Indeed, a new type of information needs in prompts, namely the exploratory
user tends to change (add, delete, or replace) a median of 3 terms prompts for text-to-image generation. Comparing to an informa-
(measured by term-level edit distance) between two consecutive tional prompt that aims to generate a specific piece of (hypothetical)
prompts in the same session on Midjourney (5 on DiffusionDB), image, an exploratory prompt often describes a vague or uncertain
astonishingly more than how people update Web search queries. information need (or image generation requirements) that inten-
Do these updates indicate different types of information needs? tionally leads to multiple possible answers. The user intends to
explore different possibilities, leveraging either the randomness of
Table 8: Statistics of prompt sessions. Sessions are identified
the model or the flexibility of terms used in a prompt session.
with a 30-minute timeout. Edit distances regarding terms are
Indeed, rather than clearly specifying the requirements and con-
calculated with consecutive prompts in the same session.
straints and gradually refining the requirements in a session, in
exploratory prompts or sessions, the users tend to play with alter-
Dataset Midjourney DiffusionDB
native terms of the same category (e.g., different colors or animals,
#Sessions 14,232 161,001 or sibling terms) to explore how the generation results could be
Avg. #sessions/user 8.52 15.51 different or could cover a broader search space. Based on the ses-
Median #sessions/user 2 9 sion analysis, we count the most frequent term replacements in
Avg. #prompts/session 10.19 13.71 Table 9. In this table, we find 33 replacements that show exploratory
Median #prompts/session 4 5 patterns, such as (“man”, “woman”), (“asian”, “white”), (“dog”, “cat”),
(“red”, “blue”), and (“16:9”, “9:16”).
Avg. edit distance 8.53 9.42 On the contrary, in non-exploratory sessions, replacing a term
Median edit distance 3 5 with its synonyms or hyponyms, or more specific concepts are
more common, which refines the search space (rather than explor-
4.2.5 A new categorization of information needs. Web search ing the generation space). In the table, we find a few such replace-
queries are typically distinguished into three categories: (1) naviga- ments: (“steampunk”, “cyberpunk”), (“deco”, “nouveau”), (“crown”,
tional queries, (2) informational queries, and (3) transactional queries “throne”). There are also examples that replace terms with the
[3]. Should text-to-image prompts be categorized in the same way? correct spelling or replace punctuations to refine: (“aphrodesiac”,
Or do prompts express new categories of information needs? “aphrodisiac”), (“with”, “,”), (“,”, “and”) and (“,”, “.”).
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA

Table 9: Most frequent term replacements. This table only coefficient at 0.197. This means longer prompts tend to produce
considers consecutive prompts from the same session where images of higher quality. This provides another perspective to un-
exactly one term is been replaced. Green highlights re- derstand the large lengths of prompts and prompt sessions and
placements that might indicates exploratory patterns, while another motivation to bundle and share long prompts.
red highlights non-exploratory replacements.

Midjourney
Replacement Freq.
DiffusionDB
Replacement Freq.
6.5
1 (’deco’, ’nouveau’) 16 (’man’, ’woman’) 216
6.0
2 (’16:9’, ’9:16’) 15 (’woman’, ’man’) 187
3 (’9:16’, ’16:9’) 14 (’2’, ’3’) 161

Rating
4 (’2’, ’1’) 8 (’1’, ’2’) 147
5.5
5 (’16:9’, ’4:6’) 8 (’7’, ’8’) 140
6 (’1’, ’2’) 7 (’8’, ’9’) 139 5.0
7 (’3:4’, ’4:3’) 7 (’6’, ’7’) 135
8 (’1000’, ’10000’) 7 (’3’, ’4’) 132 4.5
9 (’artwork’, ’parrot’) 7 (’girl’, ’woman’) 128
10 (’16:9’, ’1:2’) 6 (’red’, ’blue’) 116 0 10 20 30 40 50
11 (’2:3’, ’3:2’) 6 (’5’, ’6’) 115
Prompt length
12 (’asian’, ’white’) 6 (’4’, ’5’) 112
Figure 4: Prompt length is positively correlated with ratings.
13 (’1’, ’0.5’) 5 (’female’, ’male’) 107
The Pearson correlation coefficient is 0.197.
14 (’320’, ’384’) 5 (’male’, ’female’) 97
15 (’0.5’, ’1’) 4 (’blue’, ’red’) 93
16 (’crown’, ’throne’) 4 (’0’, ’1’) 89
4.3.2 The choice of words matters. We also investigate how
17 (’blue’, ’green’) 4 (’cat’, ’dog’) 89
the choice of words influences the performance of image genera-
18 (’9:16’, ’4:5’) 4 (’woman’, ’girl’) 82
tion. We collect all the prompts that contain a particular term and
19 (’2:3’, ’1:2’) 4 (’dog’, ’cat’) 79
calculate the average rating. Terms with the highest and lowest
20 (’--w’, ’--h’) 4 (’white’, ’black’) 72
average ratings are listed in Table 12 in the appendix. We find most
21 (’nouveau’, ’deco’) 4 (’with’, ’,’) 71
high-rating terms are artist names, which provide clear constraints
22 (’red’, ’blue’) 4 (’steampunk’, ’cyberpunk’) 71
on the styles of images. In contrast, terms with low ratings are
23 (’guy’, ’girl’) 4 (’red’, ’green’) 70
much vaguer and more abstract and might indicate an exploratory
24 (’snake’, ’apple’) 4 (’cyberpunk’, ’steampunk’) 70
behavior. More efforts needed to be done to handle exploratory
25 (’japanese’, ’korean’) 4 (’,’, ’and’) 69
prompts and to encourage the users to refine their needs.
26 (’16:8’, ’8:11’) 4 (’painting’, ’portrait’) 68
27 (’insect’, ’ladybug’) 4 (’,’, ’.’) 68
4.4 How are Users’ Information Needs Covered
28 (’--hd’, ’--vibe’) 3 (’portrait’, ’painting’) 68
by Image Captions?
29 (’aphrodesiac’, ’aphrodisiac’) 3 (’girl’, ’boy’) 64
30 (’0.5’, ’2’) 3 (’green’, ’blue’) 63 Current text-to-image generation models are generally trained with
large-scale image-text datasets, where the paired text usually come
from image captions. To figure out how these training sets match the
Another indication of exploratory behavior is the repeated use actual users’ information needs, we compare the prompts with im-
of prompts. For example, among the top prompts in Table 5 (except age captions in the open domain. In particular, we consider LAION-
those for testing purposes), each of them is repeatedly used by the 400M [35] as one of the main sources of text-to-image training data
same user more than 100 times. This might be because the user since both LDMs and the Stable Diffusion model employ this dataset.
is exploring different generation results with the same prompt, Text in LAION-400M are extracted from the captions of the images
leveraging the randomness of the generative model. collected from the Common Crawl, so they are supposed to convey
the subject, form, and intent of the images. We randomly sample
4.3 How are the Information Needs Satisfied? 1M texts from LAION-400M and compare them with user-input
Prompts are typically crafted to meet certain information needs prompts. We obtain the following finding.
by generating satisfactory images. In this subsection, we examine
how prompts can fulfill this goal. With the rating annotations in Term usages are different between user-input prompts and
the SAC dataset (the average rating is 5.53, and the median is 6), image captions in open datasets. We construct a vocabulary
we calculate the correlation between ratings and other variables based on LAION-400M and calculate the vocabulary coverage of
like prompt lengths and term frequencies. three prompt datasets (i.e., to what proportion of the user-input
terms is covered by the LAION vocabulary). The coverage is 25.94%
4.3.1 Longer prompts tend to be higher rated. We plot how for Midjourney, 43.17% for DiffusionDB, and 80.56% for SAC. The
the ratings of generated images correlate with prompt lengths in coverage is relatively high on SAC as this dataset is relatively clean.
Figure 4, where we find a positive correlation with the Pearson In comparison, the Midjourney and DifussionDB datasets directly
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei

collect prompts from Discord channels of Midjourney and Stable the great opportunity for a personalized generation. Currently,
Diffusion, and over half of the terms are not covered in the LAION the session-based generation features are mostly built upon image
dataset. We also analyzed their embeddings and find that user-input initialization of diffusion models, i.e., using the output from the
prompts and image captions from the LAION dataset cover very previous generation as the starting point of diffusion sampling.
different regions in the latent space (Figure 8 in the appendix). Compared with other session-based AI systems like ChatGPT [23],
these session-based features still seem preliminary and take little
5 IMPLICATIONS consideration about personalized generation. Meanwhile, the ex-
Our analysis presents unique characteristics of user-input prompts, plicit descriptions of forms and intent in prompts also indicate
which helps us better understand the limitations and opportunities opportunities to customize the generation models for these con-
of text-to-image generation systems and AI-facilitated creativity on straints (and the potential applications as listed in Section 4.1.2).
the Web. Below we discuss a few concrete and actionable possibili- Handling exploratory prompts and sessions. In Sec. 4.2.5 we
ties for improving the generation systems and enhancing creativity. identify a new type of prompt in addition to the three typical cate-
Building art creativity glossaries. As we discussed in Sec. gories of query in Web search (i.e., navigational, informational, and
4.1.1, a text-to-image prompt could be decomposed into three as- transactional queries), namely the exploratory prompts. To encour-
pects: subject (“what”), form (“how”), and intent (“why”, or content age the exploratory generation of images, reliable and informative
as in classical Art literature). If we can identify and analyze these exploration measures will be much needed. In other machine inno-
specific elements in prompts, we may be able to better decipher vation areas, like AI for molecular generation, efforts have been
users’ information needs. made on discussing the measurement of coverage and exploration
However, to the best of our knowledge, there is no existing of spaces [42, 43], but for text-to-image generation, such discussions
tool that is able to extract the subject, form, and intent from text are still rare. How to encourage the models to explore a larger space,
prompts. Besides, although users have spontaneously collected generate novel and diverse images, and recommend exploratory
terms that describe the form and subject 8 , there is no high-quality prompts to users are all promising yet challenging directions.
and comprehensive glossary in the literature that contains terms Improving generation models with prompt logs. Finally, the
about these three basic components of art, or something like the gap between the image captions in open datasets and the user-input
Unified Medical Language System (UMLS) for biomedical and health prompts (Sec. 4.4) indicates that it is desirable to improve model
domains [2]. Constructing such tools or glossaries is difficult and training directly using the prompt logs. Following the common
will highly rely on the domain knowledge, because: (1) These three practice in Web search engines, one may leverage both explicit
components of art are often intertwined and inseparable in a piece and implicit feedback from the prompt logs (such as the ratings
of work [22], meaning a term would have tendencies to fall into any or certain behavioral patterns or modifications in the prompts) as
categories of these three. For example, in Process Art, the form and additional signals to update the generation models.
content seem to be the same thing [22]. (2) Terminologies about art Although we focus our analysis on text-to-image generation, the
are consistently updated because new artists, styles, and art-related analogy to Web search and some of the above implications also
sites keep popping out. We call for the joint effort of the art and apply to other domains of AI-generated content (AIGC), such as AI
the Web communities to build such vocabularies and tools. chatbots (e.g., ChatGPT).
Bundling and sharing prompts. Sec. 4.2.3 analyzes the lengths
6 CONCLUSION
of text-to-image prompts, where we find an inadequate use of bun-
dled prompts compared with other vertical search engines (e.g., We take an initial step to investigate the information needs of
EHR search engines). Since the prompts are generally much longer text-to-image generation through a comprehensive and large-scale
than Web search queries, and the information needs are also more analysis of user-input prompts (analogous to Web search queries) in
complex, it is highly likely that bundled prompts can help the users multiple popular systems. The results suggest that (1) text-to-image
to craft their prompts more effectively and efficiently. Though there prompts are typically structured with terms that describe the subject,
are already prompt search websites like Lexica9 , PromptHero10 and form, and intent; (2) text-to-image prompts are sufficiently different
PromptBase11 that provide millions of user-crafted prompts, such from Web search queries. Our findings include the significantly
bundled search features are merely integrated into current text-to- lengthier prompts and sessions, the lack of navigational prompts,
image generation systems. As mentioned earlier, adding features the new perspective of transactional prompts, and the prevalence
to support bundling and sharing high-quality prompts could bring of exploratory prompts; (3) image generation quality is correlated
immediate benefits to text-to-image generation systems. with the length of the prompt as well as the usage of terms; and
(4) there is a considerable gap between the user-input prompts
Personalized generation. The analysis in Sec. 4.2.4 suggests and the image captions used to train the models. Based on these
that the session lengths in text-to-image generation are also signif- findings, we present actionable insights to improve text-to-image
icantly larger than the session lengths in Web search, indicating generation systems. We anticipate our study could help the text-to-
image generation community to better understand and facilitate
8 Prompt book for data lovers II: https://docs.google.com/presentation/d/1V8d6TIlKqB
creativity on the Web.
1j5xPFH7cCmgKOV_fMs4Cb4dwgjD5GIsg, retrieved on 3/14/2023.
9 Lexica: https://lexica.art/, retrieved on 3/14/2023.
10 PromptHero: https://prompthero.com/, retrieved on 3/14/2023.
11 PromptBase: https://promptbase.com/, retrieved on 3/14/2023.
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA

REFERENCES [28] John David Pressman, Katherine Crowson, and Simulacra Captions Contributors.
[1] Alan Agresti. 2012. Categorical data analysis. Vol. 792. John Wiley & Sons. 2022. Simulacra Aesthetic Captions. https://github.com/JD- P/simulacra-
[2] Olivier Bodenreider. 2004. The unified medical language system (UMLS): in- aesthetic-captions Retrieved on 3/15/2023..
tegrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), [29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
D267–D270. 2022. Hierarchical text-conditional image generation with clip latents. arXiv
[3] Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. preprint arXiv:2204.06125 (2022).
ACM New York, NY, USA, 3–10. [30] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda In International Conference on Machine Learning. PMLR, 8821–8831.
Askell, et al. 2020. Language models are few-shot learners. Advances in neural [31] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
information processing systems 33 (2020), 1877–1901. and Honglak Lee. 2016. Generative adversarial text to image synthesis. In Inter-
[5] Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha national conference on machine learning. PMLR, 1060–1069.
Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi- [32] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Modal Transformers. In EMNLP. Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
[6] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- 10684–10695.
bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. [33] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den-
arXiv preprint arXiv:2204.02311 (2022). ton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol
[7] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models
Amodei. 2017. Deep reinforcement learning from human preferences. Advances with Deep Language Understanding. In Advances in Neural Information Processing
in neural information processing systems 30 (2017). Systems.
[8] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law [34] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. 2023.
distributions in empirical data. SIAM review 51, 4 (2009), 661–703. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image
[9] Thomas H. Davenport and Nitin Mittal. 2022. How generative AI is changing Synthesis. arXiv preprint arXiv:2301.09515 (2023).
creative work. https://hbr.org/2022/11/how-generative-ai-is-changing-creative- [35] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
work Retrieved on 3/15/2023.. Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv
Pre-training of Deep Bidirectional Transformers for Language Understanding. In preprint arXiv:2111.02114 (2021).
Proceedings of the 2019 Conference of the North American Chapter of the Association [36] Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. 1999.
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Analysis of a very large web search engine query log. In Acm sigir forum, Vol. 33.
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, ACM New York, NY, USA, 6–12.
4171–4186. https://doi.org/10.18653/v1/N19-1423 [37] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial International Conference on Machine Learning. PMLR, 2256–2265.
Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, [38] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Associates, Inc. Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
[12] Jorge R Herskovic, Len Y Tanaka, William Hersh, and Elmer V Bernstam. 2007. preprint arXiv:2302.13971 (2023).
A day in the life of PubMed: analysis of a typical day’s query log. Journal of the [39] Iulia Turc and Gaurav Nemade. 2022. Midjourney User Prompts & Generated
American Medical Informatics Association 14, 2 (2007), 212–220. Images (250k). https://doi.org/10.34740/KAGGLE/DS/2349267 Retrieved on
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic 3/15/2023..
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. [40] Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover,
[14] Bernard J Jansen, Amanda Spink, Judy Bateman, and Tefko Saracevic. 1998. Real and Duen Horng Chau. 2022. DiffusionDB: A Large-scale Prompt Gallery Dataset
life information retrieval: A study of user queries on the web. In ACM Sigir Forum, for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896 (2022).
Vol. 32. ACM New York, NY, USA, 5–17. [41] Yinglian Xie and David O’Hallaron. 2002. Locality in search engine queries and
[15] Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: auto- its implications for caching. In Proceedings. Twenty-First Annual Joint Conference
matic hierarchical segmentation of search topics in query logs. In Proceedings of of the IEEE Computer and Communications Societies, Vol. 3. IEEE, 1238–1247.
the 17th ACM conference on Information and knowledge management. 699–708. [42] Yutong Xie, Ziqiao Xu, Jiaqi Ma, and Qiaozhu Mei. 2022. How Much of the
[16] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. Chemical Space Has Been Explored? Selecting the Right Exploration Measure
arXiv preprint arXiv:1312.6114 (2013). for Drug Discovery. In ICML 2022 2nd AI for Science Workshop.
[17] Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering [43] Yutong Xie, Ziqiao Xu, Jiaqi Ma, and Qiaozhu Mei. 2023. How Much Space
text-to-image generative models. In Proceedings of the 2022 CHI Conference on Has Been Explored? Measuring the Chemical Space Covered by Databases and
Human Factors in Computing Systems. 1–23. Machine-Generated Molecules. In The Eleventh International Conference on Learn-
[18] Elman Mansimov, Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. 2016. ing Representations.
Generating Images from Captions with Attention. In ICLR. [44] Lei Yang, Qiaozhu Mei, Kai Zheng, and David A Hanauer. 2011. Query log
[19] Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- analysis of an electronic health record search engine. In AMIA annual symposium
ifold approximation and projection for dimension reduction. arXiv preprint proceedings, Vol. 2011. American Medical Informatics Association, 915.
arXiv:1802.03426 (2018). [45] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang,
[20] Midjourney.com. 2022. Midjourney. https://midjourney.com/ Retrieved on Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scal-
3/15/2023.. ing Autoregressive Models for Content-Rich Text-to-Image Generation. Transac-
[21] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, tions on Machine Learning Research (2022).
Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: [46] Kai Zheng, Qiaozhu Mei, and David A Hanauer. 2011. Collaborative search in
Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion electronic health records. Journal of the American Medical Informatics Association
Models. In International Conference on Machine Learning. PMLR, 16784–16804. 18, 3 (2011), 282–291.
[22] Otto G Ocvirk, Robert E Stinson, Philip R Wigg, Robert O Bone, and David L
Cayton. 1968. Art fundamentals: Theory and practice. WC Brown Company.
[23] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt Retrieved on
3/15/2023..
[24] OpenAI. 2023. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf
Retrieved on 3/15/2023..
[25] Jonas Oppenlaender. 2022. The Creativity of Text-to-Image Generation. In Pro-
ceedings of the 25th International Academic Mindtrek Conference. 192–202.
[26] Jonas Oppenlaender. 2022. A Taxonomy of Prompt Modifiers for Text-to-Image
Generation. arXiv preprint arXiv:2204.13988 (2022).
[27] Nikita Pavlichenko and Dmitry Ustalov. 2022. Best Prompts for Text-to-Image
Models and How to Find Them. (2022).
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei

A DATA AND DATA PROCESSING record different images generated by the same user with the same
prompt as a single submission. As a result, we obtained 2,208,019
A.1 Datasets
non-duplication prompt submissions from users. Note that repeated
For the datasets, we list important features (prompt, timestamp, user submissions of prompts are reserved. We tokenize the prompts and
ID, and rating), feature descriptions, and corresponding examples remove the whitespace as we process the Midjourney data.
in Table 10.
SAC. SAC provides aesthetic ratings of generated images. Note
Table 10: Feature descriptions and examples of the Midjour- that one prompt can correspond to multiple images, and each image
ney, DiffusionDB, and SAC datasets. can also have multiple ratings. Since there are no user ID and
timestamp annotations in SAC, to remove the duplicates, we simply
Description Examples extract the unique prompts and conclude all correlated ratings.
The prompt used to Midjourney: hands, by Karel More details can be found in the supplementary materials.
generate images. Thole and Mike Mignola --ar 2:3
Type: String DiffusionDB: Ibai Berto Romero as B ADDITIONAL ANALYSIS RESULTS
Prompt

Willy Wonka, highly detailed,


oil on canvas B.1 Prompt-Level Analysis
SAC: concept art by David Prompt length distributions. The distributions of prompt lengths
Production. are displayed in Figure 5, where the modes are around 10.
The timestamp Midjourney:
Midjourney
Timestamp

when the image 2022-06-23T23:58:16.024000


was generated from +00:00 0.04 Midjourney KDE
the prompt. DiffusionDB: 2022-08-07 22: Midjourney Avg.
Type: String DiffusionDB
57:00+00:00
0.03 DiffusionDB KDE
SAC: N/A DiffusionDB Avg.
Density

The unique ID for Midjourney: 977252506335858758


SAC
the user account 0.02 SAC KDE
DiffusionDB: SAC Avg.
User ID

who submitted the fcdb3e09f977412c342b6624a19


prompt. d1295ee1334c153c90af16d1cca 0.01
Type: String 8d9f27b04a
SAC: N/A
0.00100 101 102 103
The rating of the im- Midjourney: N/A
Prompt length (log-scale)
Ratings

age generated from DiffusionDB: N/A


the prompt12 . SAC: 6, 5, 7, 5
Type: Integer Figure 5: Prompt length distributions. The x-axis (prompt
length) is plotted in the log scale.

A.2 Data Processing Prompts revised by users. Table 11 lists the most revisited
Midjourney. We extracted prompts, timestamps, and user IDs prompts in DiffusionDB.
from the records in the Midjourney dataset. The prompts in Midjour- Time series analysis. We analyze how the prompts distribute
ney may contain specific syntactic parameters of the Midjourney within 24 hours for the Midjourney and DiffusionDB datasets. The
model, such as “--ar” for aspect ratios, “--h” for heights, “--w” for results are shown in Figure 6. The patterns in these two datasets
widths, “::” for assigning weights to certain terms in the prompts. are similar: the rushing hours are around 01:00–03:00 (for both Mid-
We first take the lowercase characters from tokenized prompts with journey and DiffusionDB), 15:00–17:00 (Midjourney), 20:00 (Diffu-
the Spacy tokenizer13 . Regarding the parameters, such as “--h”, we sionDB); while during the daytime, the users are relatively inactive.
consider them single terms. Specially, we split the weighted terms
with their weights, and consider “::” and “::-” (negative weight) Ratings. The overall rating distribution of SAC is displayed in
as two different terms. During tokenization, we also removed re- Figure 7.
dundant whitespaces. Midjourney allows users to upload reference
images as parts of their prompts in the form of Discord links. These B.2 Comparing Prompts with Training Data
links are also processed as special terms. To compare user-input prompts with texts that are used to train
the text-to-image generation models, we also include the LAION
DiffusionDB. We utilize the metadata of DiffusionDB-Large (14M)
dataset [35]. LAION is a public dataset of CLIP-filtered image-text
for prompt analysis. We first remove duplicate data entries with
pairs and has often been used in large text-to-image model training
the same prompt, timestamp, and user ID, meaning these entries
[32–34, 45]. In the analysis, we use the LAION-400M dataset14 that
12 Note that one prompt may correspond to multiple images, and one image may have contains only English texts.
multiple ratings. Here we list all the ratings correlated to the example prompt.
13 Spacy: https://spacy.io/, retrieved on 3/15/2023. 14 LAION-400M dataset: https://laion.ai/blog/laion-400-open-dataset/.
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA

Table 11: Most revisited prompts in DiffusionDB. Only revis- Table 12: Terms with the highest and the lowest average rat-
its across sessions are considered. ings. Only terms with frequencies larger than 100 are con-
sidered. “Avg.” and “Std.” are means and standard deviations
Prompt #Revisits of ratings respectively.
1 test 24
Terms with highest avg. ratings Terms with lowest avg. ratings
2 cat 19
Term Avg. Std. Freq. Term Avg. Std. Freq.
3 fat chuck is mad 15
4 15 1 shinjuku 8.55 0.90 168 equations 2.36 2.18 240
5 dog 15 2 gyuri 8.22 1.65 219 mathematical 2.37 2.18 230
6 symmetry!! egyptian prince of technology, solid cube of light, ... 13 3 lohuller 8.22 1.66 215 geismar 2.67 2.13 136
7 full character of a samurai, character design, painting by gaston ... 13 4 afremov 7.95 1.73 288 haviv 2.68 2.14 136
8 studio portrait of lawful good colorful female holy mecha paladin ... 11 5 leonid 7.95 1.73 288 chermayeff 2.73 2.14 136
9 full portrait and/or landscape. contemporary art print. high taste. ... 11 6 retrofuture 7.95 1.97 307 learning 3.10 2.64 112
10 woman wearing oculus and digital glitch head edward hopper and ... 11 7 merantz 7.93 1.77 463 pegasus 3.10 2.00 129
11 dream 10 8 josan 7.91 1.73 1,647 teacher 3.11 2.59 110
12 hyperrealistic portrait of a character in a scenic environment by ... 10 9 fantasyland 7.90 1.52 114 someone 3.14 2.45 574
13 full portrait &/or landscape painting for a wall. contemporary art ... 10 10 gensokyo 7.89 1.34 281 funny 3.17 2.52 208
14 zombie girl kawaii, trippy landscape, pop surrealism 10
15 creepy ventriloquist dummy in the style of roger ballen, 4k, bw, ... 9 Midjourney
16 cinematic bust portrait of psychedelic robot from left, head and ... 9 DiffusionDB
17 red ball 9 SAC
18 amazing landscape photo of mountains with lake in sunset by ... 9 LAION
19 female geisha girl, beautiful face, rule of thirds, intricate outfit, ... 9
20 full portrait and/or landscape painting for a wall. contemporary ... 9

0.050 Midjourney
Proportion of prompts

DiffusionDB
0.045
0.040
0.035
00:00
01:00
02:00
03:00
04:00
05:00
06:00
07:00
08:00
09:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00

Time (hour)

Figure 8: UMAP visualization of prompt embeddings. A clear


Figure 6: The distribution of prompts within 24 hours. gap can be identified between the LAION training data (red
0.15 circles) and the user-input prompts (other colors).
0.10
Density

Non-representative training data. We discover a huge gap


0.05 between the user-input prompts and the texts in the open training
0.00 1 2 3 4 5 6 7 8 9 10
data such as the LAION training set. The out-of-vocabulary (OOV)
Rating problem is severe, and in the prompts from Midjorney, about 75%
terms are not covered by LAION’s vocabulary. Figure 8 also displays
Figure 7: Rating distribution. Each rating corresponds to a gap in prompt embedding distributions. All this evidence proves
a user-input prompt and an image generated from that that the texts (mostly image captions) from the open training data
prompt. The average rating is 5.53, the standard deviation can hardly represent users’ information needs and we should call
is 2.40, and the median is 6. for another way that renders better supervision during training.
ChatGPT [23] has already demonstrated that reinforcement learn-
Visualization. To intuitively see how user-input prompts and ing from human feedback (RLHF) [7] could provide rich supervision
texts from the LAION training set are distributed, we use UMAP and guidance to the model. However, for text-to-image generation,
[19] to visualize the prompts and the texts based on BERT [10] related work is still limited. Note that our analysis is based on the
embeddings in Figure 8. In the visualization, we find a clear gap be- open datasets that are included in the training data of the models
tween LAION (red circles) and other datasets, meaning the training and doesn’t consider the private training data that could have a
set can hardly represent the real data distributions of user-input different coverage of the space.
prompts. This visualization also aligns with the findings about vo-
cabulary coverage, where we discover the terms in SAC are most Received 7 February 2023; revised 15 March 2023; accepted 6 March 2023
covered by LAION, and the vocabulary of Midjorney is most distant
from that of LAION.

You might also like