How-To - Read An AI Image
How-To - Read An AI Image
Eryk Salvaggio
How to Read an AI Image: Toward a Media Studies
Methodology for the Analysis of Synthetic Images
2023
https://doi.org/10.25969/mediarep/22328
Eryk Salvaggio
Background
Conceptual Framework
Technical Background
Figure 1: As Gaussian noise is introduced to the image, clusters remain around the
densest concentrations of pixel information; created with Stable Diffusion in February
2023
As the model works backward from noise, our prompts constrain the possible
pathways that the model is allowed to take. Prompted with “flowers”, the model
cannot use what it has learned about the breakdown of cat photographs. We
might constrain it further: “Flowers in the nighttime sky”. This introduces new
sets of constraints: “Flowers”, but also “night”, and “sky”. All of these words are
the result of datasets of image-caption pairs taken from the world wide web. CLIP
and LAION aggregate this information and then ignore the inputs. These images,
labeled by internet users, are assembled into categories, or categories are inferred
by the model based on its similarities to existing categories. All that remains
is data – itself a biased and constrained representation of the social consensus,
shaped by often arbitrary, often malicious, and almost always unconsidered
boundaries about what defines these categories.
This paper proposes that when we look at AI images, specifically Diffusion
images, we are looking at infographics about these datasets, including their
categories, biases, and stereotypes. To read these images, we consider them rep-
resentations of the underlying data, visualizing an ‘internet consensus’. They
produce images where prompts produce abstractions of centralizing tendencies.
When images are more closely aligned to the abstract ideal of these stereotypes,
they are clean, ‘strong’ images. When images drift from this centralizing con-
sensus, they are more difficult to categorize. Therefore, images of certain catego-
ries may appear ‘weak’ – either occurring less often or with lower definition or
clarity.
These ideal ‘types’ are socially constructed and encoded by anyone who
uploads an image to the internet with a descriptive caption. For example, a ran-
dom sample of the training data associated with the phrase “Typical American”
within the LAION 5B dataset that drives Stable Diffusion suggests the images and
associations for “Typical American” as a category: images of flags, painted faces
from Independence Day events, as would be expected. Social stereotypes, related
to obesity and cowboy hats, are also prevalent. Curiously, one meme appears
multiple times, a man holding a Big Gulp from 7-11 (a kind of large, frozen sugar
drink). Figure 2 is an image in response to the prompt “Typical American” in
which the man holds a large beverage container, like a Big Gulp, whilst wearing
face paint and a cowboy hat. We see that while the relationship between the data-
set and the images that Diffusion produces are not literal, these outcomes are
nonetheless connected to the concepts tied to this phrase within the dataset.
Just as archives are the stories of those who curate them, Diffusion generated
images are no different. They visualize the constraints of the prompt, as defined
by a dataset of human-generated captions that is assembled by CLIP or LAION’s
automated categorizations. I propose that these images are a visualization of this
archive. They struggle to show anything the archive does not contain or is not
clearly categorized in accordance with prompts. This suggests that we can read
images created by these systems. The next section proposes a methodology for
reading these images which blends media analysis and data auditing techniques.
As a case study, it presents DALL·E 2 generated images of people kissing.
Methodology
We need to know what is in the image in order to assess why they are there. In
Case Study 1 (fig. 3), the image portrays a heterosexual white couple. A reluc-
tant (?) male is being kissed by a woman. In this case, the man’s lips are protrud-
ing, which is rare compared to our sample. The man is also weakly represented:
his eyes and ears have notable distortions. In the following analysis of the image,
weak features thus refer to smudged, blurry, distorted, glitched, or otherwise
striking features of the image. Strong features represent aspects of the image that
are of high clarity, realistic, or at least realistically represented.
While this paper examines photographs, similar weak and strong presence
can be found in a variety of images produced through Diffusion systems in other
styles as well. For example, if oil paintings frequently depict houses, trees, or a
particular style of dress, it may be read as a strong feature that would be matched
to a strong correspondence with aspects of the dataset. You may discover that
producing oil paintings in the style of 18th century European masters does not
generate images of black women. This would be a weak signal from the data, sug-
gesting that the referenced datasets of 18th century portraiture did not contain
portraits of black women (Note that these are hypotheticals and have not been
specifically verified).
using the same prompt or model. I use a minimum of nine, because nine images
can be placed side by side and compared on a grid. For some examinations, I have
generated 18-27 or as many as 90-120. While creating this expanded sample set,
we would continue to look for any conceptually interesting images from the same
prompt. These images do not have to be notable in the same way that the initial
source image was. The image that fascinated, intrigued, or irritated us was inter-
esting for a reason. The priority is to understand that reason by understanding
the context – interpreting the patterns present across many similarly generat-
ed images. We will not yet have a coherent theory of what makes these images
notable. We are simply trying to understand the generative space that surrounds the
image of interest. This generative, or latent space, is where the data’s weaknesses
and strengths present themselves. Even a few samples will produce recognizable
patterns, after all.
Now we can study the new set of images for patterns and similarities by applying
a form of content analysis. We describe what the image portrays ‘literally’ (the
denoted meaning). Are there particularly strong correlations between any of the
images? Look for certain compositions/arrangements, color schemes, lighting
effects, figures or poses, or other expressive elements, that are strong across all
(or some meaningful subsections) of the sample pool. These indicate certain
biases in the source data. When patterns are present, we will call these signals.
Akin to symptoms, indicators are observable elements of the image that point to
a common underlining cause. We may have strong signals: suggesting frequency
of the pattern in the data pattern, the strongest signals being near-universal and
the strongest dismissed as obvious. A strong signal would include tennis balls
being round, cats having fur, etc. A weak signal, on the other hand, suggests that
the image is on the peripheral of the model’s central tendencies for the prompt.
The most obvious indicators of weak signals are images that simply cannot be
created realistically or with great detail. The smaller the number of examples in
a dataset, the fewer images the model may learn from, and the more errors will
be present in whatever it generates. These may be visible in blurred appearances,
such as smudges, glitches, or distortions. Weak signals may also be indicated
through a comparison of what patterns are present against what patterns might
otherwise be possible.
Strong signals: In the given example, the images render skin textures quite
well. They seem professionally lit, with studio backgrounds. They are all close-
ups focused on the couple. Women tend to have protruding lips, while men tend
to have their mouths closed. These therefore suggest strong signals in the data,
suggesting an adjacency to central tendencies within the assigned category of
the prompt. These signals may not be consistent across all images, but are impor-
tant to recognize because they provide a contrast and context for what is weakly
represented.
Weak signals: In the case study, two important things are apparent to me. First,
most pictures are heteronormative, i.e., the images portray only man/woman
couples. The present test run, created in November 2022, differs from an earlier
test set (created in October 2022 and made public online, cf. Salvaggio 2022).
In the original test set, all couples were heterosexual. Second, there is a strong
presence of multiracial couples: another change from October 2022 when nearly
all couples shared skin tones. Third, they are missing convincing interpersonal
contact. This is, in fact, identical in both datasets from different months. The
strong signal across the kissing images might be a sense of hesitancy as if an invis-
ible barrier exists between the two partners in the image. The lips of the figures
are weak: inconsistent and imperfect. With an inventory of strong and weak pat-
terns, we can begin asking critical questions toward a hypothesis.
1. What data would need to be present to explain these strong signals?
2. What data would need to be absent to explain these weak signals?
Weaknesses in your images may be a result of sparse training data, training
biased toward exclusion, or reductive system interventions such as censorship.
Strengths may be the result of prevalence in your training data, or encouraged
by system interventions. They may also represent cohesion between your prompt
and the ‘central tendency’ of images in the dataset, for example, if you prompt
“apple”, you may produce more consistent and realistic representations of apples
than if you request an “apple-car”. For example, DALL·E 2 introduces diversifying
keywords randomly into prompts (cf. Offert/Phan 2022). The more often some
feature is in the data, the more often it will be emphasized in the image. In sum-
mary, you can only see what’s in the data and you cannot see what is not in the
data. When something is strikingly wrong or unconvincing, or repeatedly impos-
sible to generate at all, that is an insight into the underlying model.
An additional case study could provide even more context. In 2019, while
studying the FFHQ dataset that was used to generate images of human faces
for StyleGAN, I noted that the faces of black women were consistently more
distorted than the faces of other races and genders. I asked the same question:
What data was present to make white faces so clear and photorealistic? What
data was absent to make black women’s faces so distorted and uncanny? I began
to formulate a hypothesis. In the case of black women’s faces being distorted, I
could hypothesize that black women were underrepresented in the dataset: that
this distortion was the result of a weak signal. In the case study of kissing cou-
ples, something else is missing. One hypothesis might be that the dataset used
by OpenAI does not contain many images of anyone kissing. That might explain
the awkwardness of the poses. I might also begin to inquire about the absence of
same-sex couples and conclude that LGBTQ couples were absent from the dataset.
While unlikely, we may use this as an example of how to test that theory, or what-
ever you find in your own samples, in the next step.
Each image is the product of a dataset. To continue our research into interpreting
these images, it is helpful to address the following questions as specifically as
possible:
1. What is the dataset and where did it come from?
2. What can we verify what is included in the dataset and what is excluded?
3. How was the dataset collected?
Often, the source of training data is identified in white papers associated with
any given model. There are tools being developed – such as Matt Dryhurst and
Holly Herndon’s Swarm, that can find source images in some sets of training data
(LAION) associated with a given prompt. When training data is available, it can
confirm that we are interpreting the image-data relationship correctly. OpenAI
trained DALL·E 2 on hundreds of millions of images with associated captions. As
of this writing, the data used in DALL·E 2 is proprietary, and outsiders do not have
access to those images. In other cases, the underlying training dataset is open
source, and a researcher can see what training material they draw from. For the
sake of this exercise, we’ll look through the LAION dataset, which is used for the
diffusion engines Stable Diffusion and Midjourney. When we look at the images
that LAION uses for “Photograph of humans kissing”, we can see that the training
data for this prompt in that library consists mostly of stock photographs where
actors are posed for a kiss, suggesting a database trained on images displaying
a lack of genuine emotion or any romantic connection. For GAN models, which
produce variations on specific categories of images (for example, faces, cats, or
cars), many rely on open training datasets containing merely thousands of imag-
es. Researchers may download portions of them and examine a proportionate
sample. This may become exponentially harder as datasets become exponentially
larger. For examining race and face quality through StyleGAN, I downloaded
the training data – the FFHQ dataset – and randomly examined a sub-portion of
training images to look for racialized patterns. This confirmed that the propor-
tion of white faces far outweighed faces of color.
While we do not have training data for DALL·E 2, we can make certain inferenc-
es by examining other large datasets. For example, we might test the likelihood
of a hypothesis that the dominance of heterosexual couples in stock photography
contributes to the relative absence of LGBTQ subjects in the images. This would
explain the presence of heterosexual couples (a strong signal from the dataset) and
the absence of LGBTQ couples that occurred in our earlier tests from 2022. How-
ever, LAION’s images found for the prompt query “kissing” is almost exclusively
pictures of women kissing. While DALL·E 2’s training data remains in a black box,
we now have at least some sense of what a large training set might look like and
can recalibrate the hypothesis. The massive presence of women kissing women in
the dataset suggests that the weak pattern is probably not a result of sparse train-
ing data or a bias in data. We would instead conclude that the bias runs the other
way: if the training data is overwhelmed with images of women kissing, then the
outcomes of the prompt should also be biased toward women kissing. Even in the
October 2022 sample, however, women kissing women seemed to be rare in the
generated output.
This suggests we need to look for interventions. An intervention is a system-lev-
el design choice, such as a content filter, which prevents the generation of certain
images. Here we do have data even for DALL·E 2 that can inform this conclusion.
‘Pornographic’ images were explicitly removed from OpenAI’s dataset to ensure
it does not reproduce similar content. Other models, such as LAION, contain vast
amounts of explicit and violent material (cf. Birhane 2021). By contrast, OpenAI
deployed a system-level intervention into their dataset:
We conducted an internal audit of our filtering of sexual content to see if it concentrated or
exacerbated any particular biases in the training data. We found that our initial approach
to filtering of sexual content reduced the quantity of generated images of women in gen-
eral, and we made adjustments to our filtering approach as a result (OpenAI 2022: n.pag.).
Requests to DALL·E 2 are hence restricted to what OpenAI calls ‘G-rated’ con-
tent, referring to the motion picture rating for determining age appropriateness.
Figure 5: First page of screen results from a search of LAION training data associated
with the word “Kissing” indicates a strong bias toward images of women kissing, Screen
grab from haveibeentrained.com [Accessed March 22, 2023]
and later trained on. Second, through system-level affordances and/or interven-
tions: what can and cannot be produced or requested.
We now have a hypothesis for understanding our original image. We may decide
that the content filter excludes women kissing women from the training data as
a form of ‘explicit’ content. We deduce this because women kissing is flagged as
explicit content on the output side, suggesting an ideological, cultural, or social
bias against gay women. This bias is evidenced in at least one content moderation
decision (banning their generation) and may be present in decisions about what
is and is not included in the training data. The strangeness of the pose in the ini-
tial image, and of others showing couples kissing, may also be a result of content
restrictions in the training data that reflect OpenAI’s bias toward, and selection
for, G-rated content. How was ‘G-rated’ defined, however, and how was the data
parsed from one category to another? Human, not machinic, editorial process-
es were likely involved. Including more ‘explicit’ images in the training model
likely wouldn’t solve this problem – or create new ones. Pornographic content
would create additional distortions. But in a move to exclude explicit content,
the system has also filtered out women kissing women, resulting in a series of
images that recreate dominant social expectations of relationships and kisses as
‘normal’ between men and women.
Returning to the target image, we may ask: What do we see in it that makes
sense compared to what we have learned or inferred? What was encoded into the
image through data and decisions? How can we make sense of the information
encoded into this image by the data that produced it? With a few theories in
mind, I would run the experiment again: this time, rather than selecting images
for the patterns they shared with the notable image, use any images generated
from the prompt. Are the same patterns replicated across these images? How
many of these images support the theory? How many images challenge or com-
plicate the theory? Looking at the broader range of generated images, we can
see if our observations apply consistently – or consistently enough – to make
a confident assertion. Crucially, the presence of ‘successful’ images does not
undermine the claim that weak images reveal weaknesses in data. Every image
is a statistical product: odds are weighted toward certain outcomes. When you
see successful outcomes fail, that failure offers insight into gaps, strengths, and
weaknesses of those weights. They may occasionally – or predominantly – be
rendered well. What matters to us is what the failures suggest about the underly-
ing data. Likewise, conducting new searches across time can be a useful means of
tracking evolutions, acknowledgments, and calibrations for recognized biases.
Each of these hypotheses warrants a deeper analysis than the scope of this paper
would allow. The goal of this paper was to present a methodology toward the
analysis of generative images produced by Diffusion-based models. Our case
study suggests that examples of cultural, social, and economic values are embed-
ded into the dataset. This approach, combined with more established forms
of critical image analysis, can give us ways to read the images as infographics.
The method is meant to generate insights and questions for further inquiry,
rather than producing statistical claims, though one could design research for
quantifying the resulting claims or hypotheses. The model has succeeded in
generating strong claims for further investigations interrogating the underly-
ing weaknesses of image generation models. This includes the absence of black
women in training datasets for StyleGAN, and now, the exclusion of gay women
in DALL·E 2’s output. Ideally, these insights and techniques move us away from
the ‘magic spell’ of spectacle that these images are so often granted. It is intend-
ed to provide a deeper literacy into where these images are drawn from. Identi-
fying the widespread use of stock photography, and what that means about the
system’s limited understanding of human relationships, emotional and physical
connections, is another pathway for critical analysis and interpretations.
The method is meant to move us further from the illusion of ‘neutral’ and
unbiased technologies which is still prevalent in the discourse around these
tools. We often see AI systems deployed as if they are free of human biases – the
Edmonton police (Canada) recently issued a wanted poster including an AI-gen-
erated image of suspect based on his DNA (cf. Xiang 2022). That’s pure mystifi-
cation. They are bias engines. Every image should be read as a map of those biases,
and they are made more legible using this approach. For artists and the general
public creating AI-images, it also points to a strategy for revealing these prob-
lems. One constraint of this approach is that models can change at any given
time. It is obvious that OpenAI could recalibrate their DALL·E 2 model to include
images of women kissing tomorrow. However, when models calibrate for bias
on the user end it does not erase the presence of that bias. Models form abstrac-
tions of categories based on the corpus of the images they analyze. Removing
access to those images, on the users end, does not remove their contribution to
that abstraction. The results of early, uncalibrated outcomes are still useful in
analyzing contemporary and future outputs. Generating samples over time also
presents opportunities for another methodology, tracking the evolution (or lack
thereof) for a system’s stereotypes in response to social changes. Media studies
may benefit from the study of models that adapt or continuously update their
underlying training images or that adjust their system interventions.
Likewise, this approach has limits. One critique is that researchers cannot
simply look at training data that is not accessible. As these models move away
from research contexts and toward technology companies seeking to make a
profit from them, proprietary models are likely to be more protected, akin to
trade secrets. We are left making informed inferences about DALL·E 2’s proprie-
tary dataset by referencing datasets of a comparable size and time frame, such
as LAION 5B. Even when we can find the underlying data, researchers may use
this method only as a starting point for analysis. It raises the question of where
to begin even when there are billions of images in a dataset. The method marks
only a starting point for examining the underlying training structures at the
site where audiences encounter the products of that dataset, which is the AI-pro-
duced image.
Thanks to Valentine Kozin and Lukas R.A. Wilde for feedback on an early draft of this essay.
Bibliography