[go: up one dir, main page]

0% found this document useful (0 votes)
4 views61 pages

AI60201 Module3

The document discusses generative models, focusing on their design, calibration, and evaluation, particularly in the context of deep learning techniques such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). It outlines the aim of generative models to synthesize new data points that resemble existing datasets, and details various methodologies including deep autoregressive models and normalizing flows. The document also emphasizes the importance of evaluating generative models and includes technical aspects related to training and synthesizing new images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views61 pages

AI60201 Module3

The document discusses generative models, focusing on their design, calibration, and evaluation, particularly in the context of deep learning techniques such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). It outlines the aim of generative models to synthesize new data points that resemble existing datasets, and details various methodologies including deep autoregressive models and normalizing flows. The document also emphasizes the importance of evaluating generative models and includes technical aspects related to training and synthesizing new images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Generative Models - Design, Calibration and Evaluation

Generative and Graphical Models AI60201, Module 3

Adway Mitra

Indian Institute of Technology Kharagpur

19 October 2022

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 1 / 61
Contents

1 Background

2 Deep Autoregressive Models

3 Variational Autoencoder (VAE)


Training a VAE

4 Deep Generative Models

5 Normalizing Flows

6 Denoising Diffusion Models

7 Generative Adversarial Networks


Variants of GAN

8 Evaluating Generative Models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 2 / 61
Background

Background

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 3 / 61
Background

Generation of Complex Structures

Aim of generative model: to synthesize new data-points


Data-points may be complex structures like images, videos, speech etc
They should be similar, but not identical to, the examples already present in
a dataset
For example, if a dataset has 1000 face images from 10 different persons,
generative model should produce a face image, but not necessarily from any
of these persons
Conditional generation: generated datapoint should correspond to an input
from the user
Example: user provides a text caption, according to which an image is
generated

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 4 / 61
Background

Image Generation

An image is basically a M × N matrix, where each value (pixel) lies between


0 and 255
Images are highly complex structures with spatial properties
Deep Neural Networks, especially Convolutional Neural Network are suitable
for representing images
Deep Neural Networks like U-Net and autoencoders can produce an image as
output against a low-dimensional input
In Generative Models, the low-dimensional input can be sampled from a prior
The low-dimensional input, or the prior, can encode the condition specified
by the user
A likelihood-based generative model defines a distribution over the space of
all images
The desired images (eg. visually meaningful images containing some specified
object) should have high probability
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 5 / 61
Background

Classification of Generative Models

Likelihood-based (define probability distribution over image space)


Fully observed (no latent variable)
Autoregressive Models
Latent variable based
Variational Autoencoders
Normalizing Flows
Without Likelihood (doesn’t define any such distribution)
Generative Adversarial Networks (and variants)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 6 / 61
Deep Autoregressive Models

Deep Autoregressive Models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 7 / 61
Deep Autoregressive Models

Autoregressive Models

Generate each pixel sequentially, conditioned on those already generated


The image is scanned, pixel by pixel, in a fixed order (row-wise or
column-wise)
For each pixel, a probability distribution is defined over possible values
[0, 255], parameterized with values of previous pixels (according to sequential
order)probability

Image likelihood p(X; θ) = i,j p(Xi,j |X1,1 , . . . , Xi,j−1 )

Given image X, probability p(X; θ) = i,j p(Xi,j |X1,1 , . . . , Xi,j−1 )
Drawback: scan order is artificial, may not carry enough information to insert
new objects in image
Good for evaluating likelihood of a given image, not good for
sampling/synthesizing new images

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 8 / 61
Deep Autoregressive Models

PixelRNN

Each pixel’s distribution parameterized by Neural Network


Sequence of pixel values (X1,1 , . . . , Xi,j−1 ) in scan order input to Recurrent
Neural Network

Figure: Pixel Recurrent Neural Network

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 9 / 61
Deep Autoregressive Models

PixelCNN

Faster but less accurate than PixelRNN


For each pixel, focus only on pixels within receptive field
Two filters for horizontal and vertical neighboring pixels

Figure: Pixel Convolutional Neural Network

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 10 / 61
Variational Autoencoder (VAE)

Variational Autoencoder (VAE)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 11 / 61
Variational Autoencoder (VAE)

Latent Variable Formulation

Start with image seed Z, generated from prior Z ∼ P


It is then passed through a neural network gθ which produces an image as
output
g specifies neural network architecture, θ its parameters
Add random noise to each pixel independently
Z ∼ N (0, 1), X ∼ N (gθ (Z), σ 2 Im×n )

Probability of any image p(X) = p(X|Z)p(Z)dZ
Aim: estimate the neural network parameters θ by maximizing probability of
desired images (training)
Problem 1: The marginal likelihood P (X) cannot be calculated analytically
due to presence of gθ
Problem 2: p(Xi , Zi ) is easy to calculate, but we don’t know Zi for each
training image Xi

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 12 / 61
Variational Autoencoder (VAE)

Variational Inference Network

For each training image Xi , estimate the corresponding Zi


p(Zi |Xi ) cannot be calculated analytically, as p(X) cannot be
Alternative: variational inference q(Z|X) to approximate p(Z|X)
q(Z|X) ∼ N (µ(Xi ), Σ(Xi ))
{µ(Xi ), Σ(Xi )} = hϕ (Xi ), where h is another neural network with
parameters ϕ
hϕ is called the encoder, while gθ is called decoder
Parameters θ, ϕ to be estimated simultaneously
Loss function with respect to both Xi and Zi
Xi must have high probability according to pθ∗ , while qϕ (Zi |Xi ) should be
close to p(Zi )

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 13 / 61
Variational Autoencoder (VAE) Training a VAE

Objective Function

∏N
θ∗ = argmaxθ,ϕ i=1 p(Xi |gθ (Zi )), where Zi ∼ N (hϕ (Xi ))
Neural Network parameters θ, ϕ, may be estimated by backpropagation after
defining suitable loss function
∑N ∑N
ℓ(θ, ϕ) = i=1 ||Xi − gθ (Zi )|| + i=1 KL(N (hϕ (Xi ))||N (0, 1))
Equivalent to maximizing the evidence lower bound (ELBO)
Backpropagation cannot be applied directly as it includes sampling Zi as an
intermediate step
Solution: Decouple the sampling from the backpropagation
Reparameterization Trick: ϵi ∼ N (0, 1) sampled independently,
Zi = µϕ (Xi ) + ϵi σϕ (Xi )
Now backpropagation can be applied to estimate θ, ϕ

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 14 / 61
Variational Autoencoder (VAE) Training a VAE

VAE Graphical Model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 15 / 61
Variational Autoencoder (VAE) Training a VAE

Synthesizing New Images

Once θ∗ is estimated, new images can be generated by first sampling Z from


prior and then running it through decoder network gθ∗
Use of parameters θ∗ ensures that produced image will be similar to the
images used for training (eg. if training set had handwritten digit images,
only such images will be synthesize)
Since Z is generated independently for each image, no two images are going
to be identical
The differences may be manifested in any attribute, or a set of attributes,
that can be encoded by a single real number like Z
We can use inference network to carry out post-processing analysis of Z to
find which attributes(s) it represents

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 16 / 61
Variational Autoencoder (VAE) Training a VAE

Supervised VAE

The datapoints used for training may be accompanied by class labels (Xi , Yi )
The generator/decoder network should be sensitive to the class label, i.e.
gθ (Yi , Zi )
Similarly, the inference/encoder network should be able to predict probability
distribution q(Yi , Zi |Xi )
The training process will remain unchanged
The loss function will include cross-entropy loss function for Yi (prediction by
encoder and actual label)
For new image generation, the user should specify Yi , and the image will be
generated accordingly

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 17 / 61
Variational Autoencoder (VAE) Training a VAE

VAE Variants Graphical Model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 18 / 61
Variational Autoencoder (VAE) Training a VAE

Interpretation

The latent variable represents some attribute, or combination of attributes of


the image
However, these attributes are not specified by the user
They can be stylistic (eg. thickness, level of blurring etc) or semantic (eg.
complexion, gender etc)
Basically, Z captures the most prominent directions of variation in the
dataset (rough analogy with non-linear PCA)
Nature of Z (real/discrete/binary) determines how much variation can be
accommodated by it
May be possible to include more variations with more latent variables!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 19 / 61
Variational Autoencoder (VAE) Training a VAE

Examples of VAE images


Generating new images via latent variable is effectively an interpolation of
training images
Hence, VAE images often look blurry as the RMSE error fails to retain
sharpness of edges

Figure: Images generated by VAE


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 20 / 61
Deep Generative Models

Deep Generative Models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 21 / 61
Deep Generative Models

Deep Latent Variables

We may want several latent variables to capture the variability of training


dataset
The variables can be arranged in layers to indicate hierarchical attributes (eg.
gender, ethnicity may be at a higher level of variability than complexion)
Assumption: one layer of latent variables connected with another layer

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 22 / 61
Deep Generative Models

Deep Boltzmann Machine

An undirected graphical model over layers of variables


Z L , Z L−1 , . . . , Z 2 , Z 1 , Z 0 , where Z 0 = X (observations)
Each layer of latent variables Z l connected to neighboring layers only (Z l−1
and Z l+1 )
Each variable i in layer l (Zil ) connected to each variable j in layer (l − 1)
through Wijl
Joint distribution of all variables represented as product of edge potential
functions
ϕ(Zil , Zjl−1 ) = exp(−Wijl Zil Zjl−1 )
∑L
p(Z) ∝ exp(− l=1 (Z l−1 )T W l Z l )
The last layer connecting Z 1 with Z 0 = X may be represented by a more
complex neural network like gθ (especially if X is a complex object like image)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 23 / 61
Deep Generative Models

Inference in DBM

We may be interested to find the meaning of every latent variable (i.e. which
attribute it represents)
Approach: carry out inference of all latent variables to compute posteriors
l
p(Zij = 1|X)
Cannot be estimated directly, so we can use Gibbs Sampling
According to D-separation rules, each variable in layer l is independent of all
variables except those in the neighboring layers
For simplicity, assume all variables are binary
p(Zil = 1| − Zil ) = σ(−(Z l−1 )T Wil − Wil+1 Z l+1 ) where σ denotes the
sigmoid function
We sample each variable in one iteration, keeping the rest constant
We repeat this for many iterations, collecting samples of each variable
regularly. Posteriors estimated from these samples

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 24 / 61
Deep Generative Models

Training of DBM by Contrastive Divergence

Training of DBM involves estimating the parameters W



Maximum Likelihood: W ∗ = argmaxW Z 1 ,...,Z L p(X, Z)
Difficult to compute because there are too many combinatorial configurations
to sum over
Alternative approach: through sampling
Starting from X, sample values of Z 1 , . . . , Z L , let these values be {Ẑl }L
l=0
Starting from Z L , sample values of z L−1 , . . . , Z 1 , Z 0 , call these values
{Zl∗ }L
l=0
Sampling from the conditional distributions mentioned earlier

At every layer l, update W l according to ∆W l = Ẑl−1 ẐlT − Zl−1 (Zl∗ )T
W l = W l − α∆W l where ∆W is called contrastive divergence

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 25 / 61
Deep Generative Models

DBM Training Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 26 / 61
Deep Generative Models

Sampling from Deep Generative Model

No direct way to sample from the joint distribution p


Two samples can be compared using marginal likelihood p(X)
Start with an initial sample x0 . Perturb by adding random noise, and accept
new sample if it has higher likelihood
xt+1 = xt + z t where z t ∼ N (0, σ)
Accept xt+1 if p(xt+1 ) ≥ p(xt )
If gradient of p at x can be calculated, we can use Langevin Dynamics for
sampling

Initialize x0 randomly, navigate xt+1 = xt + ϵ∇x p(xt ) + 2ϵzt where
zt ∼ N (0, 1)
As t → ∞ and ϵ → 0, it can be shown that xt ∼ p(x)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 27 / 61
Normalizing Flows

Normalizing Flows

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 28 / 61
Normalizing Flows

Inversion Model

We want a generative model where parameter estimation, sampling and


inference are all easy
Plus, there should be many latent variables to support variations in the
training dataset
Possible solution: invertible latent variable models!
Z ∼ π, X = gθ (Z), Z = h(X) where h = gθ−1
Some special neural networks are invertible

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 29 / 61
Normalizing Flows

Inversion Model

Inversion formula: helps us to define distribution of the transformed variable


X in terms of the original latent variable Z
pX (x) = pZ (h(x))|h′ (x)| where h is the inverse function
In case h′ (x) is a matrix, |.| is the determinant
Example: if X = AZ where A is an invertible matrix, then h(X) = A−1 X,
and h′ (X) = A−1
So according to formula, pX (x) = pZ (A−1 x) det(A)
1
where pZ is the prior
distribution on latent variable Z
This becomes more complex if we consider non-linear transformations

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 30 / 61
Normalizing Flows

Flow of Transformations

Let us consider a sequence of latent variables Z 0 , Z 1 , . . . , Z L as in a deep


generative model
Flow of transformations: Z L ∼ π, Z L−1 = fL (Z L ), …, Z 1 = f2 (Z 2 ),
X = f1 (Z 1 )
Essentially a composition function X = f (Z L ) = f1 (f2 (. . . fL (Z L ))
Each of these functions is invertible (deterministic or probabilistic)
∏L ∂(f l )−1 (Z l )
Then the output distribution pX (x; θ) = pZ (f −1 (x)) l=1 |det( θ∂Z l )|
Here θ is the set of parameters of all the invertible distributions
This allows us to define a closed-form distribution over the output variable X
(we could not evaluate it in case of DBM due to the partition function)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 31 / 61
Normalizing Flows

Learning and Inference Problem

Sampling from flow models is easy (just follow the transformations in


sequence)
The main problem: estimate the parameters θ
∑ ∂(f l )−1 (X)
θ∗ = argmaxθ log(p(X; θ)) = i (pZ (f −1 (Xi )) + log|det( θ∂X )|X=Xi )
Inference: need to apply the inversion formula to estimate Z from X
In case of flows, we need to compute the Jacobian matrix as the derivative
The (i, j)-th entry is ∂hi
∂zj where hi = fj−1
In case of deep models with L latent variables, it will be expensive to
calculate the determinant of an L × L Jacobian
Solution: the Jacobian should be triangular, as hL involves just zL , hl
involves {zL , . . . , zl } etc

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 32 / 61
Normalizing Flows

Examples of Flow Models

NICE: Non-linear Independent Components Estimation


L latent variables and L observations
X1:l = s1:l ⊙ Z1:l , Xl+1:L = Zl+1:L + gθ (Z1:l )
Alternative formulation: Xl+1:L = Zl+1:L ⊙ exp(−hθ (Z1:l )) + gθ (Z1:l )
Here s = {s1 , . . . , sl } is a set of scaling factors and gθ , hθ are non-linear
functions that may be represented by neural networks
Inverse mapping: Z1:l = X1:l ⊙ 1
s1:l , Zl+1:L = Xl+1:L − gθ (X1:l ) ⊙ 1
s1:l
The Jacobian here is upper triangular by design
Can be trained on image datasets to generate new images

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 33 / 61
Normalizing Flows

Examples of Flow Models

It is possible to build an invertible neural network using specific invertible


units
MintNet uses Masked Convolutions, which are invertible unlike normal
convolutions
Masked convolution causes the input elements (eg. image pixels) to have a
sequential generative structure, and corresponding Jacobian is triangular
In a masked convolution filter, the receptive field is restricted by setting some
elements to 0

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 34 / 61
Denoising Diffusion Models

Denoising Diffusion Models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 35 / 61
Denoising Diffusion Models

Introduction

We start with any valid sample x0 , eg. an image from a reference dataset
We add some random noise to it, eg. IID Gaussian noise, to get a noisy
version x1 of x0 , i.e. x1 = x0 + e0 where e0 ∼ q(x0 )
It is possible to recover x0 from x1 , by adding some “counter-noise”, i.e.
x0 = x1 + f1 where f1 ∼ p(x1 )!
It may be possible to analytically calculate p(x1 ) from q(x0 )
But if we keep on adding noise to the samples, to get {x0 , x1 , x2 , .....xT },
can we recover x0 from xT ?
We need to predict the noise at each step!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 36 / 61
Denoising Diffusion Models

Forward Diffusion Process


Consider x0 as m × n image, obtained from the data distribution as q(x0 )
Simplest model of corruption: scale each pixel value and add Gaussian noise,
as (x1 )ij = a0 ∗ (x0 )ij + (e0 )ij where (e0 )ij ∼ q((x0 )ij )
Generalize: (xt+1 )ij = at ∗ (xt )ij + (et )ij where
√ (et )ij ∼ q((xt )ij )
Specifically,√set q(xt ) = N (0, βt ) and at = (1 − βt ), i.e.
xt+1 ∼ N ( (1 − βt )xt , βt I)

It can be shown that q(xt |x0 ) = N ( αt x0 , (1 − αt )I) where
∏t
αt = s=1 (1 − βs )
With specific choice of parameters {β0 , β1 . . . } (called schedule), we have
xT ∼ N (0, I), i.e. we keep on adding Gaussian noise to the original sample
till we are left with Standard Gaussian noise!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 37 / 61
Denoising Diffusion Models

Denoising Diffusion
Is the reverse process possible, i.e. obtain x0 from xT , i.e. x0 ∼ q(xT )?
If so, it opens up a generative model where we sample the code Z = XT
from N (0, I), and denoise it to get x0 , which is a valid sample following the
data distribution q(x0 )!
Possible only if we can find q(xt−1 |xt ) which is generally intractable!
Solution: approximate it using pθ (xt−1 |xt ) = N (uθ (xt , t), σt2 I)!
uθ : an unknown function represented by a neural network with parameters θ

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 38 / 61
Denoising Diffusion Models

Learning the Denoising Model

Role of uθ : to predict the noise et−1 based on xt , so that we can estimate


xt−1 , and finally reach x0 !
Clearly we want pθ (xt−1 |xt ) to approximate q(xt−1 |xt , x0 )
Another aim: choose pθ to maximize likelihood for the training samples, i.e.
θ̂ = argmaxθ Ex0 ∼q (log(pθ (x0 )))
We can design a suitable loss function based on the above objectives, using
variational approximations
Ho et al (NeurIPS 2021) propose uθ (xt , t) = √1−β 1
(xt − √1−αβt
eθ (xt , t))
t t
where eθ (xt , t) is the noise-prediction part

Important to remember,
√ q(xt |x0 ) = N ( αt x0 , (1 − αt )I), i.e.

xt = αt x0 + (1 − αt )e where e ∼ N (0, 1)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 39 / 61
Denoising Diffusion Models

Training Denoising Diffusion Model

Noise prediction√model eθ (xt , t) should predict the Gaussian noise e where



xt = αt x0 + 1 − αt e
We need to estimate the parameter θ by minimizing loss function
||e − eθ (xt , t)||2 using Gradient Descent as follows:
1 Choose schedule β0 , β1 , . . .
2 Choose architecture eθ , initialize θ
3 Sample training image x0 ∼ q(x0 ) from training distribution
4 Sample diffusion step t ∼ U (1, T ), noise e ∼ N (0, 1)
√ √
5 Calculate xt = αt x0 + 1 − αt e
6 Calculate gradient ∇θ ||e − eθ (xt , t)||2 and update θ by gradient descent
7 Repeat 2-4 till convergence

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 40 / 61
Denoising Diffusion Models

Noise Prediction Model


We need to choose suitable neural architecture for eθ to predict the Gaussian
noise from (xt , t)
t is a parameter to the noise prediction network so that same model eθ can
be used for all diffusion stages x1 , x2 , . . . , xT

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 41 / 61
Denoising Diffusion Models

Generation of new samples

Once the noise prediction model eθ is ready, so is the denoising model


uθ (xt , t)!
We can now generate the code Z = XT from N (0, I), and go on sampling
xt−1 ∼ N (uθ (xt , t), σt2 I)
Finally we should get x0 , which will follow q(x0 ), i.e. the data distribution
1 Sample code xT ∼ N (0, I)
2 for t = T to 1
3 Sample z ∼ N (0, I), calculate xt−1 = √ 1
1−βt
(xt − √ βt eθ (xt , t))
1−αt
+ σt z
4 end for
5 return x0

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 42 / 61
Denoising Diffusion Models

Comparison with other models

Denoising Diffusion model has clear parallels with VAE and especially
Normalizing Flows
All three approaches first generate noise, then progressively distort it towards
a specific distribution over the output space
In case of VAE, the distortion steps are not explicitly built into the model,
though each layer of the decoder network may be considered as a step.
In all three cases, the learning problem involves encoder p(z|x) in addition to
decoder q(x|z)
In case of NF, the encoder can be directly calculated because the decoder
steps are invertible. In VAE and Densoising Diffusion, variational
approximation is needed
Unlike VAE and NF, the initial noise in denoising diffusion models is of the
same size as that of the target space (eg. m × n image)
Empirically, Denoising Diffusion is seen to generate best quality images
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 43 / 61
Generative Adversarial Networks

Generative Adversarial Networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 44 / 61
Generative Adversarial Networks

Adversarial Learning

Many generators: DBM, VAE, Normalizing Flow, Denoising Diffusion....


Each of them generates new samples, based on the samples that they are
trained on
A generator can be considered as successful only if it produces samples that
seem to belong to the training set
Adversarial learning: have a discriminator to critically evaluate generator’s
performance
Generator and discriminator must be iteratively updated with contrasting aim
Generator should fool the discriminator to generate realistic samples
Discriminator should be as efficient as possible

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 45 / 61
Generative Adversarial Networks

2-sample Test

Consider two datasets {X1 , . . . , XN } ∼ P and {Z1 , . . . , ZM } ∼ Q


We do not know P or Q
Given a new sample Y , can we estimate whether it is obtained from P or Q?
No provable way of solving, heuristics available
Estimate P and Q from the samples
Compare closeness of Y with the two sets of samples
Train a classifier to identify possible discriminative features
Generative Adversarial Network: uses third option to evaluate generator’s
performance
Binary Classifier trained to discriminate between dataset samples and
generated samples
Choice of classifier: depends on the nature of samples (eg. CNNs for images)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 46 / 61
Generative Adversarial Networks

GAN objective function

Original data distribution: pDAT A , generator’s distribution pGEN


Consider a discriminator that defines D(x) as the probability that a given
sample is obtained from pDAT A
If discriminator is good, D(x) should be high (close to 1) if x ∼ pDAT A , and
D(x) should be low (close to 0) if x ∼ pGEN
Discriminator parameter ϕ: estimated by maximizing objective
ϕ∗ = argmaxϕ [Ex∼pDAT A (log(Dϕ (x))) + Ex∼pGEN (log(1 − Dϕ (x)))]
The generator however wants to ensure that samples from pGEN should have
high D(x), i.e. that the above objective should be minimized
{θ∗ , ϕ∗ } =
argminθ argmaxϕ [Ex∼pDAT A (log(Dϕ (x))) + Ex∼pGEN (log(1 − Dϕ (x)))]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 47 / 61
Generative Adversarial Networks

Training a GAN

pDAT A is usually not known, instead we simply have a dataset


pGEN may be hard to calculate analytically, depending on the architecture
Hence the expectations in the objective have to be approximated with a set
of samples (Monte-Carlo)
{θ∗ , ϕ∗ } =
∑M ∑N
argminθ argmaxϕ [ i=1 (log(Dϕ (xi ))) + j=1 (log(1 − Dϕ (Gθ (zj ))))]
Here, {x1 , . . . , xM } are training samples, while {z1 , . . . , zN } are latent
variable samples from the prior for generator
We estimate the parameters θ of generator and ϕ of discriminator in
alternating steps
No guarantee of convergence, we hope that at some point equilibrium will be
reached where discriminator classifies samples from pGEN as those from
pDAT A despite optimization of ϕ

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 48 / 61
Generative Adversarial Networks

Probabilistic Perspective of GAN

The aim is to minimize the distance between pGEN and pDAT A


pDAT A (x)
The optimized D(x) may be interpreted as pDAT A (x)+pGEN (x)
Plug in this expression into the GAN objective and remove the maximization
over ϕ
θ∗ =
argminθ [KL(pGEN || pGEN +p
2
DAT A
) + KL(pDAT A || pGEN +p
2
DAT A
)] − log4
θ is chosen to minimize the Jensen-Shannon Divergence (J-S Div) between
pDAT A and pGEN : one distance measure between them!
Why not consider other distance measures between two distributions?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 49 / 61
Generative Adversarial Networks

Conditional GAN

Out dataset may have labelled data as {Xi , Yi }


We may specify the label to generate a new sample from the generate
The generator and discriminator parameters must be made specific to class
label
{θ∗ , ϕ∗ } = argminθ argmaxϕ [Ex∼pDAT A (log(Dϕ (x|y))) + Ex∼pGEN (log(1 −
Dϕ (Gθ (z|y))))]
In this case, pDAT A may be conditional (like pDAT A (x|y)) or joint
pDAT A (x, y))

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 50 / 61
Generative Adversarial Networks Variants of GAN

Mode Collapse

pDAT A may have many modes, i.e. regions of high probability density in the
sample space
Mode collapse: a situation where pGEN contains only one or two of these
modes
Result: most generated samples are near-identical, or have only a small
number of categories
Reasons:
1 The generator distribution is less expressive than pDAT A
2 The alternate mini-max optimization is unable to find an equilibrium
Typically, the generator latches on to a few samples on which the
discriminator fails at any iteration
The generator may keep switching between modes in different iterations

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 51 / 61
Generative Adversarial Networks Variants of GAN

Prevention of Mode Collapse


Mode Collapse happens when generator produces similar samples, irrespective
of Z-value!
One way to prevent this: discriminator should consider (ZGEN , XGEN )
instead of XGEN only!
Problem: There is no ZDAT A corresponding to XDAT A !
Solution: Build an encoder network E that maps XDAT A to

ZDAT A = E(XDAT A ), in the same latent space as Z

Discriminator has to distinguish between (XDAT A , ZDAT A ) and
(XGEN , ZGEN )
Bidirectional Generative Adversarial Network (Bi-GAN)!!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 52 / 61
Generative Adversarial Networks Variants of GAN

Other Issues with GAN

Vanilla GAN essentially finds a generator to minimize J-S Divergence between


pGEN and pDAT A
J-S Divergence need not be best distance between two distributions
May be possible to try out other measures between two distributions
GAN’s performance is bounded by the generator and discriminator
architectures chosen by user

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 53 / 61
Generative Adversarial Networks Variants of GAN

f-GAN

Another well-known distance between any two distributions is the


P (x)
f -divergence Df (P ||Q) = EQ (f ( Q(x) )) where f is any convex lower
semi-continuous function
We cannot minimize it directly, but may minimize a tight lower bound of this
function
Df (P (x)||Q(x)) = supT ∈τ [Ex∼P (T (x)) − Ex∼Q (f ∗ (T (x)))]
Here, T is a function from function class τ , and f ∗ is the Fenchel Conjugate
of f
The Fenchel Conjugate can be analytically found for a few standard functions
f-GAN formulation: minθ maxT [Ex∼PDAT A (T (x)) − Ex∼PGEN,θ (f ∗ (T (x)))]
Here f ∗ is anaologous to generator and T to discriminator
Maximizing w.r.t. T tightens the lower bound, minimizing w.r.t. θ minimizes
the divergence

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 54 / 61
Generative Adversarial Networks Variants of GAN

W-GAN

Another suitable distance measure: Earth Mover’s Distance or Wasserstein


Distance: Dw (P, Q) = infγ∈Π(x,y) [E(x,y)∈γ (||x − y||1 ])
Here Π is the set of joint distributions, whose marginals are P and Q
Lower bound on Dw (P, Q) is Ex∼P [f (x)] − Ex∼Q [f (x)] where f is any
function with Lipschitz constant ≤ 1
|f (x) − f (y)| ≤ κ||x − y||1 where κ is Lipschitz constant
Now set P = pDAT A , Q = pGEN , and consider f as analogous to
discriminator
Wasserstein GAN objective: minimize of θ and maximize over f
Constraint on Lipschitz constant enforced by gradient penalty or weight
clipping

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 55 / 61
Evaluating Generative Models

Evaluating Generative Models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 56 / 61
Evaluating Generative Models

Test Likelihood

One well-known approach to evaluate a generative model: calculate the


probability of the test samples w.r.t. the generated model with its parameters
Not applicable for likelihood-free models like GANs
Even for VaE, numerically calculating likelihood tough (due to partition
function)
In such cases: draw many samples from generator, use them to estimate
density, then calculate test likelihood
Problem: not intuitive, in many cases even an intuitively bad sample can
have high likelihood

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 57 / 61
Evaluating Generative Models

Quality of Generated Samples

Quality of depends on the nature of data - in case of images, it is visual


appeal
Human-in-the-loop evaluation: can human annotators distinguish between
generated and original samples?
Can different classes of samples be distinguished from each other (in case of
labelled data)?
Suppose we have a classifier c that can classify the dataset samples accurately
How confidently can it classify the generated samples?

Entropy − y Ex∼pGEN c(y|x)log(c(y|x))dy) should be low, i.e. c(y|x) should
be high for one class and low for the rest
Sharpness S = exp(−entropy)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 58 / 61
Evaluating Generative Models

Diversity of Generated Samples

We do not want the different samples to be similar to each other


(unlabelled), or belong to the same class (labelled)
The class distribution of generated samples should cover that of the dataset
The marginal distribution c(y) = Ex∼PGEN (c(y|x)) should have high entropy,
or low cross-entropy with the dataset’s class distribution (if known)

Diversity score D = exp(−E( x∼PGEN c(y|x)log(c(y))dy))
Inception Score I = S × D, i.e. high sharpness and high diversity are
rewarded

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 59 / 61
Evaluating Generative Models

Properties of Latent Representation

The latent representation of the generator should contain some information


about the sample
Variety in the generated sample is ensured by the distribution of this latent
random variable
It should be possible to estimate the latent representation given a sample
Can the latent representation capture all the variations across the training
samples?
We may carry out clustering of the latent representation of a set of samples
Ideally, each cluster should have samples having same set of attributes or
class label
The latent representation should provide lossless compression, i.e. generator
should be able to reconstruct a sample exactly from its latent represent

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 60 / 61
Acknowledgement

Thank you!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Adway Mitra (Indian Institute of Technology Kharagpur) Centre of Excellence in AI 19 October 2022 61 / 61

You might also like