CAP6412
Advanced Computer Vision
Mubarak Shah
shah@crcv.ucf.edu
HEC-245
Lecture-5: Diffusion Models-Part-II
1/25/2023 CAP6412 - Lecture 1 Introduction 1
Diffusion models in vision: A survey
https://arxiv.org/pdf/2209.04747.pdf
Alin Croitoru Vlad Hondru Radu Tudor Ionescu Mubarak Shah
University of Bucharest, University of Bucharest, University of Bucharest, University of Central
Romania Romania Romania Florida, US
alincroitoru97@gmail.com vladhondru25@gmail.com raducu.ionescu@gmail.com shah@crcv.ucf.edu
High-level overview
• Diffusion models are probabilistic models used for image generation
• They involve reversing the process of gradually degrading the data
• Consist of two processes:
The forward process: data is progressively destroyed by adding noise across
multiple time steps
The reverse process: using a neural network, noise is sequentially removed
to obtain the original data
Standard Gaussian
Data distribution
reverse
forward
High-level overview
• Three categories:
Denoising Diffusion Probabilistic Models (DDPM)
Noise Conditioned Score Networks (NCSN)
Stochastic Differential Equations (SDE)
Denoising Diffusion Probabilistic Models (DDPMs)
Forward process
𝑥 𝑥
… …
𝑥 ~𝑝(𝑥 ) 𝑥 ~𝒩(0, 𝐼)
Denoising Diffusion Probabilistic Models (DDPMs)
𝑥 𝑥
… …
𝑥 ~𝑝(𝑥 ) Reverse process 𝑥 ~𝒩(0, 𝐼)
Denoising Diffusion Probabilistic Models (DDPMs)
Forward process (Iterative) The image is
replaced with
noise
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 1 − 𝛽 𝑥 , 𝛽 I) 𝛽 ≪ 1 , 𝑡 = 1, 𝑇
… …
𝑥 𝑥 𝑥 𝑥
Denoising Diffusion Probabilistic Models (DDPMs)
Forward process. Ancestral sampling (One Shot) Notations:
𝛽 = 𝛼
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝛽 ⋅ 𝑥 , 1 − 𝛽 I) 𝛼 =1 − 𝛽
… …
𝑥 𝑥 𝑥 𝑥
DDPMs. Training objective
Remember that:
𝑥 𝑥 𝑥 𝑥
… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ;𝜇 𝑥 ,𝑡 ,Σ 𝑥 ,𝑡 )
Reverse process
Neural network Approximated by
weights a neural network
DDPMs. Training objective
Simplification:
𝑥 𝑥 𝑥 𝑥
… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝜇 𝑥 , 𝑡 , 𝜎 I)
Reverse process
Neural network Approximated by
Fix the variance instead of learning, and predict/learn the mean weights a neural network
DDPMs. Training objective
UNet-like neural network
𝜇 (𝑥 , 𝑡)
~𝒩 𝑥 , 𝜇 (𝑥 , 𝑡), 𝜎 I
𝑥
DDPMs. Training Algorithm
1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇
Training algorithm:
Repeat 𝛽 = 𝛼
𝑥 ~𝑝 𝑥
𝑡~𝒰 1, … , 𝑇
𝑧 ~𝒩(0, I)
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ
Until convergence
DDPMs. Training Algorithm
1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇
ℒ 𝛽 = 𝛼
Training algorithm:
Repeat
𝑥 ~𝑝 𝑥 %We sample an image from our data set
𝑡~𝒰 1, … , 𝑇 %choose randomly a time step t of the forward process
𝑧 ~𝒩(0, I) %sample the noise z_t
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧 % Get noisy image
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ %Update neural network weights
Until convergence
DDPMs. Sampling
𝑥
𝑧 (𝑥 , 𝑡)
• Pass the current noisy image along with t to the neural network
• With the resultant compute the mean of the gaussian distribution
DDPMs. Sampling
𝑥
𝑧 (𝑥 , 𝑡)
Sample the image for the next iteration
𝜇 (𝑥 , 𝑡)
1 1 − 𝛼
~𝒩 𝑥 , 𝑥 − 𝑧 𝑥 ,𝑡 ,𝜎 I
𝛼
1−𝛽
𝑥
Outline
1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Score Function
• Direction we need to change the input x such that the
density becomes greater
• Mixture of two Gaussians in 2D
© Copyright 2022 Yang Song. Powered by Jekyll with al-folio theme.
Score Function
• Second formulation of Diffusion Model
• Langevin dynamics method
• Starts from a random sample
• Apply iterative updates with the score function to modify the sample
• Result will have a higher chance of being a sample of the true distribution p(x)
Naïve score-based model
• Score: gradient of the logarithm of the probability density with respect to the input
• Annealed Langevin dynamics
𝛾
𝑥 =𝑥 + ∇ log 𝑝 𝑥 + 𝛾⋅𝜔
2
Step size – controls the magnitude of the update in the direction of the score
Score – estimated by the score network
Noise – random gaussian noise N(0, I)
Naïve score-based model
• The score is approximated with a neural network
• Score network is trained using score matching
𝔼 ~ ( ) 𝑠 𝑥 − ∇ log 𝑝(𝑥)
• Denoising score matching:
Add small noise to each sample of the data:
𝑥 ~ 𝒩 𝑥 , 𝑥, 𝜎 ⋅ 𝐼 = 𝑝 (𝑥 )
Objective
~ ( )
After training:
Naïve score-based model. Problems
• Manifold hypothesis: real data resides on low dimensional manifolds
• The score is undefined outside these low dimensional manifolds
• Data being concentrated in regions results in further issues:
Incorrectly estimating the score within the low-density regions
Langevin dynamics never converging to the high-density region
Credit images Yang Song: https://yangsong.net/blog/2021/score/
Naïve score-based model. Problems
Credit images Yang Song: https://yangsong.net/blog/2021/score/
Noise Conditioned Score Network (NCSNs)
• Solution:
Perturb the data with random Gaussian noise at different scales
Learn score estimations noisy distributions via a single score network
Credit images Yang Song: https://yang-song.net/blog/2021/score/
Noise Conditioned Score Network (NCSNs)
• Given a sequence of Gaussian noise scales σ1 < σ2 < · · · < σT such that:
o 𝑝 𝑥 ≈ 𝑝(𝑥 )
o Approximating the true data distribution
o 𝑝 𝑥 ≈ 𝒩(0, 𝐼)
o Almost equally with the standard gaussian distribution.
• And the forward process i.e. noise perturbation given by:
1 −1 𝑥 −𝑥
𝑝 𝑥 | 𝑥 = 𝒩 𝑥 ; 𝑥, 𝜎 ⋅ 𝐼 = ⋅ exp ⋅
𝜎 ⋅ 2𝜋 2 𝜎
• The gradient can be written as:
𝑥 −𝑥
∇ log 𝑝 𝑥 |𝑥 =−
𝜎
Noise Conditioned Score Network (NCSNs)
• Training the NCSN with denoising score matching, the following objective is minimized:
1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎
Noise Conditioned Score Network (NCSNs)
• Training the NCSN with denoising score matching, the following objective is minimized:
1 𝑥 −𝑥
ℒ = 𝜆 𝜎 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎
Weighting function
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:
for t do:
for do:
return
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:
; %sample some standard gaussian noise
for t do: %start from the largest noise scale, which is denoted by the time step
for do: %for N iterations execute the Langevin dynamics updates
% get noise
% update
% next iteration
return
DDPM vs NCSN. Losses
DDPM: ℒ = ∑ 𝔼 ~ , ~𝒩 , 𝑧 𝑥 ,𝑡 − 𝑧
NCSN: ℒ = ∑ 𝜆(𝜎 )𝔼 ~ , ~ ( | ) 𝑠 𝑥 ,𝜎 +
• In , the weighting function is missing because better sample quality when is set to 1.
We can rewrite the noise , as follows:
• So, learns to approximate a scaled negative noise .
DDPM vs NCSN. Sampling
DDPM: 𝑥 = 𝑥 − 𝑧 𝑥 ,𝑡 + 𝛽 ⋅𝑧
• Iterative updates are based on subtracting some form of noise from the noisy image.
NCSN: 𝑥 = 𝑥 + ⋅𝑠 𝑥 ,𝜎 + 𝛾 ⋅𝑧
• This is true also for NCSN because 𝑠 𝑥 , 𝜎 approximates the negative of the noise.
1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 + 1
𝑇 𝜎 min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇
1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝜎 − (− )
𝑇 𝜎 ℒ
𝑥 −𝑥
𝑧=
• Therefore, the generative processes defined by NCSN and DDPM are very similar. 𝜎
Outline
1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Stochastic Differential Equations (SDEs)
• A generalized framework that can be applied over the previous two methods
• However, the diffusion process is continuous, given by an SDE
• Works by the same principle:
Gradually transforms the data distribution p(x0) into noise
Reverse the process to obtain the original data distribution
Stochastic Differential Equations (SDEs)
• The forward diffusion process is represented by the following SDE:
𝜕𝑥 Notation for:
= 𝑓 𝑥, 𝑡 + 𝜎 𝑡 𝜔 ⟺ 𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
𝜕𝑡 𝒩(0, 𝜕𝑡)
Function for drift coefficient: gradually White Gaussian
nullifies the data x0 noise
Function for diffusion
coefficient: controls how much
Gaussian noise is added
Stochastic Differential Equations (SDEs)
• The reverse-time SDE is defined as:
𝜕𝑥 = [𝑓 𝑥, 𝑡 − 𝜎 𝑡 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑥 ]𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
• The training objective is similar to NCSN, but adapted for continuous time:
ℒ∗ =𝔼 𝜆 𝑡 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝑡 + 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑥
• The score function is used in the reverse-time SDE:
It employs a neural network to estimate the score function.
Then uses a numerical SDE solver to generate samples.
Stochastic Differential Equations (SDEs). NCSN
• The process of NCSN:
𝑥 ~ 𝒩 𝑥 ;𝑥 , (𝜎 − 𝜎 )⋅𝐼 ⇒ 𝑥 = 𝑥 + (𝜎 − 𝜎 )⋅𝑧
• We can reformulate the above expression to look like a discretization of an SDE:
(𝜎 − 𝜎 )
𝑥 − 𝑥 = ⋅𝑧
𝑡 − (𝑡 − 1)
• Translating the above discretization in the continuous case:
𝜕𝜎 𝑡
𝜕𝑥 = 𝜕𝜔(𝑡)
𝜕𝑡
𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
Stochastic Differential Equations (SDEs). DDPM
• The process of DDPM:
𝑥 ~𝒩 𝑥 ; 1−𝛽 𝑥 ,𝛽 I ⇒ 𝑥 = 1−𝛽 𝑥 + 𝛽 ⋅𝑧
• If we consider time step size ∆𝑡 = , instead of 1, and 𝛽 𝑡 ∆𝑡 = 𝛽 :
𝑥 = 1 − 𝛽(𝑡)∆𝑡 𝑥 ∆ + 𝛽(𝑡)∆𝑡 ⋅ 𝑧
• Using Taylor expansion of 1 − 𝛽(𝑡)∆𝑡 :
∆
𝑥 ≈(1− ) 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧
𝛽 𝑡 ∆𝑡 𝛽 𝑡 ∆𝑡
𝑥 ≈𝑥 ∆ − 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧 ⟺ 𝑥 − 𝑥 ∆ = − 𝑥 + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧
2 2
• For the continuous case, the above becomes:
𝜕𝑥 = − 𝛽 𝑡 𝑥 𝜕𝑡 + 𝛽(𝑡)𝜕𝜔(𝑡)
Outline
1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Conditional generation.
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 ; y is condition.
Solution 1. Conditional training: train the model with an additional input 𝑦 to estimate 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .
𝑠 𝑥 , 𝑡, 𝑦 ≈
𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦
𝑦
Conditional generation. Classifier Guidance
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .
Solution 2. Classifier guidance:
Bayes rule:
𝑝 𝑦 𝑥 ⋅ 𝑝 (𝑥 )
𝑝 𝑥 𝑦 = ⟺
𝑝 (𝑦)
Logarithm:
log 𝑝 𝑥 𝑦 = log 𝑝 𝑦 𝑥 + log 𝑝 𝑥 − log 𝑝 𝑦 ⟺
Gradient:
𝛻 log 𝑝 𝑥 𝑦 = 𝛻 log 𝑝 𝑦 𝑥 + 𝛻 log 𝑝 (𝑥 ) − 𝛻 log 𝑝 (𝑦) ⟺
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )
Unconditional diffusion model
Classifier
Conditional generation. Classifier Guidance
Solution 2. Classifier guidance:
Guidance weight
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )
𝑠=1 𝑠 = 10
Problem
• Need to have good gradients estimates at each step of denoising process
• Need a classifier that is robust to noise added in the image.
• Training of the classifier on noisy data, which can be problematic..
Conditional generation. Classifier-free Guidance
Solution 3. Classifier-free guidance
𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )
Bayes rule:
𝑝 𝑥 𝑦) ⋅ 𝑝 (𝑦)
𝑝 𝑦𝑥 =
𝑝 (𝑥 )
Logarithm:
log 𝑝 𝑦 𝑥 = log 𝑝 𝑥 𝑦 − log 𝑝 𝑥 + log 𝑝 (𝑦)
Gradient
𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 = 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )
from above 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )
𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ (𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )) + 𝛻 𝑙𝑜𝑔 𝑝 𝑥
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕
Learned by a single model
Conditional generation. Classifier-free Guidance
𝑠 𝑥 , 𝑡, 𝑦
≈𝑝 𝑥 𝑦
𝑦
Conditional generation. Classifier-free Guidance
𝑠 𝑥 , 𝑡, 𝑦/0 ≈
𝑝 𝑥 𝑦 /𝑝 (𝑥 |0)
𝑦/0
CLIP guidance
What is a CLIP model?
• Trained by contrastive cross-entropy loss:
• The optimal value of is
12
1
Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications
Karsten Kreis Ruiqi Gao Arash Vahdat
Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
CLIP guidance
Replace the classifier in classifier guidance with a CLIP model
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 − 𝒍𝒐𝒈 𝒑(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕
CLIP model 12
2
𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒇 𝒙 . 𝒈(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕
Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications
Karsten Kreis Ruiqi Gao Arash Vahdat
Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
Outline
1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Conditional Generation
6. Stochastic Differential Equations
7. Research directions
Research directions
Unconditional image generation:
• Sampling efficiency
• Image quality
Conditional image generation:
• Text-to-image generation
Complex tasks in computer vision:
• Image editing, even based on text
• Super-resolution
• Image segmentation
• Anomaly detection in medical images
• Video generation
Thank you !
Survey: Github:
https://arxiv.org/abs/2209.04747 https://github.com/CroitoruAlin/Diffusion-
Models-in-Vision-A-Survey