Generative Image Dynamics
Generative Image Dynamics
Looping
Stochastic Motion Texture video
0.2Hz
0.4Hz
Interactive
… dynamics
…
X coefficients Y coefficients
Figure 1. Our approach models a generative image-space prior on scene dynamics: from a single RGB image, our model generates a neural
stochastic motion texture, a motion representation that models dense long-term motion trajectories in the Fourier domain. We show that
our motion priors enable applications such as turning a single picture into a seamlessly looping video, or simulating object dynamics in
response to an interactive user excitation (e.g., dragging and releasing a point on the object). On the right, we visualize the output videos
using space-time X-t slices through 10 seconds of video (along the scanline shown in the input picture).
Abstract 1. Introduction
The natural world is always in motion, with even seem-
We present an approach to modeling an image-space ingly static scenes containing subtle oscillations as a result
prior on scene dynamics. Our prior is learned from a col- of factors such as wind, water currents, respiration, or other
lection of motion trajectories extracted from real video natural rhythms. Motion is one of the most salient visual
sequences containing natural, oscillating motion such as signals, and humans are particularly sensitive to it: captured
trees, flowers, candles, and clothes blowing in the wind. imagery without motion (or even with slightly unrealistic
Given a single image, our trained model uses a frequency- motion) will often seem uncanny or unreal.
coordinated diffusion sampling process to predict a per- While it is easy for humans to interpret or imagine motion
pixel long-term motion representation in the Fourier domain, in scenes, training a model to learn realistic scene motion is
which we call a neural stochastic motion texture. This rep- far from trivial. The motion we observe in the world is the
resentation can be converted into dense motion trajectories result of a scene’s underlying physical dynamics, i.e., forces
that span an entire video. Along with an image-based ren- applied to objects that respond according to their unique
dering module, these trajectories can be used for a number physical properties — their mass, elasticity, etc. These prop-
of downstream applications, such as turning still images into erties and forces are hard to measure and capture at scale,
seamlessly looping dynamic videos, or allowing users to real- but fortunately, in many cases measuring them is unneces-
istically interact with objects in real pictures. See our project sary: we can instead capture and learn from the resulting
page for more results: generative-dynamics.github.io. observed motion. This observed motion is multi-modal and
grounded in complex physical effects, but it is nevertheless
often predictable: candles will flicker in certain ways, trees 2. Related Work
will sway, and their leaves will rustle. This predictability is
ingrained in our human perception of real scenes: by viewing Generative synthesis. Recent advances in generative
a still image, we can imagine plausible motions that might models have enabled photorealistic synthesis of images con-
have been ongoing as the picture was captured — or, if there ditioned on text prompts [16,17,23,68–70]. These generative
might have been many possible such motions, a distribution text-to-image models can be augmented to synthesize video
of natural motions conditioned on that image. Given the fa- sequences by extending the generated image tensors along a
cility with which humans are able to imagine these possible time dimension [7,9,39,58,77,96,96,101]. While these meth-
motions, a natural research problem is to model this same ods are effective at producing plausible video sequences that
distribution computationally. capture the spatiotemporal statistics of real footage, the re-
sulting videos can suffer from a number of common artifacts,
Recent advances in generative models, and in particu- such as incoherent motion, unrealistic temporal variation in
lar, conditional diffusion models [40, 78, 80], have enabled textures, and violations of physical constraints like preserva-
us to model highly rich and complex distributions, includ- tion of mass.
ing distributions of real images conditioned on text [68–70].
This capability has enabled a number of previously impos- Animating images. Instead of generating videos entirely
sible applications, such as text-conditioned generation of from text, other techniques take as input a still picture and
arbitrary, diverse, and realistic image content. Following the animate it. Many recent deep learning methods adopt a
success of these image models, recent work has shown that 3D-Unet architecture to produce video volumes directly
modeling other domains, such as videos [7,39] and 3D geom- from an input image [26, 33, 37, 43, 49, 83]. Because these
etry [72, 90, 91, 93], can be similarly useful for downstream models are effectively the same video generation models
applications. (but conditioned on image information instead of text),
they exhibit similar artifacts to those mentioned above.
In this paper, we explore modeling a generative prior One way to overcome these limitations is to not directly
for image-space scene motion, i.e., the motion of all pix- generate the video content itself, but instead animate an
els in a single image. This model is trained on automati- input source image through explicit or implicit image-
cally extracted motion trajectories from a large collection based rendering, i.e., moving the image content around
of real video sequences. Conditioned on an input image, according to motion derived from external sources such
the trained model predicts a neural stochastic motion tex- as a driving video [47, 74–76, 89], motion or 3D geome-
ture: a set of coefficients of a motion basis that characterize try priors [8, 28, 42, 60, 61, 87, 91, 92, 94, 99], user annota-
each pixel’s trajectory into the future. We limit our scope to tions [6,18,31,35,88,95,98] or a physical simulation [20,22].
real-world scenes with natural, oscillating dynamics such as These methods demonstrate greater temporal coherence and
trees and flowers moving in the wind, and therefore choose realism, but require additional guidance signals or user input,
the Fourier series as our basis functions. We predict a neu- or otherwise rely on limited motion representations (e.g.,
ral stochastic motion texture using a diffusion model that optical flow fields, as opposed to full-video dense motion
generates coefficients for a single frequency at a time, but trajectories).
coordinates these predictions across frequency bands. The
Motion models and motion priors. A number of other
resulting frequency-space textures can then be transformed
works leverage representations of motion beyond two-frame
into dense, long-range pixel motion trajectories, which can
flow fields, both in Eulerian and Lagrangian domains. For in-
(along with an image-based rendering diffusion model) be
stance, Fourier or phase-based motion representations (like
used to synthesize future frames, turning still images into
ours) have been used for magnifying and visualizing mo-
realistic animations, as illustrated in Fig. 1.
tion [63, 85], or for video editing applications [59]. These
Compared with priors over raw RGB pixels, priors over representations can also be used in motion prediction —
motion capture more fundamental, lower-dimensional under- where an image or video is used to inform a deterministic
lying structure that efficiently explains variations in pixel future motion estimate [32, 66], or a more rich distribution
values. Hence, our motion representation leads to more co- of possible motions (which can be modeled explicitly or by
herent long-term generation and more fine-grained control predicting the pixel values that would be induced by some
over animations compared with prior methods that perform implicit motion estimate) [84,86,94]. Our work can similarly
image animation via raw video synthesis. We also demon- be thought of as learning priors for motion induced by under-
strate that our generated motion representation is convenient lying scene dynamics, where our prior is in the form of an
for a number of downstream applications, such as creating image-conditioned distribution over long-range dense trajec-
seamless looping videos, editing the generated motion, and tories. Other recent work has demonstrated the advantages
enabling interactive dynamic images, i.e., simulating the of modeling and predicting motion using generative models
response of object dynamics to user-applied forces. in a number of closed-domain settings such as humans and
animals [2, 19, 27, 67, 81, 97]. the position of that pixel at a future time t. To generate a
Videos as textures. Certain moving scenes can be thought future frame at time t, one can splat pixels from I0 using the
of as a kind of texture—termed dynamic textures by Doretto corresponding displacement map Dt , resulting in a forward-
et al. [25]—that model videos as space-time samples of a warped image It0 :
stochastic process. Dynamic textures can represent smooth,
It0 (p + Ft (p)) = I0 (p) (1)
natural motions such as waves, flames, or moving trees, and
have been widely used for video classification, segmentation 4.2. Stochastic motion textures
or encoding [12–15, 71]. A related kind of texture, called a
video texture, represents a moving scene as a set of input As demonstrated by prior work in computer graphics [20,
video frames along with transition probabilities between 24, 46, 64], many natural motions, especially the oscillating
any pair of frames [73]. A large body of work exists for motions we focus on, can be described as a superposition
estimating and producing dynamic or video textures through of a small number of harmonic oscillators represented with
analysis of scene motion and pixel statistics, with the aim of different frequencies, amplitude and phases. One way to
generating seamlessly looping or infinitely varying output introduce stochasticity to the motions is to integrate noise
videos [1, 21, 30, 54, 55, 73]. In contrast to much of this fields, but as observed by prior work [20], directly adding
previous work, our method learns priors in advance that can random noise into the spatial and temporal domain of the
then be applied to single images. estimated motion fields often leads to unrealistic or erratic
animations.
3. Overview Moreover, adopting motion textures in the temporal do-
main, as defined above, implies predicting T 2D displace-
Given a single picture I0 , our goal is to generate a video
ment fields in order to generate a video with T frames. To
{Iˆ1 , Iˆ2 ., ..., IˆT } of length T featuring oscillation dynam-
avoid predicting such a large output representation for long
ics such as those of trees, flowers, or candle flames mov-
output videos, many prior animation methods either generate
ing in the breeze. Our system consists of two modules, a
video frames autoregressively [7, 28, 53, 56, 83], or predict
motion prediction module and an image-based rendering
each future output frame independently via an extra time
module. Our pipeline begins by using a latent diffusion
embedding [4]. However, neither strategy ensures long-term
model (LDM) to predict a neural stochastic motion texture
temporal consistency of generated video frames, and both
S = Sf0 , Sf1 , ..., SfK−1 for the input image I0 . A stochas-
can produce videos that drift or diverge over time.
tic motion texture is a frequency representation of per-pixel
To address the above issues, we represent per-pixel mo-
motion trajectories in an input image (Sec. 4). The predicted
tion textures (i.e., full motion trajectories for all pixels) for
stochastic motion texture is then transformed to a sequence
the input scene in the frequency domain and formulate the
of motion displacement fields F = (F1 , F2 , ..., FT ) using
motion prediction problem as a multi-modal image-to-image
an inverse discrete Fourier transform. These motion fields,
translation task. We adopt the latent diffusion model (LDM)
in turn, are used to determine the position of each input pixel
to generate a stochastic motion texture, comprised of a 4K-
at each future time step. Given these predicted motion fields,
channel 2D motion spectrum map, where K << T is the
our rendering module animates the input RGB image us-
number of frequencies modeled, and where at each frequency
ing an image-based rendering technique that splats encoded
we need four scalars to represent the complex Fourier coef-
features from the input image and decodes these splatted
ficients for the x and y dimensions. Fig. 1 illustrates these
features into an output frame with an image synthesis net-
neural stochastic motion textures.
work (Sec. 5). Because our method explicitly estimates a
representation of motion from a single picture, it enables The motion trajectory of a pixel at future time steps
several downstream applications, such as the animation of a F(p) = {Ft (p)|t = 1, 2, ...T } and its representation in
single still picture with varying speed and motion magnitude, the frequency domain as the motion spectrum S(p) =
the generation of seamless looping video, and the simulation {Sfk (p)|k = 0, 1, .. T2 − 1} are related by the Fast Fourier
of object dynamics response to an external user excitation transform (FFT):
(i.e., interactive dynamics) (Sec. 6). S(p) = FFT(F(p)). (2)
4. Neural stochastic motion textures How should we select the K output frequencies for our
representation? Prior work in real-time animation has ob-
4.1. Motion textures
served that most natural oscillation motions are composed
As proposed by Chuang et al. [20], a motion texture primarily of low-frequency components [24, 64]. To validate
defines a sequence of time-varying 2D displacement maps this hypothesis, we computed the average power spectrum of
F = {Ft |t = 1, ..., T }, where the 2D displacement vector the motion extracted from 1,000 randomly sampled 5 second
Ft (p) at each pixel coordinate p from input image I0 defines real video clips. As shown in the left plot of Fig. 2, the power
80 1.0
X-axis Scaling w/ resolution the left plot of Fig. 2, the amplitude of our motion textures
Y-axis 0.8 Adaptive normalization
60 spans a range of 0 to 100 and decays approximately exponen-
Amplitude
Frequency
0.6
40 tially with increasing frequency. As diffusion models require
0.4
20 0.2 that output values lie between 0 and 1 for stable training and
0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 0.0 0.0 0.5 1.0 1.5
denoising, we must normalize the coefficients of S extracted
Frequency (Hz) Amplitude of Fourier coefficent at 3.0 Hz from real videos before using them for training. If we scale
the magnitudes of S coefficients to [0,1] based on image
Figure 2. Left: We visualize the average motion power spectrum
width and height as in prior work [28, 72], almost all the
for the x and Y motion components extracted from a dataset of
real videos, shown as the blue and green curves. Natural oscillation coefficients at higher frequencies will end up close to zero,
motions are composed primarily of low-frequency components, as shown in Fig. 2 (right-hand side). Models trained on such
and so we use the first K = 16 terms as marked by red dots. data can produce inaccurate motions, since during inference,
Right: we show a histogram of the amplitude of Fourier terms at even small prediction errors can lead to large relative errors
3Hz (K = 16) after (1) scaling amplitude by image width and after denormalization when the magnitude of the normalized
height (blue), or (2) frequency adaptive normalization (red). Our S coefficients are very close to zero.
adaptive normalization prevents the coefficients from concentrating To address this issue, we employ a simple but effective
at extreme values. frequency adaptive normalization technique. In particular,
we first independently normalize Fourier coefficients at each
frequency based on statistics computed from the training set.
spectrum of the motion decreases exponentially with increas-
Namely, at each individual frequency fj , we compute the
ing frequency. This suggests that most natural oscillation
97th percentile of the Fourier coefficient magnitudes over all
motions can indeed be well represented by low-frequency
input samples and use that value as a per-frequency scaling
terms. In practice, we found that the first K = 16 Fourier co-
factor sfj . Furthermore, we apply a power transformation to
efficients are sufficient to realistically reproduce the original
each scaled Fourier coefficient to pull it away from extremely
natural motion in a range of real videos and scenes.
small or large values. In practice, we found that a square root
4.3. Predicting motion with a diffusion model transform performs better than other transformations, such
as log or reciprocal. In summary, the final coefficient values
We choose a latent diffusion model (LDM) [69] as the
of stochastic motion texture S(p) at frequency fj (used for
backbone for our motion prediction module, as LDMs are
training our LDM) are computed as
more computationally efficient than pixel-space diffusion s
models, while preserving generation quality. A standard Sfj (p)
LDM consists of two main modules: (1) a Variational Au- Sf0 j (p) = sign(Sfj ) . (4)
sfj
toencoder (VAE) that compresses the input image to a latent
space through an encoder z = E(I), then reconstructs the As shown on the right plot of Fig. 2, after applying frequency
input from the latent features via a decoder I = D(z), and adaptive normalization the stochastic motion texture coeffi-
(2) a U-Net based diffusion model that learns to iteratively cients no longer concentrate in a range of extremely small
denoise latent features starting from Gaussian random noise. values.
Our training applies this not to an input image but to stochas- Frequency-coordinated denoising. The straightforward
tic motion textures from a real video sequence, which are way to to predict a stochastic motion texture S with K fre-
encoded and then diffused for n steps with a pre-defined quency bands is to output a tensor of 4K channels from a
variance schedule to produce noisy latents z n . The 2D U- standard diffusion U-Net. However, as in prior work [7], we
Nets are trained to denoise the noisy latents by iteratively observe that training a model to produce such a large number
estimating the noise θ (z n ; n, c) used to update the latent of channels tends to produce over-smoothed and inaccurate
feature at each step n ∈ (1, 2, ..., N ). The training loss for output. An alternative would be to independently predict a
the LDM is written as motion spectrum map at each individual frequency by in-
LLDM = En∈U [1,N ],n ∈N (0,1) ||n − θ (z n ; n, c)||2 (3)
jecting an extra frequency embedding to the LDM, but this
results in uncorrelated predictions in the frequency domain,
where c is the embedding of any conditional signal, such as leading to unrealistic motion.
text, semantic labels, or, in our case, the first frame of the Therefore, we propose a frequency-coordinated denois-
training video sequence, I0 . The clean latent features z 0 are ing strategy as shown in Fig. 3. In particular, given an input
then passed through the decoder to recover the stochastic image I0 , we first train an LDM θ to predict a stochastic
motion textures. motion texture map Sfj with four channels to represent each
Frequency adaptive normalization. One issue we ob- individual frequency fj , where we inject extra frequency
served is that stochastic motion textures have particular dis- embedding along with time-step embedding to the LDM
tribution characteristics across frequencies. As visualized in network. We then freeze the parameters of this LDM model
Train
Spatial layer
Reshape
Inference
… Frequency Attention
…
Iterative denoising
Reshape
Noisy latent
Figure 3. Motion prediction module. We predict a neural stochastic motion texture S through a frequency-coordinated denoising model.
Each block of the diffusion network θ interleaves 2D spatial layers with frequency cross-attention layers (red box, right), and iteratively
denoises latent features z n . The denoised features are fed to a VAE decoder D to produce S. During training, we concatenate the downsampled
input I0 with noisy latent features encoded from a real motion texture via a VAE encoder E, and replace the noisy features with Gaussian
noise z N during inference (left).
θ and introduce attention layers and interleave them with to the same output 2D location, we adopt the feature pyra-
2D spatial layers of θ across K frequency bands. Specif- mid softmax splatting strategy proposed in prior work on
ically, for a batch size B of input images, the 2D spatial frame interpolation [62]. Specifically, we encode I0 through
layers of θ treat the corresponding B · K noisy latent fea- a feature extractor network to produce a multi-scale feature
tures of channel size C as independent samples with shape map M = {Mj |j = 0, ..., J}. For each individual feature
R(B·K)×C×H×W . The cross-attention layer then interprets map Mj at scale j, we resize and scale the predicted 2D
these as consecutive features spanning the frequency axis, motion field Ft according to the resolution of Mj . We use
and we reshape the latent features from previous 2D spatial flow magnitude, as a proxy for geometry, to determine the
layers to RB×K×C×H×W before feeding them to the atten- contributing weight of each source pixel mapped to its desti-
tion layers. In other words, the frequency attention layers nation location.
P In particular, we compute a per-pixel weight,
are used to coordinate the pre-trained motion latent features W (p) = T1 t ||Ft (p)||2 as the average magnitude of the
across all frequency channels in order to produce coherent predicted motion trajectory fields. In other words, we assume
stochastic motion textures. In our experiments, we observed large motions correspond to moving foreground objects, and
that the average VAE reconstruction error improves from small or zero motions correspond to background objects.
0.024 to 0.018 when we switch from a standard 2D U-Net We use motion-derived weights instead of learnable ones
to a frequency-coordinated denoising module, suggesting because we observe that in the single-view case, learnable
an improved upper bound on LDM prediction accuracy; in weights are not effective for addressing disocclusion ambi-
our ablation study in Sec. 7.6, we also demonstrate that this guities, as shown in the second column of Fig. 5.
design choice improves video generation quality compared With the motion field Ft and weights W , we apply soft-
with simpler configurations mentioned above. max splatting to warp feature map at each scale to pro-
0
duce a warped feature Mj,t = Wsoftmax (Mj , Ft , W ), where
5. Image-based rendering Wsoftmax is the softmax splatting operation. The warped fea-
0
We now describe how we take a stochastic motion tex- tures Mj,t are then injected into intermediate blocks of an
ture S predicted for a given input image I0 and render a image synthesis decoder network to produce a final rendered
future frame Iˆt at time t. We first derive motion trajectory image Iˆt .
fields in the time domain using the inverse temporal FFT We jointly train the feature extractor and synthesis net-
applied at each pixel F(p) = FFT−1 (S(p)). The motion works with a start and target frames (I0 , It ) randomly sam-
trajectory fields determine the position of every input pixel pled from real videos, where we use the estimated flow field
at every future time step. To produce a future frame Iˆt , we from I0 to It to warp encoded features from I0 , and supervise
adopt a deep image-based rendering technique and perform predictions Iˆt against It with a VGG perceptual loss [45].
splatting with the predicted motion field Ft to forward warp As shown in Fig. 5, compared to direct average splatting
the encoded I0 , as shown in Fig. 4. Since forward warp- and a baseline deep warping method [42], our motion-aware
ing can lead to holes, and multiple source pixels can map feature splatting produces a frame without holes or artifacts
(a) Average-splat (b) Baseline-splat (c) Ours
Softmax splatting
(Subject to W)
Figure 5. From left to right, we show a rendered future frame with
(a) average splatting in RGB pixel space, (b) softmax splatting with
learnable weights [42], and (c) our motion-aware feature splatting.
DT-FVD
Ours
to generate stochastic motion textures. For our ablation study,
FID
15
40
we run DDIM for 200 steps and set η = 0 for all the configu- 10
20
5
rations. We also show generated videos of up to a resolution
0 0 20
20 40 60 80 100 120 140 40 60 80 100 120
of 512 × 288, created by fine-tuning our models on higher Frame index Frame index
resolution data.
We adopt ResNet-34 [36] as our multi-scale feature ex- Figure 6. Sliding Window FID and DT-FVD. We show sliding
tractor. Our image synthesis network is based on a co- window FID of window size 30 frames, and DT-FVD of size 16
modulation StyleGAN architecture, which is a prior con- frames, for videos generated by different methods.
ditional image generation and inpainting model [53, 100].
Our rendering module runs in real-time at 25FPS on a single
Nvidia V100 GPU during inference. or where all pixels have an average motion magnitude larger
We adopt the universal guidance technique [3] to generate than one pixel. In total, our data consists of more than 130K
seamless looping videos, where we set weights w = 1.5, u = samples of image-motion pairs.
200 and the number of self-recurrence iterations to 2. We Baselines. We compare our approach to several recent
refer reader to supplementary material for full details of single-image animation and video prediction methods. Both
network architectures and hyper-parameter settings. Endo et al. [28] and DMVFN [44] predict instantaneous 2D
motion fields and future frames in an auto-regressive man-
7.2. Data and baselines ner. Other recent work such as Stochastic Image-to-Video
Data. Since our focus is on natural scenes exhibiting oscil- (I2V) [26] and MCVD [83] adopt either VAEs or diffusion
latory motion such as trees, flowers, and candles moving in models to predict video frames directly from a single pic-
the wind, we collect and process a set of 2,631 videos of such ture. LFDM [61] predicts flow fields in latent space with
phenomena from online sources as well as from our own a diffusion model, then uses those flow fields to warp the
captures, where we withhold 10% of the videos for testing encoded input image, generating future frames via a decoder.
and use the remainder for training. To generate ground truth We apply these models autoregressively to generate longer
stochastic motion textures for training our motion predic- videos by taking the last output frame and using it as the
tion module, we apply a coarse-to-fine image pyramid-based input to another round of generation until the video reaches
optical flow algorithm [10, 57] between selected starting a length of 150 frames. We train all the above methods on
frames and every future frame within a video sequence. Note our data using their respective open-source implementations.
that we found the choice of optical flow method to be cru-
7.3. Metrics
cial. We observed that deep-learning based flow estimators
tend to produce over-smoothed flow fields, leading to blobby We evaluate the quality of the videos generated by our
or unrealistic animations. We treat every 10th frame from approach and by prior baselines in two main ways. First, we
each training video as a starting image and generate corre- evaluate the quality of individual synthesized frames using
sponding ground truth stochastic motion textures using the metrics designed for image synthesis tasks. We adopt the
following 149 frames. We filter out samples with incorrect Fréchet Inception Distance (FID) [38] and Kernel Inception
motion estimates or significant camera motions by removing Distance (KID) [5] to measure the average distance between
examples with an average flow motion magnitude >8 pixels, the distribution of generated frames and the distribution
Input image Reference Stochastic I2V [26] MCVD [83] Endo et al. [28] Ours
Figure 7. X-t slices of videos generated by different approaches. From left to right: input image and corresponding X-t video slices
from the ground truth video, from videos generated by three baselines [26, 28, 83], and finally videos generated by our approach.
of ground truth frames. We further use a sliding window Image Synthesis Video Synthesis
FID FIDsw ) with a window size of 30 frames, as proposed Method FID↓ FIDsw ↓ KID↓ FVD↓ DT-FVD↓
by [53,56], to measure how generated frame quality degrades K=4 3.20 4.15 0.03 30.18 1.98
over time. K=8 3.25 4.30 0.04 28.81 1.85
Second, to evaluate the quality and temporal coherence K = 24 3.26 4.25 0.04 27.50 1.58
of synthesized videos in both the spatial and temporal do-
Scale w/ resolution 3.75 4.34 0.05 35.05 1.93
mains, we adopt the Fréchet Video Distance (FVD) [82], Independent pred. 3.20 4.21 0.04 36.30 1.80
which is based on an I3D model [11] trained on the Human Volume pred. 3.56 4.61 0.04 30.67 1.80
Kinetics datasets [48]. To more faithfully reflect synthesis
quality for the natural oscillation motions we seek to gen- Average splat 4.22 5.14 0.07 28.62 1.76
Baseline splat [42] 3.69 4.73 0.05 27.98 1.68
erate, we also adopt the Dynamic Texture Frechet Video
Distance (DT-FVD) proposed by Dorkenwald et al. [26], Full (K = 16) 3.21 4.21 0.04 27.63 1.60
which measures FVD with a I3D model trained on the Dy-
namic Textures Database [34], a dataset consisting primarily Table 2. Ablation study. We run all configurations using a DDIM
of natural motion textures. Similarly, we introduce a sliding with 200 steps. Please see Sec. 7.6 for the details of the different
window FVD with window size 16 to measure how gener- configurations.
ated video quality degrades over time. For all the methods,
we evaluate each error metric on a 256 × 128 central crop
of the predicted videos with 150 frames generated without FVD distances of generated videos from different methods.
performing temporal interpolation, at 256 × 128 resolution. Thanks to our global stochastic motion textures representa-
tion, videos generated by our approach are more temporally
7.4. Quantitative results consistent and do not suffer from drift or degradation over
Table 1 shows quantitative comparisons between our ap- time.
proach and baselines on our test set of unseen video clips.
7.5. Qualitative results
Our approach significantly outperforms prior single-image
animation baselines in terms of both image and video synthe- We visualize qualitative comparisons between videos gen-
sis quality. Specifically, our much lower FVD and DT-FVD erated by our approach and by baselines in two ways. First,
distances suggest that the videos generated by our approach we show spatio-temporal X-t slices of the generated videos,
are more realistic and more temporally coherent. Further, a standard way of visualizing small or subtle motions in a
Fig. 6 shows the sliding window FID and sliding window DT- video [85]. As shown in Fig. 7, our generated video dynam-
Input image Reference Endo et al. [28] DMVFN [44] LFDM [61] Ours
Figure 8. Visual comparisons of generated future frames and corresponding motion fields. By inspecting differences with a reference
image from the ground truth video, we observe that our approach produces more realistic textures and motions compared with baselines. We
refer readers to the supplementary video for full results.
ics more strongly resemble the motion patterns observed in prediction quality, but the improvement is marginal when
the corresponding real reference videos (second column), using more than 16 frequencies. Next, we remove adaptive
compared to other methods. Baselines such as Stochastic frequency normalization from the ground truth stochastic
I2V [26] and MCVD [83] fail to model both appearance motion textures, and instead just scale them based on input
and motion realistically over time. Endo et al. [28] produces image width and height (Scale w/ resolution). Additionally,
video frames with fewer artifacts but exhibits over-smooth we remove the frequency coordinated-denoising module (In-
or non-oscillation motions. dependent pred.), or replace it with a simpler module where
We also qualitatively compare the quality of individual a tensor volume of 4K channel stochastic motion textures
generated frames and motions across different methods by are predicted jointly via a standard 2D U-net diffusion model
visualizing the predicted image Iˆt and its corresponding mo- (Volume pred.). Finally, we compare results where we render
tion displacement field at time t = 128. Fig. 8 shows that the video frames using average splatting (Average splat), or use
frames generated by our approach exhibit fewer artifacts and a baseline rendering method that applies softmax splatting
distortions compared to other methods, and our correspond- over single-scale features subject to learnable weights used
ing 2D motion fields most resemble the reference displace- in Holynski et al. [42] (Baseline splat). From Table 2, we
ment fields estimated from the corresponding real videos. In observe that all simpler or alternative configurations lead to
contrast, the background content generated by other methods worse performance compared with our full model.
tend to drift, as shown in the flow visualizations in the even-
numbered rows. Moreover, the video frames generated by 8. Discussion and conclusion
other methods exhibit significant color distortion or ghosting Limitations. Since our approach only predicts stochastic
artifacts, suggesting that the baselines are less stable when motion textures at low frequencies, it might fail to model
generating videos with long time duration. general non-oscillating motions or high-frequency vibrations
such as those of musical instruments. Furthermore, the qual-
7.6. Ablation study
ity of our generated videos relies on the quality of the motion
We conduct an ablation study to validate the major de- trajectories estimated from the real video sequences. Thus,
sign choices in our motion prediction and rendering mod- we observed that animation quality can degrade if observed
ules, comparing our full configuration with different variants. motions in the real videos consists of large displacements.
Specifically, we evaluate results using different numbers of Moreover, since our approach is based on image-based ren-
frequency bands K = 4, 8, 16, and 24. We observe that dering from input pixels, the animation quality can also
increasing the number of frequency bands improves video degrade if the generated videos require the creation of large
amounts of content unseen in the input frame. [9] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun
Conclusion. We present a new approach for modeling Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei
natural oscillation dynamics from a single still picture. Efros, and Tero Karras. Generating long videos of dynamic
scenes. Advances in Neural Information Processing Systems,
Our image-space motion prior is represented with a neu-
35:31769–31781, 2022.
ral stochastic motion texture, a frequency representation of
[10] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim
per-pixel motion trajectories, which is learned from collec-
Weickert. High accuracy optical flow estimation based on
tions of real world videos. Our stochastic motion textures are a theory for warping. In Computer Vision-ECCV 2004: 8th
predicted using our frequency-coordinated latent diffusion European Conference on Computer Vision, Prague, Czech
model and are used to animate future video frames using Republic, May 11-14, 2004. Proceedings, Part IV 8, pages
a neural image-based rendering module. We show that our 25–36. Springer, 2004.
approach produces photo-realistic animations from a single [11] Joao Carreira and Andrew Zisserman. Quo vadis, action
picture and significantly outperforms prior baseline methods, recognition? a new model and the kinetics dataset. In pro-
and that it can enable other downstream applications such as ceedings of the IEEE Conference on Computer Vision and
creating interactive animations. Pattern Recognition, pages 6299–6308, 2017.
Acknowledgements. We thank Rick Szeliski, Andrew [12] Dan Casas, Marco Volino, John Collomosse, and Adrian
Hilton. 4d video textures for interactive character appear-
Liu, Boyang Deng, Qianqian Wang, Xuan Luo, and Lucy
ance. In Computer Graphics Forum, volume 33, pages
Chai for fruitful discussions and helpful comments.
371–380. Wiley Online Library, 2014.
[13] Antoni B Chan and Nuno Vasconcelos. Mixtures of dynamic
References textures. In Tenth IEEE International Conference on Com-
[1] Aseem Agarwala, Ke Colin Zheng, Chris Pal, Maneesh puter Vision (ICCV’05) Volume 1, volume 1, pages 641–647.
Agrawala, Michael Cohen, Brian Curless, David Salesin, IEEE, 2005.
and Richard Szeliski. Panoramic video textures. In ACM [14] Antoni B Chan and Nuno Vasconcelos. Classifying video
SIGGRAPH 2005 Papers, pages 821–827. 2005. with kernel dynamic textures. In 2007 IEEE Conference on
[2] Hyemin Ahn, Esteve Valls Mascaro, and Dongheui Lee. Computer Vision and Pattern Recognition, pages 1–6. IEEE,
Can we use diffusion probabilistic models for 3d motion 2007.
prediction? arXiv preprint arXiv:2302.14503, 2023. [15] Antoni B Chan and Nuno Vasconcelos. Modeling, clustering,
[3] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, and segmenting video with mixtures of dynamic textures.
Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and IEEE transactions on pattern analysis and machine intelli-
Tom Goldstein. Universal guidance for diffusion models. gence, 30(5):909–926, 2008.
In Proceedings of the IEEE/CVF Conference on Computer [16] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot,
Vision and Pattern Recognition, pages 843–852, 2023. Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy,
[4] Hugo Bertiche, Niloy J Mitra, Kuldeep Kulkarni, Chun- William T Freeman, Michael Rubinstein, et al. Muse: Text-
Hao P Huang, Tuanfeng Y Wang, Meysam Madadi, Sergio to-image generation via masked generative transformers.
Escalera, and Duygu Ceylan. Blowing in the wind: Cyclenet arXiv preprint arXiv:2301.00704, 2023.
for human cinemagraphs from still images. In Proceed- [17] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T
ings of the IEEE/CVF Conference on Computer Vision and Freeman. Maskgit: Masked generative image transformer.
Pattern Recognition, pages 459–468, 2023. In Proceedings of the IEEE/CVF Conference on Computer
[5] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, Vision and Pattern Recognition, pages 11315–11325, 2022.
and Arthur Gretton. Demystifying MMD GANs. arXiv [18] Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-
preprint arXiv:1801.01401, 2018. Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu-
[6] Andreas Blattmann, Timo Milbich, Michael Dorkenwald, sion model for controllable video synthesis. arXiv preprint
and Björn Ommer. ipoke: Poking a still image for controlled arXiv:2304.14404, 2023.
stochastic video synthesis. In Proceedings of the IEEE/CVF [19] Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao
International Conference on Computer Vision, pages 14707– Chen, and Gang Yu. Executing your commands via motion
14717, 2021. diffusion in latent space. In Proceedings of the IEEE/CVF
[7] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- Conference on Computer Vision and Pattern Recognition,
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. pages 18000–18010, 2023.
Align your latents: High-resolution video synthesis with [20] Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian
latent diffusion models. In Proceedings of the IEEE/CVF Curless, David H Salesin, and Richard Szeliski. Animat-
Conference on Computer Vision and Pattern Recognition, ing pictures with stochastic motion textures. In ACM SIG-
pages 22563–22575, 2023. GRAPH 2005 Papers, pages 853–860. 2005.
[8] Richard Strong Bowen, Richard Tucker, Ramin Zabih, and [21] Vincent C Couture, Michael S Langer, and Sebastien Roy.
Noah Snavely. Dimensions of motion: Monocular prediction Omnistereo video textures without ghosting. In 2013 Inter-
through flow subspaces. In 2022 International Conference national Conference on 3D Vision-3DV 2013, pages 64–70.
on 3D Vision (3DV), pages 454–464. IEEE, 2022. IEEE, 2013.
[22] Abe Davis, Justin G Chen, and Frédo Durand. Image-space of the IEEE Conference on Computer Vision and Pattern
modal bases for plausible manipulation of objects in video. Recognition, pages 7854–7863, 2018.
ACM Transactions on Graphics (TOG), 34(6):1–7, 2015. [36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[23] Prafulla Dhariwal and Alexander Nichol. Diffusion models Deep residual learning for image recognition. In Proceed-
beat gans on image synthesis. Advances in neural informa- ings of the IEEE conference on computer vision and pattern
tion processing systems, 34:8780–8794, 2021. recognition, pages 770–778, 2016.
[24] Julien Diener, Mathieu Rodriguez, Lionel Baboud, and Li- [37] Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun,
onel Reveret. Wind projection basis for real-time animation Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao
of trees. In Computer graphics forum, volume 28, pages Weng, Ying Shan, et al. Animate-a-story: Storytelling
533–540. Wiley Online Library, 2009. with retrieval-augmented video generation. arXiv preprint
[25] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and arXiv:2307.06940, 2023.
Stefano Soatto. Dynamic textures. International journal of [38] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
computer vision, 51:91–109, 2003. hard Nessler, and Sepp Hochreiter. Gans trained by a two
[26] Michael Dorkenwald, Timo Milbich, Andreas Blattmann, time-scale update rule converge to a local nash equilib-
Robin Rombach, Konstantinos G. Derpanis, and Bjorn Om- rium. Advances in neural information processing systems,
mer. Stochastic image-to-video synthesis using cinns. In 30, 2017.
Proceedings of the IEEE/CVF Conference on Computer Vi- [39] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
sion and Pattern Recognition (CVPR), pages 3742–3753, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
June 2021. Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
[27] Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, video: High definition video generation with diffusion mod-
Ali Thabet, and Artsiom Sanakoyeu. Avatars grow legs: els. arXiv preprint arXiv:2210.02303, 2022.
Generating smooth human motion from sparse tracking in- [40] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
puts with diffusion model. In Proceedings of the IEEE/CVF sion probabilistic models. Advances in neural information
Conference on Computer Vision and Pattern Recognition, processing systems, 33:6840–6851, 2020.
pages 481–490, 2023.
[41] Jonathan Ho and Tim Salimans. Classifier-free diffusion
[28] Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama.
guidance. arXiv preprint arXiv:2207.12598, 2022.
Animating landscape: Self-supervised learning of decoupled
motion and appearance for single-image video synthesis. [42] Aleksander Holynski, Brian L Curless, Steven M Seitz, and
ACM Transactions on Graphics (Proceedings of ACM SIG- Richard Szeliski. Animating pictures with Eulerian motion
GRAPH Asia 2019), 38(6):175:1–175:19, 2019. fields. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 5810–5819,
[29] Dave Epstein, Allan Jabri, Ben Poole, Alexei A Efros, and
2021.
Aleksander Holynski. Diffusion self-guidance for control-
lable image generation. arXiv preprint arXiv:2306.00986, [43] Tobias Hoppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen,
2023. and Andrea Dittadi. Diffusion models for video prediction
[30] Matthew Flagg, Atsushi Nakazawa, Qiushuang Zhang, and infilling. Trans. Mach. Learn. Res., 2022, 2022.
Sing Bing Kang, Young Kee Ryu, Irfan Essa, and James M [44] Xiaotao Hu, Zhewei Huang, Ailin Huang, Jun Xu, and
Rehg. Human video textures. In Proceedings of the 2009 Shuchang Zhou. A dynamic multi-scale voxel flow network
symposium on Interactive 3D graphics and games, pages for video prediction. ArXiv, abs/2303.09875, 2023.
199–206, 2009. [45] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Percep-
[31] Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, tual losses for real-time style transfer and super-resolution.
Sylvain Lamprier, and Patrick Gallinari. Stochastic latent In Computer Vision–ECCV 2016: 14th European Confer-
residual video prediction. In International Conference on ence, Amsterdam, The Netherlands, October 11-14, 2016,
Machine Learning, pages 3233–3246. PMLR, 2020. Proceedings, Part II 14, pages 694–711. Springer, 2016.
[32] Ruohan Gao, Bo Xiong, and Kristen Grauman. Im2Flow: [46] Hitoshi Kanda and Jun Ohya. Efficient, realistic method
Motion hallucination from static images for action recog- for animating dynamic behaviors of 3d botanical trees. In
nition. In Proc. Computer Vision and Pattern Recognition 2003 International Conference on Multimedia and Expo.
(CVPR), 2018. ICME’03. Proceedings (Cat. No. 03TH8698), volume 2,
[33] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu pages II–89. IEEE, 2003.
Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your [47] Johanna Karras, Aleksander Holynski, Ting-Chun Wang,
personalized text-to-image diffusion models without specific and Ira Kemelmacher-Shlizerman. Dreampose: Fashion
tuning. arXiv preprint arXiv:2307.04725, 2023. image-to-video synthesis via stable diffusion. arXiv preprint
[34] Isma Hadji and Richard P Wildes. A new large scale dy- arXiv:2304.06025, 2023.
namic texture dataset with application to convnet under- [48] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
standing. In Proceedings of the European Conference on Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
Computer Vision (ECCV), pages 320–335, 2018. Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
[35] Zekun Hao, Xun Huang, and Serge Belongie. Controllable man action video dataset. arXiv preprint arXiv:1705.06950,
video generation with sparse trajectories. In Proceedings 2017.
[49] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, [63] Tae-Hyun Oh, Ronnachai Jaroensri, Changil Kim, Mohamed
Chelsea Finn, and Sergey Levine. Stochastic adversarial Elgharib, Fr’edo Durand, William T Freeman, and Wojciech
video prediction. arXiv preprint arXiv:1804.01523, 2018. Matusik. Learning-based video motion magnification. In
[50] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Proceedings of the European Conference on Computer Vi-
Noah Snavely, Ce Liu, and William T Freeman. Learn- sion (ECCV), pages 633–648, 2018.
ing the depths of moving people by watching frozen people. [64] Shin Ota, Machiko Tamura, Kunihiko Fujita, T Fujimoto,
In Proceedings of the IEEE/CVF conference on computer K Muraoka, and Norishige Chiba. 1/f/sup/spl beta//noise-
vision and pattern recognition, pages 4521–4530, 2019. based real-time animation of trees swaying in wind fields. In
[51] Zhengqi Li and Noah Snavely. Megadepth: Learning single- Proceedings Computer Graphics International 2003, pages
view depth prediction from internet photos. In Proceedings 52–59. IEEE, 2003.
of the IEEE conference on computer vision and pattern [65] Automne Petitjean, Yohan Poirier-Ginter, Ayush Tewari,
recognition, pages 2041–2050, 2018. Guillaume Cordonnier, and George Drettakis. Modalnerf:
[52] Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, Neural modal analysis and synthesis for free-viewpoint navi-
and Noah Snavely. Dynibar: Neural dynamic image-based gation in dynamically vibrating scenes. In Computer Graph-
rendering. In Proceedings of the IEEE/CVF Conference on ics Forum, volume 42, 2023.
Computer Vision and Pattern Recognition, pages 4273–4284, [66] Silvia L. Pintea, Jan C. van Gemert, and Arnold W. M.
2023. Smeulders. Déjà vu: Motion prediction in static images. In
[53] Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Proc. European Conf. on Computer Vision (ECCV), 2014.
Kanazawa. Infinitenature-zero: Learning perpetual view gen- [67] Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H
eration of natural scenes from single images. In European Bermano, and Daniel Cohen-Or. Single motion diffusion.
Conference on Computer Vision, pages 515–534. Springer, arXiv preprint arXiv:2302.05905, 2023.
2022. [68] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[54] Jing Liao, Mark Finch, and Hugues Hoppe. Fast computation and Mark Chen. Hierarchical text-conditional image gen-
of seamless video loops. ACM Transactions on Graphics eration with clip latents. arXiv preprint arXiv:2204.06125,
(TOG), 34(6):1–10, 2015. 1(2):3, 2022.
[55] Zicheng Liao, Neel Joshi, and Hugues Hoppe. Automated [69] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
video looping with progressive dynamism. ACM Transac- Patrick Esser, and Björn Ommer. High-resolution image
tions on Graphics (TOG), 32(4):1–10, 2013. synthesis with latent diffusion models. In Proceedings of
[56] Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Maka- the IEEE/CVF conference on computer vision and pattern
dia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: recognition, pages 10684–10695, 2022.
Perpetual view generation of natural scenes from a single [70] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
image. In Proceedings of the IEEE/CVF International Con- Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
ference on Computer Vision, pages 14458–14467, 2021. Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
[57] Ce Liu. Beyond pixels: exploring new representations and et al. Photorealistic text-to-image diffusion models with
applications for motion analysis. PhD thesis, Massachusetts deep language understanding. Advances in Neural Informa-
Institute of Technology, 2009. tion Processing Systems, 35:36479–36494, 2022.
[58] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, [71] Payam Saisan, Gianfranco Doretto, Ying Nian Wu, and Ste-
Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and fano Soatto. Dynamic texture recognition. In Proceedings
Tieniu Tan. Videofusion: Decomposed diffusion models of the 2001 IEEE Computer Society Conference on Com-
for high-quality video generation. In Proceedings of the puter Vision and Pattern Recognition. CVPR 2001, volume 2,
IEEE/CVF Conference on Computer Vision and Pattern pages II–II. IEEE, 2001.
Recognition, pages 10209–10218, 2023. [72] Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek
[59] Long Mai and Feng Liu. Motion-adjustable neural implicit Kar, Mohammad Norouzi, Deqing Sun, and David J. Fleet.
video representation. In Proceedings of the IEEE/CVF Con- The surprising effectiveness of diffusion models for optical
ference on Computer Vision and Pattern Recognition, pages flow and monocular depth estimation, 2023.
10738–10747, 2022. [73] Arno Schödl, Richard Szeliski, David H Salesin, and Irfan
[60] Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Implicit Essa. Video textures. In Proceedings of the 27th annual con-
warping for animation with image sets. Advances in Neural ference on Computer graphics and interactive techniques,
Information Processing Systems, 35:22438–22450, 2022. pages 489–498, 2000.
[61] Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and [74] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov,
Martin Renqiang Min. Conditional image-to-video gener- Elisa Ricci, and Nicu Sebe. Animating arbitrary objects
ation with latent flow diffusion models. In Proceedings of via deep motion transfer. In Proceedings of the IEEE/CVF
the IEEE/CVF Conference on Computer Vision and Pattern Conference on Computer Vision and Pattern Recognition,
Recognition, pages 18444–18455, 2023. pages 2377–2386, 2019.
[62] Simon Niklaus and Feng Liu. Softmax splatting for video [75] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov,
frame interpolation. In Proceedings of the IEEE/CVF Con- Elisa Ricci, and Nicu Sebe. First order motion model for im-
ference on Computer Vision and Pattern Recognition, pages age animation. Advances in neural information processing
5437–5446, 2020. systems, 32, 2019.
[76] Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei [90] Frederik Warburg, Ethan Weber, Matthew Tancik, Alek-
Chai, and Sergey Tulyakov. Motion representations for sander Holynski, and Angjoo Kanazawa. Nerfbusters: Re-
articulated animation. In Proceedings of the IEEE/CVF moving ghostly artifacts from casually captured nerfs. arXiv
Conference on Computer Vision and Pattern Recognition, preprint arXiv:2304.10532, 2023.
pages 13653–13662, 2021. [91] Daniel Watson, William Chan, Ricardo Martin-Brualla,
[77] Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi.
seiny. Stylegan-v: A continuous video generator with the Novel view synthesis with diffusion models. arXiv preprint
price, image quality and perks of stylegan2. In Proceed- arXiv:2210.04628, 2022.
ings of the IEEE/CVF Conference on Computer Vision and [92] Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-
Pattern Recognition, pages 3626–3636, 2022. Shlizerman. Photo wake-up: 3d character animation from a
[78] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, single photo. In Proceedings of the IEEE/CVF conference on
and Surya Ganguli. Deep unsupervised learning using computer vision and pattern recognition, pages 5908–5917,
nonequilibrium thermodynamics. In International confer- 2019.
ence on machine learning, pages 2256–2265. PMLR, 2015. [93] Jamie Wynn and Daniyar Turmukhambetov. DiffusioNeRF:
[79] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- Regularizing Neural Radiance Fields with Denoising Diffu-
ing diffusion implicit models. arXiv:2010.02502, October sion Models. In CVPR, 2023.
2020. [94] Tianfan Xue, Jiajun Wu, Katherine L Bouman, and
[80] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- William T Freeman. Visual dynamics: Stochastic future gen-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based eration via layered cross convolutional networks. Trans. Pat-
generative modeling through stochastic differential equa- tern Analysis and Machine Intelligence, 41(9):2236–2250,
tions. arXiv preprint arXiv:2011.13456, 2020. 2019.
[95] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang
[81] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir,
Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained
Daniel Cohen-Or, and Amit H Bermano. Human motion
control in video generation by integrating text, image, and
diffusion model. arXiv preprint arXiv:2209.14916, 2022.
trajectory. arXiv preprint arXiv:2308.08089, 2023.
[82] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach,
[96] Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin.
Raphael Marinier, Marcin Michalski, and Sylvain Gelly.
Video probabilistic diffusion models in projected latent
Towards accurate generative models of video: A new metric
space. In Proceedings of the IEEE/CVF Conference on Com-
& challenges. arXiv preprint arXiv:1812.01717, 2018.
puter Vision and Pattern Recognition, pages 18456–18466,
[83] Vikram Voleti, Alexia Jolicoeur-Martineau, and Christopher 2023.
Pal. Mcvd: Masked conditional video diffusion for predic- [97] Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai,
tion, generation, and interpolation. In (NeurIPS) Advances Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re-
in Neural Information Processing Systems, 2022. modiffuse: Retrieval-augmented motion diffusion model.
[84] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. arXiv preprint arXiv:2304.01116, 2023.
Generating videos with scene dynamics. In Neural Informa- [98] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng
tion Processing Systems, 2016. Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo:
[85] Neal Wadhwa, Michael Rubinstein, Frédo Durand, and Training-free controllable text-to-video generation. arXiv
William T Freeman. Phase-based video motion process- preprint arXiv:2305.13077, 2023.
ing. ACM Transactions on Graphics (ToG), 32(4):1–10, [99] Jian Zhao and Hui Zhang. Thin-plate spline motion model
2013. for image animation. In Proceedings of the IEEE/CVF Con-
[86] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial ference on Computer Vision and Pattern Recognition, pages
Hebert. An uncertain future: Forecasting from static images 3657–3666, 2022.
using variational autoencoders. In Proc. European Conf. on [100] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao
Computer Vision (ECCV), 2016. Liang, Eric I Chang, and Yan Xu. Large scale image com-
[87] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense pletion via co-modulated generative adversarial networks.
optical flow prediction from a static image. In Proceedings In International Conference on Learning Representations
of the IEEE International Conference on Computer Vision, (ICLR), 2021.
pages 2443–2451, 2015. [101] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv,
[88] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video
Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, generation with latent diffusion models. arXiv preprint
and Jingren Zhou. Videocomposer: Compositional video arXiv:2211.11018, 2022.
synthesis with motion controllability. arXiv preprint
arXiv:2306.02018, 2023.
[89] Yaohui Wang, Di Yang, Francois Bremond, and Antitza
Dantcheva. Latent image animator: Learning to ani-
mate images via latent space navigation. arXiv preprint
arXiv:2203.09043, 2022.