8000 Stable Video Diffusion · CodersSampling/generative-models@059d8e9 · GitHub
[go: up one dir, main page]

Skip to content

Commit 059d8e9

Browse files
author
Tim Dockhorn
committed
Stable Video Diffusion
1 parent 477d8b9 commit 059d8e9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+5418
-1646
lines changed

README.md

Lines changed: 79 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,48 @@
44

55
## News
66

7+
**November 21, 2023**
8+
9+
- We are releasing Stable Video Diffusion, an image-to-video model, for research purposes:
10+
- [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid): This model was trained to generate 14
11+
frames at resolution 576x1024 given a context frame of the same size.
12+
We use the standard image encoder from SD 2.1, but replace the decoder with a temporally-aware `deflickering decoder`.
13+
- [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt): Same architecture as `SVD` but finetuned
14+
for 25 frame generation.
15+
- We provide a streamlit demo `scripts/demo/video_sampling.py` and a standalone python script `scripts/sampling/simple_video_sample.py` for inference of both models.
16+
- Alongside the model, we release a [technical report](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets).
17+
18+
![tile](assets/tile.gif)
19+
720
**July 26, 2023**
8-
- We are releasing two new open models with a permissive [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0) (see [Inference](#inference) for file hashes):
9-
- [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0): An improved version over `SDXL-base-0.9`.
10-
- [SDXL-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0): An improved version over `SDXL-refiner-0.9`.
1121

12-
![sample2](assets/001_with_eval.png)
22+
- We are releasing two new open models with a
23+
permissive [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0) (see [Inference](#inference) for file
24+
hashes):
25+
- [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0): An improved version
26+
over `SDXL-base-0.9`.
27+
- [SDXL-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0): An improved version
28+
over `SDXL-refiner-0.9`.
1329

30+
![sample2](assets/001_with_eval.png)
1431

1532
**July 4, 2023**
33+
1634
- A technical report on SDXL is now available [here](https://arxiv.org/abs/2307.01952).
1735

1836
**June 22, 2023**
1937

20-
2138
- We are releasing two new diffusion models for research purposes:
22-
- `SDXL-base-0.9`: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The base model uses [OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main) for text encoding whereas the refiner model only uses the OpenCLIP model.
23-
- `SDXL-refiner-0.9`: The refiner has been trained to denoise small noise levels of high quality data and as such is not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.
39+
- `SDXL-base-0.9`: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The
40+
base model uses [OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip)
41+
and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main) for text encoding whereas the refiner model only uses
42+
the OpenCLIP model.
43+
- `SDXL-refiner-0.9`: The refiner has been trained to denoise small noise levels of high quality data and as such is
44+
not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.
2445

2546
If you would like to access these models for your research, please apply using one of the following links:
26-
[SDXL-0.9-Base model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9), and [SDXL-0.9-Refiner](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
47+
[SDXL-0.9-Base model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9),
48+
and [SDXL-0.9-Refiner](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
2749
This means that you can apply for any of the two links - and if you are granted - you can access both.
2850
Please log in to your Hugging Face Account with your organization email to request access.
2951
**We plan to do a full release soon (July).**
@@ -32,21 +54,32 @@ Please log in to your Hugging Face Account with your organization email to reque
3254

3355
### General Philosophy
3456

35-
Modularity is king. This repo implements a config-driven approach where we build and combine submodules by calling `instantiate_from_config()` on objects defined in yaml configs. See `configs/` for many examples.
57+
Modularity is king. This repo implements a config-driven approach where we build and combine submodules by
58+
calling `instantiate_from_config()` on objects defined in yaml configs. See `configs/` for many examples.
3659

3760
### Changelog from the old `ldm` codebase
3861

39-
For training, we use [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), but it should be easy to use other training wrappers around the base modules. The core diffusion model class (formerly `LatentDiffusion`, now `DiffusionEngine`) has been cleaned up:
62+
For training, we use [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), but it should be easy to use other
63+
training wrappers around the base modules. The core diffusion model class (formerly `LatentDiffusion`,
64+
now `DiffusionEngine`) has been cleaned up:
4065

41-
- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial conditionings, and all combinations thereof) in a single class: `GeneralConditioner`, see `sgm/modules/encoders/modules.py`.
66+
- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial
67+
conditionings, and all combinations thereof) in a single class: `GeneralConditioner`,
68+
see `sgm/modules/encoders/modules.py`.
4269
- We separate guiders (such as classifier-free guidance, see `sgm/modules/diffusionmodules/guiders.py`) from the
4370
samplers (`sgm/modules/diffusionmodules/sampling.py`), and the samplers are independent of the model.
44-
- We adopt the ["denoiser framework"](https://arxiv.org/abs/2206.00364) for both training and inference (most notable change is probably now the option to train continuous time models):
45-
* Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see `sgm/modules/diffusionmodules/denoiser.py`.
46-
* The following features are now independent: weighting of the diffusion loss function (`sgm/modules/diffusionmodules/denoiser_weighting.py`), preconditioning of the network (`sgm/modules/diffusionmodules/denoiser_scaling.py`), and sampling of noise levels during training (`sgm/modules/diffusionmodules/sigma_sampling.py`).
71+
- We adopt the ["denoiser framework"](https://arxiv.org/abs/2206.00364) for both training and inference (most notable
72+
change is probably now the option to train continuous time models):
73+
* Discrete times models (denoisers) are simply a special case of continuous time models (denoisers);
74+
see `sgm/modules/diffusionmodules/denoiser.py`.
75+
* The following features are now independent: weighting of the diffusion loss
76+
function (`sgm/modules/diffusionmodules/denoiser_weighting.py`), preconditioning of the
77+
network (`sgm/modules/diffusionmodules/denoiser_scaling.py`), and sampling of noise levels during
78+
training (`sgm/modules/diffusionmodules/sigma_sampling.py`).
4779
- Autoencoding models have also been cleaned up.
4880

4981
## Installation:
82+
5083
<a name="installation"></a>
5184

5285
#### 1. Clone the repo
@@ -60,29 +93,17 @@ cd generative-models
6093

6194
This is assuming you have navigated to the `generative-models` root after cloning it.
6295

63-
**NOTE:** This is tested under `python3.8` and `python3.10`. For other python versions, you might encounter version conflicts.
64-
65-
66-
**PyTorch 1.13**
67-
68-
```shell
69-
# install required packages from pypi
70-
python3 -m venv .pt13
71-
source .pt13/bin/activate
72-
pip3 install -r requirements/pt13.txt
73-
```
96+
**NOTE:** This is tested under `python3.10`. For other python versions, you might encounter version conflicts.
7497

7598
**PyTorch 2.0**
7699

77-
78100
```shell
79101
# install required packages from pypi
80102
python3 -m venv .pt2
81103
source .pt2/bin/activate
82104
pip3 install -r requirements/pt2.txt
83105
```
84106

85-
86107
#### 3. Install `sgm`
87108

88109
```shell
@@ -114,8 +135,10 @@ depending on your use case and PyTorch version, manually.
114135

115136
## Inference
116137

117-
We provide a [streamlit](https://streamlit.io/) demo for text-to-image and image-to-image sampling in `scripts/demo/sampling.py`.
118-
We provide file hashes for the complete file as well as for only the saved tensors in the file (see [Model Spec](https://github.com/Stability-AI/ModelSpec) for a script to evaluate that).
138+
We provide a [streamlit](https://streamlit.io/) demo for text-to-image and image-to-image sampling
139+
in `scripts/demo/sampling.py`.
140+
We provide file hashes for the complete file as well as for only the saved tensors in the file (
141+
see [Model Spec](https://github.com/Stability-AI/ModelSpec) for a script to evaluate that).
119142
The following models are currently supported:
120143

121144
- [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
@@ -136,19 +159,20 @@ The following models are currently supported:
136159
**Weights for SDXL**:
137160

138161
**SDXL-1.0:**
139-
The weights of SDXL-1.0 are available (subject to a [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0)) here:
162+
The weights of SDXL-1.0 are available (subject to
163+
a [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0)) here:
164+
140165
- base model: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/
141166
- refiner model: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/
142167

143-
144168
**SDXL-0.9:**
145169
The weights of SDXL-0.9 are available and subject to a [research license](model_licenses/LICENSE-SDXL0.9).
146170
If you would like to access these models for your research, please apply using one of the following links:
147-
[SDXL-base-0.9 model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9), and [SDXL-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
171+
[SDXL-base-0.9 model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9),
172+
and [SDXL-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
148173
This means that you can apply for any of the two links - and if you are granted - you can access both.
149174
Please log in to your Hugging Face Account with your organization email to request access.
150175

151-
152176
After obtaining the weights, place them into `checkpoints/`.
153177
Next, start the demo using
154178

@@ -166,6 +190,7 @@ not the same as in previous Stable Diffusion 1.x/2.x versions.
166190

167191
To run the script you need to either have a working installation as above or
168192
try an _experimental_ import using only a minimal amount of packages:
193+
169194
```bash
170195
python -m venv .detect
171196
source .detect/bin/activate
@@ -177,6 +202,7 @@ pip install --no-deps invisible-watermark
177202
To run the script you need to have a working installation as above. The script
178203
is then useable in the following ways (don't forget to activate your
179204
virtual environment beforehand, e.g. `source .pt1/bin/activate`):
205+
180206
```bash
181207
# test a single file
182208
python scripts/demo/detect.py <your filename here>
@@ -203,11 +229,21 @@ run
203229
python main.py --base configs/example_training/toy/mnist_cond.yaml
204230
```
205231

206-
**NOTE 1:** Using the non-toy-dataset configs `configs/example_training/imagenet-f8_cond.yaml`, `configs/example_training/txt2img-clipl.yaml` and `configs/example_training/txt2img-clipl-legacy-ucg-training.yaml` for training will require edits depending on the used dataset (which is expected to stored in tar-file in the [webdataset-format](https://github.com/webdataset/webdataset)). To find the parts which have to be adapted, search for comments containing `USER:` in the respective config.
232+
**NOTE 1:** Using the non-toy-dataset
233+
configs `configs/example_training/imagenet-f8_cond.yaml`, `configs/example_training/txt2img-clipl.yaml`
234+
and `configs/example_training/txt2img-clipl-legacy-ucg-training.yaml` for training will require edits depending on the
235+
used dataset (which is expected to stored in tar-file in
236+
the [webdataset-format](https://github.com/webdataset/webdataset)). To find the parts which have to be adapted, search
237+
for comments containing `USER:` in the respective config.
207238

208-
**NOTE 2:** This repository supports both `pytorch1.13` and `pytorch2`for training generative models. However for autoencoder training as e.g. in `configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml`, only `pytorch1.13` is supported.
239+
**NOTE 2:** This repository supports both `pytorch1.13` and `pytorch2`for training generative models. However for
240+
autoencoder training as e.g. in `configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml`,
241+
only `pytorch1.13` is supported.
209242

210-
**NOTE 3:** Training latent generative models (as e.g. in `configs/example_training/imagenet-f8_cond.yaml`) requires retrieving the checkpoint from [Hugging Face](https://huggingface.co/stabilityai/sdxl-vae/tree/main) and replacing the `CKPT_PATH` placeholder in [this line](configs/example_training/imagenet-f8_cond.yaml#81). The same is to be done for the provided text-to-image configs.
243+
**NOTE 3:** Training latent generative models (as e.g. in `configs/example_training/imagenet-f8_cond.yaml`) requires
244+
retrieving the checkpoint from [Hugging Face](https://huggingface.co/stabilityai/sdxl-vae/tree/main) and replacing
245+
the `CKPT_PATH` placeholder in [this line](configs/example_training/imagenet-f8_cond.yaml#81). The same is to be done
246+
for the provided text-to-image configs.
211247

212248
### Building New Diffusion Models
213249

@@ -216,7 +252,8 @@ python main.py --base configs/example_training/toy/mnist_cond.yaml
216252
The `GeneralConditioner` is configured through the `conditioner_config`. Its only attribute is `emb_models`, a list of
217253
different embedders (all inherited from `AbstractEmbModel`) that are used to condition the generative model.
218254
All embedders should define whether or not they are trainable (`is_trainable`, default `False`), a classifier-free
219-
guidance dropout rate is used (`ucg_rate`, default `0`), and an input key (`input_key`), for example, `txt` for text-conditioning or `cls` for class-conditioning.
255+
guidance dropout rate is used (`ucg_rate`, default `0`), and an input key (`input_key`), for example, `txt` for
256+
text-conditioning or `cls` for class-conditioning.
220257
When computing conditionings, the embedder will get `batch[input_key]` as input.
221258
We currently support two to four dimensional conditionings and conditionings of different embedders are concatenated
222259
appropriately.
@@ -229,7 +266,8 @@ enough as we plan to experiment with transformer-based diffusion backbones.
229266

230267
#### Loss
231268

232-
The loss is configured through `loss_config`. For standard diffusion model training, you will have to set `sigma_sampler_config`.
269+
The loss is configured through `loss_config`. For standard diffusion model training, you will have to
270+
set `sigma_sampler_config`.
233271

234272
#### Sampler config
235273

@@ -239,8 +277,9 @@ guidance.
239277

240278
### Dataset Handling
241279

242-
243-
For large scale training we recommend using the data pipelines from our [data pipelines](https://github.com/Stability-AI/datapipelines) project. The project is contained in the requirement and automatically included when following the steps from the [Installation section](#installation).
280+
For large scale training we recommend using the data pipelines from
281+
our [data pipelines](https://github.com/Stability-AI/datapipelines) project. The project is contained in the requirement
282+
and automatically included when following the steps from the [Installation section](#installation).
244283
Small map-style datasets should be defined here in the repository (e.g., MNIST, CIFAR-10, ...), and return a dict of
245284
data keys/values,
246285
e.g.,

assets/test_image.png

482 KB
Loading

assets/tile.gif

17.8 MB
Loading

configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -29,44 +29,33 @@ model:
2929
in_channels: 3
3030
out_ch: 3
3131
ch: 128
32-
ch_mult: [ 1, 2, 4 ]
32+
ch_mult: [1, 2, 4]
3333
num_res_blocks: 4
34-
attn_resolutions: [ ]
34+
attn_resolutions: []
3535
dropout: 0.0
3636

3737
decoder_config:
3838
target: sgm.modules.diffusionmodules.model.Decoder
39-
params:
40-
attn_type: none
41-
double_z: False
42-
z_channels: 4
43-
resolution: 256
44-
in_channels: 3
45-
out_ch: 3
46-
ch: 128
47-
ch_mult: [ 1, 2, 4 ]
48-
num_res_blocks: 4
49-
attn_resolutions: [ ]
50-
dropout: 0.0
39+
params: ${model.params.encoder_config.params}
5140

5241
data:
5342
target: sgm.data.dataset.StableDataModuleFromConfig
5443
params:
5544
train:
5645
datapipeline:
5746
urls:
58-
- "DATA-PATH"
47+
- DATA-PATH
5948
pipeline_config:
6049
shardshuffle: 10000
6150
sample_shuffle: 10000
6251

6352
decoders:
64-
- "pil"
53+
- pil
6554

6655
postprocessors:
6756
- target: sdata.mappers.TorchVisionImageTransforms
6857
params:
69-
key: 'jpg'
58+
key: jpg
7059
transforms:
7160
- target: torchvision.transforms.Resize
7261
params:
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
model:
2+
base_learning_rate: 4.5e-6
3+
target: sgm.models.autoencoder.AutoencodingEngine
4+
params:
5+
input_key: jpg
6+
monitor: val/loss/rec
7+
disc_start_iter: 0
8+
9+
encoder_config:
10+
target: sgm.modules.diffusionmodules.model.Encoder
11+
params:
12+
attn_type: vanilla-xformers
13+
double_z: true
14+
z_channels: 8
15+
resolution: 256
16+
in_channels: 3
17+
out_ch: 3
18+
ch: 128
19+
ch_mult: [1, 2, 4, 4]
20+
num_res_blocks: 2
21+
attn_resolutions: []
22+
dropout: 0.0
23+
24+
decoder_config:
25+
target: sgm.modules.diffusionmodules.model.Decoder
26+
params: ${model.params.encoder_config.params}
27+
28+
regularizer_config:
29+
target: sgm.modules.autoencoding.regularizers.DiagonalGaussianRegularizer
30+
31+
loss_config:
32+
target: sgm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
33+
params:
34+
perceptual_weight: 0.25
35+
disc_start: 20001
36+
disc_weight: 0.5
37+
learn_logvar: True
38+
39+
regularization_weights:
40+
kl_loss: 1.0
41+
42+
data:
43+
target: sgm.data.dataset.StableDataModuleFromConfig
44+
params:
45+
train:
46+
datapipeline:
47+
urls:
48+
- DATA-PATH
49+
pipeline_config:
50+
shardshuffle: 10000
51+
sample_shuffle: 10000
52+
53+
decoders:
54+
- pil
55+
56+
postprocessors:
57+
- target: sdata.mappers.TorchVisionImageTransforms
58+
params:
59+
key: jpg
60+
transforms:
61+
- target: torchvision.transforms.Resize
62+
params:
63+
size: 256
64+
interpolation: 3
65+
- target: torchvision.transforms.ToTensor
66+
- target: sdata.mappers.Rescaler
67+
- target: sdata.mappers.AddOriginalImageSizeAsTupleAndCropToSquare
68+
params:
69+
h_key: height
70+
w_key: width
71+
72+
loader:
73+
batch_size: 8
74+
num_workers: 4
75+
76+
77+
lightning:
78+
strategy:
79+
target: pytorch_lightning.strategies.DDPStrategy
80+
params:
81+
find_unused_parameters: True
82+
83+
modelcheckpoint:
84+
params:
85+
every_n_train_steps: 5000
86+
87+
callbacks:
88+
metrics_over_trainsteps_checkpoint:
89+
params:
90+
every_n_train_steps: 50000
91+
92+
image_logger:
93+
target: main.ImageLogger
94+
params:
95+
enable_autocast: False
96+
batch_frequency: 1000
97+
max_images: 8
98+
increase_log_steps: True
99+
100+
trainer:
101+
devices: 0,
102+
limit_val_batches: 50
103+
benchmark: True
104+
accumulate_grad_batches: 1
105+
val_check_interval: 10000

0 commit comments

Comments
 (0)
0