[go: up one dir, main page]

Diffusers documentation

Stable Video Diffusion

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Stable Video Diffusion

Stable Video Diffusion (SVD)์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ๋งž์ถฐ 2~4์ดˆ ๋ถ„๋Ÿ‰์˜ ๊ณ ํ•ด์ƒ๋„(576x1024) ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ image-to-video ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” SVD๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ ์งง์€ ๋™์˜์ƒ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

!pip install -q -U diffusers transformers accelerate

์ด ๋ชจ๋ธ์—๋Š” SVD์™€ SVD-XT ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. SVD ์ฒดํฌํฌ์ธํŠธ๋Š” 14๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๊ณ , SVD-XT ์ฒดํฌํฌ์ธํŠธ๋Š” 25๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” SVD-XT ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Conditioning ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)
"source image of a rocket"
"generated video from source image"

torch.compile

UNet์„ ์ปดํŒŒ์ผํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์‚ด์ง ์ฆ๊ฐ€ํ•˜์ง€๋งŒ, 20~25%์˜ ์†๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ค„์ด๊ธฐ

๋น„๋””์˜ค ์ƒ์„ฑ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํฐ text-to-image ์ƒ์„ฑ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ โ€˜num_framesโ€™๋ฅผ ํ•œ ๋ฒˆ์— ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋งค์šฐ ๋†’์Šต๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ถ”๋ก  ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ ˆ์ถฉํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ชจ๋ธ ์˜คํ”„๋กœ๋ง ํ™œ์„ฑํ™”: ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ๋” ์ด์ƒ ํ•„์š”ํ•˜์ง€ ์•Š์„ ๋•Œ CPU๋กœ ์˜คํ”„๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.
  • Feed-forward chunking ํ™œ์„ฑํ™”: feed-forward ๋ ˆ์ด์–ด๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํฐ ๋‹จ์ผ feed-forward๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋Œ€์‹  ๋ฃจํ”„๋กœ ๋ฐ˜๋ณตํ•ด์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
  • decode_chunk_size ๊ฐ์†Œ: VAE๊ฐ€ ํ”„๋ ˆ์ž„๋“ค์„ ํ•œ๊บผ๋ฒˆ์— ๋””์ฝ”๋”ฉํ•˜๋Š” ๋Œ€์‹  chunk ๋‹จ์œ„๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. decode_chunk_size=1์„ ์„ค์ •ํ•˜๋ฉด ํ•œ ๋ฒˆ์— ํ•œ ํ”„๋ ˆ์ž„์”ฉ ๋””์ฝ”๋”ฉํ•˜๊ณ  ์ตœ์†Œํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ๋งŒ ์‚ฌ์šฉํ•˜์ง€๋งŒ(GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ ์ด ๊ฐ’์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค), ๋™์˜์ƒ์— ์•ฝ๊ฐ„์˜ ๊นœ๋ฐ•์ž„์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]

์ด๋Ÿฌํ•œ ๋ชจ๋“  ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด 8GAM VRAM๋ณด๋‹ค ์ ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Micro-conditioning

Stable Diffusion Video๋Š” ๋˜ํ•œ ์ด๋ฏธ์ง€ conditoning ์™ธ์—๋„ micro-conditioning์„ ํ—ˆ์šฉํ•˜๋ฏ€๋กœ ์ƒ์„ฑ๋œ ๋น„๋””์˜ค๋ฅผ ๋” ์ž˜ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • fps: ์ƒ์„ฑ๋œ ๋น„๋””์˜ค์˜ ์ดˆ๋‹น ํ”„๋ ˆ์ž„ ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • motion_bucket_id: ์ƒ์„ฑ๋œ ๋™์˜์ƒ์— ์‚ฌ์šฉํ•  ๋ชจ์…˜ ๋ฒ„ํ‚ท ์•„์ด๋””์ž…๋‹ˆ๋‹ค. ์ƒ์„ฑ๋œ ๋™์˜์ƒ์˜ ๋ชจ์…˜์„ ์ œ์–ดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ์…˜ ๋ฒ„ํ‚ท ์•„์ด๋””๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์ƒ์„ฑ๋˜๋Š” ๋™์˜์ƒ์˜ ๋ชจ์…˜์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • noise_aug_strength: Conditioning ์ด๋ฏธ์ง€์— ์ถ”๊ฐ€๋˜๋Š” ๋…ธ์ด์ฆˆ์˜ ์–‘์ž…๋‹ˆ๋‹ค. ๊ฐ’์ด ํด์ˆ˜๋ก ๋น„๋””์˜ค๊ฐ€ conditioning ์ด๋ฏธ์ง€์™€ ๋œ ์œ ์‚ฌํ•ด์ง‘๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ๋†’์ด๋ฉด ์ƒ์„ฑ๋œ ๋น„๋””์˜ค์˜ ์›€์ง์ž„๋„ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ๋ชจ์…˜์ด ๋” ๋งŽ์€ ๋™์˜์ƒ์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด motion_bucket_id ๋ฐ noise_aug_strength micro-conditioning ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
  "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Conditioning ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)

< > Update on GitHub