-
Notifications
You must be signed in to change notification settings - Fork 11.9k
OpenCL: Add CPU fallback for unsupported operations #13621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@rmatif Thank you for trying it out - I was wondering if OpenCL backend could be enabled for If an op is not supported, GGML automatically puts it back to CPU. To enable this, I think the new backend API should be used. In particular, I think
|
Thanks for the hint! I gave it a shot with this PR: leejet/stable-diffusion.cpp#680 and this small fix: ggml-org/ggml@8606b82. Unfortunately, the performance is a bit disappointing — it's 2x to 3x slower than the CPU on a Snapdragon 8 Gen 3. I'm not sure if I'm doing something wrong, but the results are similar to what we observed in llama.cpp. My memory allocation isn’t optimized, as this was just a quick test to see if it was worth pursuing. If you have some time to take a look, that would be great. But in its current state, I don’t think it's worth adding this backend to stable-diffusion.cpp [DEBUG] stable-diffusion.cpp:183 - Using OpenCL backend
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.23.12
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 887 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels......................................
[INFO ] stable-diffusion.cpp:205 - loading model from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf'
[INFO ] model.cpp:909 - load realisticVisionV60B1_v51HyperVAE_q4_0.gguf using gguf format
[DEBUG] model.cpp:926 - init from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf'
[INFO ] stable-diffusion.cpp:252 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:285 - Weight type: q4_0
[INFO ] stable-diffusion.cpp:286 - Conditioner weight type: q4_0
[INFO ] stable-diffusion.cpp:287 - Diffusion model weight type: q4_0
[INFO ] stable-diffusion.cpp:288 - VAE weight type: q4_0
[DEBUG] stable-diffusion.cpp:290 - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:171 - vocab size: 49408
[DEBUG] clip.hpp:182 - trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1213 - clip params backend buffer size = 66.61 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1213 - unet params backend buffer size = 1272.85 MB(VRAM) (686 tensors)
[DEBUG] ggml_extend.hpp:1213 - vae params backend buffer size = 94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:432 - loading weights
[DEBUG] model.cpp:1731 - loading tensors from realisticVisionV60B1_v51HyperVAE_q4_0.gguf
[INFO ] stable-diffusion.cpp:531 - total params memory size = 1433.92MB (VRAM 1433.92MB, RAM 0.00MB): clip 66.61MB(VRAM), unet 1272.85MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:535 - loading model from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf' completed, taking 2.57s
[INFO ] stable-diffusion.cpp:569 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:613 - finished loaded file
[DEBUG] stable-diffusion.cpp:1561 - txt2img 256x256
[DEBUG] stable-diffusion.cpp:1254 - prompt after extract and remove lora: "cute cat"
[INFO ] stable-diffusion.cpp:703 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1259 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:357 - parse 'cute cat' to [['cute cat', 1], ]
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] ggml_extend.hpp:1148 - clip compute buffer size for OpenCL: 1.40 MB
[DEBUG] ggml_extend.hpp:1148 - clip compute buffer size for CPU: 20.58 MB
[DEBUG] conditioner.hpp:485 - computing condition graph completed, taking 87 ms
[INFO ] stable-diffusion.cpp:1392 - get_learned_condition completed, taking 89 ms
[INFO ] stable-diffusion.cpp:1415 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1452 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:821 - Sample
[DEBUG] ggml_extend.hpp:1148 - unet compute buffer size for OpenCL: 47.07 MB
[DEBUG] ggml_extend.hpp:1148 - unet compute buffer size for CPU: 10.24 MB
|==================================================| 4/4 - 7.67s/it
[INFO ] stable-diffusion.cpp:1491 - sampling completed, taking 30.90s
[INFO ] stable-diffusion.cpp:1499 - generating 1 latent images completed, taking 31.07s
[INFO ] stable-diffusion.cpp:1502 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1148 - vae compute buffer size for OpenCL: 416.00 MB
[DEBUG] ggml_extend.hpp:1148 - vae compute buffer size for CPU: 128.00 MB
[DEBUG] stable-diffusion.cpp:1103 - computing vae [mode: DECODE] graph completed, taking 83.36s
[INFO ] stable-diffusion.cpp:1512 - latent 1 decoded, taking 83.36s
[INFO ] stable-diffusion.cpp:1516 - decode_first_stage completed, taking 83.36s
[INFO ] stable-diffusion.cpp:1641 - txt2img completed in 114.53s
save result PNG image to 'output.png' |
@rmatif I think it's kind of expected that it will be slow - there is additional overhead transferring data between GPU and CPU when unsupported ops fall back to CPU. I will play with stable-diffusion.cpp a bit. I think we will need to add unsupported ops to make it faster. |
@lhez That would be really great if you have some time to look into it. I believe it has a lot of potential. I tested MNN, and they offer a demo of Stable Diffusion that performs really well with OpenCL. Although their approach is different from sdcpp, they don't compute the graph on the fly, and instead, it's integrated into the weights and aggressively optimized and like onnx the results are quite bad. But still maybe some inspiration could be drawn from their kernels |
I'm trying to add OpenCL backend support to leejet/stable-diffusion.cpp#680, but it crashes when the backend encounters an unsupported operation.
Example for the SD1.5 model:
Is it possible to add CPU fallback when some ops are not supported by the backend?
Ideally, the perfect solution would be to add the missing ops, but before this happens, a CPU fallback would be nice.
If interested, I listed all ops that stable-diffusion.cpp needs to run all the supported models, if someone has the motivation to implement some of the missing ones:
@lhez @max-krasnyansky
The text was updated successfully, but these errors were encountered: