10000 OpenCL: Add CPU fallback for unsupported operations · Issue #13621 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

OpenCL: Add CPU fallback for unsupported operations #13621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rmatif opened this issue May 18, 2025 · 4 comments
Closed

OpenCL: Add CPU fallback for unsupported operations #13621

rmatif opened this issue May 18, 2025 · 4 comments

Comments

@rmatif
Copy link
rmatif commented May 18, 2025

I'm trying to add OpenCL backend support to leejet/stable-diffusion.cpp#680, but it crashes when the backend encounters an unsupported operation.

Example for the SD1.5 model:

[DEBUG] ggml_extend.hpp:1134 - clip compute buffer size: 1.40 MB(VRAM)
[DEBUG] conditioner.hpp:485  - computing condition graph completed, taking 45 ms
[INFO ] stable-diffusion.cpp:1392 - get_learned_condition completed, taking 91 ms
[INFO ] stable-diffusion.cpp:1415 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1452 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:821  - Sample
[DEBUG] ggml_extend.hpp:1134 - unet compute buffer size: 559.90 MB(VRAM)
ggml_backend_opencl_graph_compute: error: op not supported node_185 (GROUP_NORM)


../ggml/src/ggml-opencl/ggml-opencl.cpp:1679: GGML_ASSERT(ok) failed

Is it possible to add CPU fallback when some ops are not supported by the backend?
Ideally, the perfect solution would be to add the missing ops, but before this happens, a CPU fallback would be nice.

If interested, I listed all ops that stable-diffusion.cpp needs to run all the supported models, if someone has the motivation to implement some of the missing ones:

  • ADD
  • CONCAT
  • CONT
  • DIAG_MASK_INF
  • GET_ROWS
  • GROUP_NORM
  • IM2COL
  • MUL
  • MUL_MAT
  • NORM
  • PAD
  • PERMUTE
  • REPEAT
  • RESHAPE
  • RMS_NORM
  • SCALE
  • SOFT_MAX
  • TIMESTEP_EMBEDDING
  • UNARY
  • UPSCALE
  • VIEW

@lhez @max-krasnyansky

@lhez
Copy link
Contributor
lhez commented May 19, 2025

@rmatif Thank you for trying it out - I was wondering if OpenCL backend could be enabled for stable-diffusion.cpp.

If an op is not supported, GGML automatically puts it back to CPU. To enable this, I think the new backend API should be used. In particular, I think ggml_backend_sched_alloc_graph should be used, which calls ggml_backend_sched_split_graph, which in turn seems to split the graph into subgraphs based on the result of ggml_backend_supports_op -- unsupported ops get back to CPU in this step.

stable-diffusion.cpp seems to still use ggml_gallocr_alloc_graph (the old way before the backend API). So, I don't think the graph gets split based on ops availability and thus unsupported ops just fail instead of being put back to CPU.

@rmatif
Copy link
Author
rmatif commented May 21, 2025

@lhez

Thanks for the hint! I gave it a shot with this PR: leejet/stable-diffusion.cpp#680 and this small fix: ggml-org/ggml@8606b82.

Unfortunately, the performance is a bit disappointing — it's 2x to 3x slower than the CPU on a Snapdragon 8 Gen 3. I'm not sure if I'm doing something wrong, but the results are similar to what we observed in llama.cpp. My memory allocation isn’t optimized, as this was just a quick test to see if it was worth pursuing.

If you have some time to take a look, that would be great. But in its current state, I don’t think it's worth adding this backend to stable-diffusion.cpp

[DEBUG] stable-diffusion.cpp:183  - Using OpenCL backend
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.23.12
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 887 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels......................................
[INFO ] stable-diffusion.cpp:205  - loading model from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf'
[INFO ] model.cpp:909  - load realisticVisionV60B1_v51HyperVAE_q4_0.gguf using gguf format
[DEBUG] model.cpp:926  - init from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf'
[INFO ] stable-diffusion.cpp:252  - Version: SD 1.x 
[INFO ] stable-diffusion.cpp:285  - Weight type:                 q4_0
[INFO ] stable-diffusion.cpp:286  - Conditioner weight type:     q4_0
[INFO ] stable-diffusion.cpp:287  - Diffusion model weight type: q4_0
[INFO ] stable-diffusion.cpp:288  - VAE weight type:             q4_0
[DEBUG] stable-diffusion.cpp:290  - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1213 - clip params backend buffer size =  66.61 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1213 - unet params backend buffer size =  1272.85 MB(VRAM) (686 tensors)
[DEBUG] ggml_extend.hpp:1213 - vae params backend buffer size =  94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:432  - loading weights
[DEBUG] model.cpp:1731 - loading tensors from realisticVisionV60B1_v51HyperVAE_q4_0.gguf
[INFO ] stable-diffusion.cpp:531  - total params memory size = 1433.92MB (VRAM 1433.92MB, RAM 0.00MB): clip 66.61MB(VRAM), unet 1272.85MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:535  - loading model from 'realisticVisionV60B1_v51HyperVAE_q4_0.gguf' completed, taking 2.57s
[INFO ] stable-diffusion.cpp:569  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:613  - finished loaded file
[DEBUG] stable-diffusion.cpp:1561 - txt2img 256x256
[DEBUG] stable-diffusion.cpp:1254 - prompt after extract and remove lora: "cute cat"
[INFO ] stable-diffusion.cpp:703  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1259 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:357  - parse 'cute cat' to [['cute cat', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] ggml_extend.hpp:1148 - clip compute buffer size for OpenCL: 1.40 MB
[DEBUG] ggml_extend.hpp:1148 - clip compute buffer size for CPU: 20.58 MB
[DEBUG] conditioner.hpp:485  - computing condition graph completed, taking 87 ms
[INFO ] stable-diffusion.cpp:1392 - get_learned_condition completed, taking 89 ms
[INFO ] stable-diffusion.cpp:1415 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1452 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:821  - Sample
[DEBUG] ggml_extend.hpp:1148 - unet compute buffer size for OpenCL: 47.07 MB
[DEBUG] ggml_extend.hpp:1148 - unet compute buffer size for CPU: 10.24 MB
  |==================================================| 4/4 - 7.67s/it
[INFO ] stable-diffusion.cpp:1491 - sampling completed, taking 30.90s
[INFO ] stable-diffusion.cpp:1499 - generating 1 latent images completed, taking 31.07s
[INFO ] stable-diffusion.cpp:1502 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1148 - vae compute buffer size for OpenCL: 416.00 MB
[DEBUG] ggml_extend.hpp:1148 - vae compute buffer size for CPU: 128.00 MB
[DEBUG] stable-diffusion.cpp:1103 - computing vae [mode: DECODE] graph completed, taking 83.36s
[INFO ] stable-diffusion.cpp:1512 - latent 1 decoded, taking 83.36s
[INFO ] stable-diffusion.cpp:1516 - decode_first_stage completed, taking 83.36s
[INFO ] stable-diffusion.cpp:1641 - txt2img completed in 114.53s
save result PNG image to 'output.png'

@rmatif rmatif closed this as completed May 21, 2025
@lhez
Copy link
Contributor
lhez commented May 22, 2025

@rmatif I think it's kind of expected that it will be slow - there is additional overhead transferring data between GPU and CPU when unsupported ops fall back to CPU. I will play with stable-diffusion.cpp a bit. I think we will need to add unsupported ops to make it faster.

@rmatif
Copy link
Author
rmatif commented May 22, 2025

@lhez That would be really great if you have some time to look into it. I believe it has a lot of potential.

I tested MNN, and they offer a demo of Stable Diffusion that performs really well with OpenCL. Although their approach is different from sdcpp, they don't compute the graph on the fly, and instead, it's integrated into the weights and aggressively optimized and like onnx the results are quite bad. But still maybe some inspiration could be drawn from their kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0