sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

ShanoToni · 2025-06-02T13:36:08Z

This PR proposes moving the env variable GGML_SYCL_DISABLE_OPT from default ON (reorder disabled) to default OFF (reorders enabled). This would allow easier testing on newer hardware, without needing to modify the list of devices which support the reorder feature.
Concerns regarding the reorder feature from #13254 have been resolved.
Regarding performance on older devices an amendment has been made to the README to suggest disabling the feature.

Below are performance's runs on 2 different models showing performance improvements of having the feature enabled:

Llama2-7B Q4_0 PVC

With reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |      2618.08 ± 13.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         74.49 ± 0.25 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |      2611.12 ± 25.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         36.02 ± 0.07 |

gemma2 2B Q4_K PVC

With Reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |      7765.55 ± 60.15 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         99.06 ± 0.12 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |     7766.70 ± 145.88 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         89.94 ± 0.28 |

Llama2-7B Q4_0 ARC-A770

With reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       1711.58 ± 3.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         34.16 ± 0.22 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       1711.60 ± 1.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         29.97 ± 0.22 |

gemma2 2B Q4_K ARC-A770

With Reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |       3645.32 ± 4.61 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         38.23 ± 0.13 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |       3639.22 ± 7.44 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         35.55 ± 0.14 |

Llama2-7B Q4_0 Lunar Lake

With reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |       258.61 ± 24.36 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         19.85 ± 0.04 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |        470.37 ± 1.56 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         12.92 ± 0.99 |

gemma2 2B Q4_K Lunar Lake

With Reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |       613.06 ± 21.23 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         29.01 ± 0.32 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |       643.79 ± 87.98 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         24.78 ± 0.27 |

Llama2-7B Q4_0 Intel B580

With reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      2162.42 ± 13.20 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         66.66 ± 0.21 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2168.30 ± 6.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         39.67 ± 0.09 |

gemma2 2B Q4_K Intel B580

With Reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |      5685.01 ± 21.13 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         87.73 ± 1.75 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |      5678.08 ± 18.61 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         66.92 ± 0.62 |

ggml/src/ggml-sycl/ggml-sycl.cpp

ggml/src/ggml-sycl/common.hpp

ggml/src/ggml-sycl/ggml-sycl.cpp

ggml/src/ggml-sycl/common.hpp

NeoZhangJianyu · 2025-06-03T01:06:41Z

@ShanoToni
Thank you test on different Intel GPU!

But it doesn't cover the old iGPU in Intel Core CPU, since 11th.

I design the hw_info to check if the hardware support reorder, to make the feature won't bring negative impact to the Intel GPUs even if reorder can't bring benefit.

For newer GPU, this feature always bring benefit as your test.
For older GPU, this feature can't bring benefit or decrease in inference. But to reorder the weight in beginning, it will reduce the performance in prepare stage in any GPU.

That's why I design the structure to check the GPU type to enable reorder on the GPU which can get benefit, and disable reorder for others.

In this PR, it use the GPU name to check for Intel GPU. It's not good method: it can be used to check the GPU detail model.
In the future, there would be the requirement to detect the detail GPU model to set different parameters for better performance.

But use the architecture by SYCL API can support it.
GPU name can't provide detailed type info in fact.

8000

I suggest keep the legacy function.

It's OK to set this feature to be enabled default.

ShanoToni · 2025-06-03T13:48:13Z

@NeoZhangJianyu Appreciate the comment.
WRT the concerns.

While I agree older intel devices might not benefit from the performance improvement the current implementation of the reorder would prevent testing the backend on newer devices, requiring them being added to the list, we believe they would be a higher priority for performance than the older generations.
I agree it was not the best way to check devices. With @Alcpz the check was changed from the name to the architecture group ensuring we reorder only on Intel GPUs.
I was not 100% clear on which specific legacy code parts you were refering, I assume sycl_hw_info, I have commented them out as they are currently not used, but kept them in the codebase.

Rbiessy · 2025-06-03T14:19:20Z

LGTM. Not approving yet since we want to measure the impact on one more device before merging.

NeoZhangJianyu · 2025-06-04T01:16:59Z

@NeoZhangJianyu Appreciate the comment. WRT the concerns.

While I agree older intel devices might not benefit from the performance improvement the current implementation of the reorder would prevent testing the backend on newer devices, requiring them being added to the list, we believe they would be a higher priority for performance than the older generations.

I agree it was not the best way to check devices. With @Alcpz the check was changed from the name to the architecture group ensuring we reorder only on Intel GPUs.

I was not 100% clear on which specific legacy code parts you were refering, I assume sycl_hw_info, I have commented them out as they are currently not used, but kept them in the codebase.

Legacy code make two paths for new GPU and old GPU.
The path for old GPU avoid reorder to reduce the performance.

This PR make the code be simple, to remove the path for old GPU and reduce the performance of old GPU.
It's the problem.

All my words are only suggestion. It's depends on you.

No old user, no new user.
I see many users of llama.cpp are using old iGPU in fact.

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 2, 2025

Alcpz reviewed Jun 2, 2025

View reviewed changes

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/common.hpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/common.hpp Show resolved Hide resolved

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices

bc14320

ShanoToni force-pushed the reorder_intel_on_by_default branch from 4c20538 to bc14320 Compare June 3, 2025 13:26

Alcpz approved these changes Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

Are you sure you want to change the base?

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

Uh oh!

Conversation

Llama2-7B Q4_0 PVC

gemma2 2B Q4_K PVC

Llama2-7B Q4_0 ARC-A770

gemma2 2B Q4_K ARC-A770

Llama2-7B Q4_0 Lunar Lake

gemma2 2B Q4_K Lunar Lake

Llama2-7B Q4_0 Intel B580

gemma2 2B Q4_K Intel B580

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!