10000 sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices by ShanoToni · Pull Request #13973 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices #13973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ShanoToni
Copy link

This PR proposes moving the env variable GGML_SYCL_DISABLE_OPT from default ON (reorder disabled) to default OFF (reorders enabled). This would allow easier testing on newer hardware, without needing to modify the list of devices which support the reorder feature.
Concerns regarding the reorder feature from #13254 have been resolved.
Regarding performance on older devices an amendment has been made to the README to suggest disabling the feature.

Below are performance's runs on 2 different models showing performance improvements of having the feature enabled:

Llama2-7B Q4_0 PVC

With reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |      2618.08 ± 13.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         74.49 ± 0.25 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |      2611.12 ± 25.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         36.02 ± 0.07 |

gemma2 2B Q4_K PVC

With Reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |      7765.55 ± 60.15 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         99.06 ± 0.12 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |     7766.70 ± 145.88 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         89.94 ± 0.28 |

Llama2-7B Q4_0 ARC-A770

With reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       1711.58 ± 3.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         34.16 ± 0.22 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       1711.60 ± 1.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         29.97 ± 0.22 |

gemma2 2B Q4_K ARC-A770

With Reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |       3645.32 ± 4.61 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         38.23 ± 0.13 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |       3639.22 ± 7.44 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         35.55 ± 0.14 |

Llama2-7B Q4_0 Lunar Lake

With reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |       258.61 ± 24.36 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         19.85 ± 0.04 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           pp512 |        470.37 ± 1.56 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |       8 |           tg128 |         12.92 ± 0.99 |

gemma2 2B Q4_K Lunar Lake

With Reorder

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |       613.06 ± 21.23 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         29.01 ± 0.32 |

Without

| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           pp512 |       643.79 ± 87.98 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |       8 |           tg128 |         24.78 ± 0.27 |

Llama2-7B Q4_0 Intel B580

With reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      2162.42 ± 13.20 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         66.66 ± 0.21 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2168.30 ± 6.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         39.67 ± 0.09 |

gemma2 2B Q4_K Intel B580

With Reorder

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |      5685.01 ± 21.13 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         87.73 ± 1.75 |

Without

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |      5678.08 ± 18.61 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         66.92 ± 0.62 |

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 2, 2025
@NeoZhangJianyu
Copy link
Collaborator

@ShanoToni
Thank you test on different Intel GPU!

But it doesn't cover the old iGPU in Intel Core CPU, since 11th.

I design the hw_info to check if the hardware support reorder, to make the feature won't bring negative impact to the Intel GPUs even if reorder can't bring benefit.

For newer GPU, this feature always bring benefit as your test.
For older GPU, this feature can't bring benefit or decrease in inference. But to reorder the weight in beginning, it will reduce the performance in prepare stage in any GPU.

That's why I design the structure to check the GPU type to enable reorder on the GPU which can get benefit, and disable reorder for others.

In this PR, it use the GPU name to check for Intel GPU. It's not good method: it can be used to check the GPU detail model.
In the future, there would be the requirement to detect the detail GPU model to set different parameters for better performance.

But use the architecture by SYCL API can support it.
GPU name can't provide detailed type info in fact.

8000

I suggest keep the legacy function.

It's OK to set this feature to be enabled default.

@ShanoToni ShanoToni force-pushed the reorder_intel_on_by_default branch from 4c20538 to bc14320 Compare June 3, 2025 13:26
@ShanoToni
Copy link
Author

@NeoZhangJianyu Appreciate the comment.
WRT the concerns.

  1. While I agree older intel devices might not benefit from the performance improvement the current implementation of the reorder would prevent testing the backend on newer devices, requiring them being added to the list, we believe they would be a higher priority for performance than the older generations.
  2. I agree it was not the best way to check devices. With @Alcpz the check was changed from the name to the architecture group ensuring we reorder only on Intel GPUs.
  3. I was not 100% clear on which specific legacy code parts you were refering, I assume sycl_hw_info, I have commented them out as they are currently not used, but kept them in the codebase.

@Rbiessy
Copy link
Collaborator
Rbiessy commented Jun 3, 2025

LGTM. Not approving yet since we want to measure the impact on one more device before merging.

@NeoZhangJianyu
Copy link
Collaborator
NeoZhangJianyu commented Jun 4, 2025

@NeoZhangJianyu Appreciate the comment. WRT the concerns.

  1. While I agree older intel devices might not benefit from the performance improvement the current implementation of the reorder would prevent testing the backend on newer devices, requiring them being added to the list, we believe they would be a higher priority for performance than the older generations.
  2. I agree it was not the best way to check devices. With @Alcpz the check was changed from the name to the architecture group ensuring we reorder only on Intel GPUs.
  3. I was not 100% clear on which specific legacy code parts you were refering, I assume sycl_hw_info, I have commented them out as they are currently not used, but kept them in the codebase.

Legacy code make two paths for new GPU and old GPU.
The path for old GPU avoid reorder to reduce the performance.

This PR make the code be simple, to remove the path for old GPU and reduce the performance of old GPU.
It's the problem.

All my words are only suggestion. It's depends on you.

No old user, no new user.
I see many users of llama.cpp are using old iGPU in fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0