10000 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs by rillomas · Pull Request #14001 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs #14001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 5, 2025

Conversation

rillomas
Copy link
Contributor
@rillomas rillomas commented Jun 4, 2025

Enabling VK_KHR_cooperative_matrix on Intel Xe2 GPUs (currently Lunar Lake and Battlemage) have significant performance improvement, while we also see performance regressions with older GPUs like Arc A770.
This PR will enable VK_KHR_cooperative_matrix only for Xe2 GPUs until performance regression is resolved for older GPUs.

Reference: #13530

llama-bench results

OS Platform Benchmark b5589-xe2-enabled b5583-master Difference
Windows 11 24H2 (gfx driver 32.0.101.6795) U7-268V pp512 416.53 148.60 280%
    tg128 37.86 36.87 103%
  i5-13400 + Arc B580 pp512 1631.44 490.89 332%
    tg128 126.69 129.95 97%
  i9-12900K + Arc A770 pp512 977.81 974.92 100%
    tg128 96.83 96.42 100%
Ubuntu 24.04.2 (Mesa 24.2.8) U7-268V pp512 167.94 122.42 137%
    tg128 13.67 13.75 99%
  i5-13400 + Arc B580 pp512 583.28 420.57 139%
    tg128 41.80 41.82 100%
  i9-12900K + Arc A770 pp512 328.25 333.88 98%
    tg128 39.00 39.14 100%

Windows

Lunar Lake Core Ultra 7 268V

Before

λ llama-bench.exe -m ..\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        148.60 ± 3.52 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         36.87 ± 0.61 |

build: 7e00e60e (5583)

After

λ llama-bench.exe -m ..\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |       416.53 ± 25.15 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         37.86 ± 0.19 |

build: a0dd7795 (5589)

Battlemage Arc B580

Before

λ llama-bench.exe -m ..\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B580 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |       490.89 ± 12.28 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |        129.95 ± 0.27 |

build: 7e00e60e (5583)

After

λ llama-bench.exe -m ..\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B580 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |       1631.44 ± 4.42 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |        126.69 ± 0.50 |

build: a0dd7795 (5589)

Alchemist Arc A770

Before

λ llama-bench.exe -m C:\Users\cpie-ace\Documents\Axell\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        974.92 ± 4.06 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         96.42 ± 0.40 |

build: 7e00e60e (5583)

After

λ llama-bench.exe -m C:\Users\cpie-ace\Documents\Axell\gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        977.81 ± 1.79 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         96.83 ± 0.30 |

build: a0dd7795 (5589)

Linux

Lunar Lake Core Ultra 7 268V

Before

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        122.42 ± 0.63 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         13.75 ± 0.01 |

build: 7e00e60e (5583)

After

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        167.94 ± 0.63 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         13.67 ± 0.05 |

build: a0dd7795 (5589)

Battlemage Arc B580

Before

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 163840 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        420.52 ± 0.19 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         41.82 ± 0.31 |

build: 7e00e60e (5583)

After

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 163840 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        583.28 ± 0.69 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         41.80 ± 0.27 |

build: a0dd7795 (5589)

Alchemist Arc A770

Before

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        333.88 ± 0.34 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         39.14 ± 0.10 |

build: 7e00e60e (5583)

After

$ ./llama-bench -m ~/Downloads/gemma-2-2b-it-Q4_K_M.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           pp512 |        328.35 ± 5.95 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | Vulkan     |  99 |           tg128 |         39.00 ± 0.24 |

build: a0dd7795 (5589)

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 4, 2025
@rillomas rillomas marked this pull request as ready for review June 4, 2025 02:24
Copy link
Collaborator
@0cc4m 0cc4m left a comment
8000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, did you also check performance on Linux to make sure that we don't cause a regression for Xe2 there?

@rillomas
Copy link
Contributor Author
rillomas commented Jun 4, 2025

Thanks for your review. I'll benchmark the current version on Ubuntu and update the PR soon

Update: I've updated the PR comment with Ubuntu data

@0cc4m
Copy link
Collaborator
0cc4m commented Jun 5, 2025

Thank you for testing with Ubuntu. There is no reason not to merge, but it's seriously disappointing how much worse the Linux driver currently is. I hope Intel closes that gap in the near future.

@codecnotsupported
Copy link

Thank you for testing with Ubuntu. There is no reason not to merge, but it's seriously disappointing how much worse the Linux driver currently is. I hope Intel closes that gap in the near future.

I do believe there is a issue in the mesa bug tracking for that exact problem.
That said, SYCL backend has better performance.

$ ./build/bin/llama-bench -m ./models/gemma-2-2b-it-q4_k_m.gguf 
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Arc(TM) B580 Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz)
load_backend: failed to find ggml_backend_init in ./llama.cpp/build/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in ./llama.cpp/build/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           pp512 |      2186.80 ± 16.75 |
| gemma2 2B Q4_K - Medium        |   1.59 GiB |     2.61 B | SYCL       |  99 |           tg128 |         54.23 ± 0.31 |

@0cc4m
Copy link
Collaborator
0cc4m commented Jun 5, 2025

Yeah, we created an issue about coopmat performance, but not about the general performance issues. SYCL is faster, but less flexible than Vulkan.

Copy link
Collaborator
@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the contribution!

@0cc4m 0cc4m merged commit 669c13e into ggml-org:master Jun 5, 2025
45 checks passed
@rillomas rillomas deleted the allow-list-intel-coopmat branch June 5, 2025 21:56
furyhawk pushed a commit to furyhawk/llama.cpp that referenced this pull request Jun 6, 2025
…ggml-org#14001)

* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0