8000 SYCL: Add mrope kernel by qnixsynapse · Pull Request #13755 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

SYCL: Add mrope kernel #13755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 30, 2025
Merged

SYCL: Add mrope kernel #13755

merged 3 commits into from
May 30, 2025

Conversation

qnixsynapse
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2025
Copy link
Collaborator
@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 141 to 142
dst[i + 0] = x[i + 0];
dst[i + 1] = x[i + 1];
Copy link
Collaborator
@Alcpz Alcpz May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dst[i + 0] = x[i + 0];
dst[i + 1] = x[i + 1];
*reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i);

I've tried checked this change, and for big enough tensors it makes a noticeable difference in performance:

dst[i + 0] = x[i + 0];
dst[i + 1] = x[i + 1];
 ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                14944 runs -    68.72 us/run -    71995 kB/run - 1007.63 GB/s
  ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                 4400 runs -   232.89 us/run -   167995 kB/run -  701.70 GB/s
  ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                14944 runs -    68.82 us/run -    71995 kB/run - 1006.27 GB/s
  ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                 4400 runs -   232.11 us/run -   167995 kB/run -  704.05 GB/s
  ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                46650 runs -    21.58 us/run -    35997 kB/run - 1597.78 GB/s
  ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                11200 runs -    92.34 us/run -    83997 kB/run -  876.16 GB/s
  ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                47583 runs -    21.43 us/run -    35997 kB/run - 1608.58 GB/s
  ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                10800 runs -    92.64 us/run -    83997 kB/run -  873.38 GB/s
        *reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i);
ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                21015 runs -    47.71 us/run -    71995 kB/run - 1451.41 GB/s
  ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                 6000 runs -   167.92 us/run -   167995 kB/run -  973.19 GB/s
  ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                21482 runs -    47.53 us/run -    71995 kB/run - 1456.84 GB/s
  ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                 6000 runs -   168.06 us/run -   167995 kB/run -  972.38 GB/s
  ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                50382 runs -    20.08 us/run -    35997 kB/run - 1716.75 GB/s
  ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1):                14800 runs -    68.47 us/run -    83997 kB/run - 1181.62 GB/s
  ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                49449 runs -    20.29 us/run -    35997 kB/run - 1699.40 GB/s
  ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1):                14800 runs -    68.92 us/run -    83997 kB/run - 1174.02 GB/s

I think it's worth considering changing it. Do you know which model makes use of this (Qwen2.5-VL?)? I'd be happy to grab the tk/s to see if this makes a significant change. I suppose it's not that impactful as our backend issues are mostly in the mul_mat

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this comment Qwen 2.5 needs it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this suggestion to other kernels as well. Thanks alot!

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
Copy link
Collaborator
@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't continue the review for a couple of days (in case you introduce other changes), but if @Rbiessy is happy with the changes, I'm ok with the merge. We can iterate over this in a different PR.

Edit: Not sure if someone else wants to review the code.

@qnixsynapse qnixsynapse merged commit b49a8ff into master May 30, 2025
46 checks passed
@qnixsynapse qnixsynapse deleted the sycl/mrope branch May 30, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0