SYCL: Add mrope kernel #13755

qnixsynapse · 2025-05-24T15:45:34Z

No description provided.

Rbiessy

LGTM!

ggml/src/ggml-sycl/rope.cpp

Alcpz · 2025-05-29T10:44:31Z

ggml/src/ggml-sycl/rope.cpp

+        dst[i + 0] = x[i + 0];
+        dst[i + 1] = x[i + 1];


Suggested change

dst[i + 0] = x[i + 0];

dst[i + 1] = x[i + 1];

*reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i);

I've tried checked this change, and for big enough tensors it makes a noticeable difference in performance:

dst[i + 0] = x[i + 0]; dst[i + 1] = x[i + 1];

ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 14944 runs - 68.72 us/run - 71995 kB/run - 1007.63 GB/s ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 4400 runs - 232.89 us/run - 167995 kB/run - 701.70 GB/s ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 14944 runs - 68.82 us/run - 71995 kB/run - 1006.27 GB/s ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 4400 runs - 232.11 us/run - 167995 kB/run - 704.05 GB/s ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 46650 runs - 21.58 us/run - 35997 kB/run - 1597.78 GB/s ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 11200 runs - 92.34 us/run - 83997 kB/run - 876.16 GB/s ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 47583 runs - 21.43 us/run - 35997 kB/run - 1608.58 GB/s ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 10800 runs - 92.64 us/run - 83997 kB/run - 873.38 GB/s

*reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i);

ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 21015 runs - 47.71 us/run - 71995 kB/run - 1451.41 GB/s ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 6000 runs - 167.92 us/run - 167995 kB/run - 973.19 GB/s ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 21482 runs - 47.53 us/run - 71995 kB/run - 1456.84 GB/s ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 6000 runs - 168.06 us/run - 167995 kB/run - 972.38 GB/s ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 50382 runs - 20.08 us/run - 35997 kB/run - 1716.75 GB/s ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 14800 runs - 68.47 us/run - 83997 kB/run - 1181.62 GB/s ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 49449 runs - 20.29 us/run - 35997 kB/run - 1699.40 GB/s ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 14800 runs - 68.92 us/run - 83997 kB/run - 1174.02 GB/s

I think it's worth considering changing it. Do you know which model makes use of this (Qwen2.5-VL?)? I'd be happy to grab the tk/s to see if this makes a significant change. I suppose it's not that impactful as our backend issues are mostly in the mul_mat

From this comment Qwen 2.5 needs it.

I have added this suggestion to other kernels as well. Thanks alot!

Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution.

Alcpz

I can't continue the review for a couple of days (in case you introduce other changes), but if @Rbiessy is happy with the changes, I'm ok with the merge. We can iterate over this in a different PR.

Edit: Not sure if someone else wants to review the code.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2025

qnixsynapse force-pushed the sycl/mrope branch from 93a41a8 to 0eca4f5 Compare May 25, 2025 04:10

Rbiessy approved these changes May 28, 2025

View reviewed changes

Alcpz reviewed May 29, 2025

View reviewed changes

qnixsynapse added 2 commits May 29, 2025 19:03

SYCL: Add mrope kernel

229ea31

qnixsynapse force-pushed the sycl/mrope branch from 0eca4f5 to c722720 Compare May 29, 2025 13:43

Use ceil_div

50d4e46

Alcpz approved these changes May 29, 2025

View reviewed changes

qnixsynapse merged commit b49a8ff into master May 30, 2025
46 checks passed

qnixsynapse deleted the sycl/mrope branch May 30, 2025 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: Add mrope kernel #13755

SYCL: Add mrope kernel #13755

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	dst[i + 0] = x[i + 0];
	dst[i + 1] = x[i + 1];
	reinterpret_cast<sycl::vec<T, 2> >(dst + i) = reinterpret_cast<const sycl::vec<T, 2> >(x + i);

SYCL: Add mrope kernel #13755

SYCL: Add mrope kernel #13755

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!