-
Notifications
You must be signed in to change notification settings - Fork 12k
SYCL: Add mrope kernel #13755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL: Add mrope kernel #13755
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
ggml/src/ggml-sycl/rope.cpp
Outdated
dst[i + 0] = x[i + 0]; | ||
dst[i + 1] = x[i + 1]; |
edited
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dst[i + 0] = x[i + 0]; | |
dst[i + 1] = x[i + 1]; | |
*reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i); |
I've tried checked this change, and for big enough tensors it makes a noticeable difference in performance:
dst[i + 0] = x[i + 0];
dst[i + 1] = x[i + 1];
ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 14944 runs - 68.72 us/run - 71995 kB/run - 1007.63 GB/s
ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 4400 runs - 232.89 us/run - 167995 kB/run - 701.70 GB/s
ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 14944 runs - 68.82 us/run - 71995 kB/run - 1006.27 GB/s
ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 4400 runs - 232.11 us/run - 167995 kB/run - 704.05 GB/s
ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 46650 runs - 21.58 us/run - 35997 kB/run - 1597.78 GB/s
ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 11200 runs - 92.34 us/run - 83997 kB/run - 876.16 GB/s
ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 47583 runs - 21.43 us/run - 35997 kB/run - 1608.58 GB/s
ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 10800 runs - 92.64 us/run - 83997 kB/run - 873.38 GB/s
*reinterpret_cast<sycl::vec<T, 2> *>(dst + i) = *reinterpret_cast<const sycl::vec<T, 2> *>(x + i);
ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 21015 runs - 47.71 us/run - 71995 kB/run - 1451.41 GB/s
ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 6000 runs - 167.92 us/run - 167995 kB/run - 973.19 GB/s
ROPE(type=f32,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 21482 runs - 47.53 us/run - 71995 kB/run - 1456.84 GB/s
ROPE(type=f32,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 6000 runs - 168.06 us/run - 167995 kB/run - 972.38 GB/s
ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 50382 runs - 20.08 us/run - 35997 kB/run - 1716.75 GB/s
ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=0,v=1): 14800 runs - 68.47 us/run - 83997 kB/run - 1181.62 GB/s
ROPE(type=f16,ne_a=[1280,1200,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 49449 runs - 20.29 us/run - 35997 kB/run - 1699.40 GB/s
ROPE(type=f16,ne_a=[1280,2800,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.424500,ef=0.746500,af=1.424500,ff=1,v=1): 14800 runs - 68.92 us/run - 83997 kB/run - 1174.02 GB/s
I think it's worth considering changing it. Do you know which model makes use of this (Qwen2.5-VL?)? I'd be happy to grab the tk/s to see if this makes a significant change. I suppose it's not that impactful as our backend issues are mostly in the mul_mat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this comment Qwen 2.5 needs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this suggestion to other kernels as well. Thanks alot!
Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't continue the review for a couple of days (in case you introduce other changes), but if @Rbiessy is happy with the changes, I'm ok with the merge. We can iterate over this in a different PR.
Edit: Not sure if someone else wants to review the code.
No description provided.