-
Notifications
You must be signed in to change notification settings - Fork 11.9k
remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@lslusarczyk Thank you! |
We will run some benchmarks for this change, thanks. I suspect the templates arguments were introduced to optimize the kernel in some cases. It would probably be safer to use the "runtime version" of softmax (i.e. with the templates Also have you been able to identify how come a different soft_max can be used in each iteration? I can see this depends on the first dimension of |
I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of
with the PR:
I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in llama.cpp/ggml/src/ggml-sycl/common.hpp Line 318 in 6f180b9
I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model. master:
PR:
|
SYCL graph feature has nothing with hardware platform. |
@NeoZhangJianyu SYCL-Graph already has a global variable to try and enable SYCL-Graph but there are also more checks as SYCL-Graph doesn't support all graphs not all devices, see llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp Line 3897 in 4f81b33
Given this check, |
@qnixsynapse , @NeoZhangJianyu , thank you for reviewing my code and for your comments. @Rbiessy , thank you very much for checking the performance. Before deciding to have two code paths, which will make code uglier (instead of simplier as I tried in this PR), I'd like to try to understand why these templates were causing better performance. By code analysis I expected nearly no impact by removing template parameters that I changed to be arguments. Expect my updates here in a few days. |
When
soft_max_f32_sycl
is templated, then update of SYCL graph will fail because of different node type. Having just kernel parameters allows to update just parameters.