8000 remove templates from soft_max_f32_submitter to allow SYCL graph updates by lslusarczyk · Pull Request #13724 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lslusarczyk
Copy link
Contributor

When soft_max_f32_sycl is templated, then update of SYCL graph will fail because of different node type. Having just kernel parameters allows to update just parameters.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 23, 2025
@lslusarczyk lslusarczyk marked this pull request as ready for review May 23, 2025 12:26
Copy link
Collaborator
@qnixsynapse qnixsynapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NeoZhangJianyu
Copy link
Collaborator

@lslusarczyk
Will you feedback such issues to SYCL graph team?
I think it's good to help improve SYCL graph in same time.

Thank you!

@Rbiessy Rbiessy self-requested a review May 26, 2025 08:11
@Rbiessy
Copy link
Collaborator
Rbiessy commented May 26, 2025

We will run some benchmarks for this change, thanks. I suspect the templates arguments were introduced to optimize the kernel in some cases. It would probably be safer to use the "runtime version" of softmax (i.e. with the templates <false, 0, 0>) when SYCL-Graph is used and keep the existing version otherwise.

Also have you been able to identify how come a different soft_max can be used in each iteration? I can see this depends on the first dimension of src0, ne00, can it changed based on the values of previous operations? Can you show an example of a model where different sizes are used in different iterations? I just merged a PR to add more debug prints, it could be useful to share the logs of a model running with GGML_SYCL_DEBUG=1 set, or with your own logs.

@Rbiessy
Copy link
Collaborator
Rbiessy commented May 26, 2025

I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of ./bin/test-backend-ops perf -o SOFT_MAX -b SYCL0 for some sizes used in a few models:
with master:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              2622 runs -   633.51 us/run -    25600 kB/run -   38.55 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1010 runs -  1572.38 us/run -    66560 kB/run -   40.41 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.86 us/run -        3 kB/run -    0.30 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.97 us/run -        8 kB/run -    0.78 GB/s

with the PR:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1311 runs -   896.08 us/run -    25600 kB/run -   27.26 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               505 runs -  2215.32 us/run -    66560 kB/run -   28.68 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                98292 runs -    10.75 us/run -        3 kB/run -    0.28 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                73719 runs -    14.28 us/run -        8 kB/run -    0.54 GB/s

I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in opt_feature:

optimize_feature opt_feature;

I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model.

master:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1414.85 ± 40.19
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.10 ± 0.61
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1461.42 ± 35.38
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.84 ± 0.27
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 467.35 ± 1.67
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.93 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 435.12 ± 1.73
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 12.14 ± 1.01
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 618.69 ± 98.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 25.09 ± 0.41
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 593.43 ± 1.43
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.58 ± 0.08
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 752.97 ± 1.28
phi3 3B Q4_K - M 8000 edium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 18.07 ± 0.07

PR:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1387.46 ± 37.68
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.23 ± 0.55
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1428.39 ± 47.00
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.81 ± 0.26
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 438.09 ± 1.09
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.97 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 464.23 ± 1.75
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 11.66 ± 0.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 584.07 ± 11.48
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 24.78 ± 0.34
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 582.92 ± 10.35
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.82 ± 0.06
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 710.09 ± 8.09
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 20.81 ± 0.14

@NeoZhangJianyu
Copy link
Collaborator

I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of ./bin/test-backend-ops perf -o SOFT_MAX -b SYCL0 for some sizes used in a few models: with master:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              2622 runs -   633.51 us/run -    25600 kB/run -   38.55 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1010 runs -  1572.38 us/run -    66560 kB/run -   40.41 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.86 us/run -        3 kB/run -    0.30 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.97 us/run -        8 kB/run -    0.78 GB/s

with the PR:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1311 runs -   896.08 us/run -    25600 kB/run -   27.26 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               505 runs -  2215.32 us/run -    66560 kB/run -   28.68 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                98292 runs -    10.75 us/run -        3 kB/run -    0.28 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                73719 runs -    14.28 us/run -        8 kB/run -    0.54 GB/s

I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in opt_feature:

optimize_feature opt_feature;

I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model.

master:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1414.85 ± 40.19
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.10 ± 0.61
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1461.42 ± 35.38
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.84 ± 0.27
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 467.35 ± 1.67
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.93 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 435.12 ± 1.73
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 12.14 ± 1.01
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 618.69 ± 98.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 25.09 ± 0.41
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 593.43 ± 1.43
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.58 ± 0.08
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 752.97 ± 1.28
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 18.07 ± 0.07
PR:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1387.46 ± 37.68
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.23 ± 0.55
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1428.39 ± 47.00
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.81 ± 0.26
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 438.09 ± 1.09
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.97 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 464.23 ± 1.75
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 11.66 ± 0.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 584.07 ± 11.48
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 24.78 ± 0.34
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 582.92 ± 10.35
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.82 ± 0.06
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 710.09 ± 8.09
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 20.81 ± 0.14

opt_feature is used to save the opt features supported by different hardware platform.
For example, iGPU in 10th Core can't support reorder feature. Here is value is False.

SYCL graph feature has nothing with hardware platform.
It should be saved in a single global variable for enable/disable.

@Rbiessy
Copy link
Collaborator
Rbiessy commented May 27, 2025

@NeoZhangJianyu SYCL-Graph already has a global variable to try and enable SYCL-Graph but there are also more checks as SYCL-Graph doesn't support all graphs not all devices, see

const bool graph_support = dpct::get_device(sycl_ctx->device).has(sycl::aspect::ext_oneapi_limited_graph);

Given this check, opt_feature seems like a good fit to me.

@lslusarczyk
Copy link
Contributor Author

@qnixsynapse , @NeoZhangJianyu , thank you for reviewing my code and for your comments.

@Rbiessy , thank you very much for checking the performance. Before deciding to have two code paths, which will make code uglier (instead of simplier as I tried in this PR), I'd like to try to understand why these templates were causing better performance. By code analysis I expected nearly no impact by removing template parameters that I changed to be arguments.

Expect my updates here in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0