remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

lslusarczyk · 2025-05-23T11:15:06Z

When soft_max_f32_sycl is templated, then update of SYCL graph will fail because of different node type. Having just kernel parameters allows to update just parameters.

qnixsynapse

LGTM

NeoZhangJianyu · 2025-05-26T03:03:38Z

@lslusarczyk
Will you feedback such issues to SYCL graph team?
I think it's good to help improve SYCL graph in same time.

Thank you!

Rbiessy · 2025-05-26T08:42:15Z

We will run some benchmarks for this change, thanks. I suspect the templates arguments were introduced to optimize the kernel in some cases. It would probably be safer to use the "runtime version" of softmax (i.e. with the templates <false, 0, 0>) when SYCL-Graph is used and keep the existing version otherwise.

Also have you been able to identify how come a different soft_max can be used in each iteration? I can see this depends on the first dimension of src0, ne00, can it changed based on the values of previous operations? Can you show an example of a model where different sizes are used in different iterations? I just merged a PR to add more debug prints, it could be useful to share the logs of a model running with GGML_SYCL_DEBUG=1 set, or with your own logs.

Rbiessy · 2025-05-26T17:16:07Z

I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of ./bin/test-backend-ops perf -o SOFT_MAX -b SYCL0 for some sizes used in a few models:
with master:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              2622 runs -   633.51 us/run -    25600 kB/run -   38.55 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1010 runs -  1572.38 us/run -    66560 kB/run -   40.41 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.86 us/run -        3 kB/run -    0.30 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.97 us/run -        8 kB/run -    0.78 GB/s

with the PR:

  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1311 runs -   896.08 us/run -    25600 kB/run -   27.26 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               505 runs -  2215.32 us/run -    66560 kB/run -   28.68 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                98292 runs -    10.75 us/run -        3 kB/run -    0.28 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                73719 runs -    14.28 us/run -        8 kB/run -    0.54 GB/s

I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in opt_feature:

llama.cpp/ggml/src/ggml-sycl/common.hpp

Line 318 in 6f180b9

optimize_feature opt_feature;

I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model.

master:

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	8	none	pp512	1414.85 ± 40.19
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	8	none	tg128	34.10 ± 0.61
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	8	none	pp512	1461.42 ± 35.38
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	8	none	tg128	36.84 ± 0.27
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	8	none	pp512	467.35 ± 1.67
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	8	none	tg128	12.93 ± 0.91
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	8	none	pp512	435.12 ± 1.73
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	8	none	tg128	12.14 ± 1.01
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	8	none	pp512	618.69 ± 98.03
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	8	none	tg128	25.09 ± 0.41
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	8	none	pp512	593.43 ± 1.43
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	8	none	tg128	21.58 ± 0.08
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	8	none	pp512	752.97 ± 1.28
phi3 3B Q4_K - M 8000 edium	2.23 GiB	3.82 B	SYCL	99	8	none	tg128	18.07 ± 0.07

PR:

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	8	none	pp512	1387.46 ± 37.68
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	8	none	tg128	34.23 ± 0.55
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	8	none	pp512	1428.39 ± 47.00
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	8	none	tg128	36.81 ± 0.26
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	8	none	pp512	438.09 ± 1.09
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	8	none	tg128	12.97 ± 0.91
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	8	none	pp512	464.23 ± 1.75
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	8	none	tg128	11.66 ± 0.70
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	8	none	pp512	584.07 ± 11.48
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	8	none	tg128	24.78 ± 0.34
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	8	none	pp512	582.92 ± 10.35
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	8	none	tg128	21.82 ± 0.06
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	8	none	pp512	710.09 ± 8.09
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	8	none	tg128	20.81 ± 0.14

NeoZhangJianyu · 2025-05-27T01:30:16Z

I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of ./bin/test-backend-ops perf -o SOFT_MAX -b SYCL0 for some sizes used in a few models: with master:
  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              2622 runs -   633.51 us/run -    25600 kB/run -   38.55 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1010 runs -  1572.38 us/run -    66560 kB/run -   40.41 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.86 us/run -        3 kB/run -    0.30 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               106483 runs -     9.97 us/run -        8 kB/run -    0.78 GB/s
with the PR:
  SOFT_MAX(type=f32,ne=[512,512,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):              1311 runs -   896.08 us/run -    25600 kB/run -   27.26 GB/s
  SOFT_MAX(type=f32,ne=[512,512,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):               505 runs -  2215.32 us/run -    66560 kB/run -   28.68 GB/s
  SOFT_MAX(type=f32,ne=[32,1,12,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                98292 runs -    10.75 us/run -        3 kB/run -    0.28 GB/s
  SOFT_MAX(type=f32,ne=[32,1,32,1],mask=1,m_prec=f32,scale=1.000000,max_bias=0.000000):                73719 runs -    14.28 us/run -        8 kB/run -    0.54 GB/s
I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in opt_feature:

llama.cpp/ggml/src/ggml-sycl/common.hpp

Line 318 in 6f180b9

optimize_feature opt_feature;

I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model.

master:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1414.85 ± 40.19
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.10 ± 0.61
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1461.42 ± 35.38
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.84 ± 0.27
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 467.35 ± 1.67
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.93 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 435.12 ± 1.73
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 12.14 ± 1.01
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 618.69 ± 98.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 25.09 ± 0.41
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 593.43 ± 1.43
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.58 ± 0.08
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 752.97 ± 1.28
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 18.07 ± 0.07
PR:

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 1387.46 ± 37.68
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 34.23 ± 0.55
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 pp512 1428.39 ± 47.00
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none 0 tg128 36.81 ± 0.26
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 438.09 ± 1.09
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.97 ± 0.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 pp512 464.23 ± 1.75
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none 0 tg128 11.66 ± 0.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 pp512 584.07 ± 11.48
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none 0 tg128 24.78 ± 0.34
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 pp512 582.92 ± 10.35
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 8 none 0 tg128 21.82 ± 0.06
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 pp512 710.09 ± 8.09
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none 0 tg128 20.81 ± 0.14

opt_feature is used to save the opt features supported by different hardware platform.
For example, iGPU in 10th Core can't support reorder feature. Here is value is False.

SYCL graph feature has nothing with hardware platform.
It should be saved in a single global variable for enable/disable.

Rbiessy · 2025-05-27T08:34:37Z

@NeoZhangJianyu SYCL-Graph already has a global variable to try and enable SYCL-Graph but there are also more checks as SYCL-Graph doesn't support all graphs not all devices, see

llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp

Line 3897 in 4f81b33

    
           const bool graph_support = dpct::get_device(sycl_ctx->device).has(sycl::aspect::ext_oneapi_limited_graph);

Given this check, opt_feature seems like a good fit to me.

lslusarczyk · 2025-05-27T14:54:09Z

@qnixsynapse , @NeoZhangJianyu , thank you for reviewing my code and for your comments.

@Rbiessy , thank you very much for checking the performance. Before deciding to have two code paths, which will make code uglier (instead of simplier as I tried in this PR), I'd like to try to understand why these templates were causing better performance. By code analysis I expected nearly no impact by removing template parameters that I changed to be arguments.

Expect my updates here in a few days.

remove templates from soft_max_f32_submitter to allow SYCL graph updates

e72a7ca

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 23, 2025

lslusarczyk marked this pull request as ready for review May 23, 2025 12:26

qnixsynapse approved these changes May 25, 2025

View reviewed changes

Rbiessy self-requested a review May 26, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

Are you sure you want to change the base?

remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!