MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

xiang1guo · 2025-05-20T07:00:37Z

Background

A possible mino fix based on #12801 if I understand correctly.

With the new DS-V3 GGUF file for reduced KV cache, if the user sets no KV offload with LLAMA_ARG_NO_KV_OFFLOAD=1, we still see the MLA computation is assigned to gpu-backend(tested with sycl backend), which is unexpected in my understanding.

Solution

The root cause is that the last SV_absorb matmul's output node for MLA is not controlled by env and assigned to be GPU-backend when doing graph splits (expand up stage). The node is added together with #12801 for MLA with reduced KV Cache, and then the setting in here is invalid because of the new node. This PR forces that node to be the CPU backend if the user sets LLAMA_ARG_NO_KV_OFFLOAD=1 so that the comp 8000 utation between the KV cache store and the attention output can be assigned to CPU.

Testing results.(SYCL Backend)

The model I used is the newly updated DeepSeek-V3-0324 for Q4_K_M version.
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M

Functionality wise

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. Before the change.

## SPLIT #247: CPU # 2 inputs: [Kcur-28 (   1M)] [Vcur-28 (reshaped) (transposed) (   1M)] 
node #2242 (       CPY): cache_k_l28 (view) ( ( 306K) [  CPU         ]:        CPU#Kcur-28#0 (   1M) [ NULL         ]   cache_k_l28 (view) ( 306K) [  CPU         ]
node #2246 (       CPY): cache_v_l28 (view) ( (   1M) [  CPU         ]: CPU#Vcur-28 (reshape (   1M) [ NULL         ]   cache_v_l28 (view) (   1M) [  CPU         ]

## SPLIT #248: SYCL0 # 2 inputs: [cache_k_l28 (view) (permuted) (   1M)] [cache_v_l28 (view) (permuted) (   2M)] 
node #2252 (   MUL_MAT):            node_2252 ( 512M) [SYCL0         ]: SYCL0#cache_k_l28 (v (   1M) [ NULL         ]   Qcur-28 (permuted) ( 144M) [SYCL0         ]
node #2253 (  SOFT_MAX):            node_2253 ( 512M) [SYCL0         ]:            node_2252 ( 512M) [SYCL0         ]      SYCL0#leaf_13#0 (   4M) [ NULL         ]
node #2254 (   MUL_MAT):            node_2254 ( 128M) [SYCL0         ]: SYCL0#cache_v_l28 (v (   2M) [ NULL         ]            node_2253 ( 512M) [SYCL0         ]
node #2255 (   MUL_MAT):            node_2255 (  32M) [SYCL0         ]: blk.28.attn_v_b.weig (   4M) [SYCL0         ]            node_2254 ( 128M) [SYCL0         ]

## SPLIT #249: CPU # 1 inputs: [ (permuted) (  32M)] 
node #2257 (      CONT):           kqv_out-28 (  32M) [  CPU         ]:    CPU# (permuted)#0 (  32M) [ NULL         ]

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. After the change.

## SPLIT #14: CPU # 3 inputs: [Kcur-2 (   1M)] [Vcur-2 (reshaped) (transposed) (   1M)] [Qcur-2 (permuted) ( 144M)] 
node #119 (       CPY): cache_k_l2 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-2#0 (   1M) [ NULL         ]    cache_k_l2 (view) ( 306K) [  CPU         ]
node #123 (       CPY): cache_v_l2 (view) (c (   1M) [  CPU         ]: CPU#Vcur-2 (reshaped (   1M) [ NULL         ]    cache_v_l2 (view) (   1M) [  CPU         ]
node #129 (   MUL_MAT):             node_129 ( 512M) [  CPU         ]: cache_k_l2 (view) (p (   1M) [  CPU         ] CPU#Qcur-2 (permuted ( 144M) [ NULL         ]
node #130 (  SOFT_MAX):             node_130 ( 512M) [  CPU         ]:             node_129 ( 512M) [  CPU         ]              leaf_13 (   4M) [  CPU         ]
node #131 (   MUL_MAT):             node_131 ( 128M) [  CPU         ]: cache_v_l2 (view) (p (   2M) [  CPU         ]             node_130 ( 512M) [  CPU         ]

Performance-wise

Though it depends on the CPU/GPU kernel implementation for MLA, and profiling results show low efficiency in the first matmul in SYCL backend, therefore, the performance speedup is not very meaningful. I just put some perf numbers here for reference.
Prefill speedup: 7%
Decoding speedup: 40%

Testing results.(CUDA Backend tested with 4070)

The model I used is the newly updated DeepSeek-V3-0324 for Q4_K_M version.
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M

Functionality wise

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. Before the change.

## SPLIT #9: CPU # 2 inputs: [Kcur-0 (   1M)] [Vcur-0 (reshaped) (transposed) (   1M)] 
node # 23 (       CPY): cache_k_l0 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-0#0 (   1M) [ NULL         ]    cache_k_l0 (view) ( 306K) [  CPU         ]
node # 27 (       CPY): cache_v_l0 (view) (c (   1M) [  CPU         ]: CPU#Vcur-0 (reshaped (   1M) [ NULL         ]    cache_v_l0 (view) (   1M) [  CPU         ]

## SPLIT #10: CUDA0 # 3 inputs: [cache_k_l0 (view) (permuted) (   1M)] [leaf_13 (   4M)] [cache_v_l0 (view) (permuted) (   2M)] 
node # 33 (   MUL_MAT):              node_33 ( 512M) [CUDA0         ]: CUDA0#cache_k_l0 (vi (   1M) [ NULL         ]    Qcur-0 (permuted) ( 144M) [CUDA0         ]
node # 34 (  SOFT_MAX):              node_34 ( 512M) [CUDA0         ]:              node_33 ( 512M) [CUDA0         ]      CUDA0#leaf_13#0 (   4M) [ NULL         ]
node # 35 (   MUL_MAT):              node_35 ( 128M) [CUDA0         ]: CUDA0#cache_v_l0 (vi (   2M) [ NULL         ]              node_34 ( 512M) [CUDA0         ]

## SPLIT #11: CUDA0 # 1 inputs: [blk.0.attn_v_b.weight (   4M)] 
node # 36 (   MUL_MAT):              node_36 (  32M) [CUDA0         ]: CUDA0#blk.0.attn_v_b (   4M) [ NULL         ]              node_35 ( 128M) [CUDA0         ]

## SPLIT #12: CPU # 1 inputs: [ (permuted) (  32M)] 
node # 38 (      CONT):            kqv_out-0 (  32M) [  CPU         ]:    CPU# (permuted)#0 (  32M) [ NULL         ]

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. After the change.

## SPLIT #9: CPU # 3 inputs: [Kcur-0 (   1M)] [Vcur-0 (reshaped) (transposed) (   1M)] [Qcur-0 (permuted) ( 144M)] 
node # 23 (       CPY): cache_k_l0 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-0#0 (   1M) [ NULL         ]    cache_k_l0 (view) ( 306K) [  CPU         ]
node # 27 (       CPY): cache_v_l0 (view) (c (   1M) [  CPU         ]: CPU#Vcur-0 (reshaped (   1M) [ NULL         ]    cache_v_l0 (view) (   1M) [  CPU         ]
node # 33 (   MUL_MAT):              node_33 ( 512M) [  CPU         ]: cache_k_l0 (view) (p (   1M) [  CPU         ] CPU#Qcur-0 (permuted ( 144M) [ NULL         ]
node # 34 (  SOFT_MAX):              node_34 ( 512M) [  CPU         ]:              node_33 ( 512M) [  CPU         ]              leaf_13 (   4M) [  CPU         ]
node # 35 (   MUL_MAT):              node_35 ( 128M) [  CPU         ]: cache_v_l0 (view) (p (   2M) [  CPU         ]              node_34 ( 512M) [  CPU         ]
node # 36 (   MUL_MAT):              node_36 (  32M) [  CPU         ]: blk.0.attn_v_b.weigh (   4M) [  CPU         ]              node_35 ( 128M) [  CPU         ]
node # 38 (      CONT):            kqv_out-0 (  32M) [  CPU         ]:           (permuted) (  32M) [  CPU         ]

Performance-wise

Prefill: 94% of master branch.
Decoding: 98.6% of master branch.

…en kv cache store on CPU

xiang1guo · 2025-05-20T07:02:46Z

@jukofyork Please help take a look at this change. Thanks!

xiang1guo · 2025-05-21T05:23:56Z

@ggerganov Can you please help to take a look at this PR?

ggerganov · 2025-05-21T08:13:05Z

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

xiang1guo · 2025-05-21T08:21:51Z

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

@ggerganov Thanks for the reply! I will try to find one and collect some data.

jukofyork · 2025-05-21T14:06:13Z

@jukofyork Please help take a look at this change. Thanks!

Sorry for the slow reply - I'm away until late tomorrow, but will try to look at it then.

xiang1guo · 2025-05-22T01:19:36Z

@jukofyork Please help take a look at this change. Thanks!

Sorry for the slow reply - I'm away until late tomorrow, but will try to look at it then.

Yeah, feel free on this one~ I noticed your comment in last PR that you are on leave :)

xiang1guo · 2025-05-22T06:04:04Z

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

@ggerganov Thanks for the reply! I will try to find one and collect some data.

@ggerganov I updated CUDA backend performance data in PR description, the performance data is slightly slower than master branch. It depends on the CPU and GPU kernel implementation I think.

slaren · 2025-05-22T23:35:10Z

If using a CPU KV cache with MLA, I would suggest using --overridetensor to force attn_v_b in the CPU. Otherwise, even with this change, this tensor is going to be stored in VRAM and copied back to the CPU on every evaluation.

xiang1guo · 2025-05-26T01:29:03Z

If using a CPU KV cache with MLA, I would suggest using --overridetensor to force attn_v_b in the CPU. Otherwise, even with this change, this tensor is going to be stored in VRAM and copied back to the CPU on every evaluation.

oh, sorry for the late reply, @slaren just missed your comment. Yes, you are right, -ot attn_v_b=CPU can help to save the memory copy from VRAM to CPU and solve the issue. I was just considering the correct use of LLAMA_ARG_NO_KV_OFFLOAD=1. Becasuse with this setting, SDPA is not calculated in CPU which is not as expected. Thanks again for pointing this out! @ggerganov Please guide if we still need to merge this.

ggerganov · 2025-05-30T10:40:30Z

@xiang1guo I think this change is not necessary given that there is a better alternative.

src: llama-graph: MLA kv cache: fix split graph backend assignment wh…

0fb5ca6

…en kv cache store on CPU

ggerganov closed this May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

Uh oh!

Conversation

Uh oh!

Background

Solution

Testing results.(SYCL Backend)

Functionality wise

Performance-wise

Testing results.(CUDA Backend tested with 4070)

Functionality wise

Performance-wise

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!