TG improvements for MoE models #404

ikawrakow · 2025-05-10T08:59:17Z

This PR does 3 things:

Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
Fixes bugs related to Smart Experts Reduction (SER, see SER - Smart Expert Reduction #239). The issue was that the GGML_OP_GET_ROWS op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
Further improves CUDA TG performance with SER enabled. Here the ggml_cuda_op_mul_mat_vec_q_id function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.

Prompt processing is not eaffected by these changes.

Here is a graph obtained with sweep-bench showing TG performance as a function of the number of tokens in the KV cache N_KV. The model is DeepSeek-Lite quantized to Q4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with -ser 4,1. The command line is

./bin/llama-sweep-bench -m $model -t 1 -ngl 100 -fmoe -mla 3 -fa -b 4096 -ub 4096 [-ser 4,1]

We get 3-4% TG speed improvement for DeepSeek-Lite just from that.

With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working.

Iwan Kawrakow added 3 commits May 10, 2025 09:49

cuda: Remove unnecessary device to host copy of row ids

1055783

We get 3-4% TG speed improvement for DeepSeek-Lite just from that.

CUDA: fix TG with SER

c4e1c2c

ikawrakow mentioned this pull request May 10, 2025

FlashMLA-3 for DeepSeek models on CUDA #386

Merged

ikawrakow merged commit a2d24c9 into main May 10, 2025

saood06 mentioned this pull request May 16, 2025

Bug: llama-batched-bench crashed with batch size >2 #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TG improvements for MoE models #404

TG improvements for MoE models #404

Uh oh!

Uh oh!

Uh oh!

TG improvements for MoE models #404

TG improvements for MoE models #404

Uh oh!

Conversation

Uh oh!

Uh oh!