8000 TG improvements for MoE models by ikawrakow · Pull Request #404 · ikawrakow/ik_llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

TG improvements for MoE models #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 10, 2025
Merged

Conversation

ikawrakow
Copy link
Owner

This PR does 3 things:

  • Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
  • Fixes bugs related to Smart Experts Reduction (SER, see SER - Smart Expert Reduction #239). The issue was that the GGML_OP_GET_ROWS op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
  • Further improves CUDA TG performance with SER enabled. Here the ggml_cuda_op_mul_mat_vec_q_id function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.

Prompt processing is not eaffected by these changes.

Here is a graph obtained with sweep-bench showing TG performance as a function of the number of tokens in the KV cache N_KV. The model is DeepSeek-Lite quantized to Q4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with -ser 4,1. The command line is

./bin/llama-sweep-bench -m $model -t 1 -ngl 100 -fmoe -mla 3 -fa -b 4096 -ub 4096 [-ser 4,1]

z8

Iwan Kawrakow added 3 commits May 10, 2025 09:49
We get 3-4% TG speed improvement for DeepSeek-Lite just from that.
With smart experts reduction (SER), one potentially uses fewer
experts than specified by the model. This is accomplished by setting
the ID of the not seected tensors to -1. Most of the necessary
stuff was implemented when I added the SER option, but I forgot
to update get_rows() for not quantized tensors. As a result, we
get random garbage for the weights of the not-selected epxerts,
which leads to garbage output. This commit fixes it on the CPU.
I'm not quite sure yet why the GPU is not working.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
No reviews
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0