TG improvements for MoE models #404
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR does 3 things:
GGML_OP_GET_ROWS
op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.ggml_cuda_op_mul_mat_vec_q_id
function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.Prompt processing is not eaffected by these changes.
Here is a graph obtained with
sweep-bench
showing TG performance as a function of the number of tokens in the KV cacheN_KV
. The model is DeepSeek-Lite quantized toQ4_0
. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with-ser 4,1
. The command line is