8000 llama : do not cap thread count when MoE on CPU (#5419) · mglambda/llama.cpp@e5ca393 · GitHub
[go: up one dir, main page]

Skip to content

Commit e5ca393

Browse files
llama : do not cap thread count when MoE on CPU (ggml-org#5419)
* Not capping thread count when MoE inference is running on CPU * Whitespace
1 parent e4124c2 commit e5ca393

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

llama.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7285,7 +7285,9 @@ static int llama_decode_internal(
72857285
// TODO: this is mostly important for Apple Silicon where CBLAS is still performing very well
72867286
// we still need some threads to process all non-mul_mat ops, but not too much to avoid interfering
72877287
// with the BLAS calls. need a better solution
7288-
if (n_tokens >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) {
7288+
// MoE Special Case: This logic applies when hparams.n_expert == 0, i.e. the model is NOT an MoE model. When an MoE is
7289+
// being processed then Accelerate/BLAS will not be involved, so capping would limit performance.
7290+
if (n_tokens >= 32 && hparams.n_expert == 0 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) {
72897291
n_threads = std::min(4, n_threads);
72907292
}
72917293

0 commit comments

Comments
 (0)
0