-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Closed
Labels
Description
System Info
Llama4 model family adopts MoE
layer implementation for better efficiency.
However, in the current implementation MoE layer in fact performs an ordinary dense FFN forward pass with all experts being involved in the computation. One can see, that gate_up_proj
matrix has the same shape as if all num_experts
are active.

I guess the intent was to perform computation only for the experts selected by router.
Who can help?
Reproduction
Any usage of the model
Expected behavior
Only experts chosen by the router are involved in computation