Llama4TextExperts module implementation

### System Info

Llama4 model family adopts `MoE` layer implementation for better efficiency. 

However, in the current [implementation](https://github.com/huggingface/transformers/blob/d1b92369ca193da49f9f7ecd01b08ece45c2c9aa/src/transformers/models/llama4/modeling_llama4.py#L85) MoE layer in fact performs an ordinary dense FFN forward pass with all experts being involved in the computation. One can see, that `gate_up_proj` matrix has the same shape as if all `num_experts` are active.

<img width="726" alt="Image" src="https://github.com/user-attachments/assets/9cf64546-43f2-45d7-b2d1-f847f065a136" />

I guess the intent was to perform computation only for the experts **selected by router**.

### Who can help?

@ArthurZucker

### Reproduction

Any usage of the model

### Expected behavior

Only experts chosen by the router are involved in computation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

System Info

Who can help?

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

System Info

Who can help?

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions