Convert Pre-LN Transformers into equivalent Pre-RMSNorm Transformers to accelerate inference and training · Issue #23786 · huggingface/transformers · GitHub
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LayerNorm and RMSNorm are the top two normalization methods in Transformers. We unify them in Pre-Normalization Transformers in our paper https://arxiv.org/abs/2305.14858. The arithmetic equivalence allows us to convert Pre-LN Transformers into Pre-RMSNorm models without impact on the model functionality. Since RMSNorm offers superior efficiency compared to LayerNorm, our method enables faster equivalent inference and training for any Pre-LN Transformers, e.g., GPT, ViT.