upstream `apex.normalization.FusedRMSNorm` #72643

stas00 · 2022-02-10T01:41:39Z

🚀 The feature, motivation and pitch

All T5 models and their derivatives (t5, mt5, t0, etc.) use RMSNorm, instead of LayerNorm. The former is a subset of the latter, it only scales and doesn't shift.

The original need was a discovery that all HF Transformers t5-based models were somewhat slow under mixed precision, because of "manual" implementation of T5LayerNorm where manual up/down- casting was causing a significant bottleneck.

While researching this I have run into other users who wanted to use a fast RMSNorm (but didn't save the references)

NVIDIA/apex recently implemented apex.normalization.FusedRMSNorm but building apex is far from easy for a lay person.

I have benchmarked it in an ensemble and it gives a pretty significant gain - about 10% improvement on the full back-to-back application. huggingface/transformers#14656 - so clearly multiple times faster on just the norm part.

So to ease user's path to faster t5-based models if possible it'd be great to have this sub-set functionality of LayerNorm available in pytorch.

It's already in the nvfused branch: csarofeen#1428

I will see if I can find other users who may want a fast RMSNorm

Thank you!

cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345

The text was updated successfully, but these errors were encountered:

vadimkantorov · 2022-02-10T17:27:12Z

(if it's a variant of LayerNorm, should it be supported by LayerNorm natively, e.g. if weight is not None and bias is None, it should call this new fused kernel?)

stas00 · 2022-02-10T22:25:58Z

From https://arxiv.org/abs/1910.07467

When the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNorm.

So it's the same as LayerNorm, but:

mean is always 0.
no bias - only weight - which is how it's done in T5 - perhaps bias-enabling should be configurable?

albanD · 2022-02-11T17:11:46Z

if it's a variant of LayerNorm, should it be supported by LayerNorm natively, e.g. if weight is not None and bias is None, it should call this new fused kernel?

@vadimkantorov the problem is that you also want to ignore the average term (E[x]) there?
So that would need to be a new boolean flag to enable that. Not sure if that is better than adding a new RMSNorm function.

PetrochukM · 2022-04-04T05:56:13Z

It'd be awesome to have this available in PyTorch!! It was used again, in this big paper: https://arxiv.org/pdf/2112.11446.pdf

thomasw21 · 2023-03-05T18:52:42Z

Another big model using it: https://arxiv.org/abs/2302.13971
@albanD I can help add it if necessary, but in terms of design choices which is better:

adding boolean flag to enable them in LayerNorm?
Creating a new RMSNorm class?

Also I've noticed that there seems to be a test for a fused version: https://github.com/pytorch/pytorch/blame/5dd52e250f66a5e3377eb39228cd929871f1eb5d/test/functorch/test_memory_efficient_fusion.py#L155

albanD · 2023-03-31T20:39:56Z

Hey!
Sorry for the delay in answering, it is a tricky one.
Given the similarity with LayerNorm I think we want to move forward as follows:

Add a no_bias kwarg to LayerNorm which can be used to enable this behavior.
Make sure the doc for this arg mentions RMSNorm explicitly so that it is discoverable by that name.
Using torch.compile should gives you the same performance as the fused implementation from apex. So we don't want to add the fused implementation in PyTorch.

stas00 · 2023-03-31T22:03:05Z

Using torch.compile should gives you the same performance as the fused implementation from apex. So we don't want to add the fused implementation in PyTorch.

torch.compile is very very very very far from being ready for general use. Most of my attempts at using it failed. You can see multiple bug reports I have filed about it. It only works in very specific situations.

But if this request is put on hold for another year it might come through. Which is fine too, since we do have apex. It's just a big pain to install.

PetrochukM · 2023-04-01T02:05:11Z

torch.compile along with JIT before it has historically not worked well for NLP models which tend to be more dynamic. It'd be nice to prioritize some core development until compilers like that are stabilized.

tbh, if I could just use torch.compile for one-off modules, that'd be sick. It feels like overkill to try to compile an entire model.

thomasw21 · 2023-04-01T11:51:04Z

Just fyi, RMSNorm isn't just LayerNorm without bias (otherwise we could have use the functional method that allows to not pass bias). You need also to remove mean estimation.

Totally fine with the plan to use torch.compile as the main way forward. I'll just use apex version for now.

vadimkantorov · 2023-04-01T22:24:56Z

@albanD (also bias=False naming is probably better (to match nn.Linear's arguments' naming)

albanD · 2023-04-03T17:51:14Z

yes bias is not great as it is usually just the bias weight but here it also means the centering is also removed (you don't remove the average bias). Do you think rms_only=True would be a better name?

tbh, if I could just use torch.compile for one-off modules, that'd be sick. It feels like overkill to try to compile an entire model.

@PetrochukM this is the whole point of torch.compile (and the big difference with existing jit in ML frameworks) it is designed to work with partial graphs and small pieces!
What I heard from @cpuhrsch is that in the case of RMSNorm, torch.compile is able to generate a good compiled version. So wrapping just that one layer implementing it will give you a fused version.

stas00 · 2023-04-03T18:29:39Z

Thank you for clarifying that one doesn't have to compile the whole model to fuse just one component, albanD. That's great!

I disagree that this should be left to users. You want pytorch to be the winning framework, correct? Make it great out of the box. Expecting users to figure out that they need to compile RMSNorm to make it faster is not a great strategy, IMHO.

If it works why not compile it by default then and give an option to opt out?

vadimkantorov · 2023-04-03T18:57:49Z

I tend to agree with @stas00 as pytorch core/domain libraries is an important source of idioms that are adopted by the community. So if RmsNorm module can be implemented trivially in core by torch.compiling around simple impl, it's great to have it in core + tests + perf tests. Then this could also be a showcase/doc reference of usecase where torch.compile works great (kind of dogfooding).

PetrochukM · 2023-04-03T20:46:41Z

Yeah, if compiling works, why not offer out-of-the-box PyTorch modules that have been compiled together? Especially, modules that are used in state-of-the-art models.

Most state-of-the-art models cannot be built in PyTorch because they require many layers that are not readily available and are hard to implement. The community is needing to use libraries like apex which are hard to use.

ezyang · 2023-04-04T17:02:25Z

@albanD the problem is that we need to do a zero-to-one of having torch.compile inside regular torch library code. This is a reasonable thing to want to do, but there hasn't really been any emphasis on it (since most of the effort has been on torch.compile with full libraries). Because there is no emphasis on this style of use case, there are big perf gaps (e.g., guard evaluation overhead matters a lot more in this regime).

So to make your suggestion into reality, we need to also spend some time making this work well. Or we just enlist the OSS community's help in just adding the FusedRMSNorm into the framework directly and kick the can a few more months.

albanD · 2023-04-04T19:21:27Z

I disagree that this should be left to users. You want pytorch to be the winning framework, correct? Make it great out of the box.

This is definitely the long term vision where compile will be always there, optimizing what it can and leaving the rest as python code.
In that world, users won't have to figure out they need compile. They just always use it and will get the perf out of the box.

Right now we're in a weird transition period though as, as you said, this is not stable enough to be in this always-on state. And so the question is still open on how to best spend our ressources: add more fused one-off kernels that will be obsolete when compile is the default or work on making compile the default.
I think our current stance is to still add these fused kernels for things that are critical right now or that compile will not handle well any time soon (SDPA, adamw for example) and focus the rest of our efforts to making compile the default.

To come back to this particular issue, adding the new flag to be able to do RMSNorm is a definite "yes we want it" but adding the fused implementation is more "we would accept a simple PR but no one on the core team is working on it".

vadimkantorov · 2023-04-04T19:36:43Z

I think, having certain things in core that are torch.compile'd demonstrates to the users maturity of the technology, so it's a good goal to have by itself. Plus, in core you could have automated perf tests comparing it against legacy apex fused kernels. If wanted, this stuff could go into some separate experimental pytorch package (from where kernels can graduate into core). As long as it's built/released/tested along with the rest of pytorch, it's already strictly better than apex.

Once pytorch core has torch.compile'd kernels, it will show true commitment to the technology

8000

stas00 · 2023-04-04T21:36:34Z

If things are unstable, yet, it'd be wasteful to allocate resources to do such porting because an automatic solution is "imminent", then I think it's perfectly fine to close this feature request and tell the user to use apex if they want speed - a user who can figure out how to compile RMSNorm will surely be able to build apex, so I don't really see any advantage here if it's not something available out of the box.

ZixuanJiang · 2023-05-25T00:42:57Z

We unify the LayerNorm and RMSNorm in Pre-Normalization Transformers in our paper https://arxiv.org/abs/2305.14858. The arithmetic equivalence allows us to convert Pre-LayerNorm Transformers into Pre-RMSNorm models without impact on the model functionality.

Considering that RMSNorm offers superior efficiency compared to LayerNorm in theory, we believe that providing an official RMSNorm API would greatly benefit the community, allowing them to harness this improvement in both training and inference effectively.

We also release our implementation https://github.com/ZixuanJiang/pre-rmsnorm-transformer for reference. Thanks for your consideration.

mikaylagawarecki · 2023-07-14T19:47:29Z

Following up on the discussion here, from further discussion, it seems like

LayerNorm is given by

$y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$
where the $\epsilon$ is added for numerical stability

and RMSNorm is given by

$y = \frac{x}{ RMS(x)} * \gamma$ (with maybe an optional $+ \beta$ per this comment)

Where

$Var(x) = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i - E[x])^2$

and

$RMS(x) = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n} (x_i)^2}$ ($\epsilon$ could also be used for numerical stability in this calculation)

so RMSNorm defers from LayerNorm on 3 counts

don't subtract E[x] from the numerator
use RMS(x) rather than $\sqrt{Var(x) + \epsilon}$ in the denominator
(perhaps) don't learn an elementwise affine bias

If my understanding here is correct, we would accept the addition of RMSNorm as a new torch.nn.Module that is separate from LayerNorm

vadimkantorov · 2023-07-17T21:58:31Z

It also seems that there's a RMSNorm impl in flash-attention (fused with dropout): https://github.com/Dao-AILab/flash-attention/blob/4f285b354796fb17df8636485b9a04df3ebbb7dc/flash_attn/ops/rms_norm.py#L11

vadimkantorov · 2023-08-17T20:01:14Z

Am I also understanding correctly that semantically RMSNorm(x) := F.normalize(x, dim = dim, p = 2, eps = eps) * sqrt(D) * gamma? modulo different treatment of eps (F.normalize clamps the denom by eps from below, and rmsnorm adds eps to the under-the-sqrt expression)

vadimkantorov · 2023-09-10T15:46:34Z

There is also an impl of RMSNorm in Triton at https://github.com/kakaobrain/trident/blob/main/trident/kernel/rms_norm.py - maybe can be incorporated into core?

vince62s · 2023-10-19T13:21:27Z

+1

Ryu1845 · 2024-01-04T15:58:25Z

Any progress on this issue ? cc: @albanD

vince62s · 2024-01-04T16:48:38Z

as a matter of fact the rmsnorm from FasterTransformer is much faster than Apex'
https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/layernorm_kernels.cu

stas00 · 2024-01-04T18:47:34Z

Could you please share some factual information to support this claim, @vince62s?

vince62s · 2024-01-04T19:13:36Z

@stas00 I can try to make a unitary snippet but can tell you in this PR https://github.com/OpenNMT/OpenNMT-py/pull/2539/files it had a huge impact on inference tok/sec for a Mistral7B LM for instance. (separately from the other change kv_cache from flash2 that had also an impact).

vadimkantorov · 2024-03-08T19:34:43Z

Compiling RMSNorm Triton Kernal gives error #121526

Ryu1845 · 2024-04-25T15:49:45Z

Should probably be closed now

albanD · 2024-04-25T17:18:47Z

The PR adds the API but not the fused kernel. So we can keep this open in case someone wants to investigate if we would get benefit from a manually fused kernel. And if so, wants to upstream such implementation.

vadimkantorov · 2024-05-02T22:43:59Z

Related, I guess: https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py by @lessw2020

GuWei007 · 2024-05-16T00:12:43Z

huggingface/transformers#30236
When I was fine-tuning the llama model, a loss spike occurred. I changed the mul in RMSnorm to fp32, which solved the problem. Can you help me look at the strategy for improving the computing dtype of the RMSnorm operator?

Is it a problem with my algorithm or a problem with the RMSnorm implementation?

self.weight is bf16, hidden_states is fp32, input_dtype is bf16

return self.weight * hidden_states.to(input_dtype) # bf16 * bf16 # loss spike:
return (self.weight * hidden_states).to(input_dtype) # (bf16 * fp32).to(input_dtype) # loss correct:

vadimkantorov · 2024-06-15T11:21:54Z

This is quite interesting, maybe there needs to be some more expressive torch.mul args to allow some upcasts during the multiplication itself (compute dtype), e.g. torch.mul(a, b, compute_dtype = torch.float32, dtype = torch.bfloat16) or torch.mul(a, b, upcast = True, out_dtype = torch.bfloat16)

vadimkantorov · 2024-06-18T00:52:24Z

It also appears that Intel IPEX has some optimized code paths for RMSNorm and LayerNorm for CPUs (mentioning for pointers for potential eventual upstream): https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm-modeling

vadimkantorov · 2024-08-24T20:07:11Z

Another fused RMSNorm impl in Triton: https://x.com/hsu_byron/status/1827072742291861975 at https://github.com/linkedin/Liger-Kernel

@msaroufim The max mem improvements in Liger are massive. Curious if some of these optimizations could be added to core (like Linear layer + softmax + crossentropy. Inductor would not create such fusions now?) or if these Triton kernels could be added to PyTorch (at least as some known fusion patterns for Inductor). And of course the best would be to be able to generate these fusions with Inductor

As an extension, maybe it would be nice to have some way to extend Inductor by registering into it some known fusion pattern to be matched along with a user-provided Triton code. These way, this Triton code could be contributed first e.g. in some out-of-tree package like HF or torchao and the original model code could be sometimes kept intact. And maybe then these patterns can be brought in-tree

RoPE speedup is also pretty wild

msaroufim · 2024-08-24T21:16:16Z

So for thi 10000 s kind of work we basically have 4 options sorted in decreasing difficulty

Have inductor code generate the kernel: ideal but not possible when the kernel has clever algorithmic rewrites
Pattern match: a compromise for algorithmic rewrites
Package the kernel in core: easy because it's pure python but hard because we've tended to be conservative on which kinds of kernels we can add
Package the kernel out of core: this is the easiest to do, whether its in AO or elsewhere

Honestly I'd just vote for 4 since that's easiest and useful now. 3 isn't that crazy nowadays since it's easier to package Triton kernels since they get JIT'd vs CUDA. For 1 and 2 it sounds like the right long term thing to do but would love to hear more from @Chillee or @eellison on this, how hard is it go from a solid Triton kernel to getting it working with the pattern matcher today, ideally if there's a few n00b friendly reference examples it could help us move faster here

vadimkantorov · 2024-08-24T21:25:11Z

I think the main advantages of in-core is discoverability and resources to keep it tested for new accelerators and be notified if any regressions happen (on a large pytorch's bench) and avoiding forcing other packages to take on another dependency. Maybe if torchao can be a "staging" area before inclusion in core, but probably main packages like HF could be wary of taking a dependency on it too (or maybe not?)

The problem with 4 is that it keeps the speeding-up packages fragmented and often not tested/plugging into each other. E.g. another speeding-up package announced yesterday: https://x.com/DAlistarh/status/1826538436225806468, https://github.com/IST-DASLab/Sparse-Marlin. Currently the wide adoption is happening via domination of either PyTorch core or some other ultra-popular packages like HF.

At the end, IMO everyone will win if good continuously tested triton kernels somehow could be widely adopted - either by PyTorch core (eventually) or via some other ultra-popular package (I hope it was torchao, but I think for now it's just one of tons other packages experimenting with quantization APIs and kernels)

So IMO lacking now a PyTorch-proper wide adoption of speed-ups mechanism besides inclusion in-core.

Also, inclusion some more fused kernels / pattern matchings in core sets a new baseline for all newer packages, simplifying the benchmarking work for them. Currently there were tons of various repos with fast swiglu or fast rope (including xformers or trident or apex or oneflow). For any such repo, benchmarking against all others is not easy and error-prone, so having a better baseline in-core is very useful

Regarding the bar on kernels inclusion, I propose to have it lower for kernels in popular LLM models. As the benefit from speedups from having it in-core is large.

Another is question if these proposed kernels are tested or not to vanilla torch.compile versions...

Chillee · 2024-08-24T22:37:20Z

From my testing, I remember that vanilla torch.compile has on-par/better perf compared to the handwritten triton kernels for rmsnorm

vadimkantorov · 2024-08-24T22:40:36Z

Maybe the only interesting thing there is the fused linear + softmax + loss (in terms of max mem reduction). Or another question is why it's showing such big speedups against HF, assuming HF can now use torch.compile... Or maybe not very strong baseline/config in those benchmarks? Or is the problem that torch.compile does not decompose / compile RMSNorm / applying RoPE into Triton?

Chillee · 2024-08-24T22:44:09Z

The chunked linear + crossentropy makes a big difference in terms of peak memory, and for finetuning where you're often batch-size constrained, I've seen it make a huge difference in terms of total throughput.

vadimkantorov · 2024-08-24T22:50:45Z

So I guess, maybe what could be cool is inclusion in PyTorch core of some exteremely popular shortcuts from popular LLM models (like RoPE / SwiGLU) and having torch.compile applied to them by PyTorch itself. At least it would raise the eager baseline for any new proposed fused ops...

It's also an interesting question of how to ensure that the Inductor-generated code for these useful shortcuts does not regress over time. So maybe one way is to copy-paste its generated Triton-code into core and only regenerated it once in a while? (or just torch.compile it always...)

For having torch.compile calls in standard PyTorch nn library, probably this would be useful to have predictable performance, at least having some assurance that a user without an explicit wish would modify global Inductor options in the client code and impact the core-provided module):

[feature request] allow torch.compile calls with compiler options, but without modifying global dynamo options #124505

vadimkantorov · 2025-05-05T13:17:43Z

Unsloth also includes Triton kernels for
LayerNorm and RMSNorm (apache license):

I wonder if this means that PyTorch native torch.compile inductor/triton codegen does not produce fast enough code for LayerNorm/RMSNorm? They patch out the Torch's impl by their own in https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/patching_utils.py#L70

https://github.com/unslothai/unsloth/blob/9738cfdc526ccc50faa01260e8a64010b5c5be28/unsloth/models/qwen3.py#L94-L99

Also, RMSNorm got added to ONNX:

Add RMSNormalization from ONNX opset 23 microsoft/onnxruntime#24642

bdhirsh added feature A request for a proper, new feature. module: norms and normalization triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: nn Related to torch.nn labels Feb 10, 2022

stas00 mentioned this issue Feb 10, 2022

RMSNorm with tests/benchmarking (shapes based off of HuggingFace T5 on A100) csarofeen/pytorch#1428

Merged

pommedeterresautee mentioned this issue Oct 9, 2022

Add rmsnorm kernel for T5 ELS-RD/kernl#89

Closed

jbschlosser added this to torch.nn/optim Mar 31, 2023

jbschlosser moved this from To pick up to In Progress in torch.nn/optim Mar 31, 2023

github-project-automation bot moved this to To pick up in torch.nn/optim Mar 31, 2023

jbschlosser assigned albanD Mar 31, 2023

mikaylagawarecki mentioned this issue May 19, 2023

Allow disabling bias for LayerNorm #101683

Closed

ZixuanJiang mentioned this issue May 26, 2023

Convert Pre-LN Transformers into equivalent Pre-RMSNorm Transformers to accelerate inference and training huggingface/transformers#23786

Closed

albanD removed their assignment Aug 25, 2023

stas00 mentioned this issue Mar 27, 2024

Add RMSNorm module #121364

Closed

vadimkantorov mentioned this issue Jun 15, 2024

Add RMS Norm layer #128713

Open

AaronWang04 mentioned this issue May 15, 2025

Fused RMSNorm implementation #153666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream `apex.normalization.FusedRMSNorm` #72643

upstream `apex.normalization.FusedRMSNorm` #72643

upstream apex.normalization.FusedRMSNorm #72643

upstream apex.normalization.FusedRMSNorm #72643

Comments

🚀 The feature, motivation and pitch

self.weight is bf16, hidden_states is fp32, input_dtype is bf16

upstream `apex.normalization.FusedRMSNorm` #72643

upstream `apex.normalization.FusedRMSNorm` #72643