upstream apex.normalization.FusedRMSNorm

@albanD

🚀 The feature, motivation and pitch

All T5 models and their derivatives (t5, mt5, t0, etc.) use RMSNorm, instead of LayerNorm. The former is a subset of the latter, it only scales and doesn't shift.

The original need was a discovery that all HF Transformers t5-based models were somewhat slow under mixed precision, because of "manual" implementation of T5LayerNorm where manual up/down- casting was causing a significant bottleneck.

While researching this I have run into other users who wanted to use a fast RMSNorm (but didn't save the references)

NVIDIA/apex recently implemented apex.normalization.FusedRMSNorm but building apex is far from easy for a lay person.

I have benchmarked it in an ensemble and it gives a pretty significant gain - about 10% improvement on the full back-to-back application. huggingface/transformers#14656 - so clearly multiple times faster on just the norm part.

So to ease user's path to faster t5-based models if possible it'd be great to have this sub-set functionality of LayerNorm available in pytorch.

It's already in the nvfused branch: csarofeen#1428

I will see if I can find other users who may want a fast RMSNorm

Thank you!

cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions