fix: support grad clipping for TP through replicating non-sharded modules #36132

kmehant · 2025-02-11T13:26:28Z

What does this PR do?

torch.nn.utils.clip_grad_norm_ does not support heterogenous set of parameters having a mix of DTensors and Tensors. This PR allows for gradient clipping by distributing non-sharded modules that are not involved in TP. We replicate all such modules across the device mesh.

The PR also adds new parallel style ReplicateParallel so that the existing TP APIs can be used as is for this module replication operation. We could think of contributing this back to PyTorch if it makes sense (cc: @kwen2501) otherwise we can maintain it in transformers.

⭐ Note : We would rebase this PR once #34194 is merged some of the workflow changes that you see here would disappear once the PR is merged.

fixes: #36296

Concerns

Concern 1

When we do two TP runs with gradient clipping with exact same training settings we dont reproduce exact loss parity between the runs though both the runs converge eventually. I am worried if Replicate sharding has something to do here.

Concern 2

Grad norms are not same on each rank, I would assume in TP training the grad norms should come out to be same across the ranks however thats not the case

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @muellerzr and @SunMarc
@kwen2501 from PyTorch

SunMarc · 2025-02-20T09:55:59Z

Can you have a look @kwen2501 ?

kwen2501 · 2025-02-21T01:57:46Z

@weifengpy @mori360 do you mind having a look at the two concerns here? Thanks!

weifengpy · 2025-02-21T02:07:54Z

@weifengpy @mori360 do you mind having a look at the two concerns here? Thanks!

does not support heterogenous set of parameters having a mix of DTensors and Tensors

implicit_replication is invented to mix DTensor with plain tensors. Maybe it's cleaner here.

from torch.distributed._tensor.experimental import implicit_replication
with implicit_replication():
   # call gradient clipping here

code pointer: https://github.com/pytorch/pytorch/blob/8b818ab58f635f999de2c8a5bf8e6c01d0c122ed/test/distributed/tensor/parallel/test_tp_examples.py#L262-L264

kmehant · 2025-02-21T08:00:50Z

@weifengpy Do you recommend to use implicit_replication instead?

kwen2501

Thanks for fixing gradient clipping with TP.
Functionality wise, the code change looks reasonable.
I am consulting with my colleagues to see if there is a must to explicitly annotate non-sharded modules as Replicated.

kwen2501 · 2025-02-21T18:29:23Z

src/transformers/models/granite/configuration_granite.py

+        "layers.*.self_attn.o_proj": "rowwise_output_dtensor",
        "layers.*.mlp.gate_proj": "colwise",
        "layers.*.mlp.up_proj": "colwise",
-        "layers.*.mlp.down_proj": "rowwise",
+        "layers.*.mlp.down_proj": "rowwise_output_dtensor",
+        "embed_tokens": "replicateparallel_output_dtensor",
+        "layers.*.post_attention_layernorm": "replicateparallel_output_dtensor",
+        "layers.*.input_layernorm": "replicateparallel_output_dtensor",
+        "norm": "replicateparallel_output_dtensor",


Thanks for extending the configs here.
I wonder if some of these settings would be more interesting to training than to inference?
(On the other hand, I don't know much about HF's user profile -- training more or inference more?)
If some of the settings are specific to training, is it possible to separate them out? Or, shall we make the config somehow customizable at run time?

@kwen2501
True, these are needed for training for grad norm, however, not so needed for inference. Does using replicate incur costs?

kwen2501 · 2025-02-21T18:32:07Z

src/transformers/pytorch_utils.py

+class ReplicateParallel(ParallelStyle):
+    """
+    Replicate a nn.Module.
+    Users can compose it together with other parallel styles like RowwiseParallel to achieve a fully distributed model.
+    Fully distributed model is needed for gradient clipping.


@weifengpy @wz337 @tianyu-l
I wonder if there is anything we can do on DTensor side so that users don't have to annotate the entire model to perform gradient clipping?

@weifengpy @kwen2501 Should we be using implicit_replication() as an alternative?

kwen2501 · 2025-02-21T18:33:55Z

src/transformers/pytorch_utils.py

-# TODO need to add the __repr__ that shows that it is a colwise parallel
-# See https://github.com/pytorch/pytorch/issues/145726


nit: keep this TODO?

Fixed thanks

ArthurZucker

Very nice, waiting on @kwen2501 's feed back, but make sure to rebase since we just merged: #36335 !

ArthurZucker · 2025-03-11T09:19:05Z

@kmehant we finish the refactoring if you still want to work on this!

kmehant · 2025-03-11T09:24:15Z

#36132 (comment)

Thanks for the update, I will soon rebase the PR by EoD. Thanks

kmehant · 2025-03-11T15:09:39Z

@ArthurZucker I have rebased my PR, thanks

kmehant · 2025-03-17T15:58:14Z

#36132 (comment)

@ArthurZucker I kept it consistent with the refactored code now. Waiting for @kwen2501 if the recommendation is to use implicit_replication() from torch instead of introducing a new parallelstyle module for replication.

kwen2501 · 2025-03-21T23:02:47Z

Yeah, per @weifengpy 's comment, I think implicit_replication() is preferred over creating new strategies. You can limit its usage to minimal possible if there is risk concern -- like, only when clip_grad_norm_ is used.

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

ArthurZucker

Thanks!
I am not familiar with implicit_replication but if it is recommended by @kwen2501 happy to use! I suppose it requires a specific torch version check to use no?

If so, using our own replicate would be a bit better AFAIK for broader support (starting 2.3 vs only version where implicit replication is defined

ArthurZucker · 2025-05-15T09:40:00Z

src/transformers/trainer.py

@@ -234,6 +234,7 @@
        AutocastKwargs,
        DistributedDataParallelKwargs,
        DistributedType,
+        TorchTensorParallelPlugin,


not seing this used!

SunMarc

much needed !

HuggingFaceDocBuilderDev · 2025-06-05T15:55:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ules (huggingface#36132) * feat: fix tp grad norm: Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: use implicit replication Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

kmehant mentioned this pull request Feb 11, 2025

feat: add support for tensor parallel training workflow with accelerate #34194

Merged

5 tasks

kmehant force-pushed the tp-gradnorm branch 3 times, most recently from 289af7a to 2192a35 Compare February 12, 2025 12:20

kmehant force-pushed the tp-gradnorm branch 2 times, most recently from 9bb5edd to ebfb17d Compare February 20, 2025 14:18

kwen2501 reviewed Feb 21, 2025

View reviewed changes

ArthurZucker reviewed Feb 27, 2025

View reviewed changes

kmehant force-pushed the tp-gradnorm branch from ebfb17d to 9e16f9c Compare March 11, 2025 15:08

kmehant force-pushed the tp-gradnorm branch 9 times, most recently from 6290f7f to 99a3817 Compare March 17, 2025 11:39

kmehant force-pushed the tp-gradnorm branch from 99a3817 to 84c3506 Compare March 25, 2025 10:02

kmehant added 2 commits March 25, 2025 16:25

feat: fix tp grad norm:

6a02abb

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: use implicit replication

37ed481

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-gradnorm branch from 84c3506 to 37ed481 Compare March 25, 2025 10:55

kmehant mentioned this pull request May 12, 2025

parallelism goes brrr #37877

Merged

5 tasks

ArthurZucker approved these changes May 15, 2025

View reviewed changes

ArthurZucker reviewed May 15, 2025

View reviewed changes

Merge branch 'main' into tp-gradnorm

b28c7e1

SunMarc approved these changes Jun 5, 2025

View reviewed changes

winglian mentioned this pull request Jul 1, 2025

when delaying optimizer creation only prepare the model #39152

Merged

5 tasks

kaixuanliu mentioned this pull request Jul 4, 2025

fix bug using FSDP V1 will lead to model device not properly set #39177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: support grad clipping for TP through replicating non-sharded modules #36132

fix: support grad clipping for TP through replicating non-sharded modules #36132

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		# TODO need to add the __repr__ that shows that it is a colwise parallel
		# See https://github.com/pytorch/pytorch/issues/145726

fix: support grad clipping for TP through replicating non-sharded modules #36132

fix: support grad clipping for TP through replicating non-sharded modules #36132

Uh oh!

Conversation

Uh oh!

What does this PR do?

Concerns

Concern 1

Concern 2

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!