[dtensor] add src_data_rank to distribute_tensor API #143883

wanchaol · 2024-12-26T22:48:33Z

Stack from ghstack (oldest at bottom):

As titled, this PR add a kwarg src_data_rank to the distribute_tensor
API, to allow user specify a specific rank as the full tensor source
data. Previously we by default specify group_rank=0 as the source of
truth for single device semantic, this new option:

gives advanced user flexiblity to choose the source data rank
allow user to specify None explicity, which means we will skip the
communications needed (scatter/broadcast) for the cases that does not
care about single device semantic (i.e. loading from a checkpoint)

cc @H-Huang @awgu @kwen2501 @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) [ghstack-poisoned]

pytorch-bot · 2024-12-26T22:48:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143883

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cd83f97 with merge base d88a8c4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communicat 10000 ions needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) cc H-Huang awgu kwen2501 fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) ghstack-source-id: 55510d9 Pull Request resolved: #143883

tianyu-l

Welcome back!

Had a comment on the semantics of distribute_tensor with shard placements when src_data_rank is None.

torch/distributed/tensor/placement_types.py

torch/distributed/tensor/_collective_utils.py

As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) cc H-Huang awgu kwen2501 fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu

LGTM! Thanks for supporting 0-collective tensor sharding.

XilunWu · 2024-12-30T23:29:00Z

test/distributed/_tensor/test_api.py

@@ -41,15 +45,21 @@ def world_size(self) -> int:
        return 4

    @with_comms
-    def test_distribute_tensor(self):
+    def test_distribute_tensor_rank(self):


it would be good if we also test uneven sharding.

I think this feature is somewhat orthogonal to even or uneven sharding, I'll try to update as a follow up later

tianyu-l

LGTM!

as titled, this PR propagates the src_data_rank in the TP API, so that module level APIs could leverage the flexibility to choose src_data_rank, and avoid the communication if it does not need to Pull Request resolved: #144005 Approved by: https://github.com/tianyu-l ghstack dependencies: #143883

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 26, 2024

wanchaol added release notes: distributed (dtensor) release notes category ciflow/trunk Trigger trunk jobs on your pull request labels Dec 28, 2024

wanchaol requested review from XilunWu, wz337, tianyu-l and kwen2501 December 28, 2024 00:26

tianyu-l reviewed Dec 30, 2024

View reviewed changes

torch/distributed/tensor/placement_types.py Show resolved Hide resolved

torch/distributed/tensor/_collective_utils.py Show resolved Hide resolved

torch/distributed/tensor/_collective_utils.py Show resolved Hide resolved

wanchaol mentioned this pull request Dec 30, 2024

[tp] propagate src_data_rank kwarg in TP API #144005

Closed

XilunWu approved these changes Dec 30, 2024

View reviewed changes

tianyu-l approved these changes Dec 31, 2024

View reviewed changes

pytorchmergebot closed this in f242dbb Jan 2, 2025

pytorchmergebot added the Merged label Jan 2, 2025

wanchaol mentioned this pull request Jan 3, 2025

[dtensor] deprecate _shard_tensor to use src_data_rank=None #144171

Closed

github-actions bot deleted the gh/wanchaol/361/head branch February 2, 2025 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor] add src_data_rank to distribute_tensor API #143883

[dtensor] add src_data_rank to distribute_tensor API #143883

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[dtensor] add src_data_rank to distribute_tensor API #143883

[dtensor] add src_data_rank to distribute_tensor API #143883

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143883

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!