[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051

XilunWu · 2025-07-10T18:23:53Z

Stack from ghstack (oldest at bottom):

Summary
split_strategy used TupleStrategy as return type because DTensor sharding
propagation's OpStrategy support on multi-returns only applies to Tuple.

However, TupleStrategy's not a good fit for split op. TupleStrategy was
initially introduced to handle the sharding strategy of foreach_* ops where
the input args can be split into independent subsets regarding sharding decisions,
so are the outputs.

To address the misuse, this PR adds OpStrategy propagation for List[Tensor]
(note that this support is INCOMPLETE because it only checks the return type
to be torch.ListType). Nevertheless, the logic for Tuple returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding OpStrategy support to ops having List[Tensor] return type,
this PR also changes split_strategy's return from TupleStrategy to OpStrategy.

Test
pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tianyu-l

[ghstack-poisoned]

pytorch-bot · 2025-07-10T18:23:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158051

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0f2e156 with merge base 38371f6 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 45ebbc9 Pull Request resolved: #158051

torch/distributed/tensor/_sharding_prop.py

torch/distributed/tensor/_ops/_tensor_ops.py

zpcore · 2025-07-10T20:17:50Z

The logic LGTM! Request minor fix on metadata. We need to add tensor_meta to input_specs. Without it the test (#157991) can still pass but this will cause the issue to the following ops who consume the strategy from split. Below is the detailed explanation:

When we do generate_redistribute_costs(input_strategy, input_spec), only the metadata in input_strategy.output_spec is needed for the computation. That is to say, we only need to add metadata to input_specs for every op strategy, then output spec tensor meta will be propogated and automatically assigned here

pytorch/torch/distributed/tensor/_sharding_prop.py

Line 439 in 8c5b070

op_schema.op, output_sharding.output_spec, out_tensor_meta

. This will guarantee the follow up ops will have the required metadata in the strategy to compute cost and further propagate the metadata to the next.

torch/distributed/tensor/_op_schema.py

torch/distributed/tensor/_sharding_prop.py

… TupleStrategy" **Summary** `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s design is complicated and we want to avoid its use as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. **Test** `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l [ghstack-poisoned]

ghstack-source-id: 804e5ba Pull Request resolved: #158051

wconstab · 2025-07-11T18:02:05Z

Nit:

However, TupleStrategy's design is complicated and we want to avoid its use
as much as possible.

I think this isn't quite how i would say it. It should not be because TupleStrategy is complicated that we don't use it, it should be that TupleStrategy has a very specific case it can be used, and this is not that case.

I have attempted to document better what the contract for TupleStrategy is in #158132, please see if you agree.

On that note, i was thinking about whether we could validate this automatically, but I am afraid it is not possible without understanding the semantics of the ops. (e.g. it is not enough to just check the op schema). So i think better docs is all we can do?

… TupleStrategy" **Summary** `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s design is complicated and we want to avoid its use as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. **Test** `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l [ghstack-poisoned]

wconstab

thanks! i confirmed this works for me locally

pytorchmergebot · 2025-07-14T23:18:24Z

Starting merge as part of PR stack under #158112

**Summary** Implemented the test pattern described in #157991 (comment) as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is: 1. allowing users to specify the `Partial` placement. 2. supporting tree-like output structure. **Test** so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: #158112 Approved by: https://github.com/Skylion007, https://github.com/zpcore ghstack dependencies: #158051

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy

7dd3aa2

[ghstack-poisoned]

XilunWu mentioned this pull request Jul 10, 2025

[DTensor] support split op on Partial placement #157991

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 10, 2025

XilunWu added a commit that referenced this pull request Jul 10, 2025

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy

178e288

ghstack-source-id: 45ebbc9 Pull Request resolved: #158051

XilunWu requested review from ezyang, wanchaol, wconstab and zpcore July 10, 2025 18:33

XilunWu added topic: not user facing topic category module: dtensor distributed tensor tag labels Jul 10, 2025

zpcore reviewed Jul 10, 2025

View reviewed changes

torch/distributed/tensor/_sharding_prop.py Outdated Show resolved Hide resolved

zpcore reviewed Jul 10, 2025

View reviewed changes

torch/distributed/tensor/_ops/_tensor_ops.py Outdated Show resolved Hide resolved

zpcore reviewed Jul 10, 2025

View reviewed changes

torch/distributed/tensor/_ops/_tensor_ops.py Outdated Show resolved Hide resolved

zpcore mentioned this pull request Jul 10, 2025

rough fix for split #157946

Closed

zpcore mentioned this pull request Jul 10, 2025

[DTensor] Improve tensor_metadata and redistribute_cost coverage for op strategies. #157495

Open

52 tasks

wanchaol reviewed Jul 11, 2025

View reviewed changes

torch/distributed/tensor/_op_schema.py Outdated Show resolved Hide resolved

torch/distributed/tensor/_op_schema.py Show resolved Hide resolved

torch/distributed/tensor/_sharding_prop.py Outdated Show resolved Hide resolved

XilunWu added a commit that referenced this pull request Jul 11, 2025

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy

c5f8cf2

ghstack-source-id: 804e5ba Pull Request resolved: #158051

XilunWu mentioned this pull request Jul 11, 2025

[DTensor][BE] improve DTensor ops correctness check utils #158112

Closed

XilunWu requested a review from wanchaol July 11, 2025 08:53

XilunWu requested a review from zpcore July 14, 2025 23:03

wconstab approved these changes Jul 14, 2025

View reviewed changes

zpcore approved these changes Jul 14, 2025

View reviewed changes

pytorchmergebot closed this in 4c1fabf Jul 15, 2025

pytorchmergebot added the Merged label Jul 15, 2025

github-actions bot deleted the gh/XilunWu/156/head branch August 15, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051

[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158051

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants