E52D [DTensor] have split_strategy return OpStrategy instead of TupleStrategy by XilunWu · Pull Request #158051 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@XilunWu
Copy link
Contributor
@XilunWu XilunWu commented Jul 10, 2025

Stack from ghstack (oldest at bottom):

Summary
split_strategy used TupleStrategy as return type because DTensor sharding
propagation's OpStrategy support on multi-returns only applies to Tuple.

However, TupleStrategy's not a good fit for split op. TupleStrategy was
initially introduced to handle the sharding strategy of foreach_* ops where
the input args can be split into independent subsets regarding sharding decisions,
so are the outputs.

To address the misuse, this PR adds OpStrategy propagation for List[Tensor]
(note that this support is INCOMPLETE because it only checks the return type
to be torch.ListType). Nevertheless, the logic for Tuple returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding OpStrategy support to ops having List[Tensor] return type,
this PR also changes split_strategy's return from TupleStrategy to OpStrategy.

Test
pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tianyu-l

@pytorch-bot
Copy link
pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158051

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0f2e156 with merge base 38371f6 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 10, 2025
XilunWu added a commit that referenced this pull request Jul 10, 2025
@XilunWu XilunWu added topic: not user facing topic category module: dtensor distributed tensor tag labels Jul 10, 2025
@zpcore zpcore mentioned this pull request Jul 10, 2025
@zpcore
Copy link
Member
zpcore commented Jul 10, 2025

The logic LGTM! Request minor fix on metadata. We need to add tensor_meta to input_specs. Without it the test (#157991) can still pass but this will cause the issue to the following ops who consume the strategy from split. Below is the detailed explanation:

When we do generate_redistribute_costs(input_strategy, input_spec), only the metadata in input_strategy.output_spec is needed for the computation. That is to say, we only need to add metadata to input_specs for every op strategy, then output spec tensor meta will be propogated and automatically assigned here

op_schema.op, output_sharding.output_spec, out_tensor_meta
. This will guarantee the follow up ops will have the required metadata in the strategy to compute cost and further propagate the metadata to the next.

… TupleStrategy"


**Summary**
`split_strategy` used `TupleStrategy` as return type because DTensor sharding
propagation's `OpStrategy` support on multi-returns only applies to `Tuple`.

However, `TupleStrategy`'s design is complicated and we want to avoid its use
as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]`
(note that this support is INCOMPLETE because it only checks the return type
to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding `OpStrategy` support to ops having `List[Tensor]` return type,
this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`.

**Test**
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`


cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Jul 11, 2025
@XilunWu XilunWu requested a review from wanchaol July 11, 2025 08:53
@wconstab
Copy link
Contributor

Nit:

However, TupleStrategy's design is complicated and we want to avoid its use
as much as possible.

I think this isn't quite how i would say it. It should not be because TupleStrategy is complicated that we don't use it, it should be that TupleStrategy has a very specific case it can be used, and this is not that case.

I have attempted to document better what the contract for TupleStrategy is in #158132, please see if you agree.

On that note, i was thinking about whether we could validate this automatically, but I am afraid it is not possible without understanding the semantics of the ops. (e.g. it is not enough to just check the op schema). So i think better docs is all we can do?

… TupleStrategy"


**Summary**
`split_strategy` used `TupleStrategy` as return type because DTensor sharding
propagation's `OpStrategy` support on multi-returns only applies to `Tuple`.

However, `TupleStrategy`'s design is complicated and we want to avoid its use
as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]`
(note that this support is INCOMPLETE because it only checks the return type
to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding `OpStrategy` support to ops having `List[Tensor]` return type,
this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`.

**Test**
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`


cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l

[ghstack-poisoned]
@XilunWu XilunWu requested a review from zpcore July 14, 2025 23:03
Copy link
Contributor
@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i confirmed this works for me locally

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #158112

pytorchmergebot pushed a commit that referenced this pull request Jul 15, 2025
**Summary**
Implemented the test pattern described in #157991 (comment) as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is:
1. allowing users to specify the `Partial` placement.
2. supporting tree-like output structure.

**Test**
so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: #158112
Approved by: https://github.com/Skylion007, https://github.com/zpcore
ghstack dependencies: #158051
@github-actions github-actions bot deleted the gh/XilunWu/156/head branch August 15, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor Merged module: dtensor distributed tensor tag oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0