-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[DTensor] have split_strategy return OpStrategy instead of TupleStrategy #158051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158051
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 0f2e156 with merge base 38371f6 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The logic LGTM! Request minor fix on metadata. We need to add tensor_meta to input_specs. Without it the test (#157991) can still pass but this will cause the issue to the following ops who consume the strategy from When we do
|
… TupleStrategy" **Summary** `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s design is complicated and we want to avoid its use as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. **Test** `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l [ghstack-poisoned]
|
Nit:
I think this isn't quite how i would say it. It should not be because TupleStrategy is complicated that we don't use it, it should be that TupleStrategy has a very specific case it can be used, and this is not that case. I have attempted to document better what the contract for TupleStrategy is in #158132, please see if you agree. On that note, i was thinking about whether we could validate this automatically, but I am afraid it is not possible without understanding the semantics of the ops. (e.g. it is not enough to just check the op schema). So i think better docs is all we can do? |
… TupleStrategy" **Summary** `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s design is complicated and we want to avoid its use as much as possible. This PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. **Test** `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k tianyu-l [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! i confirmed this works for me locally
|
Starting merge as part of PR stack under #158112 |
**Summary** Implemented the test pattern described in #157991 (comment) as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is: 1. allowing users to specify the `Partial` placement. 2. supporting tree-like output structure. **Test** so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: #158112 Approved by: https://github.com/Skylion007, https://github.com/zpcore ghstack dependencies: #158051
Stack from ghstack (oldest at bottom):
Summary
split_strategyusedTupleStrategyas return type because DTensor shardingpropagation's
OpStrategysupport on multi-returns only applies toTuple.However,
TupleStrategy's not a good fit forsplitop.TupleStrategywasinitially introduced to handle the sharding strategy of
foreach_*ops wherethe input args can be split into independent subsets regarding sharding decisions,
so are the outputs.
To address the misuse, this PR adds
OpStrategypropagation forList[Tensor](note that this support is INCOMPLETE because it only checks the return type
to be
torch.ListType). Nevertheless, the logic forTuplereturns also madesimilar assumption so I think it's fine to unblock in such a way.
Besides adding
OpStrategysupport to ops havingList[Tensor]return type,this PR also changes
split_strategy's return fromTupleStrategytoOpStrategy.Test
pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partialcc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tianyu-l