add device generalisation support for distributed tests #152471

harikodali · 2025-04-29T21:42:42Z

MOTIVATION

To generalize Distributed test cases for non-CUDA devices

CHANGES

test/distributed/optim/test_zero_redundancy_optimizer.py
test/distributed/test_c10d_logger.py
test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

torch/testing/_internal/common_distributed.py

extended common utility functions

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-04-29T21:42:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152471

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/distributed/optim/test_zero_redundancy_optimizer.py

harikodali · 2025-05-12T05:07:54Z

@albanD , @wconstab , @EikanWang , @cyyever , @guangyey : Could you help in reviewing and merging this one ?

gnadathur · 2025-05-12T21:33:15Z

cc: @d4l3k , @kwen2501

d4l3k

Changes generally seem fine -- just a few nits

d4l3k · 2025-05-12T22:38:43Z

torch/testing/_internal/common_distributed.py

-            os.remove(self.file_name)
-        except OSError:
-            pass
-
    @property
    def world_size(self) -> int:
        return torch.cuda.device_count()


does this need to change as well?

d4l3k · 2025-05-12T22:39:42Z

test/distributed/optim/test_zero_redundancy_optimizer.py


 class TestZeroRedundancyOptimizerSingleRank(TestZeroRedundancyOptimizer):
    def test_state_dict(self):
        """Check that ZeroRedundancyOptimizer exposes the expected state dict
        interface, irrespective of the sharding."""
-        self.dist_init(self.rank)
+        self.create_pg(self.device)


Do we need to pass device here? Seems like we can infer it from the setting on the class since it's on self

d4l3k · 2025-05-12T22:39:57Z

test/distributed/optim/test_zero_redundancy_optimizer.py

@@ -20,6 +19,7 @@
 if not dist.is_available():
    print("Distributed not available, skipping tests", file=sys.stderr)
    sys.exit(0)
+from torch._inductor.utils import is_gpu


Is it fine for us to depend on inductor code?

I think it is a bit weird.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 29, 2025

pytorchbot added the open source label Apr 29, 2025

etaf added the ciflow/xpu Run XPU CI tasks label May 1, 2025

etaf reviewed May 1, 2025

View reviewed changes

test/distributed/optim/test_zero_redundancy_optimizer.py Outdated Show resolved Hide resolved

HDCharles requested a review from xmfan May 2, 2025 03:52

HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 2, 2025

HDCharles requested a review from Skylion007 May 2, 2025 03:54

add device generalisation support for distributed tests

c640bda

harikodali force-pushed the distributed_test_1 branch from b3e51b4 to c640bda Compare May 8, 2025 21:21

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 8, 2025

xmfan requested a review from fegin May 8, 2025 21:26

fix bug and lint issues

8476291

albanD requested a review from wconstab May 12, 2025 13:25

make fake pg tests device agnostic

8000

35b31de

d4l3k approved these changes May 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add device generalisation support for distributed tests #152471

add device generalisation support for distributed tests #152471

add device generalisation support for distributed tests #152471

Are you sure you want to change the base?

add device generalisation support for distributed tests #152471

Conversation

MOTIVATION

CHANGES

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152471

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment