8000 add device generalisation support for distributed tests by harikodali · Pull Request #152471 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

add device generalisation support for distributed tests #152471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

harikodali
Copy link
@harikodali harikodali commented Apr 29, 2025

MOTIVATION

To generalize Distributed test cases for non-CUDA devices

CHANGES

  • test/distributed/optim/test_zero_redundancy_optimizer.py
  • test/distributed/test_c10d_logger.py
  • test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

  • torch/testing/_internal/common_distributed.py

extended common utility functions

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Copy link
pytorch-bot bot commented Apr 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152471

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 29, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 1, 2025
@HDCharles HDCharles requested a review from xmfan May 2, 2025 03:52
@HDCharles HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 2, 2025
@HDCharles HDCharles requested a review from Skylion007 May 2, 2025 03:54
@harikodali harikodali force-pushed the distributed_test_1 branch from b3e51b4 to c640bda Compare May 8, 2025 21:21
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 8, 2025
@xmfan xmfan requested a review from fegin May 8, 2025 21:26
@harikodali
Copy link
Author

@albanD , @wconstab , @EikanWang , @cyyever , @guangyey : Could you help in reviewing and merging this one ?

@albanD albanD requested a review from wconstab May 12, 2025 13:25
@gnadathur
Copy link

cc: @d4l3k , @kwen2501

Copy link
Member
@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally seem fine -- just a few nits

os.remove(self.file_name)
except OSError:
pass

@property
def world_size(self) -> int:
return torch.cuda.device_count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to change as well?


class TestZeroRedundancyOptimizerSingleRank(TestZeroRedundancyOptimizer):
def test_state_dict(self):
"""Check that ZeroRedundancyOptimizer exposes the expected state dict
interface, irrespective of the sharding."""
self.dist_init(self.rank)
self.create_pg(self.device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass device here? Seems like we can infer it from the setting on the class since it's on self

@@ -20,6 +19,7 @@
if not dist.is_available():
print("Distributed not available, skipping tests", file=sys.stderr)
sys.exit(0)
from torch._inductor.utils import is_gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine for us to depend on inductor code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bit weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0