8000 Adding XPU support to DTensor examples by githubsgi · Pull Request #153213 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Adding XPU support to DTensor examples #153213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

githubsgi
Copy link
Contributor

Adds XPU support to visualize_sharding_example.py and comm_mode_features_example.py .

topic: not user facing

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

rebasing of #152973

Copy link
pytorch-bot bot commented May 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153213

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 14 Unrelated Failures

As of commit 1add9bb with merge base 7cb5c75 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2025
@githubsgi
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 8, 2025
Copy link
Collaborator
@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls. help fix the linter failure.

@EikanWang EikanWang requested a review from Copilot May 12, 2025 06:11
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for XPU in DTensor examples by updating how the device type is determined in the visualization and communication mode feature examples.

  • Updated device mesh initialization in visualize_sharding_example.py using the accelerator’s current type.
  • Removed the legacy get_device_type() function in comm_mode_features_example.py and replaced it with torch.accelerator.current_accelerator().type.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
torch/distributed/tensor/examples/visualize_sharding_example.py Replaces hardcoded "cuda" with the current accelerator type for device mesh creation.
torch/distributed/tensor/examples/comm_mode_features_example.py Removes the get_device_type() function and uses the accelerator's device type directly.
Comments suppressed due to low confidence (2)

torch/distributed/tensor/examples/visualize_sharding_example.py:20

  • The assignment to device_type assumes that torch.accelerator.current_accelerator() always returns a valid accelerator; consider adding documentation or a fallback mechanism to handle cases where no accelerator is available.
device_type = torch.accelerator.current_accelerator().type

torch/distributed/tensor/examples/comm_mode_features_example.py:44

  • Directly assigning self.device_type based on torch.accelerator.current_accelerator().type relies on the presence of an accelerator; consider clarifying this assumption or providing fallback behavior for environments without an XPU.
self.device_type = torch.accelerator.current_accelerator().type

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 13, 2025
@@ -49,7 +41,7 @@ class CommDebugModeExample:
def __init__(self, world_size: int, rank: int) -> None:
self.world_size = world_size
self.rank = rank
self.device_type = get_device_type()
self.device_type = torch.accelerator.current_accelerator().type if torch.accelerator.current_accelerator() and torch.accelerator.device_count() else 'cpu'
Copy link
8000 Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@githubsgi , for cuda, it requires the device count should be greater than 4 - torch.cuda.device_count() >= 4. Shoud torch.accelerator.device_count() be equal and greater than 4 as well?

Copy link
Contributor Author
@githubsgi githubsgi May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an assert added above, which makes "torch.cuda.device_count() >= 4 " redundant.
assert int(os.getenv("WORLD_SIZE", "1")) >= 4, "We need at least 4 devices"

Copy link
Collaborator
@EikanWang EikanWang May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WORD_SIZE may mean multiple nodes, while torch.cuda.device_count() implies a single node with multiple devices. It may be okay for the example now. @kwen2501 , any comments?

@EikanWang EikanWang requested a review from kwen2501 May 14, 2025 15:45
@EikanWang EikanWang added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025
@githubsgi
Copy link
Contributor Author
githubsgi commented May 14, 2025

Not sure why the lint checkers are complaining about the following .

        self.device_type = get_device_type()
        self.device_type = 'cpu' if not  torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type
 Check failure on line 45 in torch/distributed/tensor/examples/comm_mode_features_example.py


GitHub Actions
/ lintrunner-noclang / linux-job
MYPY [union-attr]

Item "None" of "device | None" has no attribute "type" 

@EikanWang @colesbury , do you have any insight ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0