Adding XPU support to DTensor examples #153213

githubsgi · 2025-05-08T23:02:39Z

Adds XPU support to visualize_sharding_example.py and comm_mode_features_example.py .

topic: not user facing

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

rebasing of #152973

pytorch-bot · 2025-05-08T23:02:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153213

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit 825747f with merge base efbf07e ():

NEW FAILURES - The following jobs have failed:

BC Lint / bc_linter (gh)
Process completed with exit code 1.
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/tensor/examples/visualize_sharding_example.py:

CANCELLED JOB - The following job was cancelled. Please retry:

Apply lint suggestions (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

githubsgi · 2025-05-08T23:23:42Z

@pytorchbot label "topic: not user facing"

EikanWang

Pls. help fix the linter failure.

torch/distributed/tensor/examples/comm_mode_features_example.py

Copilot

Pull Request Overview

This PR adds support for XPU in DTensor examples by updating how the device type is determined in the visualization and communication mode feature examples.

Updated device mesh initialization in visualize_sharding_example.py using the accelerator’s current type.
Removed the legacy get_device_type() function in comm_mode_features_example.py and replaced it with torch.accelerator.current_accelerator().type.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
torch/distributed/tensor/examples/visualize_sharding_example.py	Replaces hardcoded "cuda" with the current accelerator type for device mesh creation.
torch/distributed/tensor/examples/comm_mode_features_example.py	Removes the get_device_type() function and uses the accelerator's device type directly.

Comments suppressed due to low confidence (2)

torch/distributed/tensor/examples/visualize_sharding_example.py:20

The assignment to device_type assumes that torch.accelerator.current_accelerator() always returns a valid accelerator; consider adding documentation or a fallback mechanism to handle cases where no accelerator is available.

device_type = torch.accelerator.current_accelerator().type

torch/distributed/tensor/examples/comm_mode_features_example.py:44

Directly assigning self.device_type based on torch.accelerator.current_accelerator().type relies on the presence of an accelerator; consider clarifying this assumption or providing fallback behavior for environments without an XPU.

self.device_type = torch.accelerator.current_accelerator().type

EikanWang · 2025-05-13T13:59:20Z

torch/distributed/tensor/examples/comm_mode_features_example.py

@@ -49,7 +41,7 @@ class CommDebugModeExample:
    def __init__(self, world_size: int, rank: int) -> None:
        self.world_size = world_size
        self.rank = rank
-        self.device_type = get_device_type()
+        self.device_type = torch.accelerator.current_accelerator().type if torch.accelerator.current_accelerator() and torch.accelerator.device_count() else 'cpu'


@githubsgi , for cuda, it requires the device count should be greater than 4 - torch.cuda.device_count() >= 4. Shoud torch.accelerator.device_count() be equal and greater than 4 as well?

There was an assert added above, which makes "torch.cuda.device_count() >= 4 " redundant.
assert int(os.getenv("WORLD_SIZE", "1")) >= 4, "We need at least 4 devices"

WORD_SIZE may mean multiple nodes, while torch.cuda.device_count() implies a single node with multiple devices. It may be okay for the example now. @kwen2501 , any comments?

githubsgi · 2025-05-14T18:27:14Z

Not sure why the lint checkers are complaining about the following .

        self.device_type = get_device_type()
        self.device_type = 'cpu' if not  torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type
 Check failure on line 45 in torch/distributed/tensor/examples/comm_mode_features_example.py


GitHub Actions
/ lintrunner-noclang / linux-job
MYPY [union-attr]

Item "None" of "device | None" has no attribute "type"

@EikanWang @colesbury , do you have any insight ?

githubsgi · 2025-05-19T16:38:37Z

Any update on my question above ?

guangyey · 2025-05-20T04:17:10Z

torch/distributed/tensor/examples/visualize_sharding_example.py

@@ -17,6 +17,8 @@
 assert int(os.getenv("WORLD_SIZE", "1")) >= 4, "We need at least 4 devices"
 rank = int(os.environ["RANK"])

+device_type = 'cpu' if not torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type


Suggested change

device_type = 'cpu' if not torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type

device_type = 'cpu' if not torch.accelerator.is_available() else torch.accelerator.current_accelerator().type

guangyey · 2025-05-20T04:17:30Z

torch/distributed/tensor/examples/comm_mode_features_example.py

@@ -49,7 +42,7 @@ class CommDebugModeExample:
    def __init__(self, world_size: int, rank: int) -> None:
        self.world_size = world_size
        self.rank = rank
-        self.device_type = get_device_type()
+        self.device_type = 'cpu' if not  torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type


Suggested change

self.device_type = 'cpu' if not torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type

self.device_type = get_device_type()

guangyey

I think you change the semantic about get_device_type. The better way is

def get_device_type() -> str:
    return (
        torch.accelerator.current_accelerator().type
        if torch.accelerator.device_count() > 4
        else "cpu"
    )

githubsgi · 2025-05-25T22:19:00Z

I think you change the semantic about get_device_type. The better way is

def get_device_type() -> str:
    return (
        torch.accelerator.current_accelerator().type
        if torch.accelerator.device_count() > 4
        else "cpu"
    )

This looks like a redundant function to me.

githubsgi and others added 2 commits May 8, 2025 15:56

Changing to torch.accelerator, addressing conflict.

e3c8c75

Merge branch 'pytorch:main' into dtensor_debug_xpu

1d3d48c

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2025

pytorchbot added the open source label May 8, 2025

pytorch-bot bot added the topic: not user facing topic category label May 8, 2025

EikanWang requested changes May 12, 2025

View reviewed changes

torch/distributed/tensor/examples/comm_mode_features_example.py Outdated Show resolved Hide resolved

EikanWang requested a review from Copilot May 12, 2025 06:11

Copilot AI reviewed May 12, 2025

View reviewed changes

Adding fixes for lint check errors.

e62c33a

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 13, 2025

EikanWang reviewed May 13, 2025

View reviewed changes

Updates to address lint complaints.

1add9bb

EikanWang requested a review from kwen2501 May 14, 2025 15:45

EikanWang added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025

guangyey reviewed May 20, 2025

View reviewed changes

guangyey added this to PyTorch Intel May 26, 2025

Merge branch 'pytorch:main' into dtensor_debug_xpu

825747f

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Jun 30, 2025

githubsgi mentioned this pull request Jun 30, 2025

Using torch.accelerator in comm_mode_features_example.py and visualize_sharding_example.py #157317

Open

EikanWang approved these changes Jul 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding XPU support to DTensor examples #153213

Adding XPU support to DTensor examples #153213

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	device_type = 'cpu' if not torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type
	device_type = 'cpu' if not torch.accelerator.is_available() else torch.accelerator.current_accelerator().type

	self.device_type = 'cpu' if not torch.accelerator.current_accelerator() else torch.accelerator.current_accelerator().type
	self.device_type = get_device_type()

Adding XPU support to DTensor examples #153213

Are you sure you want to change the base?

Adding XPU support to DTensor examples #153213

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153213

❌ 2 New Failures, 1 Cancelled Job

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!