10000 [DeviceMesh] Add some documentation for `from_group` API and add a 2D test by wz337 · Pull Request #146364 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[DeviceMesh] Add some documentation for from_group API and add a 2D test #146364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wz337
Copy link
Contributor
@wz337 wz337 commented Feb 3, 2025

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 3, 2025
Copy link
pytorch-bot bot commented Feb 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146364

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2842508 with merge base 810d2a3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wz337 wz337 added module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category labels Feb 3, 2025
@wz337 wz337 requested a review from fduwjj February 3, 2025 22:57
@@ -823,11 +846,17 @@ def from_group(
(_get_group_tag(group), group_ranks, group.group_name)
]
return device_mesh

# nD scenario
groups = list(group)
if len(groups) == 0:
raise ValueError("Expects at least one ProcessGroup to be passed")
if mesh is None:
Copy link
Contributor
@fduwjj fduwjj Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker but I have a question around this condition, do we really need this one? I mean can we infer this? we can just assume the size of passed in PG as the mesh, no? Also, if mesh is passed in, we don't need mesh_dim_names right? They are OR not AND?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disccued offline regarding keeping mesh tensor required for now but this is subject to change.

Copy link
Contributor
@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp to unblock, the unit test looks good to me. But I have some questions regarding the args needed.

@wz337 wz337 force-pushed the add_from_group_doc_and_test branch 3 times, most recently from 79e8922 to c13af6e Compare February 5, 2025 04:48
Copy link
Contributor
@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, may need to rethink the API signature.

of the default process group. Default is None.
mesh_dim_names (tuple[str], optional): A tuple of mesh dimension names to assign
to each dimension of the multi-dimensional array describing the layout of devices.
Its length must match the length of `mesh_shape`. Each string in `mesh_dim_names`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mesh.shape?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mesh and mesh_dim_names are asymmetrical. One can be positional and another one must be keyword, but they are used together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye...It is inconsistent. This is due to the fact that originally when we introduced init_device_mesh, mesh_dim_names is not required, and it is also an optional keyword arg...

https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L940-L945

Also, the mesh tensor should be mesh_shape in order to be consistent with init_device_mesh, so it may be good to do breaking change now instead of later.

@wz337
Copy link
Contributor Author
wz337 commented Feb 14, 2025

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add_from_group_doc_and_test onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add_from_group_doc_and_test && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the add_from_group_doc_and_test branch from c13af6e to ede80d2 Compare February 14, 2025 07:18
@wz337
Copy link
Contributor Author
wz337 commented Feb 25, 2025

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add_from_group_doc_and_test onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add_from_group_doc_and_test && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the add_from_group_doc_and_test branch from ede80d2 to 4e62c31 Compare February 25, 2025 21:24
@wz337
Copy link
Contributor Author
wz337 commented Feb 28, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 28, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: dtensor distributed tensor tag oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0