10000 try root fix for FP8 tensor by mayank31398 · Pull Request #143248 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

try root fix for FP8 tensor #143248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

mayank31398
Copy link
Contributor
@mayank31398 mayank31398 commented Dec 14, 2024

Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Dec 14, 2024
Copy link
pytorch-bot bot commented Dec 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143248

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit bd2d8b1 with merge base 0b75b7f (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
fegin

This comment was marked as outdated.

Copy link
Contributor
@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using _MeshEnv is a good idea. @wz337 we discussed not to support slicing the submeshes when introducing flattening. We should revisit this option. I don't quite get why FP8 requires this features though. cc., @weifengpy

@weifengpy
Copy link
Contributor

I don't think using _MeshEnv is a good idea. @wz337 we discussed not to support slicing the submeshes when introducing flattening. We should revisit this option. I don't quite get why FP8 requires this features though. cc., @weifengpy

fp8 requires all-reduce(abs(max(tensor)) across sharded dim, and avoid all-reduce across replicated dim. so we pass mesh[-1] to fp8 all-gather

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 17, 2024
@mayank31398
Copy link
Contributor Author

is there any conclusion on how to fix this?

Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
@mayank31398
Copy link
Contributor Author

@fegin I have made the changes

@mayank31398 mayank31398 requested a review from fegin December 19, 2024 05:53
Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 19, 2024
Copy link
Contributor
@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land this PR to unblock HSDP + TorchAO. We will need a follow up on DeviceMesh side to discuss how to support submesh slicing. cc., @wz337

Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
@mayank31398
Copy link
Contributor Author

@fegin @awgu can you guys merge this?
would love to get it in today's nightly

@awgu
Copy link
Collaborator
awgu commented Dec 19, 2024
8000

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Distributed] Cannot create submesh from submesh error. Was "is this error message for DeviceMesh needed?"
7 participants
0