-
Notifications
You must be signed in to change notification settings - Fork 24.7k
try root fix for FP8 tensor #143248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try root fix for FP8 tensor #143248
Conversation
Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143248
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit bd2d8b1 with merge base 0b75b7f ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
Signed-off-by: Mayank Mishra <mayank31398@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using _MeshEnv
is a good idea. @wz337 we discussed not to support slicing the submeshes when introducing flattening. We should revisit this option. I don't quite get why FP8 requires this features though. cc., @weifengpy
fp8 requires all-reduce(abs(max(tensor)) across sharded dim, and avoid all-reduce across replicated dim. so we pass mesh[-1] to fp8 all-gather |
is there any conclusion on how to fix this? |
@fegin I have made the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's land this PR to unblock HSDP + TorchAO. We will need a follow up on DeviceMesh side to discuss how to support submesh slicing. cc., @wz337
8000
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #143194
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o