-
Notifications
You must be signed in to change notification settings - Fork 24.3k
fix randint distribution for large max #143787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143787
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 64642db with merge base b5cf8e2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
9044a08
to
28c48ec
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchmergebot revert -m "failing internal tests, to be fixed first" -c ghfirst As discussed with @ngimel |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 8059d56. Reverted #143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](#143787 (comment)))
@ngimel your PR has been successfully reverted. |
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ly added case on XPU, align CUDA." The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU. We add the expected failure here because if fails with the same reason as CUDA. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
… XPU, align CUDA." The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU. We add the expected failure here because if fails with the same reason as CUDA. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
… CUDA. (#144457) The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU. We add the expected failure here because if fails with the same reason as CUDA. Pull Request resolved: #144457 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1
Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values
<= 2**32 / 128
)This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
torch.compile
has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.__launch_bounds__
slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov