8000 fix randint distribution for large max by ngimel · Pull Request #143787 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
< 8000 /div>

fix randint distribution for large max #143787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

fix randint distribution for large max #143787

wants to merge 6 commits into from

Conversation

ngimel
Copy link
Collaborator
@ngimel ngimel commented Dec 24, 2024

Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values <= 2**32 / 128)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
torch.compile has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
__launch_bounds__ slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Copy link
pytorch-bot bot commented Dec 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143787

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 64642db with merge base b5cf8e2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ngimel
Copy link
Collaborator Author
ngimel commented Dec 24, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 24, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ngimel
Copy link
Collaborator Author
ngimel commented Dec 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link 8000
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ngimel
Copy link
Collaborator Author
ngimel commented Dec 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wdvr
Copy link
Contributor
wdvr commented Dec 27, 2024

@pytorchmergebot revert -m "failing internal tests, to be fixed first" -c ghfirst

As discussed with @ngimel

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Dec 27, 2024
This reverts commit 8059d56.

Reverted #143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](#143787 (comment)))
@pytorchmergebot
Copy link
Collaborator

@ngimel your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Dec 27, 2024
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator Author
ngimel commented Jan 8, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

etaf added a commit that referenced this pull request Jan 9, 2025
…ly added case on XPU, align CUDA."


The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
etaf added a commit that referenced this pull request Jan 9, 2025
… XPU, align CUDA."


The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 10, 2025
… CUDA. (#144457)

The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA.

Pull Request resolved: #144457
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1
@github-actions github-actions bot deleted the ngimel/random_fix branch February 12, 2025 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor release notes: cpp release notes category Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0