fix randint distribution for large max #143787

ngimel · 2024-12-24T04:50:45Z

Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values <= 2**32 / 128)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
torch.compile has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
__launch_bounds__ slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-12-24T04:50:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143787

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 64642db with merge base b5cf8e2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2024-12-24T23:32:10Z

@pytorchbot merge

pytorchmergebot · 2024-12-24T23:34:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-24T23:34:17Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ngimel · 2024-12-26T18:42:49Z

@pytorchbot merge

pytorchmergebot · 2024-12-26T18:45:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-26T19:11:55Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / win-vs2019-cpu-py3 / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ngimel · 2024-12-26T19:28:06Z

@pytorchbot merge

pytorchmergebot · 2024-12-26T19:29:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Dec 26, 2024

pytorchmergebot closed this in

8059d56

Dec 26, 2024

pytorchmergebot removed the merging label

Dec 26, 2024

@pytorchmergebot revert -m "failing internal tests, to be fixed first" -c ghfirst

As discussed with @ngimel

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request

Dec 27, 2024


          Revert "fix randint distribution for large max (#143787)"

This reverts commit 8059d56.

Reverted #143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](#143787 (comment)))

@ngimel your PR has been successfully reverted.

pytorchmergebot added Reverted ci-no-td labels

Dec 27, 2024

pytorchmergebot reopened this

Dec 27, 2024

ngimel added 2 commits

January 7, 2025 18:12


          Merge branch 'main' into random_fix

df3f93e


          use higher threshold

64642db

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorchbot merge

pytorchmergebot added the merging label

Jan 8, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

ab1f627

Jan 8, 2025

pytorchmergebot removed the merging label

Jan 8, 2025

etaf mentioned this pull request

Jan 9, 2025

[Inductor UT] Add expected failure for newly added case on XPU, align CUDA. #144457

Closed

albanD mentioned this pull request

Jan 9, 2025

DISABLED test_qs8_conv1d_batchnorm_seq (__main__.TestConv1d) #144466

Closed

huydhn mentioned this pull request

Jan 9, 2025

UNSTABLE pull / linux-jammy-py3-clang12-executorch 6D40 / test (executorch) #144480

Closed

etaf added a commit that referenced this pull request

Jan 9, 2025


          Update base for Update on "[Inductor UT] Add expected failure for new…

5d96a7e

…ly added case on XPU, align CUDA."


The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]

etaf added a commit that referenced this pull request

Jan 9, 2025


          Update on "[Inductor UT] Add expected failure for newly added case on…

5f11e42

… XPU, align CUDA."


The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request

Jan 10, 2025


          [Inductor UT] Add expected failure for newly added case on XPU, align…

e5111d0

… CUDA. (#144457)

The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA.

Pull Request resolved: #144457
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1

github-actions deleted the ngimel/random_fix branch

February 12, 2025 02:05

fix randint distribution for large max

10b7dca

ngimel requested review from eqy and syed-ahmed as code owners December 24, 2024 04:50

pytorch-bot bot added ciflow/inductor module: inductor labels Dec 24, 2024

ngimel added the release notes: cpp release notes category label Dec 24, 2024

ngimel added 2 commits December 24, 2024 11:23

Merge branch 'main' into random_fix

7179a21

skip inductor dynamic tests

b4c530f

ngimel mentioned this pull request Dec 24, 2024

Inductor with dynamic shapes fails for randint with >INT_MAX maximum value #143809

Open

eqy approved these changes Dec 24, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 24, 2024

pytorchmergebot added the merging label Dec 24, 2024

pytorchmergebot removed the merging label Dec 24, 2024

pytorchmergebot added the merging label Dec 26, 2024

pytorchmergebot removed the merging label Dec 26, 2024

skip dynamic shape tests, adjust cpu rng test

28c48ec

ngimel force-pushed the ngimel/random_fix branch from 9044a08 to 28c48ec Compare December 26, 2024 19:26

pytorchmergebot added the merging label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix randint distribution for large max #143787

fix randint distribution for large max #143787

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix randint distribution for large max #143787

fix randint distribution for large max #143787

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143787

✅ No Failures

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!