[5/N] Reconcile barrier and NaN checker #134707

kwen2501 · 2024-08-28T18:53:27Z

Stack from ghstack (oldest at bottom):

By using a zeros() tensor instead of empty() tensor.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-08-28T18:53:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134707

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit e7d6ec4 with merge base 0dbc728 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3.11-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_transformers.py::TestSDPAPrivateUse1Only::test_scaled_dot_product_fused_attention_overrideable_backward

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2d29d9f Pull Request resolved: #134707

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 47fb4be Pull Request resolved: #134707

wconstab

agree the test is lackluster becuase of not having control over the empty tensor and likely passing (or being flaky).

i think its still worth having since we'd probably get notified via flaky-test behavior if the fix got reverted..

i wonder if we can allocate and free a NaN tensor right before calling .barrier() and trick the allocator into giving back the same memory to raise the chance of hitting the case.

kwen2501 · 2024-08-28T20:04:56Z

@pytorchbot merge

pytorchmergebot · 2024-08-28T20:06:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f 8000 659324 Pull Request resolved: #134707

pytorchmergebot · 2024-08-28T22:15:43Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 1c80d23 Pull Request resolved: #134707

kwen2501 · 2024-08-28T22:22:54Z

@pytorchbot merge

pytorchmergebot · 2024-08-28T22:25:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-29T01:38:29Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 94cc275 Pull Request resolved: #134707

kwen2501 · 2024-08-29T06:34:22Z

@pytorchbot merge

pytorchmergebot · 2024-08-29T06:36:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-29T10:07:59Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 75e591c Pull Request resolved: #134707

kwen2501 · 2024-08-29T17:03:27Z

@pytorchbot -h

pytorch-bot · 2024-08-29T17:03:29Z

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick,close}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch
    close               Close a PR

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO           Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

Close

usage: @pytorchbot close

Close a PR [Can be used on issues]

kwen2501 · 2024-08-29T17:04:19Z

@pytorchbot merge -i

pytorchmergebot · 2024-08-29T17:06:14Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-focal-py3.11-clang10 / test (crossref, 2, 2, linux.2xlarge), trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

By using a zeros() tensor instead of empty() tensor. Pull Request resolved: pytorch#134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701

[5/N] Reconcile barrier and NaN checker

5fc4c67

[ghstack-poisoned]

kwen2501 mentioned this pull request Aug 28, 2024

[4/N] Test NaN checker against broadcast #134701

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 28, 2024

kwen2501 added a commit that referenced this pull request Aug 28, 2024

[5/N] Reconcile barrier and NaN checker
8000

19d755c

ghstack-source-id: 2d29d9f Pull Request resolved: #134707

kwen2501 requested review from shuqiangzhang and wconstab August 28, 2024 18:56

Update on "[5/N] Reconcile barrier and NaN checker"

c2fb58b

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Aug 28, 2024

[5/N] Reconcile barrier and NaN checker

2e10809

ghstack-source-id: 47fb4be Pull Request resolved: #134707

shuqiangzhang approved these changes Aug 28, 2024

View reviewed changes

wconstab approved these changes Aug 28, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 28, 2024

pytorchmergebot added the merging label Aug 28, 2024

Update on "[5/N] Reconcile barrier and NaN checker"

7be3054

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Aug 28, 2024

[5/N] Reconcile barrier and NaN checker

35abab3

ghstack-source-id: f 8000 659324 Pull Request resolved: #134707

pytorchmergebot removed the merging label Aug 28, 2024

Update on "[5/N] Reconcile barrier and NaN checker"

599c218

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Aug 28, 2024

[5/N] Reconcile barrier and NaN checker

065393f

ghstack-source-id: 1c80d23 Pull Request resolved: #134707

pytorchmergebot added the merging label Aug 28, 2024

kwen2501 mentioned this pull request Aug 29, 2024

[6/N] Add USE_C10D_NCCL #134741

Closed

pytorchmergebot removed the merging label Aug 29, 2024

Update on "[5/N] Reconcile barrier and NaN checker"

6194857

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 mentioned this pull request Aug 29, 2024

[1/N] Move NaN check onto NCCL stream #134300

Closed

This was referenced Aug 29, 2024

[2/N] Add flag to control which rank should perform NaN check #134345

Closed

[3/N] Set correct device to CUDA guards #134357

Closed

kwen2501 added a commit that referenced this pull request Aug 29, 2024

[5/N] Reconcile barrier and NaN checker

c675147

ghstack-source-id: 94cc275 Pull Request resolved: #134707

pytorchmergebot added the merging label Aug 29, 2024

pytorchmergebot removed the merging label Aug 29, 2024

Update on "[5/N] Reconcile barrier and NaN checker"

e7d6ec4

By using a zeros() tensor instead of empty() tensor. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Aug 29, 2024

[5/N] Reconcile barrier and NaN checker

c050d4e

ghstack-source-id: 75e591c Pull Request resolved: #134707

pytorchmergebot added the merging label Aug 29, 2024

pytorchmergebot added the Merged label Aug 29, 2024

pytorchmergebot closed this in 5470fcd Aug 29, 2024

pytorchmergebot removed the merging label Aug 29, 2024

github-actions bot deleted the gh/kwen2501/56/head branch October 3, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5/N] Reconcile barrier and NaN checker #134707

[5/N] Reconcile barrier and NaN checker #134707

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[5/N] Reconcile barrier and NaN checker #134707

[5/N] Reconcile barrier and NaN checker #134707

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134707

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

PyTorchBot Help

Merge

Revert

Rebase

Label

Dr CI

cherry-pick

Close

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!