8000 [RFC][BE] assume error checking is on by default by c-p-i-o · Pull Request #141914 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[RFC][BE] assume error checking is on by default #141914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

c-p-i-o
Copy link
Contributor
@c-p-i-o c-p-i-o commented Dec 2, 2024

Summary:
Remove conditional MACRO ENABLE_NCCL_ERROR_CHECKING and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
NCCL 2.7 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan: Unit tests.

Differential Revision: D66672062

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Copy link
pytorch-bot bot commented Dec 2, 2024
8000

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141914

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure

As of commit 44fd16e with merge base 1a26cdd (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 2, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

@c-p-i-o c-p-i-o self-assigned this Dec 2, 2024
@c-p-i-o c-p-i-o marked this pull request as draft December 2, 2024 21:35
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Dec 3, 2024
Summary:

Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Assume that the world has all upgraded to later version of NCCL library.

Test Plan: Unit tests.

Differential Revision: D66672062
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Dec 3, 2024
Summary:

Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Assume that the world has all upgraded to later version of NCCL library.

Test Plan: Unit tests.

Differential Revision: D66672062
@c-p-i-o c-p-i-o requested a review from wconstab December 3, 2024 23:05
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

@c-p-i-o c-p-i-o requested a review from kwen2501 December 3, 2024 23:05
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Dec 3, 2024
Summary:

Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

1. Assuming that the world has all upgraded to later version of NCCL library.
2. Assuming that at Meta this always evaluates to true.

Test Plan: Unit tests.

Differential Revision: D66672062
@c-p-i-o c-p-i-o requested review from fduwjj and wconstab and removed request for wconstab December 3, 2024 23:05
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

@c-p-i-o c-p-i-o requested a review from eqy December 3, 2024 23:05
@c-p-i-o c-p-i-o marked this pull request as ready for review December 3, 2024 23:06
@wconstab
Copy link
Contributor
wconstab commented Dec 3, 2024

I think this PR is essentially masquerading a bigger assumption (NCCL >=2.4) under a smaller title (error checking). Would be good to make a few things more explicit

  • did we pick the right cutoff version? (is 2.4 old enough?)
  • should we add an explicit assertion so users get a clear signal they are using unsupported nccl rather than a linker error for missing error checking?

I'd vote for making a specific PR for the nccl version assertion, and get more eyes on that (also state why we feel 2.4 or 2.x is the right oldest cutoff version to pick). If we can get that landed, then this PR is a no-brainer on top.

@c-p-i-o
Copy link
Contributor Author
c-p-i-o commented Dec 4, 2024

I think this PR is essentially masquerading a bigger assumption (NCCL >=2.4) under a smaller title (error checking). Would be good to make a few things more explicit

  • did we pick the right cutoff version? (is 2.4 old enough?)
  • should we add an explicit assertion so users get a clear signal they are using unsupported nccl rather than a linker error for missing error checking?

I'd vote for making a specific PR for the nccl version assertion, and get more eyes on that (also state why we feel 2.4 or 2.x is the right oldest cutoff version to pick). If we can get that landed, then this PR is a no-brainer on top.

I made a pull request #142023 with a static_assert that NCCL version is 2.4 or above.
I added the rationale for my thought:

  1. 2.4 released 2 years ago.
  2. Enabling error checking (this PR) should be beneficial to all and they are probably getting it anyway.
    For the stragglers who haven't upgraded, this might provide impetus to update/upgrade?

Copy link
Contributor
@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please update this PR desc to reference the 2.4 PR.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2024
pytorchmergebot pushed a commit that referenced this pull request Dec 5, 2024
Summary:
Static assert that NCCL VERSION is greater than 2.4.
This is in preparation of enabling error checking by default in PyTorch library and removal of some macros.
This is in PR #141914.
The rationale behind this version is:
1. 2.4 released ~2 years ago so it's unlikely that someone is still using the old library.
2. Enabling error checking is benefitial to the community as it helps debug subtle bugs in production environments.

Test Plan: unit tests

Differential Revision: D66737055

Pull Request resolved: #142023
Approved by: https://github.com/kwen2501
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

@c-p-i-o
Copy link
Contributor Author
c-p-i-o commented Jan 28, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/141914/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/test_c10d_nccl.py
Auto-merging torch/csrc/distributed/c10d/NCCLUtils.hpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
error: could not apply f874f3760aa...  [RFC][BE] assume error checking is on by default (#141914)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply f874f3760aa... [RFC][BE] assume error checking is on by default (#141914)

Raised by https://github.com/pytorch/pytorch/actions/runs/13002690637

pytorch-bot bot pushed a commit that referenced this pull request Jan 28, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Jan 28, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request pytorch#142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Jan 28, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request pytorch#142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
@c-p-i-o c-p-i-o force-pushed the export-D66672062 branch 2 times, most recently from ef38dd8 to a4a3fa4 Compare January 28, 2025 03:44
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Jan 28, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request pytorch#142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.


Test Plan:
Unit tests.





cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Jan 28, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request pytorch#142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.


Test Plan:
Unit tests.





cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request pytorch#142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.


Test Plan:
Unit tests.





cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66672062

@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 04:21 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 04:21 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 04:21 Failure
kwen2501 added a commit that referenced this pull request Mar 10, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o

[ghstack-poisoned]
kwen2501 pushed a commit that referenced this pull request Mar 10, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o

ghstack-source-id: 07f274b
Pull Request resolved: #148900
kwen2501 added a commit that referenced this pull request Mar 10, 2025
…default (#141914)"

Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Mar 10, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o

ghstack-source-id: 14c319b
Pull Request resolved: #148900
kwen2501 added a commit that referenced this pull request Mar 10, 2025
Summary:
Remove conditional MACRO `ENABLE_NCCL_ERROR_CHECKING` and assume that error checking is always on.
These checks were wrapped in a macro because older NCCL libraries didn't have the pre-requisite functions to do error checks. This check was put in several years ago.

Pull request #142023 adds a static_assert that NCCL version should be 2.7 or above for PyTorch to work.
2.4 released about 2 years ago so it's relatively safe to assume that everyone has upgraded.

Assume that the world has all upgraded to later version of NCCL library.

Release note for PyTorch must specify that going forward, PyTorch will only work with NCCL version 2.7 and above.

Test Plan:
Unit tests.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k

Reviewed By: wconstab, fduwjj, kwen2501

Differential Revision: D66672062

Pulled By: c-p-i-o

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501
Copy link
Contributor
kwen2501 commented Mar 10, 2025

Closing via #148900. Thanks @c-p-i-o !

@kwen2501 kwen2501 closed this Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0