8000 [CD] Fix slim-wheel cuda_nvrtc import problem by atalman · Pull Request #145582 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CD] Fix slim-wheel cuda_nvrtc import problem #145582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

atalman
Copy link
Contributor
@atalman atalman commented Jan 24, 2025

Similar fix as: #144816

Fixes: #145580

Found during testing of #138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:

(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12

Test with rc 2.6 and CUDA 11.8:

python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD

Thank you @nWEIdia for discovering this issue

cc @seemethere @malfet @osalpekar

Copy link
pytorch-bot bot commented Jan 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145582

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 148 Pending

As of commit 4c476a0 with merge base d6bea39 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@atalman atalman added module: binaries Anything related to official binaries that we release to users topic: binaries topic: not user facing topic category labels Jan 24, 2025
Copy link
Collaborator
@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great fix!
For reference, the cause is cudnn dlopening libnvrtc.

@@ -317,16 +317,17 @@ def _load_global_deps() -> None:
# shipped as wheel, which results in OS picking wrong/older version of nvjitlink library
# if `LD_LIBRARY_PATH` is defined
# See https://github.com/pytorch/pytorch/issues/138460
if version.cuda not in ["12.4", "12.6"]: # type: ignore[name-defined]
return
# Similar issue as above exist for cuda_nvrtc for reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: exists

@malfet
Copy link
Contributor
malfet commented Jan 24, 2025

@pytorchbot merge -f "What can possible go wrong?"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit to AnantGulati/pytorch that referenced this pull request Jan 24, 2025
Similar fix as: pytorch#144816

Fixes: pytorch#145580

Found during testing of pytorch#138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: pytorch#145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
@atalman
Copy link
Contributor Author
atalman commented Jan 24, 2025

@pytorchbot cherry-pick --onto release/2.6 -c critical

pytorchbot pushed a commit that referenced this pull request Jan 24, 2025
Similar fix as: #144816

Fixes: #145580

Found during testing of #138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: #145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 9752c7c)
@pytorchbot
Copy link
Collaborator

Cherry picking #145582

The cherry pick PR is at #145614 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

malfet pushed a commit that referenced this pull request Jan 24, 2025
[CD] Fix slim-wheel cuda_nvrtc import problem (#145582)

Similar fix as: #144816

Fixes: #145580

Found during testing of #138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: #145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 9752c7c)

Co-authored-by: atalman <atalman@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR Merged module: binaries Anything related to official binaries that we release to users topic: binaries topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch crashes on ubuntu:24.04 during SDPA-CuDNN test
7 participants
0