[CD] Fix slim-wheel nvjit-link import problem #141063

malfet · 2024-11-19T21:19:43Z

When other toolkit (say CUDA-12.3) is installed and LD_LIBRARY_PATH points to there, import torch will fail with

ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading nvjitlink right after global deps are loaded, by running something along the lines of the following

        if version.cuda in ["12.4", "12.6"]:
            with open("/proc/self/maps") as f:
                _maps = f.read()
            # libtorch_global_deps.so always depends in cudart, check if its installed via wheel
            if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
                # If all abovementioned conditions are met, preload nvjitlink
                _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")

Fixes #140797

pytorch-bot · 2024-11-19T21:19:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141063

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 17 Pending, 2 Unrelated Failures

As of commit 9aa303f with merge base c15d650 ():

NEW FAILURES - The following jobs have failed:

linux-binary-manywheel / manywheel-py3_10-xpu-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_11-xpu-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_12-xpu-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_13-xpu-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_9-xpu-build / build (gh)
ninja: build stopped: subcommand failed
windows-binary-wheel / wheel-py3_10-xpu-build (gh)
ninja: build stopped: subcommand failed
windows-binary-wheel / wheel-py3_11-xpu-build (gh)
ninja: build stopped: subcommand failed
windows-binary-wheel / wheel-py3_12-xpu-build (gh)
ninja: build stopped: subcommand failed
windows-binary-wheel / wheel-py3_13-xpu-build (gh)
ninja: build stopped: subcommand failed
windows-binary-wheel / wheel-py3_9-xpu-build (gh)
ninja: build stopped: subcommand failed

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_11-rocm6_3-test (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#144480)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2024-11-20T14:03:00Z

@pytorchbot rebase

pytorchmergebot · 2024-11-20T14:04:40Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-20T14:04:44Z

Successfully rebased malfet-patch-13 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout malfet-patch-13 && git pull --rebase)

Jack-Khuu · 2024-12-02T17:49:19Z

Just checking in on the plans for this PR

When other toolkit is installed Fixes #140797

This reverts commit 9023247.

This reverts commit 8c74c86.

This reverts commit 66e17c7.

facebook-github-bot · 2025-01-14T05:53:07Z

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/__init__.py

Co-authored-by: Sergii Dymchenko <sdym@meta.com>

malfet · 2025-01-14T17:31:27Z

@pytorchbot merge -f "Linux builds are fine"

pytorchmergebot · 2025-01-14T17:32:47Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-01-15T00:43:26Z

@pytorchbot cherry-pick --onto release/2.6 -c critical

When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes #140797 Pull Request resolved: #141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com> (cherry picked from commit f297571)

pytorchbot · 2025-01-15T00:47:33Z

Cherry picking #141063

The cherry pick PR is at #144816 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.6.0] Release Tracker #142814 (comment)

Details for Dev Infra team

Raised by workflow job

[CD] Fix slim-wheel nvjit-link import problem (#141063) When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes #140797 Pull Request resolved: #141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com> (cherry picked from commit f297571) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

malfet requested a review from a team as a code owner November 19, 2024 21:19

kit1980 approved these changes Nov 19, 2024

View reviewed changes

malfet added release notes: build release notes category topic: bug fixes topic category ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR labels Nov 19, 2024

pytorchmergebot force-pushed the malfet-patch-13 branch from 24546be to dabe054 Compare November 20, 2024 14:04

malfet force-pushed the malfet-patch-13 branch from dabe054 to d2e5375 Compare November 20, 2024 23:01

Jack-Khuu mentioned this pull request Dec 2, 2024

Multi Pin Bumps across PT/AO/tune/ET pytorch/torchchat#1367

Merged

malfet and others added 6 commits January 13, 2025 17:48

[CD] Fix slim-wheel nvjit-link import problem

66e17c7

When other toolkit is installed Fixes #140797

Update build_cuda.sh

8c74c86

Update rpath

9023247

Revert "Update rpath"

70c6221

This reverts commit 9023247.

Revert "Update build_cuda.sh"

f36adf6

This reverts commit 8c74c86.

Revert "[CD] Fix slim-wheel nvjit-link import problem"

ad01ffb

This reverts commit 66e17c7.

malfet force-pushed the malfet-patch-13 branch from d2e5375 to ad01ffb Compare January 14, 2025 02:20

malfet added 2 commits January 13, 2025 18:20

This one is the actual fix

beb3aa4

Pacify lint

a26c88f

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 14, 2025

kit1980 reviewed Jan 14, 2025

View reviewed changes

torch/__init__.py Outdated Show resolved Hide resolved

kit1980 approved these changes Jan 14, 2025

View reviewed changes

Update torch/__init__.py

9aa303f

Co-authored-by: Sergii Dymchenko <sdym@meta.com>

pytorchmergebot added the merging label Jan 14, 2025

pytorchmergebot added the Merged label Jan 14, 2025

pytorchmergebot closed this in f297571 Jan 14, 2025

pytorchmergebot removed the merging label Jan 14, 2025

pytorchbot mentioned this pull request Jan 15, 2025

[v.2.6.0] Release Tracker #142814

Closed

malfet mentioned this pull request Jan 15, 2025

On Kaggle : libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 #134929

Closed

malfet deleted the malfet-patch-13 branch January 21, 2025 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CD] Fix slim-wheel nvjit-link import problem #141063

[CD] Fix slim-wheel nvjit-link import problem #141063

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[CD] Fix slim-wheel nvjit-link import problem #141063

[CD] Fix slim-wheel nvjit-link import problem #141063

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141063

❌ 10 New Failures, 17 Pending, 2 Unrelated Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Cherry picking #141063

Uh oh!

Uh oh!