-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[CD] Fix slim-wheel nvjit-link import problem #141063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141063
Note: Links to docs will display an error until the docs builds have been completed. ❌ 10 New Failures, 17 Pending, 2 Unrelated FailuresAs of commit 9aa303f with merge base c15d650 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
24546be
to
dabe054
Compare
dabe054
to
d2e5375
Compare
Just checking in on the plans for this PR |
d2e5375
to
ad01ffb
Compare
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Co-authored-by: Sergii Dymchenko <sdym@meta.com>
@pytorchbot merge -f "Linux builds are fine" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot cherry-pick --onto release/2.6 -c critical |
When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes #140797 Pull Request resolved: #141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com> (cherry picked from commit f297571)
Cherry picking #141063The cherry pick PR is at #144816 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
[CD] Fix slim-wheel nvjit-link import problem (#141063) When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes #140797 Pull Request resolved: #141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com> (cherry picked from commit f297571) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
When other toolkit (say CUDA-12.3) is installed and
LD_LIBRARY_PATH
points to there, import torch will fail withIt could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading
nvjitlink
right after global deps are loaded, by running something along the lines of the followingFixes #140797