-
Notifications
You must be signed in to change notification settings - Fork 24.3k
CUDA deps cannot be preloaded under Bazel #117350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are there possibly any other workarounds for this other than patching the wheel as you suggested @georgevreilly? Wondering if there is a way to get bazel/pip to place Also FWIW with |
For reference, the issue seems to come from PyTorch 2.0, shipped with cuda11 by default
PyTorch 2.2.0, shipped with cuda12 by default
|
This is the workaround i applied in github actions for our project
|
I solved this in our bzlmod-based repo using this:
# Install pip packages.
pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
hub_name = "pypi",
python_version = PYTHON_VERSION,
requirements_lock = "//:.requirements_lock.txt",
)
pip.override(
file = "torch-2.2.1-cp39-cp39-manylinux1_x86_64.whl",
patch_strip = 1,
patches = [
# We have to patch pytorch to fix its dynamic library search code to work
# with the bazel rules_python directory layout.
"@//third_party/pytorch:pytorch.patch",
"@//third_party/pytorch:pytorch_record.patch",
],
)
use_repo(pip, "pypi") (be careful about copy-pasting this, there's sensitive whitespace that won't copy correctly). |
@mark64 This seems to be a nice workaround, but as far as I understand this only works for Python 3.9 and PyTorch 2.2.1. Could you elaborate on how to proceed for other versions? |
When you change versions, hopefully it's as easy as modifying the However, if the patches no longer apply properly, you'll need to re-create them from the new source files. I made The record patch file was tricky though, because it should have been just the one change to the
|
Here's a little This works for https://gist.github.com/parth-emancro/ac075492a4c55b7aea149ba2aa3d2841 Looks like this was the easiest way to run @mark64 I would've used your solution if not for |
Hey, there's another approach. Bazel offers the |
I debugged this a bit more in our case with 2.4.0. While there is already some logic for preloading the deps, the problem I noticed was that the preload logic only applies if the globals library fails to load with a cuda related library load failure. In our case the cuda library itself wasn't the one that was failing, but the later diff --git a/torch/__init__.py b/torch/__init__.py
index 1c4d5e4..37d8df6 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -219,15 +219,11 @@ def _load_global_deps() -> None:
split_build_lib_name = LIBTORCH_PKG_NAME
library_path = find_package_path(split_build_lib_name)
-
- if library_path:
- global_deps_lib_path = os.path.join(library_path, 'lib', lib_name)
- try:
- ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
- except OSError as err:
- # Can only happen for wheel with cuda libs as PYPI deps
- # As PyTorch is not purelib, but nvidia-*-cu12 is
- cuda_libs: Dict[str, str] = {
+ # Can only happen for wheel with cuda libs as PYPI deps
+ # As PyTorch is not purelib, but nvidia-*-cu12 is
+ cuda_libs: Dict[str, str] = {
+ 'nvjitlink': 'libnvJitLink.so.*[0-9]',
+ 'cusparse': 'libcusparse.so.*[0-9]',
'cublas': 'libcublas.so.*[0-9]',
'cudnn': 'libcudnn.so.*[0-9]',
'cuda_nvrtc': 'libnvrtc.so.*[0-9]',
@@ -240,13 +236,13 @@ def _load_global_deps() -> None:
'nccl': 'libnccl.so.*[0-9]',
'nvtx': 'libnvToolsExt.so.*[0-9]',
}
- is_cuda_lib_err = [lib for lib in cuda_libs.values() if lib.split('.')[0] in err.args[0]]
- if not is_cuda_lib_err:
- raise err
- for lib_folder, lib_name in cuda_libs.items():
- _preload_cuda_deps(lib_folder, lib_name)
- ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
+ for lib_folder, lib_name in cuda_libs.items():
+ _preload_cuda_deps(lib_folder, lib_name)
+
+ if library_path:
+ global_deps_lib_path = os.path.join(library_path, 'lib', lib_name)
+ ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
if library_path:
# loading libtorch_global_deps first due its special logic
load_shared_libraries(library_path) I don't think there is any realistic downside to this since the libraries would have just been loaded moments later. And actually in the bazel case potentially missing the vendored libraries and instead loading the system libraries seems a bit risky since they could be of different versions, in practice I don't know how much that matters though. |
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
For whatever reason this is patch is what works for me in torch 2.6 + cuda 12.4 Also since we use WORKSPACE instead of MODULE.bazel, I had to reinvent pip.override with repository_rule |
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
@lidingsnyk can you please share the commands you ran in WORKSPACE to apply this patch? The documentation for applying those patches seems to be very sparse. |
@FrankPortman Since This probably works fine...a bunch of common bash command is used in patch_pip.bzl. It should work for linux. Probably works for Mac, or can be fixed easily. In the WORKSPACE I simply have:
The patch_pip.bzl that I wrote: My torch_patch_build_file looks like the following
I looked at the BUILD file of torch in my bazel runfiles. It has a target pkg. I basically just copied it and make it depend on the original |
@lidingsnyk TY for the answer. I actually finally ended up understanding how the
Note the JSONified string. |
@FrankPortman Thanks! Yes this is much better than my approach... Great comment. It's not easy to figure this out from the official Bazel documentation... |
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
I realized today that my patch #137059 doesn't solve all cases, depending on your rpaths. We noticed in our build that it is possible for some libs to be discovered from their system installations:
therefore sidestepping the preload logic, but then later when one of the non-default installed libs is loaded like cudnn, it cannot be found and fails. At this point if the pytorch "preload" logic runs, it will load a potentially mismatched version of the nvidia libraries, from the system ones that have already been loaded.. It's unclear to me how much of a real world issue that would cause, but I assume the nvidia library versions that pytorch pins are important, so it seems to me like we should actually always call the preload logic before trying to load any native library so that the correct paths are loaded (it might still be possible that you could conflict with system libraries if another native extension was loaded before pytorch, but that doesn't seem solvable) |
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Previously cuda deps were only loaded if loading the globals library failed with a cuda shared lib related error. It's possible the globals library to load successfully, but then for the torch native libraries to fail with a cuda error. This now handles both of these cases. Fixes pytorch#117350
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
If Torch 2.1.0 is used as a dependency with Bazel and rules_python,
_preload_cuda_deps
fails withOSError: libcufft.so.11: cannot open shared object file: No such file or directory
.Torch 2.1 works fine if you install it and its CUDA dependencies into a single
site-packages
(e.g., in a virtualenv). It doesn't work with Bazel, as Bazel installs each dependency into its own directory tree, which is appended toPYTHONPATH
.This can be fixed by slightly reordering
cuda_libs
in_load_global_deps
so that they are topologically sorted.I have a full repro of the problem, which has a tiny Python app that works in a regular virtualenv, but fails with Bazel. I also created a tool there that patches the Torch wheel. The patched wheel works in Bazel for us.
Related Issues
Versions
cc @malfet @seemethere
The text was updated successfully, but these errors were encountered: