8000 Can't import torch --> OSError related to libcublasLt.so.11 · Issue #88882 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Can't import torch --> OSError related to libcublasLt.so.11 #88882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tbloch1 opened this issue Nov 11, 2022 · 4 comments
Open

Can't import torch --> OSError related to libcublasLt.so.11 #88882

tbloch1 opened this issue Nov 11, 2022 · 4 comments
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tbloch1
Copy link
tbloch1 commented Nov 11, 2022

🐛 Describe the bug

When I try to import torch in my docker container, I get an OSError:
/usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

I've followed the step from Issue 51080, which seems like it might be similar, but to no effect. Note, I don't have conda, so I didn't follow those specific steps, just the pip ones.

However, I have found that if I import tensorflow first, list my devices and then import torch it works fine...

try:
    import torch
except OSError as e:
    print(e)
    import tensorflow as tf
    tf.config.list_physical_devices()
    import torch
    print([torch.device(i) for i in range(torch.cuda.device_count())])
>>> /usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11
>>> [device(type='cuda', index=0), device(type='cuda', index=1)]

Which makes me suspect that this is a different issue to Issue 51080.

The full traceback of import torch is:

OSError                                   Traceback (most recent call last)
/root/GI_Data/KPVESQC5_AI4Q_P/Exp_workflow_a.ipynb Cell 2' in <cell line: 1>()
----> [1](vscode-notebook-cell://attached-container%2B7b22636f6e7461696e65724e616d65223a222f74625f656173657163222c2273657474696e6773223a7b22686f7374223a227373683a2f2f4750555f75736572227d7d/root/GI_Data/KPVESQC5_AI4Q_P/Exp_workflow_a.ipynb#ch0000009vscode-remote?line=0) import torch

File /usr/local/lib/python3.8/dist-packages/torch/__init__.py:191, in <module>
    [180](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=179) else:
    [181](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=180)     # Easy way.  You want this most of the time, because it will prevent
    [182](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=181)     # C++ symbols from libtorch clobbering C++ symbols from other
   (...)
    [188](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=187)     #
    [189](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=188)     # See Note [Global dependencies]
    [190](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=189)     if USE_GLOBAL_DEPS:
--> [191](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=190)         _load_global_deps()
    [192](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=191)     from torch._C import *  # noqa: F403
    [194](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=193) # Appease the type checker; ordinarily this binding is inserted by the
    [195](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=194) # torch._C module initialization code in C

File /usr/local/lib/python3.8/dist-packages/torch/__init__.py:153, in _load_global_deps()
    [150](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=149) here = os.path.abspath(__file__)
    [151](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=150) lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)
--> [153](file:///usr/local/lib/python3.8/dist-packages/torch/__init__.py?line=152) ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)

File /usr/lib/python3.8/ctypes/__init__.py:373, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    [370](file:///usr/lib/python3.8/ctypes/__init__.py?line=369) self._FuncPtr = _FuncPtr
    [372](file:///usr/lib/python3.8/ctypes/__init__.py?line=371) if handle is None:
--> [373](file:///usr/lib/python3.8/ctypes/__init__.py?line=372)     self._handle = _dlopen(self._name, mode)
    [374](file:///usr/lib/python3.8/ctypes/__init__.py?line=373) else:
    [375](file:///usr/lib/python3.8/ctypes/__init__.py?line=374)     self._handle = handle

OSError: /usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

Versions

Python version: 3.8.10
Docker version: 20.10.21
BASE_IMAGE=tensorflow/tensorflow
IMAGE_VERSION=2.9.1-gpu-jupyter

Output of nvcc -V:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Output of torch.__version__ in python:

'1.13.0+cu117'

Output of collect_env.py:

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29
Is CUDA available: N/A
CUDA runtime version: 11.2.152
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA A10
GPU 1: NVIDIA A10

Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] Could not collect

cc @seemethere @malfet @osalpekar @atalman @ptrblck @ezyang @ngimel

@malfet malfet added module: cuda Related to torch.cuda, and CUDA support in general module: binaries Anything related to official binaries that we release to users labels Nov 11, 2022
@malfet
Copy link
Contributor
malfet commented Nov 11, 2022

Can you please clarify, how you've installed PyTorch into your container? Using pip I assume? In that case, do you mind sharing output of pip list?

Also, please try install torch wheel using pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu117/ and see if this will make it go away.

@malfet malfet added the module: regression It used to work, and now it doesn't label Nov 11, 2022
@tbloch1
Copy link
Author
tbloch1 commented Nov 14, 2022

Yep, I installed pytorch using pip, following the install guide.

Sure thing. The pip llist output is below.

I've tried that install method, and it says all the requirements have already been met, and the error remains.

Package                      Version
---------------------------- --------------------
absl-py                      1.0.0
affine                       2.3.1
argon2-cffi                  21.3.0
argon2-cffi-bindings         21.2.0
asciitree                    0.3.3
asttokens                    2.0.5
astunparse                   1.6.3
attrs                        21.4.0
backcall                     0.2.0
beautifulsoup4               4.11.1
bleach                       5.0.0
cachetools                   5.1.0
celluloid                    0.2.0
certifi                      2019.11.28
cffi                         1.15.0
chardet                      3.0.4
click                        8.1.3
click-plugins                1.1.1
cligj                        0.7.2
cloudpickle                  2.2.0
cycler                       0.11.0
Cython                       0.29.32
dask                         2022.10.2
dbus-python                  1.2.16
debugpy                      1.6.0
decorator                    5.1.1
defusedxml                   0.7.1
entrypoints                  0.4
executing                    0.8.3
fasteners                    0.18
fastjsonschema               2.15.3
Fiona                        1.8.22
flatbuffers                  1.12
fonttools                    4.33.3
fsspec                       2022.10.0
gast                         0.4.0
geopandas                    0.12.1
google-auth                  2.6.6
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.46.3
h5py                         3.6.0
hdbscan                      0.8.29
idna                         2.8
importlib-metadata           4.11.4
importlib-resources          5.7.1
ipykernel                    5.1.1
ipympl                       0.9.2
ipython                      8.3.0
ipython-genutils             0.2.0
ipywidgets                   7.7.0
jedi                         0.17.2
Jinja2                       3.1.2
joblib                       1.2.0
jsonschema                   4.5.1
jupyter                      1.0.0
jupyter-client               7.3.1
jupyter-console              6.4.3
jupyter-core                 4.10.0
jupyter-http-over-ws         0.0.8
jupyterlab-pygments          0.2.2
jupyterlab-widgets           1.1.0
keras                        2.9.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.2
libclang                     14.0.1
lightgbm                     3.3.3
llvmlite                     0.39.1
locket                       1.0.0
Markdown                     3.3.7
MarkupSafe                   2.1.1
matplotlib                   3.5.2
matplotlib-inline            0.1.3
mistune                      0.8.4
munch                        2.5.0
nbclient                     0.6.3
nbconvert                    6.5.0
nbformat                     5.7.0
nest-asyncio                 1.5.5
notebook                     6.4.11
numba                        0.56.3
numcodecs                    0.10.2
numpy                        1.22.4
nvidia-cublas-cu11           11.10.3.66
nvidia-cuda-nvrtc-cu11       11.7.99
nvidia-cuda-runtime-cu11     11.7.99
nvidia-cudnn-cu11            8.5.0.96
nvidia-pyindex               1.0.9
oauthlib                     3.2.0
opt-einsum                   3.3.0
packaging                    21.3
pandas                       1.5.1
pandocfilters                1.5.0
parso                        0.7.1
partd                        1.3.0
pexpect                      4.8.0
pickleshare                  0.7.5
Pillow                       9.1.1
pip                          22.3.1
prometheus-client            0.14.1
prompt-toolkit               3.0.29
protobuf                     3.19.4
psutil                       5.9.1
ptyprocess                   0.7.0
pure-eval                    0.2.2
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycparser                    2.21
Pygments                     2.12.0
PyGObject                    3.36.0
pynndescent                  0.5.8
pyparsing                    3.0.9
pyproj                       3.4.0
pyrsistent                   0.18.1
python-apt                   2.0.0+ubuntu0.20.4.7
python-dateutil              2.8.2
pytz                         2022.6
PyYAML                       6.0
pyzmq                        23.0.0
qtconsole                    5.3.0
QtPy                         2.1.0
rasterio                     1.3.3
requests                     2.22.0
requests-oauthlib            1.3.1
requests-unixsocket          0.2.0
rioxarray                    0.12.4
rsa                          4.8
scikit-learn                 1.1.3
scipy                        1.9.3
seaborn                      0.12.1
Send2Trash                   1.8.0
setuptools                   65.5.1
Shapely                      1.8.5.post1
six                          1.14.0
snuggs                       1.4.7
soupsieve                    2.3.2.post1
stack-data                   0.2.0
tensorboard                  2.9.0
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.9.1
tensorflow-estimator         2.9.0
tensorflow-io-gcs-filesystem 0.26.0
termcolor                    1.1.0
terminado                    0.15.0
threadpoolctl                3.1.0
tinycss2                     1.1.1
toolz                        0.12.0
torch                        1.13.0
torchaudio                   0.13.0
torchvision                  0.14.0
tornado                      6.1
tqdm                         4.64.1
traitlets                    5.2.1.post0
typing_extensions            4.4.0
umap-learn                   0.5.3
urllib3                      1.25.8
wcwidth
8000
                      0.2.5
webencodings                 0.5.1
Werkzeug                     2.1.2
wheel                        0.38.3
widgetsnbextension           3.6.0
wrapt                        1.14.1
xarray                       2022.10.0
zarr                         2.13.3
zipp                         3.8.0

@mergian
Copy link
Contributor
mergian commented Dec 16, 2022

I just ran into the same issue on Ubuntu 22.04. The reason is that Pytorch loads libcublas.so. in _preload_cuda_deps (in torch/__init__.py), which uses symbols from libcublasLt.so.11. But dlopen cannot find that library. I suspect NVIDIA has forgotten to set RPATH=$ORIGIN within libcublas.so, so the path to the library is not considered for loading dependencies.

Quick solution: set LD_LIBRARY_PATH to point to that directory, in your case:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cublas/lib/

A more solid fix would be to change _preload_cuda_deps to:

def _preload_cuda_deps():
    """ Preloads cudnn/cublas deps if they could not be found otherwise """
    # Should only be called on Linux if default path resolution have failed
    assert platform.system() == 'Linux', 'Should only be called on Linux'
    for path in sys.path:
        nvidia_path = os.path.join(path, 'nvidia')
        if not os.path.exists(nvidia_path):
            continue
        cublaslt_path = os.path.join(nvidia_path, 'cublas', 'lib', 'libcublasLt.so.11')
        cublas_path = os.path.join(nvidia_path, 'cublas', 'lib', 'libcublas.so.11')
        cudnn_path = os.path.join(nvidia_path, 'cudnn', 'lib', 'libcudnn.so.8')
        if not os.path.exists(cublaslt_path) or not os.path.exists(cublas_path) or not os.path.exists(cudnn_path):
            continue
        break

    ctypes.CDLL(cublaslt_path)
    ctypes.CDLL(cublas_path)
    ctypes.CDLL(cudnn_path)

@tbloch1
Copy link
Author
tbloch1 commented Dec 16, 2022

Thanks for the reply @mergian! Seems like that's the right direction for a full solution.

For me I managed to solve it by trial and error, eventually using the following to install pytorch:

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

Which I guess fixed it due to a mismatch in cuda versions previously? Not too sure. But either way, I do not have the problem after this, and torch works fine.

I don't want to close this issue, since it seems like there is a more underlying problem that needs addressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants
0