-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Description
🐛 Bug
When calling some functions like torch::mean() on this gpu tensor, a CUDA runtime error will occur:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: invalid configuration argument
Exception raised from launch_reduce_kernel at /pytorch/aten/src/ATen/native/cuda/Reduce.cuh:828 (most recent call first):
Here is the complete output of gdb backtrace (running with CUDA_LAUNCH_BLOCKING=1):
Thread 1 "train" received signal SIGABRT, Aborted.
0x00007fff600fb70f in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-5.el8.0.2.x86_64 libgomp-8.3.1-5.el8.0.2.x86_64 libibverbs-26.0-8.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 libstdc++-8.3.1-5.el8.0.2.x86_64 sssd-client-2.2.3-20.el8.x86_64 zlib-1.2.11-16.el8_2.x86_64
(gdb) bt
#0 0x00007fff600fb70f in raise () from /lib64/libc.so.6
#1 0x00007fff600e5b25 in abort () from /lib64/libc.so.6
#2 0x00007fff60ce806b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6
#3 0x00007fff60cee50c in __cxxabiv1::__terminate(void (*)()) ()
from /lib64/libstdc++.so.6
#4 0x00007fff60cee567 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fff60cee7c8 in __cxa_throw () from /lib64/libstdc++.so.6
#6 0x00007fff83782b3c in void at::native::gpu_reduce_kernel<double, double, 4, at::native::MeanOps<double, float>, double>(at::TensorIterator&, at::native::MeanOps<double, float> const&, double, at::native::AccumulationBuffer*, long) ()
from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#7 0x00007fff83773bf2 in at::native::mean_kernel_cuda(at::TensorIterator&) ()
from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#8 0x00007fffe48299a6 in void at::native::DispatchStub<void (*)(at::TensorIterator&), at::native::mean_stub>::operator()<at::TensorIterator&>(c10::DeviceType, at::TensorIterator&) ()
from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#9 0x00007fffe48179f2 in at::native::mean_out_cpu_gpu(at::Tensor&, at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>) ()
from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffe4817f3b in at::native::mean_cpu_gpu(at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>) ()
--Type <RET> for more, q to quit, c to continue without paging--c
from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#11 0x00007fffe4818008 in at::native::mean_cpu_gpu(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#12 0x00007fff8216b373 in at::CUDAType::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#13 0x00007fff821a8dde in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::optional<c10::ScalarType>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#14 0x00007fffe4d5fa91 in at::Tensor c10::Dispatcher::callWithDispatchKey<at::Tensor, at::Tensor const&, c10::optional<c10::ScalarType> >(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)> const&, c10::DispatchKey, at::Tensor const&, c10::optional<c10::ScalarType>) const () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#15 0x00007fffe4c79d8a in at::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#16 0x00007fffe622470e in torch::autograd::VariableType::(anonymous namespace)::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#17 0x00007fffe44994ce in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::optional<c10::ScalarType>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#18 0x00007fffe4d5fa91 in at::Tensor c10::Dispatcher::callWithDispatchKey<at::Tensor, at::Tensor const&, c10::optional<c10::ScalarType> >(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)> const&, c10::DispatchKey, at::Tensor const&, c10::optional<c10::ScalarType>) const () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#19 0x00007fffe4c79d8a in at::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#20 0x000000000047bacd in one_batch_netImpl::calc_std_avg_only (this=0x883e900, coord_energy_force_batch=std::vector of length 5, capacity 5 = {...}, nei_info_batch=std::vector of length 4, capacity 4 = {...}, descrpt_and_deriv_batch=std::vector of length 4, capacity 4 = {...}, parameters_info=0x22c4ee0, DEVICE=...) at /home/admin/fanyi/Softwares/NN_train/src_O0/struct_DP.h:232
#21 0x00000000004770c6 in train (parameters_info=0x22c4ee0, frame_info=0x7fff40f29010, training_dataset=0x7fffffffd520, model=0x7fffffffd4b0, optimizer=0x7fffffffd460) at /home/admin/fanyi/Softwares/NN_train/src_O0/train_DP.cpp:186
#22 0x00000000004206a0 in train_NN_DP (param_filename=0x7fffffffdbd5 "PARAMS.json") at /home/admin/fanyi/Softwares/NN_train/src_O0/train_NN.cpp:254
#23 0x00000000004200ef in train_NN (param_filename=0x7fffffffdbd5 "PARAMS.json") at /home/admin/fanyi/Softwares/NN_train/src_O0/train_NN.cpp:51
#24 0x000000000040fd09 in main (argc=2, argv=0x7fffffffd878) at /home/admin/fanyi/Softwares/NN_train/src_O0/main.cpp:83
The code at /home/admin/fanyi/Softwares/NN_train/src_O0/struct_DP.h:232
is :
torch::Tensor xyz_hat_avg = torch::mean(xyz_hat);
This first happened in my C++ code using PyTorch's C++ APIs. I have saved this tensor xyz_hat
using torch::save() (the attchment gpu_tensor_cpp.tar.gz), then loaded it in python using torch.jit.load. The same error occured when calling torch.mean():
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument
To Reproduce
Steps to reproduce the behavior:
# gpu_tensor_cpp.tar.gz is the attachment
$ tar -zxvf gpu_tensor_cpp.tar.gz
$ python3
>>> import torch
>>> a = list(torch.jit.load("gpu_tensor_cpp").parameters())[0]
>>> a.device
device(type='cuda', index=0)
>>> a.dtype
torch.float64
>>> torch.mean(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument
>>> torch.sum(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument
>>> a2 = a.clone()
>>> torch.sum(a2)
tensor(2855.0410, device='cuda:0', dtype=torch.float64)
Expected behavior
Operations on tensor a like torch.sum(a) should return the same result as tensor a2, where a2 is just a clone of a, as described above.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: CentOS Linux 8 (Core) (x86_64)
GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Clang version: Could not collect
CMake version: version 3.11.4
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
GPU 4: A100-SXM4-40GB
GPU 5: A100-SXM4-40GB
GPU 6: A100-SXM4-40GB
GPU 7: A100-SXM4-40GB
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.7.0+cu110
[conda] Could not collect
Additional context
This error could be overcome by calling .to("cpu") first or simply making a clone() of this problematic tensor. But I still would like to understand what actually triggered it. Maybe the data stored in the problematic tensor is broken?
I tried running the same C++ code on anther system with PyTorch 1.4.0 + CUDA 10.1 installed using pip, and found everything goes fine. Here is the environment for that system:
PyTorch version: 1.4.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: CentOS Linux release 7.7.1908 (Core) (x86_64)
GCC version: (GCC) 7.5.0
Clang version: Could not collect
CMake version: version 2.8.12.2
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
Nvidia driver version: 450.51.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] numpydoc==0.9.1
[pip3] torch==1.4.0
[conda] _pytorch_select 0.2 gpu_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.14 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.17.2 py37haad9e8e_0
[conda] numpy-base 1.17.2 py37hde5b4d6_0
[conda] numpydoc 0.9.1 py_0
[conda] pytorch 1.3.1 cuda100py37h53c1284_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ngimel @heitorschueroff