-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Closed
Labels
high prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrorProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'toncall: cpu inductorCPU Inductor issues for Intel team to triageCPU Inductor issues for Intel team to triageoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone
Description
🐛 Describe the bug
When running internal test for 2.6.0 RC ARM wheels (https://download.pytorch.org/whl/test/torch/) on Grace Hopper 1GPU, getting seg fault/bus errors which are happening alternating on below test
Reproduced errors on both CUDA and CPU wheels.
python test/inductor/test_aot_inductor_package.py -k test_add -k cpu
Error:
Running only test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_add
Running 1 items in this shard
Fatal Python error: Segmentation fault
Backtrace:
(gdb) bt
#0 0x0000ef67c4019c54 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#1 0x0000ef67c40312f4 in torch::aot_inductor::AOTInductorModelContainer::AOTInductorModelContainer(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) ()
from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#2 0x0000ef67c4017744 in AOTInductorModelContainerCreateWithDevice ()
from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#3 0x0000ef6b33cbc464 in torch::inductor::AOTIModelContainerRunner::AOTIModelContainerRunner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#4 0x0000ef6b33cbcf80 in torch::inductor::AOTIModelContainerRunnerCpu::AOTIModelContainerRunnerCpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#5 0x0000ef6b33cbd038 in torch::inductor::(anonymous namespace)::create_aoti_runner_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
Output from our nightly CI with cuda-gdb
Program terminated with signal SIGBUS, Bus error.
#0 0x0000eca2a34b7628 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0xeca2a37b4580 (LWP 706))]
(cuda-gdb) where
#0 0x0000eca2a34b7628 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000eca2a346cb3c in raise () from /usr/lib/aarch64-linux-gnu/libc.so.6
#2 <signal handler called>
#3 0x0000ec9e30089c54 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#4 0x0000ec9e300a12f4 in torch::aot_inductor::AOTInductorModelContainer::AOTInductorModelContainer(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#5 0x0000ec9e30087744 in AOTInductorModelContainerCreateWithDevice () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#6 0x0000eca292c1c7e8 in torch::inductor::AOTIModelContainerRunner::AOTIModelContainerRunner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#7 0x0000eca292c1d2b8 in torch::inductor::AOTIModelContainerRunnerCpu::AOTIModelContainerRunnerCpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#8 0x0000eca292c1d368 in torch::inductor::(anonymous namespace)::create_aoti_runner_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#9 0x0000eca292c19f2c in torch::inductor::AOTIModelPackageLoader::AOTIModelPackageLoader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#10 0x0000eca2976da644 in pybind11::cpp_function::initialize<pybind11::detail::initimpl::constructor<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::execute<pybind11::class_<torch::inductor::AOTIModelPackageLoader>, , 0>(pybind11::class_<torch::inductor::AOTIModelPackageLoader>&)::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor>(pybind11::class_<torch::inductor::AOTIModelPackageLoader>&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so
#11 0x0000eca29719e430 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so
#12 0x00000000005041c4 in ?? ()
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78 @atalman @malfet @ptrblck @nWEIdia @xwang233
Versions
Reproduced with plain ubuntu 24.04 container with 2.6.0 RC wheel
Metadata
Metadata
Assignees
Labels
high prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrorProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'toncall: cpu inductorCPU Inductor issues for Intel team to triageCPU Inductor issues for Intel team to triageoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module