8000 seg fault in aot_inductor_package on arm GPU with 2.6.0 RC · Issue #145441 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
seg fault in aot_inductor_package on arm GPU with 2.6.0 RC #145441
@tinglvv

Description

@tinglvv

🐛 Describe the bug

When running internal test for 2.6.0 RC ARM wheels (https://download.pytorch.org/whl/test/torch/) on Grace Hopper 1GPU, getting seg fault/bus errors which are happening alternating on below test

Reproduced errors on both CUDA and CPU wheels.

python test/inductor/test_aot_inductor_package.py -k test_add -k cpu

Error:

Running only test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_add
Running 1 items in this shard

Fatal Python error: Segmentation fault

Backtrace:

(gdb) bt
#0  0x0000ef67c4019c54 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
   from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#1  0x0000ef67c40312f4 in torch::aot_inductor::AOTInductorModelContainer::AOTInductorModelContainer(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) ()
   from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#2  0x0000ef67c4017744 in AOTInductorModelContainerCreateWithDevice ()
   from /tmp/KFSXvp/data/aotinductor/model/cnxl5jak4cmycxqpoiy3wdbyygqqgbph4tl5wjzolu24zpiqo25v.so
#3  0x0000ef6b33cbc464 in torch::inductor::AOTIModelContainerRunner::AOTIModelContainerRunner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#4  0x0000ef6b33cbcf80 in torch::inductor::AOTIModelContainerRunnerCpu::AOTIModelContainerRunnerCpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#5  0x0000ef6b33cbd038 in torch::inductor::(anonymous namespace)::create_aoti_runner_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so

Output from our nightly CI with cuda-gdb

Program terminated with signal SIGBUS, Bus error.
#0  0x0000eca2a34b7628 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0xeca2a37b4580 (LWP 706))]
(cuda-gdb) where
#0  0x0000eca2a34b7628 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#1  0x0000eca2a346cb3c in raise () from /usr/lib/aarch64-linux-gnu/libc.so.6
#2  <signal handler called>
#3  0x0000ec9e30089c54 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#4  0x0000ec9e300a12f4 in torch::aot_inductor::AOTInductorModelContainer::AOTInductorModelContainer(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#5  0x0000ec9e30087744 in AOTInductorModelContainerCreateWithDevice () from /tmp/6cr6fe/data/aotinductor/model/csovzderskxoxbxohbxsgppmjvvjbbnermfydfa4ubnngqepcq2c.so
#6  0x0000eca292c1c7e8 in torch::inductor::AOTIModelContainerRunner::AOTIModelContainerRunner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#7  0x0000eca292c1d2b8 in torch::inductor::AOTIModelContainerRunnerCpu::AOTIModelContainerRunnerCpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#8  0x0000eca292c1d368 in torch::inductor::(anonymous namespace)::create_aoti_runner_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#9  0x0000eca292c19f2c in torch::inductor::AOTIModelPackageLoader::AOTIModelPackageLoader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so
#10 0x0000eca2976da644 in pybind11::cpp_function::initialize<pybind11::detail::initimpl::constructor<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::execute<pybind11::class_<torch::inductor::AOTIModelPackageLoader>, , 0>(pybind11::class_<torch::inductor::AOTIModelPackageLoader>&)::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor>(pybind11::class_<torch::inductor::AOTIModelPackageLoader>&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so
#11 0x0000eca29719e430 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so
#12 0x00000000005041c4 in ?? ()

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78 @atalman @malfet @ptrblck @nWEIdia @xwang233

Versions

Reproduced with plain ubuntu 24.04 container with 2.6.0 RC wheel

Metadata

Metadata

Assignees

Labels

high prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: regressionIt used to work, and now it doesn'toncall: cpu inductorCPU Inductor issues for Intel team to triageoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0