UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

malfet · 2025-03-04T22:09:47Z

See https://hud.pytorch.org/hud/pytorch/pytorch/c677f3251f46b4bffdaa7758fb7102d665b6f11b/1?per_page=50&name_filter=%20libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug%20%2F%20build but revert did not help

Currently failing out with errors similar to:

/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'

cc @seemethere @pytorch/pytorch-dev-infra

The text was updated successfully, but these errors were encountered:

pytorch-bot · 2025-03-04T22:09:51Z

Hello there! From the UNSTABLE prefix in this issue title, it looks like you are attempting to unstable a job in PyTorch CI. The information I have parsed is below:

Job name: trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build
Credential: malfet

Within ~15 minutes, trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build and all of its dependants will be unstable in PyTorch CI. Please verify that the job name looks correct. With great power comes great responsibility.

seemethere · 2025-03-04T22:38:03Z

Appears as though the docker image was updated between the run that worked and the run that was broken:

Works (d54cab7)

308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9:f6d0deb4923e1fd0ea09c9fc1eb4bf27966d8211

Broken (7ab6749)

308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9:e4800fd93ba7d48bf4197a488fd32c12de647b0e

I went through and poked around the images a bit, g++ versions appear to match and I'm a bit unsure of what's actually been changed.

Might take a bit of investigation to see what's actually been changed, possibly similar to the technique here: https://stackoverflow.com/a/48111973

malfet · 2025-03-07T02:02:03Z

Looks like someone undid my hack...

$ ldd -r /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so |grep stdc++
	libstdc++.so.6 => /opt/conda/envs/py_3.10/lib/libstdc++.so.6 (0x00007f1899787000)

Or may be not

$ objdump -T /lib/x86_64-linux-gnu/libstdc++.so.6|c++filt|grep throw_bad_array
000000000009e371 g    DF .text	0000000000000035  CXXABI_1.3.8 __cxa_throw_bad_array_new_length
000000000009e203 g    DF .text	0000000000000033  CXXABI_1.3.8 __cxa_throw_bad_array_length

cyyever · 2025-03-07T02:09:53Z

Obviously it's a linking error breaking ABI.

malfet · 2025-03-07T02:34:11Z

@cyyever do you have any idea what breaks it? (or where those new definitions are coming from?)

malfet · 2025-03-07T02:35:21Z

Another weird state of things:

jenkins@64f78f2fe4ea:~/cpp-build/caffe2/build$ grep _ZSt28__throw_bad_array_new_lengthv . -R
Binary file lib/libtorch_cuda.so matches

Attempting to re-run last successful build here, to confirm that it indeed comes from some change in the docker

malfet · 2025-03-07T02:43:02Z

This looks interesting...

$ grep _ZSt28__throw_bad_array_new_lengthv /opt/conda/envs/py_3.10/lib/libmagma.a 
Binary file /opt/conda/envs/py_3.10/lib/libmagma.a matches

So looks like culprit is #148135
And indeed

# grep _ZSt28__throw_bad_array_new_lengthv  /opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a 
Binary file /opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a matches

cyyever · 2025-03-07T03:33:55Z

It maybe. But unless some library explicitly set libstdc++ to old version, `std::__throw_bad_array_new_length()' should be found.

cyyever · 2025-03-07T03:42:19Z

grep libstdc -r .github/ .ci*
.ci/docker/common/install_conda.sh:  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
.ci/docker/common/install_conda.sh:  # which is provided in libstdcxx 12 and up.
.ci/docker/common/install_conda.sh:  conda_install libstdcxx-ng=12.3.0 -c conda-forge
.ci/docker/common/install_base.sh:    libstdc++-devel \
.ci/docker/manywheel/build_scripts/build.sh:MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
.ci/magma/package_files/cmakelists.patch:+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")

magma uses -static-libstdc++, maybe the reason.

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: a857b0e Pull Request resolved: #148740

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 081787e Pull Request resolved: #148740

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 36cdab3 Pull Request resolved: #148740

malfet · 2025-03-07T17:55:52Z

grep libstdc -r .github/ .ci*
.ci/docker/common/install_conda.sh:  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
.ci/docker/common/install_conda.sh:  # which is provided in libstdcxx 12 and up.
.ci/docker/common/install_conda.sh:  conda_install libstdcxx-ng=12.3.0 -c conda-forge
.ci/docker/common/install_base.sh:    libstdc++-devel \
.ci/docker/manywheel/build_scripts/build.sh:MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
.ci/magma/package_files/cmakelists.patch:+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")

magma uses -static-libstdc++, maybe the reason.

This flag also suggests -fno-exceptions which would've squashed that code, but I wouldn't be surprised if patch has been failing to apply for a while

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 48a9940 Pull Request resolved: #148740

@progbits

I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/` Which accidentally fixes undefined symbol references errors namely ``` /usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()' ``` Which happens because `libmagma.a` that were build with gcc-11 (after #148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`) Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or ``` $ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - - .file "" .text .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4 .globl _Z3fooi .type _Z3fooi, @function _Z3fooi: .LFB0: .cfi_startproc endbr64 movslq %edi, %rdi subq $8, %rsp .cfi_def_cfa_offset 16 movabsq $2305843009213693950, %rax cmpq %rax, %rdi ja .L2 salq $2, %rdi addq $8, %rsp .cfi_def_cfa_offset 8 jmp _Znam@PLT .cfi_endproc .section .text.unlikely .cfi_startproc .type _Z3fooi.cold, @function _Z3fooi.cold: .LFSB0: .L2: .cfi_def_cfa_offset 16 call __cxa_throw_bad_array_new_length@PLT .cfi_endproc ``` Fixes #148728 and #148495 Pull Request resolved: #148740 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi

ZainRizvi · 2025-03-07T23:24:57Z

temporarily reopening this issue so that viable/strict upgrades can continue ignoring the status of this job for the next few hours. It can be closed once the viable/strict upgrade branch catches up to #148740

malfet · 2025-03-25T19:58:22Z

@ZainRizvi if your plan were to reopen it temporarily, it should have been closed.
But also, this job is gone now (been replaced with ibtorch-linux-focal-cuda12.6-py3.10-gcc11-debug), so closing

pytorch-bot bot added module: ci Related to continuous integration unstable labels Mar 4, 2025

github-project-automation bot added this to PyTorch OSS Dev Infra Mar 4, 2025

malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: regression It used to work, and now it doesn't and removed unstable labels Mar 4, 2025

pytorch-bot bot added the unstable label Mar 5, 2025

ZainRizvi mentioned this issue Mar 5, 2025

Move broken job to unstable workflow: trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148591

Closed

atalman mentioned this issue Mar 5, 2025

Enable CUDA 12.6 OSS CI #140793

Closed

malfet mentioned this issue Mar 7, 2025

UNSTABLE trunk / libtorch-linux-focal-cuda12.6-py3.10-gcc9-debug / build #148728

Closed

malfet mentioned this issue Mar 7, 2025

[BE] Move cuda12.6 builds to gcc11 #148740

Closed

malfet added a commit that referenced this issue Mar 7, 2025

[BE] Move cuda12.6 builds to gcc11

7d73c05

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: a857b0e Pull Request resolved: #148740

malfet added a commit that referenced this issue Mar 7, 2025

[BE] Move cuda12.6 builds to gcc11

b82581d

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 081787e Pull Request resolved: #148740

malfet mentioned this issue Mar 7, 2025

magma builds should be part of the docker image builds #148762

Open

malfet added a commit that referenced this issue Mar 7, 2025

[BE] Move cuda12.6 builds to gcc11

b8e6f92

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 36cdab3 Pull Request resolved: #148740

malfet added a commit that referenced this issue Mar 7, 2025

[BE] Move cuda12.6 builds to gcc11

a6770ac

I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/ Fixes #148728 and #148495 ghstack-source-id: 48a9940 Pull Request resolved: #148740

malfet closed this as completed Mar 7, 2025

github-project-automation bot moved this to Done in PyTorch OSS Dev Infra Mar 7, 2025

ZainRizvi reopened this Mar 7, 2025

malfet closed this as completed Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!