8000 UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build · Issue #148495 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
malfet opened this issue Mar 4, 2025 · 12 comments
Closed

UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495

malfet opened this issue Mar 4, 2025 · 12 comments
Labels
module: ci Related to continuous integration module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable

Comments

@malfet
Copy link
Contributor
malfet commented Mar 4, 2025

See https://hud.pytorch.org/hud/pytorch/pytorch/c677f3251f46b4bffdaa7758fb7102d665b6f11b/1?per_page=50&name_filter=%20libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug%20%2F%20build but revert did not help

Currently failing out with errors similar to:

/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'

cc @seemethere @pytorch/pytorch-dev-infra

@pytorch-bot pytorch-bot bot added module: ci Related to continuous integration unstable labels Mar 4, 2025
Copy link
pytorch-bot bot commented Mar 4, 2025
Hello there! From the UNSTABLE prefix in this issue title, it looks like you are attempting to unstable a job in PyTorch CI. The information I have parsed is below:
  • Job name: trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build
  • Credential: malfet

Within ~15 minutes, trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build and all of its dependants will be unstable in PyTorch CI. Please verify that the job name looks correct. With great power comes great responsibility.

@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: regression It used to work, and now it doesn't and removed unstable labels Mar 4, 2025
@seemethere
Copy link
Member
seemethere commented Mar 4, 2025

Appears as though the docker image was updated between the run that worked and the run that was broken:

Works (d54cab7)

308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9:f6d0deb4923e1fd0ea09c9fc1eb4bf27966d8211

Broken (7ab6749)

308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9:e4800fd93ba7d48bf4197a488fd32c12de647b0e

I went through and poked around the images a bit, g++ versions appear to match and I'm a bit unsure of what's actually been changed.

Might take a bit of investigation to see what's actually been changed, possibly similar to the technique here: https://stackoverflow.com/a/48111973

@malfet
Copy link
Contributor Author
malfet commented Mar 7, 2025

Looks like someone undid my hack...

$ ldd -r /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so |grep stdc++
	libstdc++.so.6 => /opt/conda/envs/py_3.10/lib/libstdc++.so.6 (0x00007f1899787000)

Or may be not

$ objdump -T /lib/x86_64-linux-gnu/libstdc++.so.6|c++filt|grep throw_bad_array
000000000009e371 g    DF .text	0000000000000035  CXXABI_1.3.8 __cxa_throw_bad_array_new_length
000000000009e203 g    DF .text	0000000000000033  CXXABI_1.3.8 __cxa_throw_bad_array_length

@cyyever
Copy link
Collaborator
cyyever commented Mar 7, 2025

Obviously it's a linking error breaking ABI.

@malfet
Copy link
Contributor Author
malfet commented Mar 7, 2025

@cyyever do you have any idea what breaks it? (or where those new definitions are coming from?)

@malfet
Copy link
Contributor Author
malfet commented Mar 7, 2025

Another weird state of things:

jenkins@64f78f2fe4ea:~/cpp-build/caffe2/build$ grep _ZSt28__throw_bad_array_new_lengthv . -R
Binary file lib/libtorch_cuda.so matches

Attempting to re-run last successful build here, to confirm that it indeed comes from some change in the docker

@malfet
Copy link
Contributor Author
malfet commented Mar 7, 2025

This looks interesting...

$ grep _ZSt28__throw_bad_array_new_lengthv /opt/conda/envs/py_3.10/lib/libmagma.a 
Binary file /opt/conda/envs/py_3.10/lib/libmagma.a matches

So looks like culprit is #148135
And indeed

# grep _ZSt28__throw_bad_array_new_lengthv  /opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a 
Binary file /opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a matches

@cyyever
Copy link
Collaborator
cyyever commented Mar 7, 2025

It maybe. But unless some library explicitly set libstdc++ to old version, `std::__throw_bad_array_new_length()' should be found.

@cyyever
Copy link
Collaborator
cyyever commented Mar 7, 2025
grep libstdc -r .github/ .ci*
.ci/docker/common/install_conda.sh:  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
.ci/docker/common/install_conda.sh:  # which is provided in libstdcxx 12 and up.
.ci/docker/common/install_conda.sh:  conda_install libstdcxx-ng=12.3.0 -c conda-forge
.ci/docker/common/install_base.sh:    libstdc++-devel \
.ci/docker/manywheel/build_scripts/build.sh:MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
.ci/magma/package_files/cmakelists.patch:+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")

magma uses -static-libstdc++, maybe the reason.

malfet added a commit that referenced this issue Mar 7, 2025
I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/

Fixes #148728 and #148495

ghstack-source-id: a857b0e
Pull Request resolved: #148740
malfet added a commit that referenced this issue Mar 7, 2025
I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/

Fixes #148728 and #148495

ghstack-source-id: 081787e
Pull Request resolved: #148740
malfet added a commit that referenced this issue Mar 7, 2025
I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/

Fixes #148728 and #148495

ghstack-source-id: 36cdab3
Pull Request resolved: #148740
@malfet
Copy link
Contributor Author
malfet commented Mar 7, 2025
grep libstdc -r .github/ .ci*
.ci/docker/common/install_conda.sh:  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
.ci/docker/common/install_conda.sh:  # which is provided in libstdcxx 12 and up.
.ci/docker/common/install_conda.sh:  conda_install libstdcxx-ng=12.3.0 -c conda-forge
.ci/docker/common/install_base.sh:    libstdc++-devel \
.ci/docker/manywheel/build_scripts/build.sh:MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
.ci/magma/package_files/cmakelists.patch:+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")

magma uses -static-libstdc++, maybe the reason.

This flag also suggests -fno-exceptions which would've squashed that code, but I wouldn't be surprised if patch has been failing to apply for a while

malfet added a commit that referenced this issue Mar 7, 2025
I.e. s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/

Fixes #148728 and #148495

ghstack-source-id: 48a9940
Pull Request resolved: #148740
pytorchmergebot pushed a commit that referenced this issue Mar 7, 2025
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/`

Which accidentally fixes  undefined symbol references errors namely
```
/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'
```
Which happens because `libmagma.a` that were build with gcc-11 (after #148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`)

Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or
```
$ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - -
	.file	""
	.text
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4
	.globl	_Z3fooi
	.type	_Z3fooi, @function
_Z3fooi:
.LFB0:
	.cfi_startproc
	endbr64
	movslq	%edi, %rdi
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$2305843009213693950, %rax
	cmpq	%rax, %rdi
	ja	.L2
	salq	$2, %rdi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	_Znam@PLT
	.cfi_endproc
	.section	.text.unlikely
	.cfi_startproc
	.type	_Z3fooi.cold, @function
_Z3fooi.cold:
.LFSB0:
.L2:
	.cfi_def_cfa_offset 16
	call	__cxa_throw_bad_array_new_length@PLT
	.cfi_endproc
```

Fixes #148728 and #148495
Pull Request resolved: #148740
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
@malfet malfet closed this as completed Mar 7, 2025
@ZainRizvi
Copy link
Contributor

temporarily reopening this issue so that viable/strict upgrades can continue ignoring the status of this job for the next few hours. It can be closed once the viable/strict upgrade branch catches up to #148740

@ZainRizvi ZainRizvi reopened this Mar 7, 2025
@malfet
Copy link
Contributor Author
malfet commented Mar 25, 2025

@ZainRizvi if your plan were to reopen it temporarily, it should have been closed.
But also, this job is gone now (been replaced with ibtorch-linux-focal-cuda12.6-py3.10-gcc11-debug), so closing

@malfet malfet closed this as completed Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Related to continuous integration module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable
Projects
Archived in project
Development

No branches or pull requests

4 participants
0