-
Notifications
You must be signed in to change notification settings - Fork 24.3k
UNSTABLE trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build #148495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello there! From the UNSTABLE prefix in this issue title, it looks like you are attempting to unstable a job in PyTorch CI. The information I have parsed is below:
Within ~15 minutes, |
Appears as though the docker image was updated between the run that worked and the run that was broken: Works (d54cab7)
Broken (7ab6749)
I went through and poked around the images a bit, g++ versions appear to match and I'm a bit unsure of what's actually been changed. Might take a bit of investigation to see what's actually been changed, possibly similar to the technique here: https://stackoverflow.com/a/48111973 |
Looks like someone undid my hack...
Or may be not
|
Obviously it's a linking error breaking ABI. |
@cyyever do you have any idea what breaks it? (or where those new definitions are coming from?) |
Another weird state of things:
Attempting to re-run last successful build here, to confirm that it indeed comes from some change in the docker |
This looks interesting...
So looks like culprit is #148135
|
It maybe. But unless some library explicitly set libstdc++ to old version, `std::__throw_bad_array_new_length()' should be found. |
magma uses -static-libstdc++, maybe the reason. |
This flag also suggests |
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/` Which accidentally fixes undefined symbol references errors namely ``` /usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()' ``` Which happens because `libmagma.a` that were build with gcc-11 (after #148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`) Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or ``` $ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - - .file "" .text .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4 .globl _Z3fooi .type _Z3fooi, @function _Z3fooi: .LFB0: .cfi_startproc endbr64 movslq %edi, %rdi subq $8, %rsp .cfi_def_cfa_offset 16 movabsq $2305843009213693950, %rax cmpq %rax, %rdi ja .L2 salq $2, %rdi addq $8, %rsp .cfi_def_cfa_offset 8 jmp _Znam@PLT .cfi_endproc .section .text.unlikely .cfi_startproc .type _Z3fooi.cold, @function _Z3fooi.cold: .LFSB0: .L2: .cfi_def_cfa_offset 16 call __cxa_throw_bad_array_new_length@PLT .cfi_endproc ``` Fixes #148728 and #148495 Pull Request resolved: #148740 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
temporarily reopening this issue so that viable/strict upgrades can continue ignoring the status of this job for the next few hours. It can be closed once the viable/strict upgrade branch catches up to #148740 |
@ZainRizvi if your plan were to reopen it temporarily, it should have been closed. |
Uh oh!
There was an error while loading. Please reload this page.
See https://hud.pytorch.org/hud/pytorch/pytorch/c677f3251f46b4bffdaa7758fb7102d665b6f11b/1?per_page=50&name_filter=%20libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug%20%2F%20build but revert did not help
Currently failing out with errors similar to:
cc @seemethere @pytorch/pytorch-dev-infra
The text was updated successfully, but these errors were encountered: