8000 CI: Compiler sanitizers tests are hanging intermittently · Issue #25875 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

CI: Compiler sanitizers tests are hanging intermittently #25875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
charris opened this issue Feb 22, 2024 · 13 comments · Fixed by #26006 or #26295
Closed

CI: Compiler sanitizers tests are hanging intermittently #25875

charris opened this issue Feb 22, 2024 · 13 comments · Fixed by #26006 or #26295

Comments

@charris
Copy link
Member
charris commented Feb 22, 2024

The actions label is gcc_sanitizers. All of the test runs show errors, some of which look valid, but none cause the test to fail. I have to wonder where the bogus values come from, are they byproducts of the sanitizer? See https://github.com/numpy/numpy/actions/runs/8008572322/job/21875289661 for examples.

I also note that the time was normally around 20 minutes, it is now well in excess of 2 hours. Something has changed.

@ngoldbaum
Copy link
Member

I believe all of those errors are from UBSan and do not fail that job until #24209 is fixed.

I also note that the time was normally around 20 minutes, it is now well in excess of 2 hours. Something has changed.

The job you linked to ran in 20 minutes. Do you have a job where it took hours to run?

@ngoldbaum
Copy link
Member

Ah like this one: https://github.com/numpy/numpy/actions/runs/8008729929/job/21875802671

Yes, there's a heisenbug that crashes the test runner every so often. It only happens with the compiler sanitizers job and may be a bug in the GCC sanitizer implementation, I haven't been able to reproduce it on clang. It might also be a real issue.

@ngoldbaum ngoldbaum changed the title linux_compiler_sanitizers.yml shows multiple errors, but still passes. Compiler sanitizers tests are hanging intermittently Feb 22, 2024
@ngoldbaum ngoldbaum changed the title Compiler sanitizers tests are hanging intermittently CI: Compiler sanitizers tests are hanging intermittently Feb 22, 2024
@charris
Copy link
Member Author
charris commented Feb 22, 2024

I cancelled that one, it wasn't about to finish any time soon.

EDIT: But was still running.

@ngoldbaum
Copy link
Member

The crash is such that the test run doesn't actually end, it times out after six hours. I agree, not great!

@mattip
Copy link
Member
mattip commented Mar 13, 2024

Reopening until we are sure the test is no longer hanging. It also seems there is a failure that is not picked up by the pytest mechanism

numpy/_core/tests/test_api.py::test_copyto_fromscalar ../numpy/_core/src/multiarray/common.h:288:31: runtime error: load of misaligned address 0x6020000c7212 for type 'unsigned int', which requires 4 byte alignment
0x6020000c7212: note: pointer points here
 00 00  00 01 00 00 00 01 00 00  00 00 00 00 00 00 00 00  00 11 00 00 04 00 00 00  07 00 00 3c 00 00
              ^ 
PASSED
numpy/_core/tests/test_api.py::test_copyto PASSED
numpy/_core/tests/test_api.py::test_copyto_permut ../numpy/_core/src/multiarray/common.h:288:31: runtime error: load of misaligned address 0x6020001d4492 for type 'unsigned int', which requires 4 byte alignment
0x6020001d4492: note: pointer points here
 00 00  00 01 00 01 00 01 00 01  00 00 00 00 00 00 00 00  03 11 00 00 09 00 00 00  07 00 00 3c 00 00
              ^ 
PASSED

@mattip mattip reopened this Mar 13, 2024
@mattip
Copy link
Member
mattip commented Mar 13, 2024

Actually, searching that log for "runtime error" shows many of them...

@seberg
Copy link
Member
seberg commented Mar 13, 2024

I think most of them are somehwat intentional. I.e. some code choses to ignore alignment on platforms where we know that is OK (and probably better), but the sanitizers complain it anyway.

Not sure what to do about those, maybe those code-paths were just optimizations from a time long past, and using a safe code-path the compiler will do fast code anyway.
(I am also OK to just ignore the issue, since unaligned arrays are pretty rare either way.)

@ngoldbaum
Copy link
Member

Those are all UBsan errors that won’t fail the build until #24209 is fixed.

8000

@mattip
Copy link
Member
mattip commented Mar 13, 2024

Ahh, thanks, I missed that. I changed the title of #24209 so a search for sanitizer makes it more prominent.

@ngoldbaum
Copy link
Member

I just looked at one of the recent failures. It looks like this test is failing in a new way, where if you look in the raw logs there are many many lines like:

2024-03-13T10:24:04.4788196Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4788381Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4788558Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4788738Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4788928Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4789110Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4789286Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4789470Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4789652Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4789834Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790017Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790193Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790375Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790560Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790744Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4790933Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:04.4791116Z AddressSanitizer:DEADLYSIGNAL

I'm not sure why this is getting printed to stderr every 20 microseconds or so, and only on some test runs. It actually seems to start before the tests even begin executing:

2024-03-13T10:23:59.5807981Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2024-03-13T10:23:59.5853061Z Downloading iniconfig-2.0.0-py3-none-any.whl (5.9 kB)
2024-03-13T10:23:59.7449512Z Installing collected packages: sortedcontainers, typing_extensions, pluggy, iniconfig, execnet, attrs, pytest, hypothesis, pytest-xdist
2024-03-13T10:24:00.3361522Z Successfully installed attrs-23.2.0 execnet-2.0.2 hypothesis-6.99.5 iniconfig-2.0.0 pluggy-1.4.0 pytest-8.1.1 pytest-xdist-3.5.0 sortedcontainers-2.4.0 typing_extensions-4.10.0
2024-03-13T10:24:00.6105621Z �[92m�[1mInvoking `build` prior to running tests:�[0m
2024-03-13T10:24:00.9137798Z �[94m�[1m$ /opt/hostedtoolcache/Python/3.11.8/x64/bin/python vendored-meson/meson/meson.py compile -C build�[0m
2024-03-13T10:24:00.9166244Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9167759Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9168550Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9169412Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9170111Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9170756Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9171404Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9175971Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9176791Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9177516Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9178080Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9178603Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9179115Z AddressSanitizer:DEADLYSIGNAL
2024-03-13T10:24:00.9179624Z AddressSanitizer:DEADLYSIGNAL

I guess if this gets to be too annoying we can disable the tests. We could also look into using the clang sanitizers, which might be more stable than the gcc sanitizers.

@ngoldbaum
Copy link
Member

It's now failing on every run in the same way. I still don't understand why this is happening so I've manually disabled the workflow in the github actions settings.

I think if we build numpy with clang we should be able to use the clang sanitizers which are generally better tested (google uses them internally on all code).

@ngoldbaum
Copy link
Member

Darn, here's one that's hanging with the clang sanitizers: https://github.com/numpy/numpy/actions/runs/8741299034/job/23986999441

@ngoldbaum
Copy link
Member

This hasn't happened in a while I think with the switch to clang so I'm closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
0