-
-
Notifications
You must be signed in to change notification settings - Fork 11k
CI: Compiler sanitizers tests are hanging intermittently #25875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I believe all of those errors are from UBSan and do not fail that job until #24209 is fixed.
The job you linked to ran in 20 minutes. Do you have a job where it took hours to run? |
Ah like this one: https://github.com/numpy/numpy/actions/runs/8008729929/job/21875802671 Yes, there's a heisenbug that crashes the test runner every so often. It only happens with the compiler sanitizers job and may be a bug in the GCC sanitizer implementation, I haven't been able to reproduce it on clang. It might also be a real issue. |
I cancelled that one, it wasn't about to finish any time soon. EDIT: But was still running. |
The crash is such that the test run doesn't actually end, it times out after six hours. I agree, not great! |
Reopening until we are sure the test is no longer hanging. It also seems there is a failure that is not picked up by the pytest mechanism
|
Actually, searching that log for "runtime error" shows many of them... |
I think most of them are somehwat intentional. I.e. some code choses to ignore alignment on platforms where we know that is OK (and probably better), but the sanitizers complain it anyway. Not sure what to do about those, maybe those code-paths were just optimizations from a time long past, and using a safe code-path the compiler will do fast code anyway. |
Those are all UBsan errors that won’t fail the build until #24209 is fixed. |
Ahh, thanks, I missed that. I changed the title of #24209 so a search for sanitizer makes it more prominent. |
I just looked at one of the recent failures. It looks like this test is failing in a new way, where if you look in the raw logs there are many many lines like:
I'm not sure why this is getting printed to stderr every 20 microseconds or so, and only on some test runs. It actually seems to start before the tests even begin executing:
I guess if this gets to be too annoying we can disable the tests. We could also look into using the clang sanitizers, which might be more stable than the gcc sanitizers. |
It's now failing on every run in the same way. I still don't understand why this is happening so I've manually disabled the workflow in the github actions settings. I think if we build numpy with clang we should be able to use the clang sanitizers which are generally better tested (google uses them internally on all code). |
Darn, here's one that's hanging with the clang sanitizers: https://github.com/numpy/numpy/actions/runs/8741299034/job/23986999441 |
This hasn't happened in a while I think with the switch to clang so I'm closing. |
The actions label is
gcc_sanitizers
. All of the test runs show errors, some of which look valid, but none cause the test to fail. I have to wonder where the bogus values come from, are they byproducts of the sanitizer? See https://github.com/numpy/numpy/actions/runs/8008572322/job/21875289661 for examples.I also note that the time was normally around 20 minutes, it is now well in excess of 2 hours. Something has changed.
The text was updated successfully, but these errors were encountered: