-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: Computer crash with numpy 1.23.4 and eigvals / eigvalsh on large matrices #22516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will also note the broken version seemed to happily compute eigenvalues for sub ~1000x1000 matrices but crashes consistently any larger. OS information:
|
Can you install threadpoolctl and call:
(or whatever python you are using), just to make sure about blas versions. Also, since that is a fairly common source of issues: You are not using any virtualization/container layer? |
No containers. I just have two virtualenvs with the working/breaking versions of numpy. Breaking:
Working
|
Thanks, two more things that might help narrow things down and that should be simple:
It would also be interesting if upgrading openblas would fix it, but that is a bit trickier. EDIT: You could also run with |
This is a remote machine and I just left it so I'm approaching these in safest-first order. Both environment variables worked.
Restricting OMP by itself worked
Setting CORETYPE by itself worked
gdb seems to not reveal anything. Computer becomes inaccessible via ssh (presumably crashed as before)
|
Thanks, can you try the gdb again non-remotely? It would be strange if that actually fails. After that, probably have to move it to OpenBLAS, although I am hoping the backtrace will narrow down things a bit more. |
Strange indeed. gdb gives no info computer simply turns off. Filmed here. |
Well, that is rather odd, it didn't even occur to me that it crashes the whole computer. Now I am wondering if there is a chance of hardware issue? |
I have stress tested the computer with a few GPU and CPU things. I've also loaded the RAM up to about 90% without a problem. The only thing that seems to crash it is this specific numpy/BLAS call. If there's a C program which replicates what numpy is doing under the hood I can try running that to see if it also crashes? |
Not ruling out hardware, just hard to see how it would come down to such a specific operation. |
Testing numpy on the broken build gives
So no other major errors apparently. |
Looks like only running the slow tests, should probaly just do the opposite. Writing it in C is definitly possible, the code starts here: numpy/numpy/linalg/umath_linalg.cpp Line 2403 in 1d4cb4b
and you can ignore everything about "outer strides" and the for loop in there. I think "linearlize matrix" is also just to copy things in case it is not contiguous (and thus does nothing as well). Not sure I am up for detangling it into clean C reproducer right now. |
I'll try to get a minimum code snippet with versions to reproduce, I'm worried that since it appears thread dependent it may be hard. |
Thanks, there is always a good chance that the next openblas fixes it. IIRC, the scikit-image folks had a potential some similar issues in CI and it went away with an upgrade (but I think that was an older NumPy version). The threading that is going on here is within openblas, so that shouldn't matter. Lets ping @martin-frbg: Thanks for taking a look, just in case you know something of the top of your head about issues in eigvals/eigvalsh for larger matrices. |
Computer shutting down or terminating the ssh session could be a sign of running out of memory (or perhaps stack - does |
Computer still shuts down. |
Weird. If it works with coretype set to Haswell, the issue is probably related to AVX512 operations. But how they could bring down the system is beyond me - if there was a cooling or power supply problem, I'd expect it to show up in the other stress tests as well. Does it work with OMP_NUM_THREADS set to values greater than 1, but smaller than the actual core count ? |
Setting cores to 1-3 works, crashes at 4. This is a 10 physical core machine (20 logical). |
I see above you have tried other things. |
Do you have access to the system logs ? |
To perform the experiment I crashed the computer at a specific time (16:45) -- searching through /var/log for instances of the string 16:45 yielded few results, none of which seem relevant but included below because why not.
|
mhh, so it's not shutting down but just (s)hell freezing over ? I'm running out of ideas for why it would go straight from running fine to killing the entire system between 3 and 4 threads without crashing just the test program or returning garbage results first. |
Yeah - this seems sufficiently strange that it may be too much of a time sink to really debug. I can try a few things and update here periodically if you don't mind keeping the bug open but so long as I have a workaround and this doesn't appear to affect other people I don't mind this being lowest-priority. Things I may test:
|
Maybe the OOM manager is kicking out the sshd process instead of the numpy process, but I would expect that to show up in the system logs. Have you tried severely limiting the resources available with ulimit to see if you can crash the numpy process? |
Update: the crash happens with fewer than 4 threads it's just less consistent. So far it's been 100% of runs with 4 threads but only about 33% with 3 threads crash. Maybe some race condition? I want to check to see if it's a property of the random matrix which causes this 33% chance of if it's independent of the matrix. |
Ah sorry the red LED doesn't always happen, in this attempt the computer just froze then rebooted without the CPU error indicator. |
hmm, a race condition should never crash the system - garbage results or at worst a hanging program would be a lot more likely. maybe it is a (subtly) broken cpu after all, and/or OpenBLAS' multithreaded kernels hammering the AVX512 units more than other codes would. |
Just had the opportunity to check and this is still a problem on the latest microcode - I'm not sure when I'll have a chance to update the BIOS but that feels like a stretch anyway. |
Agree that the BIOS is much less likely than the microcode. Still this does not have me rushing to buy an i9-7900X when I have not heard of similar problems from anybody else. |
Describe the issue:
My entire computer crashes when computing eigenvalues on the pip precompiled numpy but not the source/compiled version.
The version of LAPACK/BLAS changes and the pip version was multithreaded but the locally compiled version was not.
The matrices were not large enough to cause memory pressure on this machine. I have used the same numpy version on different machines without any issue so I am including CPU information as well:
broken on: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
It worked on a Ryzen 5900x however.
Reproduce the code example:
Error message:
NumPy/Python version information:
Broken
Working version on same machine
Context for the issue:
A lot of my work involves getting eigenvalues of matrices, but I do have a fallback which works.
The text was updated successfully, but these errors were encountered: