8000 BUG: Computer crash with numpy 1.23.4 and eigvals / eigvalsh on large matrices · Issue #22516 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Computer crash with numpy 1.23.4 and eigvals / eigvalsh on large matrices #22516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Renmusxd opened this issue Nov 2, 2022 · 29 comments
Labels

Comments

@Renmusxd
Copy link
Renmusxd commented Nov 2, 2022

Describe the issue:

My entire computer crashes when computing eigenvalues on the pip precompiled numpy but not the source/compiled version.
The version of LAPACK/BLAS changes and the pip version was multithreaded but the locally compiled version was not.
The matrices were not large enough to cause memory pressure on this machine. I have used the same numpy version on different machines without any issue so I am including CPU information as well:
broken on: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
It worked on a Ryzen 5900x however.

Reproduce the code example:

import numpy as np
np.linalg.eigvals(np.random.random((1000,1000)))

Error message:

N/A - computer crashes and no logs are available afaik.
Help getting relevant logs appreciated.

NumPy/Python version information:

Broken

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.23.4 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]

>>> numpy.show_config()
openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CLX,AVX512_CNL,AVX512_ICL

Working version on same machine

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.23.4 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]
>>> numpy.show_config()
blas_armpl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
accelerate_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
    language = c
    define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
    libraries = ['f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
blas_opt_info:
    language = c
    define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
    libraries = ['f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
lapack_armpl_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
    language = f77
    libraries = ['lapack', 'f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    define_macros = [('NO_ATLAS_INFO', -1)]
lapack_opt_info:
    language = f77
    libraries = ['lapack', 'f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    define_macros = [('NO_ATLAS_INFO', -1)]
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CLX,AVX512_CNL,AVX512_ICL

Context for the issue:

A lot of my work involves getting eigenvalues of matrices, but I do have a fallback which works.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 2, 2022

I will also note the broken version seemed to happily compute eigenvalues for sub ~1000x1000 matrices but crashes consistently any larger.

OS information:

Operating System: Ubuntu 22.04.1 LTS              
Kernel: Linux 5.15.0-52-generic
Architecture: x86-64

@seberg
Copy link
Member
seberg commented Nov 2, 2022

Can you install threadpoolctl and call:

python -m threadpoolctl -i numpy

(or whatever python you are using), just to make sure about blas versions. Also, since that is a fairly common source of issues: You are not using any virtualization/container layer?

@Renmusxd
Copy link
Author
Renmusxd commented Nov 2, 2022

No containers. I just have two virtualenvs with the working/breaking versions of numpy.

Breaking:

➜  ~ python -m threadpoolctl -i numpy
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/home/sumner/.virtualenvs/unitary/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so",
    "version": "0.3.20",
    "threading_layer": "pthreads",
    "architecture": "SkylakeX",
    "num_threads": 20
  }
]

Working

➜  ~ python -m threadpoolctl -i numpy
[]

@seberg
Copy link
Member
seberg commented Nov 2, 2022

Thanks, two more things that might help narrow things down and that should be simple:

  • run with OMP_NUM_THREADS=1 (environment variable, can use export OMP_NUM_THREADS=1 also)
  • run with OPENBLAS_CORETYPE=Haswell

It would also be interesting if upgrading openblas would fix it, but that is a bit trickier.

EDIT: You could also run with gdb --args python then type r (press enter) and then bt to see if the gdb backtrace has anything interesting for the crash. Not holding my breath, but at least we can see where in openblas things go bad.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 2, 2022

This is a remote machine and I just left it so I'm approaching these in safest-first order.

Both environment variables worked.

➜  ~ export OMP_NUM_THREADS=1
➜  ~ export OPENBLAS_CORETYPE=Haswell
➜  ~ python
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.linalg.eigvals(np.random.random((1000,1000)))
array([ 5.00275497e+02+0.j        , -9.00694947e+00+1.89576205j,
       -9.00694947e+00-1.89576205j, -9.15084972e+00+0.j        ,
       -8.26777682e+00+3.78122657j, -8.26777682e+00-3.78122657j, ...

Restricting OMP by itself worked

➜  ~ export OMP_NUM_THREADS=1                  
➜  ~ python                                    
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.linalg.eigvals(np.random.random((1000,1000)))
array([ 5.00598735e+02+0.j        ,  9.16729605e+00+0.95796487j, ...

Setting CORETYPE by itself worked

➜  ~ export OPENBLAS_CORETYPE=Haswell          
➜  ~ python                                    
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.linalg.eigvals(np.random.random((1000,1000)))
array([ 5.00541614e+02+0.j        , -2.98517771e+00+8.85454335j, ...

gdb seems to not reveal anything. Computer becomes inaccessible via ssh (presumably crashed as before)

➜  ~ gdb --args python
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
(gdb) r
Starting program: /home/sumner/.virtualenvs/unitary/bin/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
[New Thread 0x7ffff3fff640 (LWP 8045)]
[New Thread 0x7ffff37fe640 (LWP 8046)]
[New Thread 0x7fffeeffd640 (LWP 8047)]
[New Thread 0x7fffec7fc640 (LWP 8048)]
[New Thread 0x7fffe9ffb640 (LWP 8049)]
[New Thread 0x7fffe77fa640 (LWP 8050)]
[New Thread 0x7fffe4ff9640 (LWP 8051)]
[New Thread 0x7fffe27f8640 (LWP 8052)]
[New Thread 0x7fffdfff7640 (LWP 8053)]
[New Thread 0x7fffdd7f6640 (LWP 8054)]
[New Thread 0x7fffdaff5640 (LWP 8055)]
[New Thread 0x7fffd87f4640 (LWP 8056)]
[New Thread 0x7fffd5ff3640 (LWP 8057)]
[New Thread 0x7fffd37f2640 (LWP 8058)]
[New Thread 0x7fffd0ff1640 (LWP 8059)]
[New Thread 0x7fffce7f0640 (LWP 8060)]
[New Thread 0x7fffcbfef640 (LWP 8061)]
[New Thread 0x7fffc97ee640 (LWP 8062)]
[New Thread 0x7fffc6fed640 (LWP 8063)]
>>> np.linalg.eigvals(np.random.random((1000,1000)))

@seberg
Copy link
Member
seberg commented Nov 3, 2022

Thanks, can you try the gdb again non-remotely? It would be strange if that actually fails. After that, probably have to move it to OpenBLAS, although I am hoping the backtrace will narrow down things a bit more.
The main function on the blas/lapack side should be dsyevd, though.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 3, 2022

Strange indeed. gdb gives no info computer simply turns off. Filmed here.

@seberg
Copy link
Member
seberg commented Nov 3, 2022

Well, that is rather odd, it didn't even occur to me that it crashes the whole computer. Now I am wondering if there is a chance of hardware issue?
Even a terrible bug in openblas shouldn't be able to take down the whole system.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 3, 2022

I have stress tested the computer with a few GPU and CPU things. I've also loaded the RAM up to about 90% without a problem. The only thing that seems to crash it is this specific numpy/BLAS call. If there's a C program which replicates what numpy is doing under the hood I can try running that to see if it also crashes?

@Renmusxd
Copy link
Author
Renmusxd commented Nov 3, 2022

Not ruling out hardware, just hard to see how it would come down to such a specific operation.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 3, 2022

Testing numpy on the broken build gives

>>> numpy.test('slow', verbose=2)
...
1137 passed, 166 skipped, 19380 deselected, 1 xfailed, 40 warnings in 370.70s (0:06:10)

So no other major errors apparently.

@seberg
Copy link
Member
seberg commented Nov 3, 2022

Looks like only running the slow tests, should probaly just do the opposite. Writing it in C is definitly possible, the code starts here:

if (init_geev(&geev_params,

and you can ignore everything about "outer strides" and the for loop in there. I think "linearlize matrix" is also just to copy things in case it is not contiguous (and thus does nothing as well).

Not sure I am up for detangling it into clean C reproducer right now.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 7, 2022

I'll try to get a minimum code snippet with versions to reproduce, I'm worried that since it appears thread dependent it may be hard.

@seberg
Copy link
Member
seberg commented Nov 8, 2022

Thanks, there is always a good chance that the next openblas fixes it. IIRC, the scikit-image folks had a potential some similar issues in CI and it went away with an upgrade (but I think that was an older NumPy version).

The threading that is going on here is within openblas, so that shouldn't matter. Lets ping @martin-frbg: Thanks for taking a look, just in case you know something of the top of your head about issues in eigvals/eigvalsh for larger matrices.

@martin-frbg
Copy link

Computer shutting down or terminating the ssh session could be a sign of running out of memory (or perhaps stack - does ulimit -s unlimited help ?)

@Renmusxd
Copy link
Author
Renmusxd commented Nov 9, 2022

Computer still shuts down.

@martin-frbg
Copy link

Weird. If it works with coretype set to Haswell, the issue is probably related to AVX512 operations. But how they could bring down the system is beyond me - if there was a cooling or power supply problem, I'd expect it to show up in the other stress tests as well. Does it work with OMP_NUM_THREADS set to values greater than 1, but smaller than the actual core count ?

@Renmusxd
Copy link
Author
Renmusxd commented Nov 9, 2022

Setting cores to 1-3 works, crashes at 4. This is a 10 physical core machine (20 logical).

@mattip
Copy link
Member
mattip commented Nov 9, 2022

It sounds like a hardware problem. Have you tried any other stress tests?

I see above you have tried other things.

@martin-frbg
Copy link

Do you have access to the system logs ?

8000
@Renmusxd
Copy link
Author
Renmusxd commented Nov 9, 2022

To perform the experiment I crashed the computer at a specific time (16:45) -- searching through /var/log for instances of the string 16:45 yielded few results, none of which seem relevant but included below because why not.

>>> sudo /home/sumner/.cargo/bin/rg "16:45" /var/log
/var/log/syslog
14150:Nov  9 16:45:18 mephistopheles gnome-shell[1683]: Failed to import DBusMenu, quicklists are not avaialble: Error: Requiring Dbusmenu, version none: Typelib file for namespace 'Dbusmenu' (any version) not found
14151:Nov  9 16:45:19 mephistopheles firefox_firefox.desktop[3553]: [2022-11-09T21:45:19Z ERROR viaduct::backend::ffi] Missing HTTP status
14152:Nov  9 16:45:19 mephistopheles firefox_firefox.desktop[3553]: message repeated 4 times: [ [2022-11-09T21:45:19Z ERROR viaduct::backend::ffi] Missing HTTP status]
14153:Nov  9 16:45:21 mephistopheles systemd[1465]: snap.firefox.firefox.35937b79-ac83-4079-935d-719a747f81ea.scope: Consumed 11min 22.381s CPU time.

mephistopheles is the hostname.
If there are other log files you believe I should check please let me know, I figured /var/log would cover most cases.

@martin-frbg
Copy link

mhh, so it's not shutting down but just (s)hell freezing over ? I'm running out of ideas for why it would go straight from running fine to killing the entire system between 3 and 4 threads without crashing just the test program or returning garbage results first.

@Renmusxd
Copy link
Author
Renmusxd commented Nov 9, 2022

Yeah - this seems sufficiently strange that it may be too much of a time sink to really debug. I can try a few things and update here periodically if you don't mind keeping the bug open but so long as I have a workaround and this doesn't appear to affect other people I don't mind this being lowest-priority.

Things I may test:

  1. See if there's a tradeoff between number of threads and size of matrix (may suggest something memory related?)
  2. See if I can get a crash with a single thread under any conditions, if so I can step through the binary over a few hours and see if I can isolate a specific failing instruction
  3. Get a minimum broken version calling OpenBLAS directly (also easier if single threaded can crash)

@mattip
Copy link
Member
mattip commented Nov 9, 2022

Maybe the OOM manager is kicking out the sshd process instead of the numpy process, but I would expect that to show up in the system logs. Have you tried severely limiting the resources available with ulimit to see if you can crash the numpy process?

@Renmusxd
Copy link
Author

Update: the crash happens with fewer than 4 threads it's just less consistent. So far it's been 100% of runs with 4 threads but only about 33% with 3 threads crash. Maybe some race condition?
And it's a full crash: the motherboard lights up the red "cpu error" LED. I've only ever seen that for hardware problems but it generally happens on boot not from some specific operation.

I want to check to see if it's a property of the random matrix which causes this 33% chance of if it's independent of the matrix.

@Renmusxd
Copy link
Author

Ah sorry the red LED doesn't always happen, in this attempt the computer just froze then rebooted without the CPU error indicator.
Still no logs from the time of the crash.

@martin-frbg
Copy link

hmm, a race condition should never crash the system - garbage results or at worst a hanging program would be a lot more likely. maybe it is a (subtly) broken cpu after all, and/or OpenBLAS' multithreaded kernels hammering the AVX512 units more than other codes would.
One other idea - does this system run the latest available bios and cpu microcode ?

@Renmusxd
Copy link
Author

Just had the opportunity to check and this is still a problem on the latest microcode - I'm not sure when I'll have a chance to update the BIOS but that feels like a stretch anyway.

@martin-frbg
Copy link

Agree that the BIOS is much less likely than the microcode. Still this does not have me rushing to buy an i9-7900X when I have not heard of similar problems from anybody else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
0