-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: SIGABRT on using ThreadPoolExecutor with linalg.eigvalsh
in v1.26.0b1
#24512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub&rdq 8000 uo;, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for whittling it down to a minimal reproducer. It sounds like something to do with OpenBLAS threading:
WorkaroundsCan you see how many threads are open when you use your worker pool? I think you might be getting 8 for each worker, and each of those threads allocates a working buffer. I see OpenBLAS wants to use 8 threads on your machine. Could you control this with either threadpoolctl or via setting the Further analysis@martin-frbg, do you know of anyhting that might have caused a regression between 0.3.23 and 0.3.23 + |
Not immediately aware of anything that could have caused this (btw. I use the Milestone feature of gh to track non-trivial changes for the next release). Will a build from source automatically pull in 0.3.23 on whatever platform ? Incidentally, INFO=8 from DSYEVD means "your work array is too small" |
A build from source will pull in whatever is on the platform via pkg-config. The wheel builds download and provision a specific version to be available via pkg-config before building. |
I can reproduce this, but only when I deliberately build libopenblas for a smaller NUM_THREADS than actually present in the target system. Are you building the "experimental" c2f4bdbb with the exact same parameters that the previously used OpenBLAS binary was built with, especially NUM_THREADS (which defaults to the number of cores in the build host) ? |
Hmm. Nothing changed in the build scripts since 0.3.23. But calling
I see this line in the windows
and this in the posix
So it does seem we are setting |
Thanks for the quick feedback!
In the failing environment with pre-build NumPy 1.26.0b1 $ python -m threadpoolctl -i numpy
[
{
"user_api": "blas",
"internal_api": "openblas",
"num_threads": 8,
"prefix": "libopenblas",
"filepath": "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so",
"version": "0.3.23.dev",
"threading_layer": "pthreads",
"architecture": "Haswell"
}
] I get the same output in the environment with the passing self-build NumPy, except for the version field which is "0.3.23" instead of "0.3.23.dev". |
Not sure if I am doing it wrong, but OPENBLAS_NUM_THREADS=1 python local/debug-pr7101.py seems to have no impact, regardless of the values 0, 1, 2, 4, 8, 16. The error persists. |
Hm, strange. And if numpy is really running OpenBLAS with just 8 threads instead of "all it can get", you should be safely below either of the two platform-specific compile time limits Matti mentioned. Maybe I just tried too hard to create a failing configuration of OpenBLAS and the actual problem is elsewhere ? (BTW it is still a bit unclear to me from the logs scattered across multiple issue tickets what your hardware and operating system is. I see Windows mentioned, but the quoted paths look unixoid ?) |
Sorry, I may have muddied the waters by mentioning windows. The system under test is linux + python 3.11 as can be seen by opening the "Pre-build v1.26.0b1" details subsection above. |
Yes. import sys; print(sys.version)
import platform; print(platform.platform())
# 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429]
# Linux-6.4.11-arch2-1-x86_64-with-glibc2.38 You can also see this in action on our CI in scikit-image/scikit-image#7101. It fails on linux-cp3.11-pre but on Windows "Default Python311-x64-pre" too. |
Thanks. With a bit of patience, the failures are also reproducible with the 0.3.23 release (and the build-time NUM_THREADS set to 24 on a 4-core hardware). So at least no recent regression in OpenBLAS, and it seems to me that some of the "double free"/"invalid pointer" messages are generated before OpenBLAS gets initialized - at least before a DYNAMIC_ARCH build announces (with OPENBLAS_VERBOSE=2) which cpu it has detected. |
gdb backtraces lead back to a free() in NumPy's |
I seem to recall seeing |
possible, though right now I am not even sure that OpenBLAS' DSYEVD is ever reached. (Unless the python/numpy environment catches any write to stdout from Fortran code) |
This may be a case where we should do the work of separating NumPy vs. OpenBLAS by comparing with Netlib and creating a pure C or Fortran reproducer for OpenBLAS if we do determine it's specific to OpenBLAS and not Netlib. Cc @steppi who is working on streamlining that process as much as possible. |
On it! |
Thanks. All I can say so far is that I see no evidence (neither from print statements added to the code nor from gdb breakpoints) that OpenBLAS' implementations of DSYEVD and XERBLA are ever entered in the sequence that leads to the LAPACK-like error message, and libopenblas does not feature in any gdb backtrace. |
... and I see the exact same (mis)behaviour when I replace NumPy's libopenblas with 0.3.21, or 0.3.15. |
I'm seeing the same misbehavior with netlib reference BLAS as well. I'm using FlexiBLAS to swap out BLAS versions, and was able to replicate by building numpy from the branch |
I've identified that numpy is calling to the non-threadsafe lapack_lite rather than the lapack seen in python setup.py build_ext --inplace -j 4 I'm building with the following (one needs to set up flexiblas for this to work), and have reproduced on spin build --clean -- -Dblas=flexiblas -Dlapack=flexiblas Below are some details of what I observed during the debugging process: Within
Of the three behaviors seen, things work correctly when one thread completes all of its work before the other. One sees the That this lack of thread safety is seen when numpy claims to be using either OpenBLAS or NETLIB BLAS, who's I'll keep looking into this to see what could be going wrong in the meson build. |
@rgommers found the issue in https://github.com/numpy/numpy/blob/main/numpy/linalg/meson.build. The |
Closes numpygh-24512, where `linalg.eigvalsh` was observed to be non-thread safe. This was due to the non-thread safe `lapack_lite` being called instead of the installed BLAS/LAPACK. Co-authored-by: Albert Steppi <albert.steppi@gmail.com>
Closes gh-24512, where `linalg.eigvalsh` was observed to be non-thread safe. This was due to the non-thread safe `lapack_lite` being called instead of the installed BLAS/LAPACK. Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
Closes numpygh-24512, where `linalg.eigvalsh` was observed to be non-thread safe. This was due to the non-thread safe `lapack_lite` being called instead of the installed BLAS/LAPACK. Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
Thanks to everyone for tackling this so quickly! 👍 |
Closes numpygh-24512, where `linalg.eigvalsh` was observed to be non-thread safe. This was due to the non-thread safe `lapack_lite` being called instead of the installed BLAS/LAPACK. Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
Describe the issue:
In scikit-image, we have started to encounter unexpected crashes in
numpy.linalg.eigvalsh
when used via aThreadPoolExecutor
with NumPy 1.26.0b1.I have now managed to reduce the reproducing example from scikit-image/scikit-image#6970 (comment) to one only using NumPy (see below and also scikit-image/scikit-image#7101 (comment)). That's why I am reasonably confident that the error might originate on NumPy's side.
Some additional observations:
python setup.py build_ext --inplace -j 4
the minimal example below passes.num_workers=1
.Reproduce the code example:
Error message:
The erratic behavior seems a bit unstable. Most of the time I get the
free(): invalid pointer
SIGABRT, but sometimes the Traceback below concerning the illegal value and very rarely no error at all. This seems to depend a bit on the size of the passed array and number of concurrent tasks?Runtime information:
Pre-build v1.26.0b1
In-place build v1.26.0b1 from source
gdb --args python local/debug-pr7101.py
debug-pr7101.py
contains the minimal example above.Context for the issue:
This is currently blocking us from upgrading our dependency on NumPy to 1.26.0b1 for scikit-image in scikit-image/scikit-image#7101. It's been a very tricky thing to debug and I am a bit out of my depth now. :)
The text was updated successfully, but these errors were encountered: