8000 Occasional segfault with cacheing allocator and multiple threads · Issue #11942 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Occasional segfault with cacheing allocator and multiple threads #11942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking &ldq 8000 uo;Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bdrosen96 opened this issue Sep 13, 2018 · 15 comments
Closed

Occasional segfault with cacheing allocator and multiple threads #11942

bdrosen96 opened this issue Sep 13, 2018 · 15 comments
Assignees
Labels
00 - Bug Priority: high High priority, also add milestones for urgent issues

Comments

@bdrosen96
Copy link
bdrosen96 commented Sep 13, 2018

The use of the caching memory allocator in numpy with multiple threads seems to result in some buggy behavior. It is not clear if this is due to the cacheing allocator methods being invoked from places where GIL is not held or due to other issues. When we upgraded from numpy 1.13.3 to 1.14.0 we noticed these issues, likely due to PRs such as #8920.
We also identified such commits as potentially problematic:

git diff 1108252...5a08e209863038b411bc506893093c5a4b8377c9 --stat
 doc/source/reference/c-api.array.rst         |  8 ++++----
 numpy/core/src/multiarray/alloc.h            | 12 ++++++++++++
 numpy/core/src/multiarray/arrayobject.c      |  8 ++++----
 numpy/core/src/multiarray/arraytypes.c.src   |  8 ++++----
 numpy/core/src/multiarray/compiled_base.c    |  9 +++++----
 numpy/core/src/multiarray/conversion_utils.c |  5 +++--
 numpy/core/src/multiarray/descriptor.c       | 21 +++++++++++----------
 numpy/core/src/multiarray/dtype_transfer.c   | 23 ++++++++++++-----------
 numpy/core/src/multiarray/getset.c           | 11 ++++++-----
 numpy/core/src/multiarray/item_selection.c   | 31 ++++++++++++++++---------------
 numpy/core/src/multiarray/methods.c          | 17 +++++++++--------
 numpy/core/src/multiarray/multiarraymodule.c | 12 ++++++------
 numpy/core/src/multiarray/nditer_pywrap.c    |  9 +++++----
 numpy/core/src/multiarray/scalartypes.c.src  | 18 ++++++++----------
 numpy/core/src/multiarray/shape.c            |  3 ++-
 15 files changed, 107 insertions(+), 88 deletions(-)

This issue seems to be still be present in the latest master.

For local testing we were able to resolve issues by making the allocator objects into Thread Local Storage object by doing:

$ git diff
diff --git a/numpy/core/src/multiarray/alloc.c b/numpy/core/src/multiarray/alloc.c
index f8305d1..fa6b3c0 100644
--- a/numpy/core/src/multiarray/alloc.c
+++ b/numpy/core/src/multiarray/alloc.c
@@ -32,8 +32,8 @@ typedef struct {
     npy_uintp available; /* number of cached pointers */
     void * ptrs[NCACHE];
 } cache_bucket;
-static cache_bucket datacache[NBUCKETS];
-static cache_bucket dimcache[NBUCKETS_DIM];
+static __thread cache_bucket datacache[NBUCKETS];
+static __thread cache_bucket dimcache[NBUCKETS_DIM];
 
 /*

Although it is doubtful that this solution would be viable for all supported platforms as __thread may not be supported for some of them. This was rather to verify that the allocator cache was improperly accessed across threads.

Reproducing code example:

The following attachment contains what is needed to reproduce this problem. Unfortunately it does not occur every time, so it may need to be run several times before it occurs. The number of cores or speed of the machine used may also make a difference in how easy it is to reproduce. When it does occur, it will be either a segfault or an IndexError.

gh8920.tar.gz

Error message:

Numpy/Python version information:

@charris
Copy link
Member
charris commented Sep 13, 2018

@juliantaylor Thoughts?

@charris charris added 00 - Bug Priority: high High priority, also add milestones for urgent issues labels Sep 13, 2018
@pprett
Copy link
pprett commented Sep 14, 2018

There was a bug with joblib in the repro case above. I updated it:
gh8920-14092018.tar.gz

@bdrosen96
Copy link
Author

I also tried using the following diff to gather info about whether GIL was held or not:

diff --git a/numpy/core/src/multiarray/alloc.c b/numpy/core/src/multiarray/alloc.c
index f8305d1..1d774c8 100644
--- a/numpy/core/src/multiarray/alloc.c
+++ b/numpy/core/src/multiarray/alloc.c
@@ -35,6 +35,47 @@ typedef struct {
 static cache_bucket datacache[NBUCKETS];
 static cache_bucket dimcache[NBUCKETS_DIM];
 
+#include <Python.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <execinfo.h>
+
+extern PyThreadState * _PyThreadState_Current;
+
+int PyGILState_Check2()
+{
+  PyThreadState * tstate = _PyThreadState_Current;
+  return tstate && (tstate == PyGILState_GetThisThreadState());
+}
+
+void
+print_trace (void)
+{
+  void *array[10];
+  size_t size;
+  char **strings;
+  size_t i;
+
+  size = backtrace (array, 10);
+  strings = backtrace_symbols (array, size);
+
+  printf ("Obtained %zd stack frames.\n", size);
+
+  for (i = 0; i < size; i++)
+     printf ("%s\n", strings[i]);
+
+  free (strings);
+}
+
+void assert_has_gil()
+{
+    int has_gil = PyGILState_Check2();
+    //assert(has_gil == 1);
+    if (!has_gil) {
+        printf("Missing GIL\n");
+        print_trace();
+    }
+}
 /*
  * very simplistic small memory block cache to avoid more expensive libc
  * allocations
@@ -45,6 +86,7 @@ static NPY_INLINE void *
 _npy_alloc_cache(npy_uintp nelem, npy_uintp esz, npy_uint msz,
                  cache_bucket * cache, void * (*alloc)(size_t))
 {
+    assert_has_gil();
     assert((esz == 1 && cache == datacache) ||
            (esz == sizeof(npy_intp) && cache == dimcache));
     if (nelem < msz) {
@@ -75,6 +117,7 @@ static NPY_INLINE void
 _npy_free_cache(void * p, npy_uintp nelem, npy_uint msz,
                 cache_bucket * cache, void (*dealloc)(void *))
 {
+    assert_has_gil();
     if (p != NULL && nelem < msz) {
         if (cache[nelem].available < NCACHE) {
             cache[nelem].ptrs[cache[nelem].available++] = p;

@charris
Copy link
Member
charris commented Sep 15, 2018

@bdrosen96 Were you able to find any trouble spots?

@bdrosen96
Copy link
Author

yes. For example:

#2 0x00007ffff592526a in assert_has_gil ()
at numpy/core/src/multiarray/alloc.c:79
#3 0x00007ffff59255e3 in _npy_free_cache (
dealloc=0x7ffff5925570 <PyDataMem_FREE>, cache=0x7ffff5c924c0 ,
msz=1024, nelem=632, p=0x0) at numpy/core/src/multiarray/alloc.c:123
#4 npy_free_cache (p=0x0, sz=632) at numpy/core/src/multiarray/alloc.c:165
#5 0x00007ffff598cd4a in _new_argsortlike (op=op@entry=0x7fffe70b7350,
axis=0, argsort=argsort@entry=0x7ffff5a0ea50 <aquicksort_long>,
argpart=argpart@entry=0x0, kth=kth@entry=0x0, nkth=nkth@entry=0)
at numpy/core/src/multiarray/item_selection.c:1081
#6 0x00007ffff598f3f9 in PyArray_ArgSort (op=op@entry=0x7fffe70b735 8000 0, axis=0,
which=) at numpy/core/src/multiarray/item_selection.c:1300
#7 0x00007ffff59d3365 in array_argsort (self=0x7fffe70b7350,
args=, kwds=)
at numpy/core/src/multiarray/methods.c:1331
#8 0x00000000004c37ed in PyEval_EvalFrameEx ()

@bdrosen96
Copy link
Author
diff --git a/numpy/core/src/multiarray/item_selection.c b/numpy/core/src/multiarray/item_selection.c
index 486eb43..b1a6e23 100644
--- a/numpy/core/src/multiarray/item_selection.c
+++ b/numpy/core/src/multiarray/item_selection.c
@@ -980,7 +980,9 @@ _new_argsortlike(PyArrayObject *op, int axis, PyArray_ArgSortFunc *argsort,
     NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(op));
 
     if (needcopy) {
+        NPY_END_THREADS_DESCR(PyArray_DESCR(op));
         valbuffer = npy_alloc_cache(N * elsize);
+        NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(op));
         if (valbuffer == NULL) {
             ret = -1;
             goto fail;
@@ -988,7 +990,9 @@ _new_argsortlike(PyArrayObject *op, int axis, PyArray_ArgSortFunc *argsort,
     }
 
     if (needidxbuffer) {
+        NPY_END_THREADS_DESCR(PyArray_DESCR(op));
         idxbuffer = (npy_intp *)npy_alloc_cache(N * sizeof(npy_intp));
+        NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(op));
         if (idxbuffer == NULL) {
             ret = -1;
             goto fail;
@@ -1078,9 +1082,9 @@ _new_argsortlike(PyArrayObject *op, int axis, PyArray_ArgSortFunc *argsort,
     }
 
 fail:
+    NPY_END_THREADS_DESCR(PyArray_DESCR(op));
     npy_free_cache(valbuffer, N * elsize);
     npy_free_cache(idxbuffer, N * sizeof(npy_intp));
-    NPY_END_THREADS_DESCR(PyArray_DESCR(op));
     if (ret < 0) {
         if (!PyErr_Occurred()) {
             /* Out of memory during sorting or buffer creation */

@bdrosen96
Copy link
Author

there might be other places, but I have not run across them yet.

@charris
Copy link
Member
charris commented Sep 15, 2018

I wonder if all the benefits of using the cache would be lost with the requirement to hold the GIL? I'm also worried that it might be too easy to make mistakes. OTOH, thread local might come with its own problems, I mean, what happens if a thread goes away or if the data should be global? If there were a way to make the call to the cache transparently safe, that would be the way to go. @juliantaylor Thoughts?

@bdrosen96
Copy link
Author

perhaps just use a normal mutex (pthread, etc) as part of the cache allocator?

@juliantaylor
Copy link
Contributor
juliantaylor commented Sep 16, 2018

putting a lock into a small memory allocation routine is a serious scalability issue, though python doesn't scale anyway and we already require a lock anyway so we could just put that assumption into the code.
The whole need for the code has also been somewhat aleviated by recent improvements in glibc concerning small memory allocations (incidentally via a thread local cache as proposed here too)

the usage of the caching allocator in the non locked code was a serious blunder (it was even visible in the PR's diff) that I should have spotted. But automated tests are always better than manuals, so we should at least add the gil assert to our testsuite (we have a testsuite run in debug mode in our existing tests)

@juliantaylor juliantaylor added this to the 1.15.2 release milestone Sep 16, 2018
@charris
Copy link
Member
charris commented Sep 16, 2018

A test would be wonderful.

@juliantaylor juliantaylor self-assigned this Sep 16, 2018
@bdrosen96
Copy link
Author

If you want to add this assertion to the test run to check for other cases where gil may not be held, it might be better to use:

+int PyGILState_Check2()
+{
+  PyThreadState * tstate = _PyThreadState_Current;
+  if (!tstate)
+    return -1;
+  return (tstate == PyGILState_GetThisThreadState());
+}

To distinguish between tstate being null vs not equal to current gil. (not sure if it can legitimately be null if not other threads running)

juliantaylor added a commit to juliantaylor/numpy that referenced this issue Sep 18, 2018
There currently only is a global cache for small allocations so the
functions must be called while the GIL is held.
Ensure this by checking the GIL state in debug mode (which is tested via
a ci configuration).
Closes numpygh-11942
charris pushed a commit to charris/numpy that referenced this issue Sep 19, 2018
There currently only is a global cache for small allocations so the
functions must be called while the GIL is held.
Ensure this by checking the GIL state in debug mode (which is tested via
a ci configuration).
Closes numpygh-11942
@charris charris removed this from the 1.15.2 release milestone Sep 19, 2018
@charris
Copy link
Member
charris commented Sep 23, 2018

@bdrosen96 I have released 1.14.6. Can you verify that it works for you?

@pprett
Copy link
pprett commented Sep 24, 2018

@charris @bdrosen96 verifying it now

@pprett
Copy link
pprett commented Sep 24, 2018

@charris i confirm that 1.14.6 does not trigger the repro case -- thanks guys for the quick turnaround on this @juliantaylor !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug Priority: high High priority, also add milestones for urgent issues
Projects
None yet
Development

No branches or pull requests

4 participants
0