8000 MNT: disable the allocator cache for nogil builds by ngoldbaum · Pull Request #26251 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

MNT: disable the allocator cache for nogil builds #26251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2024

Conversation

ngoldbaum
Copy link
Member

These caches are implicitly locked by the GIL, so in the nogil build I see use-after-free errors in tests that use python threads. See for example this ASAN error in one of the stringdtype tests that does multithreaded array mutation and access:

==31117==ERROR: AddressSanitizer: heap-use-after-free on address 0x602000003010 at pc 0x000149921bbc bp 0x00016fece010 sp 0x00016fece008
READ of size 8 at 0x602000003010 thread T13
    #0 0x149921bb8 in PyArray_MultiplyList multiarraymodule.c:181
    #1 0x149999e14 in PyArray_ClearArray refcount.c:83
    #2 0x1497578c4 in array_dealloc arrayobject.c:427
    #3 0x106961118 in _PyEval_EvalFrameDefault generated_cases.c.h:4818
    #4 0x10681ac98 in method_vectorcall classobject.c:70
    #5 0x106a47364 in thread_run _threadmodule.c:337
    #6 0x1069de8b0 in pythread_wrapper thread_pthread.h:241
    #7 0x1057b8760 in asan_thread_start(void*)+0x3c (libclang_rt.asan_osx_dynamic.dylib:arm64+0x50760)
    #8 0x186dfe030 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64+0x7030)
    #9 0xa358800186df8e38  (<unknown module>)

0x602000003010 is located 0 bytes inside of 16-byte region [0x602000003010,0x602000003020)
freed by thread T14 here:
    #0 0x1057bbb1c in free+0x90 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x53b1c)
    #1 0x14974ee18 in _npy_free_cache alloc.c:140
    #2 0x14974ef0c in npy_free_cache_dim alloc.c:204
    #3 0x149757be8 in array_dealloc arrayobject.c:452
    #4 0x1498ea800 in Py_DECREF object.h:909
    #5 0x1498ee59c in Py_XDECREF object.h:1038
    #6 0x1498fe2dc in array_assign_subscript mapping.c:2139
    #7 0x106962ccc in _PyEval_EvalFrameDefault generated_cases.c.h:5529
    #8 0x10681ac98 in method_vectorcall classobject.c:70
    #9 0x106a47364 in thread_run _threadmodule.c:337
    #10 0x1069de8b0 in pythread_wrapper thread_pthread.h:241
    #11 0x1057b8760 in asan_thread_start(void*)+0x3c (libclang_rt.asan_osx_dynamic.dylib:arm64+0x50760)
    #12 0x186dfe030 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64+0x7030)
    #13 0x587d000186df8e38  (<unknown module>)

previously allocated by thread T0 here:
    #0 0x1057bb9e8 in malloc+0x8c (libclang_rt.asan_osx_dynamic.dylib:arm64+0x539e8)
    #1 0x14974e9f4 in _npy_alloc_cache alloc.c:104
    #2 0x14974eeac in npy_alloc_cache_dim alloc.c:193
    #3 0x1497e5a20 in PyArray_NewFromDescr_int ctors.c:765
    #4 0x1497e8c20 in PyArray_NewFromDescrAndBase ctors.c:1045
    #5 0x1497e8ba0 in PyArray_NewFromDescr ctors.c:1030
    #6 0x1497e912c in PyArray_NewLikeArrayWithShape ctors.c:1119
    #7 0x1497e99d8 in PyArray_NewLikeArray ctors.c:1207
    #8 0x1497b4ec0 in PyArray_NewCopy convert.c:493
    #9 0x1499107d4 in array_copy methods.c:1141
    #10 0x106824e68 in method_vectorcall_FASTCALL_KEYWORDS descrobject.c:420
    #11 0x106817e88 in PyObject_Vectorcall call.c:327
    #12 0x10695106c in _PyEval_EvalFrameDefault generated_cases.c.h:813
    #13 0x1068181b4 in PyObject_CallOneArg call.c:395
    #14 0x10687efa8 in _PyObject_GenericGetAttrWithDict object.c:1577

I initially thought we could keep the caching by marking the caches as thread local, but that won't work with any allocation happening in a dynamically created thread. Instead I think the only safe thing we can do without adding a mutex to access the caches is disable the cache entirely.

Happy to entertain alternative approaches.

With this change when the gil is disabled numpy/_core/tests/test_stringdtype.py::test_threaded_access_and_mutation goes from seg faulting every time it's executed to working every time I run it.

@mattip
Copy link
Member
mattip commented Apr 9, 2024

What is the performance penalty for disabling the caches?

@seberg
Copy link
Member
seberg commented Apr 10, 2024

I am not sure how the nogil branch works, but I think I once heard about it using mimalloc? It might actually make sense to make our "default" allocator use PyMem_RawMalloc on that branch (maybe even generally if Python adopts it generally).

OTOH, that may also be something for the future to worry about.

(EDIT: I actually, not even sure we need "Raw", since I think these need the GIL anyway.)

@rgommers rgommers added 03 - Maintenance 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) labels Apr 10, 2024
@ngoldbaum
Copy link
Member Author

What is the performance penalty for disabling the caches?

I ran the benchmarks with a nogil python build but with the GIL explicitly enabled to avoid any threading bugs in the benchmarks. Unfortunately asv is sort of broken with nogil python at the moment, so I had to run the comparison manually, and the tools in asv to compare benchmark results don't work with this mode. The first file in the gist is the core benchmark suite with this PR applied, the second is without.

https://gist.github.com/ngoldbaum/b632c35e5359ea78d90f541ed5cc21d2

Eyeballing the results, I see approximately a 10% performance penalty for operations on 8000 small arrays.

As a general rule for the nogil work, I'm hoping to end up with a version of numpy that runs stably with nogil python and that goal may initially be achieved at the cost of single-threaded performance in the nogil build of python. Does that sound like a reasonable rule of thumb for this work?

using mimalloc

PyMem_Malloc uses mimalloc already, so we're using mimalloc already through that. I asked Sam about this today and he said there shouldn't be much performance difference compared with pymalloc. PyMem_RawMalloc remains a wrapper around the system malloc

@ngoldbaum
Copy link
Member Author

Also I'm happy to add a section to the tracking issue where we keep track of places we sacrifice single-threaded performance to avoid threading bugs, as an opportunity to come back and add e.g. locking or a more sophisticated implementation later when we're optimizing nogil numpy.

@rgommers
Copy link
Member

Does that sound like a reasonable rule of thumb for this work?

That sounds like the right thing to do to me. It's too early to micro-optimize. And optimizing performance will be a lot easier to do later, once asv works without hiccups, PyMutex has become public, the no-GIL build is easier to install, etc.

@mhvk
Copy link
Contributor
mhvk commented Apr 10, 2024

I like how this is isolated by the ifdef, so that a build with the GIL will suffer no consequences. In general, probably good to look at all cases of caching to see if the actual calculation cannot be sped up...

@ngoldbaum
Copy link
Member Author
ngoldbaum commented Apr 11, 2024

We talked about this yesterday at the community meeting and there seemed to be agreement that focusing on stability at the cost of performance is OK for this work, at least as long as there aren't any catastrophic single-threaded performance implications. There also shouldn't be negative performance implications for the "normal" python build.

I'll open a followup issue to track spots where we should come back and improve single-threaded performance and link it from the tracking issue.

@seberg, you weren't at the meeting yesterday and I just want to double-check, does this plan sound reasonable to you?

@seberg
Copy link
Member
seberg commented Apr 11, 2024

does this plan sound reasonable to you?

Yes of course, even if this ends up lost in time, I don't think it is a huge issue. I was just a bit curious if using the default allocator (and thus mimalloc) might not give us great overall performance even in the normal build.

@ngoldbaum ngoldbaum merged commit 01e220d into numpy:main Apr 11, 2024
@ngoldbaum
Copy link
Member Author

OK great, let's bring this in then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0