-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Occasional segfault with cacheing allocator and multiple threads #11942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking &ldq 8000 uo;Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@juliantaylor Thoughts? |
There was a bug with joblib in the repro case above. I updated it: |
I also tried using the following diff to gather info about whether GIL was held or not:
|
@bdrosen96 Were you able to find any trouble spots? |
yes. For example: #2 0x00007ffff592526a in assert_has_gil () |
|
there might be other places, but I have not run across them yet. |
I wonder if all the benefits of using the cache would be lost with the requirement to hold the GIL? I'm also worried that it might be too easy to make mistakes. OTOH, thread local might come with its own problems, I mean, what happens if a thread goes away or if the data should be global? If there were a way to make the call to the cache transparently safe, that would be the way to go. @juliantaylor Thoughts? |
perhaps just use a normal mutex (pthread, etc) as part of the cache allocator? |
putting a lock into a small memory allocation routine is a serious scalability issue, though python doesn't scale anyway and we already require a lock anyway so we could just put that assumption into the code. the usage of the caching allocator in the non locked code was a serious blunder (it was even visible in the PR's diff) that I should have spotted. But automated tests are always better than manuals, so we should at least add the gil assert to our testsuite (we have a testsuite run in debug mode in our existing tests) |
A test would be wonderful. |
If you want to add this assertion to the test run to check for other cases where gil may not be held, it might be better to use:
To distinguish between tstate being null vs not equal to current gil. (not sure if it can legitimately be null if not other threads running) |
There currently only is a global cache for small allocations so the functions must be called while the GIL is held. Ensure this by checking the GIL state in debug mode (which is tested via a ci configuration). Closes numpygh-11942
There currently only is a global cache for small allocations so the functions must be called while the GIL is held. Ensure this by checking the GIL state in debug mode (which is tested via a ci configuration). Closes numpygh-11942
@bdrosen96 I have released 1.14.6. Can you verify that it works for you? |
@charris @bdrosen96 verifying it now |
@charris i confirm that 1.14.6 does not trigger the repro case -- thanks guys for the quick turnaround on this @juliantaylor ! |
The use of the caching memory allocator in numpy with multiple threads seems to result in some buggy behavior. It is not clear if this is due to the cacheing allocator methods being invoked from places where GIL is not held or due to other issues. When we upgraded from numpy 1.13.3 to 1.14.0 we noticed these issues, likely due to PRs such as #8920.
We also identified such commits as potentially problematic:
This issue seems to be still be present in the latest master.
For local testing we were able to resolve issues by making the allocator objects into Thread Local Storage object by doing:
Although it is doubtful that this solution would be viable for all supported platforms as __thread may not be supported for some of them. This was rather to verify that the allocator cache was improperly accessed across threads.
Reproducing code example:
The following attachment contains what is needed to reproduce this problem. Unfortunately it does not occur every time, so it may need to be run several times before it occurs. The number of cores or speed of the machine used may also make a difference in how easy it is to reproduce. When it does occur, it will be either a segfault or an IndexError.
gh8920.tar.gz
Error message:
Numpy/Python version information:
The text was updated successfully, but these errors were encountered: