-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Circumventing cblas_sgemv segfault in Accelerate #5205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circumventing cblas_sgemv segfault in Accelerate #5205
Conversation
No bento support yet, I have no idea how bento works... |
Great - thanks for doing that. There is a fair amount of trailing whitespace in the changes - in case you get a chance to clear that up. How about adding a test. I don't know how to make the segfault happen reliably - Julian - do you have something effective? This is a version of the test in the issue:
|
How do we make or run a test for something that will segfault the test suite? Spawn a subprocess? |
I was thinking it would be better to segfault the test-suite than let it creep back accidentally. |
But I guess running the test in another python subprocess would work too, if rather complicated. |
Something like:
|
@@ -20,6 +20,11 @@ | |||
#include <limits.h> | |||
#include <stdio.h> | |||
|
|||
#ifdef APPLE_ACCELERATE_SGEMV_PATCH | |||
#include <pthread.h> | |||
extern pthread_key_t tls_memory_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to use NPY_TLS instead of pthreads, though I never tested if it works on mac, it does on linux and windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPY_TLS cannot be used on MacOS X.
this is only needed when the cpu supports AVX, does clang have something equivalent to gccs |
/* ----------------------------------------------------------------- */ | ||
/* Management of aligned memory */ | ||
|
||
#define unlikely(x) __builtin_expect(!!(x), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have NP_UNLIKELY
MacOS X object files does not support thread local storage. The storage qualifier |
I think Also I did not know about Currently |
If I commit more changes to my code will they appear in this PR, or do I need to do something else? |
Any changes you push into this github branch will appear in the PR. New
|
they appear when you push them This seems a bit to complicated to put in on the day we want to cut the 1.9.1rc. And I really hate workarounds, fix the issue at its source. Maybe if more people file bugs apple will do something. for 1.9.1 maybe we could fix at least some of the more common occurrences of the issue by using an aligned allocator at the core of numpy. |
With |
Aha, |
This seems to work: #include <stdio.h>
#include <stdlib.h>
static int cpu_supports_avx(void)
{
char tmp[1024];
FILE *p;
size_t count;
p = popen("sysctl -a | grep AVX", "r");
if (p == NULL) return -1;
count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
pclose(p);
return (count ? 1 : 0);
}
int main(int argc, char *argcv[])
{
printf("has avx: %d\n", cpu_supports_avx());
return 0;
} |
It seems the sgemv bug has been removed in Yosemite. Either this workaround can be ignored, or we should have a dynamic check for Mavericks as well. Mavericks is the only version of MacOS X where it is present. That said, I have addressed the issues raised here except for adding the test case suggested by @matthew-brett . NPY_TLS cannot be used, so it must use pthreads. |
The question is, what should we do for shipping OSX wheels? It seems to me that we will have no choice but to ship Mavericks-compatible wheels at least for the next year. So then the question is, it is worth keeping something like this in the code-base to make that possible, or not? I think there are big advantages in terms of performance, and licensing, to using Accelerate. So, as the person responsible for building the OSX wheels, I put in a strong plea to find some way of making this work. |
This code: https://gist.github.com/zchothia/3078968 successfully detects AVX on my 10.9 laptop and absence of AVX on my 10.6 desktop. |
This is the code I use to check for AVX and OS version. It is only executed on module import. #include <stdio.h>
#include <stdlib.h>
static int cpu_supports_avx(void)
{
char tmp[1024];
FILE *p;
size_t count;
p = popen("sysctl -n machdep.cpu.features | grep AVX", "r");
if (p == NULL) return -1;
count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
pclose(p);
return (count ? 1 : 0);
}
static int using_mavericks(void)
{
char tmp[1024];
FILE *p;
size_t count;
p = popen("sw_vers -productVersion | grep 10\\.9\\.", "r");
if (p == NULL) return -1;
count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
pclose(p);
return (count ? 1 : 0);
}
int main(int argc, char *argcv[])
{
printf("has avx: %d\n", cpu_supports_avx());
printf("mavericks: %d\n", using_mavericks());
return 0;
} |
Added a test case to def test_npdot_segfault():
if sys.platform != 'darwin': return
from subprocess import call
# Test for float32 np.dot segfault
# https://github.com/numpy/numpy/issues/4007
# This always segfaults when the sgemv alignment bug is present
script = """
import numpy as np
def aligned_array(N, align, dtype):
d = dtype()
tmp = np.zeros(N * d.nbytes + align, dtype=np.uint8)
address = tmp.__array_interface__["data"][0]
for offset in range(align):
if (address + offset) % align == 0: break
return tmp[offset:offset+N*d.nbytes].view(dtype=dtype)
m = aligned_array(100,15,np.float32)
s = aligned_array(10000,15,np.float32).reshape(100,100)
np.dot(s,m)"""
assert_equal(call([sys.executable, '-c', script]),0)
# test the sanity of np.dot after applying patch
script = """
import numpy as np
from numpy.testing import assert_allclose
for i in range(100):
m0 = np.random.rand(200)
s0 = np.random.rand(10000,200)
desired = np.dot(s0,m0).astype(np.float32)
m1 = m0.astype(np.float32)
s1 = s0.astype(np.float32)
assert_allclose(np.dot(s1,m1),desired,atol=0.01)
m = np.asfortranarray(m1)
s = np.asfortranarray(s1)
assert_allclose(np.dot(s,m),desired,atol=0.01)
m0 = np.random.rand(200)
s0 = np.random.rand(200,10000)
desired = np.dot(s0.T,m0).astype(np.float32)
m1 = m0.astype(np.float32)
s1 = s0.astype(np.float32)
assert_allclose(np.dot(s1.T,m1),desired,atol=0.01)
m = np.asfortranarray(m1)
s = np.asfortranarray(s1)
assert_allclose(np.dot(s.T,m),desired,atol=0.01)
m0 = np.random.rand(89)
s0 = np.random.rand(10000,89)
desired = np.dot(s0,m0).astype(np.float32)
m1 = m0.astype(np.float32)
s1 = s0.astype(np.float32)
assert_allclose(np.dot(s1,m1),desired,atol=0.01)
m = np.asfortranarray(m1)
s = np.asfortranarray(s1)
assert_allclose(np.dot(s,m),desired,atol=0.01)"""
assert_equal(call([sys.executable, '-c', script]),0) |
Pinging @matthew-brett @juliantaylor @njsmith to hear your opinion before I do more work with this. |
I'm generally very keen to be able to use Accelerate on OSX, as y'all know. Sturla - I thought the plan was to implement sgemv with sgemm if there was likely to be trouble - but here you are making copies of the arrays; was there a problem with the sgemm idea? |
That is why I said on the SciPy list that I have a fix, but it requires some small changes to _dotblas.c With an sgemm based fix the changes to _dotblas.c would not be required. But if you look at the code the changes to _dotblas.c is just a tiny part of it. Using sgemm is not likely to substantially reduce the LOC. The reason why I just copied the array was to save my brain from going up in a ball of smoke. CBLAS allows multiple combinations of array transpose and array order, and this affect the ordering of the array rank and leading dimension arguments. And since we are working with two cblas functions, there are two layers of shuffling of array shape variables to keep under control. Due to the inconsistencies in the design of BLAS it is not symmetrical. I was just lazy and figured it was easier to make an aligned copy of misaligned inputs. I wrote the fix in the middle of the night, and when I tried to get all of the combinations correct my brain refused to cooperate. But there is nothing wrong with making a an aligned temporary copy, and it might actually run faster than sgemm on misaligned data. Copy-in and/or copy-out of misaligned input arrays is an optimization strategy that Fortran compilers often employ for that very reason. And it is also why passing an array to a subroutine call will invalidate Fortran pointers to the array. |
So - am I right in thinking that using a version of the gemm fix would reduce the impact on numpy quite a a bit - and make the patch easier to accept? I can try and work on this, but not before Monday afternoon pacific time. |
For reference, I think this is the fix that you (Sturla) were discussing with the author of vecLibFort: https://github.com/mcg1969/vecLibFort/blob/master/static.h#L96 I seem to remember you were worried about the fact that he wasn't checking the alignment of the vector A, but just the matrix X. |
His fix does not cover all the possibilities in Fortran BLAS. For example he thinks the segfault only happens when a transpose flag is set. Also it does not cover all the fail modes. Further more CBLAS is more complex than Fortran BLAS because it allows for C and Fortran order arrays. The emulated sgemv in libVecFort also appears simpler than it is because of tons of macro magic. So keep in mind what you compare. The dlopen/dlsym stuff is in libVecFort too, just hidden in macros. And they are not directly reusable because they also take care of Fortran name mangling. We also cannot use sgemv_ from libVecFort because it calls cblas_sgemv, and we would create an infinite recursion if we try to replace the latter. Changing it to use cblas_sgemm is not difficult. But the lines of code will not be substantially reduced. We still need the code to call the original function when the fix is not needed, the code that works out the CPU and OS version, the changes to the build process, the test... You can avoid the 34 extra lines in _dotblas.c, but the other C file will not be substantially smaller. Each combination of arguments to cblas_sgemv will also result in a different sgemm call, so we also need additional tests to cover all of them. Thus it will add to the test suite (not that it is bad). The current code just forwards the arguments and is logically much simpler. |
I still think libVecFort would benefit SciPy, but not NumPy. In scipy.linalg we only have Fortran BLAS and LAPACK wrapped with f2py. |
If I understand correctly, the big advantage of the sgemm fix is that it has minimal impact on maintenance burden for our non OSX friends. Julian - I am guessing you would be happier to accept this if it was essentially just the extra file with the patch in it, with good tests (which I guess would be good for numpy in general), and no edits to _dotblas.c ? Maybe more minor as an advantage for the sgemm route is the ability to avoid temporary array copies, which could be large (float32 often used by people trying to save memory). |
There is a macroguard which turns the hack into nothing unless the symbol |
The absolute path to Accelerate is also hardcoded in NumPy's setup script, so I don't think it is much worse to hardcode it in this workaround. |
fine, the branch needs squashing into one commit though |
How do I do that? Assume I'm an idiot with git. |
in the file it gives you to edit replace all picks except the first with |
I'm hoping Apple fixes this up in a 10.9 upgrade and we can stop worrying about it for numpy 1.10. |
I hope Apple takes care of this too. We shouldn't need this fix. |
wasn't 10.9.5 the last 10.9 update? if so I doubt it will still get fixed. |
$ git rebase -i origin/maintenance/1.9.x
Successfully rebased and updated refs/heads/maintenance/1.9.x. This is the file I got
|
Or perhaps it did? |
You still need to squash it by replacing the
where n is the number of commits you want to modify. After all that you will need to force push to origin. |
is origin the numpy remote or you own? you have to rebase onto to original numpy branch |
origin is my own remote. |
Yeah, looks like you aren't on a working branch. That's a no-no ;) Hmm..., you can fix that up but will probably need to make another PR and clean up your origin. First, while in |
Sturla - I can do this for you if you like - let me know? |
Also - might be a good idea to :
|
no need to do anything fancy, just rebase against the right branch, super manual way:
but we can also do that for you if you have further trouble |
Gosh, it actually worked... :-) |
You should expand your commit message with an explanation of what you have done. You can use You are still not in a working branch, which will cause synchronization problems if you do another pr, but we can live with it here. |
SGEMV in Accelerate framework will segfault on MacOS X version 10.9 (aka Mavericks) if arrays are not aligned to 32 byte boundaries and the CPU supports AVX instructions. This can produce segfaults in numpy.dot if we use numpy.float32 as dtype. This patch overshadows the symbols cblas_sgemv, sgemv_ and sgemv exported by Accelerate to produce the correct behavior. The MacOS X version and CPU specs are checked on module import. If Mavericks and AVX are detected the call to SGEMV is emulated with a call to SGEMM if the arrays are not 32 byte aligned. If the exported symbols cannot be overshadowed on module import, a fatal error is produced and the process aborts. All the fixes are in a self-contained C file and do not alter the _dotblas C code. The patch is not applied unless NumPy is configured to link with Apple's Accelerate framework.
I'd prefer to keep it here to not spread the discussion to even more places |
Ok, I posted a new PR from a working branch as @charris suggested. But reopen and close the other if you prefer. |
Does this look ok now, or is there anything else I need to do? |
Looks fine to me... |
There is some leftover cruft after the switch from |
Ok, now that is taken care of in #5223. |
http://article.gmane.org/gmane.comp.python.scientific.devel/19141
#4007