Circumventing cblas_sgemv segfault in Accelerate #5205

sturlamolden · 2014-10-18T07:46:36Z

http://article.gmane.org/gmane.comp.python.scientific.devel/19141
#4007

sturlamolden · 2014-10-18T08:15:05Z

No bento support yet, I have no idea how bento works...

matthew-brett · 2014-10-18T08:45:42Z

Great - thanks for doing that.

There is a fair amount of trailing whitespace in the changes - in case you get a chance to clear that up.

How about adding a test. I don't know how to make the segfault happen reliably - Julian - do you have something effective?

This is a version of the test in the issue:

def test_npdot_segfault():
    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007
    m = 10000
    n = 200
    A = np.ones((m, n)).astype(np.float32)
    X = np.ones(n).astype(np.float32)
    # This segfaults about half the time when the bug is present
    np.dot(A, X)

sturlamolden · 2014-10-18T08:51:12Z

How do we make or run a test for something that will segfault the test suite? Spawn a subprocess?

matthew-brett · 2014-10-18T08:54:05Z

I was thinking it would be better to segfault the test-suite than let it creep back accidentally.

matthew-brett · 2014-10-18T08:55:02Z

But I guess running the test in another python subprocess would work too, if rather complicated.

matthew-brett · 2014-10-18T09:06:14Z

Something like:

import sys
from subprocess import call

from nose.tools import assert_equal

def test_npdot_segfault():
    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007
    # This segfaults about half the time when the bug is present
    assert_equal(call([sys.executable, '-c',
                       'import numpy as np; '
                       'np.dot('
                       'np.ones((10000, 200)).astype(np.float32), '
                       'np.ones(200).astype(np.float32))']),
                 0)

juliantaylor · 2014-10-18T10:27:49Z

numpy/core/blasdot/_dotblas.c

@@ -20,6 +20,11 @@
 #include <limits.h>
 #include <stdio.h>

+#ifdef APPLE_ACCELERATE_SGEMV_PATCH
+#include <pthread.h>
+extern pthread_key_t tls_memory_error;


you should be able to use NPY_TLS instead of pthreads, though I never tested if it works on mac, it does on linux and windows

NPY_TLS cannot be used on MacOS X.

juliantaylor · 2014-10-18T10:30:42Z

this is only needed when the cpu supports AVX, does clang have something equivalent to gccs __builtin_cpu_supports("avx")?

juliantaylor · 2014-10-18T12:09:50Z

numpy/core/blasdot/apple_sgemv_patch.c

+/* ----------------------------------------------------------------- */
+/* Management of aligned memory */
+
+#define unlikely(x) __builtin_expect(!!(x), 0)


we have NP_UNLIKELY

sturlamolden · 2014-10-18T14:05:43Z

MacOS X object files does not support thread local storage. The storage qualifier __thread (which is what the NPY_TLSmacro exands to) is unavailable on the platform. clang does not accept it. That is why I used pthreads_setspecificwhich on OS X emulates TLS inside the posix library. It is quite efficient though, and only needs three instructions. It is quite outrageous that a modern desktop OS does not support thread local storage, but that is sadly the case here.

sturlamolden · 2014-10-18T14:24:36Z

I think __builtin_cpu_supports("avx") might be available. Clang supports most GCC extensions. I will try.

Also I did not know about NPY_UNLIKELY, sorry. Which header file should I include?

Currently apple_sgemv_patch.c does not include any of NumPy. Python.h is only included for Py_intptr_t, since intptr_t is only available in C99. I thought of using longinstead though, because I know it has the size of a void* on OSX (at least with clang, gcc or icc), but it is not correct C to assume this. I could use npy_intp if I include NumPy.

sturlamolden · 2014-10-18T14:28:20Z

If I commit more changes to my code will they appear in this PR, or do I need to do something else?

njsmith · 2014-10-18T14:51:01Z

Any changes you push into this github branch will appear in the PR. New
commits don't trigger any notification messages, though, so it's important
to ping the thread if you want people to notice and look at new changes.
On 18 Oct 2014 15:28, "Sturla Molden" notifications@github.com wrote:

If I commit more changes to my code will they appear in this PR, or do I
need to do something else?

—
Reply to this email directly or view it on GitHub
#5205 (comment).

juliantaylor · 2014-10-18T14:53:12Z

they appear when you push them

This seems a bit to complicated to put in on the day we want to cut the 1.9.1rc. And I really hate workarounds, fix the issue at its source. Maybe if more people file bugs apple will do something.
I'd rather disable the use of accelerate completely if its such a large issue in practice, (thought we should also disable openblas then too, it still happily produces wrong results my cpu with the latest version ...)

for 1.9.1 maybe we could fix at least some of the more common occurrences of the issue by using an aligned allocator at the core of numpy.
of course it will not make @njsmith happy as it will prevent us from using pythons tracemalloc.

sturlamolden · 2014-10-18T15:21:51Z

With __builtin_cpu_supports("avx") both clang and gcc complains about unknown builtin. icc compiles the code but the _dotblas dies with an import error because ___builtin_cpu_supports (with three underscores) is not found.

sturlamolden · 2014-10-18T15:27:59Z

Aha, __builtin_cpu_supports is available in gcc 4.8. If I compile with gcc 4.8.1 it works. However with icc, clang or Apple's gcc it is not available.

sturlamolden · 2014-10-18T16:22:10Z

This seems to work:

#include <stdio.h>
#include <stdlib.h>

static int cpu_supports_avx(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sysctl -a | grep AVX", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

int main(int argc, char *argcv[]) 
{
    printf("has avx: %d\n", cpu_supports_avx());
    return 0;
}

sturlamolden · 2014-10-18T18:14:54Z

It seems the sgemv bug has been removed in Yosemite. Either this workaround can be ignored, or we should have a dynamic check for Mavericks as well. Mavericks is the only version of MacOS X where it is present.

That said, I have addressed the issues raised here except for adding the test case suggested by @matthew-brett . NPY_TLS cannot be used, so it must use pthreads.

matthew-brett · 2014-10-18T18:25:59Z

The question is, what should we do for shipping OSX wheels?

It seems to me that we will have no choice but to ship Mavericks-compatible wheels at least for the next year.

So then the question is, it is worth keeping something like this in the code-base to make that possible, or not?

I think there are big advantages in terms of performance, and licensing, to using Accelerate. So, as the person responsible for building the OSX wheels, I put in a strong plea to find some way of making this work.

matthew-brett · 2014-10-18T18:37:25Z

This code:

https://gist.github.com/zchothia/3078968

successfully detects AVX on my 10.9 laptop and absence of AVX on my 10.6 desktop.

sturlamolden · 2014-10-18T18:39:46Z

This is the code I use to check for AVX and OS version. It is only executed on module import.

#include <stdio.h>
#include <stdlib.h>

static int cpu_supports_avx(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sysctl -n machdep.cpu.features | grep AVX", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

static int using_mavericks(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sw_vers -productVersion | grep 10\\.9\\.", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

int main(int argc, char *argcv[]) 
{
    printf("has avx:   %d\n", cpu_supports_avx());
    printf("mavericks: %d\n", using_mavericks());
    return 0;
}

sturlamolden · 2014-10-18T20:20:06Z

Added a test case to test_blasdot.py that should always work

def test_npdot_segfault():
    if sys.platform != 'darwin': return
    from subprocess import call

    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007

    # This always segfaults when the sgemv alignment bug is present

    script = """
import numpy as np
def aligned_array(N, align, dtype):
    d = dtype()
    tmp = np.zeros(N * d.nbytes + align, dtype=np.uint8)
    address = tmp.__array_interface__["data"][0]
    for offset in range(align):
        if (address + offset) % align == 0: break
    return tmp[offset:offset+N*d.nbytes].view(dtype=dtype)
m = aligned_array(100,15,np.float32)
s = aligned_array(10000,15,np.float32).reshape(100,100)
np.dot(s,m)"""

    assert_equal(call([sys.executable, '-c', script]),0)

    # test the sanity of np.dot after applying patch

    script = """
import numpy as np
from numpy.testing import assert_allclose

for i in range(100):
    m0 = np.random.rand(200)
    s0 = np.random.rand(10000,200)
    desired = np.dot(s0,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s,m),desired,atol=0.01) 

    m0 = np.random.rand(200)
    s0 = np.random.rand(200,10000)
    desired = np.dot(s0.T,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1.T,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s.T,m),desired,atol=0.01) 

    m0 = np.random.rand(89)
    s0 = np.random.rand(10000,89)
    desired = np.dot(s0,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s,m),desired,atol=0.01)"""    

    assert_equal(call([sys.executable, '-c', script]),0)

sturlamolden · 2014-10-18T20:23:39Z

Pinging @matthew-brett @juliantaylor @njsmith to hear your opinion before I do more work with this.

matthew-brett · 2014-10-18T22:54:31Z

I'm generally very keen to be able to use Accelerate on OSX, as y'all know.

Sturla - I thought the plan was to implement sgemv with sgemm if there was likely to be trouble - but here you are making copies of the arrays; was there a problem with the sgemm idea?

sturlamolden · 2014-10-19T00:22:34Z

That is why I said on the SciPy list that I have a fix, but it requires some small changes to _dotblas.c With an sgemm based fix the changes to _dotblas.c would not be required. But if you look at the code the changes to _dotblas.c is just a tiny part of it. Using sgemm is not likely to substantially reduce the LOC.

The reason why I just copied the array was to save my brain from going up in a ball of smoke. CBLAS allows multiple combinations of array transpose and array order, and this affect the ordering of the array rank and leading dimension arguments. And since we are working with two cblas functions, there are two layers of shuffling of array shape variables to keep under control. Due to the inconsistencies in the design of BLAS it is not symmetrical. I was just lazy and figured it was easier to make an aligned copy of misaligned inputs. I wrote the fix in the middle of the night, and when I tried to get all of the combinations correct my brain refused to cooperate.

But there is nothing wrong with making a an aligned temporary copy, and it might actually run faster than sgemm on misaligned data. Copy-in and/or copy-out of misaligned input arrays is an optimization strategy that Fortran compilers often employ for that very reason. And it is also why passing an array to a subroutine call will invalidate Fortran pointers to the array.

matthew-brett · 2014-10-19T01:12:43Z

So - am I right in thinking that using a version of the gemm fix would reduce the impact on numpy quite a a bit - and make the patch easier to accept?

I can try and work on this, but not before Monday afternoon pacific time.

matthew-brett · 2014-10-19T01:35:33Z

For reference, I think this is the fix that you (Sturla) were discussing with the author of vecLibFort:

https://github.com/mcg1969/vecLibFort/blob/master/static.h#L96

I seem to remember you were worried about the fact that he wasn't checking the alignment of the vector A, but just the matrix X.

sturlamolden · 2014-10-19T02:42:31Z

His fix does not cover all the possibilities in Fortran BLAS. For example he thinks the segfault only happens when a transpose flag is set. Also it does not cover all the fail modes. Further more CBLAS is more complex than Fortran BLAS because it allows for C and Fortran order arrays. The emulated sgemv in libVecFort also appears simpler than it is because of tons of macro magic. So keep in mind what you compare. The dlopen/dlsym stuff is in libVecFort too, just hidden in macros. And they are not directly reusable because they also take care of Fortran name mangling. We also cannot use sgemv_ from libVecFort because it calls cblas_sgemv, and we would create an infinite recursion if we try to replace the latter.

Changing it to use cblas_sgemm is not difficult. But the lines of code will not be substantially reduced. We still need the code to call the original function when the fix is not needed, the code that works out the CPU and OS version, the changes to the build process, the test... You can avoid the 34 extra lines in _dotblas.c, but the other C file will not be substantially smaller. Each combination of arguments to cblas_sgemv will also result in a different sgemm call, so we also need additional tests to cover all of them. Thus it will add to the test suite (not that it is bad). The current code just forwards the arguments and is logically much simpler.

sturlamolden · 2014-10-19T02:50:42Z

I still think libVecFort would benefit SciPy, but not NumPy. In scipy.linalg we only have Fortran BLAS and LAPACK wrapped with f2py.

matthew-brett · 2014-10-19T02:55:25Z

If I understand correctly, the big advantage of the sgemm fix is that it has minimal impact on maintenance burden for our non OSX friends. Julian - I am guessing you would be happier to accept this if it was essentially just the extra file with the patch in it, with good tests (which I guess would be good for numpy in general), and no edits to _dotblas.c ?

Maybe more minor as an advantage for the sgemm route is the ability to avoid temporary array copies, which could be large (float32 often used by people trying to save memory).

sturlamolden · 2014-10-22T20:10:49Z

There is a macroguard which turns the hack into nothing unless the symbol APPLE_ACCELERATE_SGEMV_PATCH is defined. It is only defined when NumPy is configured to link with Accelerate on OSX. See the proposed changes to distutils/system_info.py. @charris

sturlamolden · 2014-10-22T20:15:17Z

The absolute path to Accelerate is also hardcoded in NumPy's setup script, so I don't think it is much worse to hardcode it in this workaround.

juliantaylor · 2014-10-22T20:16:10Z

fine, the branch needs squashing into one commit though

sturlamolden · 2014-10-22T20:18:34Z

How do I do that? Assume I'm an idiot with git.

juliantaylor · 2014-10-22T20:22:22Z

git rebase -i origin/maintenance/1.9.x

in the file it gives you to edit replace all picks except the first with f (pro tip have an editor that can do block editing) and update the commit message with something meaninging full (commit --amend)

charris · 2014-10-22T20:29:31Z

I'm hoping Apple fixes this up in a 10.9 upgrade and we can stop worrying about it for numpy 1.10.

sturlamolden · 2014-10-22T20:41:31Z

I hope Apple takes care of this too. We shouldn't need this fix.

juliantaylor · 2014-10-22T20:42:27Z

wasn't 10.9.5 the last 10.9 update? if so I doubt it will still get fixed.

sturlamolden · 2014-10-22T21:00:53Z

$ git rebase -i origin/maintenance/1.9.x
Successfully rebased and updated refs/heads/maintenance/1.9.x.

This is the file I got

noop

# Rebase 245c362..245c362 onto 245c362
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

git push origin says everything is up to date. Didn't seem to work though.

sturlamolden · 2014-10-22T21:03:22Z

Or perhaps it did?

charris · 2014-10-22T21:09:40Z

You still need to squash it by replacing the pick labels with f on all but the first, and by r on the first, then write and quit. Do that before rebasing on 1.9.x. with something like

git rebase -i HEAD~n

where n is the number of commits you want to modify. After all that you will need to force push to origin.

juliantaylor · 2014-10-22T21:10:22Z

is origin the numpy remote or you own? you have to rebase onto to original numpy branch
are you on the branch you have the changes on?

sturlamolden · 2014-10-22T21:12:45Z

origin is my own remote.

charris · 2014-10-22T21:21:28Z

Yeah, looks like you aren't on a working branch. That's a no-no ;) Hmm..., you can fix that up but will probably need to make another PR and clean up your origin. First, while in maintenance/1.9.x make a new branch git branch <new-branch-name>, that will be your working branch. Now do git reset --hard <sha first commit not yours> to clean up maintenance/1.9.x and force push that git push --force origin maintenance/1.9.x, that will clean up origin. Now checkout your working branch and proceed.

matthew-brett · 2014-10-22T21:27:06Z

Sturla - I can do this for you if you like - let me know?

matthew-brett · 2014-10-22T21:29:42Z

Also - might be a good idea to :

diff --git a/numpy/core/tests/test_blasdot.py b/numpy/core/tests/test_blasdot.py
index aa8bbd1..6b5afef 100644
--- a/numpy/core/tests/test_blasdot.py
+++ b/numpy/core/tests/test_blasdot.py
@@ -199,7 +199,7 @@ def test_npdot_segfault():
         return aligned

     def assert_dot_close(A, X, desired):
-        assert_allclose(np.dot(A, X), desired, rtol=1e-5, atol=1e-9)
+        assert_allclose(np.dot(A, X), desired, rtol=1e-5, atol=1e-7)

     m = aligned_array(100, 15, np.float32)
     s = aligned_array((100, 100), 15, np.float32)

juliantaylor · 2014-10-22T21:34:25Z

no need to do anything fancy, just rebase against the right branch, super manual way:

git remote add correctorigin git://github.com/numpy/numpy
git fetch correctorigin
git checkout  <the branch with your changes>
git rebase -i correctorigin/maintenance/1.9.x
git push -f origin <the branch with your changes>

but we can also do that for you if you have further trouble

sturlamolden · 2014-10-22T22:06:13Z

Gosh, it actually worked... :-)

charris · 2014-10-22T22:16:38Z

You should expand your commit message with an explanation of what you have done. You can use git rebase -i HEAD^ and replace pick with r (reword). Add blank line after summary, then explanation. The first line might also begin with BUG:. You will need to force push again after that.

You are still not in a working branch, which will cause synchronization problems if you do another pr, but we can live with it here.

SGEMV in Accelerate framework will segfault on MacOS X version 10.9 (aka Mavericks) if arrays are not aligned to 32 byte boundaries and the CPU supports AVX instructions. This can produce segfaults in numpy.dot if we use numpy.float32 as dtype. This patch overshadows the symbols cblas_sgemv, sgemv_ and sgemv exported by Accelerate to produce the correct behavior. The MacOS X version and CPU specs are checked on module import. If Mavericks and AVX are detected the call to SGEMV is emulated with a call to SGEMM if the arrays are not 32 byte aligned. If the exported symbols cannot be overshadowed on module import, a fatal error is produced and the process aborts. All the fixes are in a self-contained C file and do not alter the _dotblas C code. The patch is not applied unless NumPy is configured to link with Apple's Accelerate framework.

juliantaylor · 2014-10-22T23:09:42Z

I'd prefer to keep it here to not spread the discussion to even more places

sturlamolden · 2014-10-22T23:21:42Z

Ok, I posted a new PR from a working branch as @charris suggested. But reopen and close the other if you prefer.

sturlamolden · 2014-10-23T05:02:50Z

Does this look ok now, or is there anything else I need to do?

#5223

matthew-brett · 2014-10-23T05:15:07Z

Looks fine to me...

sturlamolden · 2014-10-23T16:26:35Z

There is some leftover cruft after the switch from popen to system. The checks for return value less than 0 are no longer needed, They are benign, but they don't have to be there anymore. I will get rid of them.

sturlamolden · 2014-10-23T17:42:49Z

Ok, now that is taken care of in #5223.

juliantaylor reviewed Oct 18, 2014
View reviewed changes

sturlamolden closed this Oct 22, 2014

juliantaylor reopened this Oct 22, 2014

sturlamolden mentioned this pull request Oct 22, 2014

BUG: Ugly fix for Apple's cblas_sgemv segfault #5223

Merged

sturlamolden closed this Oct 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circumventing cblas_sgemv segfault in Accelerate #5205

Circumventing cblas_sgemv segfault in Accelerate #5205

Circumventing cblas_sgemv segfault in Accelerate #5205

Circumventing cblas_sgemv segfault in Accelerate #5205

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment