8000 Circumventing cblas_sgemv segfault in Accelerate by sturlamolden · Pull Request #5205 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Circumventing cblas_sgemv segfault in Accelerate #5205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Circumventing cblas_sgemv segfault in Accelerate #5205

wants to merge 1 commit into from

Conversation

sturlamolden
Copy link
Contributor

@sturlamolden
Copy link
Contributor Author

No bento support yet, I have no idea how bento works...

@matthew-brett
Copy link
Contributor

Great - thanks for doing that.

There is a fair amount of trailing whitespace in the changes - in case you get a chance to clear that up.

How about adding a test. I don't know how to make the segfault happen reliably - Julian - do you have something effective?

This is a version of the test in the issue:

def test_npdot_segfault():
    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007
    m = 10000
    n = 200
    A = np.ones((m, n)).astype(np.float32)
    X = np.ones(n).astype(np.float32)
    # This segfaults about half the time when the bug is present
    np.dot(A, X)

@sturlamolden
Copy link
Contributor Author

How do we make or run a test for something that will segfault the test suite? Spawn a subprocess?

@matthew-brett
Copy link
Contributor

I was thinking it would be better to segfault the test-suite than let it creep back accidentally.

@matthew-brett
Copy link
Contributor

But I guess running the test in another python subprocess would work too, if rather complicated.

@matthew-brett
Copy link
Contributor

Something like:

import sys
from subprocess import call

from nose.tools import assert_equal

def test_npdot_segfault():
    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007
    # This segfaults about half the time when the bug is present
    assert_equal(call([sys.executable, '-c',
                       'import numpy as np; '
                       'np.dot('
                       'np.ones((10000, 200)).astype(np.float32), '
                       'np.ones(200).astype(np.float32))']),
                 0)

@@ -20,6 +20,11 @@
#include <limits.h>
#include <stdio.h>

#ifdef APPLE_ACCELERATE_SGEMV_PATCH
#include <pthread.h>
extern pthread_key_t tls_memory_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use NPY_TLS instead of pthreads, though I never tested if it works on mac, it does on linux and windows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_TLS cannot be used on MacOS X.

@juliantaylor
Copy link
Contributor

this is only needed when the cpu supports AVX, does clang have something equivalent to gccs __builtin_cpu_supports("avx")?

/* ----------------------------------------------------------------- */
/* Management of aligned memory */

#define unlikely(x) __builtin_expect(!!(x), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have NP_UNLIKELY

@sturlamolden
Copy link
Contributor Author

MacOS X object files does not support thread local storage. The storage qualifier __thread (which is what the NPY_TLSmacro exands to) is unavailable on the platform. clang does not accept it. That is why I used pthreads_setspecificwhich on OS X emulates TLS inside the posix library. It is quite efficient though, and only needs three instructions. It is quite outrageous that a modern desktop OS does not support thread local storage, but that is sadly the case here.

@sturlamolden
Copy link
Contributor Author

I think __builtin_cpu_supports("avx") might be available. Clang supports most GCC extensions. I will try.

Also I did not know about NPY_UNLIKELY, sorry. Which header file should I include?

Currently apple_sgemv_patch.c does not include any of NumPy. Python.h is only included for Py_intptr_t, since intptr_t is only available in C99. I thought of using longinstead though, because I know it has the size of a void* on OSX (at least with clang, gcc or icc), but it is not correct C to assume this. I could use npy_intp if I include NumPy.

@sturlamolden
Copy link
Contributor Author

If I commit more changes to my code will they appear in this PR, or do I need to do something else?

@njsmith
Copy link
Member
njsmith commented Oct 18, 2014

Any changes you push into this github branch will appear in the PR. New
commits don't trigger any notification messages, though, so it's important
to ping the thread if you want people to notice and look at new changes.
On 18 Oct 2014 15:28, "Sturla Molden" notifications@github.com wrote:

If I commit more changes to my code will they appear in this PR, or do I
need to do something else?


Reply to this email directly or view it on GitHub
#5205 (comment).

@juliantaylor
Copy link
Contributor

they appear when you push them

This seems a bit to complicated to put in on the day we want to cut the 1.9.1rc. And I really hate workarounds, fix the issue at its source. Maybe if more people file bugs apple will do something.
I'd rather disable the use of accelerate completely if its such a large issue in practice, (thought we should also disable openblas then too, it still happily produces wrong results my cpu with the latest version ...)

for 1.9.1 maybe we could fix at least some of the more common occurrences of the issue by using an aligned allocator at the core of numpy.
of course it will not make @njsmith happy as it will prevent us from using pythons tracemalloc.

@sturlamolden
Copy link
Contributor Author

With __builtin_cpu_supports("avx") both clang and gcc complains about unknown builtin. icc compiles the code but the _dotblas dies with an import error because ___builtin_cpu_supports (with three underscores) is not found.

@sturlamolden
Copy link
Contributor Author

Aha, __builtin_cpu_supports is available in gcc 4.8. If I compile with gcc 4.8.1 it works. However with icc, clang or Apple's gcc it is not available.

@sturlamolden
Copy link
Contributor Author

This seems to work:

#include <stdio.h>
#include <stdlib.h>

static int cpu_supports_avx(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sysctl -a | grep AVX", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

int main(int argc, char *argcv[]) 
{
    printf("has avx: %d\n", cpu_supports_avx());
    return 0;
}

@sturlamolden
Copy link
Contributor Author

It seems the sgemv bug has been removed in Yosemite. Either this workaround can be ignored, or we should have a dynamic check for Mavericks as well. Mavericks is the only version of MacOS X where it is present.

That said, I have addressed the issues raised here except for adding the test case suggested by @matthew-brett . NPY_TLS cannot be used, so it must use pthreads.

@matthew-brett
Copy link
Contributor

The question is, what should we do for shipping OSX wheels?

It seems to me that we will have no choice but to ship Mavericks-compatible wheels at least for the next year.

So then the question is, it is worth keeping something like this in the code-base to make that possible, or not?

I think there are big advantages in terms of performance, and licensing, to using Accelerate. So, as the person responsible for building the OSX wheels, I put in a strong plea to find some way of making this work.

@matthew-brett
Copy link
Contributor

This code:

https://gist.github.com/zchothia/3078968

successfully detects AVX on my 10.9 laptop and absence of AVX on my 10.6 desktop.

@sturlamolden
Copy link
Contributor Author

This is the code I use to check for AVX and OS version. It is only executed on module import.

#include <stdio.h>
#include <stdlib.h>

static int cpu_supports_avx(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sysctl -n machdep.cpu.features | grep AVX", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

static int using_mavericks(void)
{
    char tmp[1024];
    FILE *p;
    size_t count;
    p = popen("sw_vers -productVersion | grep 10\\.9\\.", "r");
    if (p == NULL) return -1;
    count = fread((void *)&tmp, 1, sizeof(tmp)-1, p);
    pclose(p);
    return (count ? 1 : 0);
}

int main(int argc, char *argcv[]) 
{
    printf("has avx:   %d\n", cpu_supports_avx());
    printf("mavericks: %d\n", using_mavericks());
    return 0;
}

@sturlamolden
Copy link
Contributor Author

Added a test case to test_blasdot.py that should always work

def test_npdot_segfault():
    if sys.platform != 'darwin': return
    from subprocess import call

    # Test for float32 np.dot segfault
    # https://github.com/numpy/numpy/issues/4007

    # This always segfaults when the sgemv alignment bug is present

    script = """
import numpy as np
def aligned_array(N, align, dtype):
    d = dtype()
    tmp = np.zeros(N * d.nbytes + align, dtype=np.uint8)
    address = tmp.__array_interface__["data"][0]
    for offset in range(align):
        if (address + offset) % align == 0: break
    return tmp[offset:offset+N*d.nbytes].view(dtype=dtype)
m = aligned_array(100,15,np.float32)
s = aligned_array(10000,15,np.float32).reshape(100,100)
np.dot(s,m)"""

    assert_equal(call([sys.executable, '-c', script]),0)

    # test the sanity of np.dot after applying patch

    script = """
import numpy as np
from numpy.testing import assert_allclose

for i in range(100):
    m0 = np.random.rand(200)
    s0 = np.random.rand(10000,200)
    desired = np.dot(s0,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s,m),desired,atol=0.01) 

    m0 = np.random.rand(200)
    s0 = np.random.rand(200,10000)
    desired = np.dot(s0.T,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1.T,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s.T,m),desired,atol=0.01) 

    m0 = np.random.rand(89)
    s0 = np.random.rand(10000,89)
    desired = np.dot(s0,m0).astype(np.float32)    
    m1 = m0.astype(np.float32)
    s1 = s0.astype(np.float32)
    assert_allclose(np.dot(s1,m1),desired,atol=0.01)    
    m = np.asfortranarray(m1)
    s = np.asfortranarray(s1)
    assert_allclose(np.dot(s,m),desired,atol=0.01)"""    

    assert_equal(call([sys.executable, '-c', script]),0)

@sturlamolden
Copy link
Contributor Author

Pinging @matthew-brett @juliantaylor @njsmith to hear your opinion before I do more work with this.

@matthew-brett
Copy link
Contributor

I'm generally very keen to be able to use Accelerate on OSX, as y'all know.

Sturla - I thought the plan was to implement sgemv with sgemm if there was likely to be trouble - but here you are making copies of the arrays; was there a problem with the sgemm idea?

@sturlamolden
Copy link
Contributor Author

That is why I said on the SciPy list that I have a fix, but it requires some small changes to _dotblas.c With an sgemm based fix the changes to _dotblas.c would not be required. But if you look at the code the changes to _dotblas.c is just a tiny part of it. Using sgemm is not likely to substantially reduce the LOC.

The reason why I just copied the array was to save my brain from going up in a ball of smoke. CBLAS allows multiple combinations of array transpose and array order, and this affect the ordering of the array rank and leading dimension arguments. And since we are working with two cblas functions, there are two layers of shuffling of array shape variables to keep under control. Due to the inconsistencies in the design of BLAS it is not symmetrical. I was just lazy and figured it was easier to make an aligned copy of misaligned inputs. I wrote the fix in the middle of the night, and when I tried to get all of the combinations correct my brain refused to cooperate.

But there is nothing wrong with making a an aligned temporary copy, and it might actually run faster than sgemm on misaligned data. Copy-in and/or copy-out of misaligned input arrays is an optimization strategy that Fortran compilers often employ for that very reason. And it is also why passing an array to a subroutine call will invalidate Fortran pointers to the array.

@matthew-brett
Copy link
Contributor

So - am I right in thinking that using a version of the gemm fix would reduce the impact on numpy quite a a bit - and make the patch easier to accept?

I can try and work on this, but not before Monday afternoon pacific time.

@matthew-brett
Copy link
Contributor

For reference, I think this is the fix that you (Sturla) were discussing with the author of vecLibFort:

https://github.com/mcg1969/vecLibFort/blob/master/static.h#L96

I seem to remember you were worried about the fact that he wasn't checking the alignment of the vector A, but just the matrix X.

@sturlamolden
Copy link
Contributor Author

His fix does not cover all the possibilities in Fortran BLAS. For example he thinks the segfault only happens when a transpose flag is set. Also it does not cover all the fail modes. Further more CBLAS is more complex than Fortran BLAS because it allows for C and Fortran order arrays. The emulated sgemv in libVecFort also appears simpler than it is because of tons of macro magic. So keep in mind what you compare. The dlopen/dlsym stuff is in libVecFort too, just hidden in macros. And they are not directly reusable because they also take care of Fortran name mangling. We also cannot use sgemv_ from libVecFort because it calls cblas_sgemv, and we would create an infinite recursion if we try to replace the latter.

Changing it to use cblas_sgemm is not difficult. But the lines of code will not be substantially reduced. We still need the code to call the original function when the fix is not needed, the code that works out the CPU and OS version, the changes to the build process, the test... You can avoid the 34 extra lines in _dotblas.c, but the other C file will not be substantially smaller. Each combination of arguments to cblas_sgemv will also result in a different sgemm call, so we also need additional tests to cover all of them. Thus it will add to the test suite (not that it is bad). The current code just forwards the arguments and is logically much simpler.

@sturlamolden
Copy link
Contributor Author

I still think libVecFort would benefit SciPy, but not NumPy. In scipy.linalg we only have Fortran BLAS and LAPACK wrapped with f2py.

@matthew-brett
Copy link
Contributor

If I understand correctly, the big advantage of the sgemm fix is that it has minimal impact on maintenance burden for our non OSX friends. Julian - I am guessing you would be happier to accept this if it was essentially just the extra file with the patch in it, with good tests (which I guess would be good for numpy in general), and no edits to _dotblas.c ?

Maybe more minor as an advantage for the sgemm route is the ability to avoid temporary array copies, which could be large (float32 often used by people trying to save memory).

@sturlamolden
Copy link
Contributor Author

There is a macroguard which turns the hack into nothing unless the symbol APPLE_ACCELERATE_SGEMV_PATCH is defined. It is only defined when NumPy is configured to link with Accelerate on OSX. See the proposed changes to distutils/system_info.py. @charris

@sturlamolden
Copy link
Contributor Author

The absolute path to Accelerate is also hardcoded in NumPy's setup script, so I don't think it is much worse to hardcode it in this workaround.

@juliantaylor
Copy link
Contributor

fine, the branch needs squashing into one commit though

@sturlamolden
Copy link
Contributor Author

How do I do that? Assume I'm an idiot with git.

@juliantaylor
Copy link
Contributor
git rebase -i origin/maintenance/1.9.x

in the file it gives you to edit replace all picks except the first with f (pro tip have an editor that can do block editing) and update the commit message with something meaninging full (commit --amend)

@charris
Copy link
Member
charris commented Oct 22, 2014

I'm hoping Apple fixes this up in a 10.9 upgrade and we can stop worrying about it for numpy 1.10.

@sturlamolden
Copy link
Contributor Author

I hope Apple takes care of this too. We shouldn't need this fix.

@juliantaylor
Copy link
Contributor

wasn't 10.9.5 the last 10.9 update? if so I doubt it will still get fixed.

@sturlamolden
Copy link
Contributor Author
$ git rebase -i origin/maintenance/1.9.x
Successfully rebased and updated refs/heads/maintenance/1.9.x.

This is the file I got

noop

# Rebase 245c362..245c362 onto 245c362
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out



git push origin says everything is up to date. Didn't seem to work though.

@sturlamolden
Copy link
Contributor Author

Or perhaps it did?

@charris
Copy link
Member
charris commented Oct 22, 2014

You still need to squash it by replacing the pick labels with f on all but the first, and by r on the first, then write and quit. Do that before rebasing on 1.9.x. with something like

git rebase -i HEAD~n

where n is the number of commits you want to modify. After all that you will need to force push to origin.

@juliantaylor
Copy link
Contributor

is origin the numpy remote or you own? you have to rebase onto to original numpy branch
are you on the branch you have the changes on?

@sturlamolden
Copy link
Contributor Author

origin is my own remote.

@charris
Copy link
Member
charris commented Oct 22, 2014

Yeah, looks like you aren't on a working branch. That's a no-no ;) Hmm..., you can fix that up but will probably need to make another PR and clean up your origin. First, while in maintenance/1.9.x make a new branch git branch <new-branch-name>, that will be your working branch. Now do git reset --hard <sha first commit not yours> to clean up maintenance/1.9.x and force push that git push --force origin maintenance/1.9.x, that will clean up origin. Now checkout your working branch and proceed.

@matthew-brett
Copy link
Contributor

Sturla - I can do this for you if you like - let me know?

@matthew-brett
Copy link
Contributor

Also - might be a good idea to :

diff --git a/numpy/core/tests/test_blasdot.py b/numpy/core/tests/test_blasdot.py
index aa8bbd1..6b5afef 100644
--- a/numpy/core/tests/test_blasdot.py
+++ b/numpy/core/tests/test_blasdot.py
@@ -199,7 +199,7 @@ def test_npdot_segfault():
         return aligned

     def assert_dot_close(A, X, desired):
-        assert_allclose(np.dot(A, X), desired, rtol=1e-5, atol=1e-9)
+        assert_allclose(np.dot(A, X), desired, rtol=1e-5, atol=1e-7)

     m = aligned_array(100, 15, np.float32)
     s = aligned_array((100, 100), 15, np.float32)

@juliantaylor
Copy link
Contributor

no need to do anything fancy, just rebase against the right branch, super manual way:

git remote add correctorigin git://github.com/numpy/numpy
git fetch correctorigin
git checkout  <the branch with your changes>
git rebase -i correctorigin/maintenance/1.9.x
git push -f origin <the branch with your changes>

but we can also do that for you if you have further trouble

@sturlamolden
Copy link
Contributor Author

Gosh, it actually worked... :-)

@charris
Copy link
Member
charris commented Oct 22, 2014

You should expand your commit message with an explanation of what you have done. You can use git rebase -i HEAD^ and replace pick with r (reword). Add blank line after summary, then explanation. The first line might also begin with BUG:. You will need to force push again after that.

You are still not in a working branch, which will cause synchronization problems if you do another pr, but we can live with it here.

SGEMV in Accelerate framework will segfault on MacOS X version 10.9
(aka Mavericks) if arrays are not aligned to 32 byte boundaries
and the CPU supports AVX instructions. This can produce segfaults
in numpy.dot if we use numpy.float32 as dtype. This patch overshadows
the symbols cblas_sgemv, sgemv_ and sgemv exported by Accelerate
to produce the correct behavior. The MacOS X version and CPU specs
are checked on module import. If Mavericks and AVX are detected
the call to SGEMV is emulated with a call to SGEMM if the arrays
are not 32 byte aligned. If the exported symbols cannot be
overshadowed on module import, a fatal error is produced and the
process aborts. All the fixes are in a self-contained C file
and do not alter the _dotblas C code. The patch is not applied
unless NumPy is configured to link with Apple's Accelerate
framework.
@juliantaylor
Copy link
Contributor

I'd prefer to keep it here to not spread the discussion to even more places

@sturlamolden
Copy link
Contributor Author

Ok, I posted a new PR from a working branch as @charris suggested. But reopen and close the other if you prefer.

@sturlamolden
Copy link
Contributor Author

Does this look ok now, or is there anything else I need to do?

#5223

@matthew-brett
Copy link
Contributor

Looks fine to me...

@sturlamolden
Copy link
Contributor Author

There is some leftover cruft after the switch from popen to system. The checks for return value less than 0 are no longer needed, They are benign, but they don't have to be there anymore. I will get rid of them.

@sturlamolden
Copy link
Contributor Author

Ok, now that is taken care of in #5223.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0