10000 BUG: better detect memory overlap when calling gufuncs by mattip · Pull Request #11381 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: better detect memory overlap when calling gufuncs #11381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mattip
Copy link
Member
@mattip mattip commented Jun 19, 2018

Ufuncs need to check output args for memory overlap. In PyUFunc_GeneralizedFunction, the check assumed the gufunc was operating elementwise.

Found when I started implementing matmul as a gufunc. We seem to have avoided this issue by not having an out kwarg in our current gufuncs (i.e. linalg, see issue #11380).

This may impact downstream users of gufuncs who already try to detect this condition, but brings gufuncs into alignment with ufuncs and the comment in the code "For generalized ufuncs, we can't do buffering, so must COPY or UPDATEIFCOPY".

Test added that failed before, now passes. An additional passing test added for matmul, which currently uses einsum for implementation, but that will change in the future.

@mattip
Copy link
Member Author
mattip commented Jun 19, 2018

Looking through scipy, astropy I find no use of PyUFunc_FromFuncAndDataAndSignature (a signature is needed to trigger this issue). The public facing Numba repo does not use it via c, and exposes it to python via _internal.fromfunc which is wrapped by build_ufunc which does not call the variant with the gufunc signature.

@mattip
Copy link
Member Author
mattip commented Jun 20, 2018

Found and fixed a bug in 4c75113 where ret was not precisely out in einsum if there is overlap. ret was the temporary ndarray created by writeback semantics, not the original out ndarray. If they have the same dtype the values are guaranteed to be the same, but they are not the same entity which wastes memory.

This could be a separate PR but they both come under the heading of gufuncs and memory overlap.

NPY_ITER_ALIGNED |
NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE;
NPY_ITER_ALIGNED;
if (op[i] && PyArray_NDIM(op[i]) < 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's special about 0d / 1d arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fix was wrong. Any element-wise gufunc (one with an empty signature '(),()->()' will not hit this code path, it will not have core_enabled. Removing this flag.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 8000 catch - a copy & paste error from quite a while ago, I suspect! Looks good modulo the single-element thing you're already removing.

@pv
Copy link
Member
pv commented Jun 21, 2018

One can also argue this is a bug in the matmul ufunc implementation, and the NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE is as intended. I think there is a design decision that ufunc inner loops are supposed to be able to deal with overlapping operands/aliasing, which is certainly assumed in the scalar case. This is of course more difficult to deal with in the gufunc case (I don't recall whether the inner loop core dimensions are contiguous in memory or not).

I suspect (without trying) that matmul.accumulate and reduce don't work correctly currently, which would demonstrate it's a bug in the inner loop implementation. EDIT: I see now reduce/accumulate raise an error these operations are not defined for gufuncs, fair enough, although the reason why is not obvious at first glance.

Anyway, I'd prefer to make an explicit design decision here, rather than considering this a bugfix.

@mhvk
Copy link
Contributor
mhvk commented Jun 21, 2018

The inner loops just get access to the arrays; the loop dimensions do not have to be contiguous.

In principle, the inner loop indeed should check whether in and out overlap.

Indeed, I just wrote code that explicitly uses the fact that in and out can be the same to speed up the loop... I hadn't quite appreciated that this change might make that much less efficient...

@mattip
Copy link
Member Author
mattip commented Jun 21, 2018

Closing this (I will refactor the einsum issue to a different PR), but I think that code is copy-pasted and wrong.

The documentation seems to strengthen my claim, and a couple of other comments:

  • We are duplicating effort. Since the flags is unconditionally set, the gufunc call tries to prevent copying, and consumers of the code who need to detect overlap will have to run that check in reverse and invent a mechanism to allocate new memory.
  • The C code that check the overlap states the following, which is even clearer than the documentation
           * Operands with NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE enabled are not
           * considered overlapping if the arrays are exactly the same. In this
           * case, the iterator loops through them in the same order element by
           * element.  (As usual, the user-provided inner loop is assumed to be
           * able to deal with this level of simple aliasing.)
    
    it cannot be assumed that generalized ufuncs "loop through in the same order element by element", that is the job of simple ufuncs
  • The only public heavy consumer of gufuncs, linalg, makes no effort to deal with overlaps, while making a large effort to deal with non-contiguous or wrongly-ordered data. On the other hand, the python wrappers for the actual _umath_linalg do not expose the out argument so the lack does not mean much
  • You can always add the flag to the gufunc->op_flags, but cannot remove it since the flags are only additive. This claim too is weak since there is actually no official interface to set those flags, one would have to reach inside the C struct and set them.

@mattip
Copy link
Member Author
mattip commented Jun 21, 2018

@mhvk can you share your gufunc? I am curious how I could turn it into a test and rewrite it to not depend on this IMO buggy behaviour.

@njsmith
Copy link
Member
njsmith commented Jun 22, 2018

it cannot be assumed that generalized ufuncs "loop through in the same order element by element", that is the job of simple ufuncs

Whenever possible, I think it'd be good to move towards unifying the gufunc/ufunc semantics and code paths.

Also, I think there may be cases where gufuncs want to allow aliasing and handle it specially? For example, consider a sort gufunc with signature (n)->(n). If the input and outputs are the same, it can perform the sort in-place, which will be more efficient than allocating a temporary buffer, copying the input into it, sorting it, and then copying back into the output (= input) buffer.

It's also entirely possible that at some point we'll have simple ufuncs where handling aliasing is difficult -- e.g. think about a ufunc where the scalars are some complex multi-field dtype.

So I lean towards saying that we should handle gufuncs and ufuncs the same way here... which might include adding some flag that a loop can use to say that it doesn't handle aliasing, and ask numpy to take care of that for it using common machinery?

@mhvk
Copy link
Contributor
mhvk commented Jun 22, 2018

@mattip - it is exactly the case of identical in and out that may need to be preserved (though your point about being able to add but not subtract a flag is taken). The PR that I mentioned is astropy/astropy#7502, but that may not help you much since it uses a rather complicated jinja2 template to help generate code for a lot of functions - but some of those functions act in-place and this is where the ability to give in=out helps. A specific example where this is checked is:

static void ufunc_loop_aper(
    char **args, npy_intp *dimensions, npy_intp* steps, void* data)
{
    npy_intp i_o;
    npy_intp n_o = *dimensions++;
    char *theta = *args++;
    npy_intp s_theta = *steps++;
    char *astrom_in = *args++;
    npy_intp s_astrom_in = *steps++;
    char *astrom = *args++;
    npy_intp s_astrom = *steps++;
    double (*_theta);
    eraASTROM (*_astrom);
    for (i_o = 0; i_o < n_o;
         i_o++, theta += s_theta, astrom += s_astrom, astrom_in += s_astrom_in) {
        _theta = ((double (*))theta);
        _astrom = ((eraASTROM (*))astrom);
        if (astrom_in != astrom) {
            memcpy(astrom, astrom_in, 1*sizeof(eraASTROM));
        }
        eraAper(*_theta, _astrom);
    }
}

@mattip
8000 Copy link
Member Author
mattip commented Jun 24, 2018

there should be a point in the ufunc call that a ufunc can say "element-wise aliasing is OK", or "leave all aliasing to the inner loop" or "always copy-on-overlap"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0