8000 MAINT: Figure out alignment for complex loops · Issue #17359 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

MAINT: Figure out alignment for complex loops #17359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
seberg opened this issue Sep 21, 2020 · 14 comments
Open

MAINT: Figure out alignment for complex loops #17359

seberg opened this issue Sep 21, 2020 · 14 comments

Comments

@seberg
Copy link
Member
seberg commented Sep 21, 2020

I am mostly creating this issue for the hope of a short discussion. Maybe @ahaldane you have an opinion on this?

Currently, we have the alignment on dtypes, which works fine. But the dtype copy code actually uses the full itemsize sometimes. This is relevant only for complex data types as far as I can tell, because np.complex64 has a 32-bit alignment, since that is the alignment of the embedded floats.
The copy code in some cases will use uint64 to copy a complex64 currently, which is convenient, because we can reuse this for all types. But, it breaks alignment and requires us to check against both alignments (although in most cases they match).

It would be nice to solve this in the long run. My main issue is that it is confusing that alignment is usually 32-bit but sometimes 64-bit here. Possible solutions:

  1. Define the alignment as 64-bit for this complex dtype, even though that is much stricter for than necessary usually.
  2. Specialize complex copy code so that it cannot have any alignment issues (does not use uint64 internally).
  3. Signal the alignment specific to a certain functionality. This is possible, but annoying since we need the alignment requirement before getting the final inner-loop function. Because we would like to set up buffering first.

There might be one other "middle" ground: Require complex to provide a 32bit aligned copy function, but actually use the current uint64 copy function as a fast-path (because at that point, we already know that just copying the data is sufficient).

@seberg
Copy link
Member Author
seberg commented Sep 21, 2020

Or, we could embrace the two meanings of alignment, and flag both uint/itemsize alignment, at least when the itemsize is a power of two.

@eric-wieser
Copy link
Member

Or, we could embrace the two meanings of alignment,

At this point, a size_t alignment might be simplest - tracking alignment and size completely separately is consistent with how alignof and sizeof work in C.

@seberg
Copy link
Member Author
seberg commented Sep 21, 2020

@eric-wieser we do have alignment stored on the descriptor (albeit an int, not that it matters though).

Embracing both, means that we should have two flags: PyArray_ISALIGNED and PyArray_IS_ITEMSIZE_ALIGNED, though.
After some more thought, I do think that is the most reasonable solution probably.

The tricky part is indicating the required alignment of an inner-loop. Although, maybe it is less important: We could be fine with "aligned" only indicating the normal alignment. But still signal both alignments when fetching the optimal inner-loop. As a reminder, the NEP 42/43 setup is somethign like:

class ArrayMethod:  # not really a class
    flags = NPY_METH_SUPPORTS_UNALIGNED
    def resolve_desciptors(self, given_descriptors, loop_descriptors):
          """Find the correct descriptors for the operation (additional cast may be required)"""
          pass

    def get_loop(self, int aligned, ...):
          """Find the optimal loop, similar to our current PyArray_GetDTypeTransferFunction"""
          pass

(not that I want to make that public right away.)

The problem is, the alignment requirement should be checked before get_loop: If the data is unaligned (and that is not supported), buffering is necessary. And buffering can mean a contiguous loop can be used instead of a strided one.

Now, I think for now this is fine, NPY_METH_SUPPORTS_UNALIGNED will just be based on the alignment, our complex casting rules do support unaligned data, but the int aligned that is passed in could actually flag both PyArray_ISALIGNED and PyArray_IS_ITEMSIZE_ALIGNED.

@ahaldane
Copy link
Member

I have a little writeup on alignment from the last time I worked on it here. (the formatting of the little table seems screwed up, raw version here).

I think the bug/problem you're talking about is the one I mention in the very last paragraph, that the copy code and cast code are deeply intertwined so both need to check for both types of alignment. If we could untangle them, then copy code would only need to check for "uint alignment" and cast code only for "true alignment".

@ahaldane
Copy link
Member

I think these are the problems with the options:

  1. Define the alignment as 64-bit for this complex dtype, even though that is much stricter for than necessary usually.

--> This would make structured datatypes with complex fields have incorrect padding/alignment, and be usually mismatched from C structs, which is a problem. Also not back-compatible for this reason.

  1. Specialize complex copy code so that it cannot have any alignment issues (does not use uint64 internally).

--> possible, but I believe it gives a good speedup (In fact, were you the one who wrote this optimization?)

  1. Signal the alignment specific to a certain functionality. This is possible, but annoying since we need the alignment requirement before getting the final inner-loop function. Because we would like to set up buffering first.

This is similar the approach I took by defining two types of alignment, uint-alignment and true-alignment, except I didn't have the energy to go all the way and properly split up the alignment checks. The issue as I recall is that the current code re-uses the buffer setup code for both copying and casting.

I think it should be possible, though perhaps tedious, to untangle the strided-copy setup code from the strided-cast setup code. THis would optimize/simplify the alignment checks.

Out of curiosity, is there a particular problem you want to fix, or is this out of interest in code cleanliness? If there is a specific problem it might help focus what we want to do about it.

@seberg
Copy link
Member Author
seberg commented Sep 22, 2020

@ahaldane the point is user-dtypes, so that part is a bit fuzzy. To get to user-dtypes, I want to reorganize everything so that the dtypes are not hard-coded anymore. Further, I want to organize/group better into such array-methods as I wrote above.
One thing is that in the end, ufuncs and casting should look the same (at the core, not the dispatching).

Now, I am trying to not rewrite everything, but basically, I am in the position to at least set the stage for your last sentence:

If there is ever a big rewrite of this code it would be good to allow them to use different alignments.

but, I am not sure I want to handle copy (or copyswap) distinctly, unless as a fast path. So, to keep the the uint alignment for copy, the casts and ufuncs also need access to it.

@seberg
Copy link
Member Author
seberg commented Sep 22, 2020

Oh, one interesting thing is the float128 being only uint/copy aligned at 8 bytes at the moment. But, I am not sure that it matters much?

I really like the ITEMSIZE_ALIGNED flag line of thought. It is very clear what that means and for all practical purposes, I assume that ITEMSIZE_ALIGNED is, at least currently, always stricter than ALIGNED (so passing a flag with both where before we passed int aligned can't break things due to some oversight).

Additionally, adding a flag on the array object for that should be painless as well meaning that we don't have to worry about duplicating the logic (which is currently confined to casting) e.g. for ufuncs.

@seberg
Copy link
Member Author
seberg commented Sep 22, 2020

@seiko2plus curious if you have any thought on flagging array alignment with respect to SIMD loops/machinery? I suppose it probably needs to be peeling regardless, and an array-wide flag would likely not be of too much use?

@ahaldane
Copy link
Member

The computations to determine whether an array+strides are aligned are not that complex, so potentially you could recompute it on the fly instead of storing a flag. Depends how often you recompute.

The relevant functions are raw_array_is_aligned with for loop over strides, and npy_uint_alignment with a switch statement.

@seberg
Copy link
Member Author
seberg commented Sep 22, 2020

@ahaldane true, the computation may be fine to simply redo.

The problem is how to do two things:

  1. Signal the alignment requirements for a specific functionality (e.g. ufuncs and most cast require normal aligned data) to NumPy/the iterator.

    • We need this, because most ufuncs require aligned data, so the iterator has to use buffering to ensure it. But copyswap functions do not require aligned data (so the iterator shouldn't do useless copying).
  2. The iterator must signal the provided alignment to the complex copy+copyswap "function" (which are dynamically chosen):

    • We need this, because we want to use a more optimized function if the data is aligned.

Now the second step is currently PyArray_GetDTypeTransferfunction which is, at this time, only passed a single flag. And that single flag means we actually pass is_aligned && is_uint_aligned. There are others options:

  • We try to pass out the specific alignment requirement before the iterator is set up fully (this feels like too much complexity though).
  • We pass the full stride information, so that the PyArray_GetDTypeTransferfunction() can calculate the alignment itself. That, again, feels a bit too complicated?
  • We ignore this (for now), only flag normal alignment and add a fast-path for simple copies if necessary.

@mattip
Copy link
Member
mattip commented Sep 22, 2020

most ufuncs require aligned data

Is this still true? I thought ARM and x86 floats are now not that much slower when they are not aligned.

@seberg
Copy link
Member Author
seberg commented Sep 22, 2020

@mattip, true, but we do still support Sparc? There is also a comment in one place that the compiler optimization leads to use of (SIMD?) instructions so that the inner-loop functions/cast-functions actually do require alignment even on those architectures.

@seiko2plus
Copy link
Member
seiko2plus commented Sep 22, 2020

@seberg,
There's no need to specializing SIMD loops to handle alignment memory access,
except for few cases that may require non-temporal memory hint,
which can be achieved by peeling as you mentioned.

However, on Arm7 and earlier the pointers should be aligned on natural size boundaries,
for example LDRD and STRD doubleword(64-bit) data transfers must be 8-byte aligned
on Arm6 and 4-byte aligned on Arm7, otherwise, the CPU raises an alignment exception(SIGBUS).

@seberg
Copy link
Member Author
seberg commented Nov 3, 2020

After thinking about this more, my current preference is that storing the power-of-two alignment in the array flags. As far as I can tell that would use up about 3 bits, which we should have available (2**(2**3-1) == 128) byte alignment is probably a decent maximum? We would still have to store the current aligned flag for backcompat (and convenience).

Such alignment could also be useful passed into structured dtypes probably.

As to the "it is fast to re-calculate the alignment". That may be true, but at least right now it is not that fast (but, the current code could likely be quite a bit faster), right now it can take ~10% of a very small cast/copy (and there are a lot more things that probably could be sped up there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0