-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: Figure out alignment for complex loops #17359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Or, we could embrace the two meanings of alignment, and flag both |
At this point, a |
@eric-wieser we do have alignment stored on the descriptor (albeit an int, not that it matters though). Embracing both, means that we should have two flags: The tricky part is indicating the required alignment of an inner-loop. Although, maybe it is less important: We could be fine with "aligned" only indicating the normal alignment. But still signal both alignments when fetching the optimal inner-loop. As a reminder, the NEP 42/43 setup is somethign like:
(not that I want to make that public right away.) The problem is, the alignment requirement should be checked before Now, I think for now this is fine, |
I have a little writeup on alignment from the last time I worked on it here. (the formatting of the little table seems screwed up, raw version here). I think the bug/problem you're talking about is the one I mention in the very last paragraph, that the copy code and cast code are deeply intertwined so both need to check for both types of alignment. If we could untangle them, then copy code would only need to check for "uint alignment" and cast code only for "true alignment". |
I think these are the problems with the options:
--> This would make structured datatypes with complex fields have incorrect padding/alignment, and be usually mismatched from C structs, which is a problem. Also not back-compatible for this reason.
--> possible, but I believe it gives a good speedup (In fact, were you the one who wrote this optimization?)
This is similar the approach I took by defining two types of alignment, uint-alignment and true-alignment, except I didn't have the energy to go all the way and properly split up the alignment checks. The issue as I recall is that the current code re-uses the buffer setup code for both copying and casting. I think it should be possible, though perhaps tedious, to untangle the strided-copy setup code from the strided-cast setup code. THis would optimize/simplify the alignment checks. Out of curiosity, is there a particular problem you want to fix, or is this out of interest in code cleanliness? If there is a specific problem it might help focus what we want to do about it. |
@ahaldane the point is user-dtypes, so that part is a bit fuzzy. To get to user-dtypes, I want to reorganize everything so that the dtypes are not hard-coded anymore. Further, I want to organize/group better into such array-methods as I wrote above. Now, I am trying to not rewrite everything, but basically, I am in the position to at least set the stage for your last sentence:
but, I am not sure I want to handle copy (or copyswap) distinctly, unless as a fast path. So, to keep the the uint alignment for copy, the casts and ufuncs also need access to it. |
Oh, one interesting thing is the I really like the Additionally, adding a flag on the array object for that should be painless as well meaning that we don't have to worry about duplicating the logic (which is currently confined to casting) e.g. for ufuncs. |
@seiko2plus curious if you have any thought on flagging array alignment with respect to SIMD loops/machinery? I suppose it probably needs to be peeling regardless, and an array-wide flag would likely not be of too much use? |
The computations to determine whether an array+strides are aligned are not that complex, so potentially you could recompute it on the fly instead of storing a flag. Depends how often you recompute. The relevant functions are |
@ahaldane true, the computation may be fine to simply redo. The problem is how to do two things:
Now the second step is currently
|
Is this still true? I thought ARM and x86 floats are now not that much slower when they are not aligned. |
@mattip, true, but we do still support Sparc? There is also a comment in one place that the compiler optimization leads to use of (SIMD?) instructions so that the inner-loop functions/cast-functions actually do require alignment even on those architectures. |
@seberg, However, on Arm7 and earlier the pointers should be aligned on natural size boundaries, |
After thinking about this more, my current preference is that storing the power-of-two alignment in the array flags. As far as I can tell that would use up about 3 bits, which we should have available ( Such alignment could also be useful passed into structured dtypes probably. As to the "it is fast to re-calculate the alignment". That may be true, but at least right now it is not that fast (but, the current code could likely be quite a bit faster), right now it can take ~10% of a very small cast/copy (and there are a lot more things that probably could be sped up there). |
I am mostly creating this issue for the hope of a short discussion. Maybe @ahaldane you have an opinion on this?
Currently, we have the
alignment
on dtypes, which works fine. But the dtype copy code actually uses the full itemsize sometimes. This is relevant only for complex data types as far as I can tell, becausenp.complex64
has a 32-bit alignment, since that is the alignment of the embedded floats.The copy code in some cases will use
uint64
to copy acomplex64
currently, which is convenient, because we can reuse this for all types. But, it breaks alignment and requires us to check against both alignments (although in most cases they match).It would be nice to solve this in the long run. My main issue is that it is confusing that alignment is usually 32-bit but sometimes 64-bit here. Possible solutions:
uint64
internally).There might be one other "middle" ground: Require complex to provide a 32bit aligned copy function, but actually use the current
uint64
copy function as a fast-path (because at that point, we already know that just copying the data is sufficient).The text was updated successfully, but these errors were encountered: