-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: add inplace cases to fast ufunc loop macros #7999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I always wondered why inplace didn't help speed much. Great to see it addressed (even if I cannot really comment on the code). |
Heh, the code is "just the same" with some extra branches that all do the same thing but help the compiler create an optimized version of the branch. Though we should check quite carefully since it might stretch the tests... Do you want to create a release note? |
tout * out = (tout *)op1; \ | ||
op; \ | ||
} | ||
#define BASE_BINARY_LOOP_S(tin, tout, cin, cinp, vin, vinp, op) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeesh, this macro argument passing is starting to give me the creeps ;). By now I think its correct, but could maybe use a comment here to explain things a bit more in depth, I am not used to such macros, but took me a bit to figure out that in1
always ends up pointing to the correct args[0]
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe it is just one of those things you have to stare at until it makes sense...
LGTM in general. Do we have basic tests which exercise all of these branches in one way or another? The only thing that worried me a little for a bit was somehow ending up with in1 and in2 swapped (which would not show up in all of the ufuncs). |
the macro is quite ugly, I though about just keeping the original code, but the macro makes clear that the code in all branches is exactly the same. |
I don't think it needs release notes, these are only used for integer and bitwise ops, the latter I doubt are written inplace often. float loops where already fine as they are manually optimized (good compilers are probably common enough now (as we dropped 2.6 and thus redhat6) that we can also autovectorize those) but this becomes very significant if the temporary elision branch is merged, that will turn a lot more operations to inplace operations. |
ef9ad9f
to
291ec29
Compare
added explicit tests via the code used to test the float loops too, should cover more than necessary (assuming non-buggy compilers) |
Both gcc and clang don't automatically specialize the inplace case, so add extra conditions to the loop macros to get the compilers to emit decent code. Without them inplace code ends up much slower than the out of place code.
249460a
to
c46f6d0
Compare
Due to the lambdas used it didn't actually generate inplace cases. Fix a few tests that now break due to assuming that the cases are always out of place. Also update some numbers to make it more likely to find issues like loading from wrong array.
c46f6d0
to
d555a0a
Compare
turns out the inplace case of the alignment data generator was broken all along, should be fixed now, luckily everything works :) |
@juliantaylor from the first look, I am not quite sure what that test fix changed? |
I don't know what I was thinking when I original wrote this (its been quite a while) but it uses a lambda to create data and the inplace one does |
Sorry, silly me, I only looked at the first two thirds, and the interesting stuff is in the last bit ;). |
Anyway, can't find anything weird, so will merge tomorrow unless something comes up. |
Thanks Julian! |
@juliantaylor, it seems airspeed velocity thinks that this commit made the performance of reduction along the fast axis considerably slower. Any idea why that might be? |
The benchmarks are run on Intel Atom N2800 + Debian jessie (gcc 4.9.2) if that helps. |
Hmmm, I guess it is not unlikely that newer compilers might not show the slowdown, so likely nothing to worry about (but maybe worth a quick check). |
which benchmark exactly? |
Seems like this one with axis=0 and integer loops, which is of course the slow axis, sorry, so it makes a little more sense (since it should go into this branch).
Frankly, I think it should be exactly the case that would get faster, because as far as I can think right now, the thing that supposedly got slower, is the fully contiguous case with |
an atom is a terrible cpu performance wise, quite possible vectorizing just makes things slower there |
Sounds like its not worth to think about it long... |
fwiw, these cases are faster on my intel-core laptop, goes opposite way to atom. |
Both gcc and clang don't automatically specialize the inplace case, so
add extra conditions to the loop macros to get the compilers to emit
decent code.
Without them inplace code ends up much slower than the out of place
code.