8000 ENH: add inplace cases to fast ufunc loop macros by juliantaylor · Pull Request #7999 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: add inplace cases to fast ufunc loop macros #7999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 2, 2016

Conversation

juliantaylor
Copy link
Contributor

Both gcc and clang don't automatically specialize the inplace case, so
add extra conditions to the loop macros to get the compilers to emit
decent code.
Without them inplace code ends up much slower than the out of place
code.

@mhvk
Copy link
Contributor
mhvk commented Aug 31, 2016

I always wondered why inplace didn't help speed much. Great to see it addressed (even if I cannot really comment on the code).

@seberg
Copy link
Member
seberg commented Aug 31, 2016

Heh, the code is "just the same" with some extra branches that all do the same thing but help the compiler create an optimized version of the branch. Though we should check quite carefully since it might stretch the tests...

Do you want to create a release note?

tout * out = (tout *)op1; \
op; \
}
#define BASE_BINARY_LOOP_S(tin, tout, cin, cinp, vin, vinp, op) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeesh, this macro argument passing is starting to give me the creeps ;). By now I think its correct, but could maybe use a comment here to explain things a bit more in depth, I am not used to such macros, but took me a bit to figure out that in1 always ends up pointing to the correct args[0], etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it is just one of those things you have to stare at until it makes sense...

@seberg
Copy link
Member
seberg commented Aug 31, 2016

LGTM in general. Do we have basic tests which exercise all of these branches in one way or another? The only thing that worried me a little for a bit was somehow ending up with in1 and in2 swapped (which would not show up in all of the ufuncs).

@juliantaylor
Copy link
Contributor Author
juliantaylor commented Sep 1, 2016

the macro is quite ugly, I though about just keeping the original code, but the macro makes clear that the code in all branches is exactly the same.
I don't know if we explicitly test these cases. I should probably check.

@juliantaylor
Copy link
Contributor Author

I don't think it needs release notes, these are only used for integer and bitwise ops, the latter I doubt are written inplace often. float loops where already fine as they are manually optimized (good compilers are probably common enough now (as we dropped 2.6 and thus redhat6) that we can also autovectorize those)

but this becomes very significant if the temporary elision branch is merged, that will turn a lot more operations to inplace operations.

@juliantaylor
Copy link
Contributor Author

added explicit tests via the code used to test the float loops too, should cover more than necessary (assuming non-buggy compilers)

Both gcc and clang don't automatically specialize the inplace case, so
add extra conditions to the loop macros to get the compilers to emit
decent code.
Without them inplace code ends up much slower than the out of place
code.
@juliantaylor juliantaylor force-pushed the inplace-opt branch 3 times, most recently from 249460a to c46f6d0 Compare September 1, 2016 14:07
Due to the lambdas used it didn't actually generate inplace cases.
Fix a few tests that now break due to assuming that the cases are always
out of place.
Also update some numbers to make it more likely to find issues like
loading from wrong array.
@juliantaylor
Copy link
Contributor Author

turns out the inplace case of the alignment data generator was broken all along, should be fixed now, luckily everything works :)

@seberg
Copy link
Member
seberg commented Sep 1, 2016

@juliantaylor from the first look, I am not quite sure what that test fix changed?

@juliantaylor
Copy link
Contributor Author

I don't know what I was thinking when I original wrote this (its been quite a while) but it uses a lambda to create data and the inplace one does inp1() inp1() which creates two arrays which is not inplace
so I switched to properly pass the same array, see also the commit message

@seberg
Copy link
Member
seberg commented Sep 1, 2016

Sorry, silly me, I only looked at the first two thirds, and the interesting stuff is in the last bit ;).

@seberg
Copy link
Member
seberg commented Sep 1, 2016

Anyway, can't find anything weird, so will merge tomorrow unless something comes up.

@seberg
Copy link
Member
seberg commented Sep 2, 2016

Thanks Julian!

@seberg seberg merged commit 8eedd3e into numpy:master Sep 2, 2016
@seberg
Copy link
Member
seberg commented Sep 3, 2016

@juliantaylor, it seems airspeed velocity thinks that this commit made the performance of reduction along the fast axis considerably slower. Any idea why that might be?

@pv
Copy link
Member
pv commented Sep 3, 2016

The benchmarks are run on Intel Atom N2800 + Debian jessie (gcc 4.9.2) if that helps.

@seberg
Copy link
Member
seberg commented Sep 3, 2016

Hmmm, I guess it is not unlikely that newer compilers might not show the slowdown, so likely nothing to worry about (but maybe worth a quick check).

@juliantaylor juliantaylor deleted the inplace-opt branch September 3, 2016 19:23
@juliantaylor
Copy link
Contributor Author

which benchmark exactly?
the changed loops are not used for the reductions so it shouldn't have changed anything.

@pv
Copy link
Member
pv commented Sep 3, 2016

@seberg
Copy link
Member
seberg commented Sep 3, 2016

Seems like this one with axis=0 and integer loops, which is of course the slow axis, sorry, so it makes a little more sense (since it should go into this branch).

class AddReduceSeparate(Benchmark):
    params = [[0, 1], TYPES1]
    param_names = ['axis', 'type']

    def setup(self, axis, typename):
        self.a = get_squares()[typename]

    def time_reduce(self, axis, typename):
        np.add.reduce(self.a, axis=axis)

Frankly, I think it should be exactly the case that would get faster, because as far as I can think right now, the thing that supposedly got slower, is the fully contiguous case with args[2] == args[0]....

@juliantaylor
Copy link
Contributor Author

an atom is a terrible cpu performance wise, quite possible vectorizing just makes things slower there
I can test that on my own next week

@seberg
Copy link
Member
seberg commented Sep 3, 2016

Sounds like its not worth to think about it long...

@pv
Copy link
Member
pv commented Sep 3, 2016

fwiw, these cases are faster on my intel-core laptop, goes opposite way to atom.
Probably not worth it to optimize for Atom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0