ENH: add inplace cases to fast ufunc loop macros #7999

juliantaylor · 2016-08-31T18:17:08Z

Both gcc and clang don't automatically specialize the inplace case, so
add extra conditions to the loop macros to get the compilers to emit
decent code.
Without them inplace code ends up much slower than the out of place
code.

mhvk · 2016-08-31T21:40:09Z

I always wondered why inplace didn't help speed much. Great to see it addressed (even if I cannot really comment on the code).

seberg · 2016-08-31T22:35:14Z

Heh, the code is "just the same" with some extra branches that all do the same thing but help the compiler create an optimized version of the branch. Though we should check quite carefully since it might stretch the tests...

Do you want to create a release note?

seberg · 2016-08-31T22:47:55Z

numpy/core/src/umath/loops.c.src

+        tout * out = (tout *)op1; \
+        op; \
+    }
+#define BASE_BINARY_LOOP_S(tin, tout, cin, cinp, vin, vinp, op) \


Jeesh, this macro argument passing is starting to give me the creeps ;). By now I think its correct, but could maybe use a comment here to explain things a bit more in depth, I am not used to such macros, but took me a bit to figure out that in1 always ends up pointing to the correct args[0], etc.

Or maybe it is just one of those things you have to stare at until it makes sense...

seberg · 2016-08-31T22:58:44Z

LGTM in general. Do we have basic tests which exercise all of these branches in one way or another? The only thing that worried me a little for a bit was somehow ending up with in1 and in2 swapped (which would not show up in all of the ufuncs).

juliantaylor · 2016-09-01T07:11:19Z

the macro is quite ugly, I though about just keeping the original code, but the macro makes clear that the code in all branches is exactly the same.
I don't know if we explicitly test these cases. I should probably check.

juliantaylor · 2016-09-01T07:16:47Z

I don't think it needs release notes, these are only used for integer and bitwise ops, the latter I doubt are written inplace often. float loops where already fine as they are manually optimized (good compilers are probably common enough now (as we dropped 2.6 and thus redhat6) that we can also autovectorize those)

but this becomes very significant if the temporary elision branch is merged, that will turn a lot more operations to inplace operations.

juliantaylor · 2016-09-01T07:34:51Z

added explicit tests via the code used to test the float loops too, should cover more than necessary (assuming non-buggy compilers)

Both gcc and clang don't automatically specialize the inplace case, so add extra conditions to the loop macros to get the compilers to emit decent code. Without them inplace code ends up much slower than the out of place code.

Due to the lambdas used it didn't actually generate inplace cases. Fix a few tests that now break due to assuming that the cases are always out of place. Also update some numbers to make it more likely to find issues like loading from wrong array.

juliantaylor · 2016-09-01T15:53:10Z

turns out the inplace case of the alignment data generator was broken all along, should be fixed now, luckily everything works :)

seberg · 2016-09-01T18:19:16Z

@juliantaylor from the first look, I am not quite sure what that test fix changed?

juliantaylor · 2016-09-01T18:37:16Z

I don't know what I was thinking when I original wrote this (its been quite a while) but it uses a lambda to create data and the inplace one does inp1() inp1() which creates two arrays which is not inplace
so I switched to properly pass the same array, see also the commit message

seberg · 2016-09-01T18:40:14Z

Sorry, silly me, I only looked at the first two thirds, and the interesting stuff is in the last bit ;).

seberg · 2016-09-01T18:42:01Z

Anyway, can't find anything weird, so will merge tomorrow unless something comes up.

seberg · 2016-09-02T08:19:58Z

Thanks Julian!

seberg · 2016-09-03T18:49:25Z

@juliantaylor, it seems airspeed velocity thinks that this commit made the performance of reduction along the fast axis considerably slower. Any idea why that might be?

pv · 2016-09-03T18:53:46Z

The benchmarks are run on Intel Atom N2800 + Debian jessie (gcc 4.9.2) if that helps.

seberg · 2016-09-03T19:15:46Z

Hmmm, I guess it is not unlikely that newer compilers might not show the slowdown, so likely nothing to worry about (but maybe worth a quick check).

juliantaylor · 2016-09-03T19:27:59Z

which benchmark exactly?
the changed loops are not used for the reductions so it shouldn't have changed anything.

pv · 2016-09-03T19:31:35Z

See https://pv.github.io/numpy-bench/index.html#regressions?sort=1&dir=desc or https://pv.github.io/numpy-bench/regressions.xml

seberg · 2016-09-03T19:58:34Z

Seems like this one with axis=0 and integer loops, which is of course the slow axis, sorry, so it makes a little more sense (since it should go into this branch).

class AddReduceSeparate(Benchmark):
    params = [[0, 1], TYPES1]
    param_names = ['axis', 'type']

    def setup(self, axis, typename):
        self.a = get_squares()[typename]

    def time_reduce(self, axis, typename):
        np.add.reduce(self.a, axis=axis)

Frankly, I think it should be exactly the case that would get faster, because as far as I can think right now, the thing that supposedly got slower, is the fully contiguous case with args[2] == args[0]....

juliantaylor · 2016-09-03T20:07:44Z

an atom is a terrible cpu performance wise, quite possible vectorizing just makes things slower there
I can test that on my own next week

seberg · 2016-09-03T20:15:13Z

Sounds like its not worth to think about it long...

pv · 2016-09-03T20:25:26Z

fwiw, these cases are faster on my intel-core laptop, goes opposite way to atom.
Probably not worth it to optimize for Atom.

seberg reviewed Aug 31, 2016
View reviewed changes

juliantaylor force-pushed the inplace-opt branch from ef9ad9f to 291ec29 Compare September 1, 2016 07:34

ENH: add inplace cases to fast ufunc loop macros

fd298a3

Both gcc and clang don't automatically specialize the inplace case, so add extra conditions to the loop macros to get the compilers to emit decent code. Without them inplace code ends up much slower than the out of place code.

juliantaylor force-pushed the inplace-opt branch 3 times, most recently from 249460a to c46f6d0 Compare September 1, 2016 14:07

juliantaylor force-pushed the inplace-opt branch from c46f6d0 to d555a0a Compare September 1, 2016 14:12

charris added 01 - Enhancement component: numpy._core labels Sep 1, 2016

seberg merged commit 8eedd3e into numpy:master Sep 2, 2016

juliantaylor deleted the inplace-opt branch September 3, 2016 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: add inplace cases to fast ufunc loop macros #7999

ENH: add inplace cases to fast ufunc loop macros #7999

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: add inplace cases to fast ufunc loop macros #7999

ENH: add inplace cases to fast ufunc loop macros #7999

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!