ENH: speed up putmask avoiding % in inner loop #5501

jaimefrio · 2015-01-25T17:06:30Z

putmask(a, mask, b) is roughly equivalent to a[mask] = b, but b is tiled as needed if it is shorter than a and m. That was being done with a % operation, which is generally not cheap. On my system, checking the index in every iteration and resetting it to 0 when it exceeds the array's length is over 2x faster in the following test case:

import numpy as np
a = np.arange(1000)
m = (a & 1).astype(bool)  # 50% full mask
b = np.arange(100)

%timeit np.putmask(a,m,b)
100000 loops, best of 3: 2.67 us per loop  # current master
1000000 loops, best of 3: 1.15 us per loop  # this PR

There is a tradeoff on the sparsity of the mask: this PR's method has to do cheap work to keep the index updated in every iteration, while current master does expensive work only in iterations where the mask is True. But even with an all False mask, performance is virtually the same:

m = np.zeros(1000, bool)
%timeit np.putmask(a, m, b)
1000000 loops, best of 3: 1.04 us per loop  # current master
1000000 loops, best of 3: 1.1 us per loop  # this PR

For types that do not have a fastputmask function defined, moving chunks of memory around dominates the total time. But applying the same idea as above and removing some duplicated work from the inner loop, the improvements are more modest, but still noticeable:

a = np.arange(1000, dtype=object)
b = np.arange(100, dtype=object)
%timeit np.putmask(a, m, b)  # same m as before
100000 loops, best of 3: 10.5 us per loop  # current master
100000 loops, best of 3: 10.1 us per loop  # this PR

a = np.arange(1000).astype('S')
b = np.arange(100).astype('S')
%timeit np.putmask(a, m, b)
100000 loops, best of 3: 13.3 us per loop  # current master
100000 loops, best of 3: 10 us per loop  # this PR

juliantaylor · 2015-01-25T17:51:42Z

numpy/core/src/multiarray/arraytypes.c.src

            if (mask[i]) {
-                in[i] = vals[i%nv];
+                in[i] = vals[i];


this should be vals[j]
as is the compiler will remove the dead j branch so the benchmarks are not valid, also seems the tests are insufficient.
but I do think it should still be beneficial as the branch should be predicted most of the time

Ooops, that's embarrassing... While update the benchmarks, and take a look at expanding the tests.

should it have a big impactone could consider a special ni == nv loop

I did have the nv >= ni branch at one point, but removed it because it was showing virtually no performance improvement. Guess we now know why. ;-)

jaimefrio · 2015-01-25T18:11:52Z

Ok, after fixing the bug in the previous code, this is still faster for my original test case:

%timeit np.putmask(a, m, b)
1000000 loops, best of 3: 1.56 us per loop

So it is 70% faster for a half full mask, although the worst case of an empty mask will result in almost the same timing as above, so a 33% slowdown for that case.

I will take a look at the tests later today, as they obviously need some extending.

larsmans · 2015-01-26T22:54:38Z

numpy/core/src/multiarray/arraytypes.c.src

+        for (i = 0, j = 0; i < ni; i++, j++) {
+            if (j >= nv) {
+                j = 0;
+            }


Speaking of branch prediction and code bumming, gcc -O2 generates a loop with three conditional jumps, but if this conditional block is replaced with

j *= (j < nv);

there's only two of them. Might be worth a try to see if multiplication is faster than modulo.

maybe worth a try, though a predicted near branch has among the best low latency/throughput of all instructions, multiplication is I think in the range of latency 4 throughout 1 cycle, likely the reason why the compiler does not do this transformation.

With the single benchmark I have been using, multiplying clocks in at 2.36 us, slightly better than current master, but noticeably slower than the single if in this PR, that takes 1.56 us.

~~Playing the "minimize branch misprediction" game, my best effort so far, clocking in at 1.25 us and outperforming the current code in this PR is the following:~~

i = 0; while (i < ni) { for (j = 0; j < nv; ++j, ++i) { if (i >= ni) { break; } in[i] = vals[j]; } }

I suppose this is capitalizing on the ability to smartly predict loops. I am not fully sure that it is worth uglying the code like this, neither would I bet any money on the result being consistent across processor architectures. My timings come from a Sandy Bridge Intel i5, FWIW.

The extra performance was of course coming from leaving the mask check out... Putting it back in the following takes the same time (1.55 us) as the much more readable single if in the current PR:

for (i = 0; i < ni;) { for (j = 0; j < nv && i < ni; ++j, ++i) { if (mask[i]) { in[i] = vals[j]; } } }

charris · 2015-02-24T21:26:14Z

LGTM. Jaime, do you still feel good about this?

juliantaylor · 2015-02-24T21:38:19Z

oh this isn't merged yet :)
some tests for the bug before would be good, but I don't mind merging it now either

ENH: speed up putmask avoiding % in inner loop

charris · 2015-02-24T21:48:28Z

Well, then, I'll just merge it as is. Some tests of this functionality would be nice...

charris · 2015-02-24T21:48:37Z

Thanks Jaime.

juliantaylor reviewed Jan 25, 2015
View reviewed changes

ENH: speed up putmask avoiding % in inner loop

fc8db73

jaimefrio force-pushed the faster_fastputmask branch from 18cd64b to fc8db73 Compare January 25, 2015 18:03

larsmans reviewed Jan 26, 2015
View reviewed changes

charris added a commit that referenced this pull request Feb 24, 2015

Merge pull request #5501 from jaimefrio/faster_fastputmask

2e016ac

ENH: speed up putmask avoiding % in inner loop

charris merged commit 2e016ac into numpy:master Feb 24, 2015

jaimefrio deleted the faster_fastputmask branch February 24, 2015 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: speed up putmask avoiding % in inner loop #5501

ENH: speed up putmask avoiding % in inner loop #5501

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: speed up putmask avoiding % in inner loop #5501

ENH: speed up putmask avoiding % in inner loop #5501

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!