8000 ENH: speed up putmask avoiding % in inner loop by jaimefrio · Pull Request #5501 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: speed up putmask avoiding % in inner loop #5501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 24, 2015

Conversation

jaimefrio
Copy link
Member

putmask(a, mask, b) is roughly equivalent to a[mask] = b, but b is tiled as needed if it is shorter than a and m. That was being done with a % operation, which is generally not cheap. On my system, checking the index in every iteration and resetting it to 0 when it exceeds the array's length is over 2x faster in the following test case:

import numpy as np
a = np.arange(1000)
m = (a & 1).astype(bool)  # 50% full mask
b = np.arange(100)

%timeit np.putmask(a,m,b)
100000 loops, best of 3: 2.67 us per loop  # current master
1000000 loops, best of 3: 1.15 us per loop  # this PR

There is a tradeoff on the sparsity of the mask: this PR's method has to do cheap work to keep the index updated in every iteration, while current master does expensive work only in iterations where the mask is True. But even with an all False mask, performance is virtually the same:

m = np.zeros(1000, bool)
%timeit np.putmask(a, m, b)
1000000 loops, best of 3: 1.04 us per loop  # current master
1000000 loops, best of 3: 1.1 us per loop  # this PR

For types that do not have a fastputmask function defined, moving chunks of memory around dominates the total time. But applying the same idea as above and removing some duplicated work from the inner loop, the improvements are more modest, but still noticeable:

a = np.arange(1000, dtype=object)
b = np.arange(100, dtype=object)
%timeit np.putmask(a, m, b)  # same m as before
100000 loops, best of 3: 10.5 us per loop  # current master
100000 loops, best of 3: 10.1 us per loop  # this PR

a = np.arange(1000).astype('S')
b = np.arange(100).astype('S')
%timeit np.putmask(a, m, b)
100000 loops, best of 3: 13.3 us per loop  # current master
100000 loops, best of 3: 10 us per loop  # this PR

if (mask[i]) {
in[i] = vals[i%nv];
in[i] = vals[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be vals[j]
as is the compiler will remove the dead j branch so the benchmarks are not valid, also seems the tests are insufficient.
but I do think it should still be beneficial as the branch should be predicted most of the time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, that's embarrassing... While update the benchmarks, and take a look at expanding the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it have a big impactone could consider a special ni == nv loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have the nv >= ni branch at one point, but removed it because it was showing virtually no performance improvement. Guess we now know why. ;-)

@jaimefrio
Copy link
Member Author

Ok, after fixing the bug in the previous code, this is still faster for my original test case:

%timeit np.putmask(a, m, b)
1000000 loops, best of 3: 1.56 us per loop

So it is 70% faster for a half full mask, although the worst case of an empty mask will result in almost the same timing as above, so a 33% slowdown for that case.

I will take a look at the tests later today, as they obviously need some extending.

for (i = 0, j = 0; i < ni; i++, j++) {
if (j >= nv) {
j = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of branch prediction and code bumming, gcc -O2 generates a loop with three conditional jumps, but if this conditional block is replaced with

j *= (j < nv);

there's only two of them. Might be worth a try to see if multiplication is faster than modulo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a try, though a predicted near branch has among the best low latency/throughput of all instructions, multiplication is I think in the range of latency 4 throughout 1 cycle, likely the reason why the compiler does not do this transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the single benchmark I have been using, multiplying clocks in at 2.36 us, slightly better than current master, but noticeably slower than the single if in this PR, that takes 1.56 us.

Playing the "minimize branch misprediction" game, my best effort so far, clocking in at 1.25 us and outperforming the current code in this PR is the following:

i = 0;
while (i < ni) {
    for (j = 0; j < nv; ++j, ++i) {
        if (i >= ni) {
            break;
        }
        in[i] = vals[j];
    }
}

I suppose this is capitalizing on the ability to smartly predict loops. I am not fully sure that it is worth uglying the code like this, neither would I bet any money on the result being consistent across processor architectures. My timings come from a Sandy Bridge Intel i5, FWIW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra performance was of course coming from leaving the mask check out... Putting it back in the following takes the same time (1.55 us) as the much more readable single if in the current PR:

for (i = 0; i < ni;) {
    for (j = 0; j < nv && i < ni; ++j, ++i) {
        if (mask[i]) {
            in[i] = vals[j];
        }
    }
}

@charris
Copy link
Member
charris commented Feb 24, 2015

LGTM. Jaime, do you still feel good about this?

@juliantaylor
Copy link
Contributor

oh this isn't merged yet :)
some tests for the bug before would be good, but I don't mind merging it now either

charris added a commit that referenced this pull request Feb 24, 2015
ENH: speed up putmask avoiding % in inner loop
@charris charris merged commit 2e016ac into numpy:master Feb 24, 2015
@charris
Copy link
Member
charris commented Feb 24, 2015

Well, then, I'll just merge it as is. Some tests of this functionality would be nice...

@charris
Copy link
Member
charris commented Feb 24, 2015

Thanks Jaime.

@jaimefrio jaimefrio deleted the faster_fastputmask branch February 24, 2015 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0