-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: speed up putmask avoiding % in inner loop #5501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (mask[i]) { | ||
in[i] = vals[i%nv]; | ||
in[i] = vals[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be vals[j]
as is the compiler will remove the dead j branch so the benchmarks are not valid, also seems the tests are insufficient.
but I do think it should still be beneficial as the branch should be predicted most of the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, that's embarrassing... While update the benchmarks, and take a look at expanding the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it have a big impactone could consider a special ni == nv loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did have the nv >= ni
branch at one point, but removed it because it was showing virtually no performance improvement. Guess we now know why. ;-)
18cd64b
to
fc8db73
Compare
Ok, after fixing the bug in the previous code, this is still faster for my original test case:
So it is 70% faster for a half full mask, although the worst case of an empty mask will result in almost the same timing as above, so a 33% slowdown for that case. I will take a look at the tests later today, as they obviously need some extending. |
for (i = 0, j = 0; i < ni; i++, j++) { | ||
if (j >= nv) { | ||
j = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of branch prediction and code bumming, gcc -O2
generates a loop with three conditional jumps, but if this conditional block is replaced with
j *= (j < nv);
there's only two of them. Might be worth a try to see if multiplication is faster than modulo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth a try, though a predicted near branch has among the best low latency/throughput of all instructions, multiplication is I think in the range of latency 4 throughout 1 cycle, likely the reason why the compiler does not do this transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the single benchmark I have been using, multiplying clocks in at 2.36 us, slightly better than current master, but noticeably slower than the single if in this PR, that takes 1.56 us.
Playing the "minimize branch misprediction" game, my best effort so far, clocking in at 1.25 us and outperforming the current code in this PR is the following:
i = 0;
while (i < ni) {
for (j = 0; j < nv; ++j, ++i) {
if (i >= ni) {
break;
}
in[i] = vals[j];
}
}
I suppose this is capitalizing on the ability to smartly predict loops. I am not fully sure that it is worth uglying the code like this, neither would I bet any money on the result being consistent across processor architectures. My timings come from a Sandy Bridge Intel i5, FWIW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra performance was of course coming from leaving the mask check out... Putting it back in the following takes the same time (1.55 us) as the much more readable single if in the current PR:
for (i = 0; i < ni;) {
for (j = 0; j < nv && i < ni; ++j, ++i) {
if (mask[i]) {
in[i] = vals[j];
}
}
}
LGTM. Jaime, do you still feel good about this? |
oh this isn't merged yet :) |
ENH: speed up putmask avoiding % in inner loop
Well, then, I'll just merge it as is. Some tests of this functionality would be nice... |
Thanks Jaime. |
putmask(a, mask, b)
is roughly equivalent toa[mask] = b
, butb
is tiled as needed if it is shorter thana
andm
. That was being done with a%
operation, which is generally not cheap. On my system, checking the index in every iteration and resetting it to 0 when it exceeds the array's length is over 2x faster in the following test case:There is a tradeoff on the sparsity of the mask: this PR's method has to do cheap work to keep the index updated in every iteration, while current master does expensive work only in iterations where the mask is
True
. But even with an allFalse
mask, performance is virtually the same:For types that do not have a
fastputmask
function defined, moving chunks of memory around dominates the total time. But applying the same idea as above and removing some duplicated work from the inner loop, the improvements are more modest, but still noticeable: