10000 use libdivide to make integer division faster · Issue #14959 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

use libdivide to make integer division faster #14959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
msalib opened this issue Nov 22, 2019 · 12 comments · Fixed by #17727
Closed

use libdivide to make integer division faster #14959

msalib opened this issue Nov 22, 2019 · 12 comments · Fixed by #17727

Comments

@msalib
Copy link
msalib commented Nov 22, 2019

Right now, integer division uses the standard division instructions, but libdivide, a header-only C library, allows you to replace integer division with integer multiplication and a bitshift. On my machine, that's about 5x faster.

Alternatively, it would be good if integer division by a power of 2 (and maybe multiplication too?) were implemented with bitshifts.

@eric-wieser
Copy link
Member

If the libdivide approach is really that good, wouldn't it make more sense to try and upstream it into gcc so that all c code benefits?

@eric-wieser
Copy link
Member

I suppose specifically for broadcast multiplies, the shift / multiply could be computed just once, then applied to the entire axis.

@msalib
Copy link
Author
msalib commented Nov 22, 2019

It is actually in gcc and other compilers. The problem though is that in order to use this approach, you have to do a small bit of work that costs about 3 integer divisions. Compilers can do that for division by constant, but they can't do it for cases where the divisor isn't visible to them. For numpy where you can see exactly how many divisions you need to do ahead of time, this could be a big win.

@ganesh-k13
Copy link
Member

@eric-wieser , @seberg , we can also explore Karatsuba if numbers are huge? Also, is this open to change? would like to explore by replacing a small bit and testing performance.

@mattip
Copy link
Member
mattip commented Nov 6, 2020

The code to move the constant value out of the loop is pretty deep in the C macros, I think this would be the place to expand the logic so that if is1 is 0 then ip1 is left out of the for loop, and likewise for is2 and ip2. So that macro would have three alternative paths. Then according to the comment above, the compiler would notice it is using a constant across a loop and perform the needed magic. In any case, it might save a noop ip1 += 0.

Benchmarking this would be important to a PR to show that the increase in code complexity is worth the change.

@ganesh-k13
Copy link
Member
ganesh-k13 commented Nov 6, 2020

Performance gains are massive :)

Sample Code:

print(np.__version__)
x = np.arange(1000000000)
cProfile.run('x//30')
print(x//30)

Now output 🎉

1.19.2
         3 function calls in 6.859 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    6.859    6.859    6.859    6.859 <string>:1(<module>)
        1    0.000    0.000    6.859    6.859 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


[       0        0        0 ... 33333333 33333333 33333333]

And after libdivde:

1.20.0.dev0+4403503
         3 function calls in 2.919 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.918    2.918    2.918    2.918 <string>:1(<module>)
        1    0.000    0.000    2.919    2.919 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


[       0        0        0 ... 33333333 33333333 33333333]

Things I did:

  1. Explode BINARY_LOOP over
    @TYPE@_divide(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(func))
    . I have not figured a clean way to if-else inside the macro,will explore.
  2. Removed ip2+=is2(few millis gain)
  3. created libdivide_s64_t, it's hardcoded for now.
  4. On actual division: ((@type@ *)op1) = libdivide_s64_do(in1, &fast_d);

I'll make it more generic and extend to other types.

@seberg
Copy link
Member
seberg commented Nov 6, 2020

Hmm, I think I once also thought that the line if (((in1 > 0) != (in2 > 0)) && (in1 % in2 != 0)) { in the integer division loops look like they might be bad for performance, since there is a modulo in there which gets calculated fairly often (and the result is not used).

But it indeed looks like it is may be worth the trouble of adding a fast path for 0-stride case! (I am not sure how big a dependency/chunk of code libdivide is).

@ganesh-k13
Copy link
Member

Thanks @seberg , libdivide is a header only, we can package it(79KB) or add it as a dependency to install.

@seberg
Copy link
Member
seberg commented Nov 6, 2020

@ganesh-k13 one thing to remember is that we cannot hardcode certain SIMD instructions sets. So the functions provided need to be wrapped into the new universal intrinsics, also we need to be careful about retaining edge-case behaviour (or at least intentionally breaking it).
Vendoring seems like the probably method, or even taking parts (I am assuming the license permits this and we should ask the authors in either case).

@ganesh-k13
Copy link
Member

Thanks for the heads up @seberg . Seems like it's better if we build it separately. Perhaps have a knob of some kind to use libdivide(at least initially, like put it in a #def). Coming to instruction sets, take a look at this. Seems like they are using a wider range for compatibility. So maybe we could use it as is.

need to be careful about retaining edge-case behavior

Can you elaborate on that?

we should ask the authors in either case

Do we mail them?

@ganesh-k13
Copy link
Member

Also, when we build numpy, I see this output:

########### CLIB COMPILER OPTIMIZATION ###########
CPU baseline  : 
  Requested   : 'min'
  Enabled     : SSE SSE2 SSE3
  Flags       : -msse -msse2 -msse3
  Extra checks: none

Perhaps we can use this to build libdivide as well?

@seberg
Copy link
Member
seberg commented Nov 7, 2020

Oh, about edge cases, I am unsure if we have some defined behaviour e.g. for np.iinfo(np.int64).min // -1...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0