10000 BUG: ndarray.sum() slow compared to + operation for narrow array · Issue #22870 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: ndarray.sum() slow compared to + operation for narrow array #22870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Astlaan opened this issue Dec 22, 2022 · 5 comments
Open

BUG: ndarray.sum() slow compared to + operation for narrow array #22870

Astlaan opened this issue Dec 22, 2022 · 5 comments
Labels

Comments

@Astlaan
Copy link
Astlaan commented Dec 22, 2022

Describe the issue:

I ran the following code:

a = np.random.randint(1, 100, (10**8, 2))
%timeit a.sum(axis=1)
%timeit a[:,0]+a[:,1]

and this was the output:

1.01 s ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
163 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I.e, the .sum() method was about 6-7X slower than manual sum. If instead I do a slightly less narrow array, say (10**6, 2), .sum() is still 6-7X slower, so this doesn't appear to be just about numpy C-function call overhead.

For more "square" arrays, we actually see the expected improvement:

b = np.random.randint(1, 100, (1000000, 200))
%%timeit 
b.sum(axis=1)
105 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit 
c = b[:,0]
for i in range(1, b.shape[1]):
    c+=b[:,i]
4 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Are is .sum() needlessly slow for narrow arrays?

Reproduce the code example:

a = np.random.randint(1, 100, (100000000, 2))
%timeit a.sum(axis=1)
%timeit a[:,0]+a[:,1]

Error message:

No response

Runtime information:

1.21.5
3.9.15 (main, Nov 24 2022, 14:39:17) [MSC v.1916 64 bit (AMD64)]

Context for the issue:

No response

@MDK8888
Copy link
MDK8888 commented Dec 27, 2022

Hey, currently working on this! I am looking for the C Source code for np.sum right now. I've been able to track it down to here -will keep looking!

@seberg
Copy link
Member
seberg commented Jan 1, 2023

This goes into the ufunc/umath C-code. NumPy tries to operate on the fastest memory order, but in the case of narrow arrays that isn't the better options, because the memory order doesn't matter that much (probably), and there is a per-call overhead which is much larger
(effectively numpy unrolls the inner-most loop).

NumPy could try to have heuristics to choose the "slow" direction, but also I would have to dig in to see if that is actually easy to find a heuristic (or more importantly, add it in a nice way).

@charris
Copy link
Member
charris commented Jan 1, 2023

@dgasmith May have some thoughts on this.

@seberg
Copy link
Member
seberg commented Jan 2, 2023

Hmmm, it seems likely that a large chunk is also due to SIMD overhead for very small loop sizes, making the "fast in memory" choice even worse. This comment suggest that might be easy to improve a lot though, and with that PR merged (or similar), that might just happen.

@mattip
Copy link
Member
mattip commented Jan 2, 2023

PR #22889 will not affect this since this is a reduce, which as far as I can tell does not use SIMD processing (it goes directly to BASE_BINARY_REDUCE_LOOP and uses gcc -O3 optimization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0