Open
Description
Describe the issue:
I ran the following code:
a = np.random.randint(1, 100, (10**8, 2))
%timeit a.sum(axis=1)
%timeit a[:,0]+a[:,1]
and this was the output:
1.01 s ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
163 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I.e, the .sum() method was about 6-7X slower than manual sum. If instead I do a slightly less narrow array, say (10**6, 2), .sum() is still 6-7X slower, so this doesn't appear to be just about numpy C-function call overhead.
For more "square" arrays, we actually see the expected improvement:
b = np.random.randint(1, 100, (1000000, 200))
%%timeit
b.sum(axis=1)
105 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
c = b[:,0]
for i in range(1, b.shape[1]):
c+=b[:,i]
4 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Are is .sum() needlessly slow for narrow arrays?
Reproduce the code example:
a = np.random.randint(1, 100, (100000000, 2))
%timeit a.sum(axis=1)
%timeit a[:,0]+a[:,1]
Error message:
No response
Runtime information:
1.21.5
3.9.15 (main, Nov 24 2022, 14:39:17) [MSC v.1916 64 bit (AMD64)]
Context for the issue:
No response