8000 Speed problem for searchsorted when different integer dtypes · Issue #13579 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Speed problem for searchsorted when different integer dtypes #13579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
topper-123 opened this issue May 17, 2019 · 3 comments · May be fixed by #16942
Open

Speed problem for searchsorted when different integer dtypes #13579

topper-123 opened this issue May 17, 2019 · 3 comments · May be fixed by #16942

Comments

@topper-123
Copy link
topper-123 commented May 17, 2019

Reproducing code example:

>>> n = 1_000_000
>>> arr = np.array([1] * n + [2] * n + [3] * n, dtype='int8')
>>> %timeit arr.searchsorted(2)  # search for a python int
7.38 ms   # slow
>>> code = np.int16(1)  # numpy int, but different dtype
>>> %timeit arr.searchsorted(code)
4.04 ms  # slow
>>> code = np.int8(1)
>>> %timeit arr.searchsorted(code)  # same dtype
2.57 µs  # fast

The cause of the slowness is that searchsorted upcasts the input array arr and value v to the same dtype, if they have different dtypes, and this costs time. This is not always needed.

In Pandas we've worked around this by making special integer type checks in a custom version of searchsorted, but IMO everyone would benefit if this/something similar was put into numpy instead.

See https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L1726 for the pandas version. The relevant PR is pandas-dev/pandas#22034. The pandas version of runs at 17.3 µs. This is slower than the optimal numpy version, probably because it's done in python + makes some checks that likely wouldn't be needed to do in in numpy.

BTW, I wouldn't be able to implement this in numpy, because I don't know C...

Numpy/Python version information:

1.15.4 3.6.7 |Anaconda, Inc.| (default, Oct 28 2018, 19:44:12) [MSC v.1915 64 bit (AMD64)]

@topper-123 topper-123 changed the title Speed problem for searchesorted when different integer dtypes Speed problem for searchsorted when different integer dtypes May 17, 2019
@seberg
Copy link
Member
8000 seberg commented May 17, 2019

Not sure about the code, to really speed things up for such mismatches, you would have to decide that the needle is small and only do the casting as needed for each of the elements (at least that would be my intuition).

@seberg
Copy link
Member
seberg commented May 18, 2019

This was previously reported in gh-5370, which has a similar discussion (close that in favor of this one).

@mattip
Copy link
Member
mattip commented May 19, 2019

#5370 replaced by #13579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0