10000 Speed problem for searchsorted when different integer dtypes · Issue #13579 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content
Speed problem for searchsorted when different integer dtypes #13579
Open
@topper-123

Description

@topper-123

Reproducing code example:

>>> n = 1_000_000
>>> arr = np.array([1] * n + [2] * n + [3] * n, dtype='int8')
>>> %timeit arr.searchsorted(2)  # search for a python int
7.38 ms   # slow
>>> code = np.int16(1)  # numpy int, but different dtype
>>> %timeit arr.searchsorted(code)
4.04 ms  # slow
>>> code = np.int8(1)
>>> %timeit arr.searchsorted(code)  # same dtype
2.57 µs  # fast

The cause of the slowness is that searchsorted upcasts the input array arr and value v to the same dtype, if they have different dtypes, and this costs time. This is not always needed.

In Pandas we've worked around this by making special integer type checks in a custom version of searchsorted, but IMO everyone would benefit if this/something similar was put into numpy instead.

See https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L1726 for the pandas version. The relevant PR is pandas-dev/pandas#22034. The pandas version of runs at 17.3 µs. This is slower than the optimal numpy version, probably because it's done in python + makes some checks that likely wouldn't be needed to do in in numpy.

BTW, I wouldn't be able to implement this in numpy, because I don't know C...

Numpy/Python version information:

1.15.4 3.6.7 |Anaconda, Inc.| (default, Oct 28 2018, 19:44:12) [MSC v.1915 64 bit (AMD64)]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0