8000 np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() · Issue #11014 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content 8000
np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014
@alexmojaki

Description

@alexmojaki

np.array(string_array.tolist(), dtype=int)

is almost twice as fast as

np.array(string_array, dtype=int)

Full demo script:

from timeit import timeit
import numpy as np
import sys

string_array = np.array(list(map(str, range(1000000))))

print(timeit(lambda: np.array(string_array, dtype=int), number=10))
print(timeit(lambda: np.array(string_array.tolist(), dtype=int), number=10))
print(np.__version__)
print(sys.version)

Output:

3.470734696020372
2.058091988990782
1.14.2
3.6.2 (default, Jul 29 2017, 00:00:00) 
[GCC 4.8.4]

This seems like an optimisation that numpy should be able to do automatically, and also possible a hint at an underlying performance problem.


Summary by @seberg 2021-11:

  • As Eric notes at the end, the current timings should be mainly due to the weird casting functions.
  • The solution will be to implement new-style casts (instead of the weird legacy cast function), for string to integer casts. Even the old functions are bad (they go via scalars!), but there is probably not much point in trying to improve them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0