np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014

alexmojaki · 2018-04-30T14:07:04Z

np.array(string_array.tolist(), dtype=int)

is almost twice as fast as

np.array(string_array, dtype=int)

Full demo script:

from timeit import timeit
import numpy as np
import sys

string_array = np.array(list(map(str, range(1000000))))

print(timeit(lambda: np.array(string_array, dtype=int), number=10))
print(timeit(lambda: np.array(string_array.tolist(), dtype=int), number=10))
print(np.__version__)
print(sys.version)

Output:

3.470734696020372
2.058091988990782
1.14.2
3.6.2 (default, Jul 29 2017, 00:00:00) 
[GCC 4.8.4]

This seems like an optimisation that numpy should be able to do automatically, and also possible a hint at an underlying performance problem.

Summary by @seberg 2021-11:

As Eric notes at the end, the current timings should be mainly due to the weird casting functions.
The solution will be to implement new-style casts (instead of the weird legacy cast function), for string to integer casts. Even the old functions are bad (they go via scalars!), but there is probably not much point in trying to improve them.

The text was updated successfully, but these errors were encountered:

Dan-Patterson · 2018-04-30T23:07:58Z

@alexmojaki I got a similar ratio of the times as you did using your script for numpy 1.13.3 but using IPython %timeit, the differences were less dramatic.

`%timeit -n10 -r10 np.array(string_array, dtype=int)
349 ms ± 4.07 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n10 -r10 np.array(string_array.tolist(), dtype=int)
317 ms ± 21.4 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)`

Can you confirm your timing by other means?

alexmojaki · 2018-04-30T23:16:41Z

Using more manual timing:

from contextlib import contextmanager
from time import time

import numpy as np


@contextmanager
def timer(description='Operation'):
    start = time()
    yield
    elapsed = time() - start
    message = '%s took %s seconds' % (description, elapsed)
    print(message)


string_array = np.array(list(map(str, range(1000000))))

with timer('plain'):
    for _ in range(10):
        np.array(string_array, dtype=int)

with timer('tolist'):
    for _ in range(10):
        np.array(string_array.tolist(), dtype=int)

plain took 6.220866918563843 seconds
tolist took 3.8044321537017822 seconds

Dan-Patterson · 2018-04-30T23:59:30Z

Something else must be going on. Using your script above, I ran it from n=2 to 10 and the ratios weren't as marked as yours. Here are a couple of the samples

`n = 2
plain took 0.6495838165283203 seconds
tolist took 0.622671365737915 seconds

n = 4
plain took 1.3122646808624268 seconds
tolist took 1.2245070934295654 seconds

n = 6
plain took 1.9206631183624268 seconds
tolist took 1.9051151275634766 seconds

n = 8
plain took 2.601041316986084 seconds
tolist took 2.455733299255371 seconds

n = 10
plain took 3.2278361320495605 seconds
tolist took 3.079383611679077 seconds`

alexmojaki · 2018-05-01T00:05:57Z

You mentioned using numpy 1.13.3. Did you mean 1.14.3? If not, can you try with that? Also what version of Python are you using?

Dan-Patterson · 2018-05-01T00:24:49Z

I am using python 3.6.2 and numpy 1.13.3. I tested using that setup to see if the behaviour existed with those versions. I was interested if this was indeed a new issue. I can't replicate the differences to the degree that you have using numpy 1.14. Was this observed when you used numpy 1.13.x or have you just tested this solely on the current version?

…

________________________________ From: Alex Hall <notifications@github.com> Sent: April 30, 2018 8:06:05 PM To: numpy/numpy Cc: Dan Patterson; Comment Subject: Re: [numpy/numpy] np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() (#11014) You mentioned using numpy 1.13.3. Did you mean 1.14.3? If not, can you try with that? Also what version of Python are you using? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#11014 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFyjn67ZHAqAhERQKHw110JkPSfPK3o8ks5tt6btgaJpZM4Tso4T>. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/numpy/numpy","title":"numpy/numpy","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/numpy/numpy"}},"updates":{"snippets":[{"icon":"PERSON","message":"@alexmojaki in #11014: You mentioned using numpy 1.13.3. Did you mean 1.14.3? If not, can you try with that? Also what version of Python are you using?"}],"action":{"name":"View Issue","url":"#11014 (comment)"}}}

alexmojaki · 2018-05-01T00:35:40Z

Well I just downgraded to 1.13.3, and now tolist() is slower as one would expect. So I guess it's a new issue.

eric-wieser · 2018-05-01T00:39:54Z

@alexmojaki: Did .tolist() get slower in 1.13 on your machine, or did array(...) get faster (compared to 1.14)?

alexmojaki · 2018-05-01T00:49:16Z

It looks like in 1.14 array got slower. Here are some fresh runs with both scripts:

1.13:

plain took 3.423121929168701 seconds
tolist took 3.9357383251190186 seconds

3.5675266589969397
3.9893020659219474

1.14:

plain took 6.254313945770264 seconds
tolist took 3.8215630054473877 seconds

6.432628321927041
3.8574256210122257

ahaldane · 2018-05-01T03:23:22Z

I haven't checked, but perhaps #9978 is involved...

ahaldane · 2018-05-01T03:41:12Z

Actually, we modified arraytypes.c.src a couple times in the last few months, that was just the first one I remembered.

My bet for this slowdown is now ##9856, since it modified the inner loop of STRING_to_INT:

     for (i = 0; i < n; i++, ip+=skip, op+=oskip) {
-        temp = @from@_getitem(ip, aip);
+        PyObject *new;
+        PyObject *temp = PyArray_Scalar(ip, PyArray_DESCR(aip), (PyObject *)aip);

eric-wieser · 2018-05-01T04:20:01Z

Not sure how best to fix this. There's a horrible dance going on between doing the work inscalartypes.c.src and doing it in arraytypes.c.src

Ideally we'd do a conversion straight from array to array, rather than the round trip that currently happens of:

ndarray[np.str_] → np.str_ → int → ndarray[np.int32]

I'm pretty sure there are more steps there, but would need to add instrumentation to find them

ahaldane · 2018-05-01T17:58:56Z

Agreed.

In that file we have a lot of functions like INT_fromstr, which appear to avoid all that complication by using PyOS_strtol, which should be fast. But that expects a null terminated string, and we have non-null terminated strings. C does not have a function to parse a non-null-terminated integer, so probably we'd have to make a null-terminated copy of the value first, or else roll our own strntol.

That's ignoring any back-compat quirks we might have to support if we change things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014

np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014

np.array(string_array.tolist(), dtype=int) is faster than without the .tolist() #11014

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!