Closed
Description
Zarr version
2.11.0 and up (including current main)
Numcodecs version
0.10.2
Python Version
3.10
Operating System
Mac and Linux
Installation
conda, pip and from source
Description
It appears that #789, commit: 5c71212 so from zarr 0.11.0, there's a performance regression that affects reading zarr data via Store backed by fsspec/FSMap.
In our test example (in practice we use xarray), we have a zarr array made of 2K files (total 1GB compressed), reading it via:
np.asarray(zarr.open(fsspec.get_mapper(...), mode="r"))
- on zarr 0.10.3 took about 12 seconds
- on zarr 0.13.3 took about 90 seconds (so roughly 7x longer)
- the same problem exists starting from version 0.11.0
Looking at the stacktraces from the different versions, looks like 0.10.3 was asynchronous fetching multiple items, while 0.13.3 is synchronized per storage item?
zarr 0.13.3
Thread 3529282 (idle): "MainThread"
do_futex_wait.constprop.0 (libpthread-2.31.so)
__new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
PyThread_acquire_lock_timed.localalias (python3.10)
lock_PyThread_acquire_lock (python3.10)
wait (threading.py:324)
wait (threading.py:607)
sync (fsspec/asyn.py:86)
wrapper (fsspec/asyn.py:113)
__getitem__ (fsspec/mapping.py:143)
__getitem__ (zarr/storage.py:724)
_chunk_getitem (zarr/core.py:1966)
_get_selection (zarr/core.py:1267)
_get_basic_selection_nd (zarr/core.py:976)
get_basic_selection (zarr/core.py:933)
__getitem__ (zarr/core.py:807)
__array__ (zarr/core.py:589)
PyArray_FromArrayAttr_int (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
_array_from_array_like (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_DiscoverDTypeAndShape_Recursive (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_DiscoverDTypeAndShape (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_FromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_CheckFromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
_array_fromobject_generic (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
array_asarray (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
<module> (<stdin>:2)
zarr 0.10.3
Thread 3530062 (idle): "MainThread"
do_futex_wait.constprop.0 (libpthread-2.31.so)
__new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
PyThread_acquire_lock_timed.localalias (python3.10)
lock_PyThread_acquire_lock (python3.10)
wait (threading.py:324)
wait (threading.py:607)
sync (fsspec/asyn.py:86)
wrapper (fsspec/asyn.py:113)
getitems (fsspec/mapping.py:93)
_chunk_getitems (zarr/core.py:1847)
_get_selection (zarr/core.py:1136)
_get_basic_selection_nd (zarr/core.py:841)
get_basic_selection (zarr/core.py:798)
__getitem__ (zarr/core.py:673)
__array__ (zarr/core.py:469)
PyArray_FromArrayAttr_int (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
_array_from_array_like (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_DiscoverDTypeAndShape_Recursive (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_DiscoverDTypeAndShape (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_FromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
PyArray_CheckFromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
_array_fromobject_generic (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
array_asarray (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
<module> (<stdin>:2)
Steps to reproduce
And we need to an existing zarr array to read:
np.asarray(zarr.open(fsspec.get_mapper(...), mode="r"))
Additional output
No response