Closed
Description
It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do
for pos, row in enumerate(data_root.variants):
print(row)
if pos == 1000:
break
it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do
for pos, row in enumerate(chunk_iterator(data_root.variants)):
print(row)
if pos == 1000:
break
it takes less than a second, where
def chunk_iterator(array):
"""
Utility to iterate over the rows in the specified array efficiently
by accessing one chunk at a time.
"""
chunk_size = array.chunks[0]
for j in range(array.shape[0]):
if j % chunk_size == 0:
chunk = array[j: j + chunk_size][:]
yield chunk[j % chunk_size]
To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.
Any thoughts?
Metadata
Metadata
Assignees
Labels
No labels