Default efficient row iteration

It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do

    for pos, row in enumerate(data_root.variants):
        print(row)
        if pos == 1000:
            break

it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do

    for pos, row in enumerate(chunk_iterator(data_root.variants)):
        print(row)
        if pos == 1000:
            break

it takes less than a second, where

def chunk_iterator(array):
    """ 
    Utility to iterate over the rows in the specified array efficiently
    by accessing one chunk at a time.
    """
    chunk_size = array.chunks[0]
    for j in range(array.shape[0]):
        if j % chunk_size == 0:
            chunk = array[j: j + chunk_size][:]
        yield chunk[j % chunk_size]

To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.

Any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions