8000 Default efficient row iteration · Issue #398 · zarr-developers/zarr-python · GitHub
[go: up one dir, main page]

Skip to content
Default efficient row iteration #398
Closed
@jeromekelleher

Description

@jeromekelleher

It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do

    for pos, row in enumerate(data_root.variants):
        print(row)
        if pos == 1000:
            break

it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do

    for pos, row in enumerate(chunk_iterator(data_root.variants)):
        print(row)
        if pos == 1000:
            break

it takes less than a second, where

def chunk_iterator(array):
    """ 
    Utility to iterate over the rows in the specified array efficiently
    by accessing one chunk at a time.
    """
    chunk_size = array.chunks[0]
    for j in range(array.shape[0]):
        if j % chunk_size == 0:
            chunk = array[j: j + chunk_size][:]
        yield chunk[j % chunk_size]

To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.

Any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0