8000 Making the output datasets of `add_to_collection` split-able · Issue #88 · scverse/annbatch · GitHub
[go: up one dir, main page]

Skip to content

Making the output datasets of add_to_collection split-able #88

@ilan-gold

Description

@ilan-gold

Description of feature

The idea of add_to_collection is that you can load the whole dataset into memory, inject a new dataset, and then write it out to disk. But at some point, the shard becomes too big in which case you probably want to have some sort of "split" option where past a certain point, the shard gets split into two (or n).

An alternative to this would be handling everything truly lazily, but I think we'd lose io efficiency since pure in-memory -> disk io is going to be faster than iteratively writing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0