8000 Add ManifestStore for loading data from ManifestArrays by maxrjones · Pull Request #490 · zarr-developers/VirtualiZarr · GitHub
[go: up one dir, main page]

Skip to content

Add ManifestStore for loading data from ManifestArrays #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Mar 25, 2025

Conversation

maxrjones
Copy link
Member
@maxrjones maxrjones commented Mar 16, 2025

This PR allows loading data directly from ManifestArrays using obstore.
Possible TO-DOs:

  • Implement get_partial_values, exists, list_prefix, and list.

General checklist

Copy link
codecov bot commented Mar 16, 2025

Codecov Report

Attention: Patch coverage is 85.14286% with 26 lines in your changes missing coverage. Please review.

Project coverage is 87.65%. Comparing base (206277f) to head (61168f4).
Report is 52 commits behind head on develop.

Files with missing lines Patch % Lines
virtualizarr/manifests/store.py 85.21% 21 Missing ⚠️
virtualizarr/vendor/zarr/metadata.py 73.68% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #490      +/-   ##
===========================================
- Coverage    87.91%   87.65%   -0.26%     
===========================================
  Files           28       31       +3     
  Lines         1713     1888     +175     
===========================================
+ Hits          1506     1655     +149     
- Misses         207      233      +26     
Files with missing lines Coverage Δ
virtualizarr/manifests/__init__.py 100.00% <100.00%> (ø)
virtualizarr/manifests/group.py 100.00% <100.00%> (ø)
virtualizarr/vendor/zarr/metadata.py 73.68% <73.68%> (ø)
virtualizarr/manifests/store.py 85.21% <85.21%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member
@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @maxrjones - I'm so impressed that you got this far this quickly. My comments are mostly nits about API choices.

Comment on lines 149 to 150
except _ALLOWED_EXCEPTIONS:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are any of these exceptions allowed? Don't we want to raise them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how Zarr handles empty chunks (xref #16). I'll need to put some more thought into this and return with a more complete answer later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you read this? #427 (comment)

Seems related.

@TomNicholas TomNicholas added the enhancement New feature or request label Mar 17, 2025
@TomNicholas TomNicholas added this to the v2.0.0 milestone Mar 17, 2025
Co-authored-by: Tom Nicholas <tom@earthmover.io>
@maxrjones
Copy link
Member Author

Thanks for prompt review, @TomNicholas! I addressed the simpler comments and will build out the test suite and request a follow-up once that's done and all the exposed issues are solved.

Co-authored-by: Tom Nicholas <tom@earthmover.io>
# Check each key to see if it's a prefix of the uri_string
for key in sorted_keys:
if request_key.startswith(key):
return StoreRequest(store=stores[key], key=request_key[len(key) :])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using key=request_key[len(key) :] makes some assumptions that are not universally valid. I will look into this tomorrow.

@TomNicholas
Copy link
Member

This has tests, which pass, so LGTM so we can iterate + build upon it.

maxrjones and others added 2 commits March 25, 2025 16:25
Co-authored-by: Tom Nicholas <tom@earthmover.io>
@maxrjones maxrjones merged commit f4d9f5b into develop Mar 25, 2025
12 checks passed
@maxrjones maxrjones deleted the virtual-obstore-store branch March 25, 2025 20:38
TomNicholas added a commit that referenced this pull request Mar 25, 2025
* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* Add ManifestStore for loading data from ManifestArrays (#490)

* Draft ManifestStore implementation

---------

Co-authored-by: Tom Nicholas <tom@earthmover.io>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make it work for dask delayed

* correct docstring

---------

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
TomNicholas added a commit that referenced this pull request Mar 29, 2025
* copy implementation from xarray

* sketch idea for
C496
 lithops parallelization

* standardize naming of variables

* add to public API

* fix errors caused by trying to import xarray types

* start writing tests

* passing test for combining in serial

* requires_kerchunk

* test for lithops with default LocalHost executor

* notes on confusing AssertionError

* ensure lithops is installed

* remove uneeded fixture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Additions to `open_virtual_mfdataset` (#508)

* need latest version of xarray to import internals correctly

* Fix metadata equality for nan fill value (#502)

* add check that works for fill_values too

* note about removing once merged upstream

* type hint

* regression test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove accidental changes to pyproject.toml

* Update pyproject.toml

* ignore mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Setup intersphinx mapping for docs (#503)

* Setup intersphinx mapping for docs

---------

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

* Change default loadable_variables (and indexes) to match xarray's behaviour (#477)

* draft refactor

* sketch of simplified handling of loadable_variables

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get at least some tests working

* separate VirtualBackend api definition from common utilities

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove indexes={} everywhere in tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* stop passing through loadable_variables to where it isn't used

* implement logic to load 1D dimension coords by default

* remove more instances of indexes={}

* remove more indexes={}

* refactor logic for choosing loadable_variables

* fix more tets

* xfail Aimee's test that I don't understand

* xfail test that explicitly specifies no indexes

* made a bunch more stuff pass

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix netcdf3 reader

* fix bad import in FITS reader

* fix import in tiff reader

* fix import in icechunk test

* release note

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update docstring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix fits reader

* xfail on empty dict for indexes

* linting

* actually test new expected behaviour

* fix logic for setting loadable_variables

* update docs page to reflect new behaviour

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix expected behaviour in another tests

* additional assert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use encode_dataset_coordinates in kerchunk writer

* Encode zarr vars

* fix some mypy errors

* move drop_variables implmentation to the end of every reader

* override loadable_variables and raise warning

* fix failing test by not creating loadable variables that would get inlined by default

* improve error message

* remove some more occurrences of indexes={}

* skip slow test

* slay mypy errors

* docs typos

* should fix dmrpp test

* Delete commented-out code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unecessary test skip

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>

* Update pyproject.toml deps (#504)

* re-add icechunk to upstream tests

* add pytest-asyncio to test envs

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

* Additions to `open_virtual_mfdataset` (#509)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* Add ManifestStore for loading data from ManifestArrays (#490)

* Draft ManifestStore implementation

---------

Co-authored-by: Tom Nicholas <tom@earthmover.io>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make it work for dask delayed

* correct docstring

---------

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* More open_virtual_mfdataset (#510)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* make it work for dask delayed

* correct docstring

* added compliant executor for lithops

* add links to lithops issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Final fixes for open_virtual_mfdataset (#517)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* make it work for dask delayed

* correct docstring

* added compliant executor for lithops

* add links to lithops issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* specify dask and lithops executors with a string again

* fix easy typing stuff

* fix typing errors by aligning executor signatures

* remove open_virtual_mfdataset from public API for now

* release note

* refactor construction of expected result

* implement preprocess arg, and dodge lithops bug

* update comment

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Apply suggestions from code reviewRemRemove new deps

* remove rogue print statement

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@TomNicholas TomNicholas mentioned this pull request Mar 31, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading data from ManifestArrays without saving references to disk first
3 participants
0