8000 xarray.open_mzar: open multiple zarr files (in parallel) by Mikejmnez · Pull Request #4003 · pydata/xarray · GitHub
[go: up one dir, main page]

Skip to content

xarray.open_mzar: open multiple zarr files (in parallel) #4003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f55ed1c
create def for multiple zarr files and added commentary/definition, w…
Apr 23, 2020
49f6512
just as with ``xr.open_mfdatasets``, identify the paths as local dire…
Apr 23, 2020
f35a3e5
added error if no path
Apr 23, 2020
9f728aa
finished copying similar code from `xr.open_mfdatasets`
Apr 23, 2020
8d0a844
remove blank lines
Apr 23, 2020
b3b0f1d
fixed typo
Apr 23, 2020
2221943
added ``xr.open_mzarr()`` to the list of available modules to call
Apr 23, 2020
ac35e7c
imported missing function
Apr 23, 2020
64654f3
imported missing glob
Apr 23, 2020
d5a5cef
imported function from backend.api
Apr 23, 2020
4c0ef19
imported function to facilitate mzarr
Apr 23, 2020
d158c21
correctly imported functions from core to mzarr
Apr 23, 2020
5171420
imported to use on open_mzarr
Apr 23, 2020
e1e51bb
removed lock and autoclose since not taken by ``open_zarr``
Apr 23, 2020
b6bf2cf
fixed typo
Apr 23, 2020
3bc4be8
class is not needed since zarr stores don`t remain open
Apr 23, 2020
a79b125
removed old behavior
Apr 23, 2020
2d3bbb5
set default
Apr 23, 2020
f7cf580
listed open_mzarr
Apr 25, 2020
53c8623
removed unused imported function
Apr 25, 2020
34d755e
imported Path - hadn`t before
Apr 25, 2020
b39b37e
remove unncessesary comments
Apr 25, 2020
276006a
modified comments
Apr 25, 2020
6f04be6
isorted zarr
Apr 25, 2020
aa97e1a
isorted
Apr 25, 2020
06de16a
erased open_mzarr. Added capability to open_dataset to open zarr files
Apr 28, 2020
f94fc9f
removed imported but unused
Apr 28, 2020
16e08e3
comment to `zarr` engine
Apr 28, 2020
22828fc
added chunking code from `open_zarr`
Apr 28, 2020
021f2cc
remove import `open_mzarr``
Apr 28, 2020
985f28c
removed `open_mzarr`` from top-level-function
Apr 28, 2020
e8ed887
missing return in nested function
Apr 29, 2020
d693514
moved outside of nested function, had touble with reading before assi…
Apr 29, 2020
df34f18
added missing argument associated with zarr stores, onto the definiti…
Apr 29, 2020
98351c7
isort zarr.py
Apr 29, 2020
160bd67
removed blank lines, fixed typo on `chunks`
Apr 29, 2020
7e57e9b
removed imported but unused
Apr 29, 2020
ac0f093
restored conditional for `auto`
Apr 29, 2020
6a1516c
removed imported but unused `dask.array`
Apr 29, 2020
8999faf
added capabilities for file_or_obj to be a mutablemapper such as `fss…
Apr 29, 2020
5df0985
moved to a different conditional since file_or_obj is a mutablemappin…
Apr 29, 2020
2d94ea2
isort api.py
Apr 29, 2020
377ef53
restored the option for when file_or_obk is a str, such as an url.
Apr 29, 2020
f48c84b
fixed relabel
Apr 29, 2020
8376cca
update open_dataset for zarr files
Apr 29, 2020
aed1cc5
remove open_zarr from tests, now open_dataset(engine=`zarr`)
Apr 29, 2020
b488363
remove extra file, and raise deprecating warning on open_zarr
Apr 29, 2020
bae7f10
added internal call to open_dataset from depricated open_zarr
Apr 29, 2020
37ff214
defined engine=`zarr`
Apr 29, 2020
b8b98f5
correct argument for open_dataset
Apr 29, 2020
5c37329
pass arguments as backend_kwargs
Apr 29, 2020
831f15b
pass backend_kwargs as argument
Apr 29, 2020
80dd7da
typo
Apr 29, 2020
4ebf380
set `overwrite_enconded_chunks as backend_kwargs
Apr 29, 2020
4ce3007
do not pass as backend, use for chunking
Apr 29, 2020
89a780b
removed commented code
May 22, 2020
6f6eb23
moved definitions to zarr backends
May 22, 2020
62893ab
Merge pull request #1 from Mikejmnez/new_branch
May 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
create def for multiple zarr files and added commentary/definition, w…
…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``
  • Loading branch information
Miguel Jimenez-Urias committed Apr 23, 2020
commit f55ed1c3abff3fceca2bfde86794ff6379232b16
154 changes: 154 additions & 0 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -692,3 +692,157 @@ def maybe_chunk(name, var, chunks):

variables = {k: maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}
return ds._replace_vars_and_dims(variables)


def open_mfzarr(
paths,
chunks=None
concat_dim="_not_supplied",
compat='no_conflicts',
preprocesses=None,
lock=None,
data_vars="all",
coords='different',
combine="_old_auto",
autoclose=None,
parallel=False,
join="outer",
attrs_file=None,
**kwargs,
):
"""Open multiple zarr files as a single dataset.


If combine="by_coords" then the function ``combine_by_coords`` is used to
combine the datasets into one before returning the result, and if
combined="nested" then ``combine_nested`` is used. The filepaths must be
structured according to which combining function is used, the details of
which are given in the documentation ``combine_by_coords`` and
``combine_nested``. Requires dask to be installed. Global attributes from
the ``attrs_file`` are used for the combined dataset.

Parameters
----------
paths : str of sequence
Either a string glob in the form ``"path/to/my/files/*.zarr"``,
``"path/to/my/files/*"`` (asumming the only directory is a zarr
store), or a explicit list of files to open. Paths can be given as
strings or as pathlib Paths.
chunks : int or dict, optional
Dictionary with keys given by dimension names and values given by
chunk sizes. In general, these should divide the dimensions of each
dataset. If int, chunk each dimension by ``chunks``. By default,
chunks will be chosen to load entire input files into memory at once.
This has major impact on performance: please see the full
documentation for more details [2]_.
concat_dim : str, or list of str, DataArray, Index or None, optional
Dimensions to concatenate files along. You only need to provide this
argument if any of the dimensions along which you want to concatenate
is not a dimension of the original dataset, e.g. you want to stack a
collection of 2D arrays along a third dimension. Set
``concat_dim=[..., None, ...]`` explicitly to disable concatenation
along a particular dimension.
combine : {'by_coords', 'nested'}, optional
Whether ``xarray.combine_by_coords`` or ``xarray.combine_nested`` is
used to combine all the data. If this argument is not provided,
``xarray.combine_by_coords`` is set by default.
compat : {'identical', 'equals', 'broadcast_equals',
'no_conflicts','override'}, optional
String indicating how to compare variables of the same name for
potential conflicts when merging:

* 'broadcast_equals': all values must be equal when variables are
broadcast against each other to ensure common dimensions.
* 'equals': all values and dimensions must be the same.
* 'identical': all values, dimensions and attributes must be the same.
* 'no_conflicts': only values which are not null in both datasets
must be equal. The returned dataset then contains the combination of
all non-null values.
* 'override': skip comparing and pick variable from first dataset.

preprocess : callable, optional
If provided, call this function on each dataset prior to concatenation.
You can find the file-name from which each dataset was loaded in
``ds.encoding[source]``.
lock: False or duck threading.Lock, optional
Resource lock to use when reading data from disk. Only relevant when
using dask or another form of parallelism. By default, appropriate
locks are chosen to safely read and write files with the currently
active dask scheduler.
data_vars : {'minimal', 'different', 'all' or list of str}, optional
These data variables will be concatenated together:
* 'minimal': Only data variables in which the dimension already
appears are included.
* 'different': Data variables which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as
all for which dimension already appears). Beware: this option may
load the data payload of data variables into memory if they are not
already loaded.
* 'all& CD36 #39;: All data variables will be concatenated.
* list of str: The listed data variables will be concatenated, in
addition to the 'minimal' data variables.
coords : {'minimal', 'different', 'all' or list of str}, optional
These coordinate variables will be concatenated together:
* 'minimal': Only coordinates in which the dimension already appears
are included.
* 'different': Coordinates which are not equal (ignoring attributes)
across all datasets are also concatenated (as well as all for which
dimension already appears). Beware: this option may load the data
payload of coordinate variables into memory if they are not already
loaded.
* 'all': All coordinate variables will be concatenated, except
those corresponding to other dimensions.
* list of str: The listed coordinate variables will be concatenated,
in addition the 'minimal' coordinates.
parallel : bool, optional
If True, the open and preprocess steps of this function will be
performed in parallel using ``dask.delayed``. Default is False.
join : {'outer', 'inner', 'left', 'right', 'exact, 'override'}, optional
String indicating how to combine differing indexes
(excluding concat_dim) in objects
- 'outer': use the union of object indexes
- 'inner': use the intersection of object indexes
- 'left': use indexes from the first object with each dimension
- 'right': use indexes from the last object with each dimension
- 'exact': instead of aligning, raise `ValueError` when indexes to be
aligned are not equal
- 'override': if indexes are of same size, rewrite indexes to be
those of the first object with that dimension. Indexes for the same
dimension must have the same size in all objects.
attrs_file : str or pathlib.Path, optional
Path of the file used to read global attributes from.
By default global attributes are read from the first file provided,
with wildcard matches sorted by filename.
**kwargs : optional
Additional arguments passed on to :py:func:`xarray.open_zarr`.


Returns
-------
xarray.Dataset
Notes
-----
``open_mfdataset`` opens files with read-only access. When you modify
values
of a Dataset, even one linked to files on disk, only the in-memory copy you
are manipulating in xarray is modified: the original file on disk is never
touched.
See Also
--------
combine_by_coords
combine_nested
auto_combine
open_dataset
References
----------
.. [1] http://xarray.pydata.org/en/stable/dask.html
.. [2] http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance
"""
pass







0