8000 SciPy 2019 submission (#397) · rjgildea/zarr-python@37672e1 · GitHub
[go: up one dir, main page]

Skip to content

Commit 37672e1

Browse files
authored
SciPy 2019 submission (zarr-developers#397)
* scipy2019 submission initial draft * address @jhamman comments * add summary, add authors, add HCA example * order authors * address @rabernat comments * Update docs/talks/scipy2019/submission.rst Co-Authored-By: alimanfoo <alimanfoo@googlemail.com> * address @jakirkham and @shikharsg comments * minor fixes
1 parent ff456a8 commit 37672e1

File tree

1 file changed

+144
-0
lines changed

1 file changed

+144
-0
lines changed

docs/talks/scipy2019/submission.rst

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
Zarr - scalable storage of tensor data for use in parallel and distributed computing
2+
====================================================================================
3+
4+
SciPy 2019 submission.
5+
6+
7+
Short summary
8+
-------------
9+
10+
Many scientific problems involve computing over large N-dimensional
11+
typed arrays of data, and reading or writing data is often the major
12+
bottleneck limiting speed or scalability. The Zarr project is
13+
developing a simple, scalable approach to storage of such data in a
14+
way that is compatible with a range of approaches to distributed and
15+
parallel computing. We describe the Zarr protocol and data storage
16+
format, and the current state of implementations for various
17+
programming languages including Python. We also describe current uses
18+
of Zarr in malaria genomics, the Human Cell Atlas, and the Pangeo
19+
project.
20+
21+
22+
Abstract
23+
--------
24+
25+
Background
26+
~~~~~~~~~~
27+
28+
Across a broad range of scientific disciplines, data are naturally
29+
represented and stored as N-dimensional typed arrays, also known as
30+
tensors. The volume of data being generated is outstripping our
31+
ability to analyse it, and scientific communities are looking for ways
32+
to leverage modern multi-core CPUs and distributed computing
33+
platforms, including cloud computing. Retrieval and storage of data is
34+
often the major bottleneck, and new approaches to data storage are
35+
needed to accelerate distributed computations and enable them to scale
36+
on a variety of platforms.
37+
38+
Methods
39+
~~~~~~~
40+
41+
We have designed a new storage format and protocol for tensor data
42+
[1_], and have released an open source Python implementation [2_,
43+
3_]. Our approach builds on data storage concepts from HDF5 [4_],
44+
particularly chunking and compression, and hierarchical organisation
45+
of datasets. Key design goals include: a simple protocol and format
46+
that can be implemented in other programming languages; support for
47+
multiple concurrent readers or writers; support for a variety of
48+
parallel computing environments, from multi-threaded execution on a
49+
single CPU to multi-process execution across a multi-node cluster;
50+
pluggable storage subsystem with support for file systems, key-value
51+
databases and cloud object stores; pluggable encoding subsystem with
52+
support for a variety of modern compressors.
53+
54+
Results
55+
~~~~~~~
56+
57+
We illustrate the use of Zarr with examples from several scientific
58+
domains. Zarr is being used within the Pangeo project [5_], which is
59+
building a community platform for big data geoscience. The Pangeo
60+
community have converted a number of existing climate modelling and
61+
satellite observation datasets to Zarr [6_], and have demonstrated
62+
their use in computations using HPC and cloud computing
63+
environments. Within the MalariaGEN project [7_], Zarr is used to
64+
store genome variation data from next-generation sequencing of natural
65+
populations of malaria parasites and mosquitoes [8_] and these data
66+
are used as input to analyses of the evolution of these organisms in
67+
response to selective pressure from anti-malarial drugs and
68+
insecticides. Zarr is being used within the Human Cell Atlas (HCA)
69+
project [9_], which is building a reference atlas of healthy human
70+
cell types. This project hopes to leverage this information to better
71+
understand the dysregulation of cellular states that underly human
72+
disease. The Human Cell Atlas uses Zarr as the output data format
73+
because it enables the project to easily generate matrices containing
74+
user-selected subsets of cells.
75+
76+
Conclusions
77+
~~~~~~~~~~~
78+
79+
Zarr is generating interest across a range of scientific domains, and
80+
work is ongoing to establish a community process to support further
81+
development of the specifications and implementations in other
82+
programming languages [10_, 11_, 12_] and building interoperability
83+
with a similar project called N5 [13_]. Other packages within the
84+
PyData ecosystem, notably Dask [14_], Xarray [15_] and Intake [16_],
85+
have added capability to read and write Zarr, and together these
86+
packages provide a compelling solution for large scale data science
87+
using Python [17_]. Zarr has recently been presented in several
88+
venues, including a webinar for the ESIP Federation tech dive series
89+
[18_], and a talk at the AGU Fall Meeting 2018 [19_].
90+
91+
92+
References
93+
~~~~~~~~~~
94+
95+
.. _1: https://zarr.readthedocs.io/en/stable/spec/v2.html
96+
.. _2: https://github.com/zarr-developers/zarr
97+
.. _3: https://github.com/zarr-developers/numcodecs
98+
.. _4: https://www.hdfgroup.org/solutions/hdf5/
99+
.. _5: https://pangeo.io/
100+
.. _6: https://pangeo.io/catalog.html
101+
.. _7: https://www.malariagen.net/
102+
.. _8: http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html
103+
.. _9: https://www.humancellatlas.org/
104+
.. _10: https://github.com/constantinpape/z5
105+
.. _11: https://github.com/lasersonlab/ndarray.scala
106+
.. _12: https://github.com/meggart/ZarrNative.jl
107+
.. _13: https://github.com/saalfeldlab/n5
108+
.. _14: http://docs.dask.org/en/latest/array-creation.html
109+
.. _15: http://xarray.pydata.org/en/stable/io.html
110+
.. _16: https://github.com/ContinuumIO/intake-xarray
111+
.. _17: http://matthewrocklin.com/blog/work/2018/01/22/pangeo-2
112+
.. _18: http://wiki.esipfed.org/index.php/Interoperability_and_Technology/Tech_Dive_Webinar_Series#8_March.2C_2018:_.22Zarr:_A_simple.2C_open.2C_scalable_solution_for_big_NetCDF.2FHDF_data_on_the_Cloud.22:_Alistair_Miles.2C_University_of_Oxford.
113+
.. _19: https://agu.confex.com/agu/fm18/meetingapp.cg 6D47 i/Paper/390015
114+
115+
116+
Authors
117+
-------
118+
119+
Project contributors are listed in alphabetical order by surname.
120+
121+
* `Ryan Abernathey <https://github.com/rabernat>`_, Columbia University
122+
* `Stephan Balmer <https://github.com/sbalmer>`_, Meteotest
123+
* `Ambrose Carr <https://github.com/ambrosejcarr>`_, Chan Zuckerberg Initiative
124+
* `Tim Crone <https://github.com/tjcrone>`_, Columbia University
125+
* `Martin Durant <https://github.com/martindurant>`_, Anaconda, inc.
126+
* `Jan Funke <https://github.com/funkey>`_, HHMI Janelia
127+
* `Darren Gallagher <https://github.com/dazzag24>`_, Satavia
128+
* `Fabian Gans <https://github.com/meggart>`_, Max Planck Institute for Biogeochemistry
129+
* `Shikhar Goenka <https://github.com/shikharsg>`_, Satavia
130+
* `Joe Hamman <https://github.com/jhamman>`_, NCAR
131+
* `Stephan Hoyer <https://github.com/shoyer>`_, Google
132+
* `Jerome Kelleher <https://github.com/jeromekelleher>`_, University of Oxford
133+
* `John Kirkham <https://github.com/jakirkham>`_, HHMI Janelia
134+
* `Alistair Miles <https://github.com/alimanfoo>`_, University of Oxford
135+
* `Josh Moore <https://github.com/joshmoore>`_, University of Dundee
136+
* `Charles Noyes <https://github.com/CSNoyes>`_, University of Southern California
137+
* `Tarik Onalan <https://github.com/onalant>`_
138+
* `Constantin Pape <https://github.com/constantinpape>`_, University of Heidelberg
139+
* `Zain Patel <https://github.com/mzjp2>`_, University of Cambridge
140+
* `Matthew Rocklin <https://github.com/mrocklin>`_, NVIDIA
141+
* `Stephan Saafeld <https://github.com/axtimwalde>`_, HHMI Janelia
142+
* `Vincent Schut <https://github.com/vincentschut>`_, Satelligence
143+
* `Justin Swaney <https://github.com/jmswaney>`_, MIT
144+
* `Ryan Williams <https://github.com/ryan-williams>`_, Chan Zuckerberg Initiative

0 commit comments

Comments
 (0)
0