|
| 1 | +Zarr - scalable storage of tensor data for use in parallel and distributed computing |
| 2 | +==================================================================================== |
| 3 | + |
| 4 | +SciPy 2019 submission. |
| 5 | + |
| 6 | + |
| 7 | +Short summary |
| 8 | +------------- |
| 9 | + |
| 10 | +Many scientific problems involve computing over large N-dimensional |
| 11 | +typed arrays of data, and reading or writing data is often the major |
| 12 | +bottleneck limiting speed or scalability. The Zarr project is |
| 13 | +developing a simple, scalable approach to storage of such data in a |
| 14 | +way that is compatible with a range of approaches to distributed and |
| 15 | +parallel computing. We describe the Zarr protocol and data storage |
| 16 | +format, and the current state of implementations for various |
| 17 | +programming languages including Python. We also describe current uses |
| 18 | +of Zarr in malaria genomics, the Human Cell Atlas, and the Pangeo |
| 19 | +project. |
| 20 | + |
| 21 | + |
| 22 | +Abstract |
| 23 | +-------- |
| 24 | + |
| 25 | +Background |
| 26 | +~~~~~~~~~~ |
| 27 | + |
| 28 | +Across a broad range of scientific disciplines, data are naturally |
| 29 | +represented and stored as N-dimensional typed arrays, also known as |
| 30 | +tensors. The volume of data being generated is outstripping our |
| 31 | +ability to analyse it, and scientific communities are looking for ways |
| 32 | +to leverage modern multi-core CPUs and distributed computing |
| 33 | +platforms, including cloud computing. Retrieval and storage of data is |
| 34 | +often the major bottleneck, and new approaches to data storage are |
| 35 | +needed to accelerate distributed computations and enable them to scale |
| 36 | +on a variety of platforms. |
| 37 | + |
| 38 | +Methods |
| 39 | +~~~~~~~ |
| 40 | + |
| 41 | +We have designed a new storage format and protocol for tensor data |
| 42 | +[1_], and have released an open source Python implementation [2_, |
| 43 | +3_]. Our approach builds on data storage concepts from HDF5 [4_], |
| 44 | +particularly chunking and compression, and hierarchical organisation |
| 45 | +of datasets. Key design goals include: a simple protocol and format |
| 46 | +that can be implemented in other programming languages; support for |
| 47 | +multiple concurrent readers or writers; support for a variety of |
| 48 | +parallel computing environments, from multi-threaded execution on a |
| 49 | +single CPU to multi-process execution across a multi-node cluster; |
| 50 | +pluggable storage subsystem with support for file systems, key-value |
| 51 | +databases and cloud object stores; pluggable encoding subsystem with |
| 52 | +support for a variety of modern compressors. |
| 53 | + |
| 54 | +Results |
| 55 | +~~~~~~~ |
| 56 | + |
| 57 | +We illustrate the use of Zarr with examples from several scientific |
| 58 | +domains. Zarr is being used within the Pangeo project [5_], which is |
| 59 | +building a community platform for big data geoscience. The Pangeo |
| 60 | +community have converted a number of existing climate modelling and |
| 61 | +satellite observation datasets to Zarr [6_], and have demonstrated |
| 62 | +their use in computations using HPC and cloud computing |
| 63 | +environments. Within the MalariaGEN project [7_], Zarr is used to |
| 64 | +store genome variation data from next-generation sequencing of natural |
| 65 | +populations of malaria parasites and mosquitoes [8_] and these data |
| 66 | +are used as input to analyses of the evolution of these organisms in |
| 67 | +response to selective pressure from anti-malarial drugs and |
| 68 | +insecticides. Zarr is being used within the Human Cell Atlas (HCA) |
| 69 | +project [9_], which is building a reference atlas of healthy human |
| 70 | +cell types. This project hopes to leverage this information to better |
| 71 | +understand the dysregulation of cellular states that underly human |
| 72 | +disease. The Human Cell Atlas uses Zarr as the output data format |
| 73 | +because it enables the project to easily generate matrices containing |
| 74 | +user-selected subsets of cells. |
| 75 | + |
| 76 | +Conclusions |
| 77 | +~~~~~~~~~~~ |
| 78 | + |
| 79 | +Zarr is generating interest across a range of scientific domains, and |
| 80 | +work is ongoing to establish a community process to support further |
| 81 | +development of the specifications and implementations in other |
| 82 | +programming languages [10_, 11_, 12_] and building interoperability |
| 83 | +with a similar project called N5 [13_]. Other packages within the |
| 84 | +PyData ecosystem, notably Dask [14_], Xarray [15_] and Intake [16_], |
| 85 | +have added capability to read and write Zarr, and together these |
| 86 | +packages provide a compelling solution for large scale data science |
| 87 | +using Python [17_]. Zarr has recently been presented in several |
| 88 | +venues, including a webinar for the ESIP Federation tech dive series |
| 89 | +[18_], and a talk at the AGU Fall Meeting 2018 [19_]. |
| 90 | + |
| 91 | + |
| 92 | +References |
| 93 | +~~~~~~~~~~ |
| 94 | + |
| 95 | +.. _1: https://zarr.readthedocs.io/en/stable/spec/v2.html |
| 96 | +.. _2: https://github.com/zarr-developers/zarr |
| 97 | +.. _3: https://github.com/zarr-developers/numcodecs |
| 98 | +.. _4: https://www.hdfgroup.org/solutions/hdf5/ |
| 99 | +.. _5: https://pangeo.io/ |
| 100 | +.. _6: https://pangeo.io/catalog.html |
| 101 | +.. _7: https://www.malariagen.net/ |
| 102 | +.. _8: http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html |
| 103 | +.. _9: https://www.humancellatlas.org/ |
| 104 | +.. _10: https://github.com/constantinpape/z5 |
| 105 | +.. _11: https://github.com/lasersonlab/ndarray.scala |
| 106 | +.. _12: https://github.com/meggart/ZarrNative.jl |
| 107 | +.. _13: https://github.com/saalfeldlab/n5 |
| 108 | +.. _14: http://docs.dask.org/en/latest/array-creation.html |
| 109 | +.. _15: http://xarray.pydata.org/en/stable/io.html |
| 110 | +.. _16: https://github.com/ContinuumIO/intake-xarray |
| 111 | +.. _17: http://matthewrocklin.com/blog/work/2018/01/22/pangeo-2 |
| 112 | +.. _18: http://wiki.esipfed.org/index.php/Interoperability_and_Technology/Tech_Dive_Webinar_Series#8_March.2C_2018:_.22Zarr:_A_simple.2C_open.2C_scalable_solution_for_big_NetCDF.2FHDF_data_on_the_Cloud.22:_Alistair_Miles.2C_University_of_Oxford. |
| 113 | +.. _19: https://agu.confex.com/agu/fm18/meetingapp.cg
6D47
i/Paper/390015 |
| 114 | + |
| 115 | + |
| 116 | +Authors |
| 117 | +------- |
| 118 | + |
| 119 | +Project contributors are listed in alphabetical order by surname. |
| 120 | + |
| 121 | +* `Ryan Abernathey <https://github.com/rabernat>`_, Columbia University |
| 122 | +* `Stephan Balmer <https://github.com/sbalmer>`_, Meteotest |
| 123 | +* `Ambrose Carr <https://github.com/ambrosejcarr>`_, Chan Zuckerberg Initiative |
| 124 | +* `Tim Crone <https://github.com/tjcrone>`_, Columbia University |
| 125 | +* `Martin Durant <https://github.com/martindurant>`_, Anaconda, inc. |
| 126 | +* `Jan Funke <https://github.com/funkey>`_, HHMI Janelia |
| 127 | +* `Darren Gallagher <https://github.com/dazzag24>`_, Satavia |
| 128 | +* `Fabian Gans <https://github.com/meggart>`_, Max Planck Institute for Biogeochemistry |
| 129 | +* `Shikhar Goenka <https://github.com/shikharsg>`_, Satavia |
| 130 | +* `Joe Hamman <https://github.com/jhamman>`_, NCAR |
| 131 | +* `Stephan Hoyer <https://github.com/shoyer>`_, Google |
| 132 | +* `Jerome Kelleher <https://github.com/jeromekelleher>`_, University of Oxford |
| 133 | +* `John Kirkham <https://github.com/jakirkham>`_, HHMI Janelia |
| 134 | +* `Alistair Miles <https://github.com/alimanfoo>`_, University of Oxford |
| 135 | +* `Josh Moore <https://github.com/joshmoore>`_, University of Dundee |
| 136 | +* `Charles Noyes <https://github.com/CSNoyes>`_, University of Southern California |
| 137 | +* `Tarik Onalan <https://github.com/onalant>`_ |
| 138 | +* `Constantin Pape <https://github.com/constantinpape>`_, University of Heidelberg |
| 139 | +* `Zain Patel <https://github.com/mzjp2>`_, University of Cambridge |
| 140 | +* `Matthew Rocklin <https://github.com/mrocklin>`_, NVIDIA |
| 141 | +* `Stephan Saafeld <https://github.com/axtimwalde>`_, HHMI Janelia |
| 142 | +* `Vincent Schut <https://github.com/vincentschut>`_, Satelligence |
| 143 | +* `Justin Swaney <https://github.com/jmswaney>`_, MIT |
| 144 | +* `Ryan Williams <https://github.com/ryan-williams>`_, Chan Zuckerberg Initiative |
0 commit comments