-
Notifications
You must be signed in to change notification settings - Fork 33
Description
The v3 spec permits the existence of Zarr groups without any distinguishing metadata.
In the section comparing v3 with v2, the spec states
v3 allows for greater flexibility in how groups and arrays are created. In particular, v3 supports implicit groups, which are groups that do not have a metadata document but whose existence is implied by descendant nodes. This change enables multiple arrays to be created in parallel without generating race conditions for the metadata when creating parent groups.
So the argument here is that we want to avoid race conditions when creating arrays in parallel. Is this a serious problem for anyone? Personally, I was not aware that parallel hierarchy mutation was a design goal of Zarr. I always thought that the only parallelism guarantees were for separate array chunks; since creating nodes in the hierarchy is so simple (just write a JSON document), there shouldn't be a motivation for parallelizing this process, at least that's how it seems to me.
Later, there is a section comparing explicit and implicit groups, which states
This specification defines both implicit and explicit groups, but implementations may create an explicit group for all implicit groups they encounter, in particular when using a hierarchical storage.
Erasure of an implicit group may automatically erase any empty parent. For example on a S3 store where the namespace is flat, erasure of the last key with a prefix will erase all implicit groups in the prefix.
Care must be taken when erasing an array or a group if the parent needs to be converted into an explicit group.
A race-condition arises if a client writes an array at path P, and another client concurrently assumes P is an implicit group and writes subgroups or arrays into it. Implementations can avoid this race condition by exclusively using explicit groups.
So here we learn that implicit groups actually introduce a new type of race condition, because they make the structure of Zarr hierarchy ambiguous, and there's a suggestion that implementations modify Zarr hierarchies they encounter to insert implicit groups when they are detected. I don't think this is great. First, we have traded the race condition that motivated implicit groups for another one, so we are net 0 race conditions, and we are encouraging implementations to mutate the hierarchies they encounter, perhaps as an admission that implicit groups might be a bit of a headache in practice.
I'm honestly not sure what the advantage is of implicit groups. Here are some disadvantages, from my POV:
- Implicit groups make the structure of the hierarchy more ambiguous. With implicit groups, two Zarr hierarchies can be "identical" yet have very different contents, because one may have explicit groups where the other has implicit groups.
- Implicit groups make the identity of a single node ambiguous. In
zarr-python
, we have an API that consumes paths on a file system / object store and attempts to infer whether that path points to a Zarr array or group. With implicit groups, literally any valid path can be interpreted as a Zarr group. This means that the boundary of a zarr hierarchy is not well defined, and essentially includes the entire file system. It becomes impossible for a user to include an extra non-zarr directory inside a Zarr hierarchy. Do we want this outcome?
I think we should reconsider including implicit groups in the v3 spec. Removing implicit groups would simplify some matters over in the ongoing zarr-python
v3 refactoring effort. The main question I have is whether there is anyone who really needs implicit groups for some reason, in which case I am curious to learn more about that use case.