Protocol for distributed data

In order for an ndarray/dataframe system to interact with a variety of frameworks in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristics is needed.

The __partitioned__ protocol accomplishes this by defining a structure which provides the necessary meta-information. Implementations show that it allows data exchange on between different distributed frameworks without unnecessary data transmission or even copies.

The structure defines how the data is partitioned, where it is located and provides a function to access the local data. It does not define or provide any means of communication, messaging and/or resource management. It merely describes the current distribution state.

In a way, this is similar to the meta-data which dlpack provides for exchanging data within a single node (including the local GPU) - but it does it for data which lives on more than on process/node. It complements mechanism for intra-node exchange, such as dlpack, __array_interface__ and alike.

The current lack of such a structure typically leads to one of the following scenarios when connecting different frameworks:

a data consumer implements a dedicated import functionality for every distributed data container it sees important enough. As an example, xgboost_ray implements a variety of data_sources.
This PR lets xgboost_ray automatically work with any new container that supports __partitioned__ (the extra code for modin DF and Ray/MLDataSet are no longer needed once they support it, too)
the user needs to explicitly deal with the distributed nature of the data. This either leads to unnecessary data transfer and/or developers need to understand internals of the data-container. In the latter case they get exposed to explicit parallelism/distribution while often the original intend of the producer was to hide exactly that.
The implementation here avoids that by entirely hiding the distribution features but still allowing zero-copy data exchange between __partitioned__ (exemplified by modin (PR) and HeAT (PR)) and scikit-learn-intelex/daal4py.
frameworks like dask and ray wrap established APIs (such as pytorch, xgboost etc) and ask developers to switch to the new framework/API and to adopt their programing model.

A structure like __partitioned__ enables distributed data exchange between various frameworks which can be seamless to users while avoiding unnecessary data transfer.

The __partitioned__ protocol is an initial proposal, any feedback from this consortium will be highly appreciated. We would like to build a neutral, open governance that continues to oversee the evolution of the spec. For example, if some version of it receives positive response from the community we would be more than happy to donate the __partitioned__ protocol to the data-api consortium or host it in a clean github org.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions