10000 Merge branch 'main' of https://github.com/data-apis/dataframe-api int… · iskode/dataframe-api@ac1a5ca · GitHub
[go: up one dir, main page]

Skip to content

Commit ac1a5ca

Browse files
committed
Merge branch 'main' of https://github.com/data-apis/dataframe-api into variable-length-string-support
2 parents 6010ae7 + 52abf7a commit ac1a5ca

File tree

2 files changed

+42
-31
lines changed

2 files changed

+42
-31
lines changed

protocol/dataframe_protocol_summary.md

Lines changed: 42 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# `__dataframe__` protocol - summary
1+
# The `__dataframe__` protocol
2+
3+
This document aims to describe the scope of the dataframe interchange protocol,
4+
as well as its essential design requirements/principles and the functionality
5+
it needs to support.
26

3-
_We've had a lot of discussion in a couple of GitHub issues and in meetings.
4-
This description attempts to summarize that, and extract the essential design
5-
requirements/principles and functionality it needs to support._
67

78
## Purpose of `__dataframe__`
89

@@ -11,7 +12,8 @@ a way to convert one type of dataframe into another type (for example,
1112
convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into
1213
a Vaex dataframe).
1314

14-
Currently (Nov'20) there is no way to do this in an implementation-independent way.
15+
Currently (June 2020) there is no way to do this in an
16+
implementation-independent way.
1517

1618
The main use case this protocol intends to enable is to make it possible to
1719
write code that can accept any type of dataframe instead of being tied to a
@@ -30,7 +32,7 @@ def somefunc(df, ...):
3032

3133
### Non-goals
3234

33-
Providing a _complete standardized dataframe API_ is not a goal of the
35+
Providing a _complete, standardized dataframe API_ is not a goal of the
3436
`__dataframe__` protocol. Instead, this is a goal of the full dataframe API
3537
standard, which the Consortium for Python Data API Standards aims to provide
3638
in the future. When that full API standard is implemented by dataframe
@@ -40,8 +42,8 @@ libraries, the example above can change to:
4042
def get_df_module(df):
4143
"""Utility function to support programming against a dataframe API"""
4244
if hasattr(df, '__dataframe_namespace__'):
43-
# Retrieve the namespace
44-
pdx = df.__dataframe_namespace__()
45+
# Retrieve the namespace
46+
pdx = df.__dataframe_namespace__()
4547
else:
4648
# Here we can raise an exception if we only want to support compliant dataframes,
4749
# or convert to our default choice of dataframe if we want to accept (e.g.) dicts
@@ -57,6 +59,7 @@ def somefunc(df, ...):
5759
# From now on, use `df` methods and `pdx` functions/objects
5860
```
5961

62+
6063
### Constraints
6164

6265
An important constraint on the `__dataframe__` protocol is that it should not
@@ -94,13 +97,14 @@ For a protocol to exchange dataframes between libraries, we need both a model
9497
of what we mean by "dataframe" conceptually for the purposes of the protocol,
9598
and a model of how the data is represented in memory:
9699

97-
![Image of a dataframe model, containing chunks, columns and 1-D arrays](conceptual_model_df_memory.png)
100+
![Conceptual model of a dataframe, containing chunks, columns and 1-D arrays](images/dataframe_conceptual_model.png)
98101

99-
The smallest building block are **1-D arrays**, which are contiguous in
100-
memory and contain data with the same dtype. A **column** consists of one or
101-
more 1-D arrays (if, e.g., missing data is represented with a boolean mask,
102-
that's a separate array). A **chunk** contains a set of columns of uniform
103-
length. A **dataframe** contains one or more chunks.
102+
The smallest building blocks are **1-D arrays** (or "buffers"), which are
103+
contiguous in memory and contain data with the same dtype. A **column**
104+
consists of one or more 1-D arrays (if, e.g., missing data is represented with
105+
a boolean mask, that's a separate array). A **dataframe** contains one or more columns.
106+
A column or a dataframe can be "chunked"; a **chunk** is a subset of a column
107+
or dataframe that contains a set of (neighboring) rows.
104108

105109

106110
## Protocol design requirements
@@ -121,7 +125,7 @@ length. A **dataframe** contains one or more chunks.
121125
6. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
122126
and provide an explicit way to force such transfers (e.g. a `force=` or
123127
`copy=` keyword that the caller can set to `True`).
124-
7. Must be zero-copy if possible.
128+
7. Must be zero-copy wherever possible.
125129
8. Must support missing values (`NA`) for all supported dtypes.
126130
9. Must supports string, categorical and datetime dtypes.
127131
10. Must allow the consumer to inspect the representation for missing values
@@ -141,7 +145,7 @@ length. A **dataframe** contains one or more chunks.
141145
_Rationale: prescribing a single in-memory representation in this
142146
protocol would lead to unnecessary copies being made if that represention
143147
isn't the native one a library uses._
144-
_Note: the memory layout is columnnar. Row-major dataframes can use this
148+
_Note: the memory layout is columnar. Row-major dataframes can use this
145149
protocol, but not in a zero-copy fashion (see requirement 2 above)._
146150
12. Must support chunking, i.e. accessing the data in "batches" of rows.
147151
There must be metadata the consumer can access to learn in how many
@@ -168,14 +172,21 @@ We'll also list some things that were discussed but are not requirements:
168172
3. Extension dtypes, i.e. a way to extend the set of dtypes that is
169173
explicitly support, are out of scope.
170174
_Rationale: complex to support, not used enough to justify that complexity._
171-
4. "virtual columns", i.e. columns for which the data is not yet in memory
175+
4. Support for strided storage in buffers.
176+
_Rationale: this is supported by a subset of dataframes only, mainly those
177+
that use NumPy arrays. In many real-world use cases, strided arrays will
178+
force a copy at some point, so requiring contiguous memory layout (and hence
179+
an extra copy at the moment `__dataframe__` is used) is considered a good
180+
trade-off for reduced implementation complexity._
181+
5. "virtual columns", i.e. columns for which the data is not yet in memory
172182
because it uses lazy evaluation, are not supported other than through
173183
letting the producer materialize the data in memory when the consumer
174184
calls `__dataframe__`.
175185
_Rationale: the full dataframe API will support this use case by
176186
"programming to an interface"; this data interchange protocol is
177187
fundamentally built around describing data in memory_.
178188

189+
179190
### To be decided
180191

181192
_The connection between dataframe and array interchange protocols_. If we
@@ -194,7 +205,7 @@ _Should there be a standard `from_dataframe` constructor function?_ This
194205
isn't completely necessary, however it's expected that a full dataframe API
195206
standard will have such a function. The array API standard also has such a
196207
function, namely `from_dlpack`. Adding at least a recommendation on syntax
197-
for this function would make sense, e.g., `from_dataframe(df, stream=None)`.
208+
for this function makes sense, e.g., simply `from_dataframe(df)`.
198209
Discussion at https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685903651
199210
is relevant.
200211

@@ -209,14 +220,16 @@ except `__dataframe__` is a Python-level rather than C-level interface.
209220
The data types format specification of that interface is something that could
210221
be used unchanged.
211222

212-
The main (only?) limitation seems to be that it does not have device support
213-
- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
214-
that interface only talks about arrays; dataframes, chunking and the metadata
215-
inspection can all be layered on top in this Python-level protocol, but are
216-
not discussed in the interface itself.
223+
The main limitation is to be that it does not have device support
224+
-- `@kkraus14` will bring this up on the Arrow dev mailing list. Another
225+
identified issue is that the "deleter" on the Arrow C struct is present at the
226+
column level, and there are use cases for having it at the buffer level
227+
(mixed-device dataframes, more granular control over memory).
217228

218229
Note that categoricals are supported, Arrow uses the phrasing
219-
"dictionary-encoded types" for categorical.
230+
"dictionary-encoded types" for categorical. Also, what it calls "array" means
231+
"column" in the terminology of this document (and every Python dataframe
232+
library).
220233

221234
The Arrow C Data Interface says specifically it was inspired by [Python's
222235
buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also
@@ -245,7 +258,7 @@ library that implements `__array__` must depend (optionally at least) on
245258
NumPy, and call a NumPy `ndarray` constructor itself from within `__array__`.
246259

247260

248-
### What is wrong with `.to_numpy?` and `.to_arrow()`?
261+
### What is wrong with `.to_numpy?` and `.to_arrow()`?
249262

250263
Such methods ask the object it is attached to to turn itself into a NumPy or
251264
Arrow array. Which means each library must have at least an optional
@@ -261,7 +274,7 @@ constructor it needs. For example, `x = np.asarray(df['colname'])` (where
261274

262275
### Does an interface describing memory work for virtual columns?
263276

264-
Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels
277+
Vaex is an example of a library that can have "virtual columns" (see `@maartenbreddels`
265278
[comment here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569)).
266279
If the protocol includes a description of data layout in memory, does that
267280
work for such a virtual column?
@@ -285,17 +298,15 @@ computational graph approach like Dask uses, etc.)._
285298

286299
## Possible direction for implementation
287300

288-
### Rough prototypes
301+
### Rough initial prototypes (historical)
289302

290303
The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14
291304
[here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386)
292-
seems to be in the right direction.
305+
looked like it was in the right direction.
293306

294307
[This prototype](https://github.com/wesm/dataframe-protocol/pull/1) by Wes
295308
McKinney was the first attempt, and has some useful features.
296309

297-
TODO: work this out after making sure we're all on the same page regarding requirements.
298-
299310

300311
### Relevant existing protocols
301312

@@ -363,4 +374,4 @@ The `=`, `<`, `>` are denoting endianness; Arrow only supports native endianness
363374
- [`__array_interface__` protocol](https://numpy.org/devdocs/reference/arrays.interface.html)
364375
- [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
365376
- [DLPack](https://github.com/dmlc/dlpack)
366-
- [Array data interchange in API standard](https://data-apis.github.io/array-api/latest/design_topics/data_interchange.html)
377+
- [Array data interchange in API standard](https://data-apis.github.io/array-api/latest/design_topics/data_interchange.html)
39.8 KB
Loading

0 commit comments

Comments
 (0)
0