1
- # ` __dataframe__ ` protocol - summary
1
+ # The ` __dataframe__ ` protocol
2
+
3
+ This document aims to describe the scope of the dataframe interchange protocol,
4
+ as well as its essential design requirements/principles and the functionality
5
+ it needs to support.
2
6
3
- _ We've had a lot of discussion in a couple of GitHub issues and in meetings.
4
- This description attempts to summarize that, and extract the essential design
5
- requirements/principles and functionality it needs to support._
6
7
7
8
## Purpose of ` __dataframe__ `
8
9
@@ -11,7 +12,8 @@ a way to convert one type of dataframe into another type (for example,
11
12
convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into
12
13
a Vaex dataframe).
13
14
14
- Currently (Nov'20) there is no way to do this in an implementation-independent way.
15
+ Currently (June 2020) there is no way to do this in an
16
+ implementation-independent way.
15
17
16
18
The main use case this protocol intends to enable is to make it possible to
17
19
write code that can accept any type of dataframe instead of being tied to a
@@ -30,7 +32,7 @@ def somefunc(df, ...):
30
32
31
33
### Non-goals
32
34
33
- Providing a _ complete standardized dataframe API_ is not a goal of the
35
+ Providing a _ complete, standardized dataframe API_ is not a goal of the
34
36
` __dataframe__ ` protocol. Instead, this is a goal of the full dataframe API
35
37
standard, which the Consortium for Python Data API Standards aims to provide
36
38
in the future. When that full API standard is implemented by dataframe
@@ -40,8 +42,8 @@ libraries, the example above can change to:
40
42
def get_df_module (df ):
41
43
""" Utility function to support programming against a dataframe API"""
42
44
if hasattr (df, ' __dataframe_namespace__' ):
43
- # Retrieve the namespace
44
- pdx = df.__dataframe_namespace__()
45
+ # Retrieve the namespace
46
+ pdx = df.__dataframe_namespace__()
45
47
else :
46
48
# Here we can raise an exception if we only want to support compliant dataframes,
47
49
# or convert to our default choice of dataframe if we want to accept (e.g.) dicts
@@ -57,6 +59,7 @@ def somefunc(df, ...):
57
59
# From now on, use `df` methods and `pdx` functions/objects
58
60
```
59
61
62
+
60
63
### Constraints
61
64
62
65
An important constraint on the ` __dataframe__ ` protocol is that it should not
@@ -94,13 +97,14 @@ For a protocol to exchange dataframes between libraries, we need both a model
94
97
of what we mean by "dataframe" conceptually for the purposes of the protocol,
95
98
and a model of how the data is represented in memory:
96
99
97
- ![ Image of a dataframe model , containing chunks, columns and 1-D arrays] ( conceptual_model_df_memory .png)
100
+ ![ Conceptual model of a dataframe, containing chunks, columns and 1-D arrays] ( images/dataframe_conceptual_model .png)
98
101
99
- The smallest building block are ** 1-D arrays** , which are contiguous in
100
- memory and contain data with the same dtype. A ** column** consists of one or
101
- more 1-D arrays (if, e.g., missing data is represented with a boolean mask,
102
- that's a separate array). A ** chunk** contains a set of columns of uniform
103
- length. A ** dataframe** contains one or more chunks.
102
+ The smallest building blocks are ** 1-D arrays** (or "buffers"), which are
103
+ contiguous in memory and contain data with the same dtype. A ** column**
104
+ consists of one or more 1-D arrays (if, e.g., missing data is represented with
105
+ a boolean mask, that's a separate array). A ** dataframe** contains one or more columns.
106
+ A column or a dataframe can be "chunked"; a ** chunk** is a subset of a column
107
+ or dataframe that contains a set of (neighboring) rows.
104
108
105
109
106
110
## Protocol design requirements
@@ -121,7 +125,7 @@ length. A **dataframe** contains one or more chunks.
121
125
6 . Must avoid device transfers by default (e.g. copy data from GPU to CPU),
122
126
and provide an explicit way to force such transfers (e.g. a ` force= ` or
123
127
` copy= ` keyword that the caller can set to ` True ` ).
124
- 7 . Must be zero-copy if possible.
128
+ 7 . Must be zero-copy wherever possible.
125
129
8 . Must support missing values (` NA ` ) for all supported dtypes.
126
130
9 . Must supports string, categorical and datetime dtypes.
127
131
10 . Must allow the consumer to inspect the representation for missing values
@@ -141,7 +145,7 @@ length. A **dataframe** contains one or more chunks.
141
145
_ Rationale: prescribing a single in-memory representation in this
142
146
protocol would lead to unnecessary copies being made if that represention
143
147
isn't the native one a library uses._
144
- _ Note: the memory layout is columnnar . Row-major dataframes can use this
148
+ _ Note: the memory layout is columnar . Row-major dataframes can use this
145
149
protocol, but not in a zero-copy fashion (see requirement 2 above)._
146
150
12 . Must support chunking, i.e. accessing the data in "batches" of rows.
147
151
There must be metadata the consumer can access to learn in how many
@@ -168,14 +172,21 @@ We'll also list some things that were discussed but are not requirements:
168
172
3 . Extension dtypes, i.e. a way to extend the set of dtypes that is
169
173
explicitly support, are out of scope.
170
174
_ Rationale: complex to support, not used enough to justify that complexity._
171
- 4 . "virtual columns", i.e. columns for which the data is not yet in memory
175
+ 4 . Support for strided storage in buffers.
176
+ _ Rationale: this is supported by a subset of dataframes only, mainly those
177
+ that use NumPy arrays. In many real-world use cases, strided arrays will
178
+ force a copy at some point, so requiring contiguous memory layout (and hence
179
+ an extra copy at the moment ` __dataframe__ ` is used) is considered a good
180
+ trade-off for reduced implementation complexity._
181
+ 5 . "virtual columns", i.e. columns for which the data is not yet in memory
172
182
because it uses lazy evaluation, are not supported other than through
173
183
letting the producer materialize the data in memory when the consumer
174
184
calls ` __dataframe__ ` .
175
185
_ Rationale: the full dataframe API will support this use case by
176
186
"programming to an interface"; this data interchange protocol is
177
187
fundamentally built around describing data in memory_ .
178
188
189
+
179
190
### To be decided
180
191
181
192
_ The connection between dataframe and array interchange protocols_ . If we
@@ -194,7 +205,7 @@ _Should there be a standard `from_dataframe` constructor function?_ This
194
205
isn't completely necessary, however it's expected that a full dataframe API
195
206
standard will have such a function. The array API standard also has such a
196
207
function, namely ` from_dlpack ` . Adding at least a recommendation on syntax
197
- for this function would make sense, e.g., ` from_dataframe(df, stream=None ) ` .
208
+ for this function makes sense, e.g., simply ` from_dataframe(df) ` .
198
209
Discussion at https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685903651
199
210
is relevant.
200
211
@@ -209,14 +220,16 @@ except `__dataframe__` is a Python-level rather than C-level interface.
209
220
The data types format specification of that interface is something that could
210
221
be used unchanged.
211
222
212
- The main (only?) limitation seems to be that it does not have device support
213
- - @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
214
- that interface only talks about arrays; dataframes, chunking and the metadata
215
- inspection can all be layered on top in this Python-level protocol, but are
216
- not discussed in the interface itself .
223
+ The main limitation is to be that it does not have device support
224
+ -- ` @kkraus14 ` will bring this up on the Arrow dev mailing list. Another
225
+ identified issue is that the "deleter" on the Arrow C struct is present at the
226
+ column level, and there are use cases for having it at the buffer level
227
+ (mixed-device dataframes, more granular control over memory) .
217
228
218
229
Note that categoricals are supported, Arrow uses the phrasing
219
- "dictionary-encoded types" for categorical.
230
+ "dictionary-encoded types" for categorical. Also, what it calls "array" means
231
+ "column" in the terminology of this document (and every Python dataframe
232
+ library).
220
233
221
234
The Arrow C Data Interface says specifically it was inspired by [ Python's
222
235
buffer protocol] ( https://docs.python.org/3/c-api/buffer.html ) , which is also
@@ -245,7 +258,7 @@ library that implements `__array__` must depend (optionally at least) on
245
258
NumPy, and call a NumPy ` ndarray ` constructor itself from within ` __array__ ` .
246
259
247
260
248
- ### What is wrong with ` .to_numpy? ` and ` .to_arrow() ` ?
261
+ ### What is wrong with ` .to_numpy? ` and ` .to_arrow() ` ?
249
262
250
263
Such methods ask the object it is attached to to turn itself into a NumPy or
251
264
Arrow array. Which means each library must have at least an optional
@@ -261,7 +274,7 @@ constructor it needs. For example, `x = np.asarray(df['colname'])` (where
261
274
262
275
### Does an interface describing memory work for virtual columns?
263
276
264
- Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels
277
+ Vaex is an example of a library that can have "virtual columns" (see ` @maartenbreddels `
265
278
[ comment here] ( https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569 ) ).
266
279
If the protocol includes a description of data layout in memory, does that
267
280
work for such a virtual column?
@@ -285,17 +298,15 @@ computational graph approach like Dask uses, etc.)._
285
298
286
299
## Possible direction for implementation
287
300
288
- ### Rough prototypes
301
+ ### Rough initial prototypes (historical)
289
302
290
303
The ` cuDFDataFrame ` , ` cuDFColumn ` and ` cuDFBuffer ` sketched out by @kkraus14
291
304
[ here] ( https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386 )
292
- seems to be in the right direction.
305
+ looked like it was in the right direction.
293
306
294
307
[ This prototype] ( https://github.com/wesm/dataframe-protocol/pull/1 ) by Wes
295
308
McKinney was the first attempt, and has some useful features.
296
309
297
- TODO: work this out after making sure we're all on the same page regarding requirements.
298
-
299
310
300
311
### Relevant existing protocols
301
312
@@ -363,4 +374,4 @@ The `=`, `<`, `>` are denoting endianness; Arrow only supports native endianness
363
374
- [ ` __array_interface__ ` protocol] ( https://numpy.org/devdocs/reference/arrays.interface.html )
364
375
- [ Arrow C Data Interface] ( https://arrow.apache.org/docs/format/CDataInterface.html )
365
376
- [ DLPack] ( https://github.com/dmlc/dlpack )
366
- - [ Array data interchange in API standard] ( https://data-apis.github.io/array-api/latest/design_topics/data_interchange.html )
377
+ - [ Array data interchange in API standard] ( https://data-apis.github.io/array-api/latest/design_topics/data_interchange.html )
0 commit comments