8000 Requirements document for the dataframe interchange protocol by rgommers · Pull Request #35 · data-apis/dataframe-api · GitHub
[go: up one dir, main page]

Skip to content

Requirements document for the dataframe interchange protocol #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 25, 2021
Merged
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
49f3b06
Add a summary document for the dataframe interchange protocol
rgommers Sep 15, 2020
31e301f
Process some review comments
rgommers Sep 17, 2020
a9b5e5c
Process a few more review comments.
rgommers Nov 5, 2020
82bc5ae
Link to Release callback semantics in Arrow C Data Interface docs
rgommers Nov 5, 2020
6197aee
Add design requirements for column selection and df metadata
rgommers Nov 5, 2020
2f51ba8
Edit the nested/heterogeneous dtypes non-requirement
rgommers Nov 5, 2020
183851d
Add requirements for chunking and memory layout description
rgommers Jan 5, 2021
c7575c1
Add TBD notes on dataframe-array connection and from_dataframe
rgommers Jan 5, 2021
5f278b3
Address review comments
rgommers Jan 6, 2021
1708e03
Add details on implementation options
rgommers Jan 6, 2021
3291dd9
Add details about the C implementation
rgommers Jan 6, 2021
e8caeba
Add an image of the dataframe model and its memory layout.
rgommers Jan 6, 2021
c5de640
Add link to discussion on array-dataframe connection
rgommers Jan 6, 2021
93d6e69
Some more updates for review comments
rgommers Jan 7, 2021
7d06066
Update table to indicate Arrow does support categoricals.
rgommers Jan 7, 2021
c0b5759
Add section on dtype format strings
rgommers Jan 7, 2021
6839642
Reflow some lines
rgommers Jan 8, 2021
53446ac
Add a requirement on semantic meaning of NaN/NaT, and timezone detail
rgommers Jan 8, 2021
b37ac91
Textual tweak: say columns in a data frame are ordered
rgommers Feb 9, 2021
be3cd32
Update requirements document for recent decisions/insights
rgommers Jun 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add a requirement on semantic meaning of NaN/NaT, and timezone detail
  • Loading branch information
rgommers committed Jan 8, 2021
commit 53446acbe395773c4c684c4b3cd3042f17eef05c
9 changes: 7 additions & 2 deletions protocol/dataframe_protocol_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,10 @@ length. A **dataframe** contains one or more chunks.
8. Must support missing values (`NA`) for all supported dtypes.
9. Must supports string, categorical and datetime dtypes.
10. Must allow the consumer to inspect the representation for missing values
that the producer uses for each column or data type.
that the producer uses for each column or data type. Sentinel values,
bit masks, and boolean masks must be supported.
Must also be able to define whether the semantic meaning of `NaN` and
`NaT` is "not-a-number/datetime" or "missing".
_Rationale: this enables the consumer to control how conversion happens,
for example if the producer uses `-128` as a sentinel value in an `int8`
column while the consumer uses a separate bit mask, that information
Expand Down Expand Up @@ -317,7 +320,9 @@ Here are the four most relevant existing protocols, and what requirements they s
2. Can be done only via separate masks of boolean arrays.
3. `__array_interface__` has a `mask` attribute, which is a separate boolean array also implementing the `__array_interface__` protocol.
4. Only fixed-length strings as sequence of char or unicode.
5. Only NumPy datetime and timedelta, which are limited compared to what the Arrow format offers.
5. Only NumPy datetime and timedelta, not timezones. For the purpose of data
interchange, timezones could be represented separately in metadata if
desired.
6. No explicit support, however categoricals can be mapped to either integers
or strings. Unclear how to communicate that information from producer to consumer.
7. No explicit support, categoricals can only be mapped to integers.
Expand Down
0