feat(python): add entry point to the Consortium DataFrame API #10244

MarcoGorelli · 2023-08-02T10:47:29Z

Here's an entry point to the Consortium's DataFrame API Standard

It enables dataframe-consuming libraries to just check for a __dataframe_consortium_standard__ attribute on a dataframe they receive - then, so long as they stick to the spec defined in https://data-apis.org/dataframe-api/draft/index.html, then their code should work the same way, regardless of what the original backing dataframe library was

Use-case: currently, scikit-learn is very keen on using this, as they're not keen on having to depend on pyarrow (which will become required in pandas), and the interchange protocol only goes so far (e.g. a way to convert to ndarray is out-of-scope for that). If we can get this to work there, then other use cases may emerge

Maintenance burden on polars: none (unless, once it gains traction, you want it upstreamed). The compat package will be developed, maintained, and tested by the consortium and community of libraries which use it. It is up to consuming libraries to set a minimum version of the dataframe-api-compat package

The current spec should be enough for scikit-learn, and having this entry point makes it easier to move forwards with development (without monkey-patching / special-casing)

stinodego

This is very interesting. Does that mean we do not implement the interchange protocol within the library anymore, but implement it in the external package?

I was planning to work on this during the coming days, so your timing is impeccable 🎉

py-polars/polars/dataframe/frame.py

py-polars/polars/lazyframe/frame.py

py-polars/polars/dataframe/frame.py

MarcoGorelli · 2023-08-02T11:29:05Z

so your timing is impeccable 🎉

😄 sorry for the confusion, this is separate (but related!) to the interchange protocol

The interchange protocol is intended more for converting between dataframe libraries. This is OK for plotting libraries that don't do any heavy lifting (e.g. that's what plotly does)

The standard is more an attempt to agree on a minimal set of "essential dataframe functionality", which would work the same way across dataframe libraries (in separate namespaces, as at this point there's no chance of getting everyone to agree on having the same main namespace), which would allow you to write something like

def remove_outliers(df, column_name):
    # Get a Standard-compliant dataframe.
    df_standard = df.__dataframe_consortium_standard__()
    # Use methods from the Standard specification.
    col = df_standard.get_column_by_name(column_name)
    z_score = (col - col.mean()) / col.std()
    df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
    # Return the result as a dataframe from the original library.
    return df_standard_filtered.dataframe

and have it work for any df which has a namespace implementing the Consortium's standard (without any conversions between libraries, nor heavy dependencies like pyarrow)

So, it doesn't replace the interchange protocol, but rather it enables things which are out-of-scope for the interchange protocol

I plan to make a similar PR to pandas after this. Other libraries will follow, but these will already be enough to get things moving in scikit-learn

py-polars/polars/dependencies.py

stinodego · 2023-08-02T11:55:33Z

Thanks for the explanation; makes sense!

stinodego

Just one final comment, then it has my blessing!

py-polars/tests/unit/test_consortium_standard.py

stinodego

Good to go if the CI passes!

py-polars/polars/dataframe/frame.py

feat(python): add entry point to the Consortium DataFrame API

ea25916

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 2, 2023

MarcoGorelli marked this pull request as ready for review August 2, 2023 11:13

MarcoGorelli requested review from alexander-beedie, ritchie46 and stinodego as code owners August 2, 2023 11:13

stinodego reviewed Aug 2, 2023

View reviewed changes

py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved

py-polars/polars/lazyframe/frame.py Outdated Show resolved Hide resolved

py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved

MarcoGorelli added 2 commits August 2, 2023 12:42

move to polars.dependencies

4067459

remove unnecessary /

19ec54a

MarcoGorelli marked this pull request as draft August 2, 2023 11:49

stinodego reviewed Aug 2, 2023

View reviewed changes

py-polars/polars/dependencies.py Show resolved Hide resolved

MarcoGorelli added 3 commits August 2, 2023 13:46

add test

51e85a0

fixup

e0652e9

Merge remote-tracking branch 'upstream/main' into standard-entrypoint

b66e2c0

MarcoGorelli marked this pull request as ready for review August 2, 2023 13:27

MarcoGorelli marked this pull request as draft August 2, 2023 13:38

MarcoGorelli marked this pull request as ready for review August 2, 2023 14:00

remove importorskip

1ec80d4

stinodego reviewed Aug 2, 2023

View reviewed changes

py-polars/tests/unit/test_consortium_standard.py Outdated Show resolved Hide resolved

stinodego approved these changes Aug 2, 2023

View reviewed changes

MarcoGorelli added 3 commits August 2, 2023 20:02

simplify

40fbfd0

fixup typing

4bef51c

set minimum version

8699926

zundertj reviewed Aug 2, 2023

View reviewed changes

py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved

stinodego merged commit 63e1dbf into pola-rs:main Aug 3, 2023

MarcoGorelli mentioned this pull request Aug 3, 2023

ENH: add consortium standard entrypoint pandas-dev/pandas#54383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(python): add entry point to the Consortium DataFrame API #10244

feat(python): add entry point to the Consortium DataFrame API #10244

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(python): add entry point to the Consortium DataFrame API #10244

feat(python): add entry point to the Consortium DataFrame API #10244

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants