10BC0 feat(python): add entry point to the Consortium DataFrame API by MarcoGorelli · Pull Request #10244 · pola-rs/polars · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@MarcoGorelli
Copy link
Collaborator
@MarcoGorelli MarcoGorelli commented Aug 2, 2023

Here's an entry point to the Consortium's DataFrame API Standard

It enables dataframe-consuming libraries to just check for a __dataframe_consortium_standard__ attribute on a dataframe they receive - then, so long as they stick to the spec defined in https://data-apis.org/dataframe-api/draft/index.html, then their code should work the same way, regardless of what the original backing dataframe library was

Use-case: currently, scikit-learn is very keen on using this, as they're not keen on having to depend on pyarrow (which will become required in pandas), and the interchange protocol only goes so far (e.g. a way to convert to ndarray is out-of-scope for that). If we can get this to work there, then other use cases may emerge

Maintenance burden on polars: none (unless, once it gains traction, you want it upstreamed). The compat package will be developed, maintained, and tested by the consortium and community of libraries which use it. It is up to consuming libraries to set a minimum version of the dataframe-api-compat package

The current spec should be enough for scikit-learn, and having this entry point makes it easier to move forwards with development (without monkey-patching / special-casing)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 2, 2023
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 2, 2023 11:13
Copy link
Contributor
@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting. Does that mean we do not implement the interchange protocol within the library anymore, but implement it in the external package?

I was planning to work on this during the coming days, so your timing is impeccable 🎉

@MarcoGorelli
Copy link
Collaborator Author
MarcoGorelli commented Aug 2, 2023

so your timing is impeccable 🎉

😄 sorry for the confusion, this is separate (but related!) to the interchange protocol

The interchange protocol is intended more for converting between dataframe libraries. This is OK for plotting libraries that don't do any heavy lifting (e.g. that's what plotly does)

The standard is more an attempt to agree on a minimal set of "essential dataframe functionality", which would work the same way across dataframe libraries (in separate namespaces, as at this point there's no chance of getting everyone to agree on having the same main namespace), which would allow you to write something like

def remove_outliers(df, column_name):
    # Get a Standard-compliant dataframe.
    df_standard = df.__dataframe_consortium_standard__()
    # Use methods from the Standard specification.
    col = df_standard.get_column_by_name(column_name)
    z_score = (col - col.mean()) / col.std()
    df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
    # Return the result as a dataframe from the original library.
    return df_standard_filtered.dataframe

and have it work for any df which has a namespace implementing the Consortium's standard (without any conversions between libraries, nor heavy dependencies like pyarrow)

So, it doesn't replace the interchange protocol, but rather it enables things which are out-of-scope for the interchange protocol

I plan to make a similar PR to pandas after this. Other libraries will follow, but these will already be enough to get things moving in scikit-learn

@MarcoGorelli MarcoGorelli marked this pull request as draft August 2, 2023 11:49
@stinodego
Copy link
Contributor

Thanks for the explanation; makes sense!

@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 2, 2023 13:27
@MarcoGorelli MarcoGorelli marked this pull request as draft August 2, 2023 13:38
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 2, 2023 14:00
Copy link
Contributor
@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one final comment, then it has my blessing!

Copy link
Contributor
@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go if the CI passes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or an improvement of an existing feature python Related to Python Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0