-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(python): add entry point to the Consortium DataFrame API #10244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting. Does that mean we do not implement the interchange protocol within the library anymore, but implement it in the external package?
I was planning to work on this during the coming days, so your timing is impeccable 🎉
😄 sorry for the confusion, this is separate (but related!) to the interchange protocol The interchange protocol is intended more for converting between dataframe libraries. This is OK for plotting libraries that don't do any heavy lifting (e.g. that's what plotly does) The standard is more an attempt to agree on a minimal set of "essential dataframe functionality", which would work the same way across dataframe libraries (in separate namespaces, as at this point there's no chance of getting everyone to agree on having the same main namespace), which would allow you to write something like def remove_outliers(df, column_name):
# Get a Standard-compliant dataframe.
df_standard = df.__dataframe_consortium_standard__()
# Use methods from the Standard specification.
col = df_standard.get_column_by_name(column_name)
z_score = (col - col.mean()) / col.std()
df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
# Return the result as a dataframe from the original library.
return df_standard_filtered.dataframeand have it work for any So, it doesn't replace the interchange protocol, but rather it enables things which are out-of-scope for the interchange protocol I plan to make a similar PR to pandas after this. Other libraries will follow, but these will already be enough to get things moving in scikit-learn |
|
Thanks for the explanation; makes sense! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one final comment, then it has my blessing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go if the CI passes!
Here's an entry point to the Consortium's DataFrame API Standard
It enables dataframe-consuming libraries to just check for a
__dataframe_consortium_standard__attribute on a dataframe they receive - then, so long as they stick to the spec defined in https://data-apis.org/dataframe-api/draft/index.html, then their code should work the same way, regardless of what the original backing dataframe library wasUse-case: currently, scikit-learn is very keen on using this, as they're not keen on having to depend on pyarrow (which will become required in pandas), and the interchange protocol only goes so far (e.g. a way to convert to ndarray is out-of-scope for that). If we can get this to work there, then other use cases may emerge
Maintenance burden on polars: none (unless, once it gains traction, you want it upstreamed). The compat package will be developed, maintained, and tested by the consortium and community of libraries which use it. It is up to consuming libraries to set a minimum version of the dataframe-api-compat package
The current spec should be enough for scikit-learn, and having this entry point makes it easier to move forwards with development (without monkey-patching / special-casing)