8000 Refactor `_convert_container` by Charlie-XIAO · Pull Request #28681 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Refactor _convert_container #28681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor
@Charlie-XIAO Charlie-XIAO commented Mar 23, 2024

Follow up of #28521, see #28521 (comment).

This PR splits constructor_name into constructor_type and constructor_lib, and adds two parameters sparse_container and sparse_format to reflect dense versus sparse. /cc @glemaitre

Mapping from the original constructor_name to the specifications in this PR

constructor_type constructor_lib sparse_container sparse_format
Default "pandas" None "csr"
"list" "list"
"tuple" "tuple"
"slice" "slice"
"array" "array"
"sparse"/"sparse_csr" "array" "matrix"
"sparse_csc" "array" "matrix" "csc"
"sparse_csr_array" "array" "array"
"sparse_csc_array" "array" "array" "csc"
"dataframe"/"pandas" "dataframe"
"polars" "dataframe" "polars"
"pyarrow" "dataframe" "pyarrow"
"series" "series"
"polars_series" "series" "polars"
"index" "index"

Some rules of when each parameter is applicable

  • sparse_container has no effect unless constructor_type="array"
  • sparse_format has no effect unless constructor_type="array" and sparse_container is not None
  • constructor_lib has no effect unless constructor_type is one of "dataframe", "series", and "index"
  • min_version has no effect unless constructor_lib is used
  • column_names (renamed from columns_name) has no effect unless constructor_type="dataframe"
  • categorical_feature_names has no effect unless constructor_type="dataframe"

Constraints on the shape of the input container

  • constructor_type="slice" requires 1-dimensional container with exactly 2 elements
  • constructor_type="dataframe" requires 2-dimensional container (previously it converts (n,) shape to (n, 1) shape)
  • constructor_type="series" requires 1-dimensional container (previously it creates series with each element being a list when given 2-dimensional container)
  • constructor_type="index" requires 1-dimensional container (similar to "series")

Copy link
github-actions bot commented Mar 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1c6f2e1. Link to the linter CI: here

@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review March 24, 2024 04:03
Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Charlie-XIAO. Here are some comments. I'm +1 for the refactoring of _convert_container.

@glemaitre
Copy link
Member

Once the CIs are green, I'll review ;) (I am buying a bit of time).
But indeed, I really like the refactoring. Thanks for taking care of it @Charlie-XIAO.

@glemaitre glemaitre self-requested a review March 28, 2024 06:51
@Charlie-XIAO
Copy link
Contributor Author

Now I have resolved all suggestions from Jérémie, and CI is green :)

@glemaitre
Copy link
Member

I'm going to resolve the conflicts and make a review.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments before to look at the tests.

@glemaitre glemaitre self-requested a review May 21, 2024 16:06
@glemaitre glemaitre added this to the 1.6 milestone May 21, 2024
@adrinjalali
Copy link
Member

@Charlie-XIAO would you mind resolving conflicts?

@Charlie-XIAO
Copy link
Contributor Author

Will do soon (maybe within one day or two).

@glemaitre glemaitre modified the milestones: 1.6, 1.7 Nov 6, 2024
@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Dec 23, 2024

Sorry for the super long delay @adrinjalali - I was super busy with coursework during the semester 😢. I have now resolved the conflicts, but not sure if anything has changed since my last update since it is affecting quite a lot of tests.

@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Dec 23, 2024

The test failure seems to be because that to_pandas of polars requires pyarrow as well: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_pandas.html, should we just skip the tests of polars when pyarrow is not installed, or try find a workaround?

@Charlie-XIAO Charlie-XIAO changed the title RFC _convert_container Refactor _convert_container Jan 2, 2025
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks @Charlie-XIAO .

@@ -2365,6 +2365,15 @@ def _is_polars_df(X):
return isinstance(X, pl.DataFrame)


def _is_arrow_df(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a CI with arrow installed? (I think we should)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we do have it in pylatest_conda_forge_mkl_linux-64:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants
0