8000 add array api support in label binarizer by jeromedockes · Pull Request #28626 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

add array api support in label binarizer #28626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jeromedockes
Copy link
Contributor
@jeromedockes jeromedockes commented Mar 13, 2024

Towards #26024.

Moving this part out of #27961 to make it easier to review and because the LabelBinarizer is used in other estimators than Ridge and RidgeCV which are the focus of #27961 .

TODO

  • add tests

Copy link
github-actions bot commented Mar 13, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9f4762b. Link to the linter CI: here

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests as well.

Question: should we update LabelBinarizer as part of this PR? Or keep it minimal to now delay work for RidgeClassifier?

@ogrisel
Copy link
Member
ogrisel commented Mar 14, 2024

Also all public API (functions or classes) updated to support Array API should be listed in:

@jeromedockes
Copy link
Contributor Author

after discussion with @ogrisel we decided that as the current logic is to build a sparse matrix that is later converted to dense, and moreover the LabelBinarizer is unlikely to be the performance bottleneck in most pipelines, for now we will keep all the function's logic in numpy but convert the results to the correct array api backend at the end so that other estimators that actually benefit from the array api can still use the LabelBinarizer.

It will always be possible later to add a different code path for the sparse_output=False case, and for that code path it might be more beneficial to rely on the array api

@jeromedockes jeromedockes changed the title [WIP] add array api support in label binarizer add array api support in label binarizer Mar 26, 2024
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a quick fix for the test and resolved a conflict when merging with main.

Here are a few suggestions for improvement, otherwise LGTM. Thanks @jeromedockes!


if not y_is_array_api:
return Y
return y_xp.asarray(Y, device=device_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a new test case to cover that line?

@@ -47,6 +47,9 @@ See :ref:`array_api` for more details.

**Classes:**

- :class:`sklearn.preprocessing.LabelBinarizer` now supports Array API compliant inputs.
:pr:`28626` by :user:`Jérôme Dockès <jeromedockes>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we cut 1.5.X earlier this week, this entry will have to be pushed to v1.6.rst (after a merge of main into this branch).

@@ -303,7 +304,8 @@ def fit(self, y):
raise ValueError("y has 0 samples: %r" % y)

self.sparse_input_ = sp.issparse(y)
self.classes_ = unique_labels(y)
xp, _ = get_namespace(y)
self.classes_ = _convert_to_numpy(unique_labels(y), xp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to convert classes_ to numpy?

This is related to #26083 as well.

This right now is a bit confusing, since fitting an estimator which supports array API ends up with an object where some attributes are in the same space as input X, and some are still in numpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that so far the policy is: store the fitted attributes with the array type that makes most sense to be efficient at prediction time assuming the prediction-time data container type will be consistent with the fit-time data container type. Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

If one day we decide to recode label_binarize to actually delegate some computation to the underlying namespace, then we think about make the type of classes_ input dependent instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

Then I'm confused. Doesn't check_array already convert data to numpy? So we already support non-numpy data.

If the point is to have the output of predict in the same space as what user gives, shouldn't that be like a decorator or something around predict? Or a simply function call at the end of predict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a simply function call at the end of predict?

that is quite close to what is done here: we record the input namespace and device at the beginning of the function, convert to numpy and do everything in numpy, and convert back to the input format and device where the function returns.

As you say, it could probably be implemented as a decorator applied to fit, label_binarize and inverse_transform

xp_y = xp.asarray(y, device=device)
xp_lb = LabelBinarizer(sparse_output=False)
with config_context(array_api_dispatch=True):
xp_transformed = xp_lb.fit_transform(xp_y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to test the output of transform when passed array api inputs (instead of just testing the output of fit_transform).

@jeromedockes
Copy link
Contributor Author

Based on @adrinjalali's comments above and further discussion during the array API meeting today, we decided not to add array API support to the LabelBinarizer yet. Estimators such as the RidgeClassifier that rely on the LabelBinarizer will need to handle the conversion to and from numpy themselves (including of the classes_ attribute if they expose it).

Anyway in the case an estimator uses the LabelBinarizer to binarize a sequence of target labels, but X is on the GPU, at least some of the conversion logic would have to be handled after the call to LabelBinarizer.fit_transform.

We will look for opportunities for refactoring this logic once array api support has been added to some estimators that use the LabelBinarizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0