add array api support in label binarizer #28626

jeromedockes · 2024-03-13T16:32:57Z

Towards #26024.

Moving this part out of #27961 to make it easier to review and because the LabelBinarizer is used in other estimators than Ridge and RidgeCV which are the focus of #27961 .

TODO

add tests

github-actions · 2024-03-13T16:34:13Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9f4762b. Link to the linter CI: here}

…izer

ogrisel

We need tests as well.

Question: should we update LabelBinarizer as part of this PR? Or keep it minimal to now delay work for RidgeClassifier?

sklearn/preprocessing/_label.py

ogrisel · 2024-03-14T14:30:00Z

Also all public API (functions or classes) updated to support Array API should be listed in:

https://scikit-learn.org/dev/modules/array_api.html#support-for-array-api-compatible-inputs

jeromedockes · 2024-03-15T10:45:50Z

after discussion with @ogrisel we decided that as the current logic is to build a sparse matrix that is later converted to dense, and moreover the LabelBinarizer is unlikely to be the performance bottleneck in most pipelines, for now we will keep all the function's logic in numpy but convert the results to the correct array api backend at the end so that other estimators that actually benefit from the array api can still use the LabelBinarizer.

It will always be possible later to add a different code path for the sparse_output=False case, and for that code path it might be more beneficial to rely on the array api

…izer

ogrisel

I pushed a quick fix for the test and resolved a conflict when merging with main.

Here are a few suggestions for improvement, otherwise LGTM. Thanks @jeromedockes!

ogrisel · 2024-03-28T14:56:50Z

sklearn/preprocessing/_label.py

+
+                if not y_is_array_api:
+                    return Y
+                return y_xp.asarray(Y, device=device_)


Could you please add a new test case to cover that line?

sklearn/preprocessing/_label.py

ogrisel · 2024-05-07T09:11:26Z

doc/whats_new/v1.5.rst

@@ -47,6 +47,9 @@ See :ref:`array_api` for more details.

 **Classes:**

+- :class:`sklearn.preprocessing.LabelBinarizer` now supports Array API compliant inputs.
+  :pr:`28626` by :user:`Jérôme Dockès <jeromedockes>`.


Since we cut 1.5.X earlier this week, this entry will have to be pushed to v1.6.rst (after a merge of main into this branch).

adrinjalali · 2024-05-08T09:44:58Z

sklearn/preprocessing/_label.py

@@ -303,7 +304,8 @@ def fit(self, y):
            raise ValueError("y has 0 samples: %r" % y)

        self.sparse_input_ = sp.issparse(y)
-        self.classes_ = unique_labels(y)
+        xp, _ = get_namespace(y)
+        self.classes_ = _convert_to_numpy(unique_labels(y), xp)


Why do we need to convert classes_ to numpy?

This is related to #26083 as well.

This right now is a bit confusing, since fitting an estimator which supports array API ends up with an object where some attributes are in the same space as input X, and some are still in numpy.

I think that so far the policy is: store the fitted attributes with the array type that makes most sense to be efficient at prediction time assuming the prediction-time data container type will be consistent with the fit-time data container type. Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

If one day we decide to recode label_binarize to actually delegate some computation to the underlying namespace, then we think about make the type of classes_ input dependent instead.

Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

Then I'm confused. Doesn't check_array already convert data to numpy? So we already support non-numpy data.

If the point is to have the output of predict in the same space as what user gives, shouldn't that be like a decorator or something around predict? Or a simply function call at the end of predict?

Or a simply function call at the end of predict?

that is quite close to what is done here: we record the input namespace and device at the beginning of the function, convert to numpy and do everything in numpy, and convert back to the input format and device where the function returns.

As you say, it could probably be implemented as a decorator applied to fit, label_binarize and inverse_transform

ogrisel · 2024-05-10T16:13:23Z

sklearn/preprocessing/tests/test_label.py

+    xp_y = xp.asarray(y, device=device)
+    xp_lb = LabelBinarizer(sparse_output=False)
+    with config_context(array_api_dispatch=True):
+        xp_transformed = xp_lb.fit_transform(xp_y)


We might also want to test the output of transform when passed array api inputs (instead of just testing the output of fit_transform).

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…izer

… fit_transform)

jeromedockes · 2024-05-15T11:15:07Z

Based on @adrinjalali's comments above and further discussion during the array API meeting today, we decided not to add array API support to the LabelBinarizer yet. Estimators such as the RidgeClassifier that rely on the LabelBinarizer will need to handle the conversion to and from numpy themselves (including of the classes_ attribute if they expose it).

Anyway in the case an estimator uses the LabelBinarizer to binarize a sequence of target labels, but X is on the GPU, at least some of the conversion logic would have to be handled after the call to LabelBinarizer.fit_transform.

We will look for opportunities for refactoring this logic once array api support has been added to some estimators that use the LabelBinarizer.

add array api support in label binarizer

8191596

github-actions bot added the module:preprocessing label Mar 13, 2024

Merge remote-tracking branch 'upstream/main' into arrayapi-labelbinar…

b56425d

…izer

ogrisel reviewed Mar 14, 2024

View reviewed changes

sklearn/preprocessing/_label.py Outdated Show resolved Hide resolved

update label_binarize

1c08d5b

jeromedockes added 8 commits March 20, 2024 17:50

Merge remote-tracking branch 'upstream/main' into arrayapi-labelbinar…

2a69ad3

…izer

do all label binarizing in numpy

d3be4cc

convert output of inverse_transform

0e6b716

add test

d718c07

fix inverse_transform for sparse Y

c4a4941

update changelog and array_api.rst

9ea0c55

add test for binary case

ba117f3

Merge remote-tracking branch 'upstream/main' into arrayapi-labelbinar…

8e859a3

…izer

jeromedockes changed the title ~~[WIP] add array api support in label binarizer~~ add array api support in label binarizer Mar 26, 2024

ogrisel mentioned this pull request Mar 28, 2024

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

ogrisel added 2 commits March 28, 2024 16:11

Merge main + fix conflict in import statements

35d31c0

Fix broken test with pytorch on a non-CPU device

a8a270e

ogrisel approved these changes Mar 28, 2024

View reviewed changes

ogrisel mentioned this pull request Mar 28, 2024

ENH Array API support for LabelEncoder #27381

Merged

betatim added the Array API label Apr 10, 2024

ogrisel reviewed May 7, 2024

View reviewed changes

adrinjalali reviewed May 8, 2024

View reviewed changes

ogrisel reviewed May 10, 2024

View reviewed changes

6D47 jeromedockes and others added 4 commits May 13, 2024 14:11

Apply suggestions from code review

86e7daf

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge remote-tracking branch 'upstream/main' into arrayapi-labelbinar…

21c441a

…izer

formatting

632f4e8

add test for case where y is constant & for transform (in addition to…

54ff366

… fit_transform)

fix text removed from whatsnew in merge

9f4762b

jeromedockes closed this May 15, 2024

OmarManzoor mentioned this pull request Dec 9, 2024

ENH Add array api for log_loss #30439

Closed

lesteve mentioned this pull request Dec 19, 2024

ENH Array API support for confusion_matrix #30440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add array api support in label binarizer #28626

add array api support in label binarizer #28626

add array api support in label binarizer #28626

add array api support in label binarizer #28626

Conversation

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment