-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
add array api support in label binarizer #28626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add array api support in label binarizer #28626
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need tests as well.
Question: should we update LabelBinarizer
as part of this PR? Or keep it minimal to now delay work for RidgeClassifier
?
Also all public API (functions or classes) updated to support Array API should be listed in: |
after discussion with @ogrisel we decided that as the current logic is to build a sparse matrix that is later converted to dense, and moreover the LabelBinarizer is unlikely to be the performance bottleneck in most pipelines, for now we will keep all the function's logic in numpy but convert the results to the correct array api backend at the end so that other estimators that actually benefit from the array api can still use the LabelBinarizer. It will always be possible later to add a different code path for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a quick fix for the test and resolved a conflict when merging with main.
Here are a few suggestions for improvement, otherwise LGTM. Thanks @jeromedockes!
|
||
if not y_is_array_api: | ||
return Y | ||
return y_xp.asarray(Y, device=device_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a new test case to cover that line?
doc/whats_new/v1.5.rst
Outdated
@@ -47,6 +47,9 @@ See :ref:`array_api` for more details. | |||
|
|||
**Classes:** | |||
|
|||
- :class:`sklearn.preprocessing.LabelBinarizer` now supports Array API compliant inputs. | |||
:pr:`28626` by :user:`Jérôme Dockès <jeromedockes>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we cut 1.5.X earlier this week, this entry will have to be pushed to v1.6.rst
(after a merge of main
into this branch).
@@ -303,7 +304,8 @@ def fit(self, y): | |||
raise ValueError("y has 0 samples: %r" % y) | |||
|
|||
self.sparse_input_ = sp.issparse(y) | |||
self.classes_ = unique_labels(y) | |||
xp, _ = get_namespace(y) | |||
self.classes_ = _convert_to_numpy(unique_labels(y), xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to convert classes_
to numpy?
This is related to #26083 as well.
This right now is a bit confusing, since fitting an estimator which supports array API ends up with an object where some attributes are in the same space as input X, and some are still in numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that so far the policy is: store the fitted attributes with the array type that makes most sense to be efficient at prediction time assuming the prediction-time data container type will be consistent with the fit-time data container type. Since this PR only changes LabelBinarizer
for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_
as a numpy array for now.
If one day we decide to recode label_binarize
to actually delegate some computation to the underlying namespace, then we think about make the type of classes_
input dependent instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.
Then I'm confused. Doesn't check_array
already convert data to numpy
? So we already support non-numpy data.
If the point is to have the output of predict
in the same space as what user gives, shouldn't that be like a decorator or something around predict
? Or a simply function call at the end of predict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a simply function call at the end of predict?
that is quite close to what is done here: we record the input namespace and device at the beginning of the function, convert to numpy and do everything in numpy, and convert back to the input format and device where the function returns.
As you say, it could probably be implemented as a decorator applied to fit
, label_binarize
and inverse_transform
xp_y = xp.asarray(y, device=device) | ||
xp_lb = LabelBinarizer(sparse_output=False) | ||
with config_context(array_api_dispatch=True): | ||
xp_transformed = xp_lb.fit_transform(xp_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to test the output of transform
when passed array api inputs (instead of just testing the output of fit_transform
).
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Based on @adrinjalali's comments above and further discussion during the array API meeting today, we decided not to add array API support to the LabelBinarizer yet. Estimators such as the RidgeClassifier that rely on the LabelBinarizer will need to handle the conversion to and from numpy themselves (including of the Anyway in the case an estimator uses the LabelBinarizer to binarize a sequence of target labels, but X is on the GPU, at least some of the conversion logic would have to be handled after the call to We will look for opportunities for refactoring this logic once array api support has been added to some estimators that use the LabelBinarizer. |
Towards #26024.
Moving this part out of #27961 to make it easier to review and because the LabelBinarizer is used in other estimators than Ridge and RidgeCV which are the focus of #27961 .
TODO