8000 add array api support in label binarizer by jeromedockes · Pull Request #28626 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

add array api support in label binarizer #28626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/array_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ Estimators
- :class:`preprocessing.MaxAbsScaler`
- :class:`preprocessing.MinMaxScaler`
- :class:`preprocessing.Normalizer`
- :class:`preprocessing.LabelBinarizer`

Metrics
-------
Expand Down
4 changes: 3 additions & 1 deletion doc/whats_new/v1.6.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ See :ref:`array_api` for more details.

**Classes:**

-
- :class:`sklearn.preprocessing.LabelBinarizer` now supports Array API compliant inputs.
:pr:`28626` by :user:`Jérôme Dockès <jeromedockes>` and :user:`Olivier Grisel <ogrisel>`.


Metadata Routing
----------------
Expand Down
44 changes: 40 additions & 4 deletions sklearn/preprocessing/_label.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

from ..base import BaseEstimator, TransformerMixin, _fit_context
from ..utils import column_or_1d
from ..utils._array_api import _convert_to_numpy, device, get_namespace
from ..utils._encode import _encode, _unique
from ..utils._param_validation import Interval, validate_params
from ..utils.multiclass import type_of_target, unique_labels
Expand Down Expand Up @@ -303,7 +304,8 @@ def fit(self, y):
raise ValueError("y has 0 samples: %r" % y)

self.sparse_input_ = sp.issparse(y)
self.classes_ = unique_labels(y)
xp, _ = get_namespace(y)
self.classes_ = _convert_to_numpy(unique_labels(y), xp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to convert classes_ to numpy?

This is related to #26083 as well.

This right now is a bit confusing, since fitting an estimator which supports array API ends up with an object where some attributes are in the same space as input X, and some are still in numpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that so far the policy is: store the fitted attributes with the array type that makes most sense to be efficient at prediction time assuming the prediction-time data container type will be consistent with the fit-time data container type. Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

If one day we decide to recode label_binarize to actually delegate some computation to the underlying namespace, then we think about make the type of classes_ input dependent instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR only changes LabelBinarizer for convenience without actually delegating any computation to the underlying Array API namespace (see the inline comments), I think it's better to always keep classes_ as a numpy array for now.

Then I'm confused. Doesn't check_array already convert data to numpy? So we already support non-numpy data.

If the point is to have the output of predict in the same space as what user gives, shouldn't that be like a decorator or something around predict? Or a simply function call at the end of predict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a simply function call at the end of predict?

that is quite close to what is done here: we record the input namespace and device at the beginning of the function, convert to numpy and do everything in numpy, and convert back to the input format and device where the function returns.

As you say, it could probably be implemented as a decorator applied to fit, label_binarize and inverse_transform

return self

def fit_transform(self, y):
Expand Down Expand Up @@ -396,6 +398,21 @@ def inverse_transform(self, Y, threshold=None):
"""
check_is_fitted(self)

# LabelBinarizer supports Array API compatibility for convenience when
# used as a sub-component of an classifier that does. However
# label_binarize internally uses a NumPy copy of the data because
# all the operations are meant to construct the backing NumPy arrays of a
# scipy.sparse CSR datastructure even when sparse_output=False.
#
# In the future, we might consider a dedicated code path for the
# sparse_output=False case that would directly be implemented using Array
# API without the intermediate NumPy conversion and scipy.sparse
# datastructure.
xp, is_array_api_compliant = get_namespace(Y)
device_ = device(Y) if is_array_api_compliant else None
if not sp.issparse(Y):
Y = _convert_to_numpy(Y, xp)

if threshold is None:
threshold = (self.pos_label + self.neg_label) / 2.0

Expand All @@ -410,11 +427,13 @@ def inverse_transform(self, Y, threshold=None):
y_inv = sp.csr_matrix(y_inv)
elif sp.issparse(y_inv):
y_inv = y_inv.toarray()
if is_array_api_compliant and not sp.issparse(y_inv):
y_inv = xp.asarray(y_inv, device=device_)

return y_inv

def _more_tags(self):
return {"X_types": ["1dlabels"]}
return {"X_types": ["1dlabels"], "array_api_support": True}


@validate_params(
Expand Down Expand Up @@ -487,6 +506,18 @@ def label_binarize(y, *, classes, neg_label=0, pos_label=1, sparse_output=False)
[0],
[1]])
"""
# label_binarize supports Array API compatibility for convenience when
# LabelBinarizer is used as a sub-component of an classifier that does.
# However label_binarize internally uses a NumPy copy of the data because
# all the operations are meant to construct the backing NumPy arrays of a
# scipy.sparse CSR datastructure even when sparse_output=False.
y_xp, y_is_array_api = get_namespace(y)
if y_is_array_api:
device_ = device(y)
y = _convert_to_numpy(y, y_xp)
classes_xp, classes_is_array_api = get_namespace(classes)
if classes_is_array_api:
classes = _convert_to_numpy(classes, classes_xp)
if not isinstance(y, list):
# XXX Workaround that will be removed when list of list format is
# dropped
Expand Down Expand Up @@ -535,7 +566,10 @@ def label_binarize(y, *, classes, neg_label=0, pos_label=1, sparse_output=False)
else:
Y = np.zeros((len(y), 1), dtype=int)
Y += neg_label
return Y

if not y_is_array_api:
return Y
return y_xp.asarray(Y, device=device_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a new test case to cover that line?

elif len(classes) >= 3:
y_type = "multiclass"

Expand Down Expand Up @@ -595,7 +629,9 @@ def label_binarize(y, *, classes, neg_label=0, pos_label=1, sparse_output=False)
else:
Y = Y[:, -1].reshape((-1, 1))

return Y
if not y_is_array_api:
return Y
return y_xp.asarray(Y, devic A92E e=device_)


def _inverse_binarize_multiclass(y, classes):
Expand Down
42 changes: 40 additions & 2 deletions sklearn/preprocessing/tests/test_label.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import pytest
from scipy.sparse import issparse

from sklearn import datasets
from sklearn import config_context, datasets
from sklearn.preprocessing._label import (
LabelBinarizer,
LabelEncoder,
Expand All @@ -11,7 +11,16 @@
_inverse_binarize_thresholding,
label_binarize,
)
from sklearn.utils._testing import assert_array_equal, ignore_warnings
from sklearn.utils._array_api import (
_convert_to_numpy,
get_namespace,
yield_namespace_device_dtype_combinations,
)
from sklearn.utils._testing import (
_array_api_for_tests,
assert_array_equal,
ignore_warnings,
)
from sklearn.utils.fixes import (
COO_CONTAINERS,
CSC_CONTAINERS,
Expand Down Expand Up @@ -216,6 +225,35 @@ def test_label_binarizer_sparse_errors(csr_container):
)


@pytest.mark.parametrize(
"array_namespace, device, dtype_name", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize(
"y",
[
np.array([1, 0, 2]),
np.array([0, 0, 0]),
np.array([1, 0, 0]),
np.array([[0, 1, 1], [1, 0, 1]]),
],
)
def test_label_binarizer_array_api(y, array_namespace, device, dtype_name):
xp = _array_api_for_tests(array_namespace, device)
xp_y = xp.asarray(y, device=device)
xp_lb = LabelBinarizer(sparse_output=False)
with config_context(array_api_dispatch=True):
xp_fit_transformed = xp_lb.fit_transform(xp_y)
xp_transformed = xp_lb.transform(xp_y)
xp_inv_transformed = xp_lb.inverse_transform(xp_fit_transformed)
np_lb = LabelBinarizer(sparse_output=False)
np_transformed = np_lb.fit_transform(y)
assert get_namespace(xp_fit_transformed)[0].__name__ == xp.__name__
assert get_namespace(xp_transformed)[0].__name__ == xp.__name__
assert get_namespace(xp_inv_transformed)[0].__name__ == xp.__name__
assert_array_equal(_convert_to_numpy(xp_fit_transformed, xp), np_transformed)
assert_array_equal(_convert_to_numpy(xp_inv_transformed, xp), y)


@pytest.mark.parametrize(
"values, classes, unknown",
[
Expand Down
Loading
0