FIX validate in `fit` for `LabelBinarizer` estimator #21434

krumetoft · 2021-10-23T18:17:10Z

Reference Issues/PRs

This closes the LabelBinarizer part of #21406

What does this implement/fix? Explain your changes.

Removed a parameter check in init of LabelBinarizer, that is also done in function label_binarize during transform.

#DataUmbrella sprint

glemaitre · 2021-10-23T20:17:29Z

I think that we should add an entry in the changelog since it could have an effect on third-party libraries.

Please add an entry to the change log at doc/whats_new/v1.1.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:. For instance, it should look like:

- |Fix| :class:`preprocessing.LabelBinarizer` nows validate input parameters in `fit`
  instead of `__init__`.
  :pr:`21434` by :user:`krumetoft  <krumetoft>`.

It should in the section:

:mod:`sklearn.preprocsessing`

glemaitre · 2021-10-23T20:33:18Z

You will need to edit the test where we were testing the pattern:

with pytest.raises(...):
    LabelBinarizer(...)

by calling fit on the instance since now __init__ will not raise the error anymore. At the same time, if you could add all the error messages that we should match, it would be a much better test. I am getting the following diff:

diff --git a/sklearn/preprocessing/_label.py b/sklearn/preprocessing/_label.py
index 72bd8c6d65..12f6d08b5c 100644
--- a/sklearn/preprocessing/_label.py
+++ b/sklearn/preprocessing/_label.py
@@ -275,6 +275,19 @@ class LabelBinarizer(TransformerMixin, BaseEstimator):
         self : object
             Returns the instance itself.
         """
+        if self.neg_label >= self.pos_label:
+            raise ValueError(
+                f"neg_label={self.neg_label} must be strictly less than "
+                f"pos_label={self.pos_label}."
+            )
+
+        if self.sparse_output and (self.pos_label == 0 or self.neg_label != 0):
+            raise ValueError(
+                "Sparse binarization is only supported with non "
+                "zero pos_label and zero neg_label, got "
+                f"pos_label={self.pos_label} and neg_label={self.neg_label}"
+            )
+
         self.y_type_ = type_of_target(y)
         if "multioutput" in self.y_type_:
             raise ValueError(
diff --git a/sklearn/preprocessing/tests/test_label.py b/sklearn/preprocessing/tests/test_label.py
index 5142144bcb..3dc4afaf89 100644
--- a/sklearn/preprocessing/tests/test_label.py
+++ b/sklearn/preprocessing/tests/test_label.py
@@ -124,25 +124,35 @@ def test_label_binarizer_errors():
     lb = LabelBinarizer().fit(one_class)
 
     multi_label = [(2, 3), (0,), (0, 2)]
-    with pytest.raises(ValueError):
+    err_msg = "You appear to be using a legacy multi-label data representation."
+    with pytest.raises(ValueError, match=err_msg):
         lb.transform(multi_label)
 
     lb = LabelBinarizer()
-    with pytest.raises(ValueError):
+
+    err_msg = "This LabelBinarizer instance is not fitted yet"
+    with pytest.raises(ValueError, match=err_msg):
         lb.transform([])
-    with pytest.raises(ValueError):
+    with pytest.raises(ValueError, match=err_msg):
         lb.inverse_transform([])
 
-    with pytest.raises(ValueError):
-        LabelBinarizer(neg_label=2, pos_label=1)
-    with pytest.raises(ValueError):
-        LabelBinarizer(neg_label=2, pos_label=2)
-
-    with pytest.raises(ValueError):
-        LabelBinarizer(neg_label=1, pos_label=2, sparse_output=True)
+    input_labels = [0, 1, 0, 1]
+    err_msg = "neg_label=2 must be strictly less than pos_label=1."
+    with pytest.raises(ValueError, match=err_msg):
+        LabelBinarizer(neg_label=2, pos_label=1).fit(input_labels)
+    err_msg = "neg_label=2 must be strictly less than pos_label=2."
+    with pytest.raises(ValueError, match=err_msg):
+        LabelBinarizer(neg_label=2, pos_label=2).fit(input_labels)
+    err_msg = (
+        "Sparse binarization is only supported with non zero pos_label and zero "
+        "neg_label, got pos_label=2 and neg_label=1"
+    )
+    with pytest.raises(ValueError, match=err_msg):
+        LabelBinarizer(neg_label=1, pos_label=2, sparse_output=True).fit(input_labels)
 
     # Fail on y_type
-    with pytest.raises(ValueError):
+    err_msg = "foo format is not supported"
+    with pytest.raises(ValueError, match=err_msg):
         _inverse_binarize_thresholding(
             y=csr_matrix([[1, 2], [2, 1]]),
             output_type="foo",
@@ -152,11 +162,13 @@ def test_label_binarizer_errors():
 
     # Sequence of seq type should raise ValueError
     y_seq_of_seqs = [[], [1, 2], [3], [0, 1, 3], [2]]
-    with pytest.raises(ValueError):
+    err_msg = "You appear to be using a legacy multi-label data representation"
+    with pytest.raises(ValueError, match=err_msg):
         LabelBinarizer().fit_transform(y_seq_of_seqs)
 
     # Fail on the number of classes
-    with pytest.raises(ValueError):
+    err_msg = "The number of class is not equal to the number of dimension of y."
+    with pytest.raises(ValueError, match=err_msg):
         _inverse_binarize_thresholding(
             y=csr_matrix([[1, 2], [2, 1]]),
             output_type="foo",
@@ -165,7 +177,8 @@ def test_label_binarizer_errors():
         )
 
     # Fail on the dimension of 'binary'
-    with pytest.raises(ValueError):
+    err_msg = "output_type='binary', but y.shape"
+    with pytest.raises(ValueError, match=err_msg):
         _inverse_binarize_thresholding(
             y=np.array([[1, 2, 3], [2, 1, 3]]),
             output_type="binary",
@@ -174,9 +187,10 @@ def test_label_binarizer_errors():
         )
 
     # Fail on multioutput data
-    with pytest.raises(ValueError):
+    err_msg = "Multioutput target data is not supported with label binarization"
+    with pytest.raises(ValueError, match=err_msg):
         LabelBinarizer().fit(np.array([[1, 3], [2, 1]]))
-    with pytest.raises(ValueError):
+    with pytest.raises(ValueError, match=err_msg):
         label_binarize(np.array([[1, 3], [2, 1]]), classes=[1, 2, 3])

You can use it to edit your PR.

krumetoft · 2021-10-23T21:04:06Z

Thank you, @glemaitre ! I've just updated doc/whats_new/v1.1. and test_label.py.

On a second review, I am wondering if it is an issue that input will be tested in transform, but not in fit (as the tests are performed by label_binarize in transform only)?

glemaitre · 2021-10-23T21:13:59Z

On a second review, I am wondering if it is an issue that input will be tested in transform, but not in fit (as the tests are performed by label_binarize in transform only)?

If you check my diff, I move the validation in fit:

diff --git a/sklearn/preprocessing/_label.py b/sklearn/preprocessing/_label.py
index 72bd8c6d65..12f6d08b5c 100644
--- a/sklearn/preprocessing/_label.py
+++ b/sklearn/preprocessing/_label.py
@@ -275,6 +275,19 @@ class LabelBinarizer(TransformerMixin, BaseEstimator):
         self : object
             Returns the instance itself.
         """
+        if self.neg_label >= self.pos_label:
+            raise ValueError(
+                f"neg_label={self.neg_label} must be strictly less than "
+                f"pos_label={self.pos_label}."
+            )
+
+        if self.sparse_output and (self.pos_label == 0 or self.neg_label != 0):
+            raise ValueError(
+                "Sparse binarization is only supported with non "
+                "zero pos_label and zero neg_label, got "
+                f"pos_label={self.pos_label} and neg_label={self.neg_label}"
+            )
+
         self.y_type_ = type_of_target(y)
         if "multioutput" in self.y_type_:
             raise ValueError(

Otherwise, we don't do any validation.

krumetoft · 2021-10-24T11:37:00Z

Apologies, I missed that - corrected now!

glemaitre · 2021-10-24T12:19:13Z

Can you run black on the 2 files that are detected by our linter: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=34025&view=logs&jobId=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a

In the future, you might want to install pre-commit. It will make the reformatting for you before committing.

krumetoft · 2021-10-25T09:22:46Z

Can you run black on the 2 files that are detected by our linter: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=34025&view=logs&jobId=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a

In the future, you might want to install pre-commit. It will make the reformatting for you before committing.

Thank you for your patience and suggestion - I will use pre-commit.

jjerphan

LGTM.

Thank you, @krumetoft, for this first-time contribution!

krumetoft · 2021-10-27T08:20:34Z

Thank you, @jjerphan! I need to thank Guillaume for his patience and guidance.
In case it is important for sprint tracking, I am a repeater :) but with profile @krumeto . I just could not figure out in time before the sprint how to have safely use two git profiles at the same time and ended up working with the corporate one.

jjerphan · 2021-10-27T08:35:46Z

You can change a local git repository config to use another identity (e.g. @krumeto in this case) using git log with the --local option.

glemaitre

LGTM

thomasjpfan

Minor comments about test. Otherwise LGTM

sklearn/preprocessing/tests/test_label.py

krumetoft · 2021-10-29T07:10:47Z

Hey @thomasjpfan, thank you, it makes sense! I changed test_label.py accordingly.

jjerphan · 2021-10-29T08:04:44Z

Regarding your last commit, @ 8000 krumetoft, you can use the Co-authored-by attribute, see this piece of docs from GitHub.

krumeto · 2021-10-29T08:16:43Z

Thank you, @jjerphan! I saw the piece of code in one of Olivier's comments and decided to give it a run :)

glemaitre · 2021-10-30T13:41:43Z

Arff I did not see that there is a conflict. @krumetoft @krumeto Could you solve the merge conflict?

Attempt to resolve merge conflict

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

krumeto · 2021-11-02T10:00:21Z

Hey @glemaitre The merge conflict is resolved (line 292 in _label.py).

jjerphan · 2021-11-02T10:26:16Z

Thank you, @krumeto!

) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

github-actions bot added the module:preprocessing label Oct 23, 2021

reshamas added the Sprint label Oct 23, 2021

glemaitre self-requested a review October 23, 2021 20:17

glemaitre changed the title ~~LabelBinarizer - Removed a redundant param check in init #21406~~ FIX validate in fit for LabelBinarizer Oct 23, 2021

glemaitre changed the title ~~FIX validate in fit for LabelBinarizer~~ FIX validate in fit for LabelBinarizer estimator Oct 24, 2021

jjerphan approved these changes Oct 27, 2021

View reviewed changes

glemaitre approved these changes Oct 27, 2021

View reviewed changes

thomasjpfan approved these changes Oct 27, 2021

View reviewed changes

sklearn/preprocessing/tests/test_label.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_label.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_label.py Outdated Show resolved Hide resolved

krumetoft and others added 9 commits November 2, 2021 10:20

Removed a redundant param check in init

2c26855

Updated doc/whats_new/v1.1. with fix information

2a6c1ee

8000 Updated test_label.py

0f52484

Added checks to fit and changed the docs

4d741c2

Attempt to resolve merge conflict

Resolved merge conflict in _label.py

798232c

whitespace in docs

aead4aa

Update test_label.py to ensure exception raised in fit

79bd7d4

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update test_label.py to ensure exception raised in fit - l150

934814c

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

grant co-authorship to @krumeto

2d6402b

merge conflict resolution attempt

11d80f1

krumetoft 6D40 force-pushed the test-labelbinarizer branch from 646119c to 11d80f1 Compare November 2, 2021 08:39

small formatting to comply black and docs

9ed7d76

jjerphan merged commit c241fe7 into scikit-learn:main Nov 2, 2021

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX validate in fit for LabelBinarizer estimator (scikit-learn#21434

77fec23

) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX validate in `fit` for `LabelBinarizer` estimator #21434

FIX validate in `fit` for `LabelBinarizer` estimator #21434

FIX validate in fit for LabelBinarizer estimator #21434

FIX validate in fit for LabelBinarizer estimator #21434

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FIX validate in `fit` for `LabelBinarizer` estimator #21434

FIX validate in `fit` for `LabelBinarizer` estimator #21434