[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

samwaterbury · 2018-09-29T01:12:57Z

Reference Issues/PRs

Fixes issue #12194.

What does this implement/fix? Explain your changes.

The bug is that if the user supplies duplicate classes to MultiLabelBinarizer(classes=...) the duplicates do not get removed. This has the potential to cause problems; for example in the original issue it raised an ValueError when calling the method tocoo().

This fix simply removes duplicate classes while preserving order using an OrderedDict, which is the fastest method I know of for doing this that is compatible with Python 2/3.

Any other comments?

It passes all tests, runs the original issue example without error, and has not failed on several other examples I found on the internet (it’s only 2 LOC).

qinhanmin2014 · 2018-09-29T04:06:30Z

Why should we support duplicate classes?

qinhanmin2014 · 2018-09-29T04:07:11Z

How about simply raising an error?

rth · 2018-09-29T14:19:13Z

Also please add to the classes docstring that classes are expected to be unique.

qinhanmin2014 · 2018-09-29T14:27:55Z

@samwaterbury We also need a test to ensure that the error message is raised.

samwaterbury · 2018-09-30T04:57:52Z

Alright, it now throws an error if there are duplicate classes. I also made a quick test for this and added a note about the requirement to the docstring.

qinhanmin2014 · 2018-09-30T07:49:38Z

sklearn/preprocessing/tests/test_label.py

@@ -374,6 +374,9 @@ def test_multilabel_binarizer_given_classes():
    mlb = MultiLabelBinarizer(classes=[1, 3, 2])
    assert_array_equal(mlb.fit(inp).transform(inp), indicator_mat)

+    # ensure a ValueError is thrown if given duplicate classes
+    assert_raises(ValueError, MultiLabelBinarizer, classes=[1, 3, 2, 3])


These days, we tend to use assert_raise_message to ensure that correct error message is raised.

qinhanmin2014 · 2018-09-30T07:52:07Z

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

TomDLT · 2018-10-01T11:39:28Z

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

qinhanmin2014 · 2018-10-01T13:06:54Z

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

Let's put it in fit :)

samwaterbury · 2018-10-01T14:42:27Z

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

Good point, I had not considered that. I'll put it in fit and change the test to use assert_raise_message while I'm at it. Lastly, I'm not sure if the comment was deleted, but I tested set vs np.unique and set seems to be the winner speed-wise.

qinhanmin2014 · 2018-10-01T14:48:05Z

Lastly, I'm not sure if the comment was deleted, but I tested set vs np.unique and set seems to be the winner speed-wise.

I'm OK with either solution. I don't think the difference between them is significant and I don't think it's an important part for MultiLabelBinarizer.

…inarizer (scikit-learn#12195)

Fix to remove duplicate classes when constructing a MultiLabelBinarizer.

01e93d3

samwaterbury changed the title ~~Fix to remove duplicate classes when constructing a MultiLabelBinarizer~~ [MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer Sep 29, 2018

qinhanmin2014 mentioned this pull request Sep 29, 2018

MultiLabelBinarizer constructs pathological CSR matrices #12194

Closed

samwaterbury added 2 commits September 29, 2018 19:47

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mlb

430b691

Changed MultiLabelBinarizer to throw error for duplicate classes.

240a410

qinhanmin2014 approved these changes Sep 30, 2018

View reviewed changes

Moved error throwing to fit function.

6f07fcb

TomDLT approved these changes Oct 1, 2018

View reviewed changes

qinhanmin2014 approved these changes Oct 2, 2018

View reviewed changes

qinhanmin2014 merged commit 11612fc into scikit-learn:master Oct 2, 2018

samwaterbury deleted the mlb branch October 2, 2018 00:16

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 15, 2018

MNT Raise error for duplicate classes when constructing a MultiLabelB…

1478728

…inarizer (scikit-learn#12195)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!