8000 [MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer by samwaterbury · Pull Request #12195 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 2, 2018

Conversation

samwaterbury
Copy link
Contributor
@samwaterbury samwaterbury commented Sep 29, 2018

Reference Issues/PRs

Fixes issue #12194.

What does this implement/fix? Explain your changes.

The bug is that if the user supplies duplicate classes to MultiLabelBinarizer(classes=...) the duplicates do not get removed. This has the potential to cause problems; for example in the original issue it raised an ValueError when calling the method tocoo().

This fix simply removes duplicate classes while preserving order using an OrderedDict, which is the fastest method I know of for doing this that is compatible with Python 2/3.

Any other comments?

It passes all tests, runs the original issue example without error, and has not failed on several other examples I found on the internet (it’s only 2 LOC).

@samwaterbury samwaterbury changed the title Fix to remove duplicate classes when constructing a MultiLabelBinarizer [MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer Sep 29, 2018
@qinhanmin2014
Copy link
Member

Why should we support duplicate classes?

@qinhanmin2014
Copy link
Member

How about simply raising an error?

@rth
Copy link
Member
rth commented Sep 29, 2018

Also please add to the classes docstring that classes are expected to be unique.

@qinhanmin2014
Copy link
Member

@samwaterbury We also need a test to ensure that the error message is raised.

@samwaterbury
Copy link
Contributor Author

Alright, it now throws an error if there are duplicate classes. I also made a quick test for this and added a note about the requirement to the docstring.

Uh oh!

There was an error while loading. Please reload this page.

@@ -374,6 +374,9 @@ def test_multilabel_binarizer_given_classes():
mlb = MultiLabelBinarizer(classes=[1, 3, 2])
assert_array_equal(mlb.fit(inp).transform(inp), indicator_mat)

# ensure a ValueError is thrown if given duplicate classes
assert_raises(ValueError, MultiLabelBinarizer, classes=[1, 3, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days, we tend to use assert_raise_message to ensure that correct error message is raised.

@qinhanmin2014
Copy link
Member

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

@TomDLT
Copy link
Member
TomDLT commented Oct 1, 2018

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

@qinhanmin2014
Copy link
Member

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

Let's put it in fit :)

@samwaterbury
Copy link
Contributor Author

AFAIK, it's uncommon to raise errors in __init__, but since we've done similar things in LabelBinarizer, I'll accept it.

But then, if we change the classes with est.set_params(), the check will not happen. I think it is better to perform the check in the fit method.

Good point, I had not considered that. I'll put it in fit and change the test to use assert_raise_message while I'm at it. Lastly, I'm not sure if the comment was deleted, but I tested set vs np.unique and set seems to be the winner speed-wise.

@qinhanmin2014
Copy link
Member

Lastly, I'm not sure if the comment was deleted, but I tested set vs np.unique and set seems to be the winner speed-wise.

I'm OK with either solution. I don't think the difference between them is significant and I don't think it's an important part for MultiLabelBinarizer.

@qinhanmin2014 qinhanmin2014 merged commit 11612fc into scikit-learn:master Oct 2, 2018
@samwaterbury samwaterbury deleted the mlb branch October 2, 2018 00:16
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0