-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Fix to remove duplicate classes when constructing a MultiLabelBinarizer #12195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why should we support duplicate classes? |
How about simply raising an error? |
Also please add to the |
@samwaterbury We also need a test to ensure that the error message is raised. |
Alright, it now throws an error if there are duplicate classes. I also made a quick test for this and added a note about the requirement to the docstring. |
@@ -374,6 +374,9 @@ def test_multilabel_binarizer_given_classes(): | |||
mlb = MultiLabelBinarizer(classes=[1, 3, 2]) | |||
assert_array_equal(mlb.fit(inp).transform(inp), indicator_mat) | |||
|
|||
# ensure a ValueError is thrown if given duplicate classes | |||
assert_raises(ValueError, MultiLabelBinarizer, classes=[1, 3, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These days, we tend to use assert_raise_message
to ensure that correct error message is raised.
AFAIK, it's uncommon to raise errors in |
But then, if we change the classes with |
Let's put it in |
Good point, I had not considered that. I'll put it in |
I'm OK with either solution. I don't think the difference between them is significant and I don't think it's an important part for MultiLabelBinarizer. |
Reference Issues/PRs
Fixes issue #12194.
What does this implement/fix? Explain your changes.
The bug is that if the user supplies duplicate classes to
MultiLabelBinarizer(classes=...)
the duplicates do not get removed. This has the potential to cause problems; for example in the original issue it raised anValueError
when calling the methodtocoo()
.This fix simply removes duplicate classes while preserving order using an OrderedDict, which is the fastest method I know of for doing this that is compatible with Python 2/3.
Any other comments?
It passes all tests, runs the original issue example without error, and has not failed on several other examples I found on the internet (it’s only 2 LOC).