-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH support multilabel targets in LabelEncoder #1987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, support 1d-array of sequences as a multilabel format
This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.
I've edited |
(But apparently passing |
for i, (first_el, second_el) in enumerate(zip(first, second)): | ||
assert_array_equal(len(first_el), len(second_el), | ||
'In sequence of sequence element {}' | ||
'{}'.format(i, err_msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you mean
assert_array_equal(first_el, second_el,
'In sequence of sequence element {}'
'{}'.format(i, err_msg))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're quite right. Fixed.
This was PRed prematurely without tests, doc updates, etc. It is now ready for review. I can remove the inheritance of |
Why have you chosen to extend |
That's a good reason for me to take this bit out of this PR. One reason is a "label binarizer" IS a label encoder. Another reason is that there would be nothing in |
This reverts commit 8669605. Conflicts: sklearn/preprocessing.py sklearn/tests/test_preprocessing.py
Okay. That's been removed from this PR. We'll talk about it later ;) |
@@ -97,6 +98,25 @@ def assert_raise_message(exception, message, function, *args, **kwargs): | |||
assert_in(message, error_message) | |||
|
|||
|
|||
def assert_sequences_equal(first, second, err_msg=''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of this new function compare to assert_equal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do without and use only arrays with dtype=object? That way we make sure in the tests that we are manipulating arrays, and not lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have discussed arrays of dtype object below. assert_equal
won't work if the outer or inner sequence is an array.
General remark about this PR: can we stick to arrays rather than lists? |
Please excuse me for not yet reading the detailed comments here. I think this and related issues are more important. In short: I wish. I had originally implemented this etc. as producing an array of arrays. But it turned out they were explicitly not supported by However, the problem is bigger than this. Firstly on the matter of arrays / not arrays: currently, we are making a distinction between arrays and array-likes in order to support multilabel target types. For example, to not be confused for a sequence of sequences, a label indicator matrix or multilabel-multioutputs must be an array. A sequence of sequences until now could not be constituted of arrays. I dislike this very much, but have been writing code to support it, naively assuming it was agreed on (somehow, some day, somewhere). However, the problem is bigger than this. Supporting the sequence of sequences multilabel type makes a mess. On top of the data type problem above, this is for three reasons:
In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use Footnotes: (*) sparse def sparse_and(y1, y2):
y1 = y1.tocoo()
y2 = y2.tocoo()
y = y1.copy()
y.rows, y.cols = zip(*set(zip(y1.rows, y1.cols)) & set(zip(y2.rows, y2.cols)))
y.data = np.ones(y.cols.shape)
return y or arithmetically in def sparse_and(y1. y2):
y1 = y1.tocsr()
y2 = y2.tocsr()
y = (y1 + y2)
y.data[y.data < 2] = 0
y.eliminate_zeros()
return y Again this illustrates that |
I should note that one way to keep a sparse multilabel format is to explicitly make them arrays of sets (which is what they should have been all along): >>> numpy.asarray([set([1])])
array([set([1])], dtype=object) |
Sorry, something went wrong.
[And if we go for supporting an Very, very roughly: class aos_matrix(numpy.ndarray):
"""Represents a sparse boolean matrix as an array of sets
"""
def __new__(self, sets, n_labels):
# TODO
@property
def shape(self):
return len(self._sets), self._n_labels
ndim = 1
@staticmethod
def _vectorize(func, otypes='O'):
vfunc = np.vectorize(func, otypes=otypes)
def wrapper(*args):
return vfunc(*[aos_matrix(arg) for arg in args])
return wrapper
__and__ = _vectorize(set.intersection)
__or__ = _vectorize(set.union)
__xor__ = _vectorize(set.symmetric_difference)
sizes = _vectorize(len, otypes='i')
def sum(self, axis=None):
if axis is None:
return self.sizes().sum()
elif axis == 1:
return self.sizes()
elif axis == 0:
res = np.zeros(self._n_labels)
for s in self:
res[s] += 1
else:
raise ValueError Let polymorphism handle the underlying data type. |
That seems reasonnable to me (my gut feeling was telling me the same First, who added it? Can this person tell us his/her thoughts? Second, what are the usecases? Are we sure that we are answering them What will will be our replacement strategy? The indicator matrix seems "Explicit is better than implicit". I prefer the idea of providing a Last, should this be discussed on the mailing list? I don't know how long |
Admittedly, I did not realise how old the support for multilabel, as sequences of sequences, goes. It seems the first relevant commit is ecb869c, with work pursued on a branch by @mblondel. Yes, I think I will move this discussion to the mailing list (which is what I had intended this morning, but then you gave me reason to raise it here). |
I agree with this proposition. |
@jnothman During the sprint, we agreed that the support of the There is a first warning on the improved multilabel page (thanks to @schwarty). |
Good to hear, thanks. The sparse matrix option may be difficult before boolean ops are supported, i.e. the next release. |
So I don't think this PR is needed. |
Also, support 1d-array of arrays (with the outer array having
dtype=object
) as a multilabel format and theirnp.vectorize
d processing.This makes it easy to support multilabel data with string (or non-consecutive / negative-inclusive) labels. In particular,
bincount
and related operations that can be used on LabelEncoded multiclass targets can also be used with multilabel data.Also added inheritance of
LabelBinarizer
fromLabelEncoder
.