8000 ENH support multilabel targets in LabelEncoder by jnothman · Pull Request #1987 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH support multilabel targets in LabelEncoder #1987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jnothman
Copy link
Member

Also, support 1d-array of arrays (with the outer array having dtype=object) as a multilabel format and their np.vectorized processing.

This makes it easy to support multilabel data with string (or non-consecutive / negative-inclusive) labels. In particular, bincount and related operations that can be used on LabelEncoded multiclass targets can also be used with multilabel data.

Also added inheritance of LabelBinarizer from LabelEncoder.

Also, support 1d-array of sequences as a multilabel format
This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.
@jnothman
Copy link
Member Author

I've edited LabelBinarizer to have LabelEncoder as a parent class. This reduces duplicate implementation of class-to-index stuff, which means adding a classes parameter to LabelBinarizer affects both (c.f. #1643).

@jnothman
Copy link
Member Author

(But apparently passing test_processing doesn't mean the changes pass all tests...)

for i, (first_el, second_el) in enumerate(zip(first, second)):
assert_array_equal(len(first_el), len(second_el),
'In sequence of sequence element {}'
'{}'.format(i, err_msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you mean

        assert_array_equal(first_el, second_el,
                           'In sequence of sequence element {}'
                           '{}'.format(i, err_msg))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're quite right. Fixed.

@jnothman
Copy link
Member Author

This was PRed prematurely without tests, doc updates, etc.

It is now ready for review.

I can remove the inheritance of LabelBinarizer from LabelEncoder if that is deemed wise.

@arjoly
Copy link
Member
arjoly commented May 28, 2013

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

@jnothman
Copy link
Member Author

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

That's a good reason for me to take this bit out of this PR. One reason is a "label binarizer" IS a label encoder. Another reason is that there would be nothing in LabelEncoder if there were an ABC. But I guess that happens elsewhere in scikit-learn.

This reverts commit 8669605.

Conflicts:

	sklearn/preprocessing.py
	sklearn/tests/test_preprocessing.py
@jnothman
Copy link
Member Author

Okay. That's been removed from this PR. We'll talk about it later ;)

@@ -97,6 +98,25 @@ def assert_raise_message(exception, message, function, *args, **kwargs):
assert_in(message, error_message)


def assert_sequences_equal(first, second, err_msg=''):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of this new function compare to assert_equal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do without and use only arrays with dtype=object? That way we make sure in the tests that we are manipulating arrays, and not lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have discussed arrays of dtype object below. assert_equal won't work if the outer or inner sequence is an array.

@GaelVaroquaux
Copy link
Member

General remark about this PR: can we stick to arrays rather than lists?

@jnothman
Copy link
Member Author

General remark about this PR: can we stick to arrays rather than lists?

Please excuse me for not yet reading the detailed comments here. I think this and related issues are more important.

In short: I wish. I had originally implemented this etc. as producing an array of arrays. But it turned out they were explicitly not supported by is_multilabel. (In the end, I modify is_multilabel anyway to accept sequences of arrays, and produce a list of arrays because vectorize was only more obfuscated than map.) One problem with arrays in this context is that if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects, and dtype=object doesn't help. You have to go: y_out = np.zeros(len(y), dtype=object); y_out[:] = y. Even though we may provide a helper function (jnothman@96dbd77#L1R141), this behaviour potentially throws up pitfalls for users, so I think it's better to explicitly work with lists so users don't think it's safe to make their own arrays from an arbitrary sequence of sequences.

However, the problem is bigger than this. Firstly on the matter of arrays / not arrays: currently, we are making a distinction between arrays and array-likes in order to support multilabel target types. For example, to not be confused for a sequence of sequences, a label indicator matrix or multilabel-multioutputs must be an array. A sequence of sequences until now could not be constituted of arrays. I dislike this very much, but have been writing code to support it, naively assuming it was agreed on (somehow, some day, somewhere).

However, the problem is bigger than this. Supporting the sequence of sequences multilabel type makes a mess. On top of the data type problem above, this is for three reasons:

  • multilabel support is in its early days; aren't we better off just supporting one (dense) format until we're happy with that? how many classes (and samples; at what sparsity) do we need to make s-o-s a substantially more efficient choice than a label indicator matrix?
  • everything supporting multilabel targets is written in duplicate and it's a pain to manage.
  • they are essentially just sparse lil matrices with all(data == 1). We are reimplementing the wheel, without scipy.sparse's tested, ?efficient, data structure-abstracted (lil is often the wrong choice for our operations) implementations of sum, transpose, etc. (And lil encapsulates the handling of array(dtype=object) to avoid those pitfalls). One problem with using scipy.sparse now is its poor support for boolean operations (support sparse boolean matrices and logical operations (Trac #639) scipy/scipy#1166). I imagine that for metrics we can get by with sum, __and__ and __xor__; we would need to implement the last two ourselves(). Nonetheless, like sparse support for X, I think we should only *gradually support sparse multilabel y.

In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use scipy.sparse (**).


Footnotes:

(*) sparse __and__ and __xor__ (assuming all data is 1) can be implemented easily using set operations over coo matrices:

def sparse_and(y1, y2):
    y1 = y1.tocoo()
    y2 = y2.tocoo()
    y = y1.copy()
    y.rows, y.cols = zip(*set(zip(y1.rows, y1.cols)) & set(zip(y2.rows, y2.cols)))
    y.data = np.ones(y.cols.shape)
    return y

or arithmetically in csc/r:

def sparse_and(y1. y2):
    y1 = y1.tocsr()
    y2 = y2.tocsr()
    y = (y1 + y2)
    y.data[y.data < 2] = 0
    y.eliminate_zeros()
    return y

Again this illustrates that lil is often not our best choice.
(**) with the annoyance that data filled with 1 or True will be stored unnecessarily.

@jnothman
Copy link
Member Author

can we stick to arrays rather than lists?

if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects

I should note that one way to keep a sparse multilabel format is to explicitly make them arrays of sets (which is what they should have been all along):

>>> numpy.asarray([set([1])])
array([set([1])], dtype=object)

@jnothman
Copy link
Member Author

[And if we go for supporting an array of sets, it should be wrapped by an object that supports the methods we care about from the label indicator matrix, essentially reimplementing a special case of scipy.sparse:

Very, very roughly:

class aos_matrix(numpy.ndarray):
    """Represents a sparse boolean matrix as an array of sets
    """
    def __new__(self, sets, n_labels):
        # TODO

    @property
    def shape(self):
        return len(self._sets), self._n_labels

    ndim = 1

    @staticmethod
    def _vectorize(func, otypes='O'):
        vfunc = np.vectorize(func, otypes=otypes)
        def wrapper(*args):
            return vfunc(*[aos_matrix(arg) for arg in args])
        return wrapper

    __and__ = _vectorize(set.intersection)
    __or__ = _vectorize(set.union)
    __xor__ = _vectorize(set.symmetric_difference)
    sizes = _vectorize(len, otypes='i')

    def sum(self, axis=None):
        if axis is None:
            return self.sizes().sum()
        elif axis == 1:
            return self.sizes()
        elif axis == 0:
            res = np.zeros(self._n_labels)
            for s in self:
                res[s] += 1
        else:
            raise ValueError

Let polymorphism handle the underlying data type.
]

@GaelVaroquaux
Copy link
Member

In conclusion, I propose that we abandon support for sequence of
sequences-style multilabel data (except perhaps for import/export), and
when we do, to use scipy.sparse (**).

That seems reasonnable to me (my gut feeling was telling me the same
thing), however, let's think this over well.

First, who added it? Can this person tell us his/her thoughts?

Second, what are the usecases? Are we sure that we are answering them
reasonnably well? Will there be anything that stops working?

What will will be our replacement strategy? The indicator matrix seems
nice to me, as the output of the one-hot encoding is a special case of
the indicator matrix. This makes me like the pattern even more: we would
be killing two birds with one stone. For the user's sake, could we even
modify the OneHotEncoder to transform sequences-of-sequences to indicator
matrices (probably with an option, off by default, to accept varying-length
inputs)? Or should we add another utility function?

"Explicit is better than implicit". I prefer the idea of providing a
couple of transformation utilities than hidding the transformation all
over the scikit.

Last, should this be discussed on the mailing list? I don't know how long
the support of sequences-of-sequences has been in the codebase (I am
having difficulties following all the development).

@jnothman
Copy link
Member Author

For the user's sake, could we even modify the OneHotEncoder to transform sequences-of-sequences to indicator matrices

LabelBinarizer currently does this job (and it is basically the composition of LabelEncoder and OneHotEncoder, but isn't built that way).

First, who added it? Can this person tell us his/her thoughts? Second, what are the usecases? Are we sure that we are answering them reasonnably well? Will there be anything that stops working? What will will be our replacement strategy?

Admittedly, I did not realise how old the support for multilabel, as sequences of sequences, goes. It seems the first relevant commit is ecb869c, with work pursued on a branch by @mblondel. Yes, I think I will move this discussion to the mailing list (which is what I had intended this morning, but then you gave me reason to raise it here).

@arjoly
Copy link
Member
arjoly commented May 29, 2013

In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use scipy.sparse (**).

I agree with this proposition.

@arjoly
Copy link
Member
arjoly commented Aug 1, 2013

@jnothman During the sprint, we agreed that the support of the sequence of sequences should disappear in the favor of the sparse matrix format with some good helpers for the user. Thanks to your argumentations which has truly enhanced the discussion!

There is a first warning on the improved multilabel page (thanks to @schwarty).

@jnothman
Copy link
Member Author
jnothman commented Aug 1, 2013

Good to hear, thanks. The sparse matrix option may be difficult before boolean ops are supported, i.e. the next release.

@jnothman
Copy link
Member Author
jnothman commented Aug 1, 2013

So I don't think this PR is needed.

@jnothman jnothman closed this Aug 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0