WIP: Classes support for all classifiers #1304

erg · 2012-10-31T23:08:01Z

I'm taking a stab at decoupling the classes in y from the classes that the estimators can predict. I think this cleanup would allow for easier design of an ensemble class that could use arbitrary classifiers and combine their results through voting, averaging, stacking, etc.

See #1183

So far I've done naive_bayes, lda, trees, and forests. I tried to do qda and almost got it right, but there are issues with nan and inf. LabelBinarizer needs some testing as well.

I've gone with the convention of class variables set in __init__ having no trailing underscore and any other private variables using one. So self.classes gets set in __init__ and then self.classes_ gets set in fit. Maybe there needs to be a parameter method that resets self.classes_ to None so that classifiers are not in half-working states if the user sets classes after the fact?

Any help with the remaining classifiers, code cleanups, or suggestions would be appreciated.

mblondel · 2012-11-01T04:26:44Z

I think it may be simpler to add a classes option to predict_proba and decision_function (predict obviously is not affected). Basically one would just need to compare np.unique(y) and self.classes_, and add additional all-zero rows to the returned 2d-array if necessary.

amueller · 2012-11-01T08:01:38Z

sklearn/base.py

+
+
+def is_multilabel(y):
+    # the explicit check for ndarray is for forward compatibility; future


Sorry, I don't understand this. Can you elaborate?

Just moved it out of preprocessing.py into base.py.

ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 593) def _is_label_indicator_matrix(y): ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 594) return hasattr(y, "shape") and len(y.shape) == 2 ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 595) ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 596) ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 597) def _is_multilabel(y): 7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 598) # the explicit check for ndarray is for forward compatibility; future 7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 599) # versions of Numpy might want to register ndarray as a Sequence 6d041623 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 12:28:30 +0200 600) return not isinstance(y[0], np.ndarray) and isinstance(y[0], Sequence) \ 855257f6 sklearn/preprocessing.py (Lars Buitinck 2012-05-24 16:08:07 +0200 601) and not isinstance(y[0], basestring) \ 7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 602) or _is_label_indicator_matrix(y)

amueller · 2012-11-01T08:02:46Z

I am not really convinced this change is necessary.
I can see the point in the online setting and in bagging. But why not deal with it there?
We have the label encoder, which can do exactly what this PR does in the mixin. I feel this adds a lot of complexity and magic without really adding much benefit.

amueller · 2012-11-01T08:05:27Z

Thinking about it, I'm not entirely sure LabelEncoder can do what I just claimed. Still, I think it is not a good idea to try to pack this logic into the base classes and try to discover what the user is doing.

amueller · 2012-11-01T08:12:21Z

Going back to the original discussion, it seems that custom bagging implementations were the motivation. Why don't we solve this by implementing a bagging meta estimator? @glouppe started that at some point, I think. I would guess that it should not be to hard.

agramfort · 2012-11-01T08:41:11Z

Going back to the original discussion, it seems that custom bagging
implementations were the motivation. Why don't we solve this by implementing
a bagging meta estimator? @glouppe started that at some point, I think. I
would guess that it should not be to hard.

+1

erg · 2012-11-01T15:20:36Z

My original motivation was making ensembles of classifiers and combining their predict_proba outputs in different ways without having to mess around with the sparse shape when some classes are not in y. It seems like a cleaner API when you can tell it what to expect and get zeros as the probabilities when the outputs are missing from the dataset. With just predict, then setting classes beforehand is pointless.

erg · 2012-11-01T17:37:16Z

Yes, LabelEncoder should take a classes= parameter too.

amueller · 2012-11-02T07:47:11Z

I'd like to hear others opinions before moving this forward.

After a longish discussion on the score_func front, I am -1 on @mblondel suggestion, though I am not to much of a fan of adding an additional __init__ parameter to all classifiers - I don't know an alternative though.

Looking at the code again, the stuff that seemed to magic to me was there before.
Refactoring some of the class logic seems desirable. I'll have to have a closer look though.

I do see your point, though, and think it would be nice if we could make sure of the shape of the decision function.
I just don't want to clutter the interface to much or have to much magic around ;)

mblondel · 2012-11-02T09:45:05Z

Well personally I'm even -1 for dealing with this problem at all. It can easily be dealt with a small utility function on the user side and ensemble methods should in my opinion do reject sampling to ensure that all classes are always represented.

GaelVaroquaux · 2012-11-10T12:47:01Z

Still, I think it is not a good idea to try to pack this logic into the
base classes and try to discover what the user is doing.

+1. I am worried about the added complexity both in the code base and in
the behavior of the toolkit.

GaelVaroquaux · 2012-11-10T12:49:15Z

My original motivation was making ensembles of classifiers and combining their
predict_proba outputs in different ways without having to mess around with the
sparse shape when some classes are not in y.

Is there a clear cut (well established and published with many citations)
scenario for such a combination. I can see that it is interesting to
explore for research purposes, but to be a goal for the scikit, it needs
to have a wide impact.

GaelVaroquaux · 2012-11-10T12:49:41Z

Yes, LabelEncoder should take a classes= parameter too.

I haven't thought this over a lot, but I gues 8000 s that I would be +1 here.

GaelVaroquaux · 2012-11-10T12:50:11Z

Well personally I'm even -1 for dealing with this problem at all. It
can easily be dealt with a small utility function on the user side and
ensemble methods should in my opinion do reject sampling to ensure that
all classes are always represented.

Same fealing here.

ogrisel · 2012-11-10T15:24:53Z

My original motivation was making ensembles of classifiers and combining their
predict_proba outputs in different ways without having to mess around with the
sparse shape when some classes are not in y.

Is there a clear cut (well established and published with many citations)
scenario for such a combination. I can see that it is interesting to
explore for research purposes, but to be a goal for the scikit, it needs
to have a wide impact.

AFAIK this is a standard way to implement stacking ensemble methods.

For instance: slide 10 of http://www.ukkdd2007.org/slides/UKKDD-2007-Niall-talk.pdf

Edit: this is also implemented this in Orange for the StackingClassifier: http://orange.biolab.si/doc/reference/Orange.ensemble/#stacking (the example is interesting).

Such stacked models are very important to win ML contests such as netflix, kaggle and such and can also be very useful in the industry to gain so percents in predictive accuracy even though often less interesting from a pure machine learning research point of view.

amueller · 2012-11-10T15:45:52Z

From my perspective there was no doubt that this is helpful for building ensembles.
I think the question is: do we want that in the general estimator interface or do we want to pack this into some ensemble class (or somewhere else). I.e. is the gain worth the somewhat more complex code.

GaelVaroquaux · 2012-11-10T15:52:00Z

From my perspective there was no doubt that this is helpful for building
ensembles.
I think the question is: do we want that in the general estimator interface or
do we want to pack this into some ensemble class (or somewhere else). I.e. is
the gain worth the somewhat more complex code.

I agree. I didn't mean to say that it was completely useless, but I meant
to ask if it was worth adding a global complexity to the scikit.

ogrisel · 2012-11-10T15:57:24Z

From my perspective there was no doubt that this is helpful for building ensembles.
I think the question is: do we want that in the general estimator interface or do we want to pack this into some ensemble class (or somewhere else). I.e. is the gain worth the somewhat more complex code.

I am not sure either. Moving the complexity to some ensemble base class would sound reasonable to me. Maybe @amueller's FeatureUnion class could be a good candidate to demonstrate the usefulness of this class identifier normalization by extending it to have options to be able to use predict_proba / decision_function instead of just transform as is currently the case.

Having. I think I would need usage examples is critical to make good decision on such design issues.

amueller · 2012-11-10T16:11:19Z

On 11/10/2012 03:57 PM, Olivier Grisel wrote:

Having usage examples is critical to make good decision on such design
issues.

+1

…verwriting y.

…Fix bug where class_prior can't be a numpy array.

…updates to be sure it's correct.

… Move _is_multilabel from preprocessing to base, and rename it to is_multilabel.

…and use in preprocessing. Use prepare_classes() and have it decide between single or multilabel.

…hapes.

mrjbq7 · 2012-11-29T00:02:53Z

This would be a useful feature to have, particularly when combining classifiers that have been trained on data sets where they share common labels (e.g., 0, 1, 2) but each might only have seen a subset of the labels (e.g., {0, 1} or {0, 2}).

mrjbq7 · 2012-11-29T00:04:38Z

(Specifically, the predict_proba() for each classifier in the example above would produce two column arrays, but the first classifier would label the columns [0,1] and the second [0,2], so to combine you would need to know that, or reshape to [0,1,2] and combine more easily).

amueller · 2012-11-29T07:16:51Z

@mrjbq7 I think you could achieve that pretty easily using LabelBinarizer. If this is not the case, I would rather extend LabelBinarizer than modify all classifiers.

erg · 2012-11-29T16:09:43Z

I don't see how LabelBinarizer helps. How would LabelBinarizer make the shape of clf2.predict_proba() and clf3.predict_proba() the same as clf1.predict_proba()? (Please write the code if you have time, should be three lines max if it were to work?)

As for making the code harder to understand, the complexity introduced for having a classes= parameter in init for trees was not any harder. For LDA and QDA the issue is that when labels are not in the y in fit, they produce some kind of 0 and calling log on that zero blows up. I did a fix for LDA and was close for QDA. What you gain is code in a base class that can deal with unique-ing labels and getting an inverse label transform rather than duplicating this code all over. Also, it handles multi-output labels (for DecisionTreeClassifiers) which takes quite a bit of code--if other classifiers were to support this, you'd want to not duplicate it.

I think having the classes= as a keyword arg in fit is the wrong approach. If you set it in init as classes, then fit can use that in setting classes_.

You end up with a much cleaner way of doing combining probabilities in ensembles, random sampling problems, cross-validation, boosting, bagging--anything where you are not guaranteed that all of the data makes it into each classifier.

from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier()
clf2 = DecisionTreeClassifier()
clf3 = DecisionTreeClassifier()

clf1.fit([[1],[2],[3], [5],[6],[7], [10],[11],[12]], [0,0,0, 1,1,1, 2,2,2])
clf2.fit([[1],[2],[3], [5],[6],[7]], [0,0,0, 1,1,1])
clf3.fit([[1],[2],[3], [10],[11],[12]], [0,0,0, 2,2,2])

clf1.predict([[1], [5], [10]])
clf2.predict([[1], [5]])
clf3.predict([[1], [10]])
clf1.classes_
clf2.classes_
clf3.classes_
clf1.predict_proba([[1], [5], [10]])
clf2.predict_proba([[1], [5]])
clf3.predict_proba([[1], [10]])



Output in ipython2:

In [75]: clf1.predict([[1], [5], [10]])
Out[75]: array([ 0.,  1.,  2.])

In [76]: clf2.predict([[1], [5]])
Out[76]: array([ 0.,  1.])

In [77]: clf3.predict([[1], [10]])
Out[77]: array([ 0.,  2.])

In [78]: clf1.classes_
Out[78]: [array([0, 1, 2])]

In [79]: clf2.classes_
Out[79]: [array([0, 1])]

In [80]: clf3.classes_
Out[80]: [array([0, 2])]

In [81]: clf1.predict_proba([[1], [5], [10]])
Out[81]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [82]: clf2.predict_proba([[1], [5]])
Out[82]:
array([[ 1.,  0.],
       [ 0.,  1.]])

In [83]: clf3.predict_proba([[1], [10]])
Out[83]:
array([[ 1.,  0.],
       [ 0.,  1.]])

amueller · 2012-11-29T16:22:42Z

@erg I will try to come up with it tomorrow.

What you gain is code in a base class that can deal with unique-ing labels and getting an inverse label transform >rather than duplicating this code all over. Also, it handles multi-output labels (for DecisionTreeClassifiers) which >takes quite a bit of code--if other classifiers were to support this, you'd want to not duplicate it.

I really don't see how this is connected to the classes init parameter. As I keep repeating, the unique is implemented nearly everywhere. Could you explain how an init parameter makes this simpler? To keep bringing this up and I keep saying that we already have that feature and I don't see where this is going.

Also: how does the init parameter help handle multi-output labels? I don't see a connection there, either.

ogrisel · 2012-11-29T16:32:11Z

When you do subsampling for bagging models (and maybe stacking too) for instance, the sub-datasets might not have the whole set of classes occurring in them. But you would still like to have the same shape in the predict_proba output for each individual estimator trained on that subsampled training sets to make it easier to combine those predicted output in the stacking or bagging meta estimator.

Also I think having the ability to fix the expected classes in the __init__ would be useful (necessary?) for online models (e.g. classifiers with a partial_fit method).

amueller · 2012-11-29T16:34:51Z

@ogrisel was this in response to my questions?

I agree with what you are saying, but I'm not sure this PR is the right solution.

erg · 2012-11-29T16:41:57Z

I really don't see how this is connected to the classes init parameter. As I keep repeating, the unique is implemented nearly everywhere. Could you explain how an init parameter makes this simpler? To keep bringing this up and I keep saying that we already have that feature and I don't see where this is going.

You are right that we already have the unique stuff and that we have to do it anyways to make sure that you're not trying to fit for labels that you aren't expecting (for input validation). The multi-output labels unique code is hairier than a one-liner, and to have this in a base class would be useful if other classifiers start supporting it.

The classes init parameter is just to make sure the shape of predict_proba is correct. From this, you can do things like make an Ensemble class and pass it a bunch of classifiers. Then when you call fit it takes the classes parameter from init and clones each classifier with the classes param and then calls fit on each of them however you want. In predict_proba you now have classifiers that are all looking for the same thing, regardless of what subset of data they were fit on. You can now combine probabilities with voting, averaging, maximum, any way you want, or you can use this Ensemble as part of another Ensemble. The part that it's all missing is the shape of the predicted probabilities has to line up somehow, whether it's a classes parameter or some kind of preprocessing/postprocessing transformer. To go through all of the training and prediction just to get a naked ndarray of the wrong shape that lost all its column labels and can't be combined with others is disheartening and hard to work with.

Basically my idea of using a classifier is that you create it and tell it what to look for, train it, and then have it classify things for you. scikit has conflated training a classifier with telling it what to look for, and it irks me. :)

A very simple solution that I'm going to use as a workaround is to make a function reshape_proba(clf, desired_labels, proba) that will grab the clf.classes_ and reshape it to what I need. So now all of my clf.predict_proba lines can have this extra step and I'll be happy.

erg · 2012-11-29T16:58:03Z

@ogrisel Agreed, for online learning something like this would be mandatory.

amueller · 2012-11-29T17:18:35Z

for multi-output: as far as I can tell, it is just not possible for any non-tree method except nearest neighbors. this is why I would't want this in a base class.

Don't get me wrong, I'm not totally opposed to the idea.
your pr just seems overly complex to me. having a lightweight method for reshaping the probabilities seems much better.

not sure what the others think, though.

erg notifications@github.com schrieb:

I really don't see how this is connected to the classes init
parameter. As I keep repeating, the unique is implemented nearly
everywhere. Could you explain how an init parameter makes this simpler?
To keep bringing this up and I keep saying that we already have that
feature and I don't see where this is going.

You are right that we already have the unique stuff and that we have to
do it anyways to make sure that you're not trying to fit for labels
that you aren't expecting (for input validation). The multi-output
labels unique code is hairier than a one-liner, and to have this in a
base class would be useful if other classifiers start supporting it.

The classes init parameter is just to make sure the shape of
predict_proba is correct. From this, you can do things like make an
Ensemble class and pass it a bunch of classifiers. Then when you
call fit it takes the classes parameter from init and clones
each classifier with the classes param and then calls fit on
each of them however you want. In predict_proba you now have
classifiers that are all looking for the same thing, regardless of what
subset of data they were fit on. You can now combine probabilities
with voting, averaging, maximum, any way you want, or you can use this
Ensemble as part of another Ensemble. The part that it's all
missing is the shape of the predicted probabilities has to line up
somehow, whether it's a classes parameter or some kind of
preprocessing/postprocessing transformer. To go through all of the
training and prediction just to get a naked ndarray of the wrong shape
that lost all its column labels
and can't be combined with others is disheartening and hard to work
with.

Basically my idea of using a classifier is that you create it and tell
it what to look for, train it, and then have it classify things for
you. scikit has conflated training with telling it what to look for,
and it irks me. :)

A very simple solution that I'm going to use as a workaround is to make
a function reshape_proba(clf, desired_labels, proba) that will grab
the clf.classes_ and reshape it to what I need. So now all of my
clf.predict_proba lines can have this extra step and I'll be happy.

Reply to this email directly or view it on GitHub:
#1304 (comment)

F438

GaelVaroquaux · 2012-11-30T18:07:44Z

Basically my idea of using a classifier is that you create it and tell it what
to look for, train it, and then have it classify things for you. scikit has
conflated training with telling it what to look for, and it irks me. :)

I disagree with this statement: the scikit makes it easy to do easy
things.

That doesn't mean that I am against having the option of specifying
explicitly classes. We just need to weigh the costs and benefits.

GaelVaroquaux · 2012-11-30T18:09:18Z

your pr just seems overly complex to me. having a lightweight method for
reshaping the probabilities seems much better.

not sure what the others think, though.

Haven't looked at the PR yet.

G

amueller · 2012-11-30T21:25:41Z

@erg You are right, it is not possible to implement this with LabelBinarizer currently.

amueller · 2012-11-30T21:25:51Z

Sorry, misclick!

amueller · 2012-11-30T21:34:12Z

Ok short (not that short) summary of my thoughts and what I feel is the result of the discussion so far:

At the moment it is hard to ascertain a certain shape for predict_proba and decision_function.
This makes in particular ensemble learning cumbersome.
This problem has to be addressed for online learning any way.

I see two possible solutions:

specifying classes in init as done in this PR
providing a utility function and dealing with the online things in the online classifiers.

I am concerned that adding an init parameter for this specific use-case seems overkill. Also, I think this PR does way to many things.

@erg would you be happy with a utility function / estimator that would solve the problem of getting consistent predict_proba shapes?

erg · 2012-11-30T22:53:15Z

I wrote a utility function to reshape all the proba arrays. So I don't really care anymore, except to maybe contribute something to scikit if stuff works well. I agree that it's pretty hard to redo each classifier, so the utility function is a better approach.

I'm also experimenting with Estimator and Ensemble classes that take a classes= parameter. The Estimator class currently fixes up the proba result and also uses a modified LabelEncoder to transform labels so you can use arbitrary labels with any classifier. It also handles the classes_ variable itself so you don't have to do it for each machine learning algorithm.

The Ensemble class just calls fit on a list of estimators and returns all of the results from predict and predict_proba. Then I subclassed it and have another ensemble that can do voting/averaging for predict/predict_proba. One thing I like about this is you can pass in a bunch of estimators build with different parameters rather than cloning the same one.

I also wrote something to resample a dataset with replacement. You pass its transform function an X,y, a scale or n parameter, and a proba which can be None for resampling with the same label distribution, balanced for balancing all of the labels, or a dict of class/probability pairs for whatever you want. The scale parameter is a float that will downscale/upscale your number of samples, so .8 would be 80% and 2.5 would be more than twice the data. The n just returns that many samples and you can only use one or the other. This is unrelated to predict_proba, but I don't really want to make a post about it atm.
Edit: It's just a function now, not a transformer since there's no concept of fit or inverse_transform.

So finally I combined it all with a BaggedEnsemble which takes a list of estimators and a scale, n, etc and calls fit on each estimator with resampled data. Since it inherits from the ensemble that does voting/averaging it's just a fit function and a constructor and turns out to be like 15 lines long.

I feel like these classes are a good start to ensemble learning and they fix most all of my API gripes with scikit. We'd probably want to fix LabelBinarizer to be like LabelEncoder. I could contribute them if you want, and maybe someone could implement more ensemble techniques on top of this. I would also need to clean it up a little and write more docs and tests, and then get it past the PR review committee.

amueller · 2012-12-01T13:46:21Z

So... what is the plan now? Close this PR and start a new one on ensembles? I'm not sure what you mean by LabelBinarizer should be like LabelEncoder.

erg · 2012-12-06T00:40:44Z

In the spirit of closing issues, I'll close this and submit a new pull request when I figure out exactly what to contribute.

amueller · 2012-12-06T08:31:07Z

@erg Sorry this one didn't work out.

glouppe · 2012-12-06T08:39:23Z

I did not take part into the discussion, but I wanted to add that I really share @erg's long-term goal. Having an Ensemble class that could take arbitrary estimators would be really really great in my opinion. At the time, I remember that I closed my branch on Bagging precisely because of the huge inconsistencies between our estimators in terms of interface.

amueller · 2012-12-06T08:42:13Z

@glouppe I think the interfaces became much more consistent since your PR. I also think having this would be great.

amueller reviewed Nov 1, 2012
View reviewed changes

erg added 11 commits November 15, 2012 10:49

base: Add code to find unique classes_ given classes or y.

3edcb15

lda: Allow classes to be set in __init__. Add y_inverse_ instead of o…

8000

3baf2b8

…verwriting y.

base: Check that we expect all the y classes in fit().

ef885da

naive_bayes: Add support for classes in __init__. Update unit tests. …

92a9c88

…Fix bug where class_prior can't be a numpy array.

preprocessing: Add support for classes to LabelBinarizer. Needs test …

1ff2e61

…updates to be sure it's correct.

tree: Add support for classes= parameter for DecisionTreeClassifiers.…

32065dc

… Move _is_multilabel from preprocessing to base, and rename it to is_multilabel.

tree: Add support for classes= in ExtraTreeClassifier.

d8e1041

ensemble: Adding classes= parameter to ensemble.

dae3798

base: Rename _is_label_indicator_matrix to is_label_indicator_matrix …

e595d64

…and use in preprocessing. Use prepare_classes() and have it decide between single or multilabel.

ensemble: Use classes in forest if provided.

4e53267

base: Implement is_iterable for better error reporting on bad class s…

565d0ef

…hapes.

amueller closed this Nov 30, 2012

amueller reopened this Nov 30, 2012

erg closed this Dec 6, 2012

amueller mentioned this pull request Dec 7, 2016

add optional 'classes' argument to .fit methods #8004

Closed

This was referenced Aug 4, 2017

Helpers for managing different set of classes #8100

Open

[WIP] Adding classes argument to Gaussian Naive Bayes #9290

Open



		def is_multilabel(y):
		# the explicit check for ndarray is for forward compatibility; future

Uh oh!

WIP: Classes support for all classifiers #1304

WIP: Classes support for all classifiers #1304

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!