8000 WIP: Classes support for all classifiers by erg · Pull Request #1304 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

WIP: Classes support for all classifiers #1304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Conversation

erg
Copy link
Contributor
@erg erg commented Oct 31, 2012

I'm taking a stab at decoupling the classes in y from the classes that the estimators can predict. I think this cleanup would allow for easier design of an ensemble class that could use arbitrary classifiers and combine their results through voting, averaging, stacking, etc.

See #1183

So far I've done naive_bayes, lda, trees, and forests. I tried to do qda and almost got it right, but there are issues with nan and inf. LabelBinarizer needs some testing as well.

I've gone with the convention of class variables set in __init__ having no trailing underscore and any other private variables using one. So self.classes gets set in __init__ and then self.classes_ gets set in fit. Maybe there needs to be a parameter method that resets self.classes_ to None so that classifiers are not in half-working states if the user sets classes after the fact?

Any help with the remaining classifiers, code cleanups, or suggestions would be appreciated.

@mblondel
Copy link
Member
mblondel commented Nov 1, 2012

I think it may be simpler to add a classes option to predict_proba and decision_function (predict obviously is not affected). Basically one would just need to compare np.unique(y) and self.classes_, and add additional all-zero r 8000 ows to the returned 2d-array if necessary.



def is_multilabel(y):
# the explicit check for ndarray is for forward compatibility; future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this. Can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved it out of preprocessing.py into base.py.

ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 593) def _is_label_indicator_matrix(y):
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 594)     return hasattr(y, "shape") and len(y.shape) == 2
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 595) 
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 596) 
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 597) def _is_multilabel(y):
7b060267 sklearn/preprocessing.py               (Lars Buitinck   2012-05-18 09:25:21 +0200 598)     # the explicit check for ndarray is for forward compatibility; future
7b060267 sklearn/preprocessing.py               (Lars Buitinck   2012-05-18 09:25:21 +0200 599)     # versions of Numpy might want to register ndarray as a Sequence
6d041623 sklearn/preprocessing.py               (Lars Buitinck   2012-05-18 12:28:30 +0200 600)     return not isinstance(y[0], np.ndarray) and isinstance(y[0], Sequence) \
855257f6 sklearn/preprocessing.py               (Lars Buitinck   2012-05-24 16:08:07 +0200 601)        and not isinstance(y[0], basestring) \
7b060267 sklearn/preprocessing.py               (Lars Buitinck   2012-05-18 09:25:21 +0200 602)         or _is_label_indicator_matrix(y)

@amueller
Copy link
Member
amueller commented Nov 1, 2012

I am not really convinced this change is necessary.
I can see the point in the online setting and in bagging. But why not deal with it there?
We have the label encoder, which can do exactly what this PR does in the mixin. I feel this adds a lot of complexity and magic without really adding much benefit.

@amueller
Copy link
Member
amueller commented Nov 1, 2012

Thinking about it, I'm not entirely sure LabelEncoder can do what I just claimed. Still, I think it is not a good idea to try to pack this logic into the base classes and try to discover what the user is doing.

@amueller
Copy link
Member
amueller commented Nov 1, 2012

Going back to the original discussion, it seems that custom bagging implementations were the motivation. Why don't we solve this by implementing a bagging meta estimator? @glouppe started that at some point, I think. I would guess that it should not be to hard.

@agramfort
Copy link
Member

Going back to the original discussion, it seems that custom bagging
implementations were the motivation. Why don't we solve this by implementing
a bagging meta estimator? @glouppe started that at some point, I think. I
would guess that it should not be to hard.

+1

@erg
Copy link
Contributor Author
erg commented Nov 1, 2012

My original motivation was making ensembles of classifiers and combining their predict_proba outputs in different ways without having to mess around with the sparse shape when some classes are not in y. It seems like a cleaner API when you can tell it what to expect and get zeros as the probabilities when the outputs are missing from the dataset. With just predict, then setting classes beforehand is pointless.

@erg
Copy link
Contributor Author
erg commented Nov 1, 2012

Yes, LabelEncoder should take a classes= parameter too.

@amueller
Copy link
Member
amueller commented Nov 2, 2012

I'd like to hear others opinions before moving this forward.

After a longish discussion on the score_func front, I am -1 on @mblondel suggestion, though I am not to much of a fan of adding an additional __init__ parameter to all classifiers - I don't know an alternative though.

Looking at the code again, the stuff that seemed to magic to me was there before.
Refactoring some of the class logic seems desirable. I'll have to have a closer look though.

I do see your point, though, and think it would be nice if we could make sure of the shape of the decision function.
I just don't want to clutter the interface to much or have to much magic around ;)

@mblondel
Copy link
Member
mblondel commented Nov 2, 2012

Well personally I'm even -1 for dealing with this problem at all. It can easily be dealt with a small utility function on the user side and ensemble methods should in my opinion do reject sampling to ensure that all classes are always represented.

@GaelVaroquaux
Copy link
Member

Still, I think it is not a good idea to try to pack this logic into the
base classes and try to discover what the user is doing.

+1. I am worried about the added complexity both in the code base and in
the behavior of the toolkit.

@GaelVaroquaux
Copy link
Member

My original motivation was making ensembles of classifiers and combining their
predict_proba outputs in different ways without having to mess around with the
sparse shape when some classes are not in y.

Is there a clear cut (well established and published with many citations)
scenario for such a combination. I can see that it is interesting to
explore for research purposes, but to be a goal for the scikit, it needs
to have a wide impact.

@GaelVaroquaux
Copy link
Member

Yes, LabelEncoder should take a classes= parameter too.

I haven't thought this over a lot, but I guess that I would be +1 here.

@GaelVaroquaux
Copy link
Member

Well personally I'm even -1 for dealing with this problem at all. It
can easily be dealt with a small utility function on the user side and
ensemble methods should in my opinion do reject sampling to ensure that
all classes are always represented.

Same fealing here.

@ogrisel
Copy link
Member
ogrisel commented Nov 10, 2012

My original motivation was making ensembles of classifiers and combining their
predict_proba outputs in different ways without having to mess around with the
sparse shape when some classes are not in y.

Is there a clear cut (well established and published with many citations)
scenario for such a combination. I can see that it is interesting to
explore for research purposes, but to be a goal for the scikit, it needs
to have a wide impact.

AFAIK this is a standard way to implement stacking ensemble methods.

For instance: 8000 slide 10 of http://www.ukkdd2007.org/slides/UKKDD-2007-Niall-talk.pdf

Edit: this is also implemented this in Orange for the StackingClassifier: http://orange.biolab.si/doc/reference/Orange.ensemble/#stacking (the example is interesting).

Such stacked models are very important to win ML contests such as netflix, kaggle and such and can also be very useful in the industry to gain so percents in predictive accuracy even though often less interesting from a pure machine learning research point of view.

@amueller
Copy link
Member

From my perspective there was no doubt that this is helpful for building ensembles.
I think the question is: do we want that in the general estimator interface or do we want to pack this into some ensemble class (or somewhere else). I.e. is the gain worth the somewhat more complex code.

@GaelVaroquaux
Copy link
Member

From my perspective there was no doubt that this is helpful for building
ensembles.
I think the question is: do we want that in the general estimator interface or
do we want to pack this into some ensemble class (or somewhere else). I.e. is
the gain worth the somewhat more complex code.

I agree. I didn't mean to say that it was completely useless, but I meant
to ask if it was worth adding a global complexity to the scikit.

@ogrisel
Copy link
Member
ogrisel commented Nov 10, 2012

From my perspective there was no doubt that this is helpful for building ensembles.
I think the question is: do we want that in the general estimator interface or do we want to pack this into some ensemble class (or somewhere else). I.e. is the gain worth the somewhat more complex code.

I am not sure either. Moving the complexity to some ensemble base class would sound reasonable to me. Maybe @amueller's FeatureUnion class could be a good candidate to demonstrate the usefulness of this class identifier normalization by extending it to have options to be able to use predict_proba / decision_function instead of just transform as is currently the case.

Having. I think I would need usage examples is critical to make good decision on such design issues.

@amueller
Copy link
Member

On 11/10/2012 03:57 PM, Olivier Grisel wrote:

Having usage examples is 8000 critical to make good decision on such design
issues.

+1

@mrjbq7
Copy link
Contributor
mrjbq7 commented Nov 29, 2012

This would be a useful feature to have, particularly when combining classifiers that have been trained on data sets where they share common labels (e.g., 0, 1, 2) but each might only have seen a subset of the labels (e.g., {0, 1} or {0, 2}).

@mrjbq7
Copy link
Contributor
mrjbq7 commented Nov 29, 2012

(Specifically, the predict_proba() for each classifier in the example above would produce two column arrays, but the first classifier would label the columns [0,1] and the second [0,2], so to combine you would need to know that, or reshape to [0,1,2] and combine more easily).

@amueller
Copy link
Member

@mrjbq7 I think you could achieve that pretty easily using LabelBinarizer. If this is not the case, I would rather extend LabelBinarizer than modify all classifiers.

@erg
Copy link
Contributor Author
erg commented Nov 29, 2012

I don't see how LabelBinarizer helps. How would LabelBinarizer make the shape of clf2.predict_proba() and clf3.predict_proba() the same as clf1.predict_proba()? (Please write the code if you have time, should be three lines max if it were to work?)

As for making the code harder to understand, the complexity introduced for having a classes= parameter in init for trees was not any harder. For LDA and QDA the issue is that when labels are not in the y in fit, they produce some kind of 0 and calling log on that zero blows up. I did a fix for LDA and was close for QDA. What you gain is code in a base class that can deal with unique-ing labels and getting an inverse label transform rather than duplicating this code all over. Also, it handles multi-output labels (for DecisionTreeClassifiers) which takes quite a bit of code--if other classifiers were to support this, you'd want to not duplicate it.

I think having the classes= as a keyword arg in fit is the wrong approach. If you set it in init as classes, then fit can use that in setting classes_.

You end up with a much cleaner way of doing combining probabilities in ensembles, random sampling problems, cross-validation, boosting, bagging--anything where you are not guaranteed that all of the data makes it into each classifier.

from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier()
clf2 = DecisionTreeClassifier()
clf3 = DecisionTreeClassifier()

clf1.fit([[1],[2],[3], [5],[6],[7], [10],[11],[12]], [0,0,0, 1,1,1, 2,2,2])
clf2.fit([[1],[2],[3], [5],[6],[7]], [0,0,0, 1,1,1])
clf3.fit([[1],[2],[3], [10],[11],[12]], [0,0,0, 2,2,2])

clf1.predict([[1], [5], [10]])
clf2.predict([[1], [5]])
clf3.predict([[1], [10]])
clf1.classes_
clf2.classes_
clf3.classes_
clf1.predict_proba([[1], [5], [10]])
clf2.predict_proba([[1], [5]])
clf3.predict_proba([[1], [10]])



Output in ipython2:

In [75]: clf1.predict([[1], [5], [10]])
Out[75]: array([ 0.,  1.,  2.])

In [76]: clf2.predict([[1], [5]])
Out[76]: array([ 0.,  1.])

In [77]: clf3.predict([[1], [10]])
Out[77]: array([ 0.,  2.])

In [78]: clf1.classes_
Out[78]: [array([0, 1, 2])]

In [79]: clf2.classes_
Out[79]: [array([0, 1])]

In [80]: clf3.classes_
Out[80]: [array([0, 2])]

In [81]: clf1.predict_proba([[1], [5], [10]])
Out[81]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [82]: clf2.predict_proba([[1], [5]])
Out[82]:
array([[ 1.,  0.],
       [ 0.,  1.]])

In [83]: clf3.predict_proba([[1], [10]])
Out[83]:
array([[ 1.,  0.],
       [ 0.,  1.]])

@amueller
Copy link
Member

@erg I will try to come up with it tomorrow.

What you gain is code in a base class that can deal with unique-ing labels and getting an inverse label transform >rather than duplicating this code all over. Also, it handles multi-output labels (for DecisionTreeClassifiers) which >takes quite a bit of code--if other classifiers were to support this, you'd want to not duplicate it.

I really don't see how this is connected to the classes init parameter. As I keep repeating, the unique is implemented nearly everywhere. Could you explain how an init parameter makes this simpler? To keep bringing this up and I keep saying that we already have that feature and I don't see where this is going.

Also: how does the init parameter help handle multi-output labels? I don't see a connection there, either.

@ogrisel
Copy link
Member
ogrisel commented Nov 29, 2012

When you do subsampling for bagging models (and maybe stacking too) for instance, the sub-datasets might not have the whole set of classes occurring in them. But you would still like to have the same shape in the predict_proba output for each individual estimator trained on that subsampled training sets to make it easier to combine those predicted output in the stacking or bagging meta estimator.

Also I think having the ability to fix the expected classes in the __init__ would be useful (necessary?) for online models (e.g. classifiers with a partial_fit method).

@amueller
Copy link
Member

@ogrisel was this in response to my questions?

I agree with what you are saying, but I'm not sure this PR is the right solution.

@erg
Copy link
Contributor Author
erg commented Nov 29, 2012

I really don't see how this is connected to the classes init parameter. As I keep repeating, the unique is implemented nearly everywhere. Could you explain how an init parameter makes this simpler? To keep bringing this up and I keep saying that we already have that feature and I don't see where this is going.

You are right that we already have the unique stuff and that we have to do it anyways to make sure that you're not trying to fit for labels that you aren't expecting (for input validation). The multi-output labels unique code is hairier than a one-liner, and to have this in a base class would be useful if other classifiers start supporting it.

The classes init parameter is just to make sure the shape of predict_proba is correct. From this, you can do things like make an Ensemble class and pass it a bunch of classifiers. Then when you call fit it takes the classes parameter from init and clones each classifier with the classes param and then calls fit on each of them however you want. In predict_proba you now have classifiers that are all looking for the same thing, regardless of what subset of data they were fit on. You can now combine probabilities with voting, averaging, maximum, any way you want, or you can use this Ensemble as part of another Ensemble. The part that it's all missing is the shape of the predicted probabilities has to line up somehow, whether it's a classes parameter or some kind of preprocessing/postprocessing transformer. To go through all of the training and prediction just to get a naked ndarray of the wrong shape that lost all its column labels and can't be combined with others is disheartening and hard to work with.

Basically my idea of using a classifier is that you create it and tell it what to look for, train it, and then have it classify things for you. scikit has conflated training a classifier with telling it what to look for, and it irks me. :)

A very simple solution that I'm going to use as a workaround is to make a function reshape_proba(clf, desired_labels, proba) that will grab the clf.classes_ and reshape it to what I need. So now all of my clf.predict_proba lines can have this extra step and I'll be happy.

@erg
Copy link
Contributor Author
erg commented Nov 29, 2012

@ogrisel Agreed, for online learning something like this would be mandatory.

@amueller
Copy link
Member

for multi-output: as far as I can tell, it is just not possible for any non-tree method except nearest neighbors. this is why I would't want this in a base class.

Don't get me wrong, I'm not totally opposed to the idea.
your pr just seems overly complex to me. having a lightweight method for reshaping the probabilities seems much better.

not sure what the others think, though.

erg notifications@github.com schrieb:

I really don't see how this is connected to the classes init
parameter. As I keep repeating, the unique is implemented nearly
everywhere. Could you explain how an init parameter makes this simpler?
To keep bringing this up and I keep saying that we already have that
feature and I don't see where this is going.

You are right that we already have the unique stuff and that we have to
do it anyways to make sure that you're not trying to fit for labels
that you aren't expecting (for input validation). The multi-output
labels unique code is hairier than a one-liner, and to have this in a
base class would be useful if other classifiers start supporting it.

The classes init parameter is just to make sure the shape of
predict_proba is correct. From this, you can do things like make an
Ensemble class and pass it a bunch of classifiers. Then when you
call fit it takes the classes parameter from init and clones
each classifier with the classes param and then calls fit on
each of them however you want. In predict_proba you now have
classifiers that are all looking for the same thing, regardless of what
subset of data they were fit on. You can now combine probabilities
with voting, averaging, maximum, any way you want, or you can use this
Ensemble as part of another Ensemble. The part that it's all
missing is the shape of the predicted probabilities has to line up
somehow, whether it's a classes parameter or some kind of
preprocessing/postprocessing transformer. To go through all of the
training and prediction just to get a naked ndarray of the wrong shape
that lost all its column labels
and can't be combined with others is disheartening and hard to work
with.

Basically my idea of using a classifier is that you create it and tell
it what to look for, train it, and then have it classify things for
you. scikit has conflated training with telling it what to look for,
and it irks me. :)

A very simple solution that I'm going to use as a workaround is to make
a function reshape_proba(clf, desired_labels, proba) that will grab
the clf.classes_ and reshape it to what I need. So now all of my
clf.predict_proba lines can have this extra step and I'll be happy.


Reply to this email directly or view it on GitHub:
#1304 (comment)

@GaelVaroquaux
Copy link
Member

Basically my idea of using a classifier is that you create it and tell it what
to look for, train it, and then have it classify things for you. scikit has
conflated training with telling it what to look for, and it irks me. :)

I disagree with this statement: the scikit makes it easy to do easy
things.

That doesn't mean that I am against having the option of specifying
explicitly classes. We just need to weigh the costs and benefits.

@GaelVaroquaux
Copy link
Member

your pr just seems overly complex to me. having a lightweight method for
reshaping the probabilities seems much better.

not sure what the others think, though.

Haven't looked at the PR yet.

G

@amueller
Copy link
Member

@erg You are right, it is not possible to implement this with LabelBinarizer currently.

@amueller amueller closed this Nov 30, 2012
@amueller amueller reopened this Nov 30, 2012
@amueller
Copy link
Member

Sorry, misclick!

@amueller
Copy link
Member

Ok short (not that short) summary of my thoughts and what I feel is the result of the discussion so far:

  • At the moment it is hard to ascertain a certain shape for predict_proba and decision_function.
  • This makes in particular ensemble learning cumbersome.
  • This problem has to be addressed for online learning any way.

I see two possible solutions:

  • specifying classes in init as done in this PR
  • providing a utility function and dealing with the online things in the online classifiers.

I am concerned that adding an init parameter for this specific use-case seems overkill. Also, I think this PR does way to many things.

@erg would you be happy with a utility function / estimator that would solve the problem of getting consistent predict_proba shapes?

@erg
Copy link
Contributor Author
erg commented Nov 30, 2012

I wrote a utility function to reshape all the proba arrays. So I don't really care anymore, except to maybe contribute something to scikit if stuff works well. I agree that it's pretty hard to redo each classifier, so the utility function is a better approach.

I'm also experimenting with Estimator and Ensemble classes that take a classes= parameter. The Estimator class currently fixes up the proba result and also uses a modified LabelEncoder to transform labels so you can use arbitrary labels with any classifier. It also handles the classes_ variable itself so you don't have to do it for each machine learning algorithm.

The Ensemble class just calls fit on a list of estimators and returns all of the results from predict and predict_proba. Then I subclassed it and have another ensemble that can do voting/averaging for predict/predict_proba. One thing I like about this is you can pass in a bunch of estimators build with different parameters rather than cloning the same one.

I also wrote something to resample a dataset with replacement. You pass its transform function an X,y, a scale or n parameter, and a proba which can be None for resampling with the same label distribution, balanced for balancing all of the labels, or a dict of class/probability pairs for whatever you want. The scale parameter is a float that will downscale/upscale your number of samples, so .8 would be 80% and 2.5 would be more than twice the data. The n just returns that many samples and you can only use one or the other. This is unrelated to predict_proba, but I don't really want to make a post about it atm.
Edit: It's just a function now, not a transformer since there's no concept of fit or inverse_transform.

So finally I combined it all with a BaggedEnsemble which takes a list of estimators and a scale, n, etc and calls fit on each estimator with resampled data. Since it inherits from the ensemble that does voting/averaging it's just a fit function and a constructor and turns out to be like 15 lines long.

I feel like these classes are a good start to ensemble learning and they fix most all of my API gripes with scikit. We'd probably want to fix LabelBinarizer to be like LabelEncoder. I could contribute them if you want, and maybe someone could implement more ensemble techniques on top of this. I would also need to clean it up a little and write more docs and tests, and then get it past the PR review committee.

@amueller
Copy link
Member
amueller commented Dec 1, 2012

So... what is the plan now? Close this PR and start a new one on ensembles? I'm not sure what you mean by LabelBinarizer should be like LabelEncoder.

@erg
Copy link
Contributor Author
erg commented Dec 6, 2012

In the spirit of closing issues, I'll close this and submit a new pull request when I figure out exactly what to contribute.

@erg erg closed this Dec 6, 2012
@amueller
Copy link
Member
amueller commented Dec 6, 2012

@erg Sorry this one didn't work out.

@glouppe
Copy link
Contributor
glouppe commented Dec 6, 2012

I did not take part into the discussion, but I wanted to add that I really share @erg's long-term goal. Having an Ensemble class that could take arbitrary estimators would be really really great in my opinion. At the time, I remember that I closed my branch on Bagging precisely because of the huge inconsistencies between our estimators in terms of interface.

@amueller
Copy link
Member
amueller commented Dec 6, 2012

@glouppe I think the interfaces became much more consistent since your PR. I also think having this would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0