-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP: Classes support for all classifiers #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think it may be simpler to add a |
|
||
|
||
def is_multilabel(y): | ||
# the explicit check for ndarray is for forward compatibility; future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand this. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved it out of preprocessing.py
into base.py
.
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 593) def _is_label_indicator_matrix(y):
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 594) return hasattr(y, "shape") and len(y.shape) == 2
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 595)
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 596)
ebb42e78 sklearn/preprocessing/preprocessing.py (Mathieu Blondel 2012-01-21 05:29:42 +0900 597) def _is_multilabel(y):
7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 598) # the explicit check for ndarray is for forward compatibility; future
7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 599) # versions of Numpy might want to register ndarray as a Sequence
6d041623 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 12:28:30 +0200 600) return not isinstance(y[0], np.ndarray) and isinstance(y[0], Sequence) \
855257f6 sklearn/preprocessing.py (Lars Buitinck 2012-05-24 16:08:07 +0200 601) and not isinstance(y[0], basestring) \
7b060267 sklearn/preprocessing.py (Lars Buitinck 2012-05-18 09:25:21 +0200 602) or _is_label_indicator_matrix(y)
I am not really convinced this change is necessary. |
Thinking about it, I'm not entirely sure LabelEncoder can do what I just claimed. Still, I think it is not a good idea to try to pack this logic into the base classes and try to discover what the user is doing. |
Going back to the original discussion, it seems that custom bagging implementations were the motivation. Why don't we solve this by implementing a bagging meta estimator? @glouppe started that at some point, I think. I would guess that it should not be to hard. |
+1 |
My original motivation was making ensembles of classifiers and combining their |
Yes, |
I'd like to hear others opinions before moving this forward. After a longish discussion on the Looking at the code again, the stuff that seemed to magic to me was there before. I do see your point, though, and think it would be nice if we could make sure of the shape of the decision function. |
Well personally I'm even -1 for dealing with this problem at all. It can easily be dealt with a small utility function on the user side and ensemble methods should in my opinion do reject sampling to ensure that all classes are always represented. |
+1. I am worried about the added complexity both in the code base and in |
Is there a clear cut (well established and published with many citations) |
I haven't thought this over a lot, but I guess that I would be +1 here. |
Same fealing here. |
AFAIK this is a standard way to implement stacking ensemble methods. For instance: 8000 slide 10 of http://www.ukkdd2007.org/slides/UKKDD-2007-Niall-talk.pdf Edit: this is also implemented this in Orange for the Such stacked models are very important to win ML contests such as netflix, kaggle and such and can also be very useful in the industry to gain so percents in predictive accuracy even though often less interesting from a pure machine learning research point of view. |
From my perspective there was no doubt that this is helpful for building ensembles. |
I agree. I didn't mean to say that it was completely useless, but I meant |
I am not sure either. Moving the complexity to some ensemble base class would sound reasonable to me. Maybe @amueller's Having. I think I would need usage examples is critical to make good decision on such design issues. |
On 11/10/2012 03:57 PM, Olivier Grisel wrote:
|
…Fix bug where class_prior can't be a numpy array.
…updates to be sure it's correct.
… Move _is_multilabel from preprocessing to base, and rename it to is_multilabel.
…and use in preprocessing. Use prepare_classes() and have it decide between single or multilabel.
This would be a useful feature to have, particularly when combining classifiers that have been trained on data sets where they share common labels (e.g., 0, 1, 2) but each might only have seen a subset of the labels (e.g., {0, 1} or {0, 2}). |
(Specifically, the |
@mrjbq7 I think you could achieve that pretty easily using LabelBinarizer. If this is not the case, I would rather extend LabelBinarizer than modify all classifiers. |
I don't see how As for making the code harder to understand, the complexity introduced for having a classes= parameter in init for trees was not any harder. For I think having the classes= as a keyword arg in You end up with a much cleaner way of doing combining probabilities in ensembles, random sampling problems, cross-validation, boosting, bagging--anything where you are not guaranteed that all of the data makes it into each classifier. from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier()
clf2 = DecisionTreeClassifier()
clf3 = DecisionTreeClassifier()
clf1.fit([[1],[2],[3], [5],[6],[7], [10],[11],[12]], [0,0,0, 1,1,1, 2,2,2])
clf2.fit([[1],[2],[3], [5],[6],[7]], [0,0,0, 1,1,1])
clf3.fit([[1],[2],[3], [10],[11],[12]], [0,0,0, 2,2,2])
clf1.predict([[1], [5], [10]])
clf2.predict([[1], [5]])
clf3.predict([[1], [10]])
clf1.classes_
clf2.classes_
clf3.classes_
clf1.predict_proba([[1], [5], [10]])
clf2.predict_proba([[1], [5]])
clf3.predict_proba([[1], [10]])
Output in ipython2:
In [75]: clf1.predict([[1], [5], [10]])
Out[75]: array([ 0., 1., 2.])
In [76]: clf2.predict([[1], [5]])
Out[76]: array([ 0., 1.])
In [77]: clf3.predict([[1], [10]])
Out[77]: array([ 0., 2.])
In [78]: clf1.classes_
Out[78]: [array([0, 1, 2])]
In [79]: clf2.classes_
Out[79]: [array([0, 1])]
In [80]: clf3.classes_
Out[80]: [array([0, 2])]
In [81]: clf1.predict_proba([[1], [5], [10]])
Out[81]:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
In [82]: clf2.predict_proba([[1], [5]])
Out[82]:
array([[ 1., 0.],
[ 0., 1.]])
In [83]: clf3.predict_proba([[1], [10]])
Out[83]:
array([[ 1., 0.],
[ 0., 1.]]) |
@erg I will try to come up with it tomorrow.
I really don't see how this is connected to the Also: how does the init parameter help handle multi-output labels? I don't see a connection there, either. |
When you do subsampling for bagging models (and maybe stacking too) for instance, the sub-datasets might not have the whole set of classes occurring in them. But you would still like to have the same shape in the Also I think having the ability to fix the expected |
@ogrisel was this in response to my questions? I agree with what you are saying, but I'm not sure this PR is the right solution. |
You are right that we already have the unique stuff and that we have to do it anyways to make sure that you're not trying to The Basically my idea of using a classifier is that you create it and tell it what to look for, train it, and then have it classify things for you. scikit has conflated training a classifier with telling it what to look for, and it irks me. :) A very simple solution that I'm going to use as a workaround is to make a function |
@ogrisel Agreed, for online learning something like this would be mandatory. |
for multi-output: as far as I can tell, it is just not possible for any non-tree method except nearest neighbors. this is why I would't want this in a base class. Don't get me wrong, I'm not totally opposed to the idea. not sure what the others think, though. erg notifications@github.com schrieb:
|
I disagree with this statement: the scikit makes it easy to do easy That doesn't mean that I am against having the option of specifying |
Haven't looked at the PR yet. G |
@erg You are right, it is not possible to implement this with |
Sorry, misclick! |
Ok short (not that short) summary of my thoughts and what I feel is the result of the discussion so far:
I see two possible solutions:
I am concerned that adding an @erg would you be happy with a utility function / estimator that would solve the problem of getting consistent |
I wrote a utility function to reshape all the proba arrays. So I don't really care anymore, except to maybe contribute something to scikit if stuff works well. I agree that it's pretty hard to redo each classifier, so the utility function is a better approach. I'm also experimenting with The I also wrote something to resample a dataset with replacement. You pass its So finally I combined it all with a I feel like these classes are a good start to ensemble learning and they fix most all of my API gripes with scikit. We'd probably want to fix |
So... what is the plan now? Close this PR and start a new one on ensembles? I'm not sure what you mean by |
In the spirit of closing issues, I'll close this and submit a new pull request when I figure out exactly what to contribute. |
@erg Sorry this one didn't work out. |
I did not take part into the discussion, but I wanted to add that I really share @erg's long-term goal. Having an |
@glouppe I think the interfaces became much more consistent since your PR. I also think having this would be great. |
I'm taking a stab at decoupling the classes in
y
from the classes that the estimators can predict. I think this cleanup would allow for easier design of an ensemble class that could use arbitrary classifiers and combine their results through voting, averaging, stacking, etc.See #1183
So far I've done
naive_bayes
,lda
,trees
, andforests
. I tried to doqda
and almost got it right, but there are issues withnan
andinf
.LabelBinarizer
needs some testing as well.I've gone with the convention of class variables set in
__init__
having no trailing underscore and any other private variables using one. Soself.classes
gets set in__init__
and thenself.classes_
gets set infit
. Maybe there needs to be a parameter method that resetsself.classes_
toNone
so that classifiers are not in half-working states if the user setsclasses
after the fact?Any help with the remaining classifiers, code cleanups, or suggestions would be appreciated.