-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MRG: Dummy estimators #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Dummy estimators #1373
Conversation
I like it! Very useful in my opinion. |
Nice, I think it's useful for comparisons this. I've seen it mentioned before so 👍 from me. |
Indeed, kaggle always provide those kinds of evaluations as baseline benchmark values for the public leaderboard. Very useful sanity check. |
+1 |
Maybe "stratified" could be renamed "frequency_weighted"? |
+1 for the estimator. |
I thought it was an april fool :) but I like it and I am eager to have students reporting statistically significant scores with it ;) |
For such purposes, I usually rely on |
(I'm working on a binary classification problem right now and "most_frequent" is useful for me as it always predicts the negative class.) |
Any opinion on the name changes suggested by @ogrisel and @amueller? I personally prefer "stratified" but no opinion on "sampling" vs "strategy". Do I need to add this to the narrative doc? If so where? We could rename the model section chapter to "model selection and evaluation" (the section on metrics could go there too). |
On Thu, Nov 15, 2012 at 08:27:51AM -0800, Mathieu Blondel wrote:
I prefer stratified as this is the term that we already use, and also
Yes. We should not add anymore something to the scikit without adding it
That would be good. |
Yes, model selection and evaluation sounds good. For the stratified: To me, a stratified data set is one where all classes are equally likely. sklearn uses a different terminology for StratifiedKFold for example - I'm not sure how widely used that is. |
I can't find the issue by @jaquesgrobler now but we could find out whether |
ok +1 for stratified as this is the word used in sklearn for proportional labels. |
|
||
Attributes | ||
---------- | ||
`label_encoder_` : LabelEncoder object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't usually put LabelEncoders on estimators as public attributes. Instead, I'd prefer a property
@property
def classes_(self):
return self._label_encoder.classes_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would like all classifiers to have a classes_
attribute.
Not sure if I would implement it like this, though ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Thu, Nov 15, 2012 at 08:45:40AM -0800, Lars Buitinck wrote:
We don't usually put LabelEncoders on estimators as public attributes. Instead,
I'd prefer a property@Property
def classes_(self):
return self.label_encoder.classes
You are hiding things from the user. Not a good idea, IMHO. I prefer the
property-less approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good if we handled this in a consistent way across classifiers.
Usually I rooted for self.classes_, y = unique(y, return_inverse=True)
and I used that in a lot of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I rooted for self.classes_, y = unique(y, return_inverse=True) and I
used that in a lot of places.
+1: simply, readable and efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping the LabelEncoder
entirely is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the classes property? where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RidgeClassifier
, LinearSVC
and sklearn.multiclass
. I see now that the latter exposes its binarizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will remove LabelEncoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KernelPCA
exposes its KernelCenterer
too... I wouldn't mind making it private.
If you don't want to do input validation in |
Three questions/suggestions:
|
Oh and +1 for |
Will this support multi-output classification? |
What other stuff would you put in it?
Skimming through the intro of the paper, OneR seems to be based on decision stumps.
Could be useful for metrics such as ROC I guess.
Do you mean multi-task classification? AFAIK, it's supported nowhere in the scikit. |
Multi-output classification is supported by decision trees as of a couple months ago. Here's a random page about multi-task in scikit. |
the multi-task lasso is for regression, that is somewhat different and has quite a different goal. |
@amueller I'm not actually using multi-output decision trees for anything, but it's related to the bugs/inconsistencies with classifying arbitrary labels and passing classes to classifiers in the constructor to get the same shape on predict_proba for ensemble learning. Perhaps if every classifier took a |
Nearly all classes already transform the labels before fitting and then transform them back after predicting. Having the possibility to pass a dictionary of predictable classes is really a somewhat different issue and there is afaik no consensus on whether we want that and if so, how it should be implemented. |
What we should decide, though, is whether we want a |
My gut feeling is to favor classes_ everywhere: it is much simpler. I |
+1 LabelEncoder is more useful for user-land code. |
:class:`RandomClassifier` implements three different strategies for | ||
random classification. `stratified` generates predictions by respecting | ||
the training set's class distribution. `most_frequent` always predicts the | ||
most frequent label in the training set (useful for binary classification). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this particularly useful for binary classification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there it will achieve <=50% error rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< only if unbalanced. In the unbalanced case, this is also true for multi-class....
I don't really see the distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
I renamed to |
Any further comment? :) |
.. currentmodule:: sklearn.dummy | ||
|
||
When doing classification, a simple sanity check consists in comparing one's | ||
classifier against simple rules of thumb. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the heading is Dummy estimators
I would probably rephrase the sentence to reflect that, i.e. "When doing supervised learning ... one's estimator ..." and then say "three such strategies for classification."
I think one important point about the dummy estimators is not mentioned in the docs yet: these "rule of thumb" are data independend. I think this is worth mentioning. |
|
||
clf = DummyClassifier(strategy="stratified", random_state=0) | ||
clf.fit(X, y) | ||
assert_array_equal(clf.predict(X), [1, 2, 2, 1, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really important but I would prefer a test that takes a bigger X, does a bincount and compares ratios.
Same for test_uniform_strategy
. Though this is complaining on a pretty high level ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, check the expectations? I had the same idea but didn't do it since I thought it would require lots of samples to reach close enough values. I will try if I have time though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. But shouldn't ~100 samples be enough? and that should be pretty instantly, right?
Well, as I said, just nitpicking on a high level ;)
I would imagine that the common tests complain as you don't do proper input validation. And also maybe because the models are petty bad ^^ |
All the examples do binary classification. It would be nice to have multiclass as well, and maybe symbolic labels. |
Class labels. | ||
|
||
`class_prior_` : array, shape = [n_classes] | ||
Probability of each class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a scalar in the binary case? That's how we handle intercept_
in the linear models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like treating the binary case differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I called it class_prior_
to follow MultinomialNB
. Do you treat the binary case differently in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither do I, but it's a long-standing convention.
As for mimicking NB: I've been considering changing it's behavior, I just never got round to it. It does some stuff differently from all the other estimators, and I feel it should be more like the other linear models.
I thought about it but the way it's implemented, it doesn't change anything how many classes there are.
Sure, I'll add that. |
I haven't run the whole test suite yet. Damn, this PR is taking more time than I thought... (as a scikit-learn core developer, I should know better though :b) |
Man, you're fast :p Regarding common tests, it would be nice if all assertions could have an explicit message. |
fast = procrastinating ;) |
About taking time: apart from the common test, feel free to ignore my comments. And you really should have known better ;) |
---What are you referring to?--- Got you (I'm tired) |
Anything else? |
+1 for merge :) |
Merged, thanks. |
This PR is gonna win the prize of the most silly classifier ever added to scikit-learn.
I think it's a good sanity check when doing classification to compare with random classification. This estimator supports three strategies for generating random predictions: stratified, most_frequent and uniform. The first two strategies can give surprisingly good accuracy if you're doing binary classification or if your dataset is unbalanced.
If people think it's useful, I'm gonna add it to the documentation.