8000 MRG: Dummy estimators by mblondel · Pull Request #1373 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MRG: Dummy estimators #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 22, 2012
Merged

MRG: Dummy estimators #1373

merged 20 commits into from
Nov 22, 2012

Conversation

mblondel
Copy link
Member

This PR is gonna win the prize of the most silly classifier ever added to scikit-learn.

I think it's a good sanity check when doing classification to compare with random classification. This estimator supports three strategies for generating random predictions: stratified, most_frequent and uniform. The first two strategies can give surprisingly good accuracy if you're doing binary classification or if your dataset is unbalanced.

If people think it's useful, I'm gonna add it to the documentation.

@glouppe
Copy link
Contributor
glouppe commented Nov 15, 2012

I like it! Very useful in my opinion.

@jaquesgrobler
Copy link
Member

Nice, I think it's useful for comparisons this. I've seen it mentioned before so 👍 from me.

@ogrisel
Copy link
Member
ogrisel commented Nov 15, 2012

Indeed, kaggle always provide those kinds of evaluations as baseline benchmark values for the public leaderboard. Very useful sanity check.

@ogrisel
Copy link
Member
ogrisel commented Nov 15, 2012

+1

@ogrisel
Copy link
Member
ogrisel commented Nov 15, 2012

Maybe "stratified" could be renamed "frequency_weighted"?

@amueller
Copy link
Member

+1 for the estimator.
Maybe rename sampling to strategy ?

@agramfort
Copy link
Member

I thought it was an april fool :)

but I like it and I am eager to have students reporting statistically significant scores with it ;)

@GaelVaroquaux
Copy link
Member

For such purposes, I usually rely on
cross_validation.permutation_test_score

@mblondel
Copy link
Member Author

permutation_test_score seems to be measuring something different: it needs an estimator whereas RandomClassifier doesn't. BTW permutation_test_score isn't in the narrative documentation and the docstring is quite laconic. Would be nice to improve it.

(I'm working on a binary classification problem right now and "most_frequent" is useful for me as it always predicts the negative class.)

@mblondel
Copy link
Member Author

Any opinion on the name changes suggested by @ogrisel and @amueller? I personally prefer "stratified" but no opinion on "sampling" vs "strategy".

Do I need to add this to the narrative doc? If so where? We could rename the model section chapter to "model selection and evaluation" (the section on metrics could go there too).

@GaelVaroquaux
Copy link
Member

On Thu, Nov 15, 2012 at 08:27:51AM -0800, Mathieu Blondel wrote:

Any opinion on the name changes suggested by @ogrisel and @amueller? I
personally prefer "stratified" but no opinion on "sampling" vs "strategy".

I prefer stratified as this is the term that we already use, and also
'sampling' because it is more explicit.

Do I need to add this to the narrative doc?

Yes. We should not add anymore something to the scikit without adding it
to the narrative doc (it's a mistake that we have done in the past... ).

If so where? We could rename the model section chapter to "model
selection and evaluation" (the section on metrics could go there too).

That would be good.

@amueller
Copy link
Member

Yes, model selection and evaluation sounds good.

For the stratified: To me, a stratified data set is one where all classes are equally likely. sklearn uses a different terminology for StratifiedKFold for example - I'm not sure how widely used that is.
Another proposal: proportional.

@amueller
Copy link
Member

I can't find the issue by @jaquesgrobler now but we could find out whether sampling or ``strategy` lowers the entropy more and what meaning (if any) they have in other places.

@amueller
Copy link
Member

ok +1 for stratified as this is the word used in sklearn for proportional labels.
I don't really care that much about strategy vs sampling. I just felt sampling a bit weird as one of the methods is not sampling based and to me sampling sounds like a boolean (for some unknown reason).


Attributes
----------
`label_encoder_` : LabelEncoder object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually put LabelEncoders on estimators as public attributes. Instead, I'd prefer a property

@property
def classes_(self):
    return self._label_encoder.classes_
10000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would like all classifiers to have a classes_ attribute.
Not sure if I would implement it like this, though ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thu, Nov 15, 2012 at 08:45:40AM -0800, Lars Buitinck wrote:

We don't usually put LabelEncoders on estimators as public attributes. Instead,
I'd prefer a property

@Property
def classes_(self):
return self.label_encoder.classes

You are hiding things from the user. Not a good idea, IMHO. I prefer the
property-less approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good if we handled this in a consistent way across classifiers.
Usually I rooted for self.classes_, y = unique(y, return_inverse=True) and I used that in a lot of places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I rooted for self.classes_, y = unique(y, return_inverse=True) and I
used that in a lot of places.

+1: simply, readable and efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the LabelEncoder entirely is fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the classes property? where?

8000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RidgeClassifier, LinearSVC and sklearn.multiclass. I see now that the latter exposes its binarizer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will remove LabelEncoder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelPCA exposes its KernelCenterer too... I wouldn't mind making it private.

@amueller
Copy link
Member

If you don't want to do input validation in predict, you have to adjust some of the common tests.
I'd prefer doing the input validation to be honest (how long could it take?).

@larsmans
Copy link
Member

Three questions/suggestions:

  • Maybe rename the module to sklearn.baseline?
  • Is RandomClassifier(sampling="most_frequent") equivalent to a Weka OneR classifier, and if so, should we mention that?
  • Does it make sense to implement predict_proba for this beast?

@larsmans
Copy link
Member

Oh and +1 for strategy.

@erg
Copy link
Contributor

Will this support multi-output classification?

@mblondel
Copy link
Member Author

Maybe rename the module to sklearn.baseline?

What other stuff would you put in it?

Is RandomClassifier(sampling="most_frequent") equivalent to a Weka OneR classifier, and if so, should we mention that?

Skimming through the intro of the paper, OneR seems to be based on decision stumps.

Does it make sense to implement predict_proba for this beast?

Could be useful for metrics such as ROC I guess.

Will this support multi-output classification?

Do you mean multi-task classification? AFAIK, it's supported nowhere in the scikit.

@amueller
Copy link
Member

I think @erg means what the multi-output forest implements.
I'd call it structured prediction. And true, we don't really support that. @erg what is your setting that you are so interested in this? I'm quite curious. Usually I'd use some CRF for this kind of task.

@erg
Copy link
Contributor
erg commented Nov 15, 2012

Multi-output classification is supported by decision trees as of a couple months ago.
http://scikit-learn.org/dev/modules/tree.html#multi-output-problems

Here's a random page about multi-task in scikit.
http://scikit-learn.org/stable/auto_examples/linear_model/plot_multi_task_lasso_support.html

@amueller
Copy link
Member

the multi-task lasso is for regression, that is somewhat different and has quite a different goal.

@erg
Copy link
Contributor
erg commented Nov 15, 2012

@amueller I'm not actually using multi-output decision trees for anything, but it's related to the bugs/inconsistencies with classifying arbitrary labels and passing classes to classifiers in the constructor to get the same shape on predict_proba for ensemble learning.

Perhaps if every classifier took a LabelEncoder that could optionally take a classes= parameter, and then all labels were transformed before classification and then inverse transformed after, that would solve several problems at once for me. The reason multi-output comes up is that decision trees support it, so the LabelEncoder would have to as well. (It's not that hard, I have the code for it.)

@amueller
Copy link
Member

Nearly all classes already transform the labels before fitting and then transform them back after predicting.
As I said elsewhere, it is just a matter of making this consistent across all classifiers.
This is made somewhat harder by the class weight logic in th SVM and SGD. There, the LabelEncoder is really useful, as an inverse transform is needed.
Related issues are #745, #1037 and #406. Feel free to attack those.

Having the possibility to pass a dictionary of predictable classes is really a somewhat different issue and there is afaik no consensus on whether we want that and if so, how it should be implemented.

@amueller
Copy link
Member

What we should decide, though, is whether we want a classes_ everywhere (I thought the answer was yes) and if / when a classifier should have a LabelEncoder.

@GaelVaroquaux
Copy link
Member

What we should decide, though, is whether we want a classes_ everywhere (I
thought the answer was yes) and if / when a classifier should have a
LabelEncoder.

My gut feeling is to favor classes_ everywhere: it is much simpler. I
don't want the scikit to become a big framework with nested components
(like VTK for instance).

@mblondel
Copy link
Member Author

+1

LabelEncoder is more useful for user-land code.

:class:`RandomClassifier` implements three different strategies for
random classification. `stratified` generates predictions by respecting
the training set's class distribution. `most_frequent` always predicts the
most frequent label in the training set (useful for binary classification).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this particularly useful for binary classification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there it will achieve <=50% error rate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< only if unbalanced. In the unbalanced case, this is also true for multi-class....
I don't really see the distinction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

@mblondel
Copy link
Member Author

I renamed to dummy.DummyClassifier and added dummy.DummyRegressor. I also forced-push a rebased version of this PR. Should be good for merge.

@mblondel
Copy link
Member Author

Any further comment? :)

.. currentmodule:: sklearn.dummy

When doing classification, a simple sanity check consists in comparing one's
classifier against simple rules of thumb.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heading is Dummy estimators I would probably rephrase the sentence to reflect that, i.e. "When doing supervised learning ... one's estimator ..." and then say "three such strategies for classification."

@amueller
Copy link
Member

I think one important point about the dummy estimators is not mentioned in the docs yet: these "rule of thumb" are data independend. I think this is worth mentioning.


clf = DummyClassifier(strategy="stratified", random_state=0)
clf.fit(X, y)
assert_array_equal(clf.predict(X), [1, 2, 2, 1, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really important but I would prefer a test that takes a bigger X, does a bincount and compares ratios.
Same for test_uniform_strategy. Though this is complaining on a pretty high level ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, check the expectations? I had the same idea but didn't do it since I thought it would require lots of samples to reach close enough values. I will try if I have time though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. But shouldn't ~100 samples be enough? and that should be pretty instantly, right?
Well, as I said, just nitpicking on a high level ;)

@amueller
Copy link
Member

I would imagine that the common tests complain as you don't do proper input validation. And also maybe because the models are petty bad ^^
I'd prefer to do input validation during fit, as the user is using this only if she is in debug mode any way.
The tests about performance should probably be skipped for these explicitly.

@larsmans
Copy link
Member

All the examples do binary classification. It would be nice to have multiclass as well, and maybe symbolic labels.

Class labels.

`class_prior_` : array, shape = [n_classes]
Probability of each class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a scalar in the binary case? That's how we handle intercept_ in the linear models.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like treating the binary case differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I called it class_prior_ to follow MultinomialNB. Do you treat the binary case differently in it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither do I, but it's a long-standing convention.

As for mimicking NB: I've been considering changing it's behavior, I just never got round to it. It does some stuff differently from all the other estimators, and I feel it should be more like the other linear models.

@mblondel
Copy link
Member Author

All the examples do binary classification.

I thought about it but the way it's implemented, it doesn't change anything how many classes there are.

and maybe symbolic labels

Sure, I'll add that.

@mblondel
Copy link
Member Author

I would imagine that the common tests complain as you don't do proper input validation

I haven't run the whole test suite yet. Damn, this PR is taking more time than I thought... (as a scikit-learn core developer, I should know better though :b)

@mblondel
Copy link
Member Author

Man, you're fast :p

Regarding common tests, it would be nice if all assertions could have an explicit message. test_common was detecting the old random_classifier.pyc on my disk and running tests for it. Took me a while to figure out :)

@amueller
Copy link
Member

fast = procrastinating ;)
Yes, better assertions with more explicit error message are somewhere in todos of the common tests (there are several issues on that).
I think @ibayer started some refactoring there.

@amueller
Copy link
Member

About taking time: apart from the common test, feel free to ignore my comments.

And you really should have known better ;)

@mblondel
Copy link
Member Author

About taking time: apart from the common test, feel free to ignore my comments.

---What are you referring to?---

Got you (I'm tired)

@mblondel
Copy link
Member Author

Anything else?

@amueller
Copy link
Member

+1 for merge :)

mblondel added a commit that referenced this pull request Nov 22, 2012
@mblondel mblondel merged commit 0e4a693 into scikit-learn:master Nov 22, 2012
@mblondel
Copy link
Member Author

Merged, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0