MRG: Dummy estimators #1373

mblondel · 2012-11-15T15:00:05Z

This PR is gonna win the prize of the most silly classifier ever added to scikit-learn.

I think it's a good sanity check when doing classification to compare with random classification. This estimator supports three strategies for generating random predictions: stratified, most_frequent and uniform. The first two strategies can give surprisingly good accuracy if you're doing binary classification or if your dataset is unbalanced.

If people think it's useful, I'm gonna add it to the documentation.

glouppe · 2012-11-15T15:03:36Z

I like it! Very useful in my opinion.

jaquesgrobler · 2012-11-15T15:14:32Z

Nice, I think it's useful for comparisons this. I've seen it mentioned before so 👍 from me.

ogrisel · 2012-11-15T15:19:36Z

Indeed, kaggle always provide those kinds of evaluations as baseline benchmark values for the public leaderboard. Very useful sanity check.

ogrisel · 2012-11-15T15:19:42Z

+1

ogrisel · 2012-11-15T15:21:27Z

Maybe "stratified" could be renamed "frequency_weighted"?

amueller · 2012-11-15T15:26:34Z

+1 for the estimator.
Maybe rename sampling to strategy ?

agramfort · 2012-11-15T15:32:03Z

I thought it was an april fool :)

but I like it and I am eager to have students reporting statistically significant scores with it ;)

GaelVaroquaux · 2012-11-15T15:39:37Z

For such purposes, I usually rely on
cross_validation.permutation_test_score

mblondel · 2012-11-15T16:20:43Z

permutation_test_score seems to be measuring something different: it needs an estimator whereas RandomClassifier doesn't. BTW permutation_test_score isn't in the narrative documentation and the docstring is quite laconic. Would be nice to improve it.

(I'm working on a binary classification problem right now and "most_frequent" is useful for me as it always predicts the negative class.)

mblondel · 2012-11-15T16:27:45Z

Any opinion on the name changes suggested by @ogrisel and @amueller? I personally prefer "stratified" but no opinion on "sampling" vs "strategy".

Do I need to add this to the narrative doc? If so where? We could rename the model section chapter to "model selection and evaluation" (the section on metrics could go there too).

GaelVaroquaux · 2012-11-15T16:32:49Z

On Thu, Nov 15, 2012 at 08:27:51AM -0800, Mathieu Blondel wrote:

Any opinion on the name changes suggested by @ogrisel and @amueller? I
personally prefer "stratified" but no opinion on "sampling" vs "strategy".

I prefer stratified as this is the term that we already use, and also
'sampling' because it is more explicit.

Do I need to add this to the narrative doc?

Yes. We should not add anymore something to the scikit without adding it
to the narrative doc (it's a mistake that we have done in the past... ).

If so where? We could rename the model section chapter to "model
selection and evaluation" (the section on metrics could go there too).

That would be good.

amueller · 2012-11-15T16:33:08Z

Yes, model selection and evaluation sounds good.

For the stratified: To me, a stratified data set is one where all classes are equally likely. sklearn uses a different terminology for StratifiedKFold for example - I'm not sure how widely used that is.
Another proposal: proportional.

amueller · 2012-11-15T16:34:51Z

I can't find the issue by @jaquesgrobler now but we could find out whether sampling or ``strategy` lowers the entropy more and what meaning (if any) they have in other places.

amueller · 2012-11-15T16:41:22Z

ok +1 for stratified as this is the word used in sklearn for proportional labels.
I don't really care that much about strategy vs sampling. I just felt sampling a bit weird as one of the methods is not sampling based and to me sampling sounds like a boolean (for some unknown reason).

larsmans · 2012-11-15T16:45:35Z

sklearn/random_classifier.py

+
+    Attributes
+    ----------
+    `label_encoder_` : LabelEncoder object


We don't usually put LabelEncoders on estimators as public attributes. Instead, I'd prefer a property

@property def classes_(self): return self._label_encoder.classes_

I think I would like all classifiers to have a classes_ attribute.
Not sure if I would implement it like this, though ;)

On Thu, Nov 15, 2012 at 08:45:40AM -0800, Lars Buitinck wrote:

We don't usually put LabelEncoders on estimators as public attributes. Instead,
I'd prefer a property

@Property
def classes_(self):
return self.label_encoder.classes

You are hiding things from the user. Not a good idea, IMHO. I prefer the
property-less approach.

I think it would be good if we handled this in a consistent way across classifiers.
Usually I rooted for self.classes_, y = unique(y, return_inverse=True) and I used that in a lot of places.

Usually I rooted for self.classes_, y = unique(y, return_inverse=True) and I
used that in a lot of places.

+1: simply, readable and efficient.

Skipping the LabelEncoder entirely is fine with me.

you mean the classes property? where?

RidgeClassifier, LinearSVC and sklearn.multiclass. I see now that the latter exposes its binarizer.

Ok, I will remove LabelEncoder.

KernelPCA exposes its KernelCenterer too... I wouldn't mind making it private.

amueller · 2012-11-15T16:49:37Z

If you don't want to do input validation in predict, you have to adjust some of the common tests.
I'd prefer doing the input validation to be honest (how long could it take?).

larsmans · 2012-11-15T16:50:39Z

Three questions/suggestions:

Maybe rename the module to sklearn.baseline?
Is RandomClassifier(sampling="most_frequent") equivalent to a Weka OneR classifier, and if so, should we mention that?
Does it make sense to implement predict_proba for this beast?

larsmans · 2012-11-15T16:50:51Z

Oh and +1 for strategy.

erg · 2012-11-15T17:02:32Z

Will this support multi-output classification?

mblondel · 2012-11-15T17:15:48Z

Maybe rename the module to sklearn.baseline?

What other stuff would you put in it?

Is RandomClassifier(sampling="most_frequent") equivalent to a Weka OneR classifier, and if so, should we mention that?

Skimming through the intro of the paper, OneR seems to be based on decision stumps.

Does it make sense to implement predict_proba for this beast?

Could be useful for metrics such as ROC I guess.

Will this support multi-output classification?

Do you mean multi-task classification? AFAIK, it's supported nowhere in the scikit.

amueller · 2012-11-15T17:18:44Z

I think @erg means what the multi-output forest implements.
I'd call it structured prediction. And true, we don't really support that. @erg what is your setting that you are so interested in this? I'm quite curious. Usually I'd use some CRF for this kind of task.

erg · 2012-11-15T17:22:35Z

Multi-output classification is supported by decision trees as of a couple months ago.
http://scikit-learn.org/dev/modules/tree.html#multi-output-problems

Here's a random page about multi-task in scikit.
http://scikit-learn.org/stable/auto_examples/linear_model/plot_multi_task_lasso_support.html

amueller · 2012-11-15T17:23:43Z

the multi-task lasso is for regression, that is somewhat different and has quite a different goal.

erg · 2012-11-15T17:28:03Z

@amueller I'm not actually using multi-output decision trees for anything, but it's related to the bugs/inconsistencies with classifying arbitrary labels and passing classes to classifiers in the constructor to get the same shape on predict_proba for ensemble learning.

Perhaps if every classifier took a LabelEncoder that could optionally take a classes= parameter, and then all labels were transformed before classification and then inverse transformed after, that would solve several problems at once for me. The reason multi-output comes up is that decision trees support it, so the LabelEncoder would have to as well. (It's not that hard, I have the code for it.)

amueller · 2012-11-15T18:16:49Z

Nearly all classes already transform the labels before fitting and then transform them back after predicting.
As I said elsewhere, it is just a matter of making this consistent across all classifiers.
This is made somewhat harder by the class weight logic in th SVM and SGD. There, the LabelEncoder is really useful, as an inverse transform is needed.
Related issues are #745, #1037 and #406. Feel free to attack those.

Having the possibility to pass a dictionary of predictable classes is really a somewhat different issue and there is afaik no consensus on whether we want that and if so, how it should be implemented.

amueller · 2012-11-15T18:20:59Z

What we should decide, though, is whether we want a classes_ everywhere (I thought the answer was yes) and if / when a classifier should have a LabelEncoder.

GaelVaroquaux · 2012-11-15T18:50:09Z

What we should decide, though, is whether we want a classes_ everywhere (I
thought the answer was yes) and if / when a classifier should have a
LabelEncoder.

My gut feeling is to favor classes_ everywhere: it is much simpler. I
don't want the scikit to become a big framework with nested components
(like VTK for instance).

mblondel · 2012-11-15T18:56:25Z

+1

LabelEncoder is more useful for user-land code.

amueller · 2012-11-15T21:04:17Z

doc/modules/model_evaluation.rst

+:class:`RandomClassifier` implements three different strategies for
+random classification. `stratified` generates predictions by respecting
+the training set's class distribution. `most_frequent` always predicts the
+most frequent label in the training set (useful for binary classification).


Why is this particularly useful for binary classification?

Because there it will achieve <=50% error rate.

< only if unbalanced. In the unbalanced case, this is also true for multi-class....
I don't really see the distinction.

mblondel · 2012-11-19T14:12:11Z

I renamed to dummy.DummyClassifier and added dummy.DummyRegressor. I also forced-push a rebased version of this PR. Should be good for merge.

mblondel · 2012-11-19T23:55:55Z

Any further comment? :)

amueller · 2012-11-20T20:42:40Z

doc/modules/model_evaluation.rst

+.. currentmodule:: sklearn.dummy
+
+When doing classification, a simple sanity check consists in comparing one's
+classifier against simple rules of thumb.


If the heading is Dummy estimators I would probably rephrase the sentence to reflect that, i.e. "When doing supervised learning ... one's estimator ..." and then say "three such strategies for classification."

amueller · 2012-11-20T20:45:48Z

I think one important point about the dummy estimators is not mentioned in the docs yet: these "rule of thumb" are data independend. I think this is worth mentioning.

amueller · 2012-11-20T20:51:24Z

sklearn/tests/test_dummy.py

+
+    clf = DummyClassifier(strategy="stratified", random_state=0)
+    clf.fit(X, y)
+    assert_array_equal(clf.predict(X), [1, 2, 2, 1, 1])


Not really important but I would prefer a test that takes a bigger X, does a bincount and compares ratios.
Same for test_uniform_strategy. Though this is complaining on a pretty high level ;)

You mean, check the expectations? I had the same idea but didn't do it since I thought it would require lots of samples to reach close enough values. I will try if I have time though.

Yeah, exactly. But shouldn't ~100 samples be enough? and that should be pretty instantly, right?
Well, as I said, just nitpicking on a high level ;)

amueller · 2012-11-20T20:53:16Z

I would imagine that the common tests complain as you don't do proper input validation. And also maybe because the models are petty bad ^^
I'd prefer to do input validation during fit, as the user is using this only if she is in debug mode any way.
The tests about performance should probably be skipped for these explicitly.

larsmans · 2012-11-21T12:07:57Z

All the examples do binary classification. It would be nice to have multiclass as well, and maybe symbolic labels.

larsmans · 2012-11-21T12:09:02Z

sklearn/dummy.py

+        Class labels.
+
+    `class_prior_` : array, shape = [n_classes]
+        Probability of each class.


Shouldn't this be a scalar in the binary case? That's how we handle intercept_ in the linear models.

I don't like treating the binary case differently.

BTW, I called it class_prior_ to follow MultinomialNB. Do you treat the binary case differently in it?

Neither do I, but it's a long-standing convention.

As for mimicking NB: I've been considering changing it's behavior, I just never got round to it. It does some stuff differently from all the other estimators, and I feel it should be more like the other linear models.

mblondel · 2012-11-21T13:46:33Z

All the examples do binary classification.

I thought about it but the way it's implemented, it doesn't change anything how many classes there are.

and maybe symbolic labels

Sure, I'll add that.

mblondel · 2012-11-21T13:49:58Z

I would imagine that the common tests complain as you don't do proper input validation

I haven't run the whole test suite yet. Damn, this PR is taking more time than I thought... (as a scikit-learn core developer, I should know better though :b)

mblondel · 2012-11-21T17:19:48Z

Man, you're fast :p

Regarding common tests, it would be nice if all assertions could have an explicit message. test_common was detecting the old random_classifier.pyc on my disk and running tests for it. Took me a while to figure out :)

amueller · 2012-11-21T17:23:31Z

fast = procrastinating ;)
Yes, better assertions with more explicit error message are somewhere in todos of the common tests (there are several issues on that).
I think @ibayer started some refactoring there.

amueller · 2012-11-21T17:24:39Z

About taking time: apart from the common test, feel free to ignore my comments.

And you really should have known better ;)

mblondel · 2012-11-21T17:31:26Z

About taking time: apart from the common test, feel free to ignore my comments.

---What are you referring to?---

Got you (I'm tired)

mblondel · 2012-11-21T18:47:00Z

Anything else?

amueller · 2012-11-21T19:51:56Z

+1 for merge :)

MRG: Dummy estimators

mblondel · 2012-11-22T09:45:12Z

Merged, thanks.

larsmans reviewed Nov 15, 2012
View reviewed changes

amueller reviewed Nov 15, 2012
View reviewed changes

Add what's new entry.

86aad25

amueller reviewed Nov 20, 2012
View reviewed changes

larsmans reviewed Nov 21, 2012
View reviewed changes

mblondel added 6 commits November 22, 2012 01:13

Test expectations.

9adeaa1

Test string labels.

e157836

Test exceptions.

55f8ab0

Cosmit: save one line.

99f2375

Address @amueller doc comments.

ef50f3e

Skip common tests for Dummy*.

6530f44

Typo :/

8430d66

mblondel added a commit that referenced this pull request Nov 22, 2012

Merge pull request #1373 from mblondel/random_clf

0e4a693

MRG: Dummy estimators

mblondel merged commit 0e4a693 into scikit-learn:master Nov 22, 2012

Uh oh!

MRG: Dummy estimators #1373

MRG: Dummy estimators #1373

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!