Add paired test for cross-validation #12730

amueller · 2018-12-06T14:47:51Z

It would be nice to add error bars for grid-search results. Right now I think we're showing std over folds, which is not the right thing (there's a nice talk & package by Max Kuhn about it, also mentioned in todays NIPS invited talk).
We should implement a paired test to have a better measure of uncertainty between parameter settings.

MiladShahidi · 2018-12-07T05:11:37Z

Is this what you're referring to?
http://jmlr.org/papers/volume18/16-305/16-305.pdf

And can I take this on?

adrinjalali · 2018-12-07T13:47:06Z

@amueller , by package, do you mean caret? And the paired t-test mentioned here?

@MiladShahidi , if I understand correctly, does the article you cite argue against this approach?

MiladShahidi · 2018-12-07T14:30:38Z

I found this talk in which he mentions Tidyposterior (Not the best name if you ask me!). And Tidyposterior cites that paper.

Not sure what @amueller exactly means by paired test, but you're right @adrinjalali, in that this paper first beats the hell out of frequentist t-tests and then proposes a Bayesian approach. Maybe we could implement both? Even after all that criticism, t-test does not seem totally useless to me, and it is what most people may want to do, at least for now.

amueller · 2018-12-08T15:15:47Z

@MiladShahidi yes that's what I meant. I would definitely include the frequentist test. The Bayesian paper doesn't really meet our inclusion criteria.

MiladShahidi · 2018-12-08T15:52:04Z

Thanks @amueller for clarifyin 8000 g. Shall I start working on this then?

adrinjalali · 2018-12-08T16:00:43Z

Sure, go ahead @MiladShahidi!

jnothman · 2018-12-09T22:42:47Z

I've not looked at the references. Are we talking about a Wilcoxon rank sum test between pairs of candidates? Are we providing a method to do this between a specified pair? Calculating by default for all pairs for all scorers? Calculating for all adjacent pairs in order of descending score?

adrinjalali · 2018-12-10T10:14:43Z

I guess a single method would assume there's only one test, and do no multiple test correction. We can of course blame the user if they do multiple tests and not correct for it, but I guess doing multiple tests ourselves and correct for it as well, would be nicer?

Another option on top of the ones you mention is to have it for the best vs all others, which somehow makes sense to me.

One thing I'm wondering about, is that the default CV is (will be) 3 (5); my question is, what is the power of the test with two sets of size 3 (or 5)? Would that be reliable? Or is my question not valid here?

jnothman · 2018-12-10T10:45:15Z

Your question is valid here, IMO. For more than 10 splits, the test statistic is said to approximate a normal distribution. For smaller sample sizes, I recall looking up the level of significance from a table. http://vassarstats.net/textbook/ch12a.html presents the test statistic for which critical p-values are reached for 5 <= N < 10... but that's not a p-value itself, nor corrected for multiple hypotheses.

MiladShahidi · 2018-12-12T20:23:10Z

How about this:

We report the (estimate of) the variance of each cross-validated metric over K folds by default. And provide an additional method for testing equality across specified pairs. One of the arguments of this method can be the type of correction (Bonferroni and friends!) the user wants to apply to the p-values to account for multiple testing. This will give people a hint that they may need to correct the p-values (or even through a warning if more than one pair is given but no correction method is specified).

To estimate the variance and perform that paired t-test, Nadeau and Bengio (2000) propose an estimator (and a t-test) that corrects for covariance across folds. Since Wilcoxon's rank test assumes that pairs are drawn independently, I think it is not suited to the context of cross-validation. But these guys are trying to take that covariance into account.

And I agree that none of this stuff will make sense with less than 5 observations.

Let me know what you think about this.

jnothman · 2018-12-13T09:46:49Z

Milad, that sounds like a feature worth implementing and us reviewing a Pull Request for.

MiladShahidi · 2018-12-13T19:48:43Z

Great. Will work on it.

raamana · 2019-01-16T17:16:29Z

This would be an useful addition to sklearn, as many [typical] users do not realize there is a distribution over the performance (accuracy or RMSE etc), and that its shape could vary in a substantial way. While I'd love contribute to this in various ways (as I tried to with discussion in #9631 ), I'll be busy with in short term, so I just want to bring few pieces of info to your attention, that might help the efforts in this direction:

Important references on performance comparison of predictive models: https://crossinvalidation.com/2017/08/10/must-read-machine-learning-papers-%EF%BB%BF/

Various posthocs that could be used: https://github.com/raamana/scikit-posthocs (which could use help in adding more tests and comprehensive validation etc).

adrinjalali · 2019-06-03T09:43:44Z

@MiladShahidi you still working on this? Or should we put it up for grab?

MiladShahidi · 2019-06-03T10:25:33Z

I'm afraid I could not find the time to work on this. Would be great to have someone else look into it.

martinagvilas · 2020-01-29T10:52:31Z

I'm working on this at the Paris sprint and will submit a PR with a basic proposal to discuss in more detail

cmarmo · 2020-11-10T08:25:18Z

Closing as fixed in #17432.

jnothman mentioned this issue Jan 16, 2019

Feature addition: Pipeline comparison #9631

Open

adrinjalali added Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Jun 3, 2019

thomasjpfan added the Sprint label Aug 20, 2019

martinagvilas mentioned this issue Jan 31, 2020

[WIP] Add ttest between gridsearch candidates score results #16344

Closed

martinagvilas mentioned this issue Jun 3, 2020

DOC Add grid search stats example #17432

Merged

cmarmo closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add paired test for cross-validation #12730

Add paired test for cross-validation #12730

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add paired test for cross-validation #12730

Add paired test for cross-validation #12730

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!