-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add paired test for cross-validation #12730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is this what you're referring to? And can I take this on? |
@amueller , by package, do you mean @MiladShahidi , if I understand correctly, does the article you cite argue against this approach? |
I found this talk in which he mentions Tidyposterior (Not the best name if you ask me!). And Tidyposterior cites that paper. Not sure what @amueller exactly means by paired test, but you're right @adrinjalali, in that this paper first beats the hell out of frequentist t-tests and then proposes a Bayesian approach. Maybe we could implement both? Even after all that criticism, t-test does not seem totally useless to me, and it is what most people may want to do, at least for now. |
@MiladShahidi yes that's what I meant. I would definitely include the frequentist test. The Bayesian paper doesn't really meet our inclusion criteria. |
Thanks @amueller for clarifyin 8000 g. Shall I start working on this then? |
Sure, go ahead @MiladShahidi! |
I've not looked at the references. Are we talking about a Wilcoxon rank sum test between pairs of candidates? Are we providing a method to do this between a specified pair? Calculating by default for all pairs for all scorers? Calculating for all adjacent pairs in order of descending score? |
I guess a single method would assume there's only one test, and do no multiple test correction. We can of course blame the user if they do multiple tests and not correct for it, but I guess doing multiple tests ourselves and correct for it as well, would be nicer? Another option on top of the ones you mention is to have it for the best vs all others, which somehow makes sense to me. One thing I'm wondering about, is that the default CV is (will be) 3 (5); my question is, what is the power of the test with two sets of size 3 (or 5)? Would that be reliable? Or is my question not valid here? |
Your question is valid here, IMO. For more than 10 splits, the test
statistic is said to approximate a normal distribution. For smaller sample
sizes, I recall looking up the level of significance from a table.
http://vassarstats.net/textbook/ch12a.html presents the test statistic for
which critical p-values are reached for 5 <= N < 10... but that's not a
p-value itself, nor corrected for multiple hypotheses.
|
How about this: We report the (estimate of) the variance of each cross-validated metric over K folds by default. And provide an additional method for testing equality across specified pairs. One of the arguments of this method can be the type of correction (Bonferroni and friends!) the user wants to apply to the p-values to account for multiple testing. This will give people a hint that they may need to correct the p-values (or even through a warning if more than one pair is given but no correction method is specified). To estimate the variance and perform that paired t-test, Nadeau and Bengio (2000) propose an estimator (and a t-test) that corrects for covariance across folds. Since Wilcoxon's rank test assumes that pairs are drawn independently, I think it is not suited to the context of cross-validation. But these guys are trying to take that covariance into account. And I agree that none of this stuff will make sense with less than 5 observations. Let me know what you think about this. |
Milad, that sounds like a feature worth implementing and us reviewing a
Pull Request for.
|
Great. Will work on it. |
This would be an useful addition to sklearn, as many [typical] users do not realize there is a distribution over the performance (accuracy or RMSE etc), and that its shape could vary in a substantial way. While I'd love contribute to this in various ways (as I tried to with discussion in #9631 ), I'll be busy with in short term, so I just want to bring few pieces of info to your attention, that might help the efforts in this direction: Important references on performance comparison of predictive models: https://crossinvalidation.com/2017/08/10/must-read-machine-learning-papers-%EF%BB%BF/ Various posthocs that could be used: https://github.com/raamana/scikit-posthocs (which could use help in adding more tests and comprehensive validation etc). |
@MiladShahidi you still working on this? Or should we put it up for grab? |
I'm afraid I could not find the time to work on this. Would be great to have someone else look into it. |
I'm working on this at the Paris sprint and will submit a PR with a basic proposal to discuss in more detail |
Closing as fixed in #17432. |
It would be nice to add error bars for grid-search results. Right now I think we're showing std over folds, which is not the right thing (there's a nice talk & package by Max Kuhn about it, also mentioned in todays NIPS invited talk).
We should implement a paired test to have a better measure of uncertainty between parameter settings.
The text was updated successfully, but these errors were encountered: