DOC Add grid search stats example #17432

martinagvilas · 2020-06-03T08:48:08Z

This PR implements the example discussed in #16344 and solves #12730

The example shows how to run a statistical comparison of the results of a grid-search, using both a Frequentist and Bayesian approach.

As discussed in #16344 it implements the most conventional ways of running this statistical comparison.

cc @adrinjalali who worked with me on this during the January sprint

lucyleeow · 2020-06-03T12:24:29Z

For some reason the plots aren't showing up in the built doc: https://105431-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/grid_search_stats.html
Not sure why. This existing example also uses seaborn and the plots work fine: https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html

martinagvilas · 2020-06-03T12:35:51Z

@lucyleeow thanks for letting me know! It's weird because I could see the outputs when building locally by running EXAMPLES_PATTERN=grid_search_stats make html in the doc folder. I will take a look to see if I'm missing something.

lucyleeow · 2020-06-03T12:38:25Z

No worries. Maybe @cmarmo will know why?

NicolasHug · 2020-06-03T12:38:37Z

Try adding a link to the example from grid_search.rst, and also maybe push a commit with [doc build] in the message https://scikit-learn.org/stable/developers/contributing.html#continuous-integration-ci

cmarmo · 2020-06-03T15:47:33Z

Hi @martinagvilas, do you mind trying to remove the plt.show directives from your code? In principle, you do not need it with seaborn, and I'm under the impression that adding it hides the output of the previous command. I'm wondering if your version of sphinx-gallery and the one used in CI are different.
Also, apparently forcing the complete build takes too much time and CircleCI is not happy: I don't know how to solve that... but you can try removing the [doc build] directive for now.
Hope this could be helpful.

NicolasHug · 2020-06-03T15:49:29Z

Re the doc building issue, it's not related to this PR, it's happening on master as well.
I'm not quite sure what's happening though

martinagvilas · 2020-06-03T16:48:57Z

Seems to be deploying correctly now here. Thanks for the help!

lucyleeow

This was really interesting and I think people will find it really useful. I just have a few suggestions.

Also FYI the notation

:class:`sklearn.model_selection.GridSearchCV`

gives you the whole class path 'sklearn.model_selection.GridSearchCV' linked in the built doc, whereas if you add ~:

:class:`~sklearn.model_selection.GridSearchCV`

it gives you just 'GridSearchCV' linked. Which can be nicer in the body of text.

lucyleeow · 2020-06-08T08:40:36Z

examples/model_selection/plot_grid_search_stats.py

+sns.scatterplot(
+    X[:, 0], X[:, 1], hue=y,
+    marker='o', s=25, edgecolor='k', legend=False
+    ).set_title("Data")


I think we should keep the plt.show() at the end of plotting code blocks as it prevents the text output (e.g., Text(0.5, 1.0, 'Data')) below the plot. I don't think it was the cause of the plotting problem before.

plt.show() will stop the execution of the code. That's why we would need it at the end.

If I recall well.

examples/model_selection/plot_grid_search_stats.py

martinagvilas · 2020-06-08T13:51:52Z

@lucyleeow thanks for the feedback! I agree with all your suggestions, so I have made the necessary changes.

lucyleeow · 2020-06-08T14:41:34Z

Looks great! @adrinjalali or @ogrisel might be able to take a look?

adrinjalali · 2020-06-08T16:09:19Z

@amueller is interested in this as well I think.

glemaitre

I find the example super interesting and useful.

I recall something link with some bootstrapping to compare models from the original thread pointed out by @ogrisel. Would it be relevant here?

glemaitre · 2020-06-16T08:36:51Z

examples/model_selection/plot_grid_search_stats.py

+
+print(__doc__)
+
+import numpy as np


Since we have a nice notebook-style example, I would rather delay the import just on the "cell" (part of the code) where we will use it. It will be more intuitive.

I changed the example to make the imports in the respective cells, but I don't know if you were suggesting to do that only for the import-from statements while le A93C aving whole packages import (e.g. import numpy as np) at the beginning, like in this example.

examples/model_selection/plot_grid_search_stats.py

glemaitre · 2020-06-16T08:38:59Z

examples/model_selection/plot_grid_search_stats.py

+sns.scatterplot(
+    X[:, 0], X[:, 1], hue=y,
+    marker='o', s=25, edgecolor='k', legend=False
+    ).set_title("Data")


plt.show() will stop the execution of the code. That's why we would need it at the end.

glemaitre · 2020-06-16T08:39:06Z

examples/model_selection/plot_grid_search_stats.py

+sns.scatterplot(
+    X[:, 0], X[:, 1], hue=y,
+    marker='o', s=25, edgecolor='k', legend=False
+    ).set_title("Data")


If I recall well.

examples/model_selection/plot_grid_search_stats.py

martinagvilas · 2020-06-18T09:11:12Z

@glemaitre thanks for the review! I implemented most of your suggestions and replied to the ones I have some questions/comments before implementing the changes.

Regarding the bootstrap approach: I think in the end the suggestion was to start the PR showing the most basic approaches and then think about the bootstrap approach that @ogrisel suggested. I would be happy to include it in this example or in another one if this is too long already.

lorentzenchr

A first round.
If I have not said before, I really very much appreciate this example:rocket:

examples/model_selection/plot_grid_search_stats.py

…into example-stats-search

lorentzenchr · 2020-10-26T22:22:30Z

@martinagvilas Thanks for your update.
Could we at least mention that t-tests for competing predictions are also known as Diebold-Mariano test? This would link to the forecast literature, thus connecting domains that IMHO should connect more and improve google searches.

adrinjalali

Some minor points, otherwise looks pretty good to me.

doc/modules/grid_search.rst

examples/model_selection/plot_grid_search_stats.py

adrinjalali · 2020-10-27T10:32:50Z

examples/model_selection/plot_grid_search_stats.py

+# credible intervals which compute the range of values that have a particular
+# probability of containing the true mean of differences.
+#
+# Let's determine the credible intervals of our data using 50%, 75% and 95%:


I don't understand the credible interval from this text, maybe expand a tiny bit?

Some discussion in #17432 (comment).

Is it clearer now? I could also link to some website explaining this concept in more detail. I also provided an intuition of the results below the table.

adrinjalali · 2020-10-27T11:32:21Z

examples/model_selection/plot_grid_search_stats.py

+# We observe that after correcting for multiple comparisons, the only model
+# that significantly differs from the others is `'2_poly'`.
+# `'rbf'`, the model ranked first by
+# :class:`~sklearn.model_selection.GridSearchCV`, does not significantly
+# differ from `'linear'` or `'3_poly'`.


there are 3 p-values < 0.05 though, hmm, and the rbf-linear p-value is 0 on the table.

Thanks for spotting this! This is the result of the error in the index of the dataframe that you mentioned earlier.

adrinjalali · 2020-10-27T11:34:44Z

examples/model_selection/plot_grid_search_stats.py

+# :class:`~sklearn.model_selection.GridSearchCV` `'rbf'`, has approximately a
+# 6.8% chance of being worse than `'linear'`, and a 1.8% chance of being worse
+# than `'3_poly'`.
+# `'rbf'` and `'linear'` have a 43% probability of being practically
+# equivalent, while `'rbf'` and `'3_poly'` have a 10% chance of being so.


again, the table seems to show a different story I guess.

thanks! same as above.

martinagvilas · 2020-10-29T17:53:10Z

Thanks for the review @adrinjalali. I have updated the example with your suggestions (and correcting the errors you spotted).

Could we at least mention that t-tests for competing predictions are also known as Diebold-Mariano test? This would link to the forecast literature, thus connecting domains that IMHO should connect more and improve google searches.

@lorentzenchr could you clarify what you mean by this? I'm not familiar with the forecasting literature, but from the references you suggested I couldn't gather where this statement is made. You must be right, but I'm just wary of adding something that I don't fully understand.

examples/model_selection/plot_grid_search_stats.py

martinagvilas · 2020-11-02T18:58:23Z

@lorentzenchr all the changes you requested have been added now

lorentzenchr

LGTM, finally:smirk: A part from a few tiny nitpicks.

examples/model_selection/plot_grid_search_stats.py

lorentzenchr · 2020-11-02T22:01:50Z

Diebold Mariano Test

Very short: The task of point forecasting is the prediction of a future event. If you have several forecasts, how do you decide which one is better? Take forecasts of the past where you meanwhile know the outcome (out-of-time) and compare the realized average loss/scoring functions (chose a scoring function that is strictly consistent for the functional of interest, often the expectation or a quantile). This type of testing is also known as backtesting and very familiar to, if not almost the same as, out-of-sample comparison of the predictiveness of machine learning models.
The focus in forecasting is out-of-time, focus in ML often out-of-sample.

[1] Chapter 1.1. proposes to use a t-test on the difference of the realized average losses. It also discusses the correlation of samples which corresponds to our Nadeau and Bengio corrected estimation of the variance.

In the end, it's just a t-test that is now used to be called Diebold-Mariano (type) test in the context of forecast comparison.

References of forecast comparison

[1] Diebold Mariano 1995, Comparing Predictive Accuracy
Often cited in one breath together with: West, K.D., 1996. Asymptotic inference about predictive ability. Econometrica 64, 1067–1084.
[2] West 2006, Forecast evaluation, in G. Elliott, C. Granger and A. Timmermann, eds, ‘Handbook
of Economic Forecasting’
[3] Giacomini and White 2006, Tests of conditional predictive ability

References that mention a Diebold-Mariano test

[4] Gneiting and Katzfuss 2014, Probabilistic Forecasting not open unfortunately
[5] Patton and Sheppard 2009, Evaluating Volatility and Correlation Forecasts
[6] Patton 2003, Volatility Forecast Comparison Using Imperfect Volatility Proxies
[7] Fissler, Ziegel and Gneiting 2015, Expected Shortfall is jointly elicitable with Value at Risk - Implications for backtesting
This shows some very nice example of forecast comparison by focussing on risk measures: It has some very important consequences for the whole financial sector!

Disclosure: I know some of the authors of [7].

@martinagvilas Is this enough explanation or does it more confuse than clarify?

NicolasHug

Thanks @martinagvilas , great example!
I have a bunch of comments but they're mostly nits. LGTM anyway

examples/model_selection/plot_grid_search_stats.py

NicolasHug · 2020-11-03T09:12:34Z

examples/model_selection/plot_grid_search_stats.py

+plt.show()
+
+# %%
+# We will compare the performance of SVC estimators that vary on their `kernel`


Maybe link to SVC's page, or just right "Support Vector Classifiers ~~estimators~~"

examples/model_selection/plot_grid_search_stats.py

NicolasHug · 2020-11-03T09:57:17Z

examples/model_selection/plot_grid_search_stats.py

+#    St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
+#    \hat{\sigma}^2)
+#
+# where :math:`\mu` represents the mean difference in performance of the


Should mu get a hat?

The description of mu in this case was confusing, I changed it to say the posterior of the mean difference in performance. In this case mu shouldn't get a hat because it represents the estimated posterior of the mean parameter, and thus is not a point estimate (see this paper). But this was not clear before, as you pointed out.

examples/model_selection/plot_grid_search_stats.py

martinagvilas · 2020-11-03T22:00:52Z

@martinagvilas Is this enough explanation or does it more confuse than clarify?

Seems clear to me! Thanks for taking the time of explaining.

I have addressed your final issues, hope it looks okey.

martinagvilas · 2020-11-03T22:09:48Z

Thanks for the review @NicolasHug! I have added your suggestions now.

NicolasHug

Thanks @martinagvilas , I realized my #17432 (comment) made little sense, but this led me to some additional questions

NicolasHug · 2020-11-04T18:24:18Z

examples/model_selection/plot_grid_search_stats.py

+# model is better than the second. Bayesian estimation will output a
+# distribution that specifies the probability of each parameter value, which in
+# this example is the mean of the differences in the performance of two models.


Just wondering: is the random variable that we're trying to model really the mean (i.e. expectation) of the differences, or is it actually the differences themselves? The description and the rest of the text suggests the former.

Also (almost) unrelated but may I suggest to define mu here already:

Bayesian estimation will output a distribution that specifies the distribution followed by the mean :math:`\mu` of the differences in the performance of two models.

(if what we model isn't the average diff but just the diff, then we can just reformulate and maybe switch mu for some other letter)

Thanks for noticing this and all of the comments below, the wording is indeed wrong in many places.

Answering the comment here: we are indeed trying to model the mean of the differences.

NicolasHug · 2020-11-04T18:26:20Z

examples/model_selection/plot_grid_search_stats.py

+# this example is the mean of the differences in the performance of two models.
+#
+# To obtain the posterior distribution we need to define a prior that models
+# our beliefs of how the means are distributed before looking at the data,


Should this be "mean" (without an s)?
Or should this be "differences"?

you are right, this should say "how the mean is distributed"

NicolasHug · 2020-11-04T18:34:46Z

examples/model_selection/plot_grid_search_stats.py

+plt.xticks(np.arange(-0.04, 0.06, 0.01))
+plt.fill_between(x, t_post.pdf(x), 0, facecolor='blue', alpha=.2)
+plt.ylabel("Probability density")
+plt.xlabel("Mean difference")


Suggested change

plt.xlabel("Mean difference")

plt.xlabel("Mean difference $\mu$")

NicolasHug · 2020-11-04T18:36:37Z

examples/model_selection/plot_grid_search_stats.py

+# and multiply it by a likelihood function that computes how likely our
+# observed mean differences are, given the values that the mean of differences


about "observed mean differences":

I find this a bit strange because as far as I understand, we don't really observe multiple "mean differences"; we observe multiple multiple differences, but this account for just one empirical mean difference?

NicolasHug · 2020-11-04T18:36:51Z

examples/model_selection/plot_grid_search_stats.py

+# where :math:`\mu` represents the posterior of the mean difference in
+# performance,


strictly speaking, I believe that the posterior is the distribution, not mu itself?

If we go with my suggested wording above where mu is defined, we probably don't need to specify it again here

The posterior is the distribution of the mean of the differences.

martinagvilas · 2020-11-06T06:49:59Z

@NicolasHug thanks for all these comments. Did my answers make sense? If so I will add the suggestions. But if it is still confusing let me know.

I think the confusion with all these wordings comes from the fact we are using a Bayesian approach using a closed form solution and not a sampling approach. In the closed form solution we are marginalizing for the variance already, so our posterior is only the posterior of the mean.

I think one part I missed from the description in this text is the fact that (in a standard sampling approach) we should have a prior for the mean of the differences, as well as a prior for the variance. And the likelihood should be computed with respect to the different combinations of the values of the mean and the variance. After obtaining the posterior we would then obtain the marginal posterior density of mu. This is more clear in the box named "Bayesian correlated t-test" in this paper. Should I correct the description in the text to add this also?

NicolasHug · 2020-11-06T08:03:33Z

Thanks @martinagvilas , yes it makes sense to me know.

we should have a prior for the mean of the differences, as well as a prior for the variance

From the ref it seems that the normal-gamma prior is in fact a joint prior, for both the mean and the variance. This wasn't too shocking to me (perhaps because I'm not super familiar with Bayesian stats) so I would not mention the closed form vs sampling discussion. Since the ref is properly referenced in this section, I think it's fine to be a little succint and leave out "details" (like the generative distribution, etc)

martinagvilas · 2020-11-06T12:07:01Z

@NicolasHug great! I added only your suggestions then.

lorentzenchr

I tried to clarify some Bayesian methodology - as suggestions.

As this PR already has 2 approvals, it's ready for merge in my opinion. We can also improve in follow-up PRs.

lorentzenchr · 2020-11-06T15:35:29Z

examples/model_selection/plot_grid_search_stats.py

+# distribution followed by the mean :math:`\mu` of the differences in the
+# performance of two models.


Suggested change

# distribution followed by the mean :math:`\mu` of the differences in the

# performance of two models.

# posterior distribution :math:`p(\mu|data and prior belief)` of the mean

# difference in performance of two models, that is :math:`\mu`.

lorentzenchr · 2020-11-06T15:42:53Z

examples/model_selection/plot_grid_search_stats.py

+# Bayesian estimation can be carried out in many forms to answer our question,
+# but in this example we will implement the approach suggested by Benavoli and
+# collegues [4]_.


Suggested change

# Bayesian estimation can be carried out in many forms to answer our question,

# but in this example we will implement the approach suggested by Benavoli and

# collegues [4]_.

# Bayesian estimation can be carried out in many forms to answer our question,

# but in this example we will apply the closed-form expression approach suggested

# by Benavoli and collegues [4]_.

lorentzenchr · 2020-11-06T15:59:27Z

examples/model_selection/plot_grid_search_stats.py

+# One way of defining our posterior using a closed-form expression is to select
+# a prior conjugate to the likelihood function. Benavoli and collegues [4]_
+# show that when comparing the performance of two classifiers we can model the
+# prior as a Normal-Gamma distribution (with both mean and variance unknown)
+# conjugate to a normal likelihood, to thus express the posterior as a normal
+# distribution.
+# Marginalizing out the variance from this normal posterior, we can define the
+# posterior of the mean parameter as a Student's t-distribution. Specifically:


Suggested change

# One way of defining our posterior using a closed-form expression is to select

# a prior conjugate to the likelihood function. Benavoli and collegues [4]_

# show that when comparing the performance of two classifiers we can model the

# prior as a Normal-Gamma distribution (with both mean and variance unknown)

# conjugate to a normal likelihood, to thus express the posterior as a normal

# distribution.

# Marginalizing out the variance from this normal posterior, we can define the

# posterior of the mean parameter as a Student's t-distribution. Specifically:

# One standard way to obtain a closed-form posterior distribution is to select

# a prior conjugate to the likelihood function. Benavoli and collegues [4]_

# use a Normal distribution for the data generating process, i.e. the

# likelihood for the observed differences in performance, and a Normal-Gamma

# distribution (with both mean and variance unknown) as conjugate prior. The

# posterior of mean and variance is then itself a Normal-Gamma distribution.

# Choosing some prior parameters and marginalizing out the variance gives us a

# Student's t-distribution as posterior for the mean parameter :math:`\mu`.

# Specifically:

Or, as @NicolasHug suggested, we make it more succinct and just state that [4] uses conjugate prior, makes some further assumptions and then it turns out to be the Student's t.

lorentzenchr · 2020-11-06T16:10:45Z

examples/model_selection/plot_grid_search_stats.py

+# and :math:`\hat{\sigma}^2` represents the variance of the observed
+# differences.


Suggested change

# and :math:`\hat{\sigma}^2` represents the variance of the observed

# differences.

# and :math:`\hat{\sigma}^2` represents the estimated variance of the observed

# differences.

Or empirical?

lorentzenchr · 2020-11-06T16:19:12Z

examples/model_selection/plot_grid_search_stats.py

+#    St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
+#    \hat{\sigma}^2)


Suggested change

# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})

# \hat{\sigma}^2)

# p(\mu|data and prior belief) =

# St(\mu; n-1, \overline{x}, (\frac{1}{n}+\frac{n_{test}}{n_{train}})

# \hat{\sigma}^2)

NicolasHug · 2020-11-06T16:58:04Z

I'd be +1 to merge now considering how long this has been opened.

lorentzenchr · 2020-11-06T17:25:00Z

I'd be +1 to merge now considering how long this has been opened.

@martinagvilas Final wish for change or merge as is?

martinagvilas · 2020-11-06T17:30:36Z

@lorentzenchr @NicolasHug maybe we can merge this now and if others find it confusing further suggestions can be added in a separate PR?

NicolasHug · 2020-11-06T17:39:48Z

Thanks a lot @martinagvilas !

add grid search stats example

5632ad5

martinagvilas changed the title ~~Add grid search stats example~~ DOC Add grid search stats example Jun 3, 2020

martinagvilas added 2 commits June 3, 2020 14:52

add example to grid_search.rst [doc build]

6c0f215

rename file [doc build]

76b03a4

remove plot.show()

adee108

lucyleeow reviewed Jun 8, 2020

View reviewed changes

examples/model_selection/plot_grid_search_stats.py Outdated Show resolved Hide resolved

address lucyleeow suggestions

badcb2a

glemaitre self-requested a review June 16, 2020 08:48

glemaitre reviewed Jun 16, 2020

View reviewed changes

address glemaitre suggestions

7a74f53

cmarmo added the Waiting for Reviewer label Aug 11, 2020

lorentzenchr reviewed Aug 16, 2020

View reviewed changes

martinagvilas added 6 commits August 27, 2020 13:40

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

eaf8990

…into example-stats-search

merge upstream master

fddb3b6

address lorentzenchr review

ad801ce

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

78b17f9

…into example-stats-search

remove unrelated changed files

66c6fe9

remove unrelated file

70bb820

adrinjalali reviewed Oct 27, 2020

View reviewed changes

address adrinjalali review

b96f16f

lorentzenchr requested changes Oct 29, 2020

View reviewed changes

examples/model_selection/plot_grid_search_stats.py Outdated Show resolved Hide resolved

examples/model_selection/plot_grid_search_stats.py Outdated Show resolved Hide resolved

examples/model_selection/plot_grid_search_stats.py Outdated Show resolved Hide resolved

martinagvilas added 3 commits November 2, 2020 17:32

add lorentzenchr suggestions

abb43de

add phrase

ce66f3d

add reference

d740dea

lorentzenchr approved these changes Nov 2, 2020

View reviewed changes

NicolasHug approved these changes Nov 3, 2020

View reviewed changes

martinagvilas added 2 commits November 3, 2020 21:40

add lorentzenchr suggestions

8d4a211

address NicolasHug review

097458a

correct variance description

265f447

NicolasHug reviewed Nov 4, 2020

View reviewed changes

change bayesian description

e5d0749

lorentzenchr reviewed Nov 6, 2020

View reviewed changes

NicolasHug merged commit 5e85a65 into scikit-learn:master Nov 6, 2020

This was referenced Nov 10, 2020

Add paired test for cross-validation #12730

Closed

[WIP] Add ttest between gridsearch candidates score results #16344

Closed

	plt.xlabel("Mean difference")
	plt.xlabel("Mean difference $\mu$")

		# and multiply it by a likelihood function that computes how likely our
		# observed mean differences are, given the values that the mean of differences

		# where :math:`\mu` represents the posterior of the mean difference in
		# performance,

		# distribution followed by the mean :math:`\mu` of the differences in the
		# performance of two models.

		# and :math:`\hat{\sigma}^2` represents the variance of the observed
		# differences.

		# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
		# \hat{\sigma}^2)

Uh oh!

DOC Add grid search stats example #17432

DOC Add grid search stats example #17432

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment