8000 DOC Add grid search stats example by martinagvilas · Pull Request #17432 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Add grid search stats example #17432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 6, 2020

Conversation

martinagvilas
Copy link
Contributor
@martinagvilas martinagvilas commented Jun 3, 2020

This PR implements the example discussed in #16344 and solves #12730

The example shows how to run a statistical comparison of the results of a grid-search, using both a Frequentist and Bayesian approach.

As discussed in #16344 it implements the most conventional ways of running this statistical comparison.

cc @adrinjalali who worked with me on this during the January sprint

@martinagvilas martinagvilas changed the title Add grid search stats example DOC Add grid search stats example Jun 3, 2020
@lucyleeow
Copy link
Member

For some reason the plots aren't showing up in the built doc: https://105431-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/grid_search_stats.html
Not sure why. This existing example also uses seaborn and the plots work fine: https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html

@martinagvilas
Copy link
Contributor Author

@lucyleeow thanks for letting me know! It's weird because I could see the outputs when building locally by running EXAMPLES_PATTERN=grid_search_stats make html in the doc folder. I will take a look to see if I'm missing something.

@lucyleeow
Copy link
Member
lucyleeow 8000 commented Jun 3, 2020

No worries. Maybe @cmarmo will know why?

@NicolasHug
Copy link
Member

Try adding a link to the example from grid_search.rst, and also maybe push a commit with [doc build] in the message https://scikit-learn.org/stable/developers/contributing.html#continuous-integration-ci

@cmarmo
Copy link
Contributor
cmarmo commented Jun 3, 2020

Hi @martinagvilas, do you mind trying to remove the plt.show directives from your code? In principle, you do not need it with seaborn, and I'm under the impression that adding it hides the output of the previous command. I'm wondering if your version of sphinx-gallery and the one used in CI are different.
Also, apparently forcing the complete build takes too much time and CircleCI is not happy: I don't know how to solve that... but you can try removing the [doc build] directive for now.
Hope this could be helpful.

@NicolasHug
Copy link
Member

Re the doc building issue, it's not related to this PR, it's happening on master as well.
I'm not quite sure what's happening though

@martinagvilas
Copy link
Contributor Author

Seems to be deploying correctly now here. Thanks for the help!

Copy link
Member
@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really interesting and I think people will find it really useful. I just have a few suggestions.

Also FYI the notation

:class:`sklearn.model_selection.GridSearchCV`

gives you the whole class path 'sklearn.model_selection.GridSearchCV' linked in the built doc, whereas if you add ~:

:class:`~sklearn.model_selection.GridSearchCV`

it gives you just 'GridSearchCV' linked. Which can be nicer in the body of text.

sns.scatterplot(
X[:, 0], X[:, 1], hue=y,
marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the plt.show() at the end of plotting code blocks as it prevents the text output (e.g., Text(0.5, 1.0, 'Data')) below the plot. I don't think it was the cause of the plotting problem before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plt.show() will stop the execution of the code. That's why we would need it at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall well.

@martinagvilas
Copy link
Contributor Author

@lucyleeow thanks for the feedback! I agree with all your suggestions, so I have made the necessary changes.

@lucyleeow
Copy link
Member

Looks great! @adrinjalali or @ogrisel might be able to take a look?

@adrinjalali
Copy link
Member

@amueller is interested in this as well I think.

@glemaitre glemaitre self-requested a review June 16, 2020 08:48
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the example super interesting and useful.

I recall something link with some bootstrapping to compare models from the original thread pointed out by @ogrisel. Would it be relevant here?


print(__doc__)

import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a nice notebook-style example, I would rather delay the import just on the "cell" (part of the code) where we will use it. It will be more intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the example to make the imports in the respective cells, but I don't know if you were suggesting to do that only for the import-from statements while le A93C aving whole packages import (e.g. import numpy as np) at the beginning, like in this example.

sns.scatterplot(
X[:, 0], X[:, 1], hue=y,
marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plt.show() will stop the execution of the code. That's why we would need it at the end.

sns.scatterplot(
X[:, 0], X[:, 1], hue=y,
marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall well.

@martinagvilas
Copy link
Contributor Author
martinagvilas commented Jun 18, 2020

@glemaitre thanks for the review! I implemented most of your suggestions and replied to the ones I have some questions/comments before implementing the changes.

Regarding the bootstrap approach: I think in the end the suggestion was to start the PR showing the most basic approaches and then think about the bootstrap approach that @ogrisel suggested. I would be happy to include it in this example or in another one if this is too long already.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round.
If I have not said before, I really very much appreciate this example:rocket:

@lorentzenchr
Copy link
Member

@martinagvilas Thanks for your update.
Could we at least mention that t-tests for competing predictions are also known as Diebold-Mariano test? This would link to the forecast literature, thus connecting domains that IMHO should connect more and improve google searches.

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor points, otherwise looks pretty good to me.

# credible intervals which compute the range of values that have a particular
# probability of containing the true mean of differences.
#
# Let's determine the credible intervals of our data using 50%, 75% and 95%:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the credible interval from this text, maybe expand a tiny bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some discussion in #17432 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it clearer now? I could also link to some website explaining this concept in more detail. I also provided an intuition of the results below the table.

Comment on lines +457 to +461
# We observe that after correcting for multiple comparisons, the only model
# that significantly differs from the others is `'2_poly'`.
# `'rbf'`, the model ranked first by
# :class:`~sklearn.model_selection.GridSearchCV`, does not significantly
# differ from `'linear'` or `'3_poly'`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 3 p-values < 0.05 though, hmm, and the rbf-linear p-value is 0 on the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this! This is the result of the error in the index of the dataframe that you mentioned earlier.

Comment on lines +502 to +506
# :class:`~sklearn.model_selection.GridSearchCV` `'rbf'`, has approximately a
# 6.8% chance of being worse than `'linear'`, and a 1.8% chance of being worse
# than `'3_poly'`.
# `'rbf'` and `'linear'` have a 43% probability of being practically
# equivalent, while `'rbf'` and `'3_poly'` have a 10% chance of being so.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, the table seems to show a different story I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! same as above.

@martinagvilas
Copy link
Contributor Author

Thanks for the review @adrinjalali. I have updated the example with your suggestions (and correcting the errors you spotted).

Could we at least mention that t-tests for competing predictions are also known as Diebold-Mariano test? This would link to the forecast literature, thus connecting domains that IMHO should connect more and improve google searches.

@lorentzenchr could you clarify what you mean by this? I'm not familiar with the forecasting literature, but from the references you suggested I couldn't gather where this statement is made. You must be right, but I'm just wary of adding something that I don't fully understand.

@martinagvilas
Copy link
Contributor Author

@lorentzenchr all the changes you requested have been added now

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, finally:smirk: A part from a few tiny nitpicks.

@lorentzenchr
Copy link
Member
lorentzenchr commented Nov 2, 2020

Diebold Mariano Test

Very short: The task of point forecasting is the prediction of a future event. If you have several forecasts, how do you decide which one is better? Take forecasts of the past where you meanwhile know the outcome (out-of-time) and compare the realized average loss/scoring functions (chose a scoring function that is strictly consistent for the functional of interest, often the expectation or a quantile). This type of testing is also known as backtesting and very familiar to, if not almost the same as, out-of-sample comparison of the predictiveness of machine learning models.
The focus in forecasting is out-of-time, focus in ML often out-of-sample.

[1] Chapter 1.1. proposes to use a t-test on the difference of the realized average losses. It also discusses the correlation of samples which corresponds to our Nadeau and Bengio corrected estimation of the variance.

In the end, it's just a t-test that is now used to be called Diebold-Mariano (type) test in the context of forecast comparison.

References of forecast comparison

References that mention a Diebold-Mariano test

Disclosure: I know some of the authors of [7].

@martinagvilas Is this enough explanation or does it more confuse than clarify?

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martinagvilas , great example!
I have a bunch of comments but they're mostly nits. LGTM anyway

plt.show()

# %%
# We will compare the performance of SVC estimators that vary on their `kernel`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe link to SVC's page, or just right "Support Vector Classifiers estimators"

# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
# \hat{\sigma}^2)
#
# where :math:`\mu` represents the mean difference in performance of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mu get a hat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of mu in this case was confusing, I changed it to say the posterior of the mean difference in performance. In this case mu shouldn't get a hat because it represents the estimated posterior of the mean parameter, and thus is not a point estimate (see this paper). But this was not clear before, as you pointed out.

@martinagvilas
Copy link
Contributor Author

@martinagvilas Is this enough explanation or does it more confuse than clarify?

Seems clear to me! Thanks for taking the time of explaining.

I have addressed your final issues, hope it looks okey.

@martinagvilas
Copy link
Contributor Author

Thanks for the review @NicolasHug! I have added your suggestions now.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martinagvilas , I realized my #17432 (comment) made little sense, but this led me to some additional questions

Comment on lines 270 to 272
# model is better than the second. Bayesian estimation will output a
# distribution that specifies the probability of each parameter value, which in
# this example is the mean of the differences in the performance of two models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: is the random variable that we're trying to model really the mean (i.e. expectation) of the differences, or is it actually the differences themselves? The description and the rest of the text suggests the former.

Also (almost) unrelated but may I suggest to define mu here already:

Bayesian estimation will output a distribution that specifies the distribution followed by the mean :math:`\mu` of the differences in the performance of two models.

(if what we model isn't the average diff but just the diff, then we can just reformulate and maybe switch mu for some other letter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this and all of the comments below, the wording is indeed wrong in many places.

Answering the comment here: we are indeed trying to model the mean of the differences.

# this example is the mean of the differences in the performance of two models.
#
# To obtain the posterior distribution we need to define a prior that models
# our beliefs of how the means are distributed before looking at the data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "mean" (without an s)?
Or should this be "differences"?

Copy link
Contributor Author
@martinagvilas martinagvilas Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, this should say "how the mean is distributed"

plt.xticks(np.arange(-0.04, 0.06, 0.01))
plt.fill_between(x, t_post.pdf(x), 0, facecolor='blue', alpha=.2)
plt.ylabel("Probability density")
plt.xlabel("Mean difference")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plt.xlabel("Mean difference")
plt.xlabel("Mean difference $\mu$")

Comment on lines 276 to 277
# and multiply it by a likelihood function that computes how likely our
# observed mean differences are, given the values that the mean of differences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about "observed mean differences":

I find this a bit strange because as far as I understand, we don't really observe multiple "mean differences"; we observe multiple multiple differences, but this account for just one empirical mean difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 297 to 298
# where :math:`\mu` represents the posterior of the mean difference in
# performance,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking, I believe that the posterior is the distribution, not mu itself?

If we go with my suggested wording above where mu is defined, we probably don't need to specify it again here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The posterior is the distribution of the mean of the differences.

@martinagvilas
Copy link
Contributor Author

@NicolasHug thanks for all these comments. Did my answers make sense? If so I will add the suggestions. But if it is still confusing let me know.

I think the confusion with all these wordings comes from the fact we are using a Bayesian approach using a closed form solution and not a sampling approach. In the closed form solution we are marginalizing for the variance already, so our posterior is only the posterior of the mean.

I think one part I missed from the description in this text is the fact that (in a standard sampling approach) we should have a prior for the mean of the differences, as well as a prior for the variance. And the likelihood should be computed with respect to the different combinations of the values of the mean and the variance. After obtaining the posterior we would then obtain the marginal posterior density of mu. This is more clear in the box named "Bayesian correlated t-test" in this paper. Should I correct the description in the text to add this also?

@NicolasHug
Copy link
Member

Thanks @martinagvilas , yes it makes sense to me know.

we should have a prior for the mean of the differences, as well as a prior for the variance

From the ref it seems that the normal-gamma prior is in fact a joint prior, for both the mean and the variance. This wasn't too shocking to me (perhaps because I'm not super familiar with Bayesian stats) so I would not mention the closed form vs sampling discussion. Since the ref is properly referenced in this section, I think it's fine to be a little succint and leave out "details" (like the generative distribution, etc)

@martinagvilas
Copy link
Contributor Author

@NicolasHug great! I added only your suggestions then.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify some Bayesian methodology - as suggestions.

As this PR already has 2 approvals, it's ready for merge in my opinion. We can also improve in follow-up PRs.

Comment on lines +271 to +272
# distribution followed by the mean :math:`\mu` of the differences in the
# performance of two models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# distribution followed by the mean :math:`\mu` of the differences in the
# performance of two models.
# posterior distribution :math:`p(\mu|data and prior belief)` of the mean
# difference in performance of two models, that is :math:`\mu`.

Comment on lines +280 to +282
# Bayesian estimation can be carried out in many forms to answer our question,
# but in this example we will implement the approach suggested by Benavoli and
# collegues [4]_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Bayesian estimation can be carried out in many forms to answer our question,
# but in this example we will implement the approach suggested by Benavoli and
# collegues [4]_.
# Bayesian estimation can be carried out in many forms to answer our question,
# but in this example we will apply the closed-form expression approach suggested
# by Benavoli and collegues [4]_.

Comment on lines +284 to +291
# One way of defining our posterior using a closed-form expression is to select
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_
# show that when comparing the performance of two classifiers we can model the
# prior as a Normal-Gamma distribution (with both mean and variance unknown)
# conjugate to a normal likelihood, to thus express the posterior as a normal
# distribution.
# Marginalizing out the variance from this normal posterior, we can define the
# posterior of the mean parameter as a Student's t-distribution. Specifically:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# One way of defining our posterior using a closed-form expression is to select
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_
# show that when comparing the performance of two classifiers we can model the
# prior as a Normal-Gamma distribution (with both mean and variance unknown)
# conjugate to a normal likelihood, to thus express the posterior as a normal
# distribution.
# Marginalizing out the variance from this normal posterior, we can define the
# posterior of the mean parameter as a Student's t-distribution. Specifically:
# One standard way to obtain a closed-form posterior distribution is to select
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_
# use a Normal distribution for the data generating process, i.e. the
# likelihood for the observed differences in performance, and a Normal-Gamma
# distribution (with both mean and variance unknown) as conjugate prior. The
# posterior of mean and variance is then itself a Normal-Gamma distribution.
# Choosing some prior parameters and marginalizing out the variance gives us a
# Student's t-distribution as posterior for the mean parameter :math:`\mu`.
# Specifically:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, as @NicolasHug suggested, we make it more succinct and just state that [4] uses conjugate prior, makes some further assumptions and then it turns out to be the Student's t.

Comment on lines +301 to +302
# and :math:`\hat{\sigma}^2` represents the variance of the observed
# differences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# and :math:`\hat{\sigma}^2` represents the variance of the observed
# differences.
# and :math:`\hat{\sigma}^2` represents the estimated variance of the observed
# differences.

Or empirical?

Comment on lines +294 to +295
# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
# \hat{\sigma}^2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}})
# \hat{\sigma}^2)
# p(\mu|data and prior belief) =
# St(\mu; n-1, \overline{x}, (\frac{1}{n}+\frac{n_{test}}{n_{train}})
# \hat{\sigma}^2)

@NicolasHug
Copy link
Member

I'd be +1 to merge now considering how long this has been opened.

@lorentzenchr
Copy link
Member

I'd be +1 to merge now considering how long this has been opened.

@martinagvilas Final wish for change or merge as is?

@martinagvilas
Copy link
Contributor Author

@lorentzenchr @NicolasHug maybe we can merge this now and if others find it confusing further suggestions can be added in a separate PR?

@NicolasHug NicolasHug merged commit 5e85a65 into scikit-learn:master Nov 6, 2020
@NicolasHug
Copy link
Member

Thanks a lot @martinagvilas !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0