-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC Add grid search stats example #17432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Add grid search stats example #17432
Conversation
For some reason the plots aren't showing up in the built doc: https://105431-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/grid_search_stats.html |
@lucyleeow thanks for letting me know! It's weird because I could see the outputs when building locally by running |
No worries. Maybe @cmarmo will know why? |
Try adding a link to the example from |
Hi @martinagvilas, do you mind trying to remove the |
Re the doc building issue, it's not related to this PR, it's happening on master as well. |
Seems to be deploying correctly now here. Thanks for the help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really interesting and I think people will find it really useful. I just have a few suggestions.
Also FYI the notation
:class:`sklearn.model_selection.GridSearchCV`
gives you the whole class path 'sklearn.model_selection.GridSearchCV' linked in the built doc, whereas if you add ~
:
:class:`~sklearn.model_selection.GridSearchCV`
it gives you just 'GridSearchCV' linked. Which can be nicer in the body of text.
sns.scatterplot( | ||
X[:, 0], X[:, 1], hue=y, | ||
marker='o', s=25, edgecolor='k', legend=False | ||
).set_title("Data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the plt.show()
at the end of plotting code blocks as it prevents the text output (e.g., Text(0.5, 1.0, 'Data')
) below the plot. I don't think it was the cause of the plotting problem before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plt.show()
will stop the execution of the code. That's why we would need it at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall well.
examples/model_selection/plot_grid_search_stats.py
Outdated
Show resolved
<
8000
svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-fold mr-1">
Hide resolved
@lucyleeow thanks for the feedback! I agree with all your suggestions, so I have made the necessary changes. |
Looks great! @adrinjalali or @ogrisel might be able to take a look? |
@amueller is interested in this as well I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the example super interesting and useful.
I recall something link with some bootstrapping to compare models from the original thread pointed out by @ogrisel. Would it be relevant here?
|
||
print(__doc__) | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a nice notebook-style example, I would rather delay the import just on the "cell" (part of the code) where we will use it. It will be more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the example to make the imports in the respective cells, but I don't know if you were suggesting to do that only for the import-from statements while le A93C aving whole packages import (e.g. import numpy as np) at the beginning, like in this example.
sns.scatterplot( | ||
X[:, 0], X[:, 1], hue=y, | ||
marker='o', s=25, edgecolor='k', legend=False | ||
).set_title("Data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plt.show()
will stop the execution of the code. That's why we would need it at the end.
sns.scatterplot( | ||
X[:, 0], X[:, 1], hue=y, | ||
marker='o', s=25, edgecolor='k', legend=False | ||
).set_title("Data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall well.
@glemaitre thanks for the review! I implemented most of your suggestions and replied to the ones I have some questions/comments before implementing the changes. Regarding the bootstrap approach: I think in the end the suggestion was to start the PR showing the most basic approaches and then think about the bootstrap approach that @ogrisel suggested. I would be happy to include it in this example or in another one if this is too long already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round.
If I have not said before, I really very much appreciate this example:rocket:
…into example-stats-search
…into example-stats-search
@martinagvilas Thanks for your update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor points, otherwise looks pretty good to me.
# credible intervals which compute the range of values that have a particular | ||
# probability of containing the true mean of differences. | ||
# | ||
# Let's determine the credible intervals of our data using 50%, 75% and 95%: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the credible interval
from this text, maybe expand a tiny bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some discussion in #17432 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it clearer now? I could also link to some website explaining this concept in more detail. I also provided an intuition of the results below the table.
# We observe that after correcting for multiple comparisons, the only model | ||
# that significantly differs from the others is `'2_poly'`. | ||
# `'rbf'`, the model ranked first by | ||
# :class:`~sklearn.model_selection.GridSearchCV`, does not significantly | ||
# differ from `'linear'` or `'3_poly'`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 3 p-values < 0.05 though, hmm, and the rbf-linear p-value is 0 on the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this! This is the result of the error in the index of the dataframe that you mentioned earlier.
# :class:`~sklearn.model_selection.GridSearchCV` `'rbf'`, has approximately a | ||
# 6.8% chance of being worse than `'linear'`, and a 1.8% chance of being worse | ||
# than `'3_poly'`. | ||
# `'rbf'` and `'linear'` have a 43% probability of being practically | ||
# equivalent, while `'rbf'` and `'3_poly'` have a 10% chance of being so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, the table seems to show a different story I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! same as above.
Thanks for the review @adrinjalali. I have updated the example with your suggestions (and correcting the errors you spotted).
@lorentzenchr could you clarify what you mean by this? I'm not familiar with the forecasting literature, but from the references you suggested I couldn't gather where this statement is made. You must be right, but I'm just wary of adding something that I don't fully understand. |
@lorentzenchr all the changes you requested have been added now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, finally:smirk: A part from a few tiny nitpicks.
Diebold Mariano TestVery short: The task of point forecasting is the prediction of a future event. If you have several forecasts, how do you decide which one is better? Take forecasts of the past where you meanwhile know the outcome (out-of-time) and compare the realized average loss/scoring functions (chose a scoring function that is strictly consistent for the functional of interest, often the expectation or a quantile). This type of testing is also known as backtesting and very familiar to, if not almost the same as, out-of-sample comparison of the predictiveness of machine learning models. [1] Chapter 1.1. proposes to use a t-test on the difference of the realized average losses. It also discusses the correlation of samples which corresponds to our Nadeau and Bengio corrected estimation of the variance. In the end, it's just a t-test that is now used to be called Diebold-Mariano (type) test in the context of forecast comparison. References of forecast comparison
References that mention a Diebold-Mariano test
Disclosure: I know some of the authors of [7]. @martinagvilas Is this enough explanation or does it more confuse than clarify? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martinagvilas , great example!
I have a bunch of comments but they're mostly nits. LGTM anyway
plt.show() | ||
|
||
# %% | ||
# We will compare the performance of SVC estimators that vary on their `kernel` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to SVC's page, or just right "Support Vector Classifiers estimators"
# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}}) | ||
# \hat{\sigma}^2) | ||
# | ||
# where :math:`\mu` represents the mean difference in performance of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mu get a hat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of mu in this case was confusing, I changed it to say the posterior of the mean difference in performance
. In this case mu shouldn't get a hat because it represents the estimated posterior of the mean parameter, and thus is not a point estimate (see this paper). But this was not clear before, as you pointed out.
Seems clear to me! Thanks for taking the time of explaining. I have addressed your final issues, hope it looks okey. |
Thanks for the review @NicolasHug! I have added your suggestions now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martinagvilas , I realized my #17432 (comment) made little sense, but this led me to some additional questions
# model is better than the second. Bayesian estimation will output a | ||
# distribution that specifies the probability of each parameter value, which in | ||
# this example is the mean of the differences in the performance of two models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: is the random variable that we're trying to model really the mean (i.e. expectation) of the differences, or is it actually the differences themselves? The description and the rest of the text suggests the former.
Also (almost) unrelated but may I suggest to define mu here already:
Bayesian estimation will output a distribution that specifies the distribution followed by the mean
:math:`\mu`
of the differences in the performance of two models.
(if what we model isn't the average diff but just the diff, then we can just reformulate and maybe switch mu for some other letter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing this and all of the comments below, the wording is indeed wrong in many places.
Answering the comment here: we are indeed trying to model the mean of the differences.
# this example is the mean of the differences in the performance of two models. | ||
# | ||
# To obtain the posterior distribution we need to define a prior that models | ||
# our beliefs of how the means are distributed before looking at the data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "mean" (without an s)?
Or should this be "differences"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, this should say "how the mean is distributed"
plt.xticks(np.arange(-0.04, 0.06, 0.01)) | ||
plt.fill_between(x, t_post.pdf(x), 0, facecolor='blue', alpha=.2) | ||
plt.ylabel("Probability density") | ||
plt.xlabel("Mean difference") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plt.xlabel("Mean difference") | |
plt.xlabel("Mean difference $\mu$") |
# and multiply it by a likelihood function that computes how likely our | ||
# observed mean differences are, given the values that the mean of differences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about "observed mean differences":
I find this a bit strange because as far as I understand, we don't really observe multiple "mean differences"; we observe multiple multiple differences, but this account for just one empirical mean difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the likelihood should be how likely our differences are, given the (...)
# where :math:`\mu` represents the posterior of the mean difference in | ||
# performance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking, I believe that the posterior is the distribution, not mu itself?
If we go with my suggested wording above where mu is defined, we probably don't need to specify it again here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The posterior is the distribution of the mean of the differences.
@NicolasHug thanks for all these comments. Did my answers make sense? If so I will add the suggestions. But if it is still confusing let me know. I think the confusion with all these wordings comes from the fact we are using a Bayesian approach using a closed form solution and not a sampling approach. In the closed form solution we are marginalizing for the variance already, so our posterior is only the posterior of the mean. I think one part I missed from the description in this text is the fact that (in a standard sampling approach) we should have a prior for the mean of the differences, as well as a prior for the variance. And the likelihood should be computed with respect to the different combinations of the values of the mean and the variance. After obtaining the posterior we would then obtain the marginal posterior density of mu. This is more clear in the box named "Bayesian correlated t-test" in this paper. Should I correct the description in the text to add this also? |
Thanks @martinagvilas , yes it makes sense to me know.
From the ref it seems that the normal-gamma prior is in fact a joint prior, for both the mean and the variance. This wasn't too shocking to me (perhaps because I'm not super familiar with Bayesian stats) so I would not mention the closed form vs sampling discussion. Since the ref is properly referenced in this section, I think it's fine to be a little succint and leave out "details" (like the generative distribution, etc) |
@NicolasHug great! I added only your suggestions then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify some Bayesian methodology - as suggestions.
As this PR already has 2 approvals, it's ready for merge in my opinion. We can also improve in follow-up PRs.
# distribution followed by the mean :math:`\mu` of the differences in the | ||
# performance of two models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# distribution followed by the mean :math:`\mu` of the differences in the | |
# performance of two models. | |
# posterior distribution :math:`p(\mu|data and prior belief)` of the mean | |
# difference in performance of two models, that is :math:`\mu`. |
# Bayesian estimation can be carried out in many forms to answer our question, | ||
# but in this example we will implement the approach suggested by Benavoli and | ||
# collegues [4]_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Bayesian estimation can be carried out in many forms to answer our question, | |
# but in this example we will implement the approach suggested by Benavoli and | |
# collegues [4]_. | |
# Bayesian estimation can be carried out in many forms to answer our question, | |
# but in this example we will apply the closed-form expression approach suggested | |
# by Benavoli and collegues [4]_. |
# One way of defining our posterior using a closed-form expression is to select | ||
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_ | ||
# show that when comparing the performance of two classifiers we can model the | ||
# prior as a Normal-Gamma distribution (with both mean and variance unknown) | ||
# conjugate to a normal likelihood, to thus express the posterior as a normal | ||
# distribution. | ||
# Marginalizing out the variance from this normal posterior, we can define the | ||
# posterior of the mean parameter as a Student's t-distribution. Specifically: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# One way of defining our posterior using a closed-form expression is to select | |
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_ | |
# show that when comparing the performance of two classifiers we can model the | |
# prior as a Normal-Gamma distribution (with both mean and variance unknown) | |
# conjugate to a normal likelihood, to thus express the posterior as a normal | |
# distribution. | |
# Marginalizing out the variance from this normal posterior, we can define the | |
# posterior of the mean parameter as a Student's t-distribution. Specifically: | |
# One standard way to obtain a closed-form posterior distribution is to select | |
# a prior conjugate to the likelihood function. Benavoli and collegues [4]_ | |
# use a Normal distribution for the data generating process, i.e. the | |
# likelihood for the observed differences in performance, and a Normal-Gamma | |
# distribution (with both mean and variance unknown) as conjugate prior. The | |
# posterior of mean and variance is then itself a Normal-Gamma distribution. | |
# Choosing some prior parameters and marginalizing out the variance gives us a | |
# Student's t-distribution as posterior for the mean parameter :math:`\mu`. | |
# Specifically: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, as @NicolasHug suggested, we make it more succinct and just state that [4] uses conjugate prior, makes some further assumptions and then it turns out to be the Student's t.
# and :math:`\hat{\sigma}^2` represents the variance of the observed | ||
# differences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# and :math:`\hat{\sigma}^2` represents the variance of the observed | |
# differences. | |
# and :math:`\hat{\sigma}^2` represents the estimated variance of the observed | |
# differences. |
Or empirical?
# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}}) | ||
# \hat{\sigma}^2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# St(\mu;n-1,\overline{x},(\frac{1}{n}+\frac{n_{test}}{n_{train}}) | |
# \hat{\sigma}^2) | |
# p(\mu|data and prior belief) = | |
# St(\mu; n-1, \overline{x}, (\frac{1}{n}+\frac{n_{test}}{n_{train}}) | |
# \hat{\sigma}^2) |
I'd be +1 to merge now considering how long this has been opened. |
@martinagvilas Final wish for change or merge as is? |
@lorentzenchr @NicolasHug maybe we can merge this now and if others find it confusing further suggestions can be added in a separate PR? |
Thanks a lot @martinagvilas ! |
This PR implements the example discussed in #16344 and solves #12730
The example shows how to run a statistical comparison of the results of a grid-search, using both a Frequentist and Bayesian approach.
As discussed in #16344 it implements the most conventional ways of running this statistical comparison.
cc @adrinjalali who worked with me on this during the January sprint