-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Bugfixes in gaussian_process subpackage #2632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
it's been a while indeed ! :) is there a way to test that it's actually fixing a bug? does the cv score improve in the example? A bug fix with no test added or modified is always suspicious :) |
I agree with @agramfort - non regression test needed with the PR 👍 |
Well, the bug is obvious if you read the code carefully... $ git checkout master
Switched to branch 'master'
$ sed -i -e 's|gp.theta0 \= gp.theta|gp.theta0 \= gp.theta_|' examples/gaussian_process/gp_diabetes_dataset.py
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.432307382773
$ git checkout -- examples/gaussian_process/gp_diabetes_dataset.py
$ git checkout bugfix-gp-random_start
Switched to branch 'bugfix-gp-random_start'
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.436042406555 |
@@ -40,7 +40,7 @@ | |||
gp.fit(X, y) | |||
|
|||
# Deactivate maximum likelihood estimation for the cross-validation loop | |||
gp.theta0 = gp.theta # Given correlation parameter = MLE | |||
gp.theta0 = gp.theta_ # Given correlation parameter = MLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Given? if it ends with _ then it's estimated more than given no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick is I am setting gp.theta0 to the theta_ estimated from the full dataset (first call to fit
), and setting thetaL and thetaU to None to force theta_ to remain the same (that's what's called the given case in the fit
method, grep for it) during the cross-validation.
ok. +1 for merge then |
The goal of a test would be to avoid having the same bug creep back in |
We still need the regression test here - as Gael mentions its good to avoid these kinds of bugs crawling back in |
…g likelihood over several random starts. The sign of the optimal_rlf_value was set wrongly in _arg_max_reduced_likelihood_function() since it was set to the negative of the return of reduced_likelihood_function(). This error was probably caused by confusing reduced_likelihood_function(theta) with minus_reduced_likelihood_function(log10t). It resulted, however, in _arg_max_reduced_likelihood_function() returning the worst local maxima instead of the best one when performing several random_starts.
As it seems, only a regression test keeps this PR from being merged since 7 months. I have created one test that checks that the reduced_likelihood_function never decreases with increasing random_start. As I cannot push to this PR (can I?), I paste it here: def test_random_starts():
"""
Test that an increasing number of random-starts of GP fitting only
increases the reduced likelihood function of the optimal theta.
"""
n_input_dims = 3
n_samples = 100
np.random.seed(0)
X = np.random.random(n_input_dims*n_samples).reshape(n_samples,
n_input_dims) * 2 - 1
y = np.sin(X).sum(axis=1) + np.sin(3*X).sum(axis=1)
best_likelihood = -np.inf
for random_start in range(1, 10):
np.random.seed(0) # The GP random_state is not used consistently
gp = GaussianProcess(regr="constant", corr="squared_exponential",
theta0=[1e-0]*n_input_dims,
thetaL=[1e-4]*n_input_dims,
thetaU=[1e+1]*n_input_dims,
random_start=random_start, random_state=0,
verbose=False).fit(X, y)
rlf = gp.reduced_likelihood_function()[0]
assert_true(rlf >= best_likelihood)
best_likelihood = rlf @dubourg Can you review this test and add it to test_gaussian_process.py in your PR? Otherwise, I would have to add a second PR. |
BTW: I noticed that gaussian_process.py uses scipy.rand() once instead of self.random_state. Can we fix this issue in this PR? |
I think it would be a good idea to fix that here as well |
Has this been fixed? If not, it would be fantastic to have a PR fixing |
Yes, that was fixed. |
Thanks a lot to every one involved! |
Hi list,
It's been a while... ;-)
I've been told the
random_start
feature of the GaussianProcess estimator was making it worse rather than making it better.It was indeed due to a bad handling of sign in the randomly restarted maximization of the reduced likelihood function (implemented as the minimization of the opposite reduced likelihood function with
fmin_cobyla
).While running GP tests and examples, I spotted another mistake in the
gp_diabetes_dataset
example which appeared when the optimizedtheta
value stored in the estimator was made "private" by renaming it totheta_
.Cheers!