Bugfixes in gaussian_process subpackage #2632

dubourg · 2013-12-03T21:48:25Z

Hi list,

It's been a while... ;-)

I've been told the random_start feature of the GaussianProcess estimator was making it worse rather than making it better.
It was indeed due to a bad handling of sign in the randomly restarted maximization of the reduced likelihood function (implemented as the minimization of the opposite reduced likelihood function with fmin_cobyla).

While running GP tests and examples, I spotted another mistake in the gp_diabetes_dataset example which appeared when the optimized theta value stored in the estimator was made "private" by renaming it to theta_.

Cheers!

coveralls · 2013-12-03T21:58:38Z

Coverage remained the same when pulling 2912384 on dubourg:bugfix-gp-random_start into 87343d7 on scikit-learn:master.

agramfort · 2013-12-05T10:46:07Z

it's been a while indeed ! :)

is there a way to test that it's actually fixing a bug? does the cv score improve in the example?

A bug fix with no test added or modified is always suspicious :)

jaquesgrobler · 2013-12-05T15:20:19Z

I agree with @agramfort - non regression test needed with the PR 👍

dubourg · 2013-12-16T21:05:41Z

Well, the bug is obvious if you read the code carefully...
If you need proof though here is the difference in terms of 20-folds R2 score on the diabetes dataset:

$ git checkout master 
Switched to branch 'master'
$ sed -i -e 's|gp.theta0 \= gp.theta|gp.theta0 \= gp.theta_|' examples/gaussian_process/gp_diabetes_dataset.py
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.432307382773
$ git checkout -- examples/gaussian_process/gp_diabetes_dataset.py
$ git checkout bugfix-gp-random_start 
Switched to branch 'bugfix-gp-random_start'
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.436042406555

agramfort · 2013-12-16T21:07:40Z

examples/gaussian_process/gp_diabetes_dataset.py

@@ -40,7 +40,7 @@
 gp.fit(X, y)

 # Deactivate maximum likelihood estimation for the cross-validation loop
-gp.theta0 = gp.theta  # Given correlation parameter = MLE
+gp.theta0 = gp.theta_  # Given correlation parameter = MLE


why Given? if it ends with _ then it's estimated more than given no?

The trick is I am setting gp.theta0 to the theta_ estimated from the full dataset (first call to fit), and setting thetaL and thetaU to None to force theta_ to remain the same (that's what's called the given case in the fitmethod, grep for it) during the cross-validation.

agramfort · 2013-12-17T09:54:03Z

ok. +1 for merge then

GaelVaroquaux · 2013-12-17T10:13:11Z

Well, the bug is obvious if you read the code carefully...

The goal of a test would be to avoid having the same bug creep back in
the codebase later on, when we refactor.

jaquesgrobler · 2013-12-17T14:12:01Z

We still need the regression test here - as Gael mentions its good to avoid these kinds of bugs crawling back in

jmetzen · 2014-07-22T18:21:06Z

As it seems, only a regression test keeps this PR from being merged since 7 months. I have created one test that checks that the reduced_likelihood_function never decreases with increasing random_start. As I cannot push to this PR (can I?), I paste it here:

def test_random_starts():
    """
    Test that an increasing number of random-starts of GP fitting only
    increases the reduced likelihood function of the optimal theta.
    """
    n_input_dims = 3
    n_samples = 100
    np.random.seed(0)
    X = np.random.random(n_input_dims*n_samples).reshape(n_samples,
                                                         n_input_dims) * 2 - 1
    y = np.sin(X).sum(axis=1) + np.sin(3*X).sum(axis=1)
    best_likelihood = -np.inf
    for random_start in range(1, 10):
        np.random.seed(0)  # The GP random_state is not used consistently
        gp = GaussianProcess(regr="constant", corr="squared_exponential",
                             theta0=[1e-0]*n_input_dims,
                             thetaL=[1e-4]*n_input_dims,
                             thetaU=[1e+1]*n_input_dims,
                             random_start=random_start, random_state=0,
                             verbose=False).fit(X, y)
        rlf = gp.reduced_likelihood_function()[0]
        assert_true(rlf >= best_likelihood)
        best_likelihood = rlf

@dubourg Can you review this test and add it to test_gaussian_process.py in your PR? Otherwise, I would have to add a second PR.

jmetzen · 2014-07-22T18:22:49Z

BTW: I noticed that gaussian_process.py uses scipy.rand() once instead of self.random_state. Can we fix this issue in this PR?

kastnerkyle · 2014-07-23T15:31:37Z

I think it would be a good idea to fix that here as well

GaelVaroquaux · 2014-08-12T18:25:15Z

BTW: I noticed that gaussian_process.py uses scipy.rand() once instead of
self.random_state.

Has this been fixed? If not, it would be fantastic to have a PR fixing
it.

larsmans · 2014-08-12T18:28:09Z

Yes, that was fixed.

GaelVaroquaux · 2014-08-12T18:35:29Z

Yes, that was fixed.

Thanks a lot to every one involved!

dubourg added 2 commits December 3, 2013 22:28

FIX: random_start feature in GaussianProcess

fc15658

FIX: gp_diabetes_dataset examples (theta_ attribute)

2912384

agramfort reviewed Dec 16, 2013
View reviewed changes

jmetzen mentioned this pull request Jul 22, 2014

[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

Closed

jmetzen mentioned this pull request Aug 10, 2014

Bugfixes in gaussian_process subpackage (continued) #3545

Closed

larsmans merged commit 2912384 into scikit-learn:master Aug 11, 2014

This was referenced Sep 26, 2014

fix sign error in gp + unit test #3180 #3181

Closed

sign error when using fmin_cobyla optimizer for gaussian processes #3180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bugfixes in gaussian_process subpackage #2632

Bugfixes in gaussian_process subpackage #2632

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bugfixes in gaussian_process subpackage #2632

Bugfixes in gaussian_process subpackage #2632

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!