8000 Bugfixes in gaussian_process subpackage by dubourg · Pull Request #2632 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Bugfixes in gaussian_process subpackage #2632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 11, 2014

Conversation

dubourg
Copy link
Contributor
@dubourg dubourg commented Dec 3, 2013

Hi list,

It's been a while... ;-)

I've been told the random_start feature of the GaussianProcess estimator was making it worse rather than making it better.
It was indeed due to a bad handling of sign in the randomly restarted maximization of the reduced likelihood function (implemented as the minimization of the opposite reduced likelihood function with fmin_cobyla).

While running GP tests and examples, I spotted another mistake in the gp_diabetes_dataset example which appeared when the optimized theta value stored in the estimator was made "private" by renaming it to theta_.

Cheers!

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 2912384 on dubourg:bugfix-gp-random_start into 87343d7 on scikit-learn:master.

@agramfort
Copy link
Member

it's been a while indeed ! :)

is there a way to test that it's actually fixing a bug? does the cv score improve in the example?

A bug fix with no test added or modified is always suspicious :)

@jaquesgrobler
Copy link
Member

I agree with @agramfort - non regression test needed with the PR 👍

@dubourg
Copy link
Contributor Author
dubourg commented Dec 16, 2013

Well, the bug is obvious if you read the code carefully...
If you need proof though here is the difference in terms of 20-folds R2 score on the diabetes dataset:

$ git checkout master 
Switched to branch 'master'
$ sed -i -e 's|gp.theta0 \= gp.theta|gp.theta0 \= gp.theta_|' examples/gaussian_process/gp_diabetes_dataset.py
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.432307382773
$ git checkout -- examples/gaussian_process/gp_diabetes_dataset.py
$ git checkout bugfix-gp-random_start 
Switched to branch 'bugfix-gp-random_start'
$ python examples/gaussian_process/gp_diabetes_dataset.py |tail -n 1
The 20-Folds estimate of the coefficient of determination is R2 = 0.436042406555

@@ -40,7 +40,7 @@
gp.fit(X, y)

# Deactivate maximum likelihood estimation for the cross-validation loop
gp.theta0 = gp.theta # Given correlation parameter = MLE
gp.theta0 = gp.theta_ # Given correlation parameter = MLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Given? if it ends with _ then it's estimated more than given no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick is I am setting gp.theta0 to the theta_ estimated from the full dataset (first call to fit), and setting thetaL and thetaU to None to force theta_ to remain the same (that's what's called the given case in the fitmethod, grep for it) during the cross-validation.

@agramfort
Copy link
Member

ok. +1 for merge then

@GaelVaroquaux
Copy link
Member

Well, the bug is obvious if you read the code carefully...

The goal of a test would be to avoid having the same bug creep back in
the codebase later on, when we refactor.

@jaquesgrobler
Copy link
Member

We still need the regression test here - as Gael mentions its good to avoid these kinds of bugs crawling back in

dubourg referenced this pull request in jmetzen/scikit-learn Jul 21, 2014
…g likelihood over several random starts.

The sign of the optimal_rlf_value was set wrongly in 
_arg_max_reduced_likelihood_function() since it was set to the negative
of the return of reduced_likelihood_function(). This error was probably
caused by confusing reduced_likelihood_function(theta) 
with minus_reduced_likelihood_function(log10t). It resulted, however,
in _arg_max_reduced_likelihood_function() returning the worst local
maxima instead of the best one when performing several random_starts.
@jmetzen
Copy link
Member
jmetzen commented Jul 22, 2014

As it seems, only a regression test keeps this PR from being merged since 7 months. I have created one test that checks that the reduced_likelihood_function never decreases with increasing random_start. As I cannot push to this PR (can I?), I paste it here:

def test_random_starts():
    """
    Test that an increasing number of random-starts of GP fitting only
    increases the reduced likelihood function of the optimal theta.
    """
    n_input_dims = 3
    n_samples = 100
    np.random.seed(0)
    X = np.random.random(n_input_dims*n_samples).reshape(n_samples,
                                                         n_input_dims) * 2 - 1
    y = np.sin(X).sum(axis=1) + np.sin(3*X).sum(axis=1)
    best_likelihood = -np.inf
    for random_start in range(1, 10):
        np.random.seed(0)  # The GP random_state is not used consistently
        gp = GaussianProcess(regr="constant", corr="squared_exponential",
                             theta0=[1e-0]*n_input_dims,
                             thetaL=[1e-4]*n_input_dims,
                             thetaU=[1e+1]*n_input_dims,
                             random_start=random_start, random_state=0,
                             verbose=False).fit(X, y)
        rlf = gp.reduced_likelihood_function()[0]
        assert_true(rlf >= best_likelihood)
        best_likelihood = rlf

@dubourg Can you review this test and add it to test_gaussian_process.py in your PR? Otherwise, I would have to add a second PR.

@jmetzen
Copy link
Member
jmetzen commented Jul 22, 2014

BTW: I noticed that gaussian_process.py uses scipy.rand() once instead of self.random_state. Can we fix this issue in this PR?

@kastnerkyle
Copy link
Member

I think it would be a good idea to fix that here as well

@larsmans larsmans merged commit 2912384 into scikit-learn:master Aug 11, 2014 8000
@GaelVaroquaux
Copy link
Member

BTW: I noticed that gaussian_process.py uses scipy.rand() once instead of
self.random_state.

Has this been fixed? If not, it would be fantastic to have a PR fixing
it.

@larsmans
Copy link
Member

Yes, that was fixed.

@GaelVaroquaux
Copy link
Member

Yes, that was fixed.

Thanks a lot to every one involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4F23 None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0