8000 Allow multiple input feature with the same value for gaussian processes by filthysocks · Pull Request #3163 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Allow multiple input feature with the same value for gaussian processes #3163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

filthysocks
Copy link

Fix for issue #3162. Allows to have multiple input feature with the same values for Gaussian Processes

@GaelVaroquaux
Copy link
Member

To consider merging this, we need to have a test added.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 8ad7472 on filthysocks:allow_multiple_inputfeature_gp into 795a616 on scikit-learn:master.

@filthysocks
Copy link
Author

I added a unit test for nosiy data and used here duplicate input feature. I also changed the code slightly so it will only allow duplicate input feature for noisy data (unit test added for this as well).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 3fb6e73 on filthysocks:allow_multiple_inputfeature_gp into 795a616 on scikit-learn:master.

for corr in all_corr:
test_1d(regr='constant', corr=corr, random_start=random_start)
test_2d(regr='constant', corr=corr, random_start=random_start)
test_2d_2d(regr='constant', corr=corr, random_start=random_start)
check2d(X, regr='constant', corr=corr, random_start=random_start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be appropriate to use yield to run check2d and check2d_2d

@jnothman
Copy link
Member
jnothman commented Jun 1, 2014

@filthysocks, one reason this PR hasn't gained much attention is that I don't think "multiple features" could possibly be what you mean. Of course GP allows multiple predictors in X (i.e. X.shape[1] > 1).

By "multiple features" do you mean repeated samples? repeated features? It looks like the former, but I'm not familiar with the GP implementation, so it's hard to know..

@filthysocks
Copy link
Author

"multiple input feature with the same value" so with this i mean repeated samples. So if i would try to estimate the function sin(x)*x with a GP (as in the test), currently all my samples for x would need to be different (except I choose the cov. function 'pure_nugget'). However, there is no need for this limitation.

@@ -525,6 +525,15 @@ def predict(self, X, eval_MSE=False, batch_size=None):

return y

def _prohibitMultipleInputFeature(self, D):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jnothman said this name is confusing. Can you please give an example of invalid input? Also function names are lower case in Python. A better name would be _has_repeated_samples(self, pairwise_distances).

@filthysocks
Copy link
Author

Thanks for your feedback, it's very much appreciated.

Can you please give an example of invalid input?

There is a unit test that triggers a value error

@raises(ValueError)
def test_no_multiple_feature():
    '''
    check that multiple features are not allowed for non-noisy data
    '''
    X = np.array([[-4.61611719, -6.00099547],
                  [4.10469096, 5.32782448],
                  [0.00000000, -0.50000000],
                  [-6.17289014, -4.6984743],
                  [-6.17289014, -4.6984743],
                  [1.3109306, -6.93271427],
                  [-5.03823144, 3.10584743],
                  [-2.87600388, 6.74310541],
                  [5.21301203, 4.26386883]])

    check2d(X)

I am not familiar with GP either it's hard to me to tell whether this is correct or not.

For sure, but how useful is a regression model that does not allow for repeated samples? That's kind of the point of regression models.
Just have a look at the 2. chapter of http://www.gaussianprocess.org/gpml/chapters/RW2.pdf
You don't even need to read it, just look at the math. There is no such restriction on the input values.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.12%) when pulling b3e9374 on filthysocks:allow_multiple_inputfeature_gp into 795a616 on scikit-learn:master.

@kastnerkyle
Copy link
Member

This looks reasonable to me, and doesn't appear to break the current API since our default is (apparently) to already add diagonal loading (I really hate the term nugget). However, I still don't understand why it was broken in the first place if the loading was being added - can you clarify this @filthysocks?

@filthysocks
Copy link
Author

Well, there is just an if condition in there saying 'you are not allowed to have repeated samples even if you use a nugget'. This is reasonable if you assume no noise (except for machine imprecisions). It would be allowed if you use a specific kernel (self.corr != correlation.pure_nugget) ... However, there is just no need to limit it to just this one. So it is not really broken in some sense. It is just setting a limitation that is just unnecessary.

@glouppe
Copy link
Contributor
glouppe commented Oct 19, 2015

Closing. We now have a new implementation of gaussian processes (see #4270).

Thanks for you contribution.

@glouppe glouppe closed this Oct 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0