8000 FIX: make GridSearchCV work with precomputed kernels by daien · Pull Request #649 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX: make GridSearchCV work with precomputed kernels #649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 3, 2012

Conversation

daien
Copy link
Contributor
@daien daien commented Feb 23, 2012

Simple fix (+test) based on checking whether base_clf.kernel == 'precomputed'


clf = SVC(kernel='precomputed')
cv = GridSearchCV(clf, {'C': [0.1, 1.0]})
cv.fit(K, y_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should also test the score?
Or at least / also the cv.best_score_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By just checking the best_score_ attribute of cv is defined, by checking its positive or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add tests for the cases that raise expected exceptions with assert_raises.

@ogrisel
Copy link
Member
ogrisel commented Feb 23, 2012

Other than that I am 👍 for merging.

@ogrisel
Copy link
Member
ogrisel commented Feb 23, 2012

Would be great to do the same for sklearn.cross_validation.cross_val_score: maybe there is some code to factorize.

@GaelVaroquaux
Copy link
Member

I am worried that this is a hack that special cases a problem to work around a design fault.

Not to say that I am against merging, but we clearly cannot add a list of such hacks each time that a problem cannot be cross-validated by splitting X and y. This is why I mentioned 'an elegant solution' on the mailing list.

@mblondel
Copy link
Member
mblondel commented Mar 2, 2012

The same problem arises in clustering algorithms with a similarity matrix. We could agree upon a flag (e.g. symmetric_X_ = True) that must be set by the estimator in fit.

@ogrisel
Copy link
Member
ogrisel commented Mar 2, 2012

I am wondering if it wouldn't be better to have a dedicated API for kernel / similarity fitting models. Like:

clf.fit_kernel(K, y) instead.

That would make it possible to provide a fit(X, y) method that would accept a design matrix X and turn it into a kernel using the kneighbors_graph stuff automatically for the unsuspecting user of the SpectralClustering class for instance.

@daien
Copy link
Contributor Author
daien commented Mar 2, 2012

It's funny because some time ago, I already made a similar pull request that yielded a similar discussion from the same people and the patch withered away...

I completely agree with you about the fact that it's a wider design problem that needs a proper generic solution. However, I also believe that this particular use case (cross-validation with non-linear SVMs) is frequent (at least in my domain) and therefore this god-awful hack might still be of some use. Furthermore, I don't think that it hugely pollutes the code base.

French people know that sometimes "le mieux est l'ennemi du bien" ("The best is the enemy of the good"). And as savvy pythonistas you also know that

Special cases aren't special enough to break the rules.
Although practicality beats purity.

Nevertheless, I agree that methods eating kernels or similarity matrices are somewhat problematic in scikit-learn, because of the breakage of the X = [n_samples, n_features] API convention. It surely needs a proper discussion, but, IMHO, it's not going to be trivial to solve while maintaining the beauty and simplicity of the existing API. If it was, you would have already done it ;-) But I'm hopeful!

@amueller
Copy link
Member
amueller commented Mar 3, 2012

I like @ogrisel's idea.
Another idea I just had (for this particular case) to subclass GridSearchCV for the "percomputed kernel/distance" case.
A problem would then be, though, to ensure that people only call it with the right kind of estimators.

So maybe a combination of @ogrisel's idea and mine? Have a grid search tool that uses the fit_kernel function?
Btw, I would rather use a name without "kernel" in it. I would like something that contains "similarity" or "precomputed".

@amueller
Copy link
Member
amueller commented Mar 3, 2012

As an afterthought: would this function be used for algorithms using similarity and using dissimilarity? that might be confusing.

@mblondel
Copy link
Member
mblondel commented Mar 3, 2012

BTW, I'm +1 for merging this PR as a temporary fix until we figure out a nicer solution.

@amueller
Copy link
Member
amueller commented Mar 3, 2012

Ok, let's merge :)

@ogrisel
Copy link
Member
ogrisel commented Mar 3, 2012

I am also ok for merging this a temporary fix but we should really think on how to better improve the API for precomputed kernel / affinity / distances matrices (SVM, KernelPCA, SpectralClustering...).

@GaelVaroquaux
Copy link
Member

+1 on temporary fix and on improving long term.

----- Original message -----

I am also ok for merging this a temporary fix but we should really think
on how to better improve the API for precomputed kernel / affinity /
distances matrices (SVM, KernelPCA, SpectralClustering...).


Reply to this email directly or view it on GitHub:
#649 (comment)

amueller added a commit that referenced this pull request Mar 3, 2012
FIX: make GridSearchCV work with precomputed kernels
@amueller amueller merged commit da9528f into scikit-learn:master Mar 3, 2012
@amueller
Copy link
Member
amueller commented Mar 3, 2012

Thanks for the contribution @daien :)

@daien
Copy link
Contributor Author
daien commented Mar 4, 2012

You're very welcome :-) I hope I will be able to help more when I'm done with writing... And not just with hacks this time ;-)

@GaelVaroquaux
Copy link
Member

@daien: any contribution is welcomed. Thanks a lot.

@alexis-mignon
Copy link
Contributor

Wouldn't a simple use_kernel option in grid search and Cross validation do the stuff ?
Or maybe adding a kernel_data attribute to estimators ?

@alexis-mignon
Copy link
Contributor

The same hack should be added to cross_validation.cross_val_score()

@amueller
Copy link
Member
amueller commented Jul 9, 2012

@alexis-mignon This should be fixed in #803 but that hasn't been reviewed yet :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0