8000 Initial implementation of GridFactory by jnothman · Pull Request #21784 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Initial implementation of GridFactory #21784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jnothman
Copy link
Member
@jnothman jnothman commented Nov 25, 2021

Reference Issues/PRs

Fixes #19045

What does this implement/fix? Explain your changes.

This implements a GridFactory with functionality comparable to searchgrid to allow easier specification of parameter search spaces, especially as a composite estimator is built.

Any other comments?

A complete version of this will:

  • import tests from searchgrid
  • adopt GridFactory in some examples
  • add a DistributionFactory for use with Randomized searches (this is tricky because currently Randomized searches do not support conditional search spaces, unless I'm much mistaken)
  • add factory support to *SearchCV (currently only GridSearchCV)
  • maybe add a function to interpret the parameter names in cv_results_ in a __-free way

I note that supporting a factory to be passed to GridSearchCV instead of a parameter grid doesn't save the user much code. Maybe we don't need to support it?

@glemaitre
Copy link
Member

I think that I did not comment on the issue but I really like what this factory remove as cognitive overhead on the user. The __ is problematic to explain when teaching.

I would be happy to review and give feedback whenever you want. Just ping me directly.

@ogrisel
Copy link
Member
ogrisel commented Nov 25, 2021

Wouldn't it make sense to add methods directly to the estimators them-selves?

>>> pipeline = make_pipeline(
...     PCA().with_search_space(n_components=[2, 5, 10, 20, 50, 100]),
...     LogisticRegression().with_search_space(C=np.logspace(-6, 6, num=13)),
... )
>>> GridSearchCV(pipeline, cv=5).fit(X, y).best_params_
{"pca__n_components"=100, "logisiticregression__C"=1e-3}

EDIT: One could even special-case pipelines to make it possible to accept tuple of estimators to express a disjunctions in the search space:

>>> pipeline = make_pipeline(
...     (
...         PCA().with_search_space(n_components=[2, 5, 10, 20, 50, 100]),
...         NMF().with_search_space(n_components=[2, 5, 10, 20, 50, 100]),
...     ),
...     LogisticRegression().with_search_space(C=np.logspace(-6, 6, num=13)),
... )
>>> GridSearchCV(pipeline, cv=5).fit(X, y).best_params_
{"nmf__n_components": 100, "logisiticregression__C": 1e-3}

If you call pipeline.fit(X, y) it would just use the first element of each tuple by default, but such disjunctive pipeline would primarily be meant to be tuned by grid search or random search.

This is a bit implicit, but given the gain in readability that might be worth it.

@ogrisel
Copy link
Member
ogrisel commented Nov 25, 2021

BTW, another source of inspiration to tackle this problem:

@jnothman
Copy link
Member Author

Thanks for those references, I'll enjoy reading them.

I don't mind setting directly on the estimators, either with a method on the estimators or with an external function that just assumes it can write to a certain attribute of the estimator. Do we think this would be better?

@glemaitre
Copy link
Member
glemaitre commented Nov 25, 2021 via email

@jnothman
Copy link
Member Author

I agree it would probably be more intuitive to use.
It would require a SLEP.

@jnothman
Copy link
Member Author

Neuraxle has some nice solutions, recognising some of the longstanding questionable design choices in Scikit-learn. One of our troubles is again going to be clone, although perhaps here you can calculate the whole param space without first cloning.

@ogrisel
Copy link
Member
ogrisel commented Nov 26, 2021

I agree. Let's put this API design on the agenda of the next dev meeting if you plan to attend @jnothman.

@thomasjpfan
Copy link
Member
thomasjpfan commented Nov 26, 2021

Setting attributes not in __init__ + clone issue was observed in the metadata routing PR. A possible solution is to introduce a attrs attribute that clone will always copy over. This is similar to the attrs attribute in pandas' DataFrames or xarray Datasets.

For this PR, we can use the attrs attributes to store the hyperparameter space.

@amueller
Copy link
Member

add a DistributionFactory for use with Randomized searches (this is tricky because currently Randomized searches do not support conditional search spaces, unless I'm much mistaken)

They support lists of search spaces, which is the same as GridSearchCV, right?

@amueller
Copy link
Member

Wouldn't it make sense to add methods directly to the estimators them-selves?

I think what I find slightly counter-intuitive about that is that with_search_space doesn't change the behavior of fit. But I think that's probably fine?
Overall something like this plus built-in search spaces is something I've wanted for yeeaarrs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better (__-free) ways to specify grid search hyperparameters
5 participants
0