-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] extending BaseSearchCV with a custom search strategy #9599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ping @betatim |
Decided to make this MRG. Why not? |
This is messy but, I hope, fixed. |
This is exactly what I tried to describe with words :) 👍 |
sklearn/model_selection/_search.py
Outdated
|
||
To be overridden by implementors. | ||
|
||
It can iteratively generate a list of candidate parameter dicts, and is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a grammatical mistake with this sentence (or I just don't understand it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it parses. But it's not easy :)
How about "As in the following snippet, implementations yield lists of candidates, where each candidate is a dict of parameter settings. The yield expression then returns the results dict corresponding to those candidates."?
sklearn/model_selection/_search.py
Outdated
@@ -555,6 +559,22 @@ def classes_(self): | |||
self._check_is_fitted("classes_") | |||
return self.best_estimator_.classes_ | |||
|
|||
def _generate_candidates(self): | |||
"""Generates lists of candidates to search as a coroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> "Generate lists of candidate parameters to evaluate."?
Having our conversation in mind I still had to think twice to grok this, so I'll try and make a suggestion for the doc string that would have helped me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use "candidate" to mean a full parameter setting, i.e. multiple parameters and their values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, my main issue was the "coroutine". I think that as a reader/user here I don't need to know about coroutines (and the fifteen different ways people define them). IMHO the important bit is "this has to be a generator that can have stuff sent into it". There aren't that many of those out in the wild I'd guess.
sklearn/model_selection/_search.py
Outdated
"""Base class for hyper parameter search with cross-validation.""" | ||
"""Abstract base class for hyper parameter search with cross-validation. | ||
|
||
Implementers should provide ``__init__`` and either ``_get_param_iterator`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my education: why do we need to provide __init__
if it is empty (like in the test)? Doesn't python call the base __init__
automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because the current code marks it as an abstractmethod.
All this result munging makes a mess of the code, admittedly. We could perhaps make it a little cleaner if
|
@betatim, do you have a preference as to whether you would rather receive a dict of arrays, as in |
Now "coroutine" free. |
As it is now (dict of arrays) seems fine. Why not pass all results back instead of just the last iterations? In |
No particular reason why not except that I thought it a cleaner interface.
But I can see issues with expecting the generator to merge those arrays
together. Maybe it's safer and the code cleaner to pass results as a list
of dicts. Much easier then to accumulate from round to round...?
…On 22 August 2017 at 23:03, Tim Head ***@***.***> wrote:
As it is now (dict of arrays) seems fine.
Why not pass all results back instead of just the last iterations? In
skopt we would end up doing that as you need the whole history.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69xvto5feOaEkUF_KU42MMt4vYMcks5satGwgaJpZM4O98-d>
.
|
One dict per parameter setting tried? Sounds like a lot of generators will write the for loop code to collect all scores together, all parameter values etc :-/ I like an already merged dict of arrays seems best I think. |
And a cumulative one, in your opinion, is better?
…On 23 August 2017 at 05:14, Tim Head ***@***.***> wrote:
One dict per parameter setting tried? Sounds like a lot of generators will
write the for loop code to collect all scores together, all parameter
values etc :-/
I like an already merged dict of arrays seems best I think.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62LUpOrbU51FOXI2Sms-YUfgfVVdks5sayiKgaJpZM4O98-d>
.
|
Yes. At least for |
@betatim, I've updated this to provide cumulative history to the coroutine. As a downside, it repeats some work on aggregating results, but I suppose that could be made more efficient at a later point if it became an issue, and at least it's not a function of the dataset size except where that determines the number of splits. WDYT? |
And now I'm wondering whether we should just add a |
I feel the same way: merge this PR without any promises (/private methods only). I still have some questions on the API, and would like to play with it. |
@stsievert can you maybe provide a bit of context on your usage, just to inform our decision making / out of curiosity? |
@amueller I'm implementing an adaptive hyperparameter search called Hyperband in dask/dask-ml#221. In this setting, Hyperband realizes optimization is iterative and treats computation as a scarce resource, so it relies on This PR can not be used directly unless |
Yes, there aren't a lot of users. But those libraries in turn *should*
probably have several users, given the limitations of our own SearchCVs.
Anyway, thanks for the +1 @amueller. Does this formally get your approval
in its current state? Is someone else keen on reviewing?
@amueller, I would also be interested in your thoughts on #11354 which is
sort of a sister PR in solving "Customizing BaseSearchCV is hard" (#9499)
|
When I looked at it I felt #11354 wasn't that widely applicable. I think @janvanrji had argued to me it would be useful for successive halfing but I argued that's easy to implement with the current code. Need to revisit the motivation. |
Another review of this as an experimental private interface? @lesteve? @glemaitre? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with tiny nitpicks.
The change are minimum and it simplified thing for contrib so I would be +1.
if attr[0].islower() and attr[-1:] == '_' and \ | ||
attr not in {'cv_results_', 'best_estimator_', | ||
'refit_time_'}: | ||
assert_equal(getattr(gscv, attr), getattr(mycv, attr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert
def test_custom_run_search(): | ||
def check_results(results, gscv): | ||
exp_results = gscv.cv_results_ | ||
assert_equal(sorted(results.keys()), sorted(exp_results)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plain assert
assert_array_equal(exp_results[k], results[k], | ||
err_msg='Checking ' + k) | ||
else: | ||
assert_almost_equal(exp_results[k], results[k], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_allclose
sklearn/model_selection/_search.py
Outdated
@@ -577,6 +578,30 @@ def classes_(self): | |||
self._check_is_fitted("classes_") | |||
return self.best_estimator_.classes_ | |||
|
|||
@abstractmethod | |||
def _run_search(self, evaluate_candidates): | |||
"""Repeatedly calls evaluate_candidates to conduct a search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backsticks around evaluate_candidates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and a full stop
hahaha thanks for keeping me on my toes!
|
Now, do I add a what's new along the lines of:
? |
A what's new entry could be nice I think even this is not public changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please see the following comments.
sklearn/model_selection/_search.py
Outdated
|
||
:: | ||
|
||
def _run_search(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
def _run_search(self, evaluate_candidates):
...
@abstractmethod | ||
def _run_search(self, evaluate_candidates): | ||
"""Repeatedly calls `evaluate_candidates` to conduct a search. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some motivation for the intent of this abstract method. For instance:
This method, implemented in sub-classes, makes it is possible to customize the
the scheduling of evaluations: GridSearchCV and RandomizedSearchCV schedule
evaluations for their whole parameter search space at once but other more
sequential approaches are also possible: for instance is possible to
iteratively schedule evaluations for new regions of the parameter search
space based on previously collected evaluation results. This makes it
possible to implement Bayesian optimization or more generally sequential
model-based optimization by deriving from the BaseSearchCV
abstract base
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice text!
Maybe for another PR, I think it might also be interesting to add: class BaseSearchCV(...):
@staticmethod
def _fit_and_score(base_estimator, X, y, ...):
"""Make it possible to override model evaluation in subclasses"""
return _fit_and_score(base_estimator, X, y, ...) This would make it possible to override the default implementation in subclasses. For instance to be able to snapshot the fitted models on disk to later build an ensemble of the top performers for instance. Or to save the model evaluation in a DB to feed a dashboard. |
I'm not sure about allowing users to customise |
And: let's merge this when green... Sorry it's taken a while, @betatim! |
The method BaseSearchCV._run_search is mandatory now on derived classes, because it's abstract.
This breaks external libraries, for example skopt.BayesSearchCV. Is this on purpose? |
@saddy001 that's unfortunate. Ideally this refactor would make BayesSearchCV easier to implement. Maybe we should not make this abstract but just raise a |
It seems to implement the following methods:
The code is here: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py |
not sure what the |
Yep sorry, dir is also listing inherited attributes. I think this is what you want:
|
Fixes #9499 through inheritance, or by providing a new
AdaptiveSearchCV
The idea is that something like
skopt.BayesSearchCV
can be implemented as:Note that it should be possible to use scipy.optimize minimizers in this framework.
I've also provided
AdatptiveSearchCV
, which accepts asearch
parameter to a public interface, rather than relying on inheritance, but I'm undecided about it.TODO:
AdaptiveSearchCV
and leave a private/protected API_run_search
be allowed to change thecv_results_
dict? Should we put (in)formal requirements (or an API) on howcv_results_
can be changed.