[MRG] extending BaseSearchCV with a custom search strategy #9599

jnothman · 2017-08-21T23:51:44Z

Fixes #9499 through inheritance, or by providing a new AdaptiveSearchCV

The idea is that something like skopt.BayesSearchCV can be implemented as:

class MySearchCV(BaseSearchCV):
    def __init__(....):
        ...

    def _run_search(self, evaluate_candidates):
        results = evaluate_candidates(initial_candidates)
        while ...:
            results = evaluate_candidates(more_candidates)

Note that it should be possible to use scipy.optimize minimizers in this framework.

I've also provided AdatptiveSearchCV, which accepts a search parameter to a public interface, rather than relying on inheritance, but I'm undecided about it.

TODO:

Get other core devs to comment on broad API
Probably remove AdaptiveSearchCV and leave a private/protected API
Decide how we document an API for inheritance, as this is a bit new to Scikit-learn
Decide whether this is "experimental" API or something that we want to assure stability on
Should _run_search be allowed to change the cv_results_ dict? Should we put (in)formal requirements (or an API) on how cv_results_ can be changed.

… coroutine

jnothman · 2017-08-21T23:52:00Z

Ping @betatim

jnothman · 2017-08-22T01:27:10Z

Decided to make this MRG. Why not?

jnothman · 2017-08-22T03:27:52Z

This is messy but, I hope, fixed.

betatim · 2017-08-22T05:46:04Z

This is exactly what I tried to describe with words :) 👍

NelleV · 2017-08-22T05:46:30Z

sklearn/model_selection/_search.py

+
+        To be overridden by implementors.
+
+        It can iteratively generate a list of candidate parameter dicts, and is


I think there is a grammatical mistake with this sentence (or I just don't understand it)

I think it parses. But it's not easy :)

How about "As in the following snippet, implementations yield lists of candidates, where each candidate is a dict of parameter settings. The yield expression then returns the results dict corresponding to those candidates."?

betatim · 2017-08-22T05:50:39Z

sklearn/model_selection/_search.py

@@ -555,6 +559,22 @@ def classes_(self):
        self._check_is_fitted("classes_")
        return self.best_estimator_.classes_

+    def _generate_candidates(self):
+        """Generates lists of candidates to search as a coroutine


-> "Generate lists of candidate parameters to evaluate."?

Having our conversation in mind I still had to think twice to grok this, so I'll try and make a suggestion for the doc string that would have helped me.

I use "candidate" to mean a full parameter setting, i.e. multiple parameters and their values.

kk, my main issue was the "coroutine". I think that as a reader/user here I don't need to know about coroutines (and the fifteen different ways people define them). IMHO the important bit is "this has to be a generator that can have stuff sent into it". There aren't that many of those out in the wild I'd guess.

betatim · 2017-08-22T05:51:42Z

sklearn/model_selection/_search.py

-    """Base class for hyper parameter search with cross-validation."""
+    """Abstract base class for hyper parameter search with cross-validation.
+
+    Implementers should provide ``__init__`` and either ``_get_param_iterator``


for my education: why do we need to provide __init__ if it is empty (like in the test)? Doesn't python call the base __init__ automatically?

Only because the current code marks it as an abstractmethod.

jnothman · 2017-08-22T06:04:15Z

All this result munging makes a mess of the code, admittedly. We could perhaps make it a little cleaner if _generate_candidates was sent a list of dicts rather than a dict of lists as results.

_generate_candidates could also be renamed to _generate_candidate_lists if that's clearer.

jnothman · 2017-08-22T10:49:31Z

@betatim, do you have a preference as to whether you would rather receive a dict of arrays, as in cv_results_ or a list of dicts?

jnothman · 2017-08-22T10:52:30Z

Now "coroutine" free.

betatim · 2017-08-22T13:03:43Z

As it is now (dict of arrays) seems fine.

Why not pass all results back instead of just the last iterations? In skopt we would end up doing that as you need the whole history.

jnothman · 2017-08-22T13:29:31Z

No particular reason why not except that I thought it a cleaner interface. But I can see issues with expecting the generator to merge those arrays together. Maybe it's safer and the code cleaner to pass results as a list of dicts. Much easier then to accumulate from round to round...?

…

On 22 August 2017 at 23:03, Tim Head ***@***.***> wrote: As it is now (dict of arrays) seems fine. Why not pass all results back instead of just the last iterations? In skopt we would end up doing that as you need the whole history. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9599 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69xvto5feOaEkUF_KU42MMt4vYMcks5satGwgaJpZM4O98-d> .

betatim · 2017-08-22T19:14:16Z

One dict per parameter setting tried? Sounds like a lot of generators will write the for loop code to collect all scores together, all parameter values etc :-/

I like an already merged dict of arrays seems best I think.

jnothman · 2017-08-22T23:09:04Z

And a cumulative one, in your opinion, is better?

…

On 23 August 2017 at 05:14, Tim Head ***@***.***> wrote: One dict per parameter setting tried? Sounds like a lot of generators will write the for loop code to collect all scores together, all parameter values etc :-/ I like an already merged dict of arrays seems best I think. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9599 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62LUpOrbU51FOXI2Sms-YUfgfVVdks5sayiKgaJpZM4O98-d> .

betatim · 2017-08-24T06:38:35Z

Yes. At least for skopt we need the "complete" history and I would guess most other approaches are also interested in the whole history so far to make their stopping decision.

jnothman · 2017-09-11T03:04:39Z

@betatim, I've updated this to provide cumulative history to the coroutine. As a downside, it repeats some work on aggregating results, but I suppose that could be made more efficient at a later point if it became an issue, and at least it's not a function of the dataset size except where that determines the number of splits.

WDYT?

jnothman · 2017-09-11T14:18:20Z

And now I'm wondering whether we should just add a warm_start parameter to GridSearchCV, allowing users to change the parameter grid and also, perhaps, the number of splits, and for this to be included in the same results listing and ranking.

stsievert · 2018-07-23T16:52:38Z

I think we shouldn't promise anything but merge this. I'm not sure how others feel about that?

I feel the same way: merge this PR without any promises (/private methods only). I still have some questions on the API, and would like to play with it.

amueller · 2018-07-23T19:30:21Z

@stsievert can you maybe provide a bit of context on your usage, just to inform our decision making / out of curiosity?

stsievert · 2018-07-23T20:30:00Z

@amueller I'm implementing an adaptive hyperparameter search called Hyperband in dask/dask-ml#221. In this setting, Hyperband realizes optimization is iterative and treats computation as a scarce resource, so it relies on partial_fit (though it doesn't have to be that way; it can be applied to black box models too).

This PR can not be used directly unless evaluate can be customized to support partial_fit (see #11266 for related work). I think this PR caught my interest because I study adaptive methods in graduate school.

jnothman · 2018-07-24T03:01:05Z

Yes, there aren't a lot of users. But those libraries in turn *should* probably have several users, given the limitations of our own SearchCVs. Anyway, thanks for the +1 @amueller. Does this formally get your approval in its current state? Is someone else keen on reviewing? @amueller, I would also be interested in your thoughts on #11354 which is sort of a sister PR in solving "Customizing BaseSearchCV is hard" (#9499)

amueller · 2018-07-24T18:28:44Z

When I looked at it I felt #11354 wasn't that widely applicable. I think @janvanrji had argued to me it would be useful for successive halfing but I argued that's easy to implement with the current code. Need to revisit the motivation.

jnothman · 2018-07-25T03:35:39Z

Another review of this as an experimental private interface? @lesteve? @glemaitre?

glemaitre

LGTM with tiny nitpicks.

The change are minimum and it simplified thing for contrib so I would be +1.

glemaitre · 2018-07-25T12:33:38Z

sklearn/model_selection/tests/test_search.py

+        if attr[0].islower() and attr[-1:] == '_' and \
+           attr not in {'cv_results_', 'best_estimator_',
+                        'refit_time_'}:
+            assert_equal(getattr(gscv, attr), getattr(mycv, attr),


glemaitre · 2018-07-25T12:33:49Z

sklearn/model_selection/tests/test_search.py

+def test_custom_run_search():
+    def check_results(results, gscv):
+        exp_results = gscv.cv_results_
+        assert_equal(sorted(results.keys()), sorted(exp_results))


plain assert

glemaitre · 2018-07-25T12:34:37Z

sklearn/model_selection/tests/test_search.py

+                    assert_array_equal(exp_results[k], results[k],
+                                       err_msg='Checking ' + k)
+                else:
+                    assert_almost_equal(exp_results[k], results[k],


assert_allclose

glemaitre · 2018-07-25T12:46:39Z

sklearn/model_selection/_search.py

@@ -577,6 +578,30 @@ def classes_(self):
        self._check_is_fitted("classes_")
        return self.best_estimator_.classes_

+    @abstractmethod
+    def _run_search(self, evaluate_candidates):
+        """Repeatedly calls evaluate_candidates to conduct a search


backsticks around evaluate_candidates

and a full stop

jnothman · 2018-07-25T13:36:53Z

hahaha thanks for keeping me on my toes!

jnothman · 2018-07-25T13:44:21Z

Now, do I add a what's new along the lines of:

- `BaseSearchCV` now has an experimental private interface to support
  customized parameter search strategies, through its ``_run_search``
  method.  See the implementations in :class:`model_selection.GridSearchCV`
  and :class:`model_selection.RandomizedSearchCV` and please provide feedback
  if you use this. Note that we do not assure the stability of this API
  beyond version 0.20. :issue:`9599` by `Joel Nothman`_

?

glemaitre · 2018-07-25T21:50:55Z

A what's new entry could be nice I think even this is not public changes.

ogrisel

LGTM. Please see the following comments.

ogrisel · 2018-08-03T12:21:13Z

sklearn/model_selection/_search.py

+
+        ::
+
+            def _run_search(self):


typo:

def _run_search(self, evaluate_candidates): ...

ogrisel · 2018-08-03T12:30:54Z

sklearn/model_selection/_search.py

+    @abstractmethod
+    def _run_search(self, evaluate_candidates):
+        """Repeatedly calls `evaluate_candidates` to conduct a search.
+


Please add some motivation for the intent of this abstract method. For instance:

This method, implemented in sub-classes, makes it is possible to customize the
the scheduling of evaluations: GridSearchCV and RandomizedSearchCV schedule
evaluations for their whole parameter search space at once but other more
sequential approaches are also possible: for instance is possible to
iteratively schedule evaluations for new regions of the parameter search
space based on previously collected evaluation results. This makes it
possible to implement Bayesian optimization or more generally sequential
model-based optimization by deriving from the BaseSearchCV abstract base
class.

Very nice text!

ogrisel · 2018-08-03T12:42:15Z

Maybe for another PR, I think it might also be interesting to add:

class BaseSearchCV(...):

    @staticmethod
    def _fit_and_score(base_estimator, X, y, ...):
        """Make it possible to override model evaluation in subclasses"""
        return _fit_and_score(base_estimator, X, y, ...)

This would make it possible to override the default implementation in subclasses. For instance to be able to snapshot the fitted models on disk to later build an ensemble of the top performers for instance. Or to save the model evaluation in a DB to feed a dashboard.

jnothman · 2018-08-05T07:48:47Z

I'm not sure about allowing users to customise _fit_and_score... It seems too powerful/essential, and something we need to be able to change on a whim. If they want to save estimators or log to a DB, scoring can do that in a hacky way. I've proposed before having another callback for storing diagnostics.

jnothman · 2018-08-05T07:49:12Z

And: let's merge this when green... Sorry it's taken a while, @betatim!

saddy001 · 2018-09-26T14:31:17Z

The method BaseSearchCV._run_search is mandatory now on derived classes, because it's abstract.
Classes that inherit from BaseSearchCV and don't implement the new method, fail with

TypeError: Can't instantiate abstract class <class_name> with abstract methods _run_search

This breaks external libraries, for example skopt.BayesSearchCV. Is this on purpose?

amueller · 2018-09-26T16:42:24Z

@saddy001 that's unfortunate. Ideally this refactor would make BayesSearchCV easier to implement. Maybe we should not make this abstract but just raise a NotImplementedError, in case the derived class overwrites fit (which I guess is what BayesSearchCV does. If they overwrite fit, what exactly does the class use from BaseSearchCV right now?

saddy001 · 2018-09-26T16:53:02Z

It seems to implement the following methods:

dir(BayesSearchCV)
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_is_fitted', '_check_search_space', '_estimator_type', '_fit', '_fit_best_model', '_format_results', '_get_param_names', '_make_optimizer', '_run_search', '_step', 'best_params_', 'best_score_', 'classes_', 'decision_function', 'fit', 'get_params', 'inverse_transform', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'total_iterations', 'transform']

The code is here: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py
Maybe we should 'ping' the developers?

amueller · 2018-09-26T16:56:35Z

not sure what the dir is supposed to be telling me. My question was what do they inherit, and I think the answer is score?

saddy001 · 2018-09-26T17:10:39Z

Yep sorry, dir is also listing inherited attributes. I think this is what you want:

BaseSearchCV.__dict__.keys() - BayesSearchCV.__dict__.keys()
{'score', '_run_search', '_check_is_fitted', 'predict_log_proba', 'inverse_transform', 'predict', 'classes_', '_format_results', 'predict_proba', 'transform', 'decision_function', '_estimator_type'}

ENH Implementers of BaseSearchCV can now provide candidates through a…

d0d0436

… coroutine

jnothman mentioned this pull request Aug 22, 2017

Customizing BaseSearchCV is hard #9499

Closed

jnothman added 3 commits August 22, 2017 10:06

Pull out _format_results helper

d4dbb5d

Cannot del all_results in Py2

8e36e37

Test

0d20d0b

jnothman changed the title ~~[WIP] *SearchCV can now provide candidates through a coroutine~~ [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

jnothman changed the title ~~[MRG] *SearchCV can now provide candidates through a coroutine~~ [WIP] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

jnothman added 2 commits August 22, 2017 11:34

More test, and a fixme

4264fff

Fix rank and masking cases

6905a47

jnothman changed the title ~~[WIP] *SearchCV can now provide candidates through a coroutine~~ [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

NelleV reviewed Aug 22, 2017

View reviewed changes

betatim reviewed Aug 22, 2017

View reviewed changes

Improve words

9c46788

Spelling

ab56a97

Provide cumulative results to coroutine

dcaeb14

Avoid return_train_score FutureWarning

f14c8ce

jnothman changed the title ~~[WIP] extending BaseSearchCV with a custom search strategy~~ [MRG] extending BaseSearchCV with a custom search strategy Jul 24, 2018

amueller approved these changes Jul 24, 2018

View reviewed changes

glemaitre approved these changes Jul 25, 2018

View reviewed changes

janvanrijn mentioned this pull request Jul 25, 2018

[MRG] Allow for refit=callable in *SearchCV to add flexibility in identifying the best estimator #11269 #11354

Merged

5 tasks

jnothman added 2 commits July 26, 2018 10:45

Nitpicks

7d91d18

What's new

2a3eecf

ogrisel approved these changes Aug 3, 2018

View reviewed changes

jnothman added 2 commits August 5, 2018 17:44

Olivier's comments

5c135d0

Merge branch 'master' into search-coroutine

bd88b64

jnothman merged commit 477c921 into scikit-learn:master Aug 5, 2018

amueller mentioned this pull request Sep 26, 2018

Breaking BaseSearchCV inheritance with abstract method #12173

Closed

kernc mentioned this pull request Jan 28, 2021

Evaluating multiple scores scikit-optimize/scikit-optimize#797

Open


		To be overridden by implementors.

		It can iteratively generate a list of candidate parameter dicts, and is

Uh oh!

[MRG] extending BaseSearchCV with a custom search strategy #9599

[MRG] extending BaseSearchCV with a custom search strategy #9599

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!