8000 [MRG] extending BaseSearchCV with a custom search strategy by jnothman · Pull Request #9599 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] extending BaseSearchCV with a custom search strategy #9599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Aug 5, 2018

Conversation

jnothman
Copy link
Member
@jnothman jnothman commented Aug 21, 2017

Fixes #9499 through inheritance, or by providing a new AdaptiveSearchCV

The idea is that something like skopt.BayesSearchCV can be implemented as:

class MySearchCV(BaseSearchCV):
    def __init__(....):
        ...

    def _run_search(self, evaluate_candidates):
        results = evaluate_candidates(initial_candidates)
        while ...:
            results = evaluate_candidates(more_candidates)

Note that it should be possible to use scipy.optimize minimizers in this framework.

I've also provided AdatptiveSearchCV, which accepts a search parameter to a public interface, rather than relying on inheritance, but I'm undecided about it.

TODO:

  • Get other core devs to comment on broad API
  • Probably remove AdaptiveSearchCV and leave a private/protected API
  • Decide how we document an API for inheritance, as this is a bit new to Scikit-learn
  • Decide whether this is "experimental" API or something that we want to assure stability on
  • Should _run_search be allowed to change the cv_results_ dict? Should we put (in)formal requirements (or an API) on how cv_results_ can be changed.

@jnothman
Copy link
Member Author

Ping @betatim

@jnothman jnothman changed the title [WIP] *SearchCV can now provide candidates through a coroutine [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017
@jnothman
Copy link
Member Author

Decided to make this MRG. Why not?

@jnothman jnothman changed the title [MRG] *SearchCV can now provide candidates through a coroutine [WIP] *SearchCV can now provide candidates through a coroutine Aug 22, 2017
@jnothman
Copy link
Member Author

This is messy but, I hope, fixed.

@jnothman jnothman changed the title [WIP] *SearchCV can now provide candidates through a coroutine [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017
@betatim
Copy link
Member
betatim commented Aug 22, 2017

This is exactly what I tried to describe with words :) 👍


To be overridden by implementors.

It can iteratively generate a list of candidate parameter dicts, and is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a grammatical mistake with this sentence (or I just don't understand it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it parses. But it's not easy :)

How about "As in the following snippet, implementations yield lists of candidates, where each candidate is a dict of parameter settings. The yield expression then returns the results dict corresponding to those candidates."?

@@ -555,6 +559,22 @@ def classes_(self):
self._check_is_fitted("classes_")
return self.best_estimator_.classes_

def _generate_candidates(self):
"""Generates lists of candidates to search as a coroutine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "Generate lists of candidate parameters to evaluate."?

Having our conversation in mind I still had to think twice to grok this, so I'll try and make a suggestion for the doc string that would have helped me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use "candidate" to mean a full parameter setting, i.e. multiple parameters and their values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, my main issue was the "coroutine". I think that as a reader/user here I don't need to know about coroutines (and the fifteen different ways people define them). IMHO the important bit is "this has to be a generator that can have stuff sent into it". There aren't that many of those out in the wild I'd guess.

"""Base class for hyper parameter search with cross-validation."""
"""Abstract base class for hyper parameter search with cross-validation.

Implementers should provide ``__init__`` and either ``_get_param_iterator``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my education: why do we need to provide __init__ if it is empty (like in the test)? Doesn't python call the base __init__ automatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because the current code marks it as an abstractmethod.

@jnothman
Copy link
Member Author

All this result munging makes a mess of the code, admittedly. We could perhaps make it a little cleaner if _generate_candidates was sent a list of dicts rather than a dict of lists as results.

_generate_candidates could also be renamed to _generate_candidate_lists if that's clearer.

@jnothman
Copy link
Member Author

@betatim, do you have a preference as to whether you would rather receive a dict of arrays, as in cv_results_ or a list of dicts?

@jnothman
Copy link
Member Author

Now "coroutine" free.

@betatim
Copy link
Member
betatim commented Aug 22, 2017

As it is now (dict of arrays) seems fine.

Why not pass all results back instead of just the last iterations? In skopt we would end up doing that as you need the whole history.

@jnothman
Copy link
Member Author
jnothman commented Aug 22, 2017 via email

@betatim
Copy link
Member
betatim commented Aug 22, 2017

One dict per parameter setting tried? Sounds like a lot of generators will write the for loop code to collect all scores together, all parameter values etc :-/

I like an already merged dict of arrays seems best I think.

@jnothman
Copy link
Member Author
jnothman commented Aug 22, 2017 via email

@betatim
Copy link
Member
betatim commented Aug 24, 2017

Yes. At least for skopt we need the "complete" history and I would guess most other approaches are also interested in the whole history so far to make their stopping decision.

@jnothman
Copy link
Member Author

@betatim, I've updated this to provide cumulative history to the coroutine. As a downside, it repeats some work on aggregating results, but I suppose that could be made more efficient at a later point if it became an issue, and at least it's not a function of the dataset size except where that determines the number of splits.

WDYT?

@jnothman
Copy link
Member Author

And now I'm wondering whether we should just add a warm_start parameter to GridSearchCV, allowing users to change the parameter grid and also, perhaps, the number of splits, and for this to be included in the same results listing and ranking.

@stsievert
Copy link
Contributor

I think we shouldn't promise anything but merge this. I'm not sure how others feel about that?

I feel the same way: merge this PR without any promises (/private methods only). I still have some questions on the API, and would like to play with it.

@amueller
Copy link
Member

@stsievert can you maybe provide a bit of context on your usage, just to inform our decision making / out of curiosity?

@stsievert
Copy link
Contributor

@amueller I'm implementing an adaptive hyperparameter search called Hyperband in dask/dask-ml#221. In this setting, Hyperband realizes optimization is iterative and treats computation as a scarce resource, so it relies on partial_fit (though it doesn't have to be that way; it can be applied to black box models too).

This PR can not be used directly unless evaluate can be customized to support partial_fit (see #11266 for related work). I think this PR caught my interest because I study adaptive methods in graduate school.

@jnothman jnothman changed the title [WIP] extending BaseSearchCV with a custom search strategy [MRG] extending BaseSearchCV with a custom search strategy Jul 24, 2018
@jnothman
Copy link
Member Author
jnothman commented Jul 24, 2018 via email

@amueller
Copy link
Member

When I looked at it I felt #11354 wasn't that widely applicable. I think @janvanrji had argued to me it would be useful for successive halfing but I argued that's easy to implement with the current code. Need to revisit the motivation.

@jnothman
Copy link
Member Author

Another review of this as an experimental private interface? @lesteve? @glemaitre?

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with tiny nitpicks.

The change are minimum and it simplified thing for contrib so I would be +1.

if attr[0].islower() and attr[-1:] == '_' and \
attr not in {'cv_results_', 'best_estimator_',
'refit_time_'}:
assert_equal(getattr(gscv, attr), getattr(mycv, attr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert

def test_custom_run_search():
def check_results(results, gscv):
exp_results = gscv.cv_results_
assert_equal(sorted(results.keys()), sorted(exp_results))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plain assert

assert_array_equal(exp_results[k], results[k],
err_msg='Checking ' + k)
else:
assert_almost_equal(exp_results[k], results[k],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_allclose

@@ -577,6 +578,30 @@ def classes_(self):
self._check_is_fitted("classes_")
return self.best_estimator_.classes_

@abstractmethod
def _run_search(self, evaluate_candidates):
"""Repeatedly calls evaluate_candidates to conduct a search
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backsticks around evaluate_candidates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and a full stop

@jnothman
Copy link
Member Author
jnothman commented Jul 25, 2018 via email

@jnothman
Copy link
Member Author

Now, do I add a what's new along the lines of:

- `BaseSearchCV` now has an experimental private interface to support
  customized parameter search strategies, through its ``_run_search``
  method.  See the implementations in :class:`model_selection.GridSearchCV`
  and :class:`model_selection.RandomizedSearchCV` and please provide feedback
  if you use this. Note that we do not assure the stability of this API
  beyond version 0.20. :issue:`9599` by `Joel Nothman`_

?

@glemaitre
Copy link
Member

A what's new entry could be nice I think even this is not public changes.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please see the following comments.


::

def _run_search(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

def _run_search(self, evaluate_candidates):
   ...

@abstractmethod
def _run_search(self, evaluate_candidates):
"""Repeatedly calls `evaluate_candidates` to conduct a search.

Copy link
Member
@ogrisel ogrisel Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some motivation for the intent of this abstract method. For instance:

This method, implemented in sub-classes, makes it is possible to customize the
the scheduling of evaluations: GridSearchCV and RandomizedSearchCV schedule
evaluations for their whole parameter search space at once but other more
sequential approaches are also possible: for instance is possible to
iteratively schedule evaluations for new regions of the parameter search
space based on previously collected evaluation results. This makes it
possible to implement Bayesian optimization or more generally sequential
model-based optimization by deriving from the BaseSearchCV abstract base
class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice text!

@ogrisel
Copy link
Member
ogrisel commented Aug 3, 2018

Maybe for another PR, I think it might also be interesting to add:

class BaseSearchCV(...):

    @staticmethod
    def _fit_and_score(base_estimator, X, y, ...):
        """Make it possible to override model evaluation in subclasses"""
        return _fit_and_score(base_estimator, X, y, ...)

This would make it possible to override the default implementation in subclasses. For instance to be able to snapshot the fitted models on disk to later build an ensemble of the top performers for instance. Or to save the model evaluation in a DB to feed a dashboard.

@jnothman
Copy link
Member Author
jnothman commented Aug 5, 2018

I'm not sure about allowing users to customise _fit_and_score... It seems too powerful/essential, and something we need to be able to change on a whim. If they want to save estimators or log to a DB, scoring can do that in a hacky way. I've proposed before having another callback for storing diagnostics.

@jnothman
Copy link
Member Author
jnothman commented Aug 5, 2018

And: let's merge this when green... Sorry it's taken a while, @betatim!

@jnothman jnothman merged commit 477c921 into scikit-learn:master Aug 5, 2018
@saddy001
Copy link

The method BaseSearchCV._run_search is mandatory now on derived classes, because it's abstract.
Classes that inherit from BaseSearchCV and don't implement the new method, fail with

TypeError: Can't instantiate abstract class <class_name> with abstract methods _run_search

This breaks external libraries, for example skopt.BayesSearchCV. Is this on purpose?

@amueller
Copy link
Member

@saddy001 that's unfortunate. Ideally this refactor would make BayesSearchCV easier to implement. Maybe we should not make this abstract but just raise a NotImplementedError, in case the derived class overwrites fit (which I guess is what BayesSearchCV does. If they overwrite fit, what exactly does the class use from BaseSearchCV right now?

@saddy001
Copy link

It seems to implement the following methods:

dir(BayesSearchCV)
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_is_fitted', '_check_search_space', '_estimator_type', '_fit', '_fit_best_model', '_format_results', '_get_param_names', '_make_optimizer', '_run_search', '_step', 'best_params_', 'best_score_', 'classes_', 'decision_function', 'fit', 'get_params', 'inverse_transform', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'total_iterations', 'transform']

The code is here: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py
Maybe we should 'ping' the developers?

@amueller
Copy link
Member

not sure what the dir is supposed to be telling me. My question was what do they inherit, and I think the answer is score?

@saddy001
Copy link

Yep sorry, dir is also listing inherited attributes. I think this is what you want:

BaseSearchCV.__dict__.keys() - BayesSearchCV.__dict__.keys()
{'score', '_run_search', '_check_is_fitted', 'predict_log_proba', 'inverse_transform', 'predict', 'classes_', '_format_results', 'predict_proba', 'transform', 'decision_function', '_estimator_type'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0