-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Customizing BaseSearchCV is hard #9499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think (without looking in detail) that the SMAC implementation could have reused BaseSearchCV/RandomizedSearchCV more, albeit with a little overhead in each run. I would be +1 for something like:
WDYT? |
yeah, that would be good, I think. I don't think SMAC could have reused more. How? |
Could it not have reused more by just calling GridSearchCV(....).fit()
within it?
…On 10 August 2017 at 02:18, Andreas Mueller ***@***.***> wrote:
yeah, that would be good, I think. I don't think SMAC could have reused
more. How?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9499 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61_CJMyS_dgR9z6ZFq7ZVAcRtfflks5sWdvSgaJpZM4OuPGB>
.
|
Maybe. You'd have to consolidate all the dictionaries you get at the end. I guess that would be simpler than copying all the code. Hm it depends a bit what |
From my point of view the tricky bit is passing feedback from |
There's also a subtle trickiness in ensuring the CV splits are the same across calls and the fact that we could reuse the parallel backend. So we basically want a coroutine. Either: def candidates_generator():
new_cv_results = yield candidates
# ... determine new candidates to try or give up and the search does something like: for candidates in gen:
.... _fit_and_score(..., candidates, ...)
gen.send(cv_results) or inverted: def search_cv_generator(estimator, candidates, X, y, ...):
while candidates:
...
candidates = yield cv_results which may make more sense. |
Perhaps the latter case more refined is: def search_cv_coroutine(estimator, X, y, ...):
candidates = yield
while candidates:
...
candidates = yield cv_results In the latter case, your skopt code might be: def fit(...):
search = search_cv_coroutine(*args, **kwargs)
results = search.send(candidates)
all_results = results
while results are not good enough:
candidates = ...
results = search.send(candidates)
for k in all_results:
all_results[k] = np.concatenate([all_results[k], results[k]])
... store results ... |
@betatim yeah that's about where I thought. Though I wouldn't split the fit method, I would just encapsulate the call that generates the candidates and gets the results. I agree that co-routine is probably the best pattern for this. Do you know how that would work with the Parallel interface, though? I would still not have skopt overwrite fit, because most of the fit is parameter validation and result parsing, and these are exactly the parts they don't want to copy. |
The parallel context manager should work fine here
…On 14 Aug 2017 1:39 am, "Andreas Mueller" ***@***.***> wrote:
@betatim <https://github.com/betatim> yeah that's about where I thought.
Though I wouldn't split the fit method, I would just encapsulate the call
that generates the candidates and gets the results.
I agree that co-routine is probably the best pattern for this. Do you know
how that would work with the Parallel interface, though?
I would still not have skopt overwrite fit, because most of the fit is
parameter validation and result parsing, and these are exactly the parts
they don't want to copy.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9499 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-_DIjP7a73Ts0jedVCOd5677TB7ks5sXxi4gaJpZM4OuPGB>
.
|
Btw, @betatim, the problem with passing results back from We can also achieve a similar effect to a coroutine with a class having an |
Thinking out loud here: there are a lot of things already in "All" I want to do is be given results so far and asked for a new set of parameters to try. How can we make it so that these lines become customisable while keeping everything else fixed? I like the idea of sending stuff into a generator. How would it work interface wise for that generator to return more than one new point per iteration (for parallelism)? My guess: you always send in all past results and the generator returns a list of next points per iteration. Does mean the generator has to know something about how "parallel" you want to be running. With a setup like this all |
I was a bit uncertain about your statement, so I just mocked up one possible solution at #9599. |
could _fit_and_score return estimator with ret? then in BaseSearchCV().fit(), we can get all trained estimators, and can get top 5 or top 10 best estimators, other than using refit=True to retrain the best estimator and get the best_estimator_ |
well you would get one estimator for each fold. what about user providing a
path to which estimators were pickled?
|
@jnothman i think each estimator was assigned to hyper-parameter other than fold, every parameter can get an estimator which was trained in the func _fit_and_score, so we can get or pickle it(and how to pickle these estimator? i cannot find examples), and not need to fit the best estimator again by set refit=True |
no, refit fits on the entire training set. _fit_and_score deals with
subsets of training data.
|
so if i want to get top 5 estimators, i need to train the model with these top 5 parameters on entire training set? can modify refit to integer, then users can get top 5 estimators? |
That would be encouraging bad practice statistically. Apart from which it's
easy to do yourself.
|
Sorry I'm a little bit late to the party, but this issue aims to solve something that I have bumped into in several personal projects. I would love to see this solved and would be happy to create a PR if that would potentially speed things a little bit up. Looking back at the email that I send @amueller (FFR on August 3rd, 2018) my idea back then was to to make _fit_and_score a member function of BaseSearchCV so that it can be overridden by any child class, but I now see how that might not be the cleanes 8000 t solution. Is the solution proposed by @jnothman on August 13 still relevant? If so, I could make some time for that this week. |
I'm very impressed at you sending Andy an email in August 2018! I would also really like to see this solved. #9599 is my attempt, and I know no substantial critique of it. As stated there, it needs road testing (as well as review from other core devs). Please try using that interface: using that branch, inherit from |
To start positive, I really like this! If I understood the generators and the send function correct, this send function returns the results from all executed runs in the while loop to the generator. This allows the BO logic to be implemented in an class inheriting from ParamSampler. The intention is probably to add a statement like Unfortunately, I don't think the current codebase works as intended. This line automatically goes over the whole generator, taking None values for the remaining yield statements. The following example illustrates my problem. It is a sampler that is almost completely copied from the ParamSampler used in Random Search, however I added that after all parameter settings have been evaluated, it once more samples the best setting amongst the earlier sampled parameters (not useful in practice, but just for illustrative purposes)
Unfortunately, the code crashes on this one, due to the aforementioned line of code. If it helps, I can contribute a unit test and a proposal for a fix in your branch. |
In order to use joblib's parallelism, you should be yielding a list (or finite iterable) of candidate param dicts at once. So use Your code shouldn't determine the best. That's assumed to be behaviour provided by BaseSearchCV (even if there are alternative ways to do that too). You shouldn't be yielding it again at the end. |
I've given a toy example in the docs there now. Take a look |
I think I badly explained my intentions. I am aware of this mechanism, however when implementing something similar to Hyperband, one wants to use the results from earlier iterations and rerun the best results on a higher budget. Which is what I tried to outline in the code. I think this does exactly what I need. Basically, one should now be able to put this logic in the Small side note, in your latest commit you changed the return value of the (now) |
you're right, thanks
|
I had some time to play with the new interface, and in short I really like it. It is now possible to write a One thing that I also really like (although this was already the case before) is that the In an ideal world, the selection of best parameter settings would also be replaceable. It would be really awesome if there were some overridable function that takes the results array as an input and returns the best index. Use case: Hyperband: Sometimes (due to bad luck) a configuration that was ran on a fraction of the budged obtained the best performance, but (according to the paper) we would still like to recommend a parameter sett 6D40 ing that was obtained on the full budget. |
Yes, but I think that needs to be a separate change.
I wonder if we should make this customisability more of a public interface,
i.e. provide a CustomSearchCV(estimator, search_coroutine, ...), rather
than using inheritance. This might be better in terms of our promising API
stability...
|
There's probably a lot of interplay between the various components, and not all will work together. For example, for the Hyperband search_coroutine, one would probably also override the |
Another design question here. Should _generate_candidates be a function of
fit arguments (X, y, ...)?
|
ping @stsievert, if you have any thoughts on #9599 (@stsievert is also implementing Hyperband in dask/dask-searchcv#72) |
See also #11269 for flexibility in identifying the "best" estimator.
|
you mean that it would get X and y as parameters? I don't see the immediate advantage. There is one more related thing that is not immediately supported, which is the manipulating (subsampling) of the X and y by the search coroutine. (Is this what you meant?) If this would somehow be possible that would be great, but I suppose there are also other ways of achieving this within the search routine (e.g., wrapping the estimator). |
No, I didn't mean for that purpose... we assume constant CV indices through all experiments. I meant perhaps so that the search could select good initial candidates given the data. |
Ah, of course. That would make perfect sense indeed. |
#9599 implements the functionality I'd like to see: there is some function call after each evaluation, which means it works for adaptive selection schemes (e.g., Bayesian approaches and Hyperband). I've added a review, and will be watching. This would enable the adaptive implementations discussed in dask/dask-ml#161. I'll try to make #11266 work with #9599.
That's been my experience too. |
Please explain your use case for overriding |
Sorry for the slow reply! For a hyperband or successive halving implementation, it wouldn't be a bad idea to add meta-data to the
This can be used by the callable refit function as implemented in #11354, to determine which hyperparameter setting should be chosen (the original algorithm claims that only hyperparameters as ran on the maximum subsample size can be recommended) |
I'd rather implementers not play with _format_results, but we should have some contact about how the custom search function is allowed to modify the results. Hmmm. |
I'll be at SciPy next week. Currently, I do not completely get the whole idea behind results_container (line 682). Some questions that I wasn't able to answer:
Still interested in this? How should I add stuff? Mention it here or PR to your fork? |
results_container is because python 2 has "local" and "global" variable
scope, but not "nonlocal". it is an implementation detail (a hack) that can
disappear when we end python 2 support.
Yes, I've considered letting _run_search return the last results... But in
most cases this is unnecessary. it implies that _run_search can modify the
results arbitrarily, and I don't like that idea,or that then we might need
to implement checks on the results returned if we support custom
_run_search implementations.
And it's best to comment on the pr if it's about implementation
|
@jnothman , what was the motivation for passing |
If a method, how would |
Dear sir, how to use
replace:
As you can see, we can feed a generator into tensorflow model's |
A generator is not going to work if you need to pass through the dataset
multiple times, as you do with CV. DaskML might be an appropriate solution (
https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.model_selection.GridSearchCV.html
).
|
There are several projects working on smarter ways to search parameters.
Right now, BaseSearchCV makes it really hard to customize parts, mostly because the fit function is really long, and changing anything in the parameter loop means copying the whole
fit
method, see here.Maybe factoring this out into a method, or just splitting up
fit
a bit more might be a good idea?The text was updated successfully, but these errors were encountered: