10000 Customizing BaseSearchCV is hard · Issue #9499 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Customizing BaseSearchCV is hard #9499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Aug 4, 2017 · 44 comments
Closed

Customizing BaseSearchCV is hard #9499

amueller opened this issue Aug 4, 2017 · 44 comments
Milestone

Comments

@amueller
Copy link
Member
amueller commented Aug 4, 2017

There are several projects working on smarter ways to search parameters.
Right now, BaseSearchCV makes it really hard to customize parts, mostly because the fit function is really long, and changing anything in the parameter loop means copying the whole fit method, see here.

Maybe factoring this out into a method, or just splitting up fit a bit more might be a good idea?

@jnothman
Copy link
Member
jnothman commented Aug 5, 2017

I think (without looking in detail) that the SMAC implementation could have reused BaseSearchCV/RandomizedSearchCV more, albeit with a little overhead in each run.

I would be +1 for something like:

def search_cv(estimator, candidates, X, y, ...):
    ...
    return cv_results_

WDYT?

@amueller
Copy link
Member Author
amueller commented Aug 9, 2017

yeah, that would be good, I think. I don't think SMAC could have reused more. How?

@jnothman
Copy link
Member
jnothman commented Aug 9, 2017 via email

@amueller
Copy link
Member Author

Maybe. You'd have to consolidate all the dictionaries you get at the end. I guess that would be simpler than copying all the code.

Hm it depends a bit what candidates is in your example. I don't think that's actually needed, since we already have _get_param_iterator, and any optimization would create these on the fly?

@betatim
Copy link
Member
betatim commented Aug 13, 2017

From my point of view the tricky bit is passing feedback from _fit_and_score to the iterator from _get_param_iterator. So I would suggest splitting things somewhere around here and replace the first part in scikit-optimize.

@jnothman
Copy link
Member
jnothman commented Aug 13, 2017

There's also a subtle trickiness in ensuring the CV splits are the same across calls and the fact that we could reuse the parallel backend.

So we basically want a coroutine. Either:

def candidates_generator():
    new_cv_results = yield candidates
    # ... determine new candidates to try or give up

and the search does something like:

for candidates in gen:
    .... _fit_and_score(..., candidates, ...)
    gen.send(cv_results)

or inverted:

def search_cv_generator(estimator, candidates, X, y, ...):
    while candidates:
        ...
        candidates = yield cv_results

which may make more sense.

@jnothman
Copy link
Member
jnothman commented Aug 13, 2017

Perhaps the latter case more refined is:

def search_cv_coroutine(estimator, X, y, ...):
    candidates = yield
    while candidates:
        ...
        candidates = yield cv_results

In the latter case, your skopt code might be:

def fit(...):
    search = search_cv_coroutine(*args, **kwargs)
    results = search.send(candidates)
    all_results = results
    while results are not good enough:
        candidates = ...
        results = search.send(candidates)
        for k in all_results:
            all_results[k] = np.concatenate([all_results[k], results[k]])
    ... store results ...

@amueller
Copy link
Member Author

@betatim yeah that's about where I thought. Though I wouldn't split the fit method, I would just encapsulate the call that generates the candidates and gets the results.

I agree that co-routine is probably the best pattern for this. Do you know how that would work with the Parallel interface, though?

I would still not have skopt overwrite fit, because most of the fit is parameter validation and result parsing, and these are exactly the parts they don't want to copy.

@jnothman
Copy link
Member
jnothman commented Aug 13, 2017 via email

@jnothman
Copy link
Member

Btw, @betatim, the problem with passing results back from _fit_and_score is that I don't want to make any promises about its interface. I would rather return a list of dicts even if it takes some overhead to transform it. I'm still not sure what the right format for returning things here is.

We can also achieve a similar effect to a coroutine with a class having an evaluate method (rather than the awkward send). It makes it slightly messier to use a parallel context manager, but perhaps produces a cleaner interface overall.

@betatim
Copy link
Member
betatim commented Aug 21, 2017

Thinking out loud here: there are a lot of things already in RandomSearch that I'd prefer not to have to write myself: argument validation (first part of fit), the contents of _fit_and_score and the aggregation of results (second part of fit).

"All" I want to do is be given results so far and asked for a new set of parameters to try. How can we make it so that these lines become customisable while keeping everything else fixed? I like the idea of sending stuff into a generator. How would it work interface wise for that generator to return more than one new point per iteration (for parallelism)? My guess: you always send in all past results and the generator returns a list of next points per iteration. Does mean the generator has to know something about how "parallel" you want to be running.

With a setup like this all skopt would have to do is provide BaseSearchCV with a custom generator as argument to __init__ right?

@jnothman
Copy link
Member

I was a bit uncertain about your statement, so I just mocked up one possible solution at #9599.

@jerryfuyu0104
Copy link

could _fit_and_score return estimator with ret? then in BaseSearchCV().fit(), we can get all trained estimators, and can get top 5 or top 10 best estimators, other than using refit=True to retrain the best estimator and get the best_estimator_

@jnothman
Copy link
Member
jnothman commented Mar 8, 2018 via email

@jerryfuyu0104
Copy link

@jnothman i think each estimator was assigned to hyper-parameter other than fold, every parameter can get an estimator which was trained in the func _fit_and_score, so we can get or pickle it(and how to pickle these estimator? i cannot find examples), and not need to fit the best estimator again by set refit=True

@jnothman
Copy link
Member
jnothman commented Mar 8, 2018 via email

@jerryfuyu0104
Copy link

so if i want to get top 5 estimators, i need to train the model with these top 5 parameters on entire training set? can modify refit to integer, then users can get top 5 estimators?

@jnothman
Copy link
Member
jnothman commented Mar 8, 2018 via email

@janvanrijn
Copy link
Contributor

Sorry I'm a little bit late to the party, but this issue aims to solve something that I have bumped into in several personal projects. I would love to see this solved and would be happy to create a PR if that would potentially speed things a little bit up.

Looking back at the email that I send @amueller (FFR on August 3rd, 2018) my idea back then was to to make _fit_and_score a member function of BaseSearchCV so that it can be overridden by any child class, but I now see how that might not be the cleanes 8000 t solution.

Is the solution proposed by @jnothman on August 13 still relevant? If so, I could make some time for that this week.

@jnothman
Copy link
Member

I'm very impressed at you sending Andy an email in August 2018!

I would also really like to see this solved. #9599 is my attempt, and I know no substantial critique of it. As stated there, it needs road testing (as well as review from other core devs). Please try using that interface: using that branch, inherit from BaseSearchCV and provide "provide __init__ and either _get_param_iterator or _generate_candidates". Show us what your custom *SearchCV looks like and if you're happy with that interface. Thanks.

@janvanrijn
Copy link
Contributor

To start positive, I really like this! If I understood the generators and the send function correct, this send function returns the results from all executed runs in the while loop to the generator. This allows the BO logic to be implemented in an class inheriting from ParamSampler. The intention is probably to add a statement like obtained_results = yield at the end of the BO-version of the ParamSampler, so that it can determine which grid points should be sampled after. Is this correct?

Unfortunately, I don't think the current codebase works as intended. This line automatically goes over the whole generator, taking None values for the remaining yield statements.

The following example illustrates my problem. It is a sampler that is almost completely copied from the ParamSampler used in Random Search, however I added that after all parameter settings have been evaluated, it once more samples the best setting amongst the earlier sampled parameters (not useful in practice, but just for illustrative purposes)

class ParameterSamplerRepeatBestParams(object):

    def __init__(self, param_distributions, n_iter, random_state=None):
        self.param_distributions = param_distributions
        self.n_iter = n_iter
        self.random_state = random_state


    def __iter__(self):
        # check if all distributions are given as lists
        # in this case we want to sample without replacement
        all_lists = np.all([not hasattr(v, "rvs")
                            for v in self.param_distributions.values()])
        rnd = check_random_state(self.random_state)

        if all_lists:
            # look up sampled parameter settings in parameter grid
            param_grid = ParameterGrid(self.param_distributions)
            grid_size = len(param_grid)
            n_iter = self.n_iter

            if grid_size < n_iter:
                warnings.warn(
                    'The total space of parameters %d is smaller '
                    'than n_iter=%d. Running %d iterations. For exhaustive '
                    'searches, use GridSearchCV.'
                    % (grid_size, self.n_iter, grid_size), UserWarning)
                n_iter = grid_size
            for i in sample_without_replacement(grid_size, n_iter,
                                                random_state=rnd):
                yield param_grid[i]

        else:
            # Always sort the keys of a dictionary, for reproducibility
            items = sorted(self.param_distributions.items())
            for _ in six.moves.range(self.n_iter):
                params = dict()
                for k, v in items:
                    if hasattr(v, "rvs"):
                        if sp_version < (0, 16):
                            params[k] = v.rvs()
                        else:
                            params[k] = v.rvs(random_state=rnd)
                    else:
                        params[k] = v[rnd.randint(len(v))]
                yield params
        # JvR: wait for the send statement, obtain the results and determine the best hyperparameter setting
        obtained_result = yield
        best_idx = np.argmax(obtained_result['mean_test_score'])
        best_params = obtained_result['params'][best_idx]
        yield best_params

    def __len__(self):
        """Number of points that will be sampled."""
        return self.n_iter

Unfortunately, the code crashes on this one, due to the aforementioned line of code. If it helps, I can contribute a unit test and a proposal for a fix in your branch.

@jnothman
Copy link
Member

In order to use joblib's parallelism, you should be yielding a list (or finite iterable) of candidate param dicts at once. So use yield [params] at each step, rather than yield params. Please help improve the documentation in my branch.

Your code shouldn't determine the best. That's assumed to be behaviour provided by BaseSearchCV (even if there are alternative ways to do that too). You shouldn't be yielding it again at the end.

@jnothman
Copy link
Member

I've given a toy example in the docs there now. Take a look

@janvanrijn
Copy link
Contributor

Your code shouldn't determine the best. That's assumed to be behaviour provided by BaseSearchCV (even if there are alternative ways to do that too).

I think I badly explained my intentions. I am aware of this mechanism, however when implementing something similar to Hyperband, one wants to use the results from earlier iterations and rerun the best results on a higher budget. Which is what I tried to outline in the code.

I think this does exactly what I need. Basically, one should now be able to put this logic in the _generate_candidates() fn, after an initial grid is returned, the results will be passed back to this function, and a new parameter grid can be constructed. I will try to implement this later this week to verify whether it has the required functionality.

Small side note, in your latest commit you changed the return value of the (now) _generate_candidates fn of Grid Search on line 1184 to yield ParameterGrid(self.param_grid). For Random Search it is (still) return ParameterSampler(...), line 1192, probably breaking the unit tests.

@jnothman
Copy link
Member
jnothman commented Jun 15, 2018 via email

@janvanrijn
Copy link
Contributor

I had some time to play with the new interface, and in short I really like it. It is now possible to write a _generate_candidates() that mimics Successive Halving (subroutine of Hyperband) using 10 lines of code.

One thing that I also really like (although this was already the case before) is that the _format_results() can be overriden. I feel this is crucial for adding additional meta-data about the individual runs.

In an ideal world, the selection of best parameter settings would also be replaceable. It would be really awesome if there were some overridable function that takes the results array as an input and returns the best index. Use case: Hyperband: Sometimes (due to bad luck) a configuration that was ran on a fraction of the budged obtained the best performance, but (according to the paper) we would still like to recommend a parameter sett 6D40 ing that was obtained on the full budget.

@jnothman
Copy link
Member
jnothman commented Jun 15, 2018 via email

@janvanrijn
Copy link
Contributor

There's probably a lot of interplay between the various components, and not all will work together. For example, for the Hyperband search_coroutine, one would probably also override the _format_results() function, which will require to use inheritance anyway.

@jnothman
Copy link
Member
jnothman commented Jun 15, 2018 via email

@TomAugspurger
Copy link
Contributor

ping @stsievert, if you have any thoughts on #9599 (@stsievert is also implementing Hyperband in dask/dask-searchcv#72)

@jnothman
Copy link
Member
jnothman commented Jun 15, 2018 via email

@janvanrijn
Copy link
Contributor

Should _generate_candidates be a function of fit arguments (X, y, ...)?​

you mean that it would get X and y as parameters? I don't see the immediate advantage.

There is one more related thing that is not immediately supported, which is the manipulating (subsampling) of the X and y by the search coroutine. (Is this what you meant?) If this would somehow be possible that would be great, but I suppose there are also other ways of achieving this within the search routine (e.g., wrapping the estimator).

@jnothman
Copy link
Member

No, I didn't mean for that purpose... we assume constant CV indices through all experiments.

I meant perhaps so that the search could select good initial candidates given the data.

@janvanrijn
Copy link
Contributor

Ah, of course. That would make perfect sense indeed.

@stsievert
Copy link
Contributor
stsievert commented Jun 15, 2018

#9599 implements the functionality I'd like to see: there is some function call after each evaluation, which means it works for adaptive selection schemes (e.g., Bayesian approaches and Hyperband). I've added a review, and will be watching. This would enable the adaptive implementations discussed in dask/dask-ml#161.

I'll try to make #11266 work with #9599.

changing anything in the parameter loop means copying the whole fit method

That's been my experience too.

@jnothman
Copy link
Member

Please explain your use case for overriding _format_results, @janvanrijn? If it's adding a measure that can be estimated independently for each fit, could you do it with a custom scorer?

@janvanrijn
Copy link
Contributor

Please explain your use case for overriding _format_results, @janvanrijn? If it's adding a measure that can be estimated independently for each fit, could you do it with a custom scorer?

Sorry for the slow reply!

For a hyperband or successive halving implementation, it wouldn't be a bad idea to add meta-data to the cv_results_ field, that indicate stuff such as

  • the subsample size (indicating which stage of successive halving we are)
  • the bracket number (indicating which stage of hyperband we are)
  • the hyperband iteration number (since the algorithm can be repeated multiple times)

This can be used by the callable refit function as implemented in #11354, to determine which hyperparameter setting should be chosen (the original algorithm claims that only hyperparameters as ran on the maximum subsample size can be recommended)

@jnothman
Copy link
Member
jnothman commented Jul 3, 2018

I'd rather implementers not play with _format_results, but we should have some contact about how the custom search function is allowed to modify the results. Hmmm.

@janvanrijn
Copy link
Contributor

I'd rather implementers not play with _format_results, but we should have some contact about how the custom search function is allowed to modify the results. Hmmm.

I'll be at SciPy next week.

Currently, I do not completely get the whole idea behind results_container (line 682). Some questions that I wasn't able to answer:

  1. why is it a list of dict (we seem only interested in element 0)
  2. line 732: why results = results_container[0] if we can also grab the result from the return of _run_search
  3. if point 2 were the case, the evaluate_candidates fn could be wrapped and the results could be enhanced by search implementations (I think .. )

Please help improve the documentation in my branch.

Still interested in this? How should I add stuff? Mention it here or PR to your fork?

@jnothman
Copy link
Member
jnothman commented Jul 3, 2018 via email

@NicolasHug
Copy link
Member

@jnothman , what was the motivation for passing evaluate_candidates() to _run_search versus e.g. just making it a private method?

@jnothman
Copy link
Member
jnothman commented Oct 4, 2019

@jnothman , what was the motivation for passing evaluate_candidates() to _run_search versus e.g. just making it a private method?

If a method, how would evaluate_candidates be provided the other variables it needs to do its job? evaluate_candidates is a closure over several other variables available in _fit. If it were a method, then those variables would need to be stored on the object. This is a cleaner functional abstraction (but maybe it's more functional paradigmatically than a lot of code in scikit-learn)

@wangyexiang
Copy link

Dear sir, how to use GridSearchCV or RandomizedSearchCV in a huge dataset? I want to load the data in batch from disk with a generator to avoid MemoryError, then I can feed the batch data to fit method.
Maybe like this:

model = RandomizedSearchCV()
model.fit(batch_data) #where batch_data is a generator`

replace:

model = RandomizedSearchCV()
model.fit(x, y)

As you can see, we can feed a generator into tensorflow model's fit method and this is pretty useful to huge dataset, like a images dataset.
Thank you very much.
@jnothman

@jnothman
Copy link
Member
jnothman commented Nov 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants
0