-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Gaussian Process-based hyper-parameter optimization #5491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/gp_search.py
Outdated
|
||
|
||
def compute_expected_improvement(predictions, sigma, y_best): | ||
ei_array = np.zeros(predictions.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be easily vectorized
It would make sense to follow the input format of RandomizedSearchCV, which accepts both distributions and lists. |
sklearn/gp_search.py
Outdated
|
||
acquisition function : string, optional | ||
Function to maximize in order to choose the next parameter to test. | ||
- Simple : maximize the predicted output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Greedy" might be a better name than "simple" as it is commonly used to denote strategies which purely exploit the current knowledge without any exploration
The general framework is called "bayesian optimization". Maybe something along the lines of |
sklearn/gp_search.py
Outdated
|
||
n_init : int, optional | ||
Number of random iterations to perform before the smart search. | ||
Default is 30. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 seems relatively large IMO. Not sure if there are any established heuristics for choosing this value but I would personally set this value to no more than 2*self.n_parameters
@fabianp No that was not intended. |
sklearn/gp_search.py
Outdated
Parameters | ||
---------- | ||
|
||
estimator : 1) sklearn estimator or 2) callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we'll drop callable support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be useful if we wanted a generic bayesian optimization algorithm, but indeed, this is certainly out of scope.
Thanks all for your comments! This kind of things keep me going :-) |
sklearn/gp_search.py
Outdated
def compute_expected_improvement(predictions, sigma, y_best): | ||
ei_array = np.zeros(predictions.shape[0]) | ||
for i in range(ei_array.shape[0]): | ||
z = (y_best - predictions[i]) / sigma[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt it be (predictions[i] - y_best) / sigma[i]
instead? (see equation 4 of http://arxiv.org/pdf/1012.2599v1)
It is possible to rewrite For example, we could support the search method being a generator/coroutine that repeatedly yields a list of parameter settings to try, and is sent (with |
@glouppe , the problem with following the same input as RandomizedSearchCV is that we need a way to specify either a discrete list and a range of possible values. As far as I know RandomizedSearchCV allows to specify either a discrete list or a distribution, but not a range of possible values (i.e. an interval). Right now I don't know what is the most intuitive way to do this. |
@fabianp An interval could be specified by a uniform distribution (with a uniform prior). |
@jnothman I like the idea, but that would involve a refactoring of grid_search objects... Since my goal is to have a usable GPSearchCV object by the end of the week I would leave that for a future PR |
@glouppe so what you are proposing is to get the bounds from the distribution and ignore the distribution? or to optimize the expected improvement by sampling from this distribution? |
The inconvenient I see is that it forbids you from using standard derivative-free optimizers (e.g. scipy's cobyla) on the acquisition function. |
Hmm I am not sure to see the connection. Why would it prevent it you from using cobyla? |
I'm benching on the MNIST data with LR and leaving it overnight. Will let you know of the results tomorrow. |
Yay! I fixed the joblib parallel handling in the last commit using the context manager. I was creating and destroying the forks for every iteration which lead to significant overhead. There is still quite a bit to do but we are getting there. These are the new graphs. Note that I have set |
I wanted to suggest that a few days ago, and looked at the code to notice |
By the way, with regards to optimization: Do you have constraints? Because cobyla is slow, and therefore useful only if you have constraints. If those constraints are bound constraints, I would expect "l_bfgs_b" to be faster. If you don't have constraints, I would still investigate l_bfgs_b, or nelder-meads (optimize.fmin). For all options, I would maybe limit the number of iterations of the optimizer, as you don't want a very good optimization. By the way: a hopefully useful resource on choosing an optimization method: http://www.scipy-lectures.org/advanced/mathematical_optimization/index.html#practical-guide-to-optimization-with-scipy |
I am sorry if I am missing an important detail . The graphs that @MechCoder has made are awesome but.. is it possible to find examples where there is a really significant advantage to hyper-parameter optimization? Some of the earlier graphs seem to show this but in the later ones the advantages looks marginal. I would have thought that there were cases where making a significant change in the hyper-parameters of a classifier/regressor would really make a huge difference. Off the top of my head, I would guess that MLP might a good classifier to look at for this as well as the regularization parameters of our standard classifiers/regressors. Also are there interesting/surprising results from the searches that have been tried or do we just find that the higher/lower some parameters go (e.g. the number of trees for random forests) the better? Finally, on a related but a little different topic, maybe there are other sort of hyper-parameters such as how data is preprocessed, what sort of scaling is used, which features are treated as numerical vs categorical, which interaction terms are introduced etc. that would be of interest as well. |
@lesshaste Thanks for raising some very valid points. I am still at the point of convincing myself but I'll still try and put forward a case for.
However all this is in theory and I hope we can find cases where this beats the existing searches convincingly. I do not have the resources currently to bench on huge problems, but feel free to try yourself. I can however bench for simple problems with different estimators. Regarding your last question(s), they are all API issues. I wanted to bench with the MLP but then it takes number of layers as a list. I need to tweak the code for this particular use case. I could not off the top of my head, recall another estimator that allows parameters as a list. How would you handle different normalizations with the present grid search API? The most straightforward way is to run different searches across different pipelines and compare your best results. Also we can handle categorical parameters but that involves changing the API to also allow a dict -list format in addition to the already convoluted dict-dict format. |
Thanks a lot for the link Gael, but I'm afraid we are still in the "show me it is useful" stage. Just FYI, I have used |
I was told this would be the right way to deal with the log vs linear parameters, and that's what's implemented in spearmint: http://www.cs.toronto.edu/~zemel/documents/bayesopt-warping.pdf |
They integrate out the parameters of the warping, which is easy for them, I guess, as they already do that with the kernel parameters. We maximize the kernel parameters IIRC, and that happens in the GP. I'm not sure how simple it is for us to integrate out the warping parameters. |
I see that there is currently no low-level interface to run tests on Branin or some other function. That wouldn't be hard to refactor, though, right? |
Thanks Andy, for the comments. I will follow up in a while.. |
Hi all, there's a package already out there called Optunity for this kind of optimisation. May be worthing checking it out: #6662 |
Hello Everyone, This appears to be the main thread on hyper-parameter optimization in sklearn. I've had some experience in a related field called Meta-Optimization, so I thought I might share some of my thoughts in case they are useful to you. Please note that I'm fairly new to Python and I'm completely new to sklearn. Random and Grid SearchFirst off, Random Search on any kind of optimization problem will only find near-optimal solutions if the search-space is very smooth. This is because the probability of finding near-optimal parameters decreases exponentially towards zero as the number of parameters increases. This is the so-called Curse of Dimensionality in the search-space of hyper-parameters. If n is the number of parameters and near-optimal parameter choices only occupy 10% of the range of each parameter, then the probability of finding near-optimal parameters by completely random sampling is p=0.1^n. For n=10 parameters this probability is p=1e-10 and will hence require 1e+10 random samples on average to find near-optimal hyper-parameters. I've sometimes seen people claim that Random Search eventually converges to the optimal parameters, and I also saw this claim in a recent lecture on hyper-parameter tuning, but the claim is a bit of a stretch - an infinite stretch to be exact. Practical UseYou may object that both GridSearch and RandomSearch in sklearn seem to work in practice. Well, they may find acceptable hyper-parameters but it is highly unlikely that they find near-optimal parameters. Unless the search-space of hyper-parameters is very smooth, there may be a very large difference in performance between acceptable and near-optimal parameters. However, I do agree that it is much preferable to automate either grid- or random-search instead of doing it manually. But even better would of course be to have a proper optimizer to find near-optimal hyper-parameters. Meta-OptimizationThe problem I was working on was the tuning of control-parameters (aka. behavioural parameters, aka. hyper-parameters) for heuristic optimizers. I prefer to call this Meta-Optimization although it is conceptually very similar to hyper-parameter optimization. The Wikipedia page has a short description of the concept, which was apparently first explored back in the 1970's: https://en.wikipedia.org/wiki/Meta-optimization The main problem when I worked on this was the massive computational needs. The popular heuristic optimizers of the day, such as Particle Swarm Optimization (PSO), typically required many thousand evaluations of the cost-function. This was intractable when the meta-cost-function consisted of another 50 optimization runs that might take hours to complete. I found a system that was simple, worked well and was quite fast. It had two main components: (1) A meta-optimizer that only required few iterations, and (2) a measure of performance whose calculation could be aborted pre-emptively in many cases. These two things in combination made it perform reasonably fast. I made a short YouTube-video on how it works: https://www.youtube.com/watch?v=O6OQPpzVHBc And another short video with a demonstration: https://www.youtube.com/watch?v=6cM-e10YRdI Python Source-CodeI recently implemented this form of meta-optimization in Python as an exercise for myself to get better at Python programming (it is surprising how many language constructs you get to exercise for such a small project). You can download it here if you want to see the actual source-code and try running it. See the file demo-meta-optimize.py: https://github.com/Hvass-Labs/swarmops Meta-optimization in sklearn?I considered proposing to implement LUS for hyper-parameter tuning in sklearn, but I see two major problems with that approach: (1) LUS only works for real-valued parameters and would therefore have to be combined with another heuristic, or grid-search, or random-search for the discrete hyper-parameters. I suppose you could hack it by mapping continuous numbers to discrete parameters - but I'm not sure it would work as LUS assumes the search-space is numerically ordered and reasonably smooth. (2) LUS typically requires about 5 runs because it has a tendency to become stuck in local optima, and each run requires a number of iterations equal to about 30 x the number of hyper-parameters to be tuned. Because of how the performance measure was designed for meta-optimization, it is possible to abort the computation pre-emptively in many cases, thus making the tuning process reasonably fast, only taking a small fraction of the full computation time. An example is shown in the following image where all the crosses are hyper-parameters that were contemplated but the calculation was aborted pre-emptively because the parameters were found to be inferior (note the meta-optimization is doing minimization in this image). I'm not sure this would work when tuning the hyper-parameters of Machine Learning methods, so this method may not be fast enough. Bayesian OptimizationI've looked at some of the other proposals for hyper-parameter tuning in Machine Learning. I like the idea of Bayesian Optimization (BO). As far as I can tell from having just studied it briefly, it is also a heuristic optimizer with some of the strengths and weaknesses of heuristics. However, BO works quite differently than e.g. LUS or PSO, because BO builds an approximation to the search-space in a clever way and improves the approximation with each evaluation of the cost-function. The main draw-back seems to be that BO is more complicated to implement (certainly compared to LUS) - and it should only be used when the cost-function is very time-consuming to compute, because BO has significant computational overhead of its own. I'm not quite sure if the standard BO variants support a combination of continuous and discrete hyper-parameters, because it is always demonstrated on continuous parameters. Can someone elaborate? A recent paper by Wang et al. (@ziyuw) proposed a BO variant which supports both types of parameters and claims to have other advantages as well. (I also find it amusing that they call their variant REMBO; while LUS means louse in Danish :-) http://arxiv.org/abs/1301.1942 Please update WikipediaIf some of the experts on Bayesian Optimization should read this, then please consider updating the Wikipedia page so the rest of us can get a quick overview. The page receives more than 60 views per day so it is certainly worth your time - does any of your academic papers get this many views? :-) https://en.wikipedia.org/wiki/Bayesian_optimization Meta-Meta-OptimizationIn my opinion, it makes sense for the research community to look for a good method for doing meta-optimization / hyper-parameter optimization. If we think about it a bit, we see that this process can of course also be automated with meta-meta-optimization aka. hyper-hyper-parameter optimization. But is it worth the trouble to build such a system? This is what I wrote on the subject in my thesis: http://www.hvass-labs.org/people/magnus/thesis/pedersen08thesis.pdf
Meta-Meta ImplementationSwarmOps for C# supports meta-meta-optimization with any meta-nesting-depth you may like. You may also be able to get this working in SwarmOps for Python. I can't quite remember if the Java and C versions of SwarmOps also support this. http://www.hvass-labs.org/projects/swarmops/ Questions?I'm sorry to hi-jack your thread with such a long post, but perhaps some of these thoughts might be useful to you when implementing hyper-parameter optimization in sklearn. If you have any questions then please tag the comment with @Hvass-Labs so I see it. I'll look forward to seeing your solution in one of the next versions of sklearn. |
What's the status on this PR? It seems to be stalled at the fact that the internal optimization was taking a lot of time, but with LBFGS and early stopping, I was hoping for an improvement in time (I didn't see a bench). Any hope in getting this merged or mergeable? |
@GaelVaroquaux I'm not really a part of the effort and it has been a month since I looked at this so I don't know if the following suggestion is helpful. If the problem is that the inner-loop of the Bayes Optimizer is too slow because it needs too many random samples, or because L-BFGS-B is too slow for some reason; then how about trying a heuristic optimizer for the inner-loop of the Bayes Optimizer? LUS and PS in my SwarmOps library were specifically made for semi-smooth search-spaces that had to be searched in few iterations (typically something like 20 x dimensionality of the search-space). It may be worth a try to plug them into the Bayes Optimizer code you guys have already made - or perhaps try and let them optimize the real-valued hyper-parameters directly without using the Gaussian approximation. Perhaps it'll work. https://github.com/Hvass-Labs/swarmops I also tried TPOT a while ago. It is a great idea, although it still seemed to be under development. Perhaps the author @rhiever has some insights to share in this thread. |
Since this was opened @MechCoder and @glouppe have made a very nice On Wed, Jun 15, 2016 at 3:46 PM, Hvass-Labs notifications@github.com
|
From the TPOT end: Currently, genetic programming (also shortened as GP) optimization is likely not better than Bayesian optimization for tuning real-valued hyperparameters. TPOT specializes more in optimizing a sequence of preprocessors, models, etc. and tuning the parameters of those operators at a granular level. You might look into some of the tricks and tips used in auto-sklearn. They use meta-learning to choose a good model/parameters to start with (which may be useful), and 10000 their paper reports that Bayesian optimization works quite well in the hyperparameter optimization domain. I don't think it would be possible to take their code and port it right into sklearn (as it's quite heavy on the dependencies), but it could be useful if you gutted it for the optimization process. /cc @mfeurer |
I took a quick look at the source-code for scikit-optimize and I like that you can pass a parameter for the type of search. So far it supports either random sampling or L-BFGS-B for searching the Gaussian approximation. This leaves the door open for using other search methods and heuristics as suggested above. The source-code for skopt looks to be reasonably easy to modify; although both the variable names and comments could perhaps be made a bit more descriptive so it's easier to understand and modify by others (please consider this @glouppe and @MechCoder). https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/gp_opt.py Both TPOT by @rhiever et al. and the work by @mfeurer et al. look to be fascinating contributions to the field. Perhaps something like that can be added to sklearn when they're ready? It would certainly be useful to the end-users if they can automatically discover and tune the entire pipeline, without having to import other libraries. I think it would be well worth the effort to consider how it could be integrated directly into sklearn. |
@Hvass-Labs Thanks for your enthusiasm and comments! Scikit-optimize is still very much in its infancy and much remains to be done before a first usable release, but hopefully we will get there by this summer. In the meantime, PRs are welcome if you feel like helping us :) |
Hey all! It's been a long time, what's the status of the PR? |
There's scikit-optimize now: https://scikit-optimize.github.io/ I'm in favor of closing this PR and deferring to scikit-optimize. @fabianp ok with you? |
Absolutely. Closing |
Based on @sds-dubois code (PR #5185), this code adds an API that is more compatible with the GridSearchCV interface and some tests.
Things still TODO are
test_n_iter_smaller_n_iter
).GPSearchCV().scores_
should have the same structure asGridSearchCV().grid_scores_
_sample_candidates
I think the module name is not ideal, since in the future we migth have other *SearchCV objects (e.g. based on trees). Any suggestions?
CC @Djabbz