8000 MRG Random sampling hyper parameters by amueller · Pull Request #1194 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MRG Random sampling hyper parameters #1194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

amueller
Copy link
Member

This is another shot at implementing randomized hyper parameter search in sklearn.
In contrast to #455, I use scipy.stats distributions for each parameter.

If I understand James' Paper correctly, one of the important points is not to use a grid.
Imagine having one totally irrelevant parameter. Using a grid basically makes a large part of the samples redundant, while using a continuous distribution is not affected by the presence of the irrelevant parameter at all.

For discrete parameters, the user can also specify a list of names i.e. kernel=['rbf', 'linear', 'poly'], which is sampled from using a uniform distribution.

I don't like passing objects as parameters, but this seems to be the best way to specify a distribution.
A user can put in any object that supports .rvs() and scipy.stats.distributions already has a solid amount of predefined ones.

@amueller
Copy link
Member Author
amueller commented Oct 6, 2012

This draft does not support tree-like parameter spaces btw.
These are not so uncommon, even in simple standard models. For example, if you use a rbf kernel in an SVM, you need a gamma and a C parameter. If you use a linear kernel, you only need C.

@alextp
Copy link
Member
alextp commented Oct 6, 2012

About tree-structured spaces I think it's ok to do without them because you're doing random search, so having extra parameters which are ignored doesn't really cost you anything in statistical efficiency.

In general I'm +1 for merging this code, as I think it's useful and we should steer people towards doing more random search and less explicit grid search (specially as adding more hyperparameters can often improve performance at the expense of more tuning).

params[k] = v[rnd.randint(len(v))]
yield params


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point -- code using yield cannot be serialized, but it is often helpful to serialize long-running code such as hyper-parameter optimization.

Consider making __iter__ return self and adding a next method that returns a params sample?

I'm happy enough to do this as a PR of my own later if you like, so that any discussion can be separate from this PR. Anyway there may be other serialization problems, this might be necessary but insufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it as the draws are independent any way...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to do this right.
It should only sample n_iter times.
But I want to be able to iterate over it more than once.
I guess that means I have to distinguish between container and iterator.
Not sure how to do this without creating another class. Any help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller AFAIK you're right, you can't get independent iterators like yield gives you without another class. If this is, in your judgement, a pain, then leave it for later / never.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating custom iterators is easy enough (I just did it), but pickling them is a pain because scipy.stats distributions can't be pickled.

@zaxtax
Copy link
Contributor
zaxtax commented Nov 9, 2012

What needs to be done to get this merged?

@amueller
Copy link
Member Author
amueller commented Nov 9, 2012

At least example, documentation and testing. Sorry this stalled. Not sure if I have time for it on the weekend, I have some other PRs.

There are some changes I made in GridSearchCV in this PR that needs to be reverted. Generally, it needs to be rebased but that should be easy.

Not sure if everyone is happy with the API, none of the core-devs voiced an opinion afaik.
There was not much response to my mail on the list either.
Maybe @GaelVaroquaux, @ogrisel, @larsmans or @mblondel (sorry for random non-inclusive pings) have an opinion?

@amueller
Copy link
Member Author
amueller commented Nov 9, 2012

ok, rebased. Code should be fine now.

@zaxtax
Copy link
Contributor
zaxtax commented Nov 9, 2012

Cool. If I push to your repo, will the changes appear here?
On Nov 9, 2012 12:25 PM, "Andreas Mueller" notifications@github.com wrote:

ok, rebased. Code should be fine now.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1194#issuecomment-10243384.

@amueller
Copy link
Member Author
amueller commented Nov 9, 2012

they are here I think.

@amueller
Copy link
Member Author

I did some refactoring and added an example. Apart from the docs I think this should be good to go. Will try to do the user guide tomorrow.

@amueller
Copy link
Member Author

Added docs, renamed PR to MRG.
Still have to fill in a citation, though.

@amueller
Copy link
Member Author

Any reviews please?

@ogrisel
Copy link
Member
ogrisel commented Nov 26, 2012

I'll add that one to my todo list. Unfortunately don't have time today.

@amueller
Copy link
Member Author

Thanks :) don't worry, you did so much reviewing for me yesterday already! I just didn't want it to get lost ;)

@zaxtax
Copy link
Contributor
zaxtax commented Nov 29, 2012

+1 LGTM

@larsmans
Copy link
Member

May I lazily ask a question, without having read the code or the paper? This just takes random samples of the hyperparameters and remembers the best sample, right?

The reason I'm asking is because I wonder what will happen to random_state hyperparameters in fancy meta-optimizers. Not all estimators that take such a parameter can do random restarts internally.

@amueller
Copy link
Member Author

for each sampled set of parameters, cross validation is run. that hopefully makes the score robust to the seed. does that answer your question?

Lars Buitinck notifications@github.com schrieb:

May I lazily ask a question, without having read the code or the paper?
This just takes random samples of the hyperparameters and remembers the
best sample, right?

The reason I'm asking is because I wonder what will happen to
random_state hyperparameters in fancy meta-optimizers. Not all
estimators that take such a parameter can do random restarts
internally.


Reply to this email directly or view it on GitHub:
#1194 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

@amueller
Copy link
Member Author

Rebased and stuff, should be mergable again.

self.estimator_scores_ = [
(clf_params, score, all_scores)
for clf_params, (score, _), all_scores
in zip(parameter_iterator, scores, cv_scores)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment still relevant?

Copy link
Member 8000 Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. Thanks :)

@arjoly
Copy link
Member
arjoly commented Dec 27, 2012

In BaseSearchCV._fit`, there is some names that seems to be more "grid" specific.

@amueller
Copy link
Member Author

@arjoly thanks for your comments, I'll address them shortly :)

@amueller
Copy link
Member Author
amueller commented Mar 2, 2013

Ok, should be good now. If there are no more comments, I'll squash and merge.

@ogrisel
Copy link
Member
ogrisel commented Mar 2, 2013

On a related note @ogrisel have you looked at #1626 and the discussion in #1674?

Not yet but I need to focus on my PyCon tutorial in the short term :)

@@ -72,11 +75,141 @@ def __iter__(self):
yield params


class IterGrid(ParameterGrid):
"""Generators on the combination of the various parameter lists given.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a note on the deprecation either here or at the end of the docstring.

@ogrisel
Copy link
Member
ogrisel commented Mar 2, 2013

Code coverage is good:

sklearn.grid_search                                239     10    96%   258, 347-352, 366, 405, 634-636, 667

@ogrisel
Copy link
Member
ogrisel commented Mar 2, 2013

In the example, rather than just reporting the top parameters, it would be better to report the, say, top 3 models along with standard of the mean validation score:

diff --git a/examples/randomized_search.py b/examples/randomized_search.py
index c416508..359a4eb 100644
--- a/examples/randomized_search.py
+++ b/examples/randomized_search.py
@@ -21,6 +21,8 @@ simultaneously using grid search, but pick only the ones deemed most important.
 print __doc__

 from time import time
+from operator import itemgetter
+from scipy.stats import sem
 from scipy.stats import randint as sp_randint

 from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
@@ -44,12 +46,23 @@ param_dist = {"max_depth": sp_randint(1, 5), "max_features": sp_randint(1, 4),
 random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                    n_iter=20)

+# Utility function to report best scores
+def report(cv_scores, n_top=3):
+    top_scores = sorted(random_search.cv_scores_, key=itemgetter(1),
+                 reverse=True)[:n_top]
+    for i, score in enumerate(top_scores):
+        print("Model with rank: {0}".format(i + 1))
+        print("Mean validation score: {0:.3f} (+/-{1:.3f})".format(
+              score.mean_validation_score,
+              sem(score.cv_validation_scores)))
+        print("Parameters: {0}".format(score.parameters))
+        print("")
+
 start = time()
 random_search.fit(X, y)
 print("RandomizedSearchCV took %.2f seconds for 20 iterations."
       % (time() - start))
-print("Best score: %f" % random_search.best_score_)
-print("Best parameters: %s" % repr(random_search.best_params_))
+report(random_search.cv_scores_)

 # use a full grid over all parameters
 param_grid = {"max_depth": range(1, 5), "max_features": range(1, 4)
F438
,
@@ -63,5 +76,4 @@ grid_search.fit(X, y)

 print("GridSearchCV took %.2f seconds for %d parameter settings."
       % (time() - start, len(grid_search.cv_scores_)))
-print("Best score: %f" % grid_search.best_score_)
-print("Best parameters: %s" % repr(grid_search.best_params_))
+report(grid_search.cv_scores_)

that outputs:

RandomizedSearchCV took 0.86 seconds for 20 iterations.
Model with rank: 1
Mean validation score: 0.960 (+/-0.020)
Parameters: {'bootstrap': False, 'min_samples_leaf': 4, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 1, 'max_depth': 2}

Model with rank: 2
Mean validation score: 0.960 (+/-0.020)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 1, 'max_depth': 2}

Model with rank: 3
Mean validation score: 0.960 (+/-0.023)
Parameters: {'bootstrap': False, 'min_samples_leaf': 4, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 2, 'max_depth': 4}

GridSearchCV took 33.51 seconds for 768 parameter settings.
Model with rank: 1
Mean validation score: 0.960 (+/-0.020)
Parameters: {'bootstrap': False, 'min_samples_leaf': 4, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 1, 'max_depth': 2}

Model with rank: 2
Mean validation score: 0.960 (+/-0.020)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 1, 'max_depth': 2}

Model with rank: 3
Mean validation score: 0.960 (+/-0.023)
Parameters: {'bootstrap': False, 'min_samples_leaf': 4, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 2, 'max_depth': 4}

We could even add a report(n_top=3, format_params=True) method to the search object directly and use pprint.pformat for the formatting of the parameters in the report.

>>> list(ParameterSampler(param_grid, n_iter=4))
... #doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
[{'a': ..., 'b': ...}, {'a': ..., 'b': ...},
{'a': ..., 'b': ...}, {'a': ..., 'b': ...}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to fix the seed of the sampler and display the values of a and the first 3 digits of b before ellipsis in the output?

@ogrisel
Copy link
Member
ogrisel commented Mar 2, 2013

The doctest failure on travis is real:

======================================================================
FAIL: Doctest: model_selection.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for model_selection.rst
  File "/home/travis/build/scikit-learn/scikit-learn/doc/tutorial/statistical_inference/model_selection.rst", line 0

----------------------------------------------------------------------
File "/home/travis/build/scikit-learn/scikit-learn/doc/tutorial/statistical_inference/model_selection.rst", line 148, in l_selection.rst
Failed example:
    clf.best_score_
Expected:
    0.988991985997974
Got:
    0.98899999999999999

missing an ellipsis.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Not sure why that changed... but ellipsis seem like a good idea any way.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Oh, my last commit fixes an unrelated bug in master. I should probably add a test. n_test_sampels was wrong in fit_grid_point as far as I can tell.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

So I guess I shouldn't use ellipsis because this caught a bug ^^

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

should I put the bugfix + test in a separate PR?

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

I would rather not add a function for reporting just now. I started something at some point but it was not generic enough. Once this is in, I'd like to revisit my ideas and create helper functions or an interface for GridSearchCV and RandomizedSearchCV that is more powerful and easier to use. I'll take your code for the example, though ;)

@ogrisel
Copy link
Member
ogrisel commented Mar 3, 2013

should I put the bugfix + test in a separate PR?

That would be great. We can get a quick merge for the bugfix.

I would rather not add a function for reporting just now. I started something at some point but it was not generic enough. Once this is in, I'd like to revisit my ideas and create helper functions or an interface for GridSearchCV and RandomizedSearchCV that is more powerful and easier to use. I'll take your code for the example, though ;)

Alright. Let's add the reporting function only in the example for now. It's likely to change if we add train time + train score anyway.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

there was no test for the iid parameter... great....

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

See #1731 for the bugfix / regression test

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Ok, I think I addressed your comments.

@ogrisel
Copy link
Member
ogrisel commented Mar 3, 2013

Looks good to me :) +1 for merging.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Ok, I'll squash and merge.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

@larsmans is there a good way to squash commits if I merged master in-between? I could drop all the history for a clean commit, not sure that is a great idea, though.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Ok, I merged without history. Everything else would be way to messy and / or to much work. Thanks for the reviews, guys :)

@ogrisel
Copy link
Member
ogrisel commented Mar 3, 2013

You definitely deserve a 🍺 for this one :)

@ogrisel
Copy link
Member
ogrisel commented Mar 3, 2013

Or maybe a caol ila :)

@amueller
Copy link
Member Author
amueller commented Mar 3, 2013

Not sure I deserve one, will take it any way ;)

@arjoly
Copy link
Member
arjoly commented Mar 4, 2013

Cool !!! Congratz @amueller !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0