-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Restructure the output attributes of *SearchCV #1768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add docstring for GridSearchCV, RandomizedSearchCV and fit_grid_point. In "fit_grid_point" I used test_score rather than validation_score, as the split is given to the function. rbf svm grid search example now also shows training scores - which illustrates overfitting for high C, and training/prediction times... which pasically serve to illustrate that this is possible. Maybe random forests would be better to evaluate training times?
Currently, no tests have been added, and backwards compatibility is eschewed
This includes some refactoring and creation of new Scorer classes to wrap legacy scorer forms. Conflicts: sklearn/grid_search.py sklearn/tests/test_grid_search.py
Thus the attributes stored by BaseSearchCV._fit() are no longer redundant. Also: test for these attributes
@@ -170,8 +169,8 @@ def __iter__(self): | |||
yield params | |||
|
|||
|
|||
def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called fit_fold
for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with deprecation of the current function name.
Thanks a lot for your work. I hope I can incooperate you chages soon. |
This replaces Scorer.store() Also: tests for new Scorer functionality and descendants, and fixes broken WrapScorer
|
||
We can observe that the lower right half of the parameters (below the diagonal | ||
with high C and gamma values) is characteristic of parameters that yields an | ||
overfitting model: the trainin score is very high but there is a wide gap. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR does to many thing / unrelated things at once. Current master basically does a rename of the On the other hand, one could argue that your changes are not going far enough. If you do have a grid of parameters, the reshaping is still annoying to do by hand, right? I'm all for incremental improvements, but not if they change the api ;) |
Btw, I don't see how this is an alternative to #1742. |
All true, @amueller . It is not strictly an alternative to #1742 (which also grew much broader than its title suggests); rather, [what I meant:] it incorporates an alternative approach to returning additional statistics from the parameter search. The implementation at #1742 (d884180) actually breaks backwards-compatibility: where This patch doesn't change the structure of I am not sure why you are concerned that the output cannot be indexed directly by parameter values. I don't think I ever suggested reshaping the output to be indexed by parameter values: this might indeed be a nice extension (though as far as I can tell would still not be trivial to use), but is certainly separate. Moreover this critique -- and the potential extension -- would apply equally to this patch and to the existing code. Finally, I included multiple scores in this PR because:
But if you insist, it can be removed; but it can then only be proposed after the restructure patch is accepted. As this was my first content PR to scikit-learn, would you advise me on the appropriate action? (If I am to split it, do I merely edit this branch, or do I close this PR and reopen?) |
Okay. I've split it up. Start with #1787 |
Thanks for responding to my comments. I think I agree with you. Thanks for splitting up the PR. I think this is the way to go. |
BaseSearchCV
now stores attribsbest_index_
,grid_results_
andfold_results_
from whichbest_params_
,best_score_
andgrid_scores_
are derived; andbest_estimator_
.grid_results_
andfold_results_
are structured arrays, allowing for flexible further computation.An alternative to #1742