Restructure the output attributes of *SearchCV #1768

jnothman · 2013-03-13T00:13:12Z

BaseSearchCV now stores attribs best_index_, grid_results_ and fold_results_ from which best_params_, best_score_ and grid_scores_ are derived; and best_estimator_.
grid_results_ and fold_results_ are structured arrays, allowing for flexible further computation.
The contents of these are extensible. Currently they may store training scores and times as well as test scores, number of test samples, parameters for each grid point, etc.
Multiple scores may be output, such as precision and recall as well as the F1 objective.
Some refactoring and additional testing included.

An alternative to #1742

add docstring for GridSearchCV, RandomizedSearchCV and fit_grid_point. In "fit_grid_point" I used test_score rather than validation_score, as the split is given to the function. rbf svm grid search example now also shows training scores - which illustrates overfitting for high C, and training/prediction times... which pasically serve to illustrate that this is possible. Maybe random forests would be better to evaluate training times?

Currently, no tests have been added, and backwards compatibility is eschewed

This includes some refactoring and creation of new Scorer classes to wrap legacy scorer forms. Conflicts: sklearn/grid_search.py sklearn/tests/test_grid_search.py

Thus the attributes stored by BaseSearchCV._fit() are no longer redundant. Also: test for these attributes

jnothman · 2013-03-13T00:23:16Z

sklearn/grid_search.py

@@ -170,8 +169,8 @@ def __iter__(self):
            yield params


-def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,


Should this be called fit_fold for consistency?

Yes, with deprecation of the current function name.

amueller · 2013-03-13T07:31:19Z

Thanks a lot for your work. I hope I can incooperate you chages soon.

This replaces Scorer.store() Also: tests for new Scorer functionality and descendants, and fixes broken WrapScorer

jaquesgrobler · 2013-03-14T15:12:29Z

examples/svm/plot_rbf_parameters.py

+
+We can observe that the lower right half of the parameters (below the diagonal
+with high C and gamma values) is characteristic of parameters that yields an
+overfitting model: the trainin score is very high but there is a wide gap. The


training score?

Hmm. This is inherited from my merge with d884180 of #1742. Do I fix it, or do I remove the file from the changeset?

amueller · 2013-03-17T16:57:15Z

I think this PR does to many thing / unrelated things at once.
The restructuring of the ParameterSearch results in to grid and fold results is independent of the restructuring of the scorer. Both are quite major changes.

Current master basically does a rename of the grid_scores_ attribute of 0.13.1. You changed the structure. Are there major benefits? It does seem a bit more intuitive but changing the structure of objects is annoying to users, so we shouldn't do it lightly.

On the other hand, one could argue that your changes are not going far enough. If you do have a grid of parameters, the reshaping is still annoying to do by hand, right?

I'm all for incremental improvements, but not if they change the api ;)

amueller · 2013-03-17T17:01:17Z

Btw, I don't see how this is an alternative to #1742.
There, I add some additional info to the current parameter search results. This PR here changes the structure of the results and adds multiple Scorer classes.

jnothman · 2013-03-18T01:10:29Z

All true, @amueller . It is not strictly an alternative to #1742 (which also grew much broader than its title suggests); rather, [what I meant:] it incorporates an alternative approach to returning additional statistics from the parameter search. The implementation at #1742 (d884180) actually breaks backwards-compatibility: where grid_scores_ returned a list of 3-tuples, there it returns a list of 6-tuples.

This patch doesn't change the structure of grid_scores_. Precisely, this patch proposes an output that is strictly more expressive, extensible and immediately powerful than the previous representation. So grid_scores_ etc. can easily be calculated as before (in a backwards-compatible way, unlike in #1742's patch), but: scores can also be accessed by name; additional data can be added; data is stored compactly in memory as numpy arrays; and these can be operated on to quickly find, for example, the score (or training time) variance across all parameters considered (for a particular fold, or over means).

I am not sure why you are concerned that the output cannot be indexed directly by parameter values. I don't think I ever suggested reshaping the output to be indexed by parameter values: this might indeed be a nice extension (though as far as I can tell would still not be trivial to use), but is certainly separate. Moreover this critique -- and the potential extension -- would apply equally to this patch and to the existing code.

Finally, I included multiple scores in this PR because:

it is what motivated me to make this change: it requires an extensible design (rather than one where all field names/types are preset);
it depends on this design (or on a hack using __float__ or __{add,iadd,mul}__ as in our ML discussions);
it continues to exemplify the power and utility of this design.

But if you insist, it can be removed; but it can then only be proposed after the restructure patch is accepted. As this was my first content PR to scikit-learn, would you advise me on the appropriate action? (If I am to split it, do I merely edit this branch, or do I close this PR and reopen?)

jnothman · 2013-03-18T03:28:50Z

Okay. I've split it up. Start with #1787

amueller · 2013-03-18T19:40:11Z

Thanks for responding to my comments. I think I agree with you. Thanks for splitting up the PR. I think this is the way to go.
I would really love to look at your contribution in more detail but I am constantly on the road at the moment (London, Berlin, New York ^^). I'll try to make some time to review #1787.

amueller and others added 9 commits March 11, 2013 21:45

ENH Enhanced results from cross-validation via Scorer.store

c9d45a3

Currently, no tests have been added, and backwards compatibility is eschewed

Merge branch 'cv-enhanced-results' into grid_search_more_info

6e71aea

This includes some refactoring and creation of new Scorer classes to wrap legacy scorer forms. Conflicts: sklearn/grid_search.py sklearn/tests/test_grid_search.py

ENH Use structured array for BaseSearchCV results

9faeb89

STYLE line length and a TODO comment

f5f3b90

FIX use PRFScorer as 'f1' scorer

b728d66

FIX return backwards compatibility to GridSearchCV.grid_scores_

b549457

ENH Reimplement best_params_ and best_score_ as properties

f9506a3

Thus the attributes stored by BaseSearchCV._fit() are no longer redundant. Also: test for these attributes

TEST add test for composite score output from GridSearchCV

19ea7ea

jnothman reviewed Mar 13, 2013
View reviewed changes

jnothman added 2 commits March 13, 2013 14:53

ENH Export PRFScore from metrics

a32d936

FIX Use six's iteritems

f65f860

ENH Simplify the merging of results dicts into a structured array

9d6d4c2

jnothman mentioned this pull request Mar 14, 2013

ENH annotate metrics to simplify populating SCORERS #1774

Closed

ENH/FIX/TST Use Scorer.calc_scores instead of store

d27adeb

This replaces Scorer.store() Also: tests for new Scorer functionality and descendants, and fixes broken WrapScorer

jaquesgrobler reviewed Mar 14, 2013
View reviewed changes

jnothman mentioned this pull request Mar 18, 2013

ENH extensible parameter search results #1787

Closed

jnothman closed this Mar 18, 2013

This was referenced Apr 4, 2013

Use cross_validation.cross_val_score with metrics.precision_recall_fscore_support #1837

Closed

Cross-validation returning multiple scores #1850

Closed

jnothman mentioned this pull request Jun 20, 2013

WIP support future extensiblity of grid search results #2079

Closed

This was referenced Apr 20, 2016

[RFC] Better Format for search results in model_selection module. #6686

Closed

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Restructure the output attributes of *SearchCV #1768

Restructure the output attributes of *SearchCV #1768

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -170,8 +169,8 @@ def __iter__(self):
		yield params


		def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,

Uh oh!

Restructure the output attributes of *SearchCV #1768

Restructure the output attributes of *SearchCV #1768

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!