8000 Restructure the output attributes of *SearchCV by jnothman · Pull Request #1768 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Restructure the output attributes of *SearchCV #1768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

jnothman
Copy link
Member
  • BaseSearchCV now stores attribs best_index_, grid_results_ and fold_results_ from which best_params_, best_score_ and grid_scores_ are derived; and best_estimator_.
  • grid_results_ and fold_results_ are structured arrays, allowing for flexible further computation.
  • The contents of these are extensible. Currently they may store training scores and times as well as test scores, number of test samples, parameters for each grid point, etc.
  • Multiple scores may be output, such as precision and recall as well as the F1 objective.
  • Some refactoring and additional testing included.

An alternative to #1742

amueller and others added 9 commits March 11, 2013 21:45
add docstring for GridSearchCV, RandomizedSearchCV and fit_grid_point. In "fit_grid_point" I used test_score rather than validation_score, as the split is given to the function.
rbf svm grid search example now also shows training scores - which illustrates overfitting for high C, and training/prediction times... which pasically serve to illustrate that this is possible. Maybe random forests would be better to evaluate training times?
Currently, no tests have been added, and backwards compatibility is eschewed
This includes some refactoring and creation of new Scorer classes to
wrap legacy scorer forms.

Conflicts:
	sklearn/grid_search.py
	sklearn/tests/test_grid_search.py
Thus the attributes stored by BaseSearchCV._fit() are no longer redundant.

Also: test for these attributes
@@ -170,8 +169,8 @@ def __iter__(self):
yield params


def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called fit_fold for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with deprecation of the current function name.

@amueller
Copy link
Member

Thanks a lot for your work. I hope I can incooperate you chages soon.

This replaces Scorer.store()

Also: tests for new Scorer functionality and descendants, and fixes broken WrapScorer

We can observe that the lower right half of the parameters (below the diagonal
with high C and gamma values) is characteristic of parameters that yields an
overfitting model: the trainin score is very high but there is a wide gap. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training score?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is inherited from my merge with d884180 of #1742. Do I fix it, or do I remove the file from the changeset?

@amueller
Copy link
Member

I think this PR does to many thing / unrelated things at once.
The restructuring of the ParameterSearch results in to grid and fold results is independent of the restructuring of the scorer. Both are quite major changes.

Current master basically does a rename of the grid_scores_ attribute of 0.13.1. You changed the structure. Are there major benefits? It does seem a bit more intuitive but changing the structure of objects is annoying to users, so we shouldn't do it lightly.

On the other hand, one could argue that your changes are not going far enough. If you do have a grid of parameters, the reshaping is still annoying to do by hand, right?

I'm all for incremental improvements, but not if they change the api ;)

@amueller
Copy link
Member

Btw, I don't see how this is an alternative to #1742.
There, I add some additional info to the current parameter search results. This PR here changes the structure of the results and adds multiple Scorer classes.

@jnothman
Copy link
Member Author

All true, @amueller . It is not strictly an alternative to #1742 (which also grew much broader than its title suggests); rather, [what I meant:] it incorporates an alternative approach to returning additional statistics from the parameter search. The implementation at #1742 (d884180) actually breaks backwards-compatibility: where grid_scores_ returned a list of 3-tuples, there it returns a list of 6-tuples.

This patch doesn't change the structure of grid_scores_. Precisely, this patch proposes an output that is strictly more expressive, extensible and immediately powerful than the previous representation. So grid_scores_ etc. can easily be calculated as before (in a backwards-compatible way, unlike in #1742's patch), but: scores can also be accessed by name; additional data can be added; data is stored compactly in memory as numpy arrays; and these can be operated on to quickly find, for example, the score (or training time) variance across all parameters considered (for a particular fold, or over means).

I am not sure why you are concerned that the output cannot be indexed directly by parameter values. I don't think I ever suggested reshaping the output to be indexed by parameter values: this might indeed be a nice extension (though as far as I can tell would still not be trivial to use), but is certainly separate. Moreover this critique -- and the potential extension -- would apply equally to this patch and to the existing code.

Finally, I included multiple scores in this PR because:

  • it is what motivated me to make this change: it requires an extensible design (rather than one where all field names/types are preset);
  • it depends on this design (or on a hack using __float__ or __{add,iadd,mul}__ as in our ML discussions);
  • it continues to exemplify the power and utility of this design.

But if you insist, it can be removed; but it can then only be proposed after the restructure patch is accepted. As this was my first content PR to scikit-learn, would you advise me on the appropriate action? (If I am to split it, do I merely edit this branch, or do I close this PR and reopen?)

@jnothman
Copy link
Member Author

Okay. I've split it up. Start with #1787

@jnothman jnothman closed this Mar 18, 2013
@amueller
Copy link
Member

Thanks for responding to my comments. I think I agree with you. Thanks for splitting up the PR. I think this is the way to go.
I would really love to look at your contribution in more detail but I am constantly on the road at the moment (London, Berlin, New York ^^). I'll try to make some time to review #1787.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0