8000 [RFC] Changes to model_selection? · Issue #5053 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Changes to model_selection? #5053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jul 29, 2015 · 39 comments · Fixed by #6660
Closed

[RFC] Changes to model_selection? #5053

amueller opened this issue Jul 29, 2015 · 39 comments · Fixed by #6660

Comments

@amueller
Copy link
Member

Once #4294 is merged, we might want to think about changing some of the api / default behaviors of the contents of this module.

There are several breaking changes that I'd like to make, and this might or might not be a good opportunity:

  • changing the default n_folds to 5, as it is already in some of the cv objects (3 seems really low).
  • rename grid_scores_ to be consistent with RandomizedSerachCV
  • possibly redo the _CVScoreTuple [needs to be done for multiple metrics I think]
  • rename labels to groups (?) in the cv iterators.
  • rename the p parameter in LeavePOut

was there anything else? ping @rvraghav93

@raghavrv
Copy link
Member
raghavrv commented Aug 2, 2015

(I'm checking the entries as soon as there is an issue/PR for the same)


MAJOR

General tests for CV objects?


MODERATE


MINOR


RFC

Moved from #5569 to here

@jnothman
Copy link
Member
jnothman commented Aug 3, 2015

Defaulting n_folds to 5 is hard because it affects all *CV classes, not just things in model_selection.

rename grid_scores_ to be consistent with RandomizedSerachCV

This was postponed until we have a solution for multiple metrics and/or training scores, which may mandate changing the format, too: namedtuples are user-friendly but inflexible to versioning.

Having a clearer delineation of "labels" and "groups" would certainly be nice...

@jnothman
Copy link
Member
jnothman commented Aug 3, 2015

In new style CV splitters, the constructor no longer differs between stratified and unstratified splitters. Should stratified become a constructor parameter so that we are not proliferating pairs of classes with parallel APIs?

@amueller
Copy link
Member Author
amueller commented Aug 3, 2015

True, changing n_folds is problematic :-/
For the stratified option: That might be interesting. I think we should also think about how this relates to stratifying in the regression case, see the discussion at #4757.

@mblondel
Copy link
Member
mblondel commented Aug 5, 2015

Should stratified become a constructor parameter so that we are not proliferating pairs of classes with parallel APIs?

+1

@jnothman
Copy link
Member
jnothman commented Aug 6, 2015

A minor change that could be included directly in #4294: drop grid_search.fit_grid_point.

@jnothman
Copy link
Member
jnothman commented Aug 6, 2015

Half kidding: Is now also the time to rename get_params and set_params to get_config (as in keras) and set_config so as to avoid the confusion of parameters vs hyperparameters etc??

@amueller
Copy link
Member Author
amueller commented Aug 6, 2015

is there a confusion about that? I wasn't aware.

@jnothman
Copy link
Member
jnothman commented Aug 7, 2015

Well, the confusion is only when discussing the models, rather than being an inherent API problem. It's irrelevant; it's a silly thing to change, certainly as long as we call function/constructor arguments "parameters".

@raghavrv
Copy link
Member

@amueller @vene @jnothman sorry for going back and forth (- wrt to a discussion at gitter) on this, but how does grouping LabelKFold and KFold sound to you?

i.e LabelKFold === KFold.split(X, y, labels) and KFold === KFold.split(X, y, labels=None)? (similarly for ShuffleSplit? )

Also if this sounds okay... we should think about how StratifiedKFold with labels should be considered? (This is in reference to Joel's comment on combining KFold and StratifiedKFold like KFold(stratify=False))

So after that is done what would KFold(stratify=True, n_folds=5).split(X, y, labels=some_label) mean? Could it mean we stratify using y and use labels to make disjoint sets (is that even possible?)

Or would it be better to simply keep it as such as a separate class as it is right now?

@jnothman
Copy link
Member

I think a separate class. It's not just an option.

On 13 September 2015 at 14:35, Raghav R V notifications@github.com wrote:

@amueller https://github.com/amueller @vene https://github.com/vene
@jnothman https://github.com/jnothman sorry for going back and forth (-
wrt to a discussion at gitter) on this, but how does grouping LabelKFold
and KFold sound to you?

i.e LabelKFold === KFold.split(X, y, labels) and KFold === KFold.split(X,
y, labels=None)? (similarly for ShuffleSplit? )

Also if this sounds okay... we should think about how StratifiedKFold
with labels should be considered? (This is in reference to Joel's comment
on combining KFold and StratifiedKFold like KFold(stratify=False))

So after that is done what would KFold(stratify=True, n_folds=5).split(X,
y, labels=some_label) mean? Could it mean we stratify using y and use
labels to make disjoint sets (is that even possible?)

Or would it be better to simply keep it as such as a separate class as it
is right now?


Reply to this email directly or view it on GitHub
#5053 (comment)
.

@amueller
Copy link
Member Author

I also think a separate class would be good, if nothing else for visibility.

@raghavrv
Copy link
Member

Okay! Thanks!

@raghavrv
Copy link
Member

rename labels to groups (?) in the cv iterators.

Does this mean LabelKFold should be renamed to GroupedKFold?

@jnothman @amueller @vene

@raghavrv
Copy link
Member

Also could we rename p to p_samples?

@raghavrv
Copy link
Member

or n_test_samples or n_test?

@jnothman
Copy link
Member

Personally, I much prefer GroupedKFold over LabelKFold as a description.

@jnothman
Copy link
Member

I think leave p alone. It's a bit weird that KFold doesn't just have a parameter k.

@raghavrv
Copy link
Member

I think leave p alone.

Pun intended? ;)

@jnothman
Copy link
Member

I'm missing the pun. And I make dad jokes all the time.

On 13 April 2016 at 22:59, Raghav R V notifications@github.com wrote:

I think leave p alone.

Pun intended? ;)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#5053 (comment)

@raghavrv
Copy link
Member

Nevermind ;) (I thought LeavePOut --> Leave p alone :P )

@raghavrv
Copy link
Member

BTW on a similar note should we also rename "Leave*LabelOut" --> "Leave*GroupOut"?

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Apr 14, 2016 via email

@GaelVaroquaux
Copy link
Member

Personally, I much prefer GroupedKFold over LabelKFold as a description.

But we shouldn't be renaming things just for cosmetics (consistency is a
valid argument, though). It uses a lot of energy both for our developers
and our users.

@raghavrv
Copy link
Member

But we shouldn't be renaming things just for cosmetics (consistency is a valid argument, though)

I think the motivation for renaming labels --> groups is to demarcate clearly what we mean by labels. (As labels can also mean the target or class labels). Thanks for the comment! Could you also express your views at #6660 please?

@vene
Copy link
Member
vene commented Apr 14, 2016

That's my view as well. labels was a weird, overloaded word in this context. I was personally very confused about what LeavePLabelOut did for a while, because I was thinking of y as labels. I thought it's some sort of zero-shot learning evaluation.

Groups is much better; though there could be confusion with, say, GroupLasso.

Such breakage is annoying on two accounts. (a) user code will need to change: but this will happen anyway because of the move to model_selection. (b) books becoming out of date, as @GaelVaroquaux points out.

Now, my first impulse is to say that (b) is also moot because of model_selection but it's not that simple. It's easier to realize that cross_validation.LeavePLabelOut became model_selection.LeavePLabelOut rather than if it becomes model_selection.LeavePGroupOut.

Still, I think that if you at least know what your intention is, and understand the word "Group", the latter can be easily identified as relevant when tabbing through the imports.

(In light of this, I would have preferred LabelKFold to be KFoldLabels so that it would be easy to type KFold[tab] and find KFoldGroups instead...)

@raghavrv
Copy link
Member

@vene Could you copy the comment to #6660 please?? :)

@ogrisel
Copy link
Member
ogrisel commented Sep 11, 2016

Re-opening as #6660 only fixed the part on renaming labels to groups.

@ogrisel ogrisel reopened this Sep 11, 2016
@jnothman
Copy link
Member

I don't think anything remains here that needs to be changed before the release of model_selection. I think anything we want to consider changing can be moved to its own specific issue and this one closed.

@raghavrv
Copy link
Member

+1 I'll do that tomorrow!

@amueller
Copy link
Member Author

I think this issue can be closed. @raghavrv are there any urgent things you still see?

@raghavrv
Copy link
Member

Few more tests that I was planning to raise as a PR soon... Otherwise this can be closed.

@raghavrv
Copy link
Member

F438 @amueller Do we want to remove _permutation_test_score? I tried doing that here at this commit - 8cfebf2

It seems to add one more line and some code complexity. I don't feel it's worth it. Reviews on the commit and a quick comment on whether we need that as a PR?

@vene
Copy link
Member
vene commented Sep 13, 2016

I think I prefer the version with the auxiliary function, it's much more readable to me at least.

@raghavrv
Copy link
Member

Yes. Thanks for the comment!

@amueller
Copy link
Member Author

why is groups passed and not used?

@raghavrv
Copy link
Member

I am sorry... I don't get you. groups is used for both cv and shuffling... (It's passed to _shuffle to shuffle the target and to split for group cvs)

@amueller
Copy link
Member Author

sorry lack of coffee

@raghavrv
Copy link
Member

I think it can be closed now... The other comments are either addressed / have PRs for them. @amueller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants
0