-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
StratifiedGroupShuffleSplit and StratifiedGroupKFold #15239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StratifiedGroupShuffleSplit and StratifiedGroupKFold #15239
Conversation
Could you please add tests @hermidalc ;) |
folds are made by preserving the percentage of groups for each class. | ||
|
||
Note: like the StratifiedShuffleSplit strategy, stratified random group | ||
splits do not guarantee that all folds will be different, although this is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence. What do you mean by "folds not being different"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That text is copied from StratifiedShuffleSplit
and the meaning behind it is that shuffle splitting does not guarantee that each split will be different than another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I think in sklearn we call folds the partitions in the split, not the repetitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think partitions/splits/folds that's what is meant here and I believe in StratifiedShuffleSplit
. With randomized splits there is no guarantee that a partition will be different than other ones.
Hey @hermidalc, any update on this? |
Can we get tests by any chance, @hermidalc? |
See reply in #13621 |
closed by #18649 |
This implements a Stratified version of the GroupShuffleSplit and GroupKFold cross-validators. Note that GroupKFold differs from StratifiedGroupKFold in that GroupKFold attempts to approximately balance the number of groups in each fold regardless of group class, whereas StratifiedGroupKFold attempts to stratify the group class percentages in each fold to be the same as that of the entire data.
There are two important points regarding the implementation logic:
This makes the logic straightforward and covers a lot of use cases (at least the ones I needed them for in my work).
TODO: