8000 StratifiedGroupShuffleSplit · Issue #12076 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

StratifiedGroupShuffleSplit #12076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
paulaceccon opened this issue Sep 14, 2018 · 16 comments
Open

StratifiedGroupShuffleSplit #12076

paulaceccon opened this issue Sep 14, 2018 · 16 comments

Comments

@paulaceccon
Copy link

It would be nice to have a manner to perform GroupShuffleSplit with some level of stratification.
Using GroupShuffleSplit does not prevent that a target (for classification problems) will be present just in train or test sets.

@amueller
Copy link
Member

The problem with this is defining the expected behavior. I'm pretty sure there's an open (or closed?) issue on that. What would you expect to happen?

@amin-nejad
Copy link

I recently had this problem of needing a StratifiedGroupShuffleSplit and this was the solution I came up with which worked for me, if anyone is interested:

https://stackoverflow.com/questions/56872664/complex-dataset-split-stratifiedgroupshufflesplit

@impiyush
Copy link
impiyush commented Mar 3, 2020

I am not sure if this is already implemented in the library but I believe it will be helpful to have something as such. This can come in handy in case of imbalanced data when you want to do a cross-validation, splitting on groups and keeping the class ratio.
I am happy to work on this and have a PR if agreed upon considering I already have some experience dealing with this kind of problem.

@jnothman
Copy link
Member
jnothman commented Mar 4, 2020 via email

@impiyush
Copy link
impiyush commented Mar 5, 2020

Thanks for sharing links to those issues @jnothman. I will review them and try to have a PR out soon. I need to understand those two cases better to make sure I code them up as per the expectations.

@impiyush
Copy link
impiyush commented Mar 7, 2020

@jnothman I see there are couple pull requests already open (PR-15239 and PR-9413).

  • Are these PRs going to be merged or should I create a new PR to tackle this?
  • Should I review them and try to combine them into one?

I am not sure if I am following what is the expectation here and also want to be conscious of others PRs.

I see that the base class for StratifiedKFold also has an option of groups line 315 but doesn't looks like StratifiedKFold itself uses it. Can I use this groups to create stratified splits on group?

@impiyush
Copy link

any suggestions on this @jnothman ??

@davzaman
Copy link

Any update on this?

@cmarmo cmarmo added the Needs Decision Requires decision label Mar 28, 2022
@BJohnBraddock
Copy link

I could also really use this! Considering sklearn.model_selection.StratifiedGroupKFold exists and works great!

@Pierre-Bartet
Copy link

Is there any subtle reason why StratifiedGroupKFold was implemented but not StratifiedGroupShuffleSplit (or a group argument added to train_test_split)?

@totuta
Copy link
totuta commented Apr 10, 2024

Is there any subtle reason why StratifiedGroupKFold was implemented but not StratifiedGroupShuffleSplit (or a group argument added to train_test_split)?

It is because the current implementation of StratifiedGroupKFold needs to sort groups, not shuffle. It employs a greedy search with a priority queue, prioritizing groups based on the descending order of standard deviation across class distribution. Groups are initially sorted by the standard deviation of class distribution, and then the best fold out of K is selected for each group in the sorted order. If groups are shuffled, the algorithm collapses.

The side-effect of not having ShuffleSplit is that, you cannot directly specify test_size. You can only set it indirectly with K(n_splits=int(1/desired_test_ratio)). It is a pain, in case you need to set the test_size to be a specific value.

@totuta
Copy link
totuta commented Apr 10, 2024

Additionally, from StratifiedGroupKFold:

shuffle: bool, default=False
This implementation can only shuffle groups that have approximately the same y distribution, no global shuffle will be performed.

Because the implementation is based on the sorted result of groups, it can only shuffle groups that have "approximately the same y distribution". Global shuffle is not possible, because it will scramble the sorted result.

@Pierre-Bartet
Copy link
Pierre-Bartet commented Apr 11, 2024

Thanks for the explanations, why couldn't StratifiedGroupShuffleSplit behave with the same caveats about shuffling ? (Not a rhetorical question)

@totuta
Copy link
totuta commented Apr 11, 2024

To summarize,

The current implementation of StratifiedGroupKFold

  • requires multiple(at least 2) K to perform the "Stratified" split,
  • and that automatically makes test_size to be 1/K.(you can't change it.)

Which is

  • a very clever choice to implement a super-fast and good enough algorithm.
  • not an optimization, but a good enough greedy search.
    • optimization is an NP-Complete problem here.

Therefore, without modifying the current algorithm, the only way to implement StratifiedGroupShuffleSplit (with the same shuffling caveats) is to perform K splits and select one or multiple folds out of K, depending on the test_size you want.

To make test_size to be exact, there will be cases where K becomes prohibitively big. (e.g. test_size=1_000_000(integer) or 0.235(float))

But those anti-patterns may not happen often. So that is probably why they only adopted StratifiedGroupKFold, I suppose.

@Pierre-Bartet
Copy link

Therefore, without modifying the current algorithm, the only way to implement StratifiedGroupShuffleSplit (with the same shuffling caveats) is to perform K splits and select one or multiple folds out of K, depending on the test_size you want.

Yes that is what people do (in fact most of the time they use train_test_split and don't even deal properly with groups).

@DavidRosen
Copy link
DavidRosen commented Apr 10, 2025

Here's a simplified implementation lazily combining StratifiedGroupShuffleSplit functionality with its split method to obtain a single train-test split by invoking StratifiedGroupKFold just for its first fold (supports multi-class targets, not just binary):

https://stackoverflow.com/a/79565815/11929115

It rounds desired test fraction to 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, etc., which is just fine for me. A more general version could alternatively e.g. round the desired test fraction to the nearest multiple of 0.05 (i.e. 5%, 10%, 15%, 20%, ....) and just concatenate the first N test folds from a 20-fold split, or maybe do something even fancier. A 40-fold split would allow e.g. 32.5% which is plenty close to 1/3. Or a 60-fold split would also allow exactly 1/2, 1/3, 1/4, n/5, n/6, n/10, etc., although this may be overkill since the resulting test fraction is never exact anyway, especially if you have any large-ish groups.

More bothersome to me is that the stratification from StratifiedGroupKFold isn't as precise as I'd like, except maybe with entirely very-small groups or if all the groups have very similar class distributions (unlikely).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0