-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
StratifiedGroupShuffleSplit #12076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem with this is defining the expected behavior. I'm pretty sure there's an open (or closed?) issue on that. What would you expect to happen? |
I recently had this problem of needing a https://stackoverflow.com/questions/56872664/complex-dataset-split-stratifiedgroupshufflesplit |
I am not sure if this is already implemented in the library but I believe it will be helpful to have something as such. This can come in handy in case of imbalanced data when you want to do a cross-validation, splitting on groups and keeping the class ratio. |
I think we would be very happy for a contributor to champion this and the
parallel issue on stratified kfold with groups (#13621). We have discussed
two main cases: one where the y within a group is homogeneous and another
where the y is heterogeneous. I think the algorithm for the former is much
less heuristic. But I do not recall the details of all past discussions and
pull requests (e.g. #14524) on the topic. It would be great if a
contributor such as yourself @impiyush reviewed the work done so far here
and proposed pull requests with solutions.
|
Thanks for sharing links to those issues @jnothman. I will review them and try to have a PR out soon. I need to understand those two cases better to make sure I code them up as per the expectations. |
@jnothman I see there are couple pull requests already open (PR-15239 and PR-9413).
I am not sure if I am following what is the expectation here and also want to be conscious of others PRs. I see that the base class for StratifiedKFold also has an option of groups line 315 but doesn't looks like StratifiedKFold itself uses it. Can I use this groups to create stratified splits on group? |
any suggestions on this @jnothman ?? |
Any update on this? |
I could also really use this! Considering sklearn.model_selection.StratifiedGroupKFold exists and works great! |
Is there any subtle reason why |
It is because the current implementation of The side-effect of not having |
Additionally, from
Because the implementation is based on the sorted result of groups, it can only shuffle groups that have "approximately the same y distribution". Global shuffle is not possible, because it will scramble the sorted result. |
Thanks for the explanations, why couldn't |
To summarize, The current implementation of
Which is
Therefore, without modifying the current algorithm, the only way to implement To make But those anti-patterns may not happen often. So that is probably why they only adopted |
Yes that is what people do (in fact most of the time they use |
Here's a simplified implementation lazily combining StratifiedGroupShuffleSplit functionality with its split method to obtain a single train-test split by invoking StratifiedGroupKFold just for its first fold (supports multi-class targets, not just binary): https://stackoverflow.com/a/79565815/11929115 It rounds desired test fraction to 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, etc., which is just fine for me. A more general version could alternatively e.g. round the desired test fraction to the nearest multiple of 0.05 (i.e. 5%, 10%, 15%, 20%, ....) and just concatenate the first N test folds from a 20-fold split, or maybe do something even fancier. A 40-fold split would allow e.g. 32.5% which is plenty close to 1/3. Or a 60-fold split would also allow exactly 1/2, 1/3, 1/4, n/5, n/6, n/10, etc., although this may be overkill since the resulting test fraction is never exact anyway, especially if you have any large-ish groups. More bothersome to me is that the stratification from StratifiedGroupKFold isn't as precise as I'd like, except maybe with entirely very-small groups or if all the groups have very similar class distributions (unlikely). |
It would be nice to have a manner to perform GroupShuffleSplit with some level of stratification.
Using GroupShuffleSplit does not prevent that a target (for classification problems) will be present just in train or test sets.
The text was updated successfully, but these errors were encountered: