-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Clarify group order in GroupKFold and LeaveOneGroupOut #18338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what cv objects should guarantee is that order does not change every time you call split. I am more wondering why you have no shuffle and random_state parameters in GroupKFold to match the KFold API. unless the need you report is quite common I would write my own code / private function/object to do what you suggest |
GroupKFold is a deterministic packing algorithm. It's not possible to shuffle except among groups of equal size, and to shuffle samples within groups. This has surprised several users. We should either enhance the documentation, or add an option to shuffle in whichever ways possible, and then explicitly document the limitations. |
ok got it makes sense. @lorentzenchr for your use in linear models I would suggest you try: LeaveOneGroupOut I think it does what you want here. |
@agramfort I wonder if that is by chance or behaviour that can be relied upon. |
Read the code to check
|
Of course, it is not by chance ( |
+1 to say this in the documentation and ensure that we have test
that would break otherwise
… |
Help wanted:smiley: |
If this is still open, may I work on this? I am a first-time contributor @lorentzenchr. |
take |
Hi @lorentzenchr, I am not certain if this is the right place to ask. I am a first-time contributor. I love the library and it has helped me immensely in my studies so far. I was hoping to work on this issue as my first issue. |
@sharyar This is the perfect place to ask. Glad to hear you like scikit-learn. Contributions are always welcome. |
@lorentzenchr, you mean to create a pull request after I have updated the docstrings for the two functions right? |
@sharyar If you are still interested, the best thing is to open a PR very early. This way, we can help you there. |
I also want this. I've had several cases where I need to know which groups are coming through at each GroupKFold iteration. I think the most general way to do this is to also return...Wait a minute. If I've got |
@sharyar Are you still working on this? If not I'd like to make a pull request. |
Describe the bug
GroupKFold
builds folds with non-overlapping groups in it. The order of the groups seems random and not sorted by group ID. A naiv user like me expects the groups to be ordered over the folds.Steps/Code to Reproduce
Results show group ID of train and test folds:
Now we add a few samples to the first group (group ID = 0).
Result:
Expected Results
I expect the same (first) output for both:
Versions
python: 3.7.2
sklearn: 0.24.dev0 (current master as of today morning)
The text was updated successfully, but these errors were encountered: