8000 Clarify group order in GroupKFold and LeaveOneGroupOut · Issue #18338 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Clarify group order in GroupKFold and LeaveOneGroupOut #18338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lorentzenchr opened this issue Sep 4, 2020 · 16 comments · Fixed by #22582
Closed

Clarify group order in GroupKFold and LeaveOneGroupOut #18338

lorentzenchr opened this issue Sep 4, 2020 · 16 comments · Fixed by #22582
Assignees
Labels
Documentation Easy Well-defined and straightforward way to resolve module:test-suite everything related to our tests

Comments

@lorentzenchr
Copy link
Member

Describe the bug

GroupKFold builds folds with non-overlapping groups in it. The order of the groups seems random and not sorted by group ID. A naiv user like me expects the groups to be ordered over the folds.

Steps/Code to Reproduce

import numpy as np
from sklearn.model_selection import GroupKFold


n_splits, n_samples, n_features = 3, 2, 2
X = np.arange(n_splits * n_samples * n_features).reshape(n_splits * n_samples, n_features)
groups = np.concatenate([np.full(n_samples, i) for i in range(n_splits)])
splits = list(GroupKFold(n_splits=n_splits).split(X, groups=groups))
[(np.unique(groups[train]), np.unique(groups[test])) for train, test in splits]

Results show group ID of train and test folds:

[(array([0, 1]), array([2])),
 (array([0, 2]), array([1])),
 (array([1, 2]), array([0]))]

Now we add a few samples to the first group (group ID = 0).

X = np.r_[X[:n_samples, :], X]
groups = np.r_[groups[:n_samples], groups]
splits = list(GroupKFold(n_splits=n_splits).split(X, groups=groups))
[(np.unique(groups[train]), np.unique(groups[test])) for train, test in splits]

Result:

[(array([1, 2]), array([0])),
 (array([0, 1]), array([2])),
 (array([0, 2]), array([1]))]

Expected Results

I expect the same (first) output for both:

[(array([0, 1]), array([2])),
 (array([0, 2]), array([1])),
 (array([1, 2]), array([0]))]

Versions

python: 3.7.2
sklearn: 0.24.dev0 (current master as of today morning)

@agramfort
Copy link
Member
8000

what cv objects should guarantee is that order does not change every time you call split. I am more wondering why you have no shuffle and random_state parameters in GroupKFold to match the KFold API.

unless the need you report is quite common I would write my own code / private function/object to do what you suggest

@jnothman
Copy link
Member
jnothman commented Sep 7, 2020

GroupKFold is a deterministic packing algorithm. It's not possible to shuffle except among groups of equal size, and to shuffle samples within groups. This has surprised several users. We should either enhance the documentation, or add an option to shuffle in whichever ways possible, and then explicitly document the limitations.

@agramfort
Copy link
Member

ok got it makes sense.

@lorentzenchr for your use in linear models I would suggest you try: LeaveOneGroupOut

I think it does what you want here.

@lorentzenchr
Copy link
Member Author

@agramfort LeaveOneGroupOut does indeed preserve the order of the groups if samples are added as in the code snippet above.

I wonder if that is by chance or behaviour that can be relied upon.

@agramfort
Copy link
Member
agramfort commented Sep 9, 2020 via email

@lorentzenchr
Copy link
Member Author

Of course, it is not by chance (LeaveOneGroupOut._iter_test_masks as compared to GroupKFold._iter_test_indices). But it is also not mentioned in the docs. So my question should have better been: Is it something that we want to guarantee and mention in the documentation or not?

@agramfort
Copy link
Member
agramfort commented Sep 10, 2020 via email

@cmarmo cmarmo added the module:test-suite everything related to our tests label Sep 15, 2020
@lorentzenchr lorentzenchr changed the title BUG Group order of GroupKFold Clearify group order in GroupKFold and LeaveOneGroupOut Oct 4, 2020
@lorentzenchr
Copy link
Member Author

+1 to say this in the documentation and ensure that we have test
that would break otherwise

Help wanted:smiley:

@lorentzenchr lorentzenchr added the Easy Well-defined and straightforward way to resolve label Jan 24, 2021
@sharyar
Copy link
sharyar commented Feb 9, 2021

If this is still open, may I work on this? I am a first-time contributor @lorentzenchr.

@sharyar
Copy link
sharyar commented Feb 9, 2021

take

@sharyar
Copy link
sharyar commented Feb 9, 2021

Hi @lorentzenchr, I am not certain if this is the right place to ask. I am a first-time contributor. I love the library and it has helped me immensely in my studies so far. I was hoping to work on this issue as my first issue.
As far as I can understand, this issue requires that the documentation be updated, does that indicate the docstring within the function definition only, or is that referring to another piece of documentation?
You also mentioned ensuring there are tests that break if this documentation doesn't exist, how do I go about doing that effectively?

8000

@lorentzenchr
Copy link
Member Author

@sharyar This is the perfect place to ask. Glad to hear you like scikit-learn. Contributions are always welcome.
I would start with opening a pull request (PR) and extend to docstrings of the classes GroupKFold and LeaveOneGroupOut. After that, we can figure out which test makes most sense, maybe based on my code snipped above.

@sharyar
Copy link
sharyar commented Feb 12, 2021

@lorentzenchr, you mean to create a pull request after I have updated the docstrings for the two functions right?

@lorentzenchr
Copy link
Member Author

@sharyar If you are still interested, the best thing is to open a PR very early. This way, we can help you there.

@pavelkomarov
Copy link

I also want this. I've had several cases where I need to know which groups are coming through at each GroupKFold iteration. I think the most general way to do this is to also return...Wait a minute. If I've got X, y, groups as numpy arrays, and .split returns me indices, I can see which groups things belong to by indexing groups[train_ndx] and groups[test_ndx].

@SamAdamDay
Copy link
Contributor

@sharyar Are you still working on this? If not I'd like to make a pull request.

@lesteve lesteve changed the title Clearify group order in GroupKFold and LeaveOneGroupOut Clarify group order in GroupKFold and LeaveOneGroupOut Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve module:test-suite everything related to our tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
0