8000 DOC/TST Clarify group order in GroupKFold and LeaveOneGroupOut by SamAdamDay · Pull Request #22582 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC/TST Clarify group order in GroupKFold and LeaveOneGroupOut #22582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 6, 2022

Conversation

SamAdamDay
Copy link
Contributor

Reference Issues/PRs

Fixes #18338

What does this implement/fix? Explain your changes.

  • I have added notes to the docstrings of GroupKFold and LeaveOneGroupOut explaining that in the former groups are distributed in an arbitrary order between folds, while in the latter the splits are ordered according to the index of the group left out.
  • I added a test to make sure that LeaveOneGroupOut does indeed order the splits in this way.

Any other comments?

  • Concerning the order in GroupKFold:
    • The order of the groups in the folds calculated by GroupKFold depends on the numpy implementation of argsort. (In GroupKFold line 526 we take the indices which sort the array of group sizes. Two groups of the same size may be swapped if we don't use a stable sorting algorithm.)
    • This was implementation was actually changed in numpy version 1.12.0, from a quicksort to introsort.
    • We could make this stable across numpy versions by using a stable sorting algorithm (e.g. we can just pass kind="stable" to argsort)
  • Comparing to other cross validators in _split.py:
    • It might be good to mention the default ordering in the docstrings of other cvs.
    • Other cvs have a shuffle option. In Clarify group order in GroupKFold and LeaveOneGroupOut #18338 one possible suggestion is with respect to GroupKFold is:

      add an option to shuffle in whichever ways possible, and then explicitly document the limitations.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -1138,6 +1142,11 @@ class LeaveOneGroupOut(BaseCrossValidator):
[3 4]] [[5 6]
[7 8]] [1 2] [1 2]

Notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this section above the example section

@@ -496,6 +496,10 @@ class GroupKFold(_BaseKFold):
[7 8]] [[1 2]
[3 4]] [3 4] [1 2]

Notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for this section

X = np.ones(len(groups))

splits = iter(LeaveOneGroupOut().split(X, groups=groups))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a small for loop here.

    expected_indices = [
        ([0, 1, 4, 5], [2, 3]),
        ([0, 1, 2, 3], [4, 5]),
        ([2, 3, 4, 5], [0, 1]),
    ]
    for expected_train, expected_test in expected_indices:
        train, test = next(splits)
        assert_array_equal(train, expected_train)
        assert_array_equal(test, expected_test)

@SamAdamDay
Copy link
Contributor Author

Thanks for your comments. I've made the changes you suggest.

Just to confirm, do we not want to make the group ordering in GroupKFold stable for all future versions by using a stable sorting algorithm?

@glemaitre
Copy link
Member

I assume that this is fine for the moment. At least the grouping is deterministic. I don't know if ordering is enough to change the behaviour for future versions.

We can merge this PR for the moment and revisit the issue of ordering in another one.

@glemaitre glemaitre changed the title Clarify group order in GroupKFold and LeaveOneGroupOut DOC/TST Clarify group order in GroupKFold and LeaveOneGroupOut May 6, 2022
@glemaitre glemaitre merged commit 3275bea into scikit-learn:main May 6, 2022
@glemaitre
Copy link
Member

Thanks @SparklePigBang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify group order in GroupKFold and LeaveOneGroupOut
3 participants
0