Closed
Description
Describe the bug
GroupKFold
builds folds with non-overlapping groups in it. The order of the groups seems random and not sorted by group ID. A naiv user like me expects the groups to be ordered over the folds.
Steps/Code to Reproduce
import numpy as np
from sklearn.model_selection import GroupKFold
n_splits, n_samples, n_features = 3, 2, 2
X = np.arange(n_splits * n_samples * n_features).reshape(n_splits * n_samples, n_features)
groups = np.concatenate([np.full(n_samples, i) for i in range(n_splits)])
splits = list(GroupKFold(n_splits=n_splits).split(X, groups=groups))
[(np.unique(groups[train]), np.unique(groups[test])) for train, test in splits]
Results show group ID of train and test folds:
[(array([0, 1]), array([2])),
(array([0, 2]), array([1])),
(array([1, 2]), array([0]))]
Now we add a few samples to the first group (group ID = 0).
X = np.r_[X[:n_samples, :], X]
groups = np.r_[groups[:n_samples], groups]
splits = list(GroupKFold(n_splits=n_splits).split(X, groups=groups))
[(np.unique(groups[train]), np.unique(groups[test])) for train, test in splits]
Result:
[(array([1, 2]), array([0])),
(array([0, 1]), array([2])),
(array([0, 2]), array([1]))]
Expected Results
I expect the same (first) output for both:
[(array([0, 1]), array([2])),
(array([0, 2]), array([1])),
(array([1, 2]), array([0]))]
Versions
python: 3.7.2
sklearn: 0.24.dev0 (current master as of today morning)