-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Stratified GroupKFold #13621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@TomDLT @NicolasHug What do you think? |
Might be interesting in theory, but I'm not sure how useful it'd be in practice. We can certainly keep the issue open and see how many people request this feature |
Do you assume that each group is in a single class? |
See also #9413 |
@jnothman Yes, I had a similar thing in mind. However, I see that the pull request is still open. I meant that a group will not be repeated across folds. If we have ID as groups then a same ID will not occur across multiple folds |
I understand this is relevant to use of RFECV. Grouping AND stratification are useful for quite imbalanced datasets with inter-record dependency So: +1! |
This would definitely be useful. For instance, working with highly imbalanced time-series medical data, keeping patients separate but (approximately) balance the imbalanced class in each fold. I have also found that StratifiedKFold takes groups as a parameter but doesn't group according to them, should probably be flagged up. |
Another good use of this feature would be financial data, which is usually very imbalanced. In my case, I have a highly imbalanced dataset with several records for the same entity (just different points in time). We want to do a |
also see #14524 I think? |
Another use case for Stratified GroupShuffleSplit and GroupKFold is biological "repeated measures" designs, where you have multiple samples per subject or other parent biological unit. Also in many real world datasets in biology there is class imbalance. Each group of samples has the same class. So it's important to stratify and keep groups together. |
Hi, I think it would be quite useful for medicine ML. Is it implemented already? |
@amueller Do you think we should implement this, given that people are interested in this? |
I'm very interested too... it would be really useful in spectroscopy when you have several replicates measures for each of your sample, they really need to stay in the same fold during cross-validation. And if you have several unbalanced classes that you are trying to classify you really want to use the stratify feature too. Therefore I vote for it too! Sorry I'm not good enough to participate in the development but for those who will take part in that you can be sure it will be used :-) |
Please look at referenced issues and PRs in this thread as work has at least been attempted on |
I think we should implement it, but I think I still don't know what we actually want. @hermidalc has a restriction that members of the same group must be of the same class. That's not the general case, right? It would be good if people that are interested could describe their use-case and what they really want out of this. There are #15239 #14524 and #9413 which I remember all having different semantics. |
@amueller totally agree with you, I spent a few hours today looking for something between the different versions available (#15239 #14524 and #9413) but couldn't really understand if any of these would fit my need. So here is my use case if it can help: This is why I believe a stratified (Always my 6 classes represented in each fold) group (keep always the 3 replicate measures of each of my sample together) kfold seems to be very much what I am looking for here. |
My use case and why I wrote up |
@fcoppey For you, the samples within a group always have the same class, right? @hermidalc I'm not very familiar with the terminology, but from wikipedia it sounds like "repeated measure design" doesn't mean the same group must be within the same class as it says "A crossover trial has a repeated measures design in which each patient is assigned to a sequence of two or more treatments, of which one may be a standard treatment or a placebo." Irrespective of the name, it sounds to me like you both have the same use case, while I was thinking about a case similar to what's described in the crossover study. Or maybe a bit more simple: you could have a patient become sick over time (or get better), so the outcome for a patient could change. |
Actually the wikipedia article you link to explicitly says "Longitudinal analysis—Repeated measure designs allow researchers to monitor how participants change over time, both long- and short-term situations.", so I think that means that changing the class is included. |
@amueller yes you’re right, I realized I miswrote above where I meant to say in my use cases of this design not in this use case in general. There can be many quite elaborate types of repeated measures designs, though in the two types I’ve needed I needed something right away that works so wanted to put it out there for others to use and to get something started on sklearn, plus if I’m not mistaken it’s more complicated to design the stratification logic when within group class labels can be different. |
@amueller yes always. They are replicates of a same measure in order to include the intravariability of the device in the prediction. |
@hermidalc yes, this case is much easier. If it's a common need, I'm happy for us to add it. We should just make sure that from the name it's somewhat clear what it does, and we should think about whether these two versions should live in the same class. It should be quite easy to make GroupKFold I think he ED48 uristically trades off the two off them by adding to the smallest fold first. I'm not sure how that would translate to the stratified case, so I'm happy with using your approach. Should we also add GroupStratifiedKFold in the same PR? Or leave that for later? |
+1 for separately handling the group constraint where all samples have the same class. |
I'm not totally understanding this, a
I will add |
Very common use-case in medicine and biology when you have repeated measures. Below an example implementation, inspired by kaggle-kernel. import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state
class RepeatedStratifiedGroupKFold():
def __init__(self, n_splits=5, n_repeats=1, random_state=None):
self.n_splits = n_splits
self.n_repeats = n_repeats
self.random_state = random_state
# Implementation based on this kaggle kernel:
# https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
def split(self, X, y=None, groups=None):
k = self.n_splits
def eval_y_counts_per_fold(y_counts, fold):
y_counts_per_fold[fold] += y_counts
std_per_label = []
for label in range(labels_num):
label_std = np.std(
[y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
)
std_per_label.append(label_std)
y_counts_per_fold[fold] -= y_counts
return np.mean(std_per_label)
rnd = check_random_state(self.random_state)
for repeat in range(self.n_repeats):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, g in zip(y, groups):
y_counts_per_group[g][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
groups_and_y_counts = list(y_counts_per_group.items())
rnd.shuffle(groups_and_y_counts)
for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i in range(k):
fold_eval = eval_y_counts_per_fold(y_counts, i)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(g)
all_groups = set(groups)
for i in range(k):
train_groups = all_groups - groups_per_fold[i]
test_groups = groups_per_fold[i]
train_indices = [i for i, g in enumerate(groups) if g in train_groups]
test_indices = [i for i, g in enumerate(groups) if g in test_groups]
yield train_indices, test_indices Comparing import matplotlib.pyplot as plt
from sklearn import model_selection
def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
indices = np.array([np.nan] * len(X))
indices[tt] = 1
indices[tr] = 0
ax.scatter(range(len(indices)), [ii + .5] * len(indices),
c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
vmin=-.2, vmax=1.2)
ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
lw=lw, cmap=plt.cm.Paired)
ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
lw=lw, cmap=plt.cm.tab20c)
yticklabels = list(range(n_splits)) + ['class', 'group']
ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
xlabel='Sample index', ylabel="CV iteration",
ylim=[n_splits+2.2, -.2], xlim=[0, 100])
ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5
# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)
percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])
# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])
fig, ax = plt.subplots(1,2, figsize=(14,4))
cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
n_repeats=n_repeats,
random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
n_repeats=n_repeats,
random_state=1338)
plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)
plt.show() |
+1 for stratifiedGroupKfold. I am trying to detect falls of seniors, taking sensors from the samrt watch. since we don't have much fall data - we do simulations with different watches that get different classes. I also do augmentations on the data before I train it. from each data point I create 9 points- and this is a group. it is important that a group will not be both in train and test as explained |
I would like to be able to use StratifiedGroupKFold as well. I am looking at a dataset for predicting financial crises, where the years before, after and during each crisis is its own group. During training and cross-validation, members of each group should not leak between the folds. |
Is there anyway to generalize that for multilabel scenario (Multilabel_ |
+1 for this. We're analyzing user accounts for spam, so we want to group by user, but also stratify because spam is relatively low-incidence. For our use case, any user who spams once is flagged as a spammer in all data, so a group member will always have the same label. |
@hermidalc Hope your PhD work has been succeeded! Cheers |
@bfeeny @dispink it's very easy to use the two classes I wrote above. Create a file e.g. from collections import Counter, defaultdict
import numpy as np
from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state
class StratifiedGroupKFold(_BaseKFold):
"""Stratified K-Folds iterator variant with non-overlapping groups.
This cross-validation object is a variation of StratifiedKFold that returns
stratified folds with non-overlapping groups. The folds are made by
preserving the percentage of samples for each class.
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
The difference between GroupKFold and StratifiedGroupKFold is that
the former attempts to create balanced folds such that the number of
distinct groups is approximately the same in each fold, whereas
StratifiedGroupKFold attempts to create folds which preserve the
percentage of samples for each class.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
shuffle : bool, default=False
Whether to shuffle each class's samples before splitting into batches.
Note that the samples within each split will not be shuffled.
random_state : int or RandomState instance, default=None
When `shuffle` is True, `random_state` affects the ordering of the
indices, which controls the randomness of each fold for each class.
Otherwise, leave `random_state` as `None`.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 6 6 7]
[1 1 1 0 0 0 0 0 0 0]
TEST: [1 1 3 3 3 8 8]
[0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
[0 0 1 1 1 1 0 0 0 0 0 0]
TEST: [2 2 6 6 7]
[1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
[0 0 1 1 1 1 1 0 0 0 0 0]
TEST: [4 5 5 5 5]
[1 0 0 0 0]
See also
--------
StratifiedKFold: Takes class information into account to build folds which
retain class distributions (for binary or multiclass classification
tasks).
GroupKFold: K-fold iterator variant with non-overlapping groups.
"""
def __init__(self, n_splits=5, shuffle=False, random_state=None):
super().__init__(n_splits=n_splits, shuffle=shuffle,
random_state=random_state)
# Implementation based on this kaggle kernel:
# https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
def _iter_test_indices(self, X, y, groups):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, group in zip(y, groups):
y_counts_per_group[group][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
groups_and_y_counts = list(y_counts_per_group.items())
rng = check_random_state(self.random_state)
if self.shuffle:
rng.shuffle(groups_and_y_counts)
for group, y_counts in sorted(groups_and_y_counts,
key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i
10000
in range(self.n_splits):
y_counts_per_fold[i] += y_counts
std_per_label = []
for label in range(labels_num):
std_per_label.append(np.std(
[y_counts_per_fold[j][label] / y_distr[label]
for j in range(self.n_splits)]))
y_counts_per_fold[i] -= y_counts
fold_eval = np.mean(std_per_label)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(group)
for i in range(self.n_splits):
test_indices = [idx for idx, group in enumerate(groups)
if group in groups_per_fold[i]]
yield test_indices
class RepeatedStratifiedGroupKFold(_RepeatedSplits):
"""Repeated Stratified K-Fold cross validator.
Repeats Stratified K-Fold with non-overlapping groups n times with
different randomization in each repetition.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
n_repeats : int, default=10
Number of times cross-validator needs to be repeated.
random_state : int or RandomState instance, default=None
Controls the generation of the random states for each repetition.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
... random_state=36851234)
>>> for train_index, test_index in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TEST: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TRAIN: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TEST: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TRAIN: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
TEST: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TRAIN: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TEST: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
Notes
-----
Randomized CV splitters may return different results for each call of
split. You can make the results identical by setting `random_state`
to an integer.
See also
--------
RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
"""
def __init__(self, n_splits=5, n_repeats=10, random_state=None):
super().__init__(StratifiedGroupKFold, n_splits=n_splits,
n_repeats=n_repeats, random_state=random_state) |
@hermidalc Thank you for the positive reply! Sincerely |
To test I made the In [6]: import numpy as np
...: from split import StratifiedGroupKFold
...:
...: X = np.ones((17, 2))
...: y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
...: cv = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=777)
...: for train_idxs, test_idxs in cv.split(X, y, groups):
...: print("TRAIN:", groups[train_idxs])
...: print(" ", y[train_idxs])
...: print(" TEST:", groups[test_idxs])
...: print(" ", y[test_idxs])
...:
TRAIN: [2 2 4 5 5 5 5 6 6 7]
[1 1 1 0 0 0 0 0 0 0]
TEST: [1 1 3 3 3 8 8]
[0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
[0 0 1 1 1 1 0 0 0 0 0 0]
TEST: [2 2 6 6 7]
[1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
[0 0 1 1 1 1 1 0 0 0 0 0]
TEST: [4 5 5 5 5]
[1 0 0 0 0] |
There does seem to be regular interest in this feature, @hermidalc, and we
could likely find someone to finish it off if you didn't mind.
|
@hermidalc 'You have to make sure that every sample in the same group has the same class label.' Obviously that's the problem. My samples in the same group don't share the same class. Mmm...it seems to be another branch of development. |
Actually @dispink I was wrong, this algorithm does not require that all members of a group belong to the same class. For example: In [2]: X = np.ones((17, 2))
...: y = np.array([0, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0])
...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
...: cv = StratifiedGroupKFold(n_splits=3)
...: for train_idxs, test_idxs in cv.split(X, y, groups):
...: print("TRAIN:", groups[train_idxs])
...: print(" ", y[train_idxs])
...: print(" TEST:", groups[test_idxs])
...: print(" ", y[test_idxs])
...:
TRAIN: [1 1 2 2 3 3 3 4 8 8]
[0 2 1 1 2 0 0 1 1 0]
TEST: [5 5 5 5 6 6 7]
[2 1 1 1 0 2 0]
TRAIN: [1 1 4 5 5 5 5 6 6 7 8 8]
[0 2 1 2 1 1 1 0 2 0 1 0]
TEST: [2 2 3 3 3]
[1 1 2 0 0]
TRAIN: [2 2 3 3 3 5 5 5 5 6 6 7]
[1 1 2 0 0 2 1 1 1 0 2 0]
TEST: [1 1 4 8 8]
[0 2 1 1 0] So I'm not quite sure what is going on with your data, since even with your screenshots you cannot truly see what your data layout is and what might be happening. I would suggest you first reproduce the examples I showed in here to make sure it's not a scikit-learn version issue (since I'm using 0.22.2) and if you can reproduce it then I would suggest you start from small parts of your data and test it. Using ~104k samples is difficult to troubleshoot. |
@hermidalc Thank you for the reply! |
+1 |
Anyone mind if I pick this issue up? |
+1 |
Hi there, any news about this feature? Dealing with a project that requires this kind of folding and have missed it! |
Totally necessary! |
It makes it really confusing that the documentation for StratifiedKFold and RepeatedStratifiedKFold includes groups as a parameter to the split function, but in reality, this parameter does not affect the splits in any way. Either the solutions in this thread should be merged into the existing classes (so the parameter actually does something), or there should be new classes (StratifiedGroupKFold and RepeatedStratifiedGroupKFold) and the useless group parameter should be taken out of the non-group classes. |
@hermidalc how can I use StratifiedGroupKFold with GridSearchCV? |
Like you would use any other CV iterator with |
@hermidalc
I think I solved both above with the following code:
Does this code seem right to you? In addition, and mainly, the code is extremely slow (perhaps because the number of unique values in the feature I'm stratifying by is pretty large and not 2 like in the examples above). Any ideas on how to make it more efficient? Thanks! |
Adding to the discussion, I have situation where I have |
|
My intention is to split dataset of radiographs, and to make sure samples of different age groups are represented in accordance to the original distribution. |
If I'm understanding correctly and you are converting |
It would be optimal if y can be a code word ( |
I would review https://scikit-learn.org/stable/modules/multiclass.html. Stratified CV iterators in scikit-learn work with multiclass problems. So for any such dataset, each sample should have a class label from |
Sure, I can and have added one additional column to the metadata. |
Description
Currently sklearn does not have a stratified group kfold feature. Either we can use stratification or we can use group kfold. However, it would be good to have both.
I would like to implement it, if we decide to have it.
The text was updated successfully, but these errors were encountered: