-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] LabelKFold: balance folds without sorting #5396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This changes LabelKFold so that the original or shuffled order of samples is reflected in the folds. Instead of sorting the labels by frequency, balance is achieved just by looking at the smallest fold at each iteration. This means shuffling has an effect beyond tie breaking, and the order of samples can be used as a simple way of achieving stratification. Closes scikit-learn#5390; see also scikit-learn#5300
This heuristic is better if shuffling but it also can create very uneven folds, which for small datasets can be problematic. For instance if you take a very easy case like Maybe we could switch between methods depending on whether shuffling is needed or not? |
I see your point but it only holds for very small datasets and I wonder whether such datasets are relevant for machine learning, to me it seems too small to learn from. But if I'm wrong I would suggest adding the old behavior with a |
Scale @JeanKossaifi example by a factor of 1000 and this is no longer a too small dataset. Yet the issue is still there. |
@glouppe if you'd scale the number of labels to scale as well, the problem wouldn't occur. But indeed, if you don't, you get a pathological case. However, that's why I specifically wondered whether those are relevant in practice? Initially I decided to remove the old balancing behavior to keep the code simple, but I can add it back as a parameter if there's consensus that it's worth it. |
I can see this happening e.g. in situations where you collect a same number of samples per patient and then want to do k-fold by without having samples from a same patient in both train and test. Is that really a pathological case? In lack of a consensus, maybe the simplest is to remove shuffling from this class? (better remove than having a semi-broken behaviour shipped with the release. CC: @ogrisel @amueller ) |
@ogrisel is telling me (orally, crazy!) that he thinks that shuffling
should be removed for 0.17.
I agree with him.
|
Okay, I am on it. |
It's a problematic case as soon as the number of labels gets close to the number of folds, and I'm saying this is unusual, but I can see it's debatable. I'd still like to have the shuffle and non-balanced deterministic folds merged. Should I make a new pull request with both the old balancing and new shuffling? And if yes, would it be preferable to have a single class with parameters, or to make 2 different classes? |
There are many case when the dataset is small (expert annotations are expensive and time consuming and data from patients/subjects can be rare or hard to obtain). In these cases we might want to prioritise balance of the folds over shuffling to prevent over-fitting (eg learning subject dependent features) and shuffling might not be that important given that we usually do some kind of averaging across the folds. In addition, the problem with that heuristic is that it can produce arbitrarily good or bad results. Increasing the number of labels doesn't solve the problem. Take for instance |
The heuristic I implemented allows for both shuffling and stratification (by sorting the dataset beforehand). I'm sure it's useful because it solves a problem with a real world dataset I'm working on. It is clear to me that balancing and stratification/shuffling are two conflicting goals when constructing cross-validation folds, so it makes sense to offer both. |
This has been dormant for a while but seems like a good feature to have. It needs to be implemented in |
I'm assuming @andreasvc is no longer working on this and am marking it as Need Contributor. |
Superseded by #9413 |
This changes LabelKFold so that the original or shuffled order of
samples is reflected in the folds. Instead of sorting the labels by
frequency, balance is achieved just by looking at the smallest
fold at each iteration.
This means shuffling has an effect beyond tie breaking, and the order of
samples can be used as a simple way of achieving stratification.
Closes #5390; see also #5300