8000 [WIP] LabelKFold: balance folds without sorting by andreasvc · Pull Request #5396 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] LabelKFold: balance folds without sorting #5396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

andreasvc
Copy link
Contributor

This changes LabelKFold so that the original or shuffled order of
samples is reflected in the folds. Instead of sorting the labels by
frequency, balance is achieved just by looking at the smallest
fold at each iteration.

This means shuffling has an effect beyond tie breaking, and the order of
samples can be used as a simple way of achieving stratification.

Closes #5390; see also #5300

This changes LabelKFold so that the original or shuffled order of
samples is reflected in the folds. Instead of sorting the labels by
frequency, balance is achieved just by looking at the smallest
fold at each iteration.

This means shuffling has an effect beyond tie breaking, and the order of
samples can be used as a simple way of achieving stratification.

Closes scikit-learn#5390; see also scikit-learn#5300
@andreasvc andreasvc changed the title LabelKFold: balance folds without sorting [WIP] LabelKFold: balance folds without sorting Oct 13, 2015
@andreasvc andreasvc changed the title [WIP] LabelKFold: balance folds without sorting LabelKFold: balance folds without sorting Oct 14, 2015
@JeanKossaifi
Copy link
Contributor

This heuristic is better if shuffling but it also can create very uneven folds, which for small datasets can be problematic.

For instance if you take a very easy case like labels = [1]*5 + [2]*5 + [3]*10 and n_folds = 2 you get one fold twice as big as the other while the original method gives perfectly even folds.

Maybe we could switch between methods depending on whether shuffling is needed or not?

@andreasvc
Copy link
Contributor Author

I see your point but it only holds for very small datasets and I wonder whether such datasets are relevant for machine learning, to me it seems too small to learn from. But if I'm wrong I would suggest adding the old behavior with a balance=True parameter.

@glouppe
Copy link
Contributor
glouppe commented Oct 15, 2015

I see your point but it only holds for very small datasets and I wonder whether such datasets are relevant for machine learning, to me it seems too small to learn from. But if I'm wrong I would suggest adding the old behavior with a balance=True parameter.

Scale @JeanKossaifi example by a factor of 1000 and this is no longer a too small dataset. Yet the issue is still there.

@andreasvc
Copy link
Contributor Author

@glouppe if you'd scale the number of labels to scale as well, the problem wouldn't occur. But indeed, if you don't, you get a pathological case. However, that's why I specifically wondered whether those are relevant in practice? Initially I decided to remove the old balancing behavior to keep the code simple, but I can add it back as a parameter if there's consensus that it's worth it.

@glouppe glouppe added the Bug label Oct 19, 2015
@glouppe
Copy link
Contributor
glouppe comm 8000 ented Oct 19, 2015

I can see this happening e.g. in situations where you collect a same number of samples per patient and then want to do k-fold by without having samples from a same patient in both train and test. Is that really a pathological case?

In lack of a consensus, maybe the simplest is to remove shuffling from this class? (better remove than having a semi-broken behaviour shipped with the release. CC: @ogrisel @amueller )

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Oct 19, 2015 via email

@glouppe
Copy link
Contributor
glouppe commented Oct 19, 2015

Okay, I am on it.

@glouppe glouppe changed the title LabelKFold: balance folds without sorting [WIP] LabelKFold: balance folds without sorting Oct 20, 2015
@andreasvc
Copy link
Contributor Author

I can see this happening e.g. in situations where you collect a same number of samples per patient and then want to do k-fold by without having samples from a same patient in both train and test. Is that really a pathological case?

It's a problematic case as soon as the number of labels gets close to the number of folds, and I'm saying this is unusual, but I can see it's debatable.

I'd still like to have the shuffle and non-balanced deterministic folds merged. Should I make a new pull request with both the old balancing and new shuffling? And if yes, would it be preferable to have a single class with parameters, or to make 2 different classes?

8000

@JeanKossaifi
Copy link
Contributor

There are many case when the dataset is small (expert annotations are expensive and time consuming and data from patients/subjects can be rare or hard to obtain). In these cases we might want to prioritise balance of the folds over shuffling to prevent over-fitting (eg learning subject dependent features) and shuffling might not be that important given that we usually do some kind of averaging across the folds.

In addition, the problem with that heuristic is that it can produce arbitrarily good or bad results. Increasing the number of labels doesn't solve the problem. Take for instance labels = [k for i in range(1, 30) for k in ([i]*10*i) ] and n_folds=5.

@andreasvc
Copy link
Contributor Author

The heuristic I implemented allows for both shuffling and stratification (by sorting the dataset beforehand). I'm sure it's useful because it solves a problem with a real world dataset I'm working on. It is clear to me that balancing and stratification/shuffling are two conflicting goals when constructing cross-validation folds, so it makes sense to offer both.

@amueller amueller added this to the 0.19 milestone Oct 27, 2016
@amueller
Copy link
Member
amueller commented Dec 7, 2016

This has been dormant for a while but seems like a good feature to have. It needs to be implemented in model_selection/_split now, though. Do you want to try that?

@amueller amueller removed the Bug label Dec 7, 2016
@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017
@jnothman
Copy link
Member

I'm assuming @andreasvc is no longer working on this and am marking it as Need Contributor.

@andreasvc
Copy link
Contributor Author

Superseded by #9413

@andreasvc andreasvc closed this Jul 20, 2017
@jnothman jnothman mentioned this pull request Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0