10000 Balanced/Weighted Sampling · Issue #6568 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Balanced/Weighted Sampling #6568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
anjishnu opened this issue Mar 21, 2016 · 12 comments
Open

Balanced/Weighted Sampling #6568

anjishnu opened this issue Mar 21, 2016 · 12 comments

Comments

@anjishnu
Copy link

Many classification applications need to deal with skewed input data - recently for several projects I've had to implement techniques to re-weight samples during training to get the best results - this can ideally be supported generically by scikit-learn in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py

In my use case I was able to get significantly better results by assuming a uniform prior during training with the skewed labels- but it makes sense to have a generic way to add weights to the sampled training distribution for cases where researchers have good reason to incorporate a certain prior.

@MechCoder
Copy link
Member

Unless I'm missing something obvious, if you want to maintain your class frequency across all folds, you can use the StratifiedKFold object.

If you want to reweight your samples on the basis of classes during training, it depends on the specific model but a good number of them support class_weights already. For instance LogisticRegression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L496)

@anjishnu
Copy link
Author

As you pointed out - StratifiedKFold maintains class frequency, which is sort of the opposite of this use case. The class weights case is what I'm going for - but doesn't it make more sense to support this in the data processing sampling rather than reimplementing it in each model - which could make that implementation more complex?

In particular, SGD based non-linear models (coughcoughneuralnets) can be quite sensitive to the priors over your sampling - this is not going to be as useful for a random forest or a SVM - so to some extent it depends on how sklearn is planning on approaching deep learning in the future.

@agramfort
Copy link
Member
agramfort commented Mar 21, 2016 via email

@anjishnu
Copy link
Author

Yes, that PR should cover my use case.

@MechCoder
Copy link
Member

@agramfort @anjishnu What are your thoughts about having a new class LabelResampler, which takes method as a parameter where accepts a class dict as a prior and does this sampling during transform time. This will be convenient to use in a Pipeline and gridsearch across various class priors as parameters?

On the downside, since the current API does not allow changing sample size during transform, we cannot oversample or undersample. But I think that is okay because bootstrap approaches maintain sample size right?

@anjishnu
Copy link
Author

I don't fully understand the behavior of this function you're describing - are you saying that this would have a fixed sized output? I think it would be relatively trivial to use it in conjunction with a separate sampler - but what are the advantages of this approach vis-a-vie #1454 or adding it as a Kfold subclass.

@MechCoder
Copy link
Member

Sorry for the delayed reply. I did not mean a function but a class ClassResampler. This would be similar to the custom label distribution in #1454

The use-case would be to tune your class priors to obtain your class prior hyperparameter. Example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import ClassResampler
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV

param_distributions = {'resampler__method':   [{1:0.9, 0:0.1}, {1:0.3, 0:0.7}]}
pipeline = Pipeline([('resampler', ClassResampler), ('sgd', SGDClassifier)])
gscv = GridSearchCV(pipeline, param_distributions)
gscv.fit(X, y)

In this way, you can get your best class_prior. Is that a common use case?

@anjishnu
Copy link
Author
anjishnu commented Apr 1, 2016

Let me go through the pipeline code to understand it better - in general yes - it seems useful - in practice I've found that maintaining overall sample size constant with prior resampling is detrimental to the majority class in certain cases (if it has a lot of variance).

That seems to be the main flaw with this approach - in most of the use cases I run into I'd end up artificially inflating the input before passing it through the resampler to make sure that the majority class is fully represented even after resampling. It would be nice to have that taken care of by scikit-learn as well - but doing the processing to get my intended result is simple enough that this takes care of 90% of the use case.

@amueller
Copy link
Member

ClassResampler can not exist with our current pipeline as you can not change y. @anjishnu also check out https://github.com/scikit-learn-contrib/imbalanced-learn which hacked the pipeline to make this possible.

@jnothman
Copy link
Member

(But you can create a class-resampling CV splitter...)

< 8000 include-fragment loading="lazy" src="/scikit-learn/scikit-learn/issue_comments/252795774/edit_form?textarea_id=issuecomment-252795774-body&comment_context=" data-nonce="v2:bcee1861-5682-1e1c-95d4-65ea27724576" data-view-component="true" class="previewable-comment-form js-comment-edit-form-deferred-include-fragment">

@amueller
Copy link
Member

(or a meta-estimator ;) - and reference #5972 for your idea...

@cmarmo
Copy link
Contributor
cmarmo commented Dec 10, 2021

See also #13269.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
0