-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Balanced/Weighted Sampling #6568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unless I'm missing something obvious, if you want to maintain your class frequency across all folds, you can use the If you want to reweight your samples on the basis of classes during training, it depends on the specific model but a good number of them support |
As you pointed out - StratifiedKFold maintains class frequency, which is sort of the opposite of this use case. The class weights case is what I'm going for - but doesn't it make more sense to support this in the data processing sampling rather than reimplementing it in each model - which could make that implementation more complex? In particular, SGD based non-linear models (coughcoughneuralnets) can be quite sensitive to the priors over your sampling - this is not going to be as useful for a random forest or a SVM - so to some extent it depends on how sklearn is planning on approaching deep learning in the future. |
are you talking about #1454
?
|
Yes, that PR should cover my use case. |
@agramfort @anjishnu What are your thoughts about having a new class On the downside, since the current API does not allow changing sample size during |
I don't fully understand the behavior of this function you're describing - are you saying that this would have a fixed sized output? I think it would be relatively trivial to use it in conjunction with a separate sampler - but what are the advantages of this approach vis-a-vie #1454 or adding it as a Kfold subclass. |
Sorry for the delayed reply. I did not mean a function but a class The use-case would be to tune your class priors to obtain your class prior hyperparameter. Example
In this way, you can get your best class_prior. Is that a common use case? |
Let me go through the pipeline code to understand it better - in general yes - it seems useful - in practice I've found that maintaining overall sample size constant with prior resampling is detrimental to the majority class in certain cases (if it has a lot of variance). That seems to be the main flaw with this approach - in most of the use cases I run into I'd end up artificially inflating the input before passing it through the resampler to make sure that the majority class is fully represented even after resampling. It would be nice to have that taken care of by scikit-learn as well - but doing the processing to get my intended result is simple enough that this takes care of 90% of the use case. |
|
(But you can create a class-resampling CV splitter...) |
(or a meta-estimator ;) - and reference #5972 for your idea... |
See also #13269. |
Many classification applications need to deal with skewed input data - recently for several projects I've had to implement techniques to re-weight samples during training to get the best results - this can ideally be supported generically by scikit-learn in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py
In my use case I was able to get significantly better results by assuming a uniform prior during training with the skewed labels- but it makes sense to have a generic way to add weights to the sampled training distribution for cases where researchers have good reason to incorporate a certain prior.
The text was updated successfully, but these errors were encountered: