-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
how class_weight or sample_weight passed and used for imbalanced classification? #17363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which algorithm are you referring to because it will depend on the algorithm (estimator) itself? |
at the moment, I searched Logistic Regression. |
It is used in the loss function where each sample is multiplied by a weight which is the inverse of the class frequency. |
You can check the following loss and neg gradient computation: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_logistic.py#L84 Note that this is for the LBFGS solver but the |
Thanks for the quick reply and refs. However, I knew this line#L84 in this file, and other files. My question concerns about whether the sample_weight will be used in subsampling, since I can see this argument is passed to some c++/c file. I want to know if this sample_weight or class_weight will impact on the ratio of samples used in training. |
When speaking about C/C++, do you mean the |
Yes, I meant this solver. What is the mechanism behind this operation? Any theory supports it? |
If I recall properly, it comes back to have a C for each sample which is multiplied by the sample weight. |
Though I am not glad to see this answer, thank you all the same. I wished I could be able to understand the underlying theory of it. |
There isn't much theory behind it. The starting point is that to handle class imbalance we would like to repeat samples from underrepresented classes. By definition, using sample weights is equivalent to repeating samples (e.g. sample weight of 2 for a sample == repeat that sample twice). This in turns puts constraints on the loss function (see #15657). So in the end by looking at the loss function you would see the impact of sample weight. We are just not documenting samples weights in the loss functions so far but I think we should. For logistic regression the expression with sample weights is in the above link. |
Hi @lampda your question received some attention already: I'm closing the issue. If you are interested in continuing the discussion please join the community on Stack Overflow or the scikit-learn mailing list. The issue tracker is mainly for bugs and new features. Thanks for your understanding. |
I was trying to track down how the parameter class_weight and sample_weight passed and used in classification. And I found the parameter weight is passed to a C++/C file, however I could not understand further.
I also learned both items from the document, but the doc only says that class_weight will alter the loss function.
My question is, if the class_weight impacts only the loss function, why the class_weight, in turn the sample_class derived from the class_weight is passed to C++/C file?
Explicitly, if user specifies class_weight, does model or algorithm (and what algorithm) subsample the sample instances?
Moreover, is there any theory about this subsampling samples by sample_weight and altering to loss function?
The text was updated successfully, but these errors were encountered: