8000 how class_weight or sample_weight passed and used for imbalanced classification? · Issue #17363 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

how class_weight or sample_weight passed and used for imbalanced classification? #17363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pamdla opened this issue May 27, 2020 · 11 comments
Closed
Labels

Comments

@pamdla
Copy link
pamdla commented May 27, 2020

I was trying to track down how the parameter class_weight and sample_weight passed and used in classification. And I found the parameter weight is passed to a C++/C file, however I could not understand further.

I also learned both items from the document, but the doc only says that class_weight will alter the loss function.

My question is, if the class_weight impacts only the loss function, why the class_weight, in turn the sample_class derived from the class_weight is passed to C++/C file?

Explicitly, if user specifies class_weight, does model or algorithm (and what algorithm) subsample the sample instances?

Moreover, is there any theory about this subsampling samples by sample_weight and altering to loss function?

@glemaitre
Copy link
Member

Which algorithm are you referring to because it will depend on the algorithm (estimator) itself?

@pamdla
Copy link
Author
pamdla commented May 28, 2020

at the moment, I searched Logistic Regression.

8000

@glemaitre
Copy link
Member

It is used in the loss function where each sample is multiplied by a weight which is the inverse of the class frequency.

@glemaitre
Copy link
Member

You can check the following loss and neg gradient computation: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_logistic.py#L84

Note that this is for the LBFGS solver but the class_weight will be handled similarly for the other optimizer.

@pamdla
Copy link
Author
pamdla commented May 28, 2020

Thanks for the quick reply and refs.

However, I knew this line#L84 in this file, and other files. My question concerns about whether the sample_weight will be used in subsampling, since I can see this argument is passed to some c++/c file. I want to know if this sample_weight or class_weight will impact on the ratio of samples used in training.

@glemaitre
Copy link
Member

When speaking about C/C++, do you mean the liblinear solver? It will be applied in the same manner at the C level without any resampling. We should multiply by the inverse of the class frequency.

8000

@pamdla
Copy link
Author
pamdla commented May 29, 2020

Yes, I meant this solver. What is the mechanism behind this operation? Any theory supports it?

@glemaitre
Copy link
Member

If I recall properly, it comes back to have a C for each sample which is multiplied by the sample weight.

@pamdla
Copy link
Author
pamdla commented Jun 3, 2020

Though I am not glad to see this answer, thank you all the same. I wished I could be able to understand the underlying theory of it.

@rth
Copy link
Member
rth commented Jun 3, 2020

There isn't much theory behind it. The starting point is that to handle class imbalance we would like to repeat samples from underrepresented classes. By definition, using sample weights is equivalent to repeating samples (e.g. sample weight of 2 for a sample == repeat that sample twice). This in turns puts constraints on the loss function (see #15657).

So in the end by looking at the loss function you would see the impact of sample weight. We are just not documenting samples weights in the loss functions so far but I think we should. For logistic regression the expression with sample weights is in the above link.

@cmarmo
Copy link
Contributor
cmarmo commented Sep 14, 2020

Hi @lampda your question received some attention already: I'm closing the issue. If you are interested in continuing the discussion please join the community on Stack Overflow or the scikit-learn mailing list. The issue tracker is mainly for bugs and new features. Thanks for your understanding.

@cmarmo cmarmo closed this as completed Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
0