how class_weight or sample_weight passed and used for imbalanced classification? #17363

pamdla · 2020-05-27T08:00:59Z

I was trying to track down how the parameter class_weight and sample_weight passed and used in classification. And I found the parameter weight is passed to a C++/C file, however I could not understand further.

I also learned both items from the document, but the doc only says that class_weight will alter the loss function.

My question is, if the class_weight impacts only the loss function, why the class_weight, in turn the sample_class derived from the class_weight is passed to C++/C file?

Explicitly, if user specifies class_weight, does model or algorithm (and what algorithm) subsample the sample instances?

Moreover, is there any theory about this subsampling samples by sample_weight and altering to loss function?

glemaitre · 2020-05-27T21:26:38Z

Which algorithm are you referring to because it will depend on the algorithm (estimator) itself?

pamdla · 2020-05-28T13:28:42Z

at the moment, I searched Logistic Regression.

8000

glemaitre · 2020-05-28T13:43:54Z

It is used in the loss function where each sample is multiplied by a weight which is the inverse of the class frequency.

glemaitre · 2020-05-28T13:46:31Z

You can check the following loss and neg gradient computation: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_logistic.py#L84

Note that this is for the LBFGS solver but the class_weight will be handled similarly for the other optimizer.

pamdla · 2020-05-28T16:03:20Z

Thanks for the quick reply and refs.

However, I knew this line#L84 in this file, and other files. My question concerns about whether the sample_weight will be used in subsampling, since I can see this argument is passed to some c++/c file. I want to know if this sample_weight or class_weight will impact on the ratio of samples used in training.

glemaitre · 2020-05-28T17:59:37Z

When speaking about C/C++, do you mean the liblinear solver? It will be applied in the same manner at the C level without any resampling. We should multiply by the inverse of the class frequency.

8000

pamdla · 2020-05-29T10:26:51Z

Yes, I meant this solver. What is the mechanism behind this operation? Any theory supports it?

glemaitre · 2020-05-29T11:42:24Z

If I recall properly, it comes back to have a C for each sample which is multiplied by the sample weight.

pamdla · 2020-06-03T16:06:05Z

Though I am not glad to see this answer, thank you all the same. I wished I could be able to understand the underlying theory of it.

rth · 2020-06-03T20:05:39Z

There isn't much theory behind it. The starting point is that to handle class imbalance we would like to repeat samples from underrepresented classes. By definition, using sample weights is equivalent to repeating samples (e.g. sample weight of 2 for a sample == repeat that sample twice). This in turns puts constraints on the loss function (see #15657).

So in the end by looking at the loss function you would see the impact of sample weight. We are just not documenting samples weights in the loss functions so far but I think we should. For logistic regression the expression with sample weights is in the above link.

cmarmo · 2020-09-14T13:47:24Z

Hi @lampda your question received some attention already: I'm closing the issue. If you are interested in continuing the discussion please join the community on Stack Overflow or the scikit-learn mailing list. The issue tracker is mainly for bugs and new features. Thanks for your understanding.

pamdla added the Question label May 27, 2020

cmarmo closed this as completed Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how class_weight or sample_weight passed and used for imbalanced classification? #17363

how class_weight or sample_weight passed and used for imbalanced classification? #17363

how class_weight or sample_weight passed and used for imbalanced classification? #17363

how class_weight or sample_weight passed and used for imbalanced classification? #17363

Comments