-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] FIX make sure sample_weight is taken into account by estimators #14246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests a necessary condition of sample_weight support only in some situations. There will be cases where the weighted version is implemented differently... And cases like MLP where we might decide that the minibatch should be sampled without weights and the updates weighted.
A sufficient condition would also check that sample_weight had some effect at all on the model learnt.
You are right. It was my comments in #14532 that this test can fail due to implementation details rather than bugs. However, it allowed me to find some bugs (at least one for the moment). Surely I will implement, the version that you are mentioning. However, depending on the follow-up I would maybe propose to implement both tests with an additional estimator tag. But I need to investigate all failing estimators and what is the underlying reason. |
# check that the estimators yield same results for | ||
# over-sample dataset by indice filtering and using sample_weight | ||
if (has_fit_parameter(estimator_orig, "sample_weight") and | ||
not (hasattr(estimator_orig, "_pairwise") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we use the pairwise tool to make it pairwise? Or do we not properly support sample weights then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, the base test was coming from the other sample_weight common test.
I can't see the builts :( |
adaboostregressor, bagginregressor, calibratedclassifiercv, gradientboostingregressor, isolationforest, linearsvc, linearsvr, minibatchkmeans, kmeans, nusvr, oneclasssvm, perceptron, ransacregressor, randomforestregressor, sgdclassifier, sgdregressor |
i think that the ensemble could be due to the bootstrapping which might see different X and therefore generate different boostrap. The SVM could be linked to what we discuss in the other PR. Something is wrong with SGD (but this is stochastic) and LinearSVM (we should check the solver there) |
LinearSVC just ignores the sample weights, see #10873 |
Digging in the code, it seems actually reasonable: Only one configuration is using the sample_weight (+another fixed there #15018). They are not triggered by the default parameters. The test is passing for LinearSVC with So one clean way would be to raise a NotImplementedError for the other configuration in the meanwhile (waiting if somebody wants to implement, if possible). Our test could test this configuration. |
Should we close this and only do slep6 on it? |
Actually, we do have it for CalibratedClassifierCV. So I guess we can close this. |
closes #14191
It seems that we don't have a common test checking that over-sampling or under-sampling
X
give equivalent results assample_weigth
. This PR introduces this new common tests.In addition, it fixes the estimators which do not follow this constraint if they should.