8000 [WIP] FIX make sure sample_weight is taken into account by estimators by glemaitre · Pull Request #14246 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] FIX make sure sample_weight is taken into account by estimators #14246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

glemaitre
Copy link
Member

closes #14191

It seems that we don't have a common test checking that over-sampling or under-sampling X give equivalent results as sample_weigth. This PR introduces this new common tests.

In addition, it fixes the estimators which do not follow this constraint if they should.

@glemaitre glemaitre changed the title FIX make sure sample_weight is taken into account by estimators [WIP] FIX make sure sample_weight is taken into account by estimators Jul 3, 2019
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests a necessary condition of sample_weight support only in some situations. There will be cases where the weighted version is implemented differently... And cases like MLP where we might decide that the minibatch should be sampled without weights and the updates weighted.

A sufficient condition would also check that sample_weight had some effect at all on the model learnt.

@glemaitre
Copy link
Member Author

This tests a necessary condition of sample_weight support only in some s 8000 ituations. There will be cases where the weighted version is implemented differently... And cases like MLP where we might decide that the minibatch should be sampled without weights and the updates weighted.

A sufficient condition would also check that sample_weight had some effect at all on the model learnt.

You are right. It was my comments in #14532 that this test can fail due to implementation details rather than bugs. However, it allowed me to find some bugs (at least one for the moment).

Surely I will implement, the version that you are mentioning. However, depending on the follow-up I would maybe propose to implement both tests with an additional estimator tag. But I need to investigate all failing estimators and what is the underlying reason.

8000
# check that the estimators yield same results for
# over-sample dataset by indice filtering and using sample_weight
if (has_fit_parameter(estimator_orig, "sample_weight") and
not (hasattr(estimator_orig, "_pairwise")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use the pairwise tool to make it pairwise? Or do we not properly support sample weights then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, the base test was coming from the other sample_weight common test.

@amueller
Copy link
Member

I can't see the builts :(
Which estimators were failing? As I said in #15015 I agree with @jnothman that online algorithms might be problematic and also that CV based algorithms might be problematic.
Given all the bugs, I feel we do need a test like this, though.
Dare I say we need a tag :(
Or should we make sure that the batch algorithms actually do "the right thing"? That might be more costly and not worth it, though. For convex optimization we should end up in the same place if we train until convergence, but for MLPs it's probably tricky? Though the prediction results on the training set really shouldn't be that different when changing the sampling of the batches a bit.

@glemaitre
Copy link
Member Author
8000

adaboostregressor, bagginregressor, calibratedclassifiercv, gradientboostingregressor, isolationforest, linearsvc, linearsvr, minibatchkmeans, kmeans, nusvr, oneclasssvm, perceptron, ransacregressor, randomforestregressor, sgdclassifier, sgdregressor

@glemaitre
Copy link
Member Author

i think that the ensemble could be due to the bootstrapping which might see different X and therefore generate different boostrap. The SVM could be linked to what we discuss in the other PR.

Something is wrong with SGD (but this is stochastic) and LinearSVM (we should check the solver there)

@amueller
Copy link
Member

LinearSVC just ignores the sample weights, see #10873

@glemaitre
Copy link
Member Author

Digging in the code, it seems actually reasonable:

Only one configuration is using the sample_weight (+another fixed there #15018). They are not triggered by the default parameters. The test is passing for LinearSVC with dual=False (after merging #15018) and LinearSVR with dual='False' and loss='squared_epsilon_insensitive'.

So one clean way would be to raise a NotImplementedError for the other configuration in the meanwhile (waiting if somebody wants to implement, if possible). Our test could test this configuration.

@jnothman jnothman modified the milestones: 0.22, 0.23 Oct 31, 2019
@thomasjpfan thomasjpfan modified the milestones: 0.23, 0.24 Apr 20, 2020
@cmarmo cmarmo removed this from the 0.24 milestone Oct 15, 2020
Base automatically changed from master to main January 22, 2021 10:51
@adrinjalali
Copy link
Member

Should we close this and only do slep6 on it?

@adrinjalali
Copy link
Member

Actually, we do have it for CalibratedClassifierCV. So I guess we can close this.

@adrinjalali adrinjalali closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BaggingClassifier vs LinearSVC training slow
6 participants
0