8000 [MRG] SVM: Ensure nonnegative sample weights (#9494) by jondo · Pull Request #9674 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] SVM: Ensure nonnegative sample weights (#9494) #9674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jondo
Copy link
Contributor
@jondo jondo commented Sep 2, 2017

As a first step for dealing with #9494, this reproduces the issue in a unit test.

This should fail with the same IndexError as in the original report.
It only fails when all of the weights of the second class are negative.

@glemaitre
Copy link
Member

Uhm no chances on this one :)

@jondo
Copy link
Contributor Author
jondo commented Sep 2, 2017

@glemaitre, do you mean that the original issue should be closed without any code change, or that there is something fundamentally wrong with my try to do test driven programming?
Can you reproduce the error with my test?

@glemaitre
Copy link
Member

Oh sorry, actually the CI did not run and thus I believed that you test did not failed.

@jondo
Copy link
Contributor Author
jondo commented Sep 2, 2017

I see. Bad nightly idea to deactivate CI on that. I will switch it back on, s.t. you don't need to test locally.

@jondo jondo force-pushed the check-negative-SVC-weights branch from e9b3329 to f6d6fe3 Compare September 2, 2017 15:24
@jondo
Copy link
Contributor Author
jondo commented Sep 2, 2017

FWIW, A travis build additionally shows the stderr message warning: class label 1 specified in weight is not found, which might be related and seems to come from here in linear.cpp.

@jondo
Copy link
Contributor Author
jondo commented Sep 2, 2017

No, I guess the warning must be from the corresponding place in svm.cpp, since the test calls svm.SVC, not svm.LinearSVC—but I am reading all of this code for the first time.

Actually, this warning might just be a side effect (or even totally unrelated), because the code there is about label weights weight. The sample weights are in W. They were introduced by @fabianp in 2010, and they don't exist in upstream libSVM.

@jondo
Copy link
Contributor Author
jondo commented Sep 4, 2017

(See my comment in the issue.)

@jnothman
Copy link
Member
jnothman commented Sep 5, 2017

Add a fix?

@jondo jondo force-pushed the check-negative-SVC-weights branch from f6d6fe3 to 744862c Compare September 5, 2017 21:04
@jondo jondo changed the title [WIP] Reproduce the IndexError from #9494 [MRG] SVM: Ensure nonnegative sample weights (#9494) Sep 5, 2017
@jondo jondo force-pushed the check-negative-SVC-weights branch from 744862c to 66060f7 Compare September 6, 2017 16:30
@jondo
Copy link
Contributor Author
jondo commented Sep 8, 2017

Ready for review.

@@ -170,6 +170,9 @@ def fit(self, X, y, sample_weight=None):
"boolean masks (use `indices=True` in CV)."
% (sample_weight.shape, X.shape))

if sample_weight.shape[0] > 0 and min(sample_weight) < 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.min



def test_negative_weights():
W = np.array([1, 1, 1, -1, -1, -1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename : W -> sample_weights

@agramfort
Copy link
Member

I feel this should be with a check_sample_weights function tested on all estimators with a common test that can be disable for speed with the new context manager.

@jondo jondo force-pushed the check-negative-SVC-weights branch from 66060f7 to 2af259d Compare September 10, 2017 20:02
@jondo
Copy link
Contributor Author
jondo commented Sep 10, 2017

@agramfort:
I have applied the small changes. Would (sample_weight < 0).any() be even better?

I guess sklearn/base.pyis the place to call check_sample_weights for all estimators? And utils/validation.py the place to define it? But does no estimator support negative weights?

Should the disabling look as in #7548? E.g.

with sklearn.config_context(assume_positive_sample_weights=True):
    clf.fit(X, Y, sample_weight=sample_weights)

@jnothman
Copy link
Member
jnothman commented Sep 10, 2017 via email

@agramfort
Copy link
Member
agramfort commented Sep 11, 2017 via email

@jondo
Copy link
Contributor Author
jondo commented Sep 11, 2017

Besides svm.SVC, are there other estimators that need nonnegative weights? I found no such case in the sklearn docs. (Should I document it for svm.SVC?)

I even don't know about the other svm estimators. I guess that at least svm.LinearSVC might also need nonnegative weights, since the README.weight of LIBLINEAR also says "Please make sure all weights are non-negative". So the common test can be reused at least here.

@jnothman
Copy link
Member
jnothman commented Sep 11, 2017 via email

@amueller
Copy link
Member

I think most people assume weights are non-negative, but as @agramfort said, they don't need to be for adaboost (I'm not sure about gradient boosting and other tree-based algorithms).

Can you run a test over all_estimators and check which ones take weights and which one require non-negative ones and add error messages?
I guess we'd have a common test that check that either a good error is raised or the result is correct (I think flipping the sign of y would be ok for binary, not sure what negative weights do in multiclass)

@alexshacked
Copy link
Contributor

Actually the SVC algorithm handles negative weights. It happens in the implementation in function PREFIX(train) in svm.cpp:2338. The first thing this function does is calling remove_zero_weight(), located also in svm.cpp. remove_zero_weight() will remove from the input all the samples which have a negative or zero weight. So the training algorithm will process only the samples with positive weights. The problem in the specific example that uncovered bug #9494 is that here all samples that had the "0" label also received a negative weight. So after removing the samples with negative weights the training algorithm in svm.cpp was left with only the samples having label "1". This is of course an invalid state and this is why coef_ was not created. For SVM training the input must contain samples with label "0" and also samples with label "1". Both classes must be present in the training data.
Function PREFIX(train) in svm.cpp:2338 does not test that both labels appear in the input data because it relies on the testing done in the base class - in sklearn.svm.base.BaseLibSvm.fit(). Here we call validate_targets() (base.py:147), which tests exactly that. But, this function looks at the labels not at the weights. So for the #9494 scenario what happens is that sklearn.svm.base.BaseLibSvm.fit() validates the input since it sees samples having both type of labels, but then in the implementation in svm.ccp, all the samples with negative weights are removed, leaving only samples with label "1" in the training data.
Since BaseLibSVM is inherited by many specialized classes (NuSVC, SVR, NuSVR, OneClassSVM)
testing for negative labels in BaseLibSvm.fit() could interfere with the other algorithms that inherit this class. This issue was mentioned by the participants in the pull-request #9674.

A more encapsulated fix could be done in the implementation - PREFIX(train) in svm.cpp
After removing the samples with zero/negative weights, the function should test that in the remaining training set both class labels are present, and if not throw an exception.

I see that this issue was not handled for some time, and would be happy to take it on and handle the pull-request.

@alexshacked
Copy link
Contributor
alexshacked commented Jul 1, 2019

@amueller, @agramfort I saw that issue #9494 (which is handled by this PR), was tagged "help wanted".
I analysed the problem starting from the PR comments and then by debugging the code. My findings in the comment above. I would like to proceed with the above solution. What do you think?

@jnothman
Copy link
Member
jnothman commented Jul 1, 2019 via email

@alexshacked
Copy link
Contributor

Opened PR #14282

@alexshacked
Copy link
Contributor

Closed PR #14282
Opened PR #14286

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0