8000 Weird behavior in LogisticRegression on parameter class_weight · Issue #1411 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Weird behavior in LogisticRegression on parameter class_weight #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ShusenLiu opened this issue Nov 26, 2012 · 31 comments · Fixed by #1491
Closed

Weird behavior in LogisticRegression on parameter class_weight #1411

ShusenLiu opened this issue Nov 26, 2012 · 31 comments · Fixed by #1491
Labels
Milestone

Comments

@ShusenLiu
Copy link

The class_weight parameter set different weights for misclassify that class. For example, in a 0/1 classification problem, if we set class_weight={0:0.95, 1:0.05}, then we can expect the classifier to be more careful when it try to classify a data sample to be 0, since misclassify a 0 to 1 is heavily penalized.

But the LogisticRegression class seems go wrong:

from sklearn import datasets
from sklearn import svm
from sklearn import linear_model

#100 sample, half of its label is 0, others are 1
X, Y = datasets.make_classification()
Y.sum()
>>> 50 

#balance LR classifier
clr0 = linear_model.LogisticRegression()
clr0.fit(X, Y)
clr0.score(X,Y)
>>> 0.84999999999999998
clr0.predict(X).sum()
>>> 49

#imbalance LR classifier
clr1 = linear_model.LogisticRegression(class_weight={0:0.9, 1:0.1})
clr1.fit(X, Y)
clr1.score(X,Y)
>>> 0.63
clr1.predict(X).sum()
>>> 85

The imbalance classifier clr1 is supposed to classifier more data to be label 0, but it actually predict far more data to be 1. But when we choose another classfier, say SVM, the behavior seems resonable:

 
#balance SVM classifier
clr2 = svm.SVC()
clr2.fit(X,Y)
clr2.score(X,Y)
>>> 0.95999999999999996
clr2.predict(X).sum()
>>> 46

#imbalance SVM classifier
clr3 = svm.SVC(class_weight={0:0.6, 1:0.4})
clr3.fit(X,Y)
clr3.score(X,Y)
>>> 0.84999999999999998
clr1.predict(X).sum()
>>> 35.0

#another imbalance SVM classifier
clr4 = svm.SVC(class_weight={0:0.9, 1:0.1})
clr4.fit(X,Y)
clr4.score(X,Y)
>>> 0.5
clr4.predict(X).sum()
>>> 0.0

When the class_weight[0] vs class_weight[0] is 5:5, SVC predict roughly half of data to be 0.
When the class_weight[0] vs class_weight[0] is 6:4, SVC predict more data to be 0.
When the class_weight[0] vs class_weight[0] is 9:1, SVC predict all the data to be 0.

Is this a bug?

@fannix
Copy link
Contributor
fannix commented Nov 27, 2012

I think this implementation use Liblinear. Therefore the meaning of class weight depends on the actual optimization method implemented. I tried to use different regularizer, and the number of 1 predicted also changed dramatically. Maybe you shouldn't set the class weight :)

@ShusenLiu
Copy link
Author

It's a weird behavior, we should consider it as a bug, right?

I look up LibLinear ReadMe and in line198:


LibLinear 1.92 ReadMe line198

We implement 1-vs-the rest multi-class strategy for classification.
In training i vs. non_i, their C parameters are (weight from -wi)_C
and C, respectively. If there are only two classes, we train only one
model. Thus weight1_C vs. weight2*C is used. See examples below.


Obviously Linlinear supports different weights for different classes.

Is this a bug in sklearn wrapper, or a bug in liblinear?

On Tue, Nov 27, 2012 at 10:20 AM, Meng Xinfan notifications@github.comwrote:

I think this implementation use Liblinear. Therefore the meaning of class
weight depends on the actual optimization method implemented. I tried to
use different regularizer, and the number of 1 predicted also changed
dramatically. Maybe you shouldn't set the class weight :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10743780.

@fannix
Copy link
Contributor
fannix commented Nov 28, 2012

I think this is unlikely a bug in sklearn or liblinear.

I tried a couple of examples, and print out the confusion matrix. I found
that the larger weight you give to class 0, the harder the LR classifier
tried to avoid predicting class 1 to 0, hence refrained from predicting an
instance to class 0. This cause less instances to be predicted to class 0,
though contradicting what you assume.

you can see more about parameter C from the following references:

http://jmlr.csail.mit.edu/papers/volume9/fan08a/fan08a.pdf
http://pyml.sourceforge.net/doc/howto.pdf

On Wed, Nov 28, 2012 at 11:38 AM, Shusen Liu notifications@github.comwrote:

It's a weird behavior, we should consider it as a bug, right?

I look up LibLinear ReadMe and in line198:

"""
LibLinear 1.92 ReadMe line198

We implement 1-vs-the rest multi-class strategy for classification.
In training i vs. non_i, their C parameters are (weight from -wi)_C
and C, respectively. If there are only two classes, we train only one
model. Thus weight1_C vs. weight2*C is used. See examples below.
"""

Obviously Linlinear supports different weights for different classes.

Is this a bug in sklearn wrapper, or a bug in liblinear?

On Tue, Nov 27, 2012 at 10:20 AM, Meng Xinfan notifications@github.comwrote:

I think this implementation use Liblinear. Therefore the meaning of
class
weight depends on the actual optimization method implemented. I tried to
use different regularizer, and the number of 1 predicted also changed
dramatically. Maybe you shouldn't set the class weight :)


Reply to this email directly or view it on GitHub<
https://github.com/scikit-learn/scikit-learn/issues/1411#issuecomment-10743780>.

Tks/B. rgds,

Shusen Liu
School of Software Engineering
South China University of Technology
Guangzhou, GD, P.R. China
Email: liushusen.smart@gmail.com, liushusen@outlook.com,

liu.ss@mail.scut.edu.cn


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10788483.

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@amueller
Copy link
Member

Sorry for the lack of feedback, the devs are busy at the moment.
I'll try to take a look later today.
Meng Xinfan: Sorry, I can't see your posts on github either :-/

@fannix
Copy link
Contributor
fannix commented Nov 28, 2012

Really? very strange. I can see that the number of counts is increasing.
Anyway, my reply is here:

I think this is unlikely a bug in sklearn or liblinear.

I tried a couple of examples, and print out the confusion matrix. I found
that the larger weight you give to class 0, the harder the LR classifier
tried to avoid predicting class 1 to 0, hence refrained from predicting an
instance to class 0. This cause less instances to be predicted to class 0,
though contradicting what you assume.

you can see more about parameter C from the following references:

http://jmlr.csail.mit.edu/papers/volume9/fan08a/fan08a.pdf
http://pyml.sourceforge.net/doc/howto.pdf

On Wed, Nov 28, 2012 at 5:08 PM, Andreas Mueller
notifications@github.comwrote:

Sorry for the lack of feedback, the devs are busy at the moment.
I'll try to take a look later today.
Meng Xinfan: Sorry, I can't see your posts on github either :-/


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10795274.

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@amueller
Copy link
Member

@fannix Well, now I have the reply in my inbox a couple of times but still can not see it here. Maybe try to contact github on this? They are usually very responsive. It shows that you are part of the discussion at least.

@fannix
Copy link
Contributor
fannix commented Nov 28, 2012

OK, thanks.

@fannix
Copy link
Contributor
fannix commented Nov 30, 2012

@amueller You can read my comments now.

@amueller
Copy link
Member

Yes :) great. Did you contact github?
@ShusenLiu sorry for the delay, I really meant to have a look at this but I've been busy. Will do on the weekend.

@fannix
Copy link
Contributor
fannix commented Nov 30, 2012

Yes, it turns out that I was recognized as a spammer and blocked..

On Fri, Nov 30, 2012 at 5:19 PM, Andreas Mueller
notifications@github.comwrote:

Yes :) great. Did you contact github?
@ShusenLiu https://github.com/ShusenLiu sorry for the delay, I really
meant to have a look at this but I've been busy. Will do on the weekend.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10883053.

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@amueller
Copy link
Member

@fannix Let that be a lesson to you ;)

[ the lesson being that machine learning algorithms can not be trusted]

@amueller
Copy link
Member

Ok after having looked at this for 10 minutes, this is definitely a bug in the semantics of sklearn.
Liblinear considers "class weights" the scaling of "C". So higher means more regularization!
While in sklearn higher means usually "more important".
My fix would be to take the inverse of the class weight before passing it to liblinear.

After a quick check this produces similar results as class weights with SVC(kernel="linear").

I'll have a look at what we are doing in SVC now.

@amueller
Copy link
Member

Hm there is no explanation of the class weights in the liblinear docs. I have no idea why the behavior of liblinear and libsvm is different here. It seems like we do exactly the same thing on the scikit-learn side.
@larsmans do you have any idea?

@fannix
Copy link
Contributor
fannix commented Nov 30, 2012

I think the class weight is the same as LibSVM, as described in the below
paper:

http://jmlr.csail.mit.edu/papers/volume9/fan08a/fan08a.pdf

On Sat, Dec 1, 2012 at 4:53 AM, Andreas Mueller notifications@github.comwrote:

Hm there is no explanation of the class weights in the liblinear docs. I
have no idea why the behavior of liblinear and libsvm is different here. It
seems like we do exactly the same thing on the scikit-learn side.
@larsmans https://github.com/larsmans do you have any idea?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10903589.

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@amueller
Copy link
Member
amueller commented Dec 1, 2012

@fannix from the docs it looks like it would be the same. But then why is the effect different? Maybe somewhere in the ova and ovo the class indices are switched.... uh oh starting to feel a bit guilty now... maybe I broke this...

@amueller
Copy link
Member
amueller commented Dec 1, 2012

I'm still bisecting but it seems that I reversed the behavior of SVC, but not LinearSVC when I messed with the signs / class ordering. Which would explain why they are inconsistent now.

@amueller
Copy link
Member
amueller commented Dec 1, 2012

Turns out I'm really bad with git bisect :(

@amueller
Copy link
Member
amueller commented Dec 1, 2012

Any help with finding out when the behavior of SVC changed and any idea what to do about it would be welcome.

@pprett
Copy link
Member
pprett commented Dec 4, 2012

@amueller I checked the liblinear source code: it only supports class weights if the solver is not::

param->solver_type == L2R_L2LOSS_SVR ||
param->solver_type == L2R_L1LOSS_SVR_DUAL ||
param->solver_type == L2R_L2LOSS_SVR_DUAL

The else branch in linear.cpp #2344 is the relevant part: here the weighted_C array gets created using the class weights and the C parameter (see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/liblinear/linear.cpp#L2389 )

when you print weighted_C in the above example you'll get the following output::

0:0.9000
1:0.1000

the routine train_one implements binary classification; the last two arguments Cp and Cn are the weighted Cs for the positive and negative class, resp.

For binary classification with logistic loss the invocation of train_one looks as follows::

train_one(&sub_prob, param, &model_->w[0], weighted_C[0], weighted_C[1]);

(see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/liblinear/linear.cpp#L2443)

So the weighted C of the positive class is weighted_C[0] and for the negative class weighted_C[1] -> this is wrong, it should be the other way round.
If you switch the indices 0 and 1 the results should be ok .

The question is the following: does liblinear sort the class labels in ascending order and picks the first one as the positive class? If so, that's the opposite of what sklearn does...

For OVA and Cramer&Singer this is not an issue.

8000

@amueller
Copy link
Member
amueller commented Dec 4, 2012

Thanks for investigating!
Should we just switch the indices there?
I switches some indices in LibSVM to fix the sign of the decision function, and I think this is the reason why now the two behave differently.
So is it right that large C -> stronger weight for the positive class? That means the current docstring is confusing at best. In particular because it is the same for LibSVM and LinearSVM which have opposite behaviors...

@fannix
Copy link
Contributor
fannix commented Dec 4, 2012
import pylab as pl
import sklearn
from sklearn import linear_model, svm
import numpy as np
from sklearn import datasets

X, y = datasets.make_classification(n_samples=100, n_features=2, n_redundant=0)
pl.scatter(X[:, 0], X[:, 1], c=y)

clr0 = linear_model.LogisticRegression()
clr0.fit(X, y)
clr0.predict(X).sum()
w = clr0.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clr0.intercept_[0] / w[1]
pl.plot(xx, yy, 'k--', label='no weights')

clr1 = linear_model.LogisticRegression(class_weight={0: 0.9, 1: 0.1})
clr1.fit(X, y)
w1 = clr1.coef_[0]
a1 = -w1[0] / w1[1]
xx1 = np.linspace(-5, 5)
yy1 = a * xx1 - clr1.intercept_[0] / w1[1]
pl.plot(xx1, yy1, 'k-', label='with weights')

pl.legend()

pl.show()

This is the plot showing the imbalance problem. Hope it helps. I am also trying to make sense of the class weights...

@fannix
Copy link
Contributor
fannix commented Dec 4, 2012

I tried a couple of classifiers. It turns out LinearSVC and LogisticRegression have the same behavior; and they have different behavior compared with SVC(kernel="linear")

@fannix
Copy link
Contributor
fannix commented Dec 4, 2012

According to the LibSVM guide: http://pyml.sourceforge.net/doc/howto.pdf

Assuming n+ (n-) is the number of positive (negative) examples. then,
C+ / C- = n- / n+

Therefore I don't think large C+ lead to stronger weight for the positive class.

@pprett
Copy link
Member
pprett commented Dec 4, 2012

I think we have to be careful - first of all, I'm referring to Liblinear -
not LibSVM; second, I think the semantics of C changes depending on the
solver. Consider the solver "L2R_LR" (logistic regression in the primal):
see
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/liblinear/linear.cpp#L2257
and
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/liblinear/linear.cpp#L55
,
resp. First, Cp and Cn are copied into an array C of size n_samples;
then C[i] is multiplied to the gradient (or the objective value of the
loss function).

2012/12/4 Meng Xinfan notifications@github.com

According to the LibSVM guide: http://pyml.sourceforge.net/doc/howto.pdf

Assuming n+ (n-) is the number of positive (negative) examples. then,
C+ / C- = n- / n+

Therefore I don't think large C+ lead to stronger weight for the positive
class.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1411#issuecomment-10993023.

Peter Prettenhofer

@amueller
Copy link
Member

Ok this is worse than I thought as this means the 'auto' parameter does completely crazy things?!
I have to check what happens in #1464.

@amueller
Copy link
Member

OK, sorry for the long delay, getting back to it now.
I think the class_weight parameter should do the same thing as in other classifiers, i.e. large means that a class is more likely. This is the opposite to what LibSVM and LibLinear do "natively", as @fannix pointed out.
This is the same as what LibSVM does natively. Large C_+ reduces the number of false negatives, i.e. more samples are classified as being positive. This is the same semantics as sklearn has!

I think my approach will be to write tests and check the expected behavior and put in some switches so it does what I think it should. The signs and the 0 <-> 1 switch are already somewhat messed up unfortunately :(

@amueller
Copy link
Member

Corrected my statement above. Larger C_i means more samples are classified as class i!

@amueller
Copy link
Member

There is a fix in #1491.

@ShusenLiu
Copy link
Author

Thanks for all your hard work ^.^

@amueller
Copy link
Member

No problem :)
Thanks for reporting and your patience. The fix will be in the next release (which will be soon).

@amueller
Copy link
Member
amueller commented Jan 3, 2013

Merged #1491 so this is fixed in master now :)

@amueller amueller closed this as completed Jan 3, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0