-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
The class_weight parameter set different weights for misclassify that class. For example, in a 0/1 classification problem, if we set class_weight={0:0.95, 1:0.05}, then we can expect the classifier to be more careful when it try to classify a data sample to be 0, since misclassify a 0 to 1 is heavily penalized.
But the LogisticRegression class seems go wrong:
from sklearn import datasets from sklearn import svm from sklearn import linear_model #100 sample, half of its label is 0, others are 1 X, Y = datasets.make_classification() Y.sum() >>> 50 #balance LR classifier clr0 = linear_model.LogisticRegression() clr0.fit(X, Y) clr0.score(X,Y) >>> 0.84999999999999998 clr0.predict(X).sum() >>> 49 #imbalance LR classifier clr1 = linear_model.LogisticRegression(class_weight={0:0.9, 1:0.1}) clr1.fit(X, Y) clr1.score(X,Y) >>> 0.63 clr1.predict(X).sum() >>> 85
The imbalance classifier clr1 is supposed to classifier more data to be label 0, but it actually predict far more data to be 1. But when we choose another classfier, say SVM, the behavior seems resonable:
#balance SVM classifier clr2 = svm.SVC() clr2.fit(X,Y) clr2.score(X,Y) >>> 0.95999999999999996 clr2.predict(X).sum() >>> 46 #imbalance SVM classifier clr3 = svm.SVC(class_weight={0:0.6, 1:0.4}) clr3.fit(X,Y) clr3.score(X,Y) >>> 0.84999999999999998 clr1.predict(X).sum() >>> 35.0 #another imbalance SVM classifier clr4 = svm.SVC(class_weight={0:0.9, 1:0.1}) clr4.fit(X,Y) clr4.score(X,Y) >>> 0.5 clr4.predict(X).sum() >>> 0.0
When the class_weight[0] vs class_weight[0] is 5:5, SVC predict roughly half of data to be 0.
When the class_weight[0] vs class_weight[0] is 6:4, SVC predict more data to be 0.
When the class_weight[0] vs class_weight[0] is 9:1, SVC predict all the data to be 0.
Is this a bug?