-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] FIX class_weight='auto' on SGDClassifier #3515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -492,6 +492,27 @@ def test_auto_weight(self): | |||
y_pred = clf.predict(X) | |||
assert_greater(metrics.f1_score(y, y_pred), 0.96) | |||
|
|||
def test_auto_weight_non_contiguous(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this test should be performed as part of sklearn.utils.estimator_checks.check_class_weight_auto_classifiers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I will do that tomorrow.
@jnothman I re-pushed the test as a common estimator check. @larsmans @arjoly @kastnerkyle any more review? I would like to make a 0.15.1 release today. Once this is merged, please @MechCoder have a look at why |
This is a problem that me and @GaelVaroquaux had discussed during the implementation. For Log Reg CV,
so since the distribution of classes is not necessarily same across all folds, we get a different result. |
@MechCoder that makes sense, I will update the inline comment and skip without raising an exception then. |
done. |
thanks! On Fri, Aug 1, 2014 at 1:39 PM, Olivier Grisel notifications@github.com
Regards, |
I merged this fix as I really want to include it in 0.15.1 and I don't want to postpone it further. |
backported in 0.15.X as de9de2d |
I'm using 0.15.1 and still having this problem:
I'm using partial_fit and all of the classes are included in every method call. |
Can you give a sample of y and X that is causing this problem? |
The x for each iteration of partial fit is a 50K long numpy array of ~7K feature vectors between 0.0 and 1.0. Imagine this, but with many more zeros:
The y is a list made up of strings from this set: ['toppings', 'pasta', 'wine', 'nuts', 'cordial', 'bedroom', 'poultry', 'biscuits', 'meat', 'pie', 'baking', 'shoes', 'beans', 'sweets', 'beer', 'pet food', 'coffee', 'rice', 'clothes', 'cake', 'furniture', 'bathroom', 'milk', 'pizza', 'phone', 'juice', 'soft drinks', 'cheese', 'tv', 'spirits', 'spices', 'fruits', 'vegetables', 'seafood', 'bread', 'kitchen', 'pc', 'butter', 'eggs', 'chocolate'] If I use the LabelEncoder, the problem goes away: http://stackoverflow.com/questions/24808821/sgdclassifier-with-class-weight-auto-fails-on-scikit-learn-0-15-but-not-0-14 But I think that shouldn't be necessary? Anyway, it sometimes works, so it's quite puzzling. |
Hmm, that is puzzling and I can't seem to replicate this. It definitely shouldn't be necessary to use the LabelEncoder. To confirm that we're talking about 0.15.1 can you run
for me? |
|
Excellent, now you say it sometimes works? Is there a small example that you can consistently get to throw the error? |
This is what I tried to do to replicate but wasn't successful. Do you get an error when you try running this?
|
What I can say is that changing the size of the partial_fit dataset makes the error more likely to occur, e.g. training with a set of 5K vectors+ground truth I seem to always get the problem, but with a bigger set that includes those 5K vectors+ground truth I don't. However, if I keep loading a large dataset from a file for partial_fit I might run into a problem later on. I was speculating that sklearn raises an exception in some cases when the dataset that is being fitted does not include every class that is in the set of classes - that is the only explanation I can think of. So I ran the old code and got the error after training 40K vectors in 5K chunks. Then I thought I would run the code again and output the length of the labels in the ground truth table and then the code worked just fine until the chunk between 160K and 165K was being fitted:
Every second line here is the output of
So basically, how many and which classes are included in that specific chunk. My initial guess was right: when a chunk of vectors is fitted using partial_fit and they don't include every possible class in ground truth, sklearn will throw this error. Keep in mind that the classes I supply as the third parameter to partial_fit never changes, it's always all 40 labels. So is this behaviour to be expected? Am I supposed to only supply the classes included in that chunk? Because then I don't get why the third parameter is necessary at all if not for sklearn to know every possible label in advance. |
Ah, actually that is precisely the problem. When using the LabelEncoder I still have the problem if the classes argument of partial_fit does not precisely match set(y). So I'm guessing this is the problem then? I still don't understand why classes is a required parameter in partial_fit() and not in the normal fit() method. In the source code the docstring for partial_fit says:
So according to the last line, y doesn't need to include all of the labels in classes, yet partial_fit fails every time this is the case. |
y doesn't need to include all the labels. y_all does. On Thu, Aug 21, 2014 at 1:00 PM, Simon Gray notifications@github.com
|
Yeah, I know, that's what I'm saying. It's not supposed to throw an error, but it does. Try running this code, it will produce the ValueError: classes should have valid labels that are in y: from sklearn.linear_model import SGDClassifier
import numpy as np
classes = ['a', 'b']
classifier = SGDClassifier(
shuffle=True,
class_weight="auto",
penalty="elasticnet",
loss="log"
)
# a dataset including datapoints from all possible classes (e.g. 'a' and 'b')
x_1 = np.array([[0,1],[1,1]])
y_1 = ['a', 'b']
# a dataset only including datapoints from one of the classes (e.g. 'a')
x_2 = np.array([[0,1],[1,0]])
y_2 = ['a', 'a']
# first partial_fit call includes all possible classes
classifier.partial_fit(X=x_1, y=y_1, classes=classes)
print('fit dataset 1 with no issues!')
# second partial_fit does not state the possible classes (although stating it changes nothing)
# second call also only includes datapoints labeled 'a'
# this will produce --> ValueError: classes should have valid labels that are in y
classifier.partial_fit(X=x_2, y=y_2)
print('this will never print') According to the source code, it shouldn't be a problem that y in the second dataset does not include all classes mentioned in classes, yet it throws an error. edit: made what's happeing in the example code more obvious |
This code works fine for me. Python 3.4, sklearn dev branch. On Thu, Aug 21, 2014 at 1:27 PM, Simon Gray notifications@github.com
|
What are your versions, etc.? Something is fishy... On Thu, Aug 21, 2014 at 2:00 PM, Kyle Kastner kastnerkyle@gmail.com wrote:
|
Running python 3.4.1 and sklearn 0.15.1 (as mentioned earlier). |
In the infringing sklearn source code I found this: elif class_weight == 'auto':
# Find the weight of each class as present in y.
le = LabelEncoder()
y_ind = le.fit_transform(y)
if not all(np.in1d(classes, le.classes_)):
raise ValueError("classes should have valid labels that are in y") So I'm pretty sure the arguments in np.in1d(classes, le.classes_) should be switched and this is the cause of the error. This was from "/sklearn/utils/class_weight.py", line 44 BTW! |
I was able to reproduce the error with code you gave. Looking at it now... |
Great! thanks. |
Alrighty, thanks B4E0 for being patient. This looks like a bug. The line: if not all(np.in1d(classes, le.classes_)):
raise ValueError("classes should have valid labels that are in y") shouldn't exist because it suggests that all of the classes should be represented in y which is definitely not the case (especially for partial_fit). And as a matter of fact, you're right that opposite should be true (they should be switched). For now, you should be able to get around this bug by either passing all 50k samples into a fit method or by making your chunks big enough so that each chunk includes all labels. (I'm aware that both of these work-arounds aren't very good). If you want though, you can download the source, change the 'auto' weight method to this: elif class_weight == 'auto':
# Find the weight of each class as present in y.
le = LabelEncoder()
class_ind = le.fit_transform(classes)
y_ind = le.transform(y)
# inversely proportional to the number of samples in the class
recip_freq = 1. / np.bincount(y_ind)
recip_freq.resize(class_ind.shape[0])
weight = recip_freq[class_ind] / np.mean(recip_freq) Then you can build from source and this should work. I'll try to get a fix going for this as soon as possible, sorry for the trouble. |
@ogrisel Can this be reopened? |
Regression reported in #3485