10000 [MRG] FIX class_weight='auto' on SGDClassifier by ogrisel · Pull Request #3515 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] FIX class_weight='auto' on SGDClassifier #3515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ogrisel
Copy link
Member
@ogrisel ogrisel commented Jul 31, 2014

Regression reported in #3485

@ogrisel ogrisel added this to the 0.15.1 milestone Jul 31, 2014
@@ -492,6 +492,27 @@ def test_auto_weight(self):
y_pred = clf.predict(X)
assert_greater(metrics.f1_score(y, y_pred), 0.96)

def test_auto_weight_non_contiguous(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this test should be performed as part of sklearn.utils.estimator_checks.check_class_weight_auto_classifiers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I will do that tomorrow.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 1, 2014

@jnothman I re-pushed the test as a common estimator check.

@larsmans @arjoly @kastnerkyle any more review? I would like to make a 0.15.1 release today.

Once this is merged, please @MechCoder have a look at why LogisticRegressionCV does not behave as all the other linear models on this toy check.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 7276460 on ogrisel:fix-class-weight-auto into ad5773c on scikit-learn:master.

@MechCoder
Copy link
Member

This is a problem that me and @GaelVaroquaux had discussed during the implementation. For Log Reg CV,

  1. If the class_weights given is "auto", then the samples weights are calculated for each fold.

so since the distribution of classes is not necessarily same across all folds, we get a different result.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 1, 2014

@MechCoder that makes sense, I will update the inline comment and skip without raising an exception then.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 1, 2014

done.

@MechCoder
Copy link
Member

thanks!

On Fri, Aug 1, 2014 at 1:39 PM, Olivier Grisel notifications@github.com
wrote:

done.


Reply to this email directly or view it on GitHub
#3515 (comment)
.

Regards,
Manoj Kumar,
GSoC 2014, Scikit-learn
Mech Undergrad
http://manojbits.wordpress.com

@ogrisel
Copy link
Member Author
ogrisel commented Aug 1, 2014

I merged this fix as I really want to include it in 0.15.1 and I don't want to postpone it further.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 1, 2014

backported in 0.15.X as de9de2d

@ogrisel ogrisel closed this Aug 1, 2014
@simongray
Copy link

I'm using 0.15.1 and still having this problem:

Traceback (most recent call last):
  File "/Users/simongray/Code/etilbudsavis/offer-sniffer/build_model3.py", line 53, in <module>
    predictor.train(dataset)
  File "/Users/simongray/Code/etilbudsavis/offer-sniffer/offerhandling.py", line 363, in train
    self._classifier_labels
  File "/usr/local/lib/python3.4/site-packages/sklearn/linear_model/stochastic_gradient.py", line 452, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python3.4/site-packages/sklearn/linear_model/stochastic_gradient.py", line 333, in _partial_fit
    self.classes_, y)
  File "/usr/local/lib/python3.4/site-packages/sklearn/utils/class_weight.py", line 43, in compute_class_weight
    raise ValueError("classes should have valid labels that are in y")
ValueError: classes should have valid labels that are in y

I'm using partial_fit and all of the classes are included in every method call.

@dsullivan7
Copy link
Contributor

Can you give a sample of y and X that is causing this problem?

@simongray
Copy link

The x for each iteration of partial fit is a 50K long numpy array of ~7K feature vectors between 0.0 and 1.0. Imagine this, but with many more zeros:

[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0. 1.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  0.]]

The y is a list made up of strings from this set:

['toppings', 'pasta', 'wine', 'nuts', 'cordial', 'bedroom', 'poultry', 'biscuits', 'meat', 'pie', 'baking', 'shoes', 'beans', 'sweets', 'beer', 'pet food', 'coffee', 'rice', 'clothes', 'cake', 'furniture', 'bathroom', 'milk', 'pizza', 'phone', 'juice', 'soft drinks', 'cheese', 'tv', 'spirits', 'spices', 'fruits', 'vegetables', 'seafood', 'bread', 'kitchen', 'pc', 'butter', 'eggs', 'chocolate']

If I use the LabelEncoder, the problem goes away: http://stackoverflow.com/questions/24808821/sgdclassifier-with-class-weight-auto-fails-on-scikit-learn-0-15-but-not-0-14

But I think that shouldn't be necessary? Anyway, it sometimes works, so it's quite puzzling.

@dsullivan7
Copy link
Contributor

Hmm, that is puzzling and I can't seem to replicate this. It definitely shouldn't be necessary to use the LabelEncoder. To confirm that we're talking about 0.15.1 can you run

 import sklearn
 print(sklearn.__version__) 

for me?

@simongray
Copy link
>>> print(sklearn.__version__) 
0.15.1

@dsullivan7
Copy link
Contributor

Excellent, now you say it sometimes works? Is there a small example that you can consistently get to throw the error?

@dsullivan7
Copy link
Contributor

This is what I tried to do to replicate but wasn't successful. Do you get an error when you try running this?

from sklearn.linear_model import SGDClassifier
import numpy as np

c = ['toppings', 'pasta', 'wine', 'nuts', 'cordial', 'bedroom', 'poultry', 'biscuits', 'meat', 'pie', 'baking', 'shoes', 'beans', 'sweets', 'beer', 'pet food', 'coffee', 'rice', 'clothes', 'cake', 'furniture', 'bathroom', 'milk', 'pizza', 'phone', 'juice', 'soft drinks', 'cheese', 'tv', 'spirits', 'spices', 'fruits', 'vegetables', 'seafood', 'bread', 'kitchen', 'pc', 'butter', 'eggs', 'chocolate']
c = np.array(c)
X = np.ones((50000, 7000))
r = np.random.randint(0, c.shape[0], 50000)
y = c[r]

clf = SGDClassifier()

for i in range(1000):
    clf.partial_fit(X[i * 1000:i * 1000 + 1000, :], y[i * 1000
8000
:i * 1000 + 1000], classes=c)

@simongray
Copy link

What I can say is that changing the size of the partial_fit dataset makes the error more likely to occur, e.g. training with a set of 5K vectors+ground truth I seem to always get the problem, but with a bigger set that includes those 5K vectors+ground truth I don't. However, if I keep loading a large dataset from a file for partial_fit I might run into a problem later on.

I was speculating that sklearn raises an exception in some cases when the dataset that is being fitted does not include every class that is in the set of classes - that is the only explanation I can think of. So I ran the old code and got the error after training 40K vectors in 5K chunks. Then I thought I would run the code again and output the length of the labels in the ground truth table and then the code worked just fine until the chunk between 160K and 165K was being fitted:

training offers 0 to 5000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 5000 to 10000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 10000 to 15000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'rice', 'sweets', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 15000 to 20000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 20000 to 25000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 25000 to 30000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 30000 to 35000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'rice', 'sweets', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 35000 to 40000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 40000 to 45000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 45000 to 50000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 50000 to 55000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 55000 to 60000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 60000 to 65000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 65000 to 70000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 70000 to 75000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 75000 to 80000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 80000 to 85000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 85000 to 90000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'rice', 'sweets', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 90000 to 95000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 95000 to 100000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 100000 to 105000
40 
8000
['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 105000 to 110000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 110000 to 115000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 115000 to 120000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 120000 to 125000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 125000 to 130000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 130000 to 135000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 135000 to 140000
40 ['pc', 'milk', 'poultry', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 140000 to 145000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 145000 to 150000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'pasta', 'furniture']
training offers 150000 to 155000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'butter', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 155000 to 160000
40 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'butter', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
training offers 160000 to 165000
39 ['pc', 'poultry', 'milk', 'seafood', 'soft drinks', 'beans', 'biscuits', 'baking', 'juice', 'toppings', 'bread', 'vegetables', 'meat', 'cordial', 'clothes', 'cheese', 'chocolate', 'tv', 'pie', 'phone', 'nuts', 'kitchen', 'spices', 'bathroom', 'sweets', 'rice', 'coffee', 'fruits', 'beer', 'pet food', 'wine', 'pizza', 'bedroom', 'cake', 'eggs', 'spirits', 'shoes', 'furniture', 'pasta']
Traceback (most recent call last):
  File "/Users/simongray/Code/etilbudsavis/offer-sniffer/build_model3.py", line 54, in <module>
    predictor.train(dataset)
  File "/Users/simongray/Code/etilbudsavis/offer-sniffer/offerhandling.py", line 361, in train
    self._classifier_labels
  File "/usr/local/lib/python3.4/site-packages/sklearn/linear_model/stochastic_gradient.py", line 452, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python3.4/site-packages/sklearn/linear_model/stochastic_gradient.py", line 333, in _partial_fit
    self.classes_, y)
  File "/usr/local/lib/python3.4/site-packages/sklearn/utils/class_weight.py", line 43, in compute_class_weight
    raise ValueError("classes should have valid labels that are in y")
ValueError: classes should have valid labels that are in y

Every second line here is the output of

included_tags = list(set([offer['tag'] for offer in offers]))
print(len(included_tags), included_tags)

So basically, how many and which classes are included in that specific chunk. My initial guess was right: when a chunk of vectors is fitted using partial_fit and they don't include every possible class in ground truth, sklearn will throw this error. Keep in mind that the classes I supply as the third parameter to partial_fit never changes, it's always all 40 labels.

So is this behaviour to be expected? Am I supposed to only supply the classes included in that chunk? Because then I don't get why the third parameter is necessary at all if not for sklearn to know every possible label in advance.

@simongray
Copy link

Ah, actually that is precisely the problem. When using the LabelEncoder I still have the problem if the classes argument of partial_fit does not precisely match set(y). So I'm guessing this is the problem then?

I still don't understand why classes is a required parameter in partial_fit() and not in the normal fit() method. In the source code the docstring for partial_fit says:

        classes : array, shape = [n_classes]
            Classes across all calls to partial_fit.
            Can be obtained by via `np.unique(y_all)`, where y_all is the
            target vector of the entire dataset.
            This argument is required for the first call to partial_fit
            and can be omitted in the subsequent calls.
            Note that y doesn't need to contain all labels in `classes`.

So according to the last line, y doesn't need to include all of the labels in classes, yet partial_fit fails every time this is the case.

@kastnerkyle
Copy link
Member

y doesn't need to include all the labels. y_all does.

On Thu, Aug 21, 2014 at 1:00 PM, Simon Gray notifications@github.com
wrote:

Ah, actually that is precisely the problem. When using the LabelEncoder I
still have the problem if the classes argument of partial_fit does not
precisely match set(y). So I'm guessing this is the problem then?

I still don't understand why classes is a required parameter in
partial_fit() and not in the normal fit() method. In the source code the
docstring for partial_fit says:

    classes : array, shape = [n_classes]
        Classes across all calls to partial_fit.
        Can be obtained by via `np.unique(y_all)`, where y_all is the
        target vector of the entire dataset.
        This argument is required for the first call to partial_fit
        and can be omitted in the subsequent calls.
        Note that y doesn't need to contain all labels in `classes`.

So according to the last line, y doesn't need to include all of the labels
in classes, yet sklearn fails every time this is the case.


Reply to this email directly or view it on GitHub
#3515 (comment)
.

@simongray
Copy link

Yeah, I know, that's what I'm saying. It's not supposed to throw an error, but it does. Try running this code, it will produce the ValueError: classes should have valid labels that are in y:

from sklearn.linear_model import SGDClassifier
import numpy as np

classes = ['a', 'b']

classifier = SGDClassifier(
    shuffle=True,
    class_weight="auto",
    penalty="elasticnet",
    loss="log"
)

# a dataset including datapoints from all possible classes (e.g. 'a' and 'b')
x_1 = np.array([[0,1],[1,1]])
y_1 = ['a', 'b']

# a dataset only including datapoints from one of the classes (e.g. 'a')
x_2 = np.array([[0,1],[1,0]])
y_2 = ['a', 'a']

# first partial_fit call includes all possible classes
classifier.partial_fit(X=x_1, y=y_1, classes=classes)
print('fit dataset 1 with no issues!')

# second partial_fit does not state the possible classes (although stating it changes nothing)
# second call also only includes datapoints labeled 'a'
# this will produce --> ValueError: classes should have valid labels that are in y
classifier.partial_fit(X=x_2, y=y_2)
print('this will never print')

According to the source code, it shouldn't be a problem that y in the second dataset does not include all classes mentioned in classes, yet it throws an error.

edit: made what's happeing in the example code more obvious

@kastnerkyle
Copy link
Member

This code works fine for me. Python 3.4, sklearn dev branch.

On Thu, Aug 21, 2014 at 1:27 PM, Simon Gray notifications@github.com
wrote:

Yeah, I know, that's what I'm saying. It's not supposed to throw an error,
but it does. Try running this code then, it will produce the error:

from sklearn.linear_model import SGDClassifierimport numpy as np
classes = ['a', 'b']
classifier = SGDClassifier(
shuffle=True,
class_weight="auto",
penalty="elasticnet",
loss="log")
dataset_1 = np.array([[0,1],[1,0]])dataset_2 = np.array([[0,1],[1,0]])ground_truth_1 = ['a', 'b']ground_truth_2 = ['a', 'a']
classifier.partial_fit(dataset_1, ground_truth_1, classes)print('fit dataset 1 with no issues')
classifier.partial_fit(dataset_2, ground_truth_2)

According to the source code, it shouldn't be a problem that y in the
second dataset does not include all classes mentioned in classes, yet
it throws an error.


Reply to this email directly or view it on GitHub
#3515 (comment)
.

@kastnerkyle
Copy link
Member

What are your versions, etc.? Something is fishy...

On Thu, Aug 21, 2014 at 2:00 PM, Kyle Kastner kastnerkyle@gmail.com wrote:

This code works fine for me. Python 3.4, sklearn dev branch.

On Thu, Aug 21, 2014 at 1:27 PM, Simon Gray notifications@github.com
wrote:

Yeah, I know, that's what I'm saying. It's not supposed to throw an
error, but it does. Try running this code then, it will produce the error:

from sklearn.linear_model import SGDClassifierimport numpy as np
classes = ['a', 'b']
classifier = SGDClassifier(
shuffle=True,
class_weight="auto",
penalty="elasticnet",
loss="log")
dataset_1 = np.array([[0,1],[1,0]])dataset_2 = np.array([[0,1],[1,0]])ground_truth_1 = ['a', 'b']ground_truth_2 = ['a', 'a']
classifier.partial_fit(dataset_1, ground_truth_1, classes)print('fit dataset 1 with no issues')
classifier.partial_fit(dataset_2, ground_truth_2)

According to the source code, it shouldn't be a problem that y in the
second dataset does not include all classes mentioned in classes, yet
it throws an error.


Reply to this email directly or view it on GitHub
#3515 (comment)
.

@simongray
Copy link

Running python 3.4.1 and sklearn 0.15.1 (as mentioned earlier).

@simongray
Copy link

In the infringing sklearn source code I found this:

elif class_weight == 'auto':
        # Find the weight of each class as present in y.
        le = LabelEncoder()
        y_ind = le.fit_transform(y)
        if not all(np.in1d(classes, le.classes_)):
            raise ValueError("classes should have valid labels that are in y")

So I'm pretty sure the arguments in np.in1d(classes, le.classes_) should be switched and this is the cause of the error. This was from "/sklearn/utils/class_weight.py", line 44 BTW!

@dsullivan7
Copy link
Contributor

I was able to reproduce the error with code you gave. Looking at it now...

@simongray
Copy link

Great! thanks.

@dsullivan7
Copy link
Contributor

Alrighty, thanks B4E0 for being patient. This looks like a bug. The line:

    if not all(np.in1d(classes, le.classes_)):
            raise ValueError("classes should have valid labels that are in y")

shouldn't exist because it suggests that all of the classes should be represented in y which is definitely not the case (especially for partial_fit). And as a matter of fact, you're right that opposite should be true (they should be switched). For now, you should be able to get around this bug by either passing all 50k samples into a fit method or by making your chunks big enough so that each chunk includes all labels. (I'm aware that both of these work-arounds aren't very good). If you want though, you can download the source, change the 'auto' weight method to this:

     elif class_weight == 'auto':
        # Find the weight of each class as present in y.
        le = LabelEncoder()
        class_ind = le.fit_transform(classes)
        y_ind = le.transform(y)

        # inversely proportional to the number of samples in the class
        recip_freq = 1. / np.bincount(y_ind)
        recip_freq.resize(class_ind.shape[0])
        weight = recip_freq[class_ind] / np.mean(recip_freq)

Then you can build from source and this should work. I'll try to get a fix going for this as soon as possible, sorry for the trouble.

@dsullivan7
Copy link
Contributor

@ogrisel Can this be reopened?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0