8000 BUG CalibratedClassiferCV when train has less classes than test · Issue #17827 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG CalibratedClassiferCV when train has less classes than test #17827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lucyleeow opened this issue Jul 3, 2020 · 3 comments
Closed

BUG CalibratedClassiferCV when train has less classes than test #17827

lucyleeow opened this issue Jul 3, 2020 · 3 comments

Comments

@lucyleeow
Copy link
Member
lucyleeow commented Jul 3, 2020

Describe the bug

This may be very uncommon and not a big problem.

For CalibratedClassiferCV, if the training subset has less classes than the test subset in a cv, e.g.,:

y_test = [1 1 2 2 3 3]
y train = [1 1 1 2 2 2]

this classifier (from the ensemble) will only be fit for 2 classes. At prediction time, the output of this classifier will have shape (n_samples, 2) as it can only predict 2 classes. The 3rd class proba's will be filled with 0's, see line:

proba = np.zeros((X.shape[0], n_classes))

resulting in lower proba than expected. We also average the probas from the ensemble of classifier/calibrators by dividing by the number of classifier/calibrators in the ensemble, see:

mean_proba /= len(self.calibrated_classifiers_)

This also gives a lower proba than expected, as one of the ensemble just gave all 0's.

Steps/Code to Reproduce

I am not sure of the best way to reproduce, but the code below results in a lower proba for class 3 than you would expect:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.calibration import CalibratedClassifierCV


X, _ = make_classification(n_samples=12, n_features=4, n_classes=3,
                           n_clusters_per_class=1, random_state=7)
y = [1,1,1,2,2,2,1,1,2,2,3,3]
# Make class 3 easier to predict
X[-2:,:] = np.abs(X[-2:,:]) * 10
clf = RandomForestClassifier()
splits = 2
kfold = KFold(n_splits=2)
calb_clf = CalibratedClassifierCV(clf, cv=kfold)
calb_clf.fit(X,y)
# Predict the last sample in X
calb_clf.predict_proba(X[-1,:].reshape((1,-1)))

Gives:
[[0.31342392 0.61173178 0.0748443 ]]

I would suggest maybe adding a warning whenever the train subset does not contain all the classes in y. Again not sure how much of a problem this is as it is probably uncommon to have less classes in train than the full present in y .

cc @NicolasHug @ogrisel who have been reviewing other calibrator stuff.

@NicolasHug
Copy link
Member

I'm not sure we want to care about this because users should use StratifiedKFolds, and in general I wouldn't expe A144 ct calibration to be useful with so few examples which is when such discrepancies might happen

@lucyleeow
Copy link
Member Author

No problem, happy to close! It is indeed a very edge case.

@lucyleeow
Copy link
Member Author

@NicolasHug thinking about this more, we specifically amended this function to allow cases where train and test subsets have different number of classes: #7799, but I don't think we deal with it well when it occurs...

I feel like we shouldn't have allowed this, if we aren't going to 'care' about it....?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0