8000 BUG: Using GridSearchCV with scoring='roc_auc' and GMM as classifier gives IndexError · Issue #7598 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG: Using GridSearchCV with scoring='roc_auc' and GMM as classifier gives IndexError #7598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Rendiere opened this issue Oct 7, 2016 · 11 comments · Fixed by #12486
Closed
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted Sprint

Comments

@Rendiere
Copy link
Rendiere commented Oct 7, 2016

When performing grid search using GridSearchCV using ootb scoring method 'roc_auc' and ootb GMM classifier from sklearn.mixture.GMM I get an index error.
Code to reproduce:

from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.mixture import GMM
X,y = datasets.make_classification(n_samples = 10000, n_features=10,n_classes=2)
# Vanilla GMM_model
gmm_model = GMM()
# Standard param grid
param_grid = {'n_components' : [1,2,3,4],
              'covariance_type': ['tied','full','spherical']}
grid_search = GridSearchCV(gmm_model, param_grid, scoring='roc_auc')
# Fit GS with this data
grid_search.fit(X, y)

Sorry if the format is incorrect. First time I am posting.

ERROR:
File "*/python2.7/site-packages/sklearn/metrics/scorer.py", line 175, in call
y_pred = y_pred[:, 1]
IndexError: index 1 is out of bounds for axis 1 with size 1

@Rendiere
Copy link
Author
Rendiere commented Oct 7, 2016

Updated to 0.17.1 and issue persists ( Changing GMM to GaussianMixture)

@amueller
Copy link
Member
amueller commented Oct 7, 2016

The error is strange, but GMM is not a supervised model, so AUC doesn't really make sense.
We might want to raise a better error, though it's hard to detect what's going on here in a sense.

@jnothman
Copy link
Member
jnothman commented Oct 8, 2016

Do you really mean updated to 0.17.1, not 0.18?

On 8 October 2016 at 03:28, Andreas Mueller notifications@github.com
wrote:

The error is strange, but GMM is not a supervised model, so AUC doesn't
really make sense.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7598 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6_dNOBwzUCFpb4N1bNKGAjC02ZEDks5qxnMlgaJpZM4KRC_m
.

@Spikhalskiy
Copy link
Contributor

Getting the same error with

    cv = GridSearchCV(
        estimator=DecisionTreeClassifier(),
        param_grid={
            'max_depth': [20],
            'class_weight': ['auto'],
            'min_samples_split': [100],
            'min_samples_leaf': [30],
            'criterion': ['gini']
        },
        scoring='roc_auc',
        n_jobs=-1,
    )

Log:

__call__(self=make_scorer(roc_auc_score, needs_threshold=True), clf=DecisionTreeClassifier(class_weight='auto', crit...resort=False, random_state=None, splitter='best'), X=memmap([[ 2.14686672e-01, 0.00000000e+00, 0...000000e+00, 1.00000000e+00, 0.00000000e+00]]), y=memmap([0, 0, 0, ..., 0, 0, 0]), sample_weight=None) 
170 
171 except (NotImplementedError, AttributeError): 
172 y_pred = clf.predict_proba(X) 
173 
174 if y_type == "binary": 
--> 175 y_pred = y_pred[:, 1] 
y_pred = array([[ 1.], 
[ 1.], 
[ 1.], 
..., 
[ 1.], 
[ 1.], 
[ 1.]]) 
176 elif isinstance(y_pred, list): 
177 y_pred = np.vstack([p[:, -1] for p in y_pred]).T 
178 
179 if sample_weight is not None: 

IndexError: index 1 is out of bounds for axis 1 with size 1 

@jnothman
Copy link
Member

It looks there like you might have been training your DecisionTreeClassifier on a single class... what does the y you pass to GridSearchCV.fit look like?

Yes, this error message is not very helpful.

@sedeh
Copy link
sedeh commented Apr 11, 2018

I ran into this error and you are correct @amueller about the single class explanation. Here's what my data looks like.

X_test.shape: (750, 34)
y_test.shape: (750,)
y_test value_counts: True    750

So my data contains a single class: True.

Perhaps, a more descriptive error message would help. Something along the line of your comment: It looks like you might have a single class. Looking at line 175 long enough may give that away too.

@liuwanfei
Copy link

Can someone help with this issue? I do not know how to fix it still. Thank you!

@liuwanfei
Copy link

clf = ExtraTreesClassifier()
my_cv = TimeSeriesSplit(n_splits=5) # time series split

param_grid = {
'n_estimators': [100, 300, 500, 700, 1000, 2000, 5000],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': range(2,20,2),
'bootstrap': [True, False]
}

clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=my_cv, n_jobs=-1, scoring='roc_auc', return_train_score=False)
clf.fit(X, y)

@Spikhalskiy
Copy link
Contributor

@liuwanfei you likely have just one class in y, at least looks like it was an issue for most people in this thread. It should be an error message with a clear text instead of the exception.

@Rendiere
Copy link
Author
Rendiere commented Jul 13, 2018

Yes, what the people above have mentioned is correct - if you train with one class you will get this error.

However, if you have a look at my code, I generated a dataset which has 2 classes so that was not the case with me. What was the causing the issue is that my param grid was set up with a subtle error. Remember the "roc_auc" scorer is using probabilities as inputs to create the ROC curve, and in my example above, my parameter space for n_components: was [1,2,3,4].

If you think about it, a GMM with one component will output only one probability. Thus, the output of model.predict_proba will be a one dimensional array which is why an IndexError occurs on pred[:, 1].

So, for your case, see if one of the parameter combinations might not result in the classifier being constrained to predicting a single class.

P.S I only realised this now, almost 2 years after the post. lol

@amueller amueller added the Easy Well-defined and straightforward way to resolve label Jul 14, 2018
@AMDonati
Copy link
AMDonati commented Sep 29, 2018

I and @reshamas are working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted Sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0