8000 order of returned probabilites unclear for cross_val_predict with method=predict_proba · Issue #7863 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonm3 opened this issue Nov 13, 2016 · 9 comments · Fixed by #7889
Closed
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices

Comments

@simonm3
Copy link
simonm3 commented Nov 13, 2016

Cross_val_predict has a new method parameter which is typically set to "predict_proba" to return probabilities for each class.

However the order of the classes returned is unclear. Either self.classes_ needs to be set; or the results need to be returned in a predictable order. Otherwise we have a list of probabilities for each class but no way to know which column relates to which class.

@jnothman jnothman added the Bug label Nov 14, 2016
@jnothman
Copy link
Member

Yes, I suppose this (and disagreements in the set of classes between splits) was overlooked.

(For internal models, I'm fairly sure that classes_ is always alphabetically ordered.)

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Nov 14, 2016
@dalmia
Copy link
Contributor
dalmia commented Nov 15, 2016

Can I start working on this?

@dalmia
Copy link
Contributor
dalmia commented Nov 16, 2016

After running a few tests on the iris dataset using LogisticRegression as the estimator, it became clear that the order of the classes appearing did not matter in the final result. This gives a simple and clear explanation.

@jnothman
Copy link
Member

Iris aside, it's possible to create a cross-validation strategy that will
land up with only a subset of the classes being present in the training
data for a particular training set. This should be handled.

Firstly the documentation should point out that the classes will be in
sorted order, and this should, in theory at least, be confirmed, reordering
if necessary, from the underlying estimators.

On 16 November 2016 at 11:41, Aman Dalmia notifications@github.com wrote:

After running a few tests on the iris datasets using LogisticRegression as
the estimator, it became clear that the order of the classes appearing did
not matter in the final result. This
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_log_proba
gives a simple and clear explanation.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65Z6-gNAXqLB0OC8OTlEeBKG9pBeks5q-lFNgaJpZM4Kw2SU
.

@dalmia
Copy link
Contributor
dalmia commented Nov 16, 2016

True, we could miss out on a few observations in cross-validation. As for the confirmation of the classes being sorted, since these lines of the cross_val_predict function already ensure that only estimators capable of calculating the probability are passed 'predict_proba' as the method, and all such estimators return the classes in sorted order, do we need any other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

 if not callable(getattr(estimator, method)):
        raise AttributeError('{} not implemented in estimator'
                             .format(method))

@jnothman
Copy link
Member

Well, we don't promise that all return classes in sorted order; all store
classes in .classes_

On 16 November 2016 at 14:44, Aman Dalmia notifications@github.com wrote:

True, we could miss out on a few observations in cross-validation. As for
the confirmation of the classes being sorted, since these lines of the
cross_val_predict function already ensure that only estimators capable of
calculating the probability are passed 'predict_proba' as the method, and
all such estimators return the classes in sorted order, do we need any
other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

if not callable(getattr(estimator, method)):
raise AttributeError('{} not implemented in estimator'
.format(method))


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz643iTLmmU5CBwLaMK5h6Iwqak0f3ks5q-nwFgaJpZM4Kw2SU
.

@dalmia
Copy link
Contributor
dalmia commented Nov 16, 2016

Then how about, in the _fit_and_predict function, I retrieve the classes_ attribute, make a dictionary of the column indices and the class labels, sort them on the class labels and apply the transformation to the predictions, thereby ensuring that a sorted result is returned?

@jnothman
Copy link
Member

That sort of thing, yes

On 16 November 2016 at 16:23, Aman Dalmia notifications@github.com wrote:

Then how about, in the fit_and_predict function, I retrieve the 'classes'
attribute, make a dictionary of the column indices and the class labels,
sort them and apply the transformation to the predictions, thereby ensuring
that a sorted result is returned?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61TyLjZQJX7Jjehj00p84uYkIiMHks5q-pNWgaJpZM4Kw2SU
.

@simonm3
Copy link 9293
Author
simonm3 commented Nov 16, 2016

That would be a good idea as it would make sure 3rd party classifiers used the same standard. I know it works today for LogisticRegression and RandomForestClassifier. However I am unsure for XGBClassifier as that is outside sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0