order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

simonm3 · 2016-11-13T23:53:30Z

Cross_val_predict has a new method parameter which is typically set to "predict_proba" to return probabilities for each class.

However the order of the classes returned is unclear. Either self.classes_ needs to be set; or the results need to be returned in a predictable order. Otherwise we have a list of probabilities for each class but no way to know which column relates to which class.

jnothman · 2016-11-14T13:30:15Z

Yes, I suppose this (and disagreements in the set of classes between splits) was overlooked.

(For internal models, I'm fairly sure that classes_ is always alphabetically ordered.)

dalmia · 2016-11-15T23:47:38Z

Can I start working on this?

dalmia · 2016-11-16T00:41:48Z

After running a few tests on the iris dataset using LogisticRegression as the estimator, it became clear that the order of the classes appearing did not matter in the final result. This gives a simple and clear explanation.

jnothman · 2016-11-16T02:45:27Z

Iris aside, it's possible to create a cross-validation strategy that will
land up with only a subset of the classes being present in the training
data for a particular training set. This should be handled.

Firstly the documentation should point out that the classes will be in
sorted order, and this should, in theory at least, be confirmed, reordering
if necessary, from the underlying estimators.

On 16 November 2016 at 11:41, Aman Dalmia notifications@github.com wrote:

After running a few tests on the iris datasets using LogisticRegression as
the estimator, it became clear that the order of the classes appearing did
not matter in the final result. This
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_log_proba
gives a simple and clear explanation.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65Z6-gNAXqLB0OC8OTlEeBKG9pBeks5q-lFNgaJpZM4Kw2SU
.

dalmia · 2016-11-16T03:44:03Z

True, we could miss out on a few observations in cross-validation. As for the confirmation of the classes being sorted, since these lines of the cross_val_predict function already ensure that only estimators capable of calculating the probability are passed 'predict_proba' as the method, and all such estimators return the classes in sorted order, do we need any other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

 if not callable(getattr(estimator, method)):
        raise AttributeError('{} not implemented in estimator'
                             .format(method))

jnothman · 2016-11-16T04:26:03Z

Well, we don't promise that all return classes in sorted order; all store
classes in .classes_

On 16 November 2016 at 14:44, Aman Dalmia notifications@github.com wrote:

True, we could miss out on a few observations in cross-validation. As for
the confirmation of the classes being sorted, since these lines of the
cross_val_predict function already ensure that only estimators capable of
calculating the probability are passed 'predict_proba' as the method, and
all such estimators return the classes in sorted order, do we need any
other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

if not callable(getattr(estimator, method)):
raise AttributeError('{} not implemented in estimator'
.format(method))

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz643iTLmmU5CBwLaMK5h6Iwqak0f3ks5q-nwFgaJpZM4Kw2SU
.

dalmia · 2016-11-16T05:23:32Z

Then how about, in the _fit_and_predict function, I retrieve the classes_ attribute, make a dictionary of the column indices and the class labels, sort them on the class labels and apply the transformation to the predictions, thereby ensuring that a sorted result is returned?

jnothman · 2016-11-16T05:26:44Z

That sort of thing, yes

On 16 November 2016 at 16:23, Aman Dalmia notifications@github.com wrote:

Then how about, in the fit_and_predict function, I retrieve the 'classes'
attribute, make a dictionary of the column indices and the class labels,
sort them and apply the transformation to the predictions, thereby ensuring
that a sorted result is returned?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61TyLjZQJX7Jjehj00p84uYkIiMHks5q-pNWgaJpZM4Kw2SU
.

simonm3 · 2016-11-16T11:14:48Z

That would be a good idea as it would make sure 3rd party classifiers used the same standard. I know it works today for LogisticRegression and RandomForestClassifier. However I am unsure for XGBClassifier as that is outside sklearn.

jnothman added the Bug label Nov 14, 2016

jnothman added Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Nov 14, 2016

dalmia mentioned this issue Nov 16, 2016

[MRG + 1] Fix the cross_val_predict function for method='predict_proba' #7889

Merged

jnothman closed this as completed in #7889 Jan 7, 2017

qinhanmin2014 mentioned this issue Jun 19, 2017

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

Merged

flamby mentioned this issue May 30, 2019

sklearn estimator compatibility (useful for OneVsRest or VotingClassifier meta-estimators) imoscovitz/wittgenstein#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

Comments