-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG + 1] Fix the cross_val_predict function for method='predict_proba' #7889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
All reactions
Sorry, something went wrong.
@amueller I had the same doubt before I started working on the issue. However, the discussion in the thread mentioned that the classes are not always promised to be returned in sorted order. Also, for classifiers like RandomForestClassifier, the docstring doesn't mention that the |
All reactions
Sorry, something went wrong.
Do we require that to be true for external classifiers? |
All reactions
Sorry, something went wrong.
It's undocumented and untested, so I guess it's not required. We could require it. I'm not sure where in scikit-learn we might rely on it. |
All reactions
Sorry, something went wrong.
actually, it is tested: So I suppose we should worry about absent but not out-of-order classes... On 17 November 2016 at 10:14, Andreas Mueller notifications@github.com
|
All reactions
Sorry, something went wrong.
With this being tested, it's true that we don't need to worry about them being out-of-order. To check for absent classes though, checking that number of unique elements in |
All reactions
Sorry, something went wrong.
not necessarily, no. They may have differing subsets of classes, in a rare On 17 November 2016 at 17:58, Aman Dalmia notifications@github.com wrote:
|
All reactions
Sorry, something went wrong.
I guess |
All reactions
Sorry, something went wrong.
This reminds me of a long-ago PR that added |
All reactions
Sorry, something went wrong.
Yes, I was trying to think of a more general solution... I don't mind a On 18 November 2016 at 02:01, Andreas Mueller notifications@github.com
|
All reactions
Sorry, something went wrong.
I was wondering whether we can create a meta-estimator without adding parameter indirection via We could get around that using meta-classes, though I'm not sure if that'll look nice. I'm trying to come up with a better solution.... |
All reactions
Sorry, something went wrong.
meta-estimator or not, this needs tests for:
|
All reactions
Sorry, something went wrong.
i've marked it WIP: untested changes are not ready for merge |
All reactions
Sorry, something went wrong.
@jnothman @amueller Actually I am not familiar with what a meta-estimator is and hence, am unable to understand the solution that is being proposed. Could you people please give me a small explanation as to what is being discussed and how is a meta estimator different from a common estimator so that I can participate constructively as well? |
All reactions
Sorry, something went wrong.
A meta-estimator is just an estimator that wraps an estimator. Regardless of whether we use a meta-estimator in the solution, the tests are the same. @amueller, I've said before, you can make metaestimators without parameter indirection using polymorphic |
All reactions
Sorry, something went wrong.
@jnothman You mean with polymorphic clone we could allow |
All reactions
Sorry, something went wrong.
from sklearn.preprocessing import LabelEncoder
class ClassesMeta(type):
def __call__(cls, *args, **kwargs):
classes = kwargs.pop("classes", None)
obj = type.__call__(cls, *args, **kwargs)
obj.classes = classes
return obj
class ClassesMixin(object):
def get_params(self, deep=True):
params = super(ClassesMixin, self).get_params(deep=deep)
params['classes'] = self.classes
return params
def set_params(self, params):
self.classes = params.pop(classes, None)
return super(ClassesMixin, self).set_params(params)
def fit(self, X, y):
self._private_le = LabelEncoder().fit(self.classes)
return super(ClassesMixin, self).fit(X, self._private_le.transform(y))
def predict(self, X):
return self._private_le.inverse_transform(super(ClassesMixin, self).predict(X))
def predict_proba(self, X):
probs = super(ClassesMixin, self).predict_proba(X)
padded_probs = np.zeros((probs.shape[0], len(self.classes)))
class_mapping = np.searchsorted(self._private_le.classes_, self.classes_)
padded_probs[:, class_mapping] = probs
return padded_probs
def add_classes_wrapper(cls):
return ClassesMeta(cls.__name__ + "WithClasses", (ClassesMixin, cls), {}) And then LogisticRegressionWithClasses = add_classes_wrapper(LogisticRegression)
asdf = LogisticRegressionWithClasses(classes=[0, 1, 2, 4])
from sklearn.datasets import load_iris
iris = load_iris()
asdf.fit(iris.data, iris.target)
asdf.predict_proba(iris.data).shape
Not the solution we want to use here probably though ;) Just wanted to see how bad it would be. |
All reactions
Sorry, something went wrong.
You should check out my suggestion here: And ignore everything afterwards. |
All reactions
Sorry, something went wrong.
@amueller But it seems that the |
All reactions
Sorry, something went wrong.
@amueller Did you mean |
All reactions
Sorry, something went wrong.
I made |
All reactions
Sorry, something went wrong.
You would really benefit from writing a test case that currently fails, but you would like to succeed with your patch. That, especially if you then propose a solution, would much better help us decide the appropriate direction. |
All reactions
Sorry, something went wrong.
@raghavrv Thanks for letting me know. I'll try to fix it. |
All reactions
Sorry, something went wrong.
Is this ready for merge? |
All reactions
Sorry, something went wrong.
@@ -474,6 +481,14 @@ def _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, | |||
estimator.fit(X_train, y_train, **fit_params) | |||
func = getattr(estimator, method) | |||
predictions = func(X_test) | |||
if method in ['decision_function', 'predict_proba', 'predict_log_proba']: | |||
true_classes = np.unique(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have _, n_classes = np.unique(y, return_counts=True)
as you don't use the "true_classes"?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @jnothman that return_counts
is not supported for all versions and it's true that only the number of classes is being used, I have thought of another workaround. Will do the change.
Sorry, something went wrong.
All reactions
assert_array_almost_equal(expected_predictions, predictions) | ||
|
||
# Testing unordered labels | ||
y = [1, 1, -4, 6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to explicitly test for a use-case where the classes_
is guaranteed to not be sorted but cross_val_predict
returns output in sorted order?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned here, classes_
should indeed be sorted. Please have a look.
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! Thanks for checking :)
Sorry, something went wrong.
All reactions
-
👍 1 reaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wait could we have a mock estimator that does not have the classes_
sorted? @jnothman Is it worth having such a test?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind this has been discussed before. Sorry for not checking... +1 for merge with the whatsnew entry...
Sorry, something went wrong.
All reactions
I'm fairly sure that return_counts is not available on all supported
versions of numpy
…On 4 Jan 2017 10:48 pm, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote:
***@***.**** requested changes on this pull request.
------------------------------
In sklearn/model_selection/_validation.py
<#7889 (review)>
:
> @@ -474,6 +481,14 @@ def _fit_and_predict(estimator, X, y, train, test, verbose, fit_params,
estimator.fit(X_train, y_train, **fit_params)
func = getattr(estimator, method)
predictions = func(X_test)
+ if method in ['decision_function', 'predict_proba', 'predict_log_proba']:
+ true_classes = np.unique(y)
Why not have _, n_classes = np.unique(y, return_counts=True) as you don't
use the "true_classes"?
------------------------------
In sklearn/model_selection/tests/test_validation.py
<#7889 (review)>
:
> + cv=kfold3)
+
+ # Runs a naive loop (should be same as cross_val_predict):
+ expected_predictions = get_expected_predictions(X, y, kfold3, classes,
+ est, method)
+ assert_array_almost_equal(expected_predictions, predictions)
+
+ # Test with n_splits=4
+ predictions = cross_val_predict(est, X, y, method=method,
+ cv=kfold4)
+ expected_predictions = get_expected_predictions(X, y, kfold4, classes,
+ est, method)
+ assert_array_almost_equal(expected_predictions, predictions)
+
+ # Testing unordered labels
+ y = [1, 1, -4, 6]
Do you want to explicitly test for a use-case where the classes_ is
guaranteed to not be sorted but cross_val_predict returns output in
sorted order?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7889 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xFxfQOm7nOF17ZAkLyfZ7x7L1rcks5rO4bwgaJpZM4Kz1fC>
.
|
All reactions
Sorry, something went wrong.
And a whatnew entry please... |
All reactions
Sorry, something went wrong.
I am getting this error also when I am building locally: Exception occurred:
File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data
with open(url, 'r') as fid:
IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'
The full traceback has been saved in /tmp/sphinx-err-wz9PtY.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
Embedding documentation hyperlinks in examples..
processing: feature_stacker.html
make: *** [html] Error 1 Any workaround for this? |
All reactions
Sorry, something went wrong.
Merging with an updated master will fix it for CircleCI |
All reactions
Sorry, something went wrong.
+1 for merge after updating master (and all CIs turning green) |
All reactions
Sorry, something went wrong.
@raghavrv, nothing here could cause the doc build to fail. |
All reactions
-
👍 1 reaction
Sorry, something went wrong.
Merged, thanks @dalmia! |
All reactions
-
👍 2 reactions
Sorry, something went wrong.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
I started getting a different shape for |
All reactions
Sorry, something went wrong.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
Successfully merging this pull request may close these issues.
order of returned probabilites unclear for cross_val_predict with method=predict_proba
Reference Issue
Fixes #7863
What does this implement/fix? Explain your changes.
cross_val_predict() did not specify the order of the classes when method='predict_proba' was passed as the argument. So firstly, the change mentions in the documentation that the columns returned would be sorted. Secondly, for the estimators having a
classes_
attribute, it ensures that the predictions are returned in a sorted manner.Any other comments?
Currently, I haven't yet tested the code. Would be completing that within a day or so.