-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Fix the cross_val_predict function for method='predict_proba' #7889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@amueller I had the same doubt before I started working on the issue. However, the discussion in the thread mentioned that the classes are not always promised to be returned in sorted order. Also, for classifiers like RandomForestClassifier, the docstring doesn't mention that the |
Do we require that to be true for external classifiers? |
It's undocumented and untested, so I guess it's not required. We could require it. I'm not sure where in scikit-learn we might rely on it. |
actually, it is tested: So I suppose we should worry about absent but not out-of-order classes... On 17 November 2016 at 10:14, Andreas Mueller notifications@github.com
|
With this being tested, it's true that we don't need to worry about them being out-of-order. To check for absent classes though, checking that number of unique elements in |
not necessarily, no. They may have differing subsets of classes, in a rare On 17 November 2016 at 17:58, Aman Dalmia notifications@github.com wrote:
|
I guess |
This reminds me of a long-ago PR that added |
Yes, I was trying to think of a more general solution... I don't mind a On 18 November 2016 at 02:01, Andreas Mueller notifications@github.com
|
I was wondering whether we can create a meta-estimator without adding parameter indirection via We could get around that using meta-classes, though I'm not sure if that'll look nice. I'm trying to come up with a better solution.... |
meta-estimator or not, this needs tests for:
|
i've marked it WIP: untested changes are not ready for merge |
@jnothman @amueller Actually I am not familiar with what a meta-estimator is and hence, am unable to understand the solution that is being proposed. Could you people please give me a small explanation as to what is being discussed and how is a meta estimator different from a common estimator so that I can participate constructively as well? |
@jnothman You mean with polymorphic clone we could allow |
from sklearn.preprocessing import LabelEncoder
class ClassesMeta(type):
def __call__(cls, *args, **kwargs):
classes = kwargs.pop("classes", None)
obj = type.__call__(cls, *args, **kwargs)
obj.classes = classes
return obj
class ClassesMixin(object):
def get_params(self, deep=True):
params = super(ClassesMixin, self).get_params(deep=deep)
params['classes'] = self.classes
return params
def set_params(self, params):
self.classes = params.pop(classes, None)
return super(ClassesMixin, self).set_params(params)
def fit(self, X, y):
self._private_le = LabelEncoder().fit(self.classes)
return super(ClassesMixin, self).fit(X, self._private_le.transform(y))
def predict(self, X):
return self._private_le.inverse_transform(super(ClassesMixin, self).predict(X))
def predict_proba(self, X):
probs = super(ClassesMixin, self).predict_proba(X)
padded_probs = np.zeros((probs.shape[0], len(self.classes)))
class_mapping = np.searchsorted(self._private_le.classes_, self.classes_)
padded_probs[:, class_mapping] = probs
return padded_probs
def add_classes_wrapper(cls):
return ClassesMeta(cls.__name__ + "WithClasses", (ClassesMixin, cls), {}) And then LogisticRegressionWithClasses = add_classes_wrapper(LogisticRegression)
asdf = LogisticRegressionWithClasses(classes=[0, 1, 2, 4])
from sklearn.datasets import load_iris
iris = load_iris()
asdf.fit(iris.data, iris.target)
asdf.predict_proba(iris.data).shape
Not the solution we want to use here probably though ;) Just wanted to see how bad it would be. |
You should check out my suggestion here: And ignore everything afterwards. |
@amueller But it seems that the |
@amueller Did you mean |
You would really benefit from writing a test case that currently fails, but you would like to succeed with your patch. That, especially if you then propose a solution, would much better help us decide the appropriate direction. |
@raghavrv Thanks for letting me know. I'll try to fix it. |
Is this ready for merge? |
@@ -474,6 +481,14 @@ def _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, | |||
estimator.fit(X_train, y_train, **fit_params) | |||
func = getattr(estimator, method) | |||
predictions = func(X_test) | |||
if method in ['decisio A93C n_function', 'predict_proba', 'predict_log_proba']: | |||
true_classes = np.unique(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have _, n_classes = np.unique(y, return_counts=True)
as you don't use the "true_classes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @jnothman that return_counts
is not supported for all versions and it's true that only the number of classes is being used, I have thought of another workaround. Will do the change.
assert_array_almost_equal(expected_predictions, predictions) | ||
|
||
# Testing unordered labels | ||
y = [1, 1, -4, 6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to explicitly test for a use-case where the classes_
is guaranteed to not be sorted but cross_val_predict
returns output in sorted order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned here, classes_
should indeed be sorted. Please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! Thanks for checking :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wait could we have a mock estimator that does not have the classes_
sorted? @jnothman Is it worth having such a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind this has been discussed before. Sorry for not checking... +1 for merge with the whatsnew entry...
I'm fairly sure that return_counts is not available on all supported
versions of numpy
…On 4 Jan 2017 10:48 pm, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote:
***@***.**** requested changes on this pull request.
------------------------------
In sklearn/model_selection/_validation.py
<#7889 (review)>
:
> @@ -474,6 +481,14 @@ def _fit_and_predict(estimator, X, y, train, test, verbose, fit_params,
estimator.fit(X_train, y_train, **fit_params)
func = getattr(estimator, method)
predictions = func(X_test)
+ if method in ['decision_function', 'predict_proba', 'predict_log_proba']:
+ true_classes = np.unique(y)
Why not have _, n_classes = np.unique(y, return_counts=True) as you don't
use the "true_classes"?
------------------------------
In sklearn/model_selection/tests/test_validation.py
<#7889 (review)>
:
> + cv=kfold3)
+
+ # Runs a naive loop (should be same as cross_val_predict):
+ expected_predictions = get_expected_predictions(X, y, kfold3, classes,
+ est, method)
+ assert_array_almost_equal(expected_predictions, predictions)
+
+ # Test with n_splits=4
+ predictions = cross_val_predict(est, X, y, method=method,
+ cv=kfold4)
+ expected_predictions = get_expected_predictions(X, y, kfold4, classes,
+ est, method)
+ assert_array_almost_equal(expected_predictions, predictions)
+
+ # Testing unordered labels
+ y = [1, 1, -4, 6]
Do you want to explicitly test for a use-case where the classes_ is
guaranteed to not be sorted but cross_val_predict returns output in
sorted order?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7889 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xFxfQOm7nOF17ZAkLyfZ7x7L1rcks5rO4bwgaJpZM4Kz1fC>
.
|
And a whatnew entry please... |
I am getting this error also when I am building locally: Exception occurred:
File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data
with open(url, 'r') as fid:
IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'
The full traceback has been saved in /tmp/sphinx-err-wz9PtY.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
Embedding documentation hyperlinks in examples..
processing: feature_stacker.html
make: *** [html] Error 1 Any workaround for this? |
Merging with an updated master will fix it for CircleCI |
+1 for merge after updating master (and all CIs turning green) |
@raghavrv, nothing here could cause the doc build to fail. |
Merged, thanks @dalmia! |
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
I started getting a different shape for |
…a' (scikit-learn#7889) Handle the case where different CV splits have different sets of classes present.
Reference Issue
Fixes #7863
What does this implement/fix? Explain your changes.
cross_val_predict() did not specify the order of the classes when method='predict_proba' was passed as the argument. So firstly, the change mentions in the documentation that the columns returned would be sorted. Secondly, for the estimators having a
classes_
attribute, it ensures that the predictions are returned in a sorted manner.Any other comments?
Currently, I haven't yet tested the code. Would be completing that within a day or so.