-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fixes #7578 added check_decision_proba_consistency in estimator_checks #8253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/utils/estimator_checks.py
Outdated
@@ -114,7 +116,7 @@ def _yield_classifier_checks(name, Classifier): | |||
yield check_classifiers_regression_target | |||
if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"] | |||
# TODO some complication with -1 label | |||
and name not in ["DecisionTreeClassifier", | |||
and name not in ["DecisionTreeClassifier", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to put the and
on the previous line to make flake8 happy. Error from Travis is (it gives you an hint as to what to do):
./sklearn/utils/estimator_checks.py:119:9: W503 line break before binary operator
and name not in ["DecisionTreeClassifier",
Using |
sklearn/utils/estimator_checks.py
Outdated
|
||
|
||
@ignore_warnings(category=DeprecationWarning) | ||
def check_rank_corr(name, Estimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps call it check_decision_proba_consistency
.
…n_proba_consistency
sklearn/utils/estimator_checks.py
Outdated
predict_proba methods has outputs with perfect rank correlation. | ||
""" | ||
|
||
X, Y = make_multilabel_classification(n_classes=2, n_labels=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using multilabel data? Why not just binary?
sklearn/utils/estimator_checks.py
Outdated
try: | ||
classif = OneVsRestClassifier(estimator) | ||
classif.fit(X, Y) | ||
a = classif.predict_proba([i for i in range(20)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the input would be 2d. Why is it 1d? For test data, can just generate something random and uniform, or can take from a similar distribution to training data.
sklearn/utils/estimator_checks.py
Outdated
|
||
if hasattr(estimator, "predict_proba"): | ||
try: | ||
classif = OneVsRestClassifier(estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we doing OvR?
sklearn/utils/estimator_checks.py
Outdated
a = classif.predict_proba([i for i in range(20)]) | ||
b = classif.decision_function([i for i in range(20)]) | ||
assert_equal( | ||
rankdata(a, method='average'), rankdata(b, method='average')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method
shouldn't matter as long as tied values have tied ranks. But if we're working with non-binary classification, we need to do this comparison column-wise. Use assert_array_equal
rather than assert_equal
.
sklearn/utils/estimator_checks.py
Outdated
assert_equal( | ||
rankdata(a, method='average'), rankdata(b, method='average')) | ||
|
||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this to catch? The try
block should go around the smallest scope that we want to exclude otherwise this test can pass when all estimators raise a ValueError
because the test is broken.
sklearn/utils/estimator_checks.py
Outdated
|
||
if hasattr(estimator, "decision_function"): | ||
|
||
if hasattr(estimator, "predict_proba"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use and
@jnothman Travis-ci fails even though no errors found. |
Yeah I have opened an issue on Travis about that (this is under investigation): |
merge for fork update
Trigger
You seem to have confused a number of different patches in your PR. You should be using a branch in your fork to avoid this, not |
You can keep this one on master, but you need to revert or otherwise remove your changes pertaining to other issues. |
@jnothman anything else needed? |
sklearn/utils/estimator_checks.py
Outdated
@@ -56,8 +56,9 @@ | |||
|
|||
BOSTON = None | |||
CROSS_DECOMPOSITION = ['PLSCanonical', 'PLSRegression', 'CCA', 'PLSSVD'] | |||
MULTI_OUTPUT = ['CCA', 'DecisionTreeRegressor', 'ElasticNet', | |||
'ExtraTreeRegressor', 'ExtraTreesRegressor', 'GaussianProcess', | |||
MULTI_OUTPUT = ['CCA', 'DecisionTreeClassifier', 'DecisionTreeRegressor', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all others here are regressors. What makes you sure it's appropriate to include multioutput classifiers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the function check_supervised_y_2d the line inside the warning `section-
estimator.fit(X, y[:, np.newaxis])
does not give any warnings for the classifiers I included. therefore I included it in the MULTI_OUTPUT list . Otherwise it would give me a error that ```expected 1 DataConversionWarning, got: ```. I checked sklearn documents for DecisionTreeClassifier it says y can accept [n_samples, n_outputs].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change relevant to the rest of the PR? Perhaps it should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, making changes then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nosetests fail if I do not include them. Can 1 reference 2 issues with 1 pr otherwise this will not pass.Maybe I will open one and reference this issue and the opened one with this pr.What do you say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific what fails if you do not include them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn/utils/estimator_checks.py
Outdated
@@ -113,12 +114,12 @@ def _yield_classifier_checks(name, Classifier): | |||
# basic consistency testing | |||
yield check_classifiers_train | |||
yield check_classifiers_regression_target | |||
if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"] | |||
if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't add these parentheses
sklearn/utils/estimator_checks.py
Outdated
# TODO some complication with -1 label | ||
and name not in ["DecisionTreeClassifier", | ||
"ExtraTreeClassifier"]): | ||
if (name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't add these parentheses
build_tools/travis/test_script.sh
Outdated
@@ -8,6 +8,7 @@ | |||
|
|||
set -e | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added by mistake during the days when travis went down, when I foolishly tried to make travis work in order to get my pr pass tests as this was my first one.Will correct it.
sklearn/utils/estimator_checks.py
Outdated
return (p.name != 'self' | ||
and p.kind != p.VAR_KEYWORD | ||
and p.kind != p.VAR_POSITIONAL) | ||
return (p.name != 'self' and p.kind != p.VAR_KEYWORD and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this?
travis.log
Outdated
@@ -0,0 +1,87 @@ | |||
Command line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this file.
sklearn/utils/estimator_checks.py
Outdated
|
||
@ignore_warnings(category=DeprecationWarning) | ||
def check_decision_proba_consistency(name, Estimator): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't usually add docstrings to checks, because nose doesn't play nicely.
sklearn/utils/estimator_checks.py
Outdated
predict_proba methods has outputs with perfect rank correlation. | ||
""" | ||
rnd = np.random.RandomState(0) | ||
X_train = (3*rnd.uniform(size=(10, 4))).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use integer features, or binary, in common tests in case estimators can't deal with real-valued features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely I added .astype(int).
sklearn/utils/estimator_checks.py
Outdated
if (hasattr(estimator, "decision_function") and | ||
hasattr(estimator, "predict_proba")): | ||
|
||
estimator.fit(X_train, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing seems to be entering this case: I've modified it to say assert False
but nothing is failing from sklearn/tests/test_common.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. You've put this in as a regressor check.
sklearn/utils/estimator_checks.py
Outdated
@@ -162,6 +163,7 @@ def _yield_regressor_checks(name, Regressor): | |||
yield check_regressors_no_decision_function | |||
yield check_supervised_y_2d | |||
yield check_supervised_y_no_nan | |||
yield check_decision_proba_consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in ...classifier_checks
not regressors
Awesome thats great I get what you are trying to say (likelihood(point belongs to(A))/likelihood(belongs to(B))) if both are around 0.5 that is we can have 0.6/0.4 meaning it can belong to either side then the values wont peak so much.Thanks a lot. It should work definitely.I will make changes and update. |
@jnothman I did not exactly take the points in the middle I just bought the cluster centres nearer and they kind of overlap. Other thing that I thought was to make ellipses on both blobs and consider all the points outside them for test set but this worked. Is it fine or should I improvise? |
sklearn/utils/estimator_checks.py
Outdated
# TODO some complication with -1 label | ||
and name not in ["DecisionTreeClassifier", | ||
"ExtraTreeClassifier"]): | ||
if name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this from and
to a separate if
. That's what creates the errors in your screenshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making changes.
sklearn/utils/estimator_checks.py
Outdated
centers = [(2, 2), (4, 4)] | ||
X, y = make_blobs(n_samples=100, random_state=0, n_features=4, | ||
centers=centers, cluster_std=1.0, shuffle=True) | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this approach, the probabilities are again going to be very peaked around 0 and 1, since the blobs are more-or-less linearly separable, encouraging numerical precision errors etc. For test, I'd just use np.random.randn() + 3
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sklearn/utils/estimator_checks.py
Outdated
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, | ||
random_state=0) | ||
|
||
X_test = np.random.randn(20, 2)+4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please insert spaces around +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for all the reviews on pr. Learnt a lot.Making changes.
Please add an entry in what's new. Put it in API changes to say "Estimators with both x and y are now required ..." |
Got my Network exam now will surely do it by evening. |
@shubham0704 please use "Fix #issueNumber" in your PR description this way the associated issue gets closed automatically when the PR is merged. For more details, look at this. I have edited your description but please remember do it next time. |
Sure.Thanks @lesteve . |
[RFC] -request for close :) |
return (p.name != 'self' | ||
and p.kind != p.VAR_KEYWORD | ||
and p.kind != p.VAR_POSITIONAL) | ||
return (p.name != 'self' and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For next time, try not to change things that are not related to your PR. This adds noise into the diff and makes it harder for the review to be efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @lesteve .Thanks a lot.
LGTM, merging, thanks a lot! |
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
Reference Issue
Fix #7578
What does this implement/fix? Explain your changes.
It fixes the need for a test function to check whether the output predict_proba and decision_function are perfectly correlated or not.
Any other comments?
I need to understand the testing part of this function.I have done the pep8 linting and pyflakes but recieved 1 error while nose check stating set_testing_parameters() takes exactly 1 value 0 given and error=1.Also where is the best palce to yield this function.I did not make that change because I was unsure.