8000 [MRG+2] Deprecate reorder parameter in auc by qinhanmin2014 · Pull Request #9851 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Deprecate reorder parameter in auc #9851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Oct 17, 2017
Merged

[MRG+2] Deprecate reorder parameter in auc #9851

merged 22 commits into from
Oct 17, 2017

Conversation

qinhanmin2014
Copy link
Member
@qinhanmin2014 qinhanmin2014 commented Sep 29, 2017

Reference Issue

proposed in #9786 by @lesteve

with this change reorder=False(reorder=True?) in the auc function is not used in our code. I think we should deprecate the reorder=False(reorder=True?) parameter and potentially (up for debate) have some threshold in case there are some small negative values in np.diff(tpr) or np.diff(fpr)

What does this implement/fix? Explain your changes.

(1)add tolerance to roc_auc_score when checking whether x is increasing. (discarded)

(2)improve error message when x is neither increasing nor decreasing.

(3)deprecate reorder parameter. reorder=False is now the default behaviour.

reorder=True is introduced in 9f18586 for roc_auc_score. Now it is not used in roc_auc_score due to #9786.

Any other comments?

cc @jnothman @lesteve

@lesteve
Copy link
Member
lesteve commented Sep 29, 2017

I would deprecate reorder in the same PR.

To be honest I am not sure it is worth adding tol. If the user has non increasing fpr (which means he is already in a special case, i.e. not using roc_curve which guarantees that fpr and tpr are increasing) then it is up to him to preprocess fpr and tpr before passing them to auc. The error message could be a bit improved to help the user do the right thing. e.g. by mentioning how many negative np.diff(fpr) there is and what's the most negative values we found.

@qinhanmin2014 qinhanmin2014 changed the title [MRG] Add tol parameter to avoid unexpected error in auc [MRG] Add tol parameter and deprecate reorder parameter in auc Sep 29, 2017
@qinhanmin2014
Copy link
Member Author

@lesteve Thanks for the instant reply :)
I have addressed all your comments. Now the PR:
(1)add tolerance to roc_auc_score when checking whether x is increasing.
(2)improve error message when x is neither increasing nor decreasing.
(3)deprecate reorder parameter. reorder=False is now the default behaviour.
If you are sure that we don't need some part of the PR, please give a clear instruction and I'll remove it.

@lesteve
Copy link
8000
Member
lesteve commented Oct 3, 2017

I would be in favour of removing tol and making the error message more informative.

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks. I have removed parameter tol.
For the erorr message, I have sligtly improved it previously. Now it looks like this:

auc([2, 1, 3, 4], [4, 5, 6, 7])
# x is neither increasing nor decreasing : [2 1 3 4].
# np.diff(x) contains 2 positive values and 1 negative values.

Since x can be either increasing or decreasing, I'm wondering what's the best information to provide.
WDYT? Thanks a lot :)

@lesteve
Copy link
Member
lesteve commented Oct 3, 2017

Maybe the most negative value and the most positive value of np.diff? This way you can see that his how far your array is from being increasing or decreasing.

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks for the instant reply :) I have updated the error message.

auc([2.5, 1.6, 3.7, 4.8], [5, 6, 7, 8])
# ValueError: x is neither increasing nor decreasing : [ 2.5  1.6  3.7  4.8].
# np.diff(x) contains 2 positive values and 1 negative values.
# The most positive value in np.diff(x) : x[2] - x[1] = 2.1.
# The most negative value in np.diff(x) : x[1] - x[0] = -0.9.

@@ -426,7 +427,20 @@ def test_auc_errors():
assert_raises(ValueError, auc, [0.0], [0.1])

# x is not in order
assert_raises(ValueError, auc, [1.0, 0.0, 0.5], [0.0, 0.0, 0.0])
error_message = ("x is neither increasing nor decreasing : "
"[ 2.5 1.6 3.7 4.8]. np.diff(x) contains 2 positive "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy 1.14 will change np array repr/str so don't hard-code it like this instead use '{}'.format(your_array).

"x[{}] = {}. The most negative value in "
"np.diff(x) : x[{}] - x[{}] = {}."
.format(x, (dx > 0).sum(), (dx < 0).sum(),
dx.argmax() + 1, dx.argmax(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the variables that you reuse outside of the .format

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks :) Both comments are addressed.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the title to remove reference to tol?

y : array, shape = [n]
y coordinates.
reorder : boolean, optional (default=False)
reorder : boolean, optional (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the default something semantically meaningful, like 'warn' or 'deprecated', please?

@@ -47,13 +47,18 @@ def auc(x, y, reorder=False):
Parameters
----------
x : array, shape = [n]
x coordinates.
Increasing or decreasing x coordinates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "x coordinates. These must be either monotonic increasing or monotonic decreasing."? Otherwise, this seems vague.

"the x array is not increasing: %s" % x)
maxpos = dx.argmax()
minpos = dx.argmin()
raise ValueError("x is neither increasing nor decreasing "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this message is a bit obscure to a reader of the code, more so than to someone who receives the error by surprise. It deserves a comment, perhaps a reference to the issue.

@qinhanmin2014 qinhanmin2014 changed the title [MRG] Add tol parameter and deprecate reorder parameter in auc [MRG] Deprecate reorder parameter and improve error message in auc Oct 9, 2017
raise ValueError("Reordering is not turned on, and "
"the x array is not increasing: %s" % x)
# Output the most positive value and most negative value in
# np.diff(x) for debugging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meant to handle the case that adjacent values are out of order but only within a small tolerance?

minpos = dx.argmin()
raise ValueError("x is neither increasing nor decreasing "
": {}. The most positive value in np.diff(x)"
" : x[{}] - x[{}] = {}. The most negative "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this error message will most often leave the user bewildered.

@qinhanmin2014
Copy link
Member Author
qinhanmin2014 commented Oct 9, 2017

@jnothman Thanks :) The error message is suggested by @lesteve here. (In fact, I have slightly shortened it just now.) I don't have strong opinion on this. Seems that it might indeed introduce some confusion. But at least the original error message is not right. So do you think "x is neither increasing nor decreasing : {}".format(x) (very similar to the original message) is enough?

@jnothman
Copy link
Member
jnothman commented Oct 9, 2017

Why do we want to allow increasing or decreasing? Why don't we require it to be increasing, and state that there is a negative diff of ...? I think it would be much clearer to the user why this is relevant.

@jnothman
Copy link
Member
jnothman commented Oct 9, 2017

It's not like we cause great harm to the user by asking them to reverse their inputs.

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks. From my perspective, I agree with you. Allow x to be decreasing seems awkward but the implementation is there. So what's your final decision or do you want me to further investigate the origin of the code?

@jnothman
Copy link
Member
jnothman commented Oct 10, 2017 via email

@jnothman jnothman changed the title [MRG+1] Deprecate reorder parameter in auc [MRG+1?] Deprecate reorder parameter in auc Oct 10, 2017
@qinhanmin2014
Copy link
Member Author

@jnothman Personally, I would prefer to remain reorder parameter. Like the support for decreasing x, they have long been here and we will not need much effort to support them (though I don't think they are useful). The deprecation is proposed by lesteve, you might ping him for more reasons. Or if you are pretty sure, please give a clear instruction and I'll remove the deprecation and leave this PR as a doc improvement :)

@jnothman
Copy link
Member
jnothman commented Oct 10, 2017 via email

@lesteve
Copy link
Member
lesteve commented Oct 10, 2017

Apologies for taking some time to come back to this one.

In other words, a way to put this in plots:

import matplotlib.pyplot as plt
plt.figure()
plt.plot([0, 1, 1], [0, 0, 1], 'o-')
plt.figure()
# this curve has the two last points swapped
plt.plot([0, 1, 1], [0, 1, 0], 'o-')
plt.show()

figure_1

figure_2

Before #9786 was merged we would be switching from the first curve to the second curve because of floating point noise in x and reorder=True. The areas under the curve of this two curves are very different and this was the reason we were seeing roc_auc_score differences a lot bigger than what you would expect from floating point differences.

In the case we fixed in #9786, fpr was not increasing only for floating point reasons. In fact we were guaranteed that fpr were in the right order because they were computed with decreasing thresholds. reorder=True was an easy way to get rid of the error "x is not increasing" but the wrong thing to do because rearranging the curve as reorder=True does was drastically changing the auc score (as the plots above are trying to illustrate). This is the main reason why I would be in favour of dropping the reorder parameter in the long run.

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks a lot for the detailed reply :)
So I think your opinion is that it is better for us to force users to pass increasing/decreasing x so that we can have a sanity check to avoid wrong result, right? Considering that the result from auc will be significantly influenced when x is sorted unexpectedly due to slight floating point error. I think at this point at least you persuade me to do the deprecation. I'll update the reason for deprecation.
ping @jnothman

@codecov
Copy link
codecov bot commented Oct 10, 2017

Codecov Report

Merging #9851 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9851      +/-   ##
==========================================
- Coverage   96.17%   96.16%   -0.01%     
==========================================
  Files         336      336              
  Lines       62413    62504      +91     
==========================================
+ Hits        60024    60108      +84     
- Misses       2389     2396       +7
Impacted Files Coverage Δ
sklearn/metrics/ranking.py 98.36% <100%> (+0.04%) ⬆️
sklearn/metrics/tests/test_ranking.py 100% <100%> (ø) ⬆️
sklearn/_build_utils/__init__.py 59.18% <0%> (-2.05%) ⬇️
sklearn/manifold/tests/test_t_sne.py 98.93% <0%> (-1.07%) ⬇️
sklearn/datasets/rcv1.py 30.9% <0%> (-0.58%) ⬇️
sklearn/manifold/t_sne.py 99.19% <0%> (-0.01%) ⬇️
sklearn/datasets/california_housing.py 39.47% <0%> (ø) ⬆️
sklearn/datasets/tests/test_samples_generator.py 100% <0%> (ø) ⬆️
sklearn/datasets/species_distributions.py 28% <0%> (ø) ⬆️
sklearn/decomposition/tests/test_pca.py 100% <0%> (ø) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update daeb3ad...4c4ea88. Read the comment docs.

@jnothman
Copy link
Member
jnothman commented Oct 11, 2017 via email

@qinhanmin2014
Copy link
Member Author

ping @lesteve
I think I'll wait for the final decision from @jnothman and @lesteve.
Personally, after @lesteve's explanation, I might prefer to deprecate reorder. It seems worthwhile to force users to change their code since the potential error might be significant. Also, another important reason is that, I don't think the reorder parameter is designed for general case, because sometimes users might don't want to use y to break ties (like the case below). So it might be better that users sort x and arrange y themselves to get the result they expect.
index
Thanks for the the detailed instructions from all of you :)

@jnothman
Copy link
Member
jnothman commented Oct 11, 2017 via email

@qinhanmin2014
Copy link
Member Author

@jnothman I have updated a more detailed reason for deprecation. Do you think it is enough for you to remove the ? in the title? Thanks :)

@jnothman
Copy link
Member

Sure

@jnothman jnothman changed the title [MRG+1?] Deprecate reorder parameter in auc [MRG+1] Deprecate reorder parameter in auc Oct 11, 2017
Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, otherwise LGTM.

general use) and is no longer used there. What's more, the result
from auc will be significantly influenced if x is sorted
unexpectedly due to slight floating point error (See issue #9786).
Future (and default) behavior is equivalent to `reorder=False`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rst and not markdown so you need double-backticks around reorder=False. Single backtick is italic in rst.

@@ -83,8 +94,13 @@ def auc(x, y, reorder=False):
raise ValueError('At least 2 points are needed to compute'
' area under curve, but x.shape = %s' % x.shape)

if reorder != 'deprecated':
warnings.warn("The `reorder` parameter has been deprecated "
"in version 0.20 and will be removed in 0.22.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practices: the warning ideally:

  • explains why there is a warning (maybe here the doc + release notes is enough in this case)
  • what to do to get rid of it. In this case say it is recommended to not set reorder (same behaviour as reorder=False) and make sure to have x and y ordered (maybe add as the points of a curve. I am not sure how to phrase this better ...)

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks :) Comments addressed.

@lesteve
Copy link
Member
lesteve commented Oct 12, 2017

@jnothman any comments on this one? If I don't hear anything I'll merge it in a few days.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title [MRG+1] Deprecate reorder parameter in auc [MRG+2] Deprecate reorder parameter in auc Oct 17, 2017
@jnothman jnothman merged commit a7f8c32 into scikit-learn:master Oct 17, 2017
@qinhanmin2014 qinhanmin2014 deleted the auc_tol branch October 17, 2017 02:57
@lesteve
Copy link
Member
lesteve commented Oct 17, 2017

Good stuff @qinhanmin2014!

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0