-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MAINT use nanmin to replace nan by finite values in ranking of SearchCV #24543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR should solve the remaining failure shown in the CIs using the dev versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run with [scipy-dev]
to see if this fixes the issues in CI?
Done. It will still fail until #24521 is in |
I just merged #24521. |
OK merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the behavior on with scikit-learn 1.1 and SciPy 1.8 is:
from scipy.stats import rankdata
import numpy as np
array_means = np.asarray([np.nan] * 3)
min_array_means = min(array_means) - 1
array_means = np.nan_to_num(array_means, copy=True, nan=min_array_means)
rankdata(-array_means, method="min")
# array([1, 2, 3])
With this PR and the nanmin
fix:
from scipy.stats import rankdata
import numpy as np
array_means = np.asarray([np.nan] * 3)
min_array_means = np.nanmin(array_means) - 1
if np.isnan(min_array_means):
min_array_means = 0
array_means = np.nan_to_num(array_means, copy=True, nan=min_array_means)
rankdata(-array_means, method="min")
# array([1, 1, 1])
Even tho this is a edge case, it is backward breaking change. If we move forward with this behavior change, I think it belongs in the change log.
Furthermore, for array_means = np.asarray([np.nan] * 3 + [1, 2, 3])
, the rank_results
are:
- SciPy 1.8 and on scikit-learn 1.1 ->
rank_results=array([4, 4, 4, 3, 2, 1])
- SciPy 1.10 and
nanmin
fix ->rank_results=array([4, 5, 6, 3, 2, 1])
sklearn/model_selection/_search.py
Outdated
@@ -968,7 +968,10 @@ def _store(key_name, array, weights=None, splits=False, rank=False): | |||
# when input is nan, scipy >= 1.10 rankdata returns nan. To | |||
# keep previous behaviour nans are set to be smaller than the | |||
# minimum value in the array before ranking | |||
min_array_means = min(array_means) - 1 | |||
min_array_means = np.nanmin(array_means) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When array_means
is all nan, there is a RuntimeWarning:
RuntimeWarning: All-NaN slice encountered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can either catch it or check for all nans first before calling nanmin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I catch this case early.
Ranking only nan as no meaning and we should not give a ranking IMO.
I would be fine with the current incompatibility.
We can always get this behaviour and make it a bug fix with a change log entry. WDYT?
…Sent from my iPhone
On 29 Sep 2022, at 19:16, Thomas J. Fan ***@***.***> wrote:
@thomasjpfan commented on this pull request.
Overall, the behavior on with scikit-learn 1.1 and SciPy 1.8 is:
from scipy.stats import rankdata
import numpy as np
array_means = np.asarray([np.nan] * 3)
min_array_means = min(array_means) - 1
array_means = np.nan_to_num(array_means, copy=True, nan=min_array_means)
rankdata(-array_means, method="min")
# array([1, 2, 3])
With this PR and the nanmin fix:
from scipy.stats import rankdata
import numpy as np
array_means = np.asarray([np.nan] * 3)
min_array_means = np.nanmin(array_means) - 1
if np.isnan(min_array_means):
min_array_means = 0
array_means = np.nan_to_num(array_means, copy=True, nan=min_array_means)
rankdata(-array_means, method="min")
# array([1, 1, 1])
Even tho this is a edge case, it is backward breaking change. If we move forward with this behavior change, I think it belongs in the change log.
Furthermore, for array_means = np.asarray([np.nan] * 3 + [1, 2, 3]), the rank_results are:
SciPy 1.8 and on scikit-learn 1.1 -> rank_results=array([4, 4, 4, 3, 2, 1])
SciPy 1.10 and nanmin fix -> rank_results=array([4, 5, 6, 3, 2, 1])
In sklearn/model_selection/_search.py:
> @@ -968,7 +968,10 @@ def _store(key_name, array, weights=None, splits=False, rank=False):
# when input is nan, scipy >= 1.10 rankdata returns nan. To
# keep previous behaviour nans are set to be smaller than the
# minimum value in the array before ranking
- min_array_means = min(array_means) - 1
+ min_array_means = np.nanmin(array_means) - 1
When array_means is all nan, there is a RuntimeWarning:
RuntimeWarning: All-NaN slice encountered
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
I am okay with making it a bug fix and adding it to the change log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add or update a test to check more precisely for the new behavior ;)
And also please re-push a commit for |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I modified the previous test to assert the new behaviour. It would fail on |
@ogrisel It is green ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, do we need to have a similar description in two separate places of the changelog?
I tried to suggest a wording tweak to make it sligthly more clear, not 100% sure I succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (once @lesteve's feedback has been taken into account).
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Thanks @lesteve. I will be merging since I don't run the right CI anyway. |
…CV (scikit-learn#24543) Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Fixes #24424
For SciPy >= 1.10,
rankdata
does not deal withnan
.When finding the minimum finite values, we should be using
np.nanmin
and notmin
to discardnan
values. The subsequent casting is therefore safe.