-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For completness, here are some similar tuples:
Shall we agree in lower/upper case ? |
53f65f6
to
01c417f
Compare
Hijacking since its also relevant to an open PR I have #10042 and @jnothman sent me to #7394 (which this is the successor of right?) |
Uppercase were used since it was global at that time. If we define something, I would use lower case. |
You also have FLOAT_DTYPE in validation.py |
yes validation.py please
|
Also important for the matter, should 0d arrays be considered as ints for all pratical purposes (and be included in the int types)? It seems scikit is still inconsistent in this aspect. |
I don't think we should differentiate between scalar arrays and python
scalars... But I also don't think we should promise that we support scalar
arrays. It's tempting to raise an error when they are passed as
parameters... But I'm sure we'll break someone's code if we do so...
|
01c417f
to
c05498e
Compare
This comment has been minimized.
This comment has been minimized.
c05498e
to
f838fa9
Compare
@glemaitre regarding maybe an option is to set |
Actually there are no many places where |
Actually it should be equivalent to |
Actually, |
So at the moment this doesn't handle 0d arrays. Are we making that a policy? |
It doesn't. I kept the behavior that was on the codebase. IMHO if a function needs to accept 0d arrays it should be handled there directly (in the same manner that def my_func (param=None):
if not ( isinstance(param, integer_types) or param is None):
if param.ndim == 0 and isinstance(param[()], integer_types):
param = param[()]
else
raise ValueError("n_estimators must be an integer, got {0}.".format(type(param)))
... |
I think we have previously had issues where 0d arrays were passed in unknowingly, but it may have been due to bugs in numpy/scipy/etc that previously existed (e.g. iteration over memmapped array or something obscure like that). |
And answering to @jnothman I'm +1 to not explicitly support 0d arrays. @glemaitre any thoughts ? |
0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits. |
I don't think the code becomes overly complex if we have an is_integer or a check_integer helper instead of using isinstance. Yes, 0d arrays were not supported, but this could also be inducing unintentional behaviour. Note that, for instance, 0d arrays are output when iterating over a numpy array. Thus for instance |
If that's the case, I think we should implement our own checker (or validation unwrapping the 0d array). >>> foo = np.array(42)
>>> isinstance(foo, (numbers.Integral, np.integer))
False
>>> isinstance(foo[()], (numbers.Integral, np.integer))
True and adding |
@jnothman >>> import numbers
>>> import numpy as np
>>> from sklearn.model_selection import ParameterGrid
>>> [isinstance(d['a'], (numbers.Integral, np.integer))
... for d in ParameterGrid(dict(a=np.arange(3)))]
...
[True, True, True]
>>> [isinstance(d['a'], np.ndarray) for d in ParameterGrid(dict(a=np.arange(3)))]
[False, False, False] |
oh dear... am I getting multiple things confused? they do have attributes
like shape and ndim. well whatever we get out of ParameterGrid should be
handled and tested!
|
I had not realized that instances of |
I agree with that. 0d array is a weird quirk which to be honest I don't really understand. They should not be confused with numpy scalars. arr_0d = np.array(0) # 0d array
scalar = np.int64(0) # numpy scalar We should definitly support numpy scalars because they are very easy to get, e.g. by indexing, doing reduction like |
Also note that in recent versions of numpy: |
sklearn/utils/validation.py
Outdated
@@ -24,6 +24,8 @@ | |||
from ..externals.joblib import Memory | |||
|
|||
|
|||
integer_types = (numbers.Integral, np.integer) | |||
floating_types = (float, np.floating) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two constants should be moved to sklearn.utils.fixes
and renamed to:
# In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral
SCALAR_INTEGRAL_TYPES = (numbers.Integral, np.integer)
SCALAR_REAL_TYPES = (numbers.Real, np.floating)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar base types from numbers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral
I think that you mean Numpy, I could not find Python 1.* in conda :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar
it is 1.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.
>>> import numbers
>>> isinstance(int(42), numbers.Real)
True
>>> isinstance(int(42), float)
False
0a56a5b
to
a67a607
Compare
There's also another scikit-learn/sklearn/externals/six.py Lines 35 to 46 in 3fa7a06
which is only used these two tests, but I'm not sure that it has the same functionality: scikit-learn/sklearn/model_selection/tests/test_split.py Lines 549 to 550 in 3fa7a06
scikit-learn/sklearn/tests/test_cross_validation.py Lines 446 to 447 in 3fa7a06
|
Do not modify the vendored six module but fill tree to replace usage of |
Indeed, good catch. |
sklearn/utils/validation.py
Outdated
from .. import get_config as _get_config | ||
from ..exceptions import NonBLASDotWarning | ||
from ..exceptions import NotFittedError | ||
from ..exceptions import DataConversionWarning | ||
from ..externals.joblib import Memory | ||
|
||
|
||
floating_types = (float, np.floating) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this to scalar_floating_types
to make it explicit that this should not be used to validated array dtype
s.
Also because this tuple of types is not a provided as a backport for old version of numpy but is going to be used to validate scalar paramters in similar ways to scalar_integer_types
, I think we should both constant here instead of sklearn.utils.fixes
. Sorry for the back and forth comments.
a67a607
to
e0e6281
Compare
There are places where only numbers.Integral is used like: scikit-learn/sklearn/tree/export.py Line 14 in 3fa7a06
scikit-learn/sklearn/tree/export.py Lines 413 to 420 in 3fa7a06
which can raise an error if numpy is 1.8.2. But there's no open issue in that. Shall we make it homogeneous and add a test for it? Another disturbing thought is this comment scikit-learn/sklearn/feature_extraction/hashing.py Lines 103 to 105 in 3fa7a06
>>> np.__version__
'1.8.2'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[False, False, False, False, False]
....
>>> np.__version__
'1.9.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.10.4'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.11.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.12.1'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True] |
And IIRC there has also been some variation across Python versions.
It's unlikely that a int16 scalar (or a uint) is going to be used for a
parameter, but there's no harm in supporting it. Yes, we should be
replacing uses of numbers.integral, and of dtype.kind checks for scalars in
this PR.
|
Since currently the minimal numpy version is 1.11.0 a significant part of this PR is no longer needed (and was fixed in #14004) |
This was taken care by #14004, but some were left off ~/code/scikit-learn master
(mne) ❯ git grep -i 'np.asarray(test_size)'
sklearn/model_selection/_split.py: test_size_type = np.asarray(test_size).dtype.kind I'll close this one and maybe open a new one |
Reference Issues/PRs
This PR takes over the stalling #7394.
What does this implement/fix? Explain your changes.
Any other comments?
Based on the comments, @lesteve and @jnothman were in favor of using a helper function. I propose to use a set of types to mimic this:
My only question is where should I define such tuples. Any ideas?