8000 [WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. by massich · Pull Request #10017 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

massich
Copy link
Contributor
@massich massich commented Oct 26, 2017

Reference Issues/PRs

This PR takes over the stalling #7394.

What does this implement/fix? Explain your changes.

Any other comments?

Based on the comments, @lesteve and @jnothman were in favor of using a helper function. I propose to use a set of types to mimic this:

>>> from six import string_types
>>> isinstance('foo', string_types)
True

My only question is where should I define such tuples. Any ideas?

@massich
Copy link
Contributor Author
massich commented Oct 26, 2017

For completness, here are some similar tuples:

Shall we agree in lower/upper case ?

@massich massich force-pushed the model_selection_enhancements branch from 53f65f6 to 01c417f Compare October 26, 2017 20:25
@massich massich changed the title [WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. [MRG] MNT Use isinstance instead of dtype.kind check for scalar validation. Oct 26, 2017
@CoderPat
Copy link
Contributor

Hijacking since its also relevant to an open PR I have #10042 and @jnothman sent me to #7394 (which this is the successor of right?)
It would probably fit into the utils folder. Maybe create a types.py? Also would this require scikit-wide changes or just starting to use this from now on when type-checking is required?

@glemaitre
Copy link
Member

Shall we agree in lower/upper case ?

Uppercase were used since it was global at that time. If we define something, I would use lower case.

@glemaitre
Copy link
Member

For completness, here are some similar tuples:

You also have FLOAT_DTYPE in validation.py

@glemaitre
Copy link
Member

I would suggest to define it inside validation.py or in __init__.py I think but I would rely on the experience of @lesteve @jnothman @amueller

@jnothman
Copy link
Member
jnothman commented Oct 31, 2017 via email

@CoderPat
Copy link
Contributor
CoderPat commented Nov 1, 2017

Also important for the matter, should 0d arrays be considered as ints for all pratical purposes (and be included in the int types)? It seems scikit is still inconsistent in this aspect.

@jnothman
Copy link
Member
jnothman commented Nov 1, 2017 via email

@massich massich force-pushed the model_selection_enhancements branch from 01c417f to c05498e Compare November 3, 2017 14:33
@codecov

This comment has been minimized.

@massich massich force-pushed the model_selection_enhancements branch from c05498e to f838fa9 Compare November 3, 2017 14:45
@massich
Copy link
Contributor Author
massich commented Nov 3, 2017

@glemaitre regarding FLOAT_DTYPES I don't think it can be substituted by floating_types since it is used with check_array which expects a tuple of types and converts to the first type when there's no match (see here).

maybe an option is to set floating_types to np.float64, float, np.floating) but I'm not sure that this is the right move.

@massich
Copy link
Contributor Author
massich commented Nov 3, 2017

Actually there are no many places where isinstance is used with floating apart from validation.py. So I think that we can let it as it is.

@glemaitre
Copy link
Member

maybe an option is to set floating_types to np.float64, float, np.floating) but I'm not sure that this is the right move.

Actually it should be equivalent to floating_types=(np.float64, float, np.float32, np.float16)

@massich
Copy link
Contributor Author
massich commented Nov 3, 2017

Actually, np.floating also has float128 and float8 which might not be desirable in check_array in the same manner that we might not want to fit float (the non numpy value) to check_array.

@jnothman
Copy link
Member
jnothman commented Nov 4, 2017

So at the moment this doesn't handle 0d arrays. Are we making that a policy?

@massich
Copy link
Contributor Author
massich commented Nov 4, 2017

It doesn't. I kept the behavior that was on the codebase. IMHO if a function needs to accept 0d arrays it should be handled there directly (in the same manner that None is handled). By something like:

def my_func (param=None):
       if not ( isinstance(param, integer_types) or param is None):
              if param.ndim == 0 and isinstance(param[()], integer_types):
                   param = param[()]
              else
                   raise ValueError("n_estimators must be an integer, got {0}.".format(type(param)))	
    ...

@jnothman
Copy link
Member
jnothman commented Nov 4, 2017

I think we have previously had issues where 0d arrays were passed in unknowingly, but it may have been due to bugs in numpy/scipy/etc that previously existed (e.g. iteration over memmapped array or something obscure like that).

@massich
Copy link
Contributor Author
massich commented Nov 4, 2017

And answering to @jnothman I'm +1 to not explicitly support 0d arrays. @glemaitre any thoughts ?

@ogrisel
Copy link
Member
ogrisel commented Nov 7, 2017

0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits.

@jnothman
Copy link
Member
jnothman commented Nov 7, 2017

I don't think the code becomes overly complex if we have an is_integer or a check_integer helper instead of using isinstance.

Yes, 0d arrays were not supported, but this could also be inducing unintentional behaviour. Note that, for instance, 0d arrays are output when iterating over a numpy array. Thus for instance ParameterGrid({'min_weight_fraction_leaf': np.linspace(.1, 1), 'max_depth': np.arange(2, 10, 2)}) will set parameters to numpy 0d arrays, not to Python scalars. (Perhaps this is a bug in ParameterGrid, but it's certainly longstanding.) Using if isinstance(my_param, integer_type): do something; else: do some other thing will do the other - unwanted - thing. Not supporting 0d arrays means we need to test our code better and make sure we never use else in validation and always use elif. Are you still sure there are no real benefits?

@massich
Copy link
Contributor Author
massich commented Nov 7, 2017

If that's the case, I think we should implement our own checker (or validation unwrapping the 0d array).

>>> foo = np.array(42)
>>> isinstance(foo, (numbers.Integral, np.integer))
False
>>> isinstance(foo[()], (numbers.Integral, np.integer))
True

and adding np.ndarray to the possible types defeats the purpose.

@ogrisel
Copy link
Member
ogrisel commented Nov 7, 2017

@jnothman ParameterGrid returns instances of np.integer or np.float that are scalar classes, but no instances of 0d arrays:

>>> import numbers
>>> import numpy as np
>>> from sklearn.model_selection import ParameterGrid

>>> [isinstance(d['a'], (numbers.Integral, np.integer))
...  for d in ParameterGrid(dict(a=np.arange(3)))]
...
[True, True, True]
>>> [isinstance(d['a'], np.ndarray) for d in ParameterGrid(dict(a=np.arange(3)))]
[False, False, False]

@jnothman
Copy link
Member
jnothman commented Nov 7, 2017 via email

@ogrisel
Copy link
Member
ogrisel commented Nov 8, 2017

I had not realized that instances of np.int64 had a ndim and a shape attribute either. I agree we should be careful to test that what we get out of ParameterGrid iterations is compatible with the set_params and constructors of estimators. I don't know if it's possible to write a test_common. Python 3 type hints would helpful for this.

@lesteve
Copy link
Member
lesteve commented Nov 8, 2017

0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits.

I agree with that. 0d array is a weird quirk which to be honest I don't really understand. They should not be confused with numpy scalars.

arr_0d = np.array(0)  # 0d array
scalar = np.int64(0)  # numpy scalar

We should definitly support numpy scalars because they are very easy to get, e.g. by indexing, doing reduction like .max(), .sum(), iterating over an array, etc ...
Numpy 0d arrays I am not sure how you would get one without doing explicitly something like np.array(0).

@ogrisel
Copy link
Member
ogrisel commented Nov 8, 2017

Also note that in recent versions of numpy: issubclass(np.integer, numbers.Integral) is True but this is not the case in numpy 1.8.2 which we still support. So it's worth maintaining a SCALAR_INTEGER_TYPES = (numbers.Integral, np.integer) constant in sklearn.utils.fixes for the time being.

@@ -24,6 +24,8 @@
from ..externals.joblib import Memory


integer_types = (numbers.Integral, np.integer)
floating_types = (float, np.floating)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two constants should be moved to sklearn.utils.fixes and renamed to:

# In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral
SCALAR_INTEGRAL_TYPES = (numbers.Integral, np.integer)
SCALAR_REAL_TYPES = (numbers.Real, np.floating)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar base types from numbers.

Copy link
Member
@glemaitre glemaitre Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral

I think that you mean Numpy, I could not find Python 1.* in conda :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar

it is 1.9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

>>> import numbers
>>> isinstance(int(42), numbers.Real)
True
>>> isinstance(int(42), float)
False

@massich massich force-pushed the model_selection_enhancements branch from 0a56a5b to a67a607 Compare November 8, 2017 14:16
@massich
Copy link
Contributor Author
massich commented Nov 8, 2017

There's also another integer_types in six. Here:

if PY3:
string_types = str,
integer_types = int,
class_types = type,
text_type = str
binary_type = bytes
MAXSIZE = sys.maxsize
else:
string_types = basestring,
integer_types = (int, long)
class_types = (type, types.ClassType)

which is only used these two tests, but I'm not sure that it has the same functionality:

for typ in six.integer_types:
ss4 = ShuffleSplit(test_size=typ(2), random_state=0).split(X)

for typ in six.integer_types:
ss4 = cval.ShuffleSplit(10, test_size=typ(2), random_state=0)

@massich
Copy link
Contributor Author
massich commented Nov 8, 2017

Just to not lose this comment within the review
@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

>>> import numbers
>>> isinstance(int(42), numbers.Real)
True
>>> isinstance(int(42), float)
False

@ogrisel
Copy link
Member
ogrisel commented Nov 8, 2017

Do not modify the vendored six module but fill tree to replace usage of six.integer_types by your own constants from sklearn.utils.fixes instead.

@ogrisel
Copy link
Member
ogrisel commented Nov 8, 2017

@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

Indeed, good catch. SCALAR_FLOATING_TYPES then.

from .. import get_config as _get_config
from ..exceptions import NonBLASDotWarning
from ..exceptions import NotFittedError
from ..exceptions import DataConversionWarning
from ..externals.joblib import Memory


floating_types = (float, np.floating)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this to scalar_floating_types to make it explicit that this should not be used to validated array dtypes.

Also because this tuple of types is not a provided as a backport for old version of numpy but is going to be used to validate scalar paramters in similar ways to scalar_integer_types, I think we should both constant here instead of sklearn.utils.fixes. Sorry for the back and forth comments.

@massich massich force-pushed the model_selection_enhancements branch from a67a607 to e0e6281 Compare November 8, 2017 15:56
@massich
Copy link
Contributor Author
massich commented Nov 8, 2017

There are places where only numbers.Integral is used like:

from numbers import Integral

if isinstance(precision, Integral):
if precision < 0:
raise ValueError("'precision' should be greater or equal to 0."
" Got {} instead.".format(precision))
else:
raise ValueError("'precision' should be an integer. Got {}"
" instead.".format(type(precision)))

which can raise an error if numpy is 1.8.2. But there's no open issue in that. Shall we make it homogeneous and add a test for it?

Another disturbing thought is this comment

def _validate_params(n_features, input_type):
# strangely, np.int16 instances are not instances of Integral,
# while np.int64 instances are...

>>> np.__version__
'1.8.2'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[False, False, False, False, False]

.... 

>>> np.__version__
'1.9.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.10.4'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.11.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.12.1'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

@jnothman
Copy link
Member
jnothman commented Nov 8, 2017 via email

@massich massich changed the title [MRG] MNT Use isinstance instead of dtype.kind check for scalar validation. [WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. Nov 9, 2017
@rth
Copy link
Member
rth commented Jun 13, 2019

Also note that in recent versions of numpy: issubclass(np.integer, numbers.Integral) is True but this is not the case in numpy 1.8.2 which we still support. So it's worth maintaining a SCALAR_INTEGER_TYPES

Since currently the minimal numpy version is 1.11.0 a significant part of this PR is no longer needed (and was fixed in #14004)

@amueller amueller added the Needs Decision Requires decision label Aug 6, 2019
@massich
Copy link
Contributor Author
massich commented Sep 6, 2019

This was taken care by #14004, but some were left off

~/code/scikit-learn master
(mne) ❯ git grep -i 'np.asarray(test_size)'                             
sklearn/model_selection/_split.py:    test_size_type = np.asarray(test_size).dtype.kind

I'll close this one and maybe open a new one

@massich massich closed this Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0