[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017

massich · 2017-10-26T16:45:33Z

Reference Issues/PRs

This PR takes over the stalling #7394.

What does this implement/fix? Explain your changes.

Any other comments?

Based on the comments, @lesteve and @jnothman were in favor of using a helper function. I propose to use a set of types to mimic this:

>>> from six import string_types
>>> isinstance('foo', string_types)
True

My only question is where should I define such tuples. Any ideas?

massich · 2017-10-26T17:01:30Z

For completness, here are some similar tuples:

SPARSE_TYPES
INTEGER_TYPES
externals.six.string_types

Shall we agree in lower/upper case ?

CoderPat · 2017-10-31T20:48:36Z

Hijacking since its also relevant to an open PR I have #10042 and @jnothman sent me to #7394 (which this is the successor of right?)
It would probably fit into the utils folder. Maybe create a types.py? Also would this require scikit-wide changes or just starting to use this from now on when type-checking is required?

glemaitre · 2017-10-31T21:00:26Z

Shall we agree in lower/upper case ?

Uppercase were used since it was global at that time. If we define something, I would use lower case.

glemaitre · 2017-10-31T21:15:35Z

For completness, here are some similar tuples:

You also have FLOAT_DTYPE in validation.py

glemaitre · 2017-10-31T21:16:37Z

I would suggest to define it inside validation.py or in __init__.py I think but I would rely on the experience of @lesteve @jnothman @amueller

jnothman · 2017-10-31T21:40:54Z

yes validation.py please

CoderPat · 2017-11-01T00:56:22Z

Also important for the matter, should 0d arrays be considered as ints for all pratical purposes (and be included in the int types)? It seems scikit is still inconsistent in this aspect.

jnothman · 2017-11-01T06:59:38Z

I don't think we should differentiate between scalar arrays and python scalars... But I also don't think we should promise that we support scalar arrays. It's tempting to raise an error when they are passed as parameters... But I'm sure we'll break someone's code if we do so...

massich · 2017-11-03T16:27:57Z

@glemaitre regarding FLOAT_DTYPES I don't think it can be substituted by floating_types since it is used with check_array which expects a tuple of types and converts to the first type when there's no match (see here).

maybe an option is to set floating_types to np.float64, float, np.floating) but I'm not sure that this is the right move.

massich · 2017-11-03T16:33:48Z

Actually there are no many places where isinstance is used with floating apart from validation.py. So I think that we can let it as it is.

glemaitre · 2017-11-03T16:37:45Z

maybe an option is to set floating_types to np.float64, float, np.floating) but I'm not sure that this is the right move.

Actually it should be equivalent to floating_types=(np.float64, float, np.float32, np.float16)

massich · 2017-11-03T16:50:37Z

Actually, np.floating also has float128 and float8 which might not be desirable in check_array in the same manner that we might not want to fit float (the non numpy value) to check_array.

jnothman · 2017-11-04T13:04:25Z

So at the moment this doesn't handle 0d arrays. Are we making that a policy?

massich · 2017-11-04T14:05:23Z

It doesn't. I kept the behavior that was on the codebase. IMHO if a function needs to accept 0d arrays it should be handled there directly (in the same manner that None is handled). By something like:

def my_func (param=None):
       if not ( isinstance(param, integer_types) or param is None):
              if param.ndim == 0 and isinstance(param[()], integer_types):
                   param = param[()]
              else
                   raise ValueError("n_estimators must be an integer, got {0}.".format(type(param)))	
    ...

jnothman · 2017-11-04T14:06:47Z

I think we have previously had issues where 0d arrays were passed in unknowingly, but it may have been due to bugs in numpy/scipy/etc that previously existed (e.g. iteration over memmapped array or something obscure like that).

massich · 2017-11-04T14:06:55Z

And answering to @jnothman I'm +1 to not explicitly support 0d arrays. @glemaitre any thoughts ?

ogrisel · 2017-11-07T12:10:59Z

0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits.

jnothman · 2017-11-07T12:30:58Z

I don't think the code becomes overly complex if we have an is_integer or a check_integer helper instead of using isinstance.

Yes, 0d arrays were not supported, but this could also be inducing unintentional behaviour. Note that, for instance, 0d arrays are output when iterating over a numpy array. Thus for instance ParameterGrid({'min_weight_fraction_leaf': np.linspace(.1, 1), 'max_depth': np.arange(2, 10, 2)}) will set parameters to numpy 0d arrays, not to Python scalars. (Perhaps this is a bug in ParameterGrid, but it's certainly longstanding.) Using if isinstance(my_param, integer_type): do something; else: do some other thing will do the other - unwanted - thing. Not supporting 0d arrays means we need to test our code better and make sure we never use else in validation and always use elif. Are you still sure there are no real benefits?

massich · 2017-11-07T12:41:25Z

If that's the case, I think we should implement our own checker (or validation unwrapping the 0d array).

>>> foo = np.array(42)
>>> isinstance(foo, (numbers.Integral, np.integer))
False
>>> isinstance(foo[()], (numbers.Integral, np.integer))
True

and adding np.ndarray to the possible types defeats the purpose.

ogrisel · 2017-11-07T15:09:04Z

@jnothman ParameterGrid returns instances of np.integer or np.float that are scalar classes, but no instances of 0d arrays:

>>> import numbers
>>> import numpy as np
>>> from sklearn.model_selection import ParameterGrid

>>> [isinstance(d['a'], (numbers.Integral, np.integer))
...  for d in ParameterGrid(dict(a=np.arange(3)))]
...
[True, True, True]
>>> [isinstance(d['a'], np.ndarray) for d in ParameterGrid(dict(a=np.arange(3)))]
[False, False, False]

jnothman · 2017-11-07T21:00:16Z

oh dear... am I getting multiple things confused? they do have attributes like shape and ndim. well whatever we get out of ParameterGrid should be handled and tested!

ogrisel · 2017-11-08T08:12:40Z

I had not realized that instances of np.int64 had a ndim and a shape attribute either. I agree we should be careful to test that what we get out of ParameterGrid iterations is compatible with the set_params and constructors of estimators. I don't know if it's possible to write a test_common. Python 3 type hints would helpful for this.

lesteve · 2017-11-08T08:12:41Z

0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits.

I agree with that. 0d array is a weird quirk which to be honest I don't really understand. They should not be confused with numpy scalars.

arr_0d = np.array(0)  # 0d array
scalar = np.int64(0)  # numpy scalar

We should definitly support numpy scalars because they are very easy to get, e.g. by indexing, doing reduction like .max(), .sum(), iterating over an array, etc ...
Numpy 0d arrays I am not sure how you would get one without doing explicitly something like np.array(0).

ogrisel · 2017-11-08T08:15:41Z

Also note that in recent versions of numpy: issubclass(np.integer, numbers.Integral) is True but this is not the case in numpy 1.8.2 which we still support. So it's worth maintaining a SCALAR_INTEGER_TYPES = (numbers.Integral, np.integer) constant in sklearn.utils.fixes for the time being.

ogrisel · 2017-11-08T08:20:01Z

sklearn/utils/validation.py

@@ -24,6 +24,8 @@
 from ..externals.joblib import Memory


+integer_types = (numbers.Integral, np.integer)
+floating_types = (float, np.floating)


Those two constants should be moved to sklearn.utils.fixes and renamed to:

# In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral SCALAR_INTEGRAL_TYPES = (numbers.Integral, np.integer) SCALAR_REAL_TYPES = (numbers.Real, np.floating)

Ideally, it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar base types from numbers.

In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral

I think that you mean Numpy, I could not find Python 1.* in conda :)

it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar

it is 1.9

@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

>>> import numbers >>> isinstance(int(42), numbers.Real) True >>> isinstance(int(42), float) False

massich · 2017-11-08T14:30:03Z

There's also another integer_types in six. Here:

scikit-learn/sklearn/externals/six.py

Lines 35 to 46 in 3fa7a06

    
           if PY3: 
        
               string_types = str, 
        
               integer_types = int, 
        
               class_types = type, 
        
               text_type = str 
        
               binary_type = bytes 
        
               MAXSIZE = sys.maxsize 
        
           else: 
        
               string_types = basestring, 
        
               integer_types = (int, long) 
        
               class_types = (type, types.ClassType)

which is only used these two tests, but I'm not sure that it has the same functionality:

scikit-learn/sklearn/model_selection/tests/test_split.py

Lines 549 to 550 in 3fa7a06

    
           for typ in six.integer_types: 
        
               ss4 = ShuffleSplit(test_size=typ(2), random_state=0).split(X)

scikit-learn/sklearn/tests/test_cross_validation.py

Lines 446 to 447 in 3fa7a06

    
           for typ in six.integer_types: 
        
               ss4 = cval.ShuffleSplit(10, test_size=typ(2), random_state=0)

massich · 2017-11-08T14:32:53Z

Just to not lose this comment within the review
@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

>>> import numbers
>>> isinstance(int(42), numbers.Real)
True
>>> isinstance(int(42), float)
False

ogrisel · 2017-11-08T14:41:10Z

Do not modify the vendored six module but fill tree to replace usage of six.integer_types by your own constants from sklearn.utils.fixes instead.

ogrisel · 2017-11-08T14:42:06Z

@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.

Indeed, good catch. SCALAR_FLOATING_TYPES then.

ogrisel · 2017-11-08T14:51:38Z

sklearn/utils/validation.py

 from .. import get_config as _get_config
 from ..exceptions import NonBLASDotWarning
 from ..exceptions import NotFittedError
 from ..exceptions import DataConversionWarning
 from ..externals.joblib import Memory


+floating_types = (float, np.floating)


I think we should rename this to scalar_floating_types to make it explicit that this should not be used to validated array dtypes.

Also because this tuple of types is not a provided as a backport for old version of numpy but is going to be used to validate scalar paramters in similar ways to scalar_integer_types, I think we should both constant here instead of sklearn.utils.fixes. Sorry for the back and forth comments.

…EGRAL_TYPES

massich · 2017-11-08T16:33:36Z

There are places where only numbers.Integral is used like:

scikit-learn/sklearn/tree/export.py

Line 14 in 3fa7a06

from numbers import Integral

scikit-learn/sklearn/tree/export.py

Lines 413 to 420 in 3fa7a06

    
           if isinstance(precision, Integral): 
        
               if precision < 0: 
        
                   raise ValueError("'precision' should be greater or equal to 0." 
        
                                    " Got {} instead.".format(precision)) 
        
           else: 
        
               raise ValueError("'precision' should be an integer. Got {}" 
        
                                " instead.".format(type(precision)))

which can raise an error if numpy is 1.8.2. But there's no open issue in that. Shall we make it homogeneous and add a test for it?

Another disturbing thought is this comment

scikit-learn/sklearn/feature_extraction/hashing.py

Lines 103 to 105 in 3fa7a06

    
           def _validate_params(n_features, input_type): 
        
               # strangely, np.int16 instances are not instances of Integral, 
        
               # while np.int64 instances are...

>>> np.__version__
'1.8.2'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[False, False, False, False, False]

.... 

>>> np.__version__
'1.9.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.10.4'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.11.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

...

>>> np.__version__
'1.12.1'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]

jnothman · 2017-11-08T22:23:22Z

And IIRC there has also been some variation across Python versions. It's unlikely that a int16 scalar (or a uint) is going to be used for a parameter, but there's no harm in supporting it. Yes, we should be replacing uses of numbers.integral, and of dtype.kind checks for scalars in this PR.

rth · 2019-06-13T12:52:06Z

Also note that in recent versions of numpy: issubclass(np.integer, numbers.Integral) is True but this is not the case in numpy 1.8.2 which we still support. So it's worth maintaining a SCALAR_INTEGER_TYPES

Since currently the minimal numpy version is 1.11.0 a significant part of this PR is no longer needed (and was fixed in #14004)

massich · 2019-09-06T09:53:35Z

This was taken care by #14004, but some were left off

~/code/scikit-learn master
(mne) ❯ git grep -i 'np.asarray(test_size)'                             
sklearn/model_selection/_split.py:    test_size_type = np.asarray(test_size).dtype.kind

I'll close this one and maybe open a new one

massich force-pushed the model_selection_enhancements branch from 53f65f6 to 01c417f Compare October 26, 2017 20:25

massich changed the title ~~[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation.~~ [MRG] MNT Use isinstance instead of dtype.kind check for scalar validation. Oct 26, 2017

massich mentioned this pull request Oct 30, 2017

[MRG] Add check for n_components in pca #10042

Merged

massich force-pushed the model_selection_enhancements branch from 01c417f to c05498e Compare November 3, 2017 14:33

This comment has been minimized.

Sign in to view

raghavrv and others added 2 commits November 3, 2017 15:44

MNT Use isinstance(..., float/numbers.Integral)

01e776e

factorize the reference class to compare to

f838fa9

massich force-pushed the model_selection_enhancements branch from c05498e to f838fa9 Compare November 3, 2017 14:45

Joan Massich added 2 commits November 3, 2017 16:13

[PEP8] fix long line

1f1aeb8

[PEP8] clean up unused variable

6a1b5ec

ogrisel reviewed Nov 8, 2017

View reviewed changes

massich force-pushed the model_selection_enhancements branch from 0a56a5b to a67a607 Compare November 8, 2017 14:16

ogrisel reviewed Nov 8, 2017

View reviewed changes

Joan Massich added 2 commits November 8, 2017 16:41

Change (numbers.Integral, np.integer) or INTEGER_TYPES for SCALAR_INT…

f8b2628

…EGRAL_TYPES

rename floating_types to SCALAR_FLOATING_TYPES

e0e6281

massich force-pushed the model_selection_enhancements branch from a67a607 to e0e6281 Compare November 8, 2017 15:56

massich changed the title ~~[MRG] MNT Use isinstance instead of dtype.kind check for scalar validation.~~ [WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. Nov 9, 2017

massich mentioned this pull request Nov 10, 2017

[MRG] MNT Remove code deprecated in 0.18 #10094

Closed

45 tasks

qinhanmin2014 mentioned this pull request Nov 24, 2017

[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195

Merged

jnothman mentioned this pull request Feb 6, 2018

[MRG] MNT Use isinstance instead of dtype.kind check for scalar validation. #7394

Closed

amueller added the Needs Decision Requires decision label Aug 6, 2019

massich closed this Sep 6, 2019

Uh oh!

[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017

[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!