Preserving dtype for float32 / float64 in transformers #11000

glemaitre · 2018-04-20T11:34:39Z

This is the issue which we want to tackle during the Man AHL Hackathon.

We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:

We could think to extend it to integer whenever possible and applicable.

Also the following transformers are not included in the common tests. We should write a specific test:

# some strange ones                                                                                                                                                         
DONT_TEST = ['SparseCoder', 'DictVectorizer',                                                                                                                               
             'TfidfTransformer',                                                                                                                     
             'TfidfVectorizer' (check 10443), 'IsotonicRegression',                                                                                                                       
             'CategoricalEncoder',                                                                                                 
             'FeatureHasher',                                                                                             
             'TruncatedSVD', 'PolynomialFeatures',                                                                                                                          
             'GaussianRandomProjectionHash', 'HashingVectorizer',                                                                                                           
             'CountVectorizer']

We could also check classifiers, regressors or clusterers (see #8769 for more context),

AffinitiyPropagation -> Bug in Incorrect Clusters Due To Dtype Mismatch #10832
check SVC -> SVC: Do not enforce that input data is of type np.float64 #10713

Below the code executed to find the failure.

# Let's check the 32 - 64 bits type conservation.                                                                                                                       
if isinstance(X, np.ndarray):                                                                                                                                           
    for dtype in [np.float32, np.float64]:                                                                                                                              
        X_cast = X.astype(dtype)                                                                                                                                        
                                                                                                                                                                            
        transformer = clone(transformer_orig)                                                                                                                           
        set_random_state(transformer)                                                                                                                                   
                                                                                                                                                                            
        if hasattr(transformer, 'fit_transform'):                                                                                                                       
            X_trans = transformer.fit_transform(X_cast, y_)                                                                                                             
        elif hasattr(transformer, 'fit_transform'):                                                                                                                     
            transformer.fit(X_cast, y_)                                                                                                                                 
            X_trans = transformer.transform(X_cast)                                                                                                                     
                                                                                                                                                                            
        # FIXME: should we check that the dtype of some attributes are the                                                                                              
        # same than dtype.                                                                                                                                              
        assert X_trans.dtype == X_cast.dtype, 'transform dtype: {} - original dtype: {}'.format(X_trans.dtype, X_cast.dtype)

Tips to run the test for a specific transformer:

Choose a transformer, for instance FastICA
If this class does not already have a method named _more_tags: add the following code snippet at the bottom of the class definition:

    def _more_tags(self):
        return {"preserves_dtype": [np.float64, np.float32]}

Run the common tests for this specific class:

pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" -v

It should fail: read the error message and try to understand why the fit_transform method (if it exists) or the transform method returns a float64 data array when it is passed a float32 input array.

It might be helpful to use a debugger, for instance by adding the line:

import pdb; pdb.set_trace()

at the beginning of the fit_transform method and then re-rerunning pytest with:

pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" --pdb

Then using the l (list), n (next), s (step into a function call), p some_array_variable.dtype (p stands for print) and c (continue) commands to interactively debug the execution of this fit_transform call.

ping @rth feel free to edit this thread.

The text was updated successfully, but these errors were encountered:

jnothman · 2018-04-21T12:53:27Z

just a note that we landed up creating untested regressions, particularly in things like Euclidean distance calculation (#9354), when we made similar changes not long ago.... please make sure the motivations and the risks of these changes are clear. We've still not fixed up that previous mess.

rth · 2018-04-21T13:28:07Z

Thanks for your comment and for mentioning this issue @jnothman ! The question is whether the gains from suporting 32bit arrays overweight the risk of running into new numerical issues.

On a different topic, the above test code can be re-written in a self-consistent way as,

import numpy as np

from sklearn.base import clone
from sklearn.utils.testing import set_random_state

def check_transformer_dtypes_casting(transformer, X, y):
    """Check that a transformer preserves 64bit / 32bit
    dtypes

    Parameters
    ----------
    transformer : estimator
      a transformer instance

    X : array, shape (n_samples, n_features)
      Training vector, where n_samples is the number of samples
      and n_features is the number of features.
    y : array, shape (n_samples)
      Target vector relative to X.
    """
    for dtype_in, dtype_out in [(np.float32, np.float32),
                                (np.float64, np.float64),
                                (np.int, np.float64)]:
        X_cast = X.copy().astype(dtype_in)

        transformer = clone(transformer)
        set_random_state(transformer)

        if hasattr(transformer, 'fit_transform'):
            X_trans = transformer.fit_transform(X_cast, y)
        elif hasattr(transformer, 'fit_transform'):
            transformer.fit(X_cast, y)
            X_trans = transformer.transform(X_cast)

        # FIXME: should we check that the dtype of some attributes are the
        # same than dtype.
        assert X_trans.dtype == dtype_out, \
            ('transform dtype: {} - original dtype: {}'
             .format(X_trans.dtype, X_cast.dtype))

For instance, we can test the LocallyLinearEmbedding as follows,

from sklearn.manifold import LocallyLinearEmbedding

X = np.random.RandomState(0).randn(1000, 100)

estimator = LocallyLinearEmbedding()

check_transformer_dtypes_casting(estimator, X, None)

which in this case will produce

AssertionError: transform dtype: float64 - original dtype: float32

glemaitre · 2018-04-21T13:45:08Z

@jnothman noted that we should be careful. The motivation is the same as in #8769 mentioned by @GaelVaroquaux . For transformers, I think that there is an extra incentive: user could use other library (tensorflow, etc.) in which default type is float32. Using the scikit-learn transformers will trigger some extra conversion which would not be required. A bit similar vision than letting the NaN past through.

LalliAcqua · 2019-11-02T14:36:51Z

Checking Isomap for this issue #11000

ksslng · 2020-01-29T11:25:44Z

I'm working on GenericUnivariateSelect as well as implementing a generic test for all estimators to preserve dtypes.

Henley13 · 2020-01-30T12:36:33Z

Should we wait for @ksslng to merge his PR with a generic test before working on a specific transformer from the above list ? Or focus on the "strange ones" instead ?

rth · 2020-01-30T13:05:45Z

@Henley13 It would have been ideal to have that common test, but you can certainly start investigation an estimator from the list meanwhile.

It should be noted that some of the above estimators may never preserve dtype , either because they use a significant number of linear algebra and we are not able to guarantee that results are consistent in 32bit (or because they are not frequently used enough to deserve this effort). So starting with frequently used and relatively simple estimators would be best.

Another point related is that you could check that some of the supervised/unsupervised models are able to work in 32bit (again focusing on the most popular ones). There is a number of PRs linked above that have done that, but many models probably still need work. So you could look into say linear checking that models are able to work without doing conversion to 64bit when given 32bit input and what would be necessary to fix it.

svenstehle · 2022-09-02T08:19:27Z

Hi, I checked GaussianRandomProjection.
In the DONT_TEST list this is referenced as GaussianRandomProjectionHash - maybe a typo?

It seems GaussianRandomProjection was already fixed in #22114
The tests introduced in this PR also cover GaussianRandomProjection, since it is part of the tested estimators all_RandomProjection. Running the tests [test_random_projection_dtype_match, test_random_projection_numerical_consistency] locally also passed without any errors.

_more_tags already exists in the base class BaseRandomProjection.

I think we can tick off GaussianRandomProjection. I don't know if the recommended benchmarking was done though.

jeremiedbb · 2022-09-02T08:23:40Z

~~@svenstehle Thanks for looking into that. Would you like to open a PR ?~~ EDIT: there's nothing to actually besides removing it from the checklist. Thanks

svenstehle · 2022-09-02T08:34:01Z

~~Working on GenericUnivariateSelect~~
Correction: GenericUnivariateSelect already implemented in #22370

sklearn/tests/test_common.py::test_estimators[GenericUnivariateSelect()-check_transformer_preserve_dtypes] PASSED

svenstehle · 2022-09-02T08:55:51Z

Birch was fixed in #22968

sklearn/tests/test_common.py::test_estimators[Birch()-check_transformer_preserve_dtypes] PASSED

IvanLauLinTiong · 2022-09-02T08:57:14Z

Working on FeatureAgglomeration.

svenstehle · 2022-09-02T09:04:50Z

~~BernoulliRBM was fixed in #16352~~

@jeremiedbb discovered that we were missing the necessary update to _more_tags. Added this in #24318

sklearn/tests/test_common.py::test_estimators[BernoulliRBM()-check_transformer_preserve_dtypes] PASSED

MiniBatchDictionaryLearning seems to have been fixed with PRs ~~#22002~~ and #22428; but @jeremiedbb pls double-check if that is correct

sklearn/tests/test_common.py::test_estimators[MiniBatchDictionaryLearning()-check_transformer_preserve_dtypes] PASSED

svenstehle · 2022-09-02T09:57:47Z

Continuing work of @thibsej on FactorAnalysis in #13303

MisaOgura · 2022-09-02T10:48:27Z

Looking at LocallyLinearEmbedding

betatim · 2022-09-02T14:14:33Z

I think we can tick SkewedChi2Sampler in the list a 9E88 s pytest sklearn/tests/test_common.py -k "SkewedChi2Sampler and check_transformer_preserve_dtypes" -v says it passes the test.

jeremiedbb · 2022-09-02T14:35:00Z

@betatim not yet: this test only runs on float64 by default. To trigger the test on both float64 and float32, you need to set the "preserve_dtype": [np.float64, np.float32] estimator tag.

rprkh · 2022-09-04T05:54:44Z

Working on SkewedChi2Sampler

OmarManzoor · 2022-09-09T11:38:26Z

@jeremiedbb, @glemaitre
Is there any part of this issue that is remaining and can be worked on or all of them are taken or complete?

glemaitre · 2022-09-09T12:00:16Z

I don't see any PR for Isomap.

OmarManzoor · 2022-09-09T12:10:34Z

change in cdef class BinaryTree necessary to fix float32/float64 conversion in Isomap #15474

I see. It is mentioned above along with the issue #15474. Does that issue need to be completed first in order to work on Isomap within the current one?

glemaitre · 2022-09-09T12:14:28Z

Does that issue need to be completed first in order to work on Isomap within the current one?

It seems so.

ogrisel · 2022-09-19T09:02:57Z

I see. It is mentioned above along with the issue #15474. Does that issue need to be completed first in order to work on Isomap within the current one?

No this is the opposite: we need to make Isomap preserve the float32 and then update the tests accordingly.

glemaitre added the Sprint label Apr 20, 2018

rth changed the title ~~[Man AHL Hackathon] Preserving dtype with float / double in transformer~~ [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Apr 20, 2018

Mariand012 mentioned this issue Apr 21, 2018

[WIP] Avoid float64 conversion for float32 in transformers #11004

Closed

rth changed the title ~~[Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers~~ Preserving dtype for float32 / float64 in transformers Apr 27, 2018

rth mentioned this issue Apr 28, 2018

yield tests are deprecated in pytest #10728

Closed

glemaitre mentioned this issue Jun 12, 2018

[MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler #11235

Merged

matiasbattocchia mentioned this issue Dec 19, 2018

Type error when using Pipeline skorch-dev/skorch#414

Closed

massich mentioned this issue Feb 25, 2019

LogisticRegression convert to float64 (for SAG solver) #13243

Merged

This was referenced Feb 26, 2019

[MRG+2] Add float32 support for Linear Discriminant Analysis #13273

Merged

Add float32 support for Latent Dirichlet Allocation #13275

Closed

Float32 support factor analysis #13303

Closed

This was referenced Mar 1, 2019

[MRG+1] ENH Ridge with solver SAG/SAGA does not cast to float64 #13302

Merged

ENH: Preserve float32/64 for SGD #13346

Closed

Migrate WeightVector to use tempita #13358

Closed

rth mentioned this issue Jul 12, 2019

MAINT Common sample_weight validation #14307

Merged

Henley13 mentioned this issue Aug 5, 2019

[MRG] Bayesian regression model dtype consistency #9087

Closed

rth mentioned this issue Oct 30, 2019

Add dtype parameter to KBinsDiscretizer #15409

Closed

LalliAcqua mentioned this issue Nov 2, 2019

Support for float32 in KDTree and BallTree #15474

Closed

tirthasheshpatel mentioned this issue Dec 2, 2019

QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results #15733

Closed

jeremiedbb mentioned this issue Jan 29, 2020

[MRG] ENH Make NMF preserve floating dtype #16280

Merged

ksslng mentioned this issue Jan 29, 2020

[MRG] Add common test and estimator tag for preserving float32 dtype in transformers #16290

Closed

Henley13 mentioned this issue Jan 31, 2020

[MRG] 32/64-bit float consistency with BernoulliRBM #16352

Merged

rth mentioned this issue Jun 24, 2020

Add 32 bit support to neural_network module #17700

Closed

ogrisel mentioned this issue Jun 29, 2022

MNT Use _validate_params in NMF and MiniBatchNMF #23463

Merged

francoisgoupil added this to EuroPython 2022 Jul 13, 2022

francoisgoupil moved this to Issues that are list of sub-issues in EuroPython 2022 Jul 13, 2022

betatim mentioned this issue Sep 2, 2022

ENH Use X's dtype for the projection in RBFSampler #24317

Merged

svenstehle mentioned this issue Sep 2, 2022

[ENH] add dtype preservation to BernoulliRBM #24318

Merged

svenstehle mentioned this issue Sep 2, 2022

ENH add dtype preservation to FactorAnalysis #24321

Open

1 task

MisaOgura mentioned this issue Sep 2, 2022

ENH Add dtype preservation to LocallyLinearEmbedding #24337

Open

IvanLauLinTiong mentioned this issue Sep 3, 2022

ENH Add dtype preservation to FeatureAgglomeration #24346

Draft

This was referenced Sep 4, 2022

ENH Add dtype preservation to SkewedChi2Sampler #24349

Closed

ENH Add dtype preservation to SkewedChi2Sampler #24350

Merged

This was referenced Sep 23, 2022

RFC Should pairwise_distances preserve float32 ? #24502

Open

ENH Preserving dtype for np.float32 in LatentDirichletAllocation #24528

Merged

rprkh mentioned this issue Oct 21, 2022

ENH Add dtype preservation for Isomap #24714

Merged

glemaitre mentioned this issue Jan 15, 2024

ENH TfidfTransformer perserves np.float32 dtype #28136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving dtype for float32 / float64 in transformers #11000

Preserving dtype for float32 / float64 in transformers #11000

Preserving dtype for float32 / float64 in transformers #11000

Preserving dtype for float32 / float64 in transformers #11000

Comments

Tips to run the test for a specific transformer: