-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Preserving dtype for float32 / float64 in transformers #11000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
just a note that we landed up creating untested regressions, particularly
in things like Euclidean distance calculation (#9354), when we made similar
changes not long ago.... please make sure the motivations and the risks of
these changes are clear. We've still not fixed up that previous mess.
|
Thanks for your comment and for mentioning this issue @jnothman ! The question is whether the gains from suporting 32bit arrays overweight the risk of running into new numerical issues. On a different topic, the above test code can be re-written in a self-consistent way as, import numpy as np
from sklearn.base import clone
from sklearn.utils.testing import set_random_state
def check_transformer_dtypes_casting(transformer, X, y):
"""Check that a transformer preserves 64bit / 32bit
dtypes
Parameters
----------
transformer : estimator
a transformer instance
X : array, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples
and n_features is the number of features.
y : array, shape (n_samples)
Target vector relative to X.
"""
for dtype_in, dtype_out in [(np.float32, np.float32),
(np.float64, np.float64),
(np.int, np.float64)]:
X_cast = X.copy().astype(dtype_in)
transformer = clone(transformer)
set_random_state(transformer)
if hasattr(transformer, 'fit_transform'):
X_trans = transformer.fit_transform(X_cast, y)
elif hasattr(transformer, 'fit_transform'):
transformer.fit(X_cast, y)
X_trans = transformer.transform(X_cast)
# FIXME: should we check that the dtype of some attributes are the
# same than dtype.
assert X_trans.dtype == dtype_out, \
('transform dtype: {} - original dtype: {}'
.format(X_trans.dtype, X_cast.dtype)) For instance, we can test the from sklearn.manifold import LocallyLinearEmbedding
X = np.random.RandomState(0).randn(1000, 100)
estimator = LocallyLinearEmbedding()
check_transformer_dtypes_casting(estimator, X, None) which in this case will produce
|
@jnothman noted that we should be careful. The motivation is the same as in #8769 mentioned by @GaelVaroquaux . For transformers, I think that there is an extra incentive: user could use other library (tensorflow, etc.) in which default type is float32. Using the scikit-learn transformers will trigger some extra conversion which would not be required. A bit similar vision than letting the NaN past through. |
Checking Isomap for this issue #11000 |
I'm working on |
Should we wait for @ksslng to merge his PR with a generic test before working on a specific transformer from the above list ? Or focus on the "strange ones" instead ? |
@Henley13 It would have been ideal to have that common test, but you can certainly start investigation an estimator from the list meanwhile. It should be noted that some of the above estimators may never preserve dtype , either because they use a significant number of linear algebra and we are not able to guarantee that results are consistent in 32bit (or because they are not frequently used enough to deserve this effort). So starting with frequently used and relatively simple estimators would be best. Another point related is that you could check that some of the supervised/unsupervised models are able to work in 32bit (again focusing on the most popular ones). There is a number of PRs linked above that have done that, but many models probably still need work. So you could look into say linear checking that models are able to work without doing conversion to 64bit when given 32bit input and what would be necessary to fix it. |
Hi, I checked It seems
I think we can tick off |
|
sklearn/tests/test_common.py::test_estimators[GenericUnivariateSelect()-check_transformer_preserve_dtypes] PASSED |
sklearn/tests/test_common.py::test_estimators[Birch()-check_transformer_preserve_dtypes] PASSED |
Working on |
@jeremiedbb discovered that we were missing the necessary update to sklearn/tests/test_common.py::test_estimators[BernoulliRBM()-check_transformer_preserve_dtypes] PASSED
sklearn/tests/test_common.py::test_estimators[MiniBatchDictionaryLearning()-check_transformer_preserve_dtypes] PASSED |
Looking at |
I think we can tick |
@betatim not yet: this test only runs on float64 by default. To trigger the test on both float64 and float32, you need to set the |
Working on |
@jeremiedbb, @glemaitre |
I don't see any PR for |
It seems so. |
No this is the opposite: we need to make Isomap preserve the float32 and then update the tests accordingly. |
This is the issue which we want to tackle during the Man AHL Hackathon.
We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:
dtype
preservation toBirch
#22968)X
andy
are usedX
andy
are usedX
andy
are usedX
's dtype for the projection inRBFSampler
#24317)We could think to extend it to integer whenever possible and applicable.
Also the following transformers are not included in the common tests. We should write a specific test:
We could also check classifiers, regressors or clusterers (see #8769 for more context),
Below the code executed to find the failure.
Tips to run the test for a specific transformer:
FastICA
_more_tags
: add the following code snippet at the bottom of the class definition:fit_transform
method (if it exists) or thetransform
method returns afloat64
data array when it is passed afloat32
input array.It might be helpful to use a debugger, for instance by adding the line:
at the beginning of the
fit_transform
method and then re-rerunning pytest with:Then using the
l
(list),n
(next),s
(step into a function call),p some_array_variable.dtype
(p
stands for print) andc
(continue) commands to interactively debug the execution of thisfit_transform
call.ping @rth feel free to edit this thread.
The text was updated successfully, but these errors were encountered: