-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MaxAbsScaler Upcasts Pandas to float64 #15093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It should probably be preserving dtype. It doesn't look like this issue
should result from check_array, which looks like it is set up to preserve
dtype in MaxAbsScaler.
Can you please confirm that this is still an issue in scikit-learn 0.21
(you have an old version)?
|
Thanks for the quick response!
Upon a closer look, this might be a bug in check_array, though I don't know enough about its desired functionality to comment. dtypes_orig = None
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
dtypes_orig = np.array(array.dtypes) # correctly pulls the float32 dtypes from pandas
if dtype_numeric:
if dtype_orig is not None and dtype_orig.kind == "O":
# if input is object, convert to float.
dtype = np.float64
else:
dtype = None
if isinstance(dtype, (list, tuple)):
if dtype_orig is not None and dtype_orig in dtype:
# no dtype conversion required
dtype = None
else:
# dtype conversion required. Let's select the first element of the
# list of accepted types.
dtype = dtype[0] # Should this be dtype = dtypes_orig[0]? dtype[0] is always float64 Thanks again! |
It shouldn't be going down that path... It should be using the "no dtype
conversion required" path
|
Can confirm it's a bug in the handling of pandas introduced here: #10949 |
Fixed in #15094. (I should be writing grants, in case that's not obvious) |
Y'all are awesome, thanks! |
Description
I am working with the Column transformer, and for memory issues, am trying to produce a float32 sparse matrix. Unfortunately, regardless of pandas input type, the output is always float64.
I've identified one of the Pipeline scalers, MaxAbsScaler, as being the culprit. Other preprocessing functions, such as OneHotEncoder, have an optional
dtype
argument. This argument does not exist in MaxAbsScaler (among others). It appears that the upcasting happens whencheck_array
is executed.Is it possible to specify a dtype? Or is there a commonly accepted practice to do so from the Column Transformer?
Thank you!
Steps/Code to Reproduce
Example:
Expected Results
Actual Results
Versions
Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.20.3
Pandas 0.25.1
The text was updated successfully, but these errors were encountered: