MaxAbsScaler Upcasts Pandas to float64 #15093

danhitchcock · 2019-09-25T18:23:28Z

Description

I am working with the Column transformer, and for memory issues, am trying to produce a float32 sparse matrix. Unfortunately, regardless of pandas input type, the output is always float64.

I've identified one of the Pipeline scalers, MaxAbsScaler, as being the culprit. Other preprocessing functions, such as OneHotEncoder, have an optional dtype argument. This argument does not exist in MaxAbsScaler (among others). It appears that the upcasting happens when check_array is executed.

Is it possible to specify a dtype? Or is there a commonly accepted practice to do so from the Column Transformer?

Thank you!

Steps/Code to Reproduce

Example:

import pandas as pd
from sklearn.preprocessing import MaxAbsScaler

df = pd.DataFrame({
    'DOW': [0, 1, 2, 3, 4, 5, 6],
    'Month': [3, 2, 4, 3, 2, 6, 7],
    'Value': [3.4, 4., 8, 5, 3, 6, 4]
})
df = df.astype('float32')
print(df.dtypes)
a = MaxAbsScaler()
scaled = a.fit_transform(df) # providing df.values will produce correct response
print('Transformed Type: ', scaled.dtype)

Expected Results

DOW      float32
Month    float32
Value    float32
dtype: object
Transformed Type: float32

Actual Results

DOW      float32
Month    float32
Value    float32
dtype: object
Transformed Type: float64

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.20.3
Pandas 0.25.1

The text was updated successfully, but these errors were encountered:

jnothman · 2019-09-25T20:46:46Z

It should probably be preserving dtype. It doesn't look like this issue should result from check_array, which looks like it is set up to preserve dtype in MaxAbsScaler. Can you please confirm that this is still an issue in scikit-learn 0.21 (you have an old version)?

danhitchcock · 2019-09-25T21:35:44Z

Thanks for the quick response!
Same issue with 0.21.3

Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:07:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.21.3
Pandas 0.25.1

Upon a closer look, this might be a bug in check_array, though I don't know enough about its desired functionality to comment. MaxAbsScaler calls check_array with dtype=FLOAT_DTYPES which has the value['float64', 'float32', 'float16']. In check_array, pandas dtypes are properly pulled but not used. Instead, check_array pulls the dtype from first list item in the supplied dtype=FLOAT_DTYPES, which results in 'float64'. I placed inline comments next to what I think is going on:

dtypes_orig = None
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
    dtypes_orig = np.array(array.dtypes) # correctly pulls the float32 dtypes from pandas

if dtype_numeric:
    if dtype_orig is not None and dtype_orig.kind == "O":
        # if input is object, convert to float.
        dtype = np.float64
    else:
        dtype = None

if isinstance(dtype, (list, tuple)):
    if dtype_orig is not None and dtype_orig in dtype:
        # no dtype conversion required
        dtype = None
    else:
        # dtype conversion required. Let's select the first element of the
        # list of accepted types.
        dtype = dtype[0] # Should this be dtype = dtypes_orig[0]? dtype[0] is always float64

Thanks again!

jnothman · 2019-09-25T21:47:15Z

It shouldn't be going down that path... It should be using the "no dtype conversion required" path

amueller · 2019-09-25T21:54:51Z

Can confirm it's a bug in the handling of pandas introduced here: #10949
If dtypes has more then one entry we need to figure out the best cast, right?
Here we're in the simple case where len(unique(dtypes)))==1 which is easy to fix.

amueller · 2019-09-25T22:04:13Z

Fixed in #15094. (I should be writing grants, in case that's not obvious)

danhitchcock · 2019-09-25T22:53:27Z

Y'all are awesome, thanks!

amueller mentioned this issue Sep 25, 2019

MRG respect dtypes in pandas dataframes if homogeneous #15094

Merged

amueller mentioned this issue Sep 27, 2019

Use of X.dtype when it is not float #14314

Open

jnothman closed this as completed in #15094 Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaxAbsScaler Upcasts Pandas to float64 #15093

MaxAbsScaler Upcasts Pandas to float64 #15093

MaxAbsScaler Upcasts Pandas to float64 #15093

MaxAbsScaler Upcasts Pandas to float64 #15093

Comments

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions