8000 _handle_zeros_in_scale causing improper scaling when using StandardScaler() · Issue #17794 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
_handle_zeros_in_scale causing improper scaling when using StandardScaler() #17794
Closed
@rastna12

Description

@rastna12

Describe the bug

There is no floating point tolerance in function handle_zeros_in_scale for checking if scale == 0.0. As a result, floating point precision can cause this check to incorrectly fail and not set scale to 1.0. The end result is to potentially have an incorrectly scaled values when using StandardScaler() since the value of scale will be near 0 instead of 1, introducing numerical instability.

Steps/Code to Reproduce

from sklearn.preprocessing import StandardScaler
import numpy as np



data_fails = np.full((1000, 1), 14.62, dtype=float).reshape(-1,1) #array filled with 14.62, causes issue
data_works = np.full((1000,1), 100.0 , dtype=float).reshape(-1,1) #array filled with 100.0, works as intended


scaler_fails = StandardScaler()
scaler_works = StandardScaler()


scaled_fails = scaler_fails.fit_transform(data_fails) #Returns array filled with -1.0 
scaled_works = scaler_works.fit_transform(data_works) #Returns array fill with 0.0


print('\n Results: \n\n')
print(scaled_fails[0][0])
print(scaled_works[0][0])

Expected Results

Expected both scaled results to be zero vector since both are constant-valued vectors.

Actual Results

Standard scaling subtracts mean and divides by standard deviation when appropriate flags are set as in example above. Variance of constant valued vector is 0 which should be caught and replaced by 1 in function handle_zeros_in_scale. However, this is not happening due variations introduced by floating point representation. Results in mean being divided by small floating point value resulting in incorrect scaling when using StandardScaler().

Error occurs at line number 77 in my version of _data inside function _handle_zeros_in_scale. Currently reads:
scale[scale == 0.0] = 1.0

Versions

Python dependencies:
pip: 20.0.2
setuptools: 47.1.1.post20200604
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.15.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0