8000 _handle_zeros_in_scale causing improper scaling when using StandardScaler() · Issue #17794 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

_handle_zeros_in_scale causing improper scaling when using StandardScaler() #17794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rastna12 opened this issue Jun 30, 2020 · 2 comments
Closed

Comments

@rastna12
Copy link

Describe the bug

There is no floating point tolerance in function handle_zeros_in_scale for checking if scale == 0.0. As a result, floating point precision can cause this check to incorrectly fail and not set scale to 1.0. The end result is to potentially have an incorrectly scaled values when using StandardScaler() since the value of scale will be near 0 instead of 1, introducing numerical instability.

Steps/Code to Reproduce

from sklearn.preprocessing import StandardScaler
import numpy as np



data_fails = np.full((1000, 1), 14.62, dtype=float).reshape(-1,1) #array filled with 14.62, causes issue
data_works = np.full((1000,1), 100.0 , dtype=float).reshape(-1,1) #array filled with 100.0, works as intended


scaler_fails = StandardScaler()
scaler_works = StandardScaler()


scaled_fails = scaler_fails.fit_transform(data_fails) #Returns array filled with -1.0 
scaled_works = scaler_works.fit_transform(data_works) #Returns array fill with 0.0


print('\n Results: \n\n')
print(scaled_fails[0][0])
print(scaled_works[0][0])

Expected Results

Expected both scaled results to be zero vector since both are constant-valued vectors.

Actual Results

Standard scaling subtracts mean and divides by standard deviation when appropriate flags are set as in example above. Variance of constant valued vector is 0 which should be caught and replaced by 1 in function handle_zeros_in_scale. However, this is not happening due variations introduced by floating point representation. Results in mean being divided by small floating point value resulting in incorrect scaling when using StandardScaler().

Error occurs at line number 77 in my version of _data inside function _handle_zeros_in_scale. Currently reads:
scale[scale == 0.0] = 1.0

Versions

Python dependencies:
pip: 20.0.2
setuptools: 47.1.1.post20200604
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.15.1

@jnothman
Copy link
Member
jnothman commented Jun 30, 2020 via email

rastna12 added a commit to rastna12/scikit-learn that referenced this issue Jul 1, 2020
Added floating point tolerance to _handle_zeros_in_scale to address issue scikit-learn#17794 created on 6/30/2020. I'm using numpy's isclose() function with default absolute and relative tolerance values. The default values handled my test cases fine up until floats around 1e+20 when the variable 'scale' grew to non-zero values even for constant-valued vectors. There may be floating point sensitivities in that function as well but that's outside the scope of this issue.  

I also could not test the first if-statement in _handle_zeros_in_scale which checks for scalars close to zero through StandardScaler(). Scalar values passed in are stopped by check_array(). It may be prudent to adjust this statement as well, but without a way to properly check it and deeper knowledge of the package at the moment, I didn't want to mess with it.
rastna12 added a commit to rastna12/scikit-learn that referenced this issue Jul 1, 2020
Updating format from linting results.

Added floating point tolerance to _handle_zeros_in_scale to address issue scikit-learn#17794 created on 6/30/2020. I'm using numpy's isclose() function with default absolute and relative tolerance values. The default values handled my test cases fine up until floats around 1e+20 when the variable 'scale' grew to non-zero values even for constant-valued vectors. There may be floating point sensitivities in that function as well but that's outside the scope of this issue.

I also could not test the first if-statement in _handle_zeros_in_scale which checks for scalars close to zero through StandardScaler(). Scalar values passed in are stopped by check_array(). It may be prudent to adjust this statement as well, but without a way to properly check it and deeper knowledge of the package at the moment, I didn't want to mess with it.
@jeremiedbb
Copy link
Member

This has been fixed in #19788

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0