-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
_handle_zeros_in_scale causing improper scaling when using StandardScaler() #17794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
I agree that this should have some tolerance to correspond to floating
point imprecision in computing the variance. Pull request welcome
|
rastna12
added a commit
to rastna12/scikit-learn
that referenced
this issue
Jul 1, 2020
Added floating point tolerance to _handle_zeros_in_scale to address issue scikit-learn#17794 created on 6/30/2020. I'm using numpy's isclose() function with default absolute and relative tolerance values. The default values handled my test cases fine up until floats around 1e+20 when the variable 'scale' grew to non-zero values even for constant-valued vectors. There may be floating point sensitivities in that function as well but that's outside the scope of this issue. I also could not test the first if-statement in _handle_zeros_in_scale which checks for scalars close to zero through StandardScaler(). Scalar values passed in are stopped by check_array(). It may be prudent to adjust this statement as well, but without a way to properly check it and deeper knowledge of the package at the moment, I didn't want to mess with it.
rastna12
added a commit
to rastna12/scikit-learn
that referenced
this issue
Jul 1, 2020
Updating format from linting results. Added floating point tolerance to _handle_zeros_in_scale to address issue scikit-learn#17794 created on 6/30/2020. I'm using numpy's isclose() function with default absolute and relative tolerance values. The default values handled my test cases fine up until floats around 1e+20 when the variable 'scale' grew to non-zero values even for constant-valued vectors. There may be floating point sensitivities in that function as well but that's outside the scope of this issue. I also could not test the first if-statement in _handle_zeros_in_scale which checks for scalars close to zero through StandardScaler(). Scalar values passed in are stopped by check_array(). It may be prudent to adjust this statement as well, but without a way to properly check it and deeper knowledge of the package at the moment, I didn't want to mess with it.
This has been fixed in #19788 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
There is no floating point tolerance in function handle_zeros_in_scale for checking if scale == 0.0. As a result, floating point precision can cause this check to incorrectly fail and not set scale to 1.0. The end result is to potentially have an incorrectly scaled values when using StandardScaler() since the value of scale will be near 0 instead of 1, introducing numerical instability.
Steps/Code to Reproduce
Expected Results
Expected both scaled results to be zero vector since both are constant-valued vectors.
Actual Results
Standard scaling subtracts mean and divides by standard deviation when appropriate flags are set as in example above. Variance of constant valued vector is 0 which should be caught and replaced by 1 in function handle_zeros_in_scale. However, this is not happening due variations introduced by floating point representation. Results in mean being divided by small floating point value resulting in incorrect scaling when using StandardScaler().
Error occurs at line number 77 in my version of _data inside function _handle_zeros_in_scale. Currently reads:
scale[scale == 0.0] = 1.0
Versions
Python dependencies:
pip: 20.0.2
setuptools: 47.1.1.post20200604
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.15.1
The text was updated successfully, but these errors were encountered: