Description
Describe the bug
There is no floating point tolerance in function handle_zeros_in_scale for checking if scale == 0.0. As a result, floating point precision can cause this check to incorrectly fail and not set scale to 1.0. The end result is to potentially have an incorrectly scaled values when using StandardScaler() since the value of scale will be near 0 instead of 1, introducing numerical instability.
Steps/Code to Reproduce
from sklearn.preprocessing import StandardScaler
import numpy as np
data_fails = np.full((1000, 1), 14.62, dtype=float).reshape(-1,1) #array filled with 14.62, causes issue
data_works = np.full((1000,1), 100.0 , dtype=float).reshape(-1,1) #array filled with 100.0, works as intended
scaler_fails = StandardScaler()
scaler_works = StandardScaler()
scaled_fails = scaler_fails.fit_transform(data_fails) #Returns array filled with -1.0
scaled_works = scaler_works.fit_transform(data_works) #Returns array fill with 0.0
print('\n Results: \n\n')
print(scaled_fails[0][0])
print(scaled_works[0][0])
Expected Results
Expected both scaled results to be zero vector since both are constant-valued vectors.
Actual Results
Standard scaling subtracts mean and divides by standard deviation when appropriate flags are set as in example above. Variance of constant valued vector is 0 which should be caught and replaced by 1 in function handle_zeros_in_scale. However, this is not happening due variations introduced by floating point representation. Results in mean being divided by small floating point value resulting in incorrect scaling when using StandardScaler().
Error occurs at line number 77 in my version of _data inside function _handle_zeros_in_scale. Currently reads:
scale[scale == 0.0] = 1.0
Versions
Python dependencies:
pip: 20.0.2
setuptools: 47.1.1.post20200604
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.15.1