8000 Uniform columns return a standard deviation of 1 in StandardScaler · Issue #4609 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Uniform columns return a standard deviation of 1 in StandardScaler #4609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TroyHernandez opened this issue Apr 16, 2015 · 8 comments
Closed
Labels
Documentation Easy Well-defined and straightforward way to resolve

Comments

@TroyHernandez
Copy link

It should be 0.

StandardScaler().fit(np.zeros(10)).std_
1.0
@lesteve
Copy link
Member
lesteve commented Apr 17, 2015

StandardScaler.fit is using sklearn.preprocessing.data._mean_and_std, this docstring may be a hint to explain this behaviour.

@amueller
Copy link
Member

Yeah, it is a bit awkward but setting it to 0 is also awkward. I am a bit conflicted about what would be best here.
I think we should document this behavior in the std_ docstring. That might be best

@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation labels Apr 17, 2015
@TroyHernandez
Copy link
Author

0 is the correct answer, so I find it less awkward than 1. Returning 0, but handling it internally to avoid NaNs, and throwing a warning would be my preference. Adding functionality to optionally remove those columns would be even better.

@amueller
Copy link
Member

what do you mean by "the correct answer"? This is not a function to compute the standard deviation, you can use np.std for that. This is the "internal handling" to get the desired scaling.

I feel that the promise here is that

scaler.transform(X) == (X - scaler.mean_) / scaler.std_ 

and not that

scaler.mean_ == np.mean(X, axis=0)
scaler.std_ == np.std(X, axis=0)

I don't see what the usefulness of the second contract would be.

Optionally removing these columns would be a nice addition. PR welcome.

@TroyHernandez
Copy link
Author

By correct I mean reporting the calculated standard deviation (in Latex, sorry)
$$ \sqrt(\frac{1}{N} \sum_{i=1}{N} (x_i - \mu)^2) $$

I'm new to sklearn. That PR may be a while.

@amueller
Copy link
Member

well, I agree that if we name the attribute std that it is confusing. I think there is a PR that refactors it and renames it scaling.

@TomDLT
Copy link
Member
TomDLT commented Jun 10, 2015

I think there is a PR that refactors it and renames it scaling.

yes, it is #3639

@amueller
Copy link
Member

Fixed by #4796.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

No branches or pull requests

4 participants
0